Introduction

We are in production with our shop example [1]. We’d like to get some statistics about our implementation: how often are the Lambda functions called? How fast are they?

Of course, we could use the statistics from the performance test, but there is a faster way. This faster way uses Amazon X-Ray, I implemented this in my repository [2] in the shop-3 directory.

AWS Shop example: Amazon X-Ray 01 Shops with shops message ids 2

X-Ray in API Gateway

In the API Gateway, you will find the switch on the stage level: go to AMIS_api_gateway > Stages > prod, and click on tab “Logs/Tracing”, you will find the checkbox “Enable X-Ray Tracing” in the X-Ray Tracing part of this screen:

AWS Shop example: Amazon X-Ray 02 AWS X Ray enable X Ray Tracing in API Gateway

X-Ray in Lambda functions

You can switch X-Ray on or off per Lambda function. I switched it on for all production objects: AMIS_shop_accept, AMIS_shop_decrypt and AMIS_shop_update_db. You will find the switch in the box about AWS X-Ray:

AWS Shop example: Amazon X-Ray 01 AWS X Ray activate X Ray in Lambda

Code

Just two lines of code have to be added to let X-Ray also see the connections to the AWS services that are called from within the code:

from aws_xray_sdk.core import patch
patch([‘botocore’])

The library aws-xray-sdk is not present in AWS by default, it therefore should be added to the zip-file. The resulting zip-file is 9.5Mb, this is about 10 times the size of the request library. It is so big, that the GUI from AWS will not display the content of the zip file (and the source code) for these functions:

AWS Shop example: Amazon X-Ray 02a No source code

It is also possible to use X-Ray for code that is running in a virtual machine or for specific parts of your code. See the X-Ray developer guide for more information [3]. For my Lambda functions, I just added the two lines that I showed before.

AWS X-Ray

When you start the smoke test, and then go to AWS X-Ray > Service map, you will see the statistics about our functions:

AWS Shop example: Amazon X-Ray 03 Service map

Let’s zoom into this:

AWS Shop example: Amazon X-Ray 03b Right

You can see that by adding the two lines of code (and adding the libraries to the zip file), the first start of the function takes pretty long: the AMIS_shop_decrypt function took 2.19 seconds for the first run (!). You can zoom in by clicking on one of the circles (f.e. AMIS_shop_decrypt):

AWS Shop example: Amazon X-Ray 03c AMIS shop decrypt 2

When we would have more data, than we could see a nice graph on the right, where it becomes visible how often the duration is long and how often the duration is short.

When you select “Traces” in the menu on the left after running a performance test, you will see the details of each individual call. When you group the results by “Resource ARN” you can see the results for the API Gateway, the accept function, the decrypt function and the update_db function:

AWS Shop example: Amazon X-Ray 04 X Ray traces 2

Let’s look at the Analytics part within AWS X-Ray (via the menu option in the left menu). When you click at about 400ms and then drag the mouse to the maximum, you will see the following image:

AWS Shop example: Amazon X-Ray 05 X Ray analytics 1

You can see now, that there were three traces of more than 400 ms, two of them were in the beginning of the tests and a third was at about half of the tests: these are the dark blue boxes on the timeline.

When the shop example would run in the production environment, we wouldn’t want to have all tracing data from all calls to the API Gateway. Click on Sampling in the menu and let’s first look at the default: click on the link with the name default:

AWS Shop example: Amazon X-Ray 06 Default sampling X Ray 1

You can see now, that the number of requests that are traced, is determined by a reservoir size and a fixed rate. In the default, these are set to one and five. The reservoir size is the number of messages that are traced per second. When there are less (or equal) messages per second, all messages will be traced. In the default rule, the first message per second will always be traced. When there are more than one messages per second, then 5% of the rest will be traced as well.

AWS Shop example: Amazon X-Ray 06a Default sampling X Ray details

The matching criteria cannot be changed for the default rule. When you would create an extra rule (with a lower number than 10000), then you can use the name of the Lambda function or the name of the API Gateway as service name to have more (or less) tracing for these functions.

Searching for specific cases

Let’s assume, that we want to search for a specific case. Go back to the traces, select the range from 400ms and up again and then scroll down: you will see the three items in the Trace list. Let’s look at the first one, with the highest response time and click on that link:

AWS Shop example: Amazon X-Ray 07 Trace list

You can now see the details of this trace, you can see that this first message needed 8.9 seconds to come from the API Gateway to the update of the DynamoDB table:

AWS Shop example: Amazon X-Ray 08 Details of trace

When you scroll down, you can see the whole duration of this first message. In this case, it is clear that much of the response time is taken by initialization of the Lambda function (636 ms for the accept function, 624 ms for the decrypt function, etc).

AWS Shop example: Amazon X-Ray 08a Detailed graph

Let’s assume we also want to look at the CloudWatch logging for this specific trace. This is possible, because the ID that is shown in this screen is also used in the CloudWatch logs:

AWS Shop example: Amazon X-Ray 09 CloudWatch for this trace

Please mind, that when you search in CloudWatch, that you should only search for the part without the dash. You can see in the REPORT section of the logging that three items have been added: XRAY TraceId, SegmentId and Sampled.

Errors shown in X-Ray

I didn’t have to change that much in the configuration: I just marked some check boxes in the API Gateway and in the Lambda functions. I added a few lines of code, I added some libraries to the zip file – and that’s all.

I didn’t have to configure anything in X-Ray itself (I accepted the defaults for the amount of tracing that is done). The only objects that are used by X-Ray in the shop-3 example are the Lambda functions for the production environment: the API Gateway and the three AMIS_shop_<<name>> functions.

Let’s add X-Ray to the AMIS_object_under_test_update_db function and see what happens if there are errors, f.e. because we did some unit tests.

Go to Lambda functions and select the AMIS_unittest_object_under_test_update_db function. Scroll down to the AWS X-Ray block and check the checkbox Active tracing. Change also the default timeout of three seconds and increase it to ten seconds:

AWS Shop example: Amazon X-Ray 10 Change X Ray and timeout

When you change the X-Ray checkbox, you will see a red message that IAM roles have been changed. This is strange, because I added the necessary permissions to the IAM roles for the accept, decrypt and update_db roles myself – and we saw before that that worked perfectly fine. On the other hand: it doesn’t matter to allow sending data to X-Ray in two different IAM policies, so you can leave the role as it is.

AWS Shop example: Amazon X-Ray 11 Added X Ray to execution role

Now, start the AMIS_unittest_test_update_db function. When you go back to the X-ray service map, you can see that part of the calls to the functions have failed (the yellow part). The green parts succeeded. Writing to the non-existing-AMIS-unittest-sho… table never succeeded. I think we shouldn’t worry too much about this…

AWS Shop example: Amazon X-Ray 12 unittest object under test test

Effects on performance

We might expect some effects of adding the X-Ray functionality to our code. To see the effects, let’s look at the output of the get_statistics Lambda function of the performance tests. It is fair to compare (just) the accept function: this function is only changed for the X-Ray functionality, it has not been changed to solve the SNS duplicate messages issue. Left is the accept function in the shop-2 environment (I copied it from the blog about performance tests), right is the accept function after adding the X-Ray library:

AWS Shop example: Amazon X-Ray 13 Performance accept

On average the function is 30 ms slower than before. The minimum duration is 12 ms slower, this difference is less than I expected. The maximum duration is about 0.7 seconds slower than before – this is where the size of the zip file is counting. On average the function used 10 Mb of memory more than before.

Play along

You can play along with this series [2]. See the first blog in this series how to start the VM. Don’t deploy shop-1 or shop-2 on the same moment as shop-3: some objects have the same name and your deployment will fail.