OpenTelemetry Lambda for Python!
by Ted Young
In December, Alex Boten did some great work experimenting with running OpenTelemetry on Lambda. A number of improvements have happened since then, so it’s time for another update.
Most importantly, the Lightstep Lambda extension has been donated to OpenTelemetry. You can find it here.
A Lambda extension is a way of running a separate process alongside your serverless function. This allows you to add an OpenTelemetry Collector as a sidecar.
The main goal of running a sidecar is that it moves the configuration and management of OpenTelemetry out of your serverless function. Ultimately, we’d like to offer an experience where you can take an existing Lambda function and add observability to it without having to modify the function itself.
To help with this, AWS has now taken this a step further and added auto-instrumentation support for Python. This creates a layer that bundles together two extensions: stripped-down Collector running as an external extension, and a Python extension which sets up OpenTelemetry within your function. This means you can add OpenTelemetry to your function without writing any code, and control your configuration via YAML.
This automation is helpful in several ways.
It automatically installs any available instrumentation, such as Redis, Flask, SQLAlchemy, etc. (BTW, you can automatically instrument any Python service, not just on Lambda. The details can be found here, under “Getting Started.”
The extension will automatically add relevant Lambda-specific resources to your trace, such as the function name, ARN, version, request ID, etc. so that the traces can be properly indexed.
The SDK and the collector have been tuned to buffer and batch data in a way that works best for capturing and sending data with a single invoke. This includes flushing the data before the function is shut down.
Sounds good, right? Unfortunately, there are still some tricky issues and tradeoffs, which are worth being aware of. Let’s dive into them.
Lambda functions have caps on the amount of RAM and the size of the image. Extensions count against these limits. To help with this AWS offers a stripped-down version of the Collector with most of the plugins removed (just X-Ray and the OTLP exporter remaining). This reduces the image size of the collector. Buffering data can create memory pressure, and the Collector itself consumes memory. The provided Collector has been tuned to help mitigate this, but it is good to be aware of it, so consider checking your memory usage when you first roll it out.
Lambda functions may be frozen instead of shut down, in order to leave them available for future requests while ceasing to consume resources in the meantime. Unlike shutdown, there is no signal before a freeze occurs. This can interrupt network connections if it occurs while one is in flight. Batching is tuned to help with this, but occasionally the collector may need to resend data if it gets interrupted. This can be problematic, as the data may arrive late.
I did some light digging into the freezing issue on Lambda since it seems like signaling a freeze would be a handy feature. I noticed that Lambda runs on the Firecracker virtual machine manager, which itself is built on KVM. It looks like neither of these layers support a freeze signal. This makes sense. No one thinking about virtualization or containers prior to serverless would consider this a useful feature – freezing was intended to make a virtual machine cease to run as quickly as possible, without the contained process having any awareness of the occurrence. See the Linux c-groups freezer as the relevant example in this case. I think this is fascinating stuff, so If anyone reading this has further details on how a freeze signal might work with KVM, please DM me on Twitter.
Those gotchas aside, this is a big improvement for the state of observability on Lambda. You can use the extension today with any language, provided you manage the SDK yourself, and now there is automated support for Python. As support is extended to other languages and the implementation is further tuned, I’ll keep you updated.
Interested in a further recap? I recently did a webinar with Nizar Tyrewalla from AWS, covering an overview of OpenTelemetry and a deep dive into the Lambda runtime. You can check out the entire recording here, or skip right to the Lambda details. There is also a Lambda working group which currently meets every Wednesday. You can find it on the OpenTelemetry public calendar, or join the OpenTelemetry Slack channel and say hi. (If you are new, you can create a CNCF Slack account here).
Interested in joining our team? See our open positions here.