In December, Alex Boten did some great work experimenting with running OpenTelemetry on LambdaOpenTelemetry on Lambda. A number of improvements have happened since then, so it’s time for another update.
Most importantly, the Lightstep Lambda extension has been donated to OpenTelemetry. You can find it herehere.
What is a Lambda extension?
A Lambda extensionLambda extension is a way of running a separate process alongside your serverless function. This allows you to add an OpenTelemetry Collector as a sidecar.
The main goal of running a sidecar is that it moves the configuration and management of OpenTelemetry out of your serverless function. Ultimately, we’d like to offer an experience where you can take an existing Lambda function and add observability to it without having to modify the function itself.
OpenTelemetry Lambda for Python
To help with this, AWS has now taken this a step further and added auto-instrumentation support for Pythonauto-instrumentation support for Python. This creates a layer that bundles together two extensions: stripped-down Collector running as an external extension, and a Python extension which sets up OpenTelemetry within your function. This means you can add OpenTelemetry to your function without writing any code, and control your configuration via YAML.
This automation is helpful in several ways.
It automatically installs any available instrumentationavailable instrumentation, such as Redis, Flask, SQLAlchemy, etc. (BTW, you can automatically instrument any Python service, not just on Lambda. The details can be found herehere, under “Getting Started.”
The extension will automatically add relevant Lambda-specific resources to your trace, such as the function name, ARNARN, version, request ID, etc. so that the traces can be properly indexed.
The SDK and the collector have been tuned to buffer and batch data in a way that works best for capturing and sending data with a single invoke. This includes flushing the data before the function is shut down.
Sounds good, right? Unfortunately, there are still some tricky issues and tradeoffs, which are worth being aware of. Let’s dive into them.
Lambda functions have caps on the amount of RAM and the size of the image. Extensions count against these limits. To help with this AWS offers a stripped-down version of the Collector with most of the plugins removed (just X-Ray and the OTLP exporter remaining). This reduces the image size of the collector. Buffering data can create memory pressure, and the Collector itself consumes memory. The provided Collector has been tuned to help mitigate this, but it is good to be aware of it, so consider checking your memory usage when you first roll it out.
Lambda functions may be frozen instead of shut down, in order to leave them available for future requests while ceasing to consume resources in the meantime. Unlike shutdown, there is no signal before a freeze occurs. This can interrupt network connections if it occurs while one is in flight. Batching is tuned to help with this, but occasionally the collector may need to resend data if it gets interrupted. This can be problematic, as the data may arrive late.
I did some light digging into the freezing issue on Lambda since it seems like signaling a freeze would be a handy feature. I noticed that Lambda runs on the Firecracker virtual machine manager, which itself is built on KVM. It looks like neither of these layers support a freeze signal. This makes sense. No one thinking about virtualization or containers prior to serverless would consider this a useful feature – freezing was intended to make a virtual machine cease to run as quickly as possible, without the contained process having any awareness of the occurrence. See the Linux c-groups freezerLinux c-groups freezer as the relevant example in this case. I think this is fascinating stuff, so If anyone reading this has further details on how a freeze signal might work with KVM, please DM me on TwitterDM me on Twitter.
Those gotchas aside, this is a big improvement for the state of observability on Lambda. You can use the extension today with any language, provided you manage the SDK yourself, and now there is automated support for Python. As support is extended to other languages and the implementation is further tuned, I’ll keep you updated.
Interested in a further recap? I recently did a webinar with Nizar Tyrewalla from AWS, covering an overview of OpenTelemetry and a deep dive into the Lambda runtime. You can check out the entire recording herehere, or skip right to the Lambdaskip right to the Lambda details. There is also a Lambda working group which currently meets every Wednesday. You can find it on the OpenTelemetry public calendarpublic calendar, or join the OpenTelemetry Slack channelOpenTelemetry Slack channel and say hi. (If you are new, you can create a CNCF Slack account herehere).
Interested in joining our team? See our open positions herehere.
February 26, 2021
4 min read
About the author
Ted YoungRead moreRead more
Explore more articles
From Day 0 to Day 2: Reducing the anxiety of scaling up cloud-native deploymentsJason English | Mar 7, 2023
The global cloud-native development community is facing a reckoning. There are too many tools, too much telemetry data, and not enough skilled people to make sense of it all. See how you can.Learn moreLearn more
OpenTelemetry Collector in Kubernetes: Get started with autoscalingMoh Osman | Jan 6, 2023
Learn how to leverage a Horizontal Pod Autoscaler alongside the OpenTelemetry Collector in Kubernetes. This will enable a cluster to handle varying telemetry workloads as the collector pool aligns to demand.Learn moreLearn more
Observability-Landscape-as-Code in PracticeAdriana Villela, Ana Margarita Medina | Oct 25, 2022
Learn how to put Observability-Landscape-as-Code in this hands-on tutorial. In it, you'll use Terraform to create a Kubernetes cluster, configure and deploy the OTel Demo App to send Traces and Metrics to Lightstep, and create dashboards in Lightstep.Learn moreLearn more
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems