Monitoring Serverless Functions Using OpenTracing

June 21, 2018 | Parker Edwards | Technical

Introduction

The adoption of serverless functions is reaching record levels within enterprise organizations. Interestingly, given the growing adoption and interest, many monitoring solutions silo performance of the code executed in these environments, or provide just basic metrics about the execution. To understand the performance of applications, I want to know what bottlenecks exist, where time is being spent, and the current state of each system involved in fulfilling a request. While metrics, logs, and segmented stack traces are helpful, a more cohesive performance story is still the most useful way to understand an application, and should be achievable using existing technologies.

In this post, I’ll explore how to achieve a cohesive performance story using OpenTracing to instrument an Express app running in a local container, and a function running on Lambda. For visualizations and analysis, I’ll use LightStep [x]PM for this example, although you can choose another OpenTracing backend like Jaeger.

System performance is exactly the sum of its parts

Before we begin, it’s worth examining the purpose of this exercise. As I mentioned, almost every FaaS (function as a service) provider offers some amount of visibility into the performance of individual functions at reasonable price points.

Having invocation counts, error rates, logs, and even stack traces at your fingertips is very compelling. Even more compelling is getting this information without having to do much on top of your actual development work. When performance data is segmented by actor, piecing together a cohesive story is an exercise in frustration. Whenever I’m working on a project and trying to quantify the value of that work, I tend to think, “so what, who cares?” Or alternatively, what value exactly am I creating through this work, product, or technology?

The value of monitoring is measured at the intersection of the unique data provided, problems being solved, and how it gets into the hands of those who need it. While metrics and individual stack traces are appropriate for many use use cases, I created this example because, as a developer writing new features, refactoring existing code, or just keeping the lights on, you may have to rely on systems and services outside your control. In those cases, distributed tracing is, in my opinion, the best tool for the job.

Hello, OpenTracing

OpenTracing is a vendor-neutral interface that defines how to measure the performance of any individual operation or component in your infrastructure, and how you can tie those individual bits together into a cohesive end-to-end performance story. Importantly, the data required to do so is well defined and very simple: start and stop timestamps, service and operation names, and a common transaction ID are essentially all you need to get started. The backend typically takes care of “gluing” together the related operations into a single trace, so overhead can be extremely minimal.

OpenTracing’s lightweight data model makes it a perfect fit for measuring the performance of ephemeral architecture components, including containers and serverless functions. We’ll need three things: the OpenTracing data, a way to get that data to our backend, and of course, the backend itself.

The OpenTracing data

OpenTracing defines the exact API calls required in each language to extract the necessary information and put together an end-to-end trace. The intent is for these calls to be made throughout your environment, and then for your desired client library (the library that receives OpenTracing data and sends it to your backend) to receive the resulting data by establishing itself as the global destination for the instrumentation. This means you define exactly what you’d like to measure in your code. More importantly, because OpenTracing is an open standard, it’s widely adopted by library and infrastructure developers, and many popular frameworks and tools have built-in OpenTracing instrumentation.

In this example, we won’t leverage any of the community-driven plugins and instrumentation. Instead, we’ll do ours by hand to explicitly demonstrate how the technology works. We’ll start with the backend.

A brief note on SpanContext

As I mentioned, OpenTracing requires a common transaction ID to be present when each measurement is made. This is bundled and transmitted throughout your system using an object called the SpanContext. It can be included in many carriers, such as HTTP headers, text maps, and binary blobs, and transmitted in whatever format works for your services. In this example, since this is an HTTP request we will be injecting the SpanContext into the HTTP headers.

Backend instrumentation

First, we’re going to initialize our client library and assign it as our global tracer, which makes it the destination for any OpenTracing data emitted throughout the application.

let opentracing = require('opentracing');
let lightstep = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
  access_token: process.env.lightstep-token,
  component_name : 'node-service',
}));

Next, we’ll instrument one of our Express routes with OpenTracing.

var express = require('express');
var router = express.Router();
var createError = require('http-errors');
var request = require('request-promise');
var opentracing = require('opentracing');

router.get('/ot-gen', function(req, res, next) {
  // create our parent span, using the operation name "router"
  let parentSpan = opentracing.globalTracer().startSpan('router');
  // assign the count passed in our request query to a variable which will be used in our for loop below
  count = parseInt(req.query.count);
  promises = [];
  if (count) {
    for (let c = 0; c < count; c++) {
      promises.push(new Promise(() => {
	    // create our child spans for each request that will be made to Lambda
        let childSpan = opentracing.globalTracer().startSpan('service-egress', { childOf : parentSpan });
	    // create an empty carrier object and inject the child span's SpanContext into it
        var headerCarrier = {
          'Content-Type': 'application/json'
        };
        opentracing.globalTracer().inject(childSpan.context(), opentracing.FORMAT_HTTP_HEADERS, headerCarrier);
	    // make our outbound POST request to our Lambda function, and inject the SpanContext into the request headers
        request.post(process.env.lambda-url, { headers: headerCarrier, body: { 'example': 'request'}, json: true }).then((response) => {
	      // append some contextual information into span for use later in LightStep
          childSpan.logEvent('remote request ok', { response });
	      childSpan.setTag('Destination', 'xxx.us-east1.amazonaws.com/production/');
        })
        .catch((err) => { 
			// if there is an error in the Lambda response, then we attach an error boolean tag, and the error message received to the span, which we can then access later in LightStep 
            childSpan.setTag('error', 'true');
            childSpan.logEvent('remote request failed', {
                error   : err
            });
        })
        .then(() => {
			// finish our child span
            childSpan.finish();
        });
      }));
    }
  } else {
    // basic route functionality
    next(createError(400, 'Count not provided'));
  }
  // end the response so it doesn't stay open while we wait for Lambda to return our responses
  res.end();
  // after all requests have been made, end the parent span
  Promise.all(promises).then(parentSpan.finish());
});
module.exports = router;

As you can see, I’ve made a route here which allows us to instruct Express to send a variable number of requests (passed as a query parameter) to our Lambda function. I’ve left comments in the code above, but the primary takeaways and steps here are:

  1. Start a parent span when the route is invoked. This establishes the foundation of our trace hierarchy, which we’ll use in the next step.
  2. For each spawned Lambda function, we want to create a child span. In OpenTracing, each span has its own ID, which along with the persistent transaction ID, is included in the SpanContext. This means each child span has its own SpanContext that it needs to pass along to the next hop in the system.
  3. Inject the SpanContext into the request headers and attach it to our Lambda POST request.
  4. If there is an error in the request, we attach the error along with a boolean KV pair which tells our backend that there is an error present on the active span. This becomes really useful in a system like LightStep [x]PM, where error rate is automatically calculated and can be used as a foundation for alerts and analytics.
  5. End our parent span after all Lambda requests are made, and end each child span after we receive a 200 from the AWS API Gateway. We’re ending our parent span here after the Lambda requests leave our app, but we aren’t waiting for the responses. This is a matter of preference, and it can be modified to fit your needs.

Function instrumentation

In instrumenting any serverless function, we want to make certain that we’re being as performant as possible. Most providers take execution time into account, so any overhead added during monitoring has a direct cost associated with it.

There’s one extra step when tracing HTTP requests on Lambda: exposing the HTTP headers in the event. Luckily, this great blog post by Ken Brodhagen explains how to do it. Once you’ve added the mapping template to the API Gateway, ‘event.headers’ will contain the HTTP headers, allowing you to extract the SpanContext.

var opentracing = require('opentracing');
var lightstep   = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
    access_token: process.env.lightstep-token,
    component_name : 'lambda_function',
    // along with other considerations, we remove any initial reporting delay in the library initialization to remove unnecessary overhead. NOTE: This is a LightStep tracer specific option, though similar options should be available in your tracer of choice
    delay_initial_report_millis: 0,
}));

exports.handler = (event, context, callback) => {
    // all interactions with a request in a Lambda function (after it is through the API Gateway) occur through a handler
    // we don't really need to use promises here, but we've utilized them for part 2 of our exploration, which will be coming soon
    let promise = new Promise(function () {
	    // we extract the SpanContext from the body of our request
        let extractedContext = opentracing.globalTracer().extract(opentracing.FORMAT_HTTP_HEADERS, event.headers);
	    // and use it to start a span called "lambda-execution"
        let span = opentracing.globalTracer().startSpan('lambda-execution', { childOf : extractedContext });
        span.finish(); 
    })
};

You’ll notice that we have a configuration option in our Tracer initialization which zeros out an initial reporting delay. We typically suggest setting this option to a few hundred milliseconds to allow the NodeJS event loop to flush and stabilize before adding new network calls. However, since Lambda is highly optimized and our function is lightweight, we remove any delay to reduce our overhead.

Walking through the instrumentation in this function, we initialize our tracer and attach it to OpenTracing as the global tracer. Then we extract the SpanContext from our JSON payload, and use it to start the span we’ll use for all measurements in this function. We then immediately finish the span, but in a more realistic example, you’d probably do some kind of work between these calls.

Putting it all together

LightStep [x]PM - Monitoring Serverless Functions with OpenTracing
The final trace illustrates the performance of the entire transaction from Express routing to Lambda execution

LightStep [x]PM takes all of the emitted span data from both environments and uses the timestamps and relational data to put together a cohesive story of the system performance. In operations where you might want some additional context outside of timing, or other golden signals, you can add any number of key/value pairs (including metrics) and logs right from your code. To make this automatic, many users add the OpenTracing logging calls to their existing error handling or logging libraries, which adds exception messages directly to the individual operation.

This view, enabled through the lightweight data model of OpenTracing and the low overhead of the LightStep client library, is the fastest and most context-rich way to understand a distributed performance story. Feel free to contact us with any comments or questions, or to share your own OpenTracing experiences.

We're Hiring!

Add your talent and experience to our team of friendly, low-ego, and motivated people.