Monitoring Serverless Functions Using OpenTracing

Introduction

The adoption of serverless functions is reaching record levels within enterprise organizations. Interestingly, given the growing adoption and interest, many monitoring solutions silo performance of the code executed in these environments, or provide just basic metrics about the execution. To understand the performance of applications, I want to know what bottlenecks exist, where time is being spent, and the current state of each system involved in fulfilling a request. While metrics, logs, and segmented stack traces are helpful, a more cohesive performance story is still the most useful way to understand an application, and should be achievable using existing technologies.

In this post, I’ll explore how to achieve a cohesive performance story using OpenTracing to instrument an Express app running in a local container, and a function running on Lambda. For visualizations and analysis, I’ll use LightStep [x]PM for this example, although you can choose another OpenTracing backend like Jaeger.

System performance is exactly the sum of its parts

Before we begin, it’s worth examining the purpose of this exercise. As I mentioned, almost every FaaS (function as a service) provider offers some amount of visibility into the performance of individual functions at reasonable price points.

Having invocation counts, error rates, logs, and even stack traces at your fingertips is very compelling. Even more compelling is getting this information without having to do much on top of your actual development work. When performance data is segmented by actor, piecing together a cohesive story is an exercise in frustration. Whenever I’m working on a project and trying to quantify the value of that work, I tend to think, “so what, who cares?” Or alternatively, what value exactly am I creating through this work, product, or technology?

The value of monitoring is measured at the intersection of the unique data provided, problems being solved, and how it gets into the hands of those who need it. While metrics and individual stack traces are appropriate for many use use cases, I created this example because, as a developer writing new features, refactoring existing code, or just keeping the lights on, you may have to rely on systems and services outside your control. In those cases, distributed tracing is, in my opinion, the best tool for the job.

Hello, OpenTracing

OpenTracing is a vendor-neutral interface that defines how to measure the performance of any individual operation or component in your infrastructure, and how you can tie those individual bits together into a cohesive end-to-end performance story. Importantly, the data required to do so is well defined and very simple: start and stop timestamps, service and operation names, and a common transaction ID are essentially all you need to get started. The backend typically takes care of “gluing” together the related operations into a single trace, so overhead can be extremely minimal.

OpenTracing’s lightweight data model makes it a perfect fit for measuring the performance of ephemeral architecture components, including containers and serverless functions. We’ll need three things: the OpenTracing data, a way to get that data to our backend, and of course, the backend itself.

The OpenTracing data

OpenTracing defines the exact API calls required in each language to extract the necessary information and put together an end-to-end trace. The intent is for these calls to be made throughout your environment, and then for your desired client library (the library that receives OpenTracing data and sends it to your backend) to receive the resulting data by establishing itself as the global destination for the instrumentation. This means you define exactly what you’d like to measure in your code. More importantly, because OpenTracing is an open standard, it’s widely adopted by library and infrastructure developers, and many popular frameworks and tools have built-in OpenTracing instrumentation.

In this example, we won’t leverage any of the community-driven plugins and instrumentation. Instead, we’ll do ours by hand to explicitly demonstrate how the technology works. We’ll start with the backend.

A brief note on SpanContext

As I mentioned, OpenTracing requires a common transaction ID to be present when each measurement is made. This is bundled and transmitted throughout your system using an object called the SpanContext. It can be included in many carriers, such as HTTP headers, text maps, and binary blobs, and transmitted in whatever format works for your services. In this example, since this is an HTTP request we will be injecting the SpanContext into the HTTP headers.

Backend instrumentation

First, we’re going to initialize our client library and assign it as our global tracer, which makes it the destination for any OpenTracing data emitted throughout the application.

let opentracing = require('opentracing');
let lightstep = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
  access_token: process.env.lightstep-token,
  component_name : 'node-service',
}));

Next, we’ll instrument one of our Express routes with OpenTracing.

var express = require('express');
var router = express.Router();
var createError = require('http-errors');
var request = require('request-promise');
var opentracing = require('opentracing');

router.get('/ot-gen', function(req, res, next) {
  // create our parent span, using the operation name "router"
  let parentSpan = opentracing.globalTracer().startSpan('router');
  // assign the count passed in our request query to a variable which will be used in our for loop below
  count = parseInt(req.query.count);
  promises = [];
  if (count) {
    for (let c = 0; c < count; c++) {
      promises.push(new Promise(() => {
	    // create our child spans for each request that will be made to Lambda
        let childSpan = opentracing.globalTracer().startSpan('service-egress', { childOf : parentSpan });
	    // create an empty carrier object and inject the child span's SpanContext into it
        var headerCarrier = {
          'Content-Type': 'application/json'
        };
        opentracing.globalTracer().inject(childSpan.context(), opentracing.FORMAT_HTTP_HEADERS, headerCarrier);
	    // make our outbound POST request to our Lambda function, and inject the SpanContext into the request headers
        request.post(process.env.lambda-url, { headers: headerCarrier, body: { 'example': 'request'}, json: true }).then((response) => {
	      // append some contextual information into span for use later in LightStep
          childSpan.logEvent('remote request ok', { response });
	      childSpan.setTag('Destination', 'xxx.us-east1.amazonaws.com/production/');
        })
        .catch((err) => { 
			// if there is an error in the Lambda response, then we attach an error boolean tag, and the error message received to the span, which we can then access later in LightStep 
            childSpan.setTag('error', 'true');
            childSpan.logEvent('remote request failed', {
                error   : err
            });
        })
        .then(() => {
			// finish our child span
            childSpan.finish();
        });
      }));
    }
  } else {
    // basic route functionality
    next(createError(400, 'Count not provided'));
  }
  // end the response so it doesn't stay open while we wait for Lambda to return our responses
  res.end();
  // after all requests have been made, end the parent span
  Promise.all(promises).then(parentSpan.finish());
});
module.exports = router;

As you can see, I’ve made a route here which allows us to instruct Express to send a variable number of requests (passed as a query parameter) to our Lambda function. I’ve left comments in the code above, but the primary takeaways and steps here are:

  1. Start a parent span when the route is invoked. This establishes the foundation of our trace hierarchy, which we’ll use in the next step.
  2. For each spawned Lambda function, we want to create a child span. In OpenTracing, each span has its own ID, which along with the persistent transaction ID, is included in the SpanContext. This means each child span has its own SpanContext that it needs to pass along to the next hop in the system.
  3. Inject the SpanContext into the request headers and attach it to our Lambda POST request.
  4. If there is an error in the request, we attach the error along with a boolean KV pair which tells our backend that there is an error present on the active span. This becomes really useful in a system like LightStep [x]PM, where error rate is automatically calculated and can be used as a foundation for alerts and analytics.
  5. End our parent span after all Lambda requests are made, and end each child span after we receive a 200 from the AWS API Gateway. We’re ending our parent span here after the Lambda requests leave our app, but we aren’t waiting for the responses. This is a matter of preference, and it can be modified to fit your needs.

Function instrumentation

In instrumenting any serverless function, we want to make certain that we’re being as performant as possible. Most providers take execution time into account, so any overhead added during monitoring has a direct cost associated with it.

There’s one extra step when tracing HTTP requests on Lambda: exposing the HTTP headers in the event. Luckily, this great blog post by Ken Brodhagen explains how to do it. Once you’ve added the mapping template to the API Gateway, ‘event.headers’ will contain the HTTP headers, allowing you to extract the SpanContext.

var opentracing = require('opentracing');
var lightstep   = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
    access_token: process.env.lightstep-token,
    component_name : 'lambda_function',
    // along with other considerations, we remove any initial reporting delay in the library initialization to remove unnecessary overhead. NOTE: This is a LightStep tracer specific option, though similar options should be available in your tracer of choice
    delay_initial_report_millis: 0,
}));

exports.handler = (event, context, callback) => {
    // all interactions with a request in a Lambda function (after it is through the API Gateway) occur through a handler
    // we don't really need to use promises here, but we've utilized them for part 2 of our exploration, which will be coming soon
    let promise = new Promise(function () {
	    // we extract the SpanContext from the body of our request
        let extractedContext = opentracing.globalTracer().extract(opentracing.FORMAT_HTTP_HEADERS, event.headers);
	    // and use it to start a span called "lambda-execution"
        let span = opentracing.globalTracer().startSpan('lambda-execution', { childOf : extractedContext });
        span.finish(); 
    })
};

You’ll notice that we have a configuration option in our Tracer initialization which zeros out an initial reporting delay. We typically suggest setting this option to a few hundred milliseconds to allow the NodeJS event loop to flush and stabilize before adding new network calls. However, since Lambda is highly optimized and our function is lightweight, we remove any delay to reduce our overhead.

Walking through the instrumentation in this function, we initialize our tracer and attach it to OpenTracing as the global tracer. Then we extract the SpanContext from our JSON payload, and use it to start the span we’ll use for all measurements in this function. We then immediately finish the span, but in a more realistic example, you’d probably do some kind of work between these calls.

Putting it all together

LightStep [x]PM - Monitoring Serverless Functions with OpenTracing
The final trace illustrates the performance of the entire transaction from Express routing to Lambda execution

LightStep [x]PM takes all of the emitted span data from both environments and uses the timestamps and relational data to put together a cohesive story of the system performance. In operations where you might want some additional context outside of timing, or other golden signals, you can add any number of key/value pairs (including metrics) and logs right from your code. To make this automatic, many users add the OpenTracing logging calls to their existing error handling or logging libraries, which adds exception messages directly to the individual operation.

This view, enabled through the lightweight data model of OpenTracing and the low overhead of the LightStep client library, is the fastest and most context-rich way to understand a distributed performance story. Feel free to contact us with any comments or questions, or to share your own OpenTracing experiences.

LightStep [x]PM Architecture Explained

Intro

LightStep [x]PM has made an incredible impact at some of the world’s most innovative companies. It provides an unprecedented level of visibility into the production performance of these highly-distributed applications. When we say unprecedented, we mean it – we analyze 100% of the performance data flowing through these enterprise systems. These analyses include a large number of customizable facets, and we provide real-time, end-to-end distributed traces, with no up-front sampling at all. Our users can see their applications and services in entirely new ways, which we’ll discuss a bit later. But first, it’s important to explore how we can analyze a near-limitless volume of data with no scaling, cardinality, or overhead concerns.

Measure everything, diagnose anything

Our unique data collection architecture allows us to collect and analyze the large volume of production data that our enterprise customers generate. LightStep was founded by pioneers in the distributed tracing space who realize end-to-end traces are the holy grail of performance data. These traces provide visibility into exactly how separate services and parts of an application interact with each other.

Furthermore, time-series data that represents latency, throughput, and error rate for operations, services, and entire transactions is necessary to enable Service Level Objective (SLO) and root cause analysis capabilities in any modern performance monitoring solution.

The gap in existing solutions can be seen in the granularity and availability of both distributed tracing and time-series data. Early in the design of [x]PM, our team realized that in order to reliably provide this data, we couldn’t do any heavy lifting on the application hosts. We wanted to provide a new way to measure and analyze time-series and trace data, and doing so with either set of data would be very expensive computationally. So instead, we built a new way of collecting and analyzing this data.

[x]PM ArchitectureGranular timeseries and trace data can be collected from any facet of a distributed environment, at any scale

Our Satellite Architecture collects the performance data of individual operations in a service through the OpenTracing libraries. OpenTracing is a vendor-neutral API that defines both how we can measure performance in a system and piece it together with other related distributed operations. The OpenTracing libraries in conjunction with LightStep are extremely lightweight.

[x]PM was designed from the outset to have no measurable performance overhead, and LightStep as a company has a “first, do no harm” policy – performance transparency has been a first principles priority here from day one. As a result, 100% of our customers run LightStep 100% of the time in production. We make no attempts to do any kind of intensive processing on the app host, making overhead concerns a distant memory. Our Satellite Architecture can also use log translations to extract the necessary information from the system.

That data then flows to our satellites, which sit on premise within your hosted datacenter or cloud environment. These stand-alone satellites are a key component of our high performance stream processing system. The satellites store and analyze the performance data (extracted from your system) for about 3-5 minutes. This gives us enough time to index the entire volume of performance data across historical norms, user-defined thresholds, and other metrics.

For example, let’s say you’re measuring the performance of an authentication transaction. That transaction might rely on several services, each with its own datastore. Our satellites will receive the performance data of each operation, across every service, and automatically analyze the performance of each segment against historical performance, error rates, and throughput.

LightStep [x]PM segments performance data across two VIP customers, without front-end sampling or data-smoothingLightStep [x]PM segments performance data across two VIP customers, without front-end sampling or data-smoothing

But let’s say you want to analyze that performance data along a couple of dimensions. Maybe you want to track the performance of your authentication service by identity management provider or deployment version. Maybe you want to track SSO functionality versus traditional username and password logins. LightStep (and OpenTracing) model those dimensions as key/value pairs called “tags.” Once they’ve been added to LightStep [x]PM, our satellites will automatically partition that data out and provide the time-series and trace data for each segment without you having to worry about the cardinality of your performance data.

We take this one step further by offering the ability to set your own SLOs for each dimension. So now, you can track the performance of any arbitrary segment of your application, set different performance thresholds for each, and receive context-rich alerts – complete with moment-in-time traces relevant to those segments.

The operation started with a front-end user interaction but the “problem” was a long-running DB call over a hundred layers deepThe operation started with a front-end user interaction but the “problem” was a long-running DB call over a hundred layers deep

See the entire request (and response) payload that was generated when the user first clicked “Get Historical Matches”See the entire request (and response) payload that was generated when the user first clicked “Get Historical Matches”

And this is only the beginning. Because we’re storing all of your performance data for that 3-5 minute window of omniscience, we can tell you not only when and where a performance issue occurs in your stack, but also everything that happens both downstream and upstream from that issue. So, if you’re trying to diagnose a 1 in a 1,000 database issue that is only affecting users of a certain deployment, starting with a client-side operation, we can tell you exactly what that environment looked like, and every call leading from the front-end to the datastore and back.

This is a glimpse of what is possible when you have 100% visibility into the performance of your production distributed system. With LightStep [x]PM, you get a fine-grained, objective view into your system, so you no longer have to react with partial information to the latest unexpected systems failures; instead, you can focus on building, improving, and tracking the core value your system was built to deliver.