Know What’s Normal: Historical Layers

In April, we released real-time latency histograms and explained why performance is a shape, not a number. These latency histograms change the game when it comes to characterizing performance and identifying worrisome behaviors, but these behaviors immediately beg new questions: “Is this normal? Did something change? If so, when?”

Today, we’re announcing historical layers for LightStep [x]PM Live View. Historical layers allow you to compare the up-to-the-second latency histogram against the performance shape from an hour, a day, and a week ago. When you interactively filter the latency histogram to restrict its focus on a specific service, operation, and/or collection of tags, the historical layers will also reflect the specified query criteria. Now with just a glance, you can determine whether performance behavior has improved or degraded for any aspect of your application (and when that change occurred).

LightStep Adds Historical Layers to [x]PM for Performance Management
Historical layers quickly show when performance has improved or degraded for any aspect of your application

We chose the different time intervals deliberately to cover a wide range of scenarios and to account for common cyclical performance variations. This new capability is designed for firefighting time-sensitive issues, investigating latency spikes to isolate root cause, and for validating whether application changes are exhibiting the expected outcome over time. Historical layers make it easy to spot even the most subtle and harmful performance regressions.

LightStep [x]PM captures and stores the data and statistics required to produce these historical layers through its unique Satellites. In contrast to traditional time-series data, this information is available automatically and doesn’t require any additional configuration or preparation. You can also filter using high-cardinality tags – so you can view this level of detail for virtually any aspect of your application: from specific services or product versions to individual customers.

We’re extremely proud of these new capabilities and encouraged by the enthusiastic feedback we’ve received from our early beta customers, but we’re certainly not stopping here! In the coming months, we’ll be delivering more unique capabilities that will complement historical layers to make high-fidelity performance management and monitoring even more intuitive and insightful for our customers.

Are you adopting microservices? Contact us to learn more and see exactly how LightStep [x]PM works.

Monitoring Serverless Functions Using OpenTracing

Introduction

The adoption of serverless functions is reaching record levels within enterprise organizations. Interestingly, given the growing adoption and interest, many monitoring solutions silo performance of the code executed in these environments, or provide just basic metrics about the execution. To understand the performance of applications, I want to know what bottlenecks exist, where time is being spent, and the current state of each system involved in fulfilling a request. While metrics, logs, and segmented stack traces are helpful, a more cohesive performance story is still the most useful way to understand an application, and should be achievable using existing technologies.

In this post, I’ll explore how to achieve a cohesive performance story using OpenTracing to instrument an Express app running in a local container, and a function running on Lambda. For visualizations and analysis, I’ll use LightStep [x]PM for this example, although you can choose another OpenTracing backend like Jaeger.

System performance is exactly the sum of its parts

Before we begin, it’s worth examining the purpose of this exercise. As I mentioned, almost every FaaS (function as a service) provider offers some amount of visibility into the performance of individual functions at reasonable price points.

Having invocation counts, error rates, logs, and even stack traces at your fingertips is very compelling. Even more compelling is getting this information without having to do much on top of your actual development work. When performance data is segmented by actor, piecing together a cohesive story is an exercise in frustration. Whenever I’m working on a project and trying to quantify the value of that work, I tend to think, “so what, who cares?” Or alternatively, what value exactly am I creating through this work, product, or technology?

The value of monitoring is measured at the intersection of the unique data provided, problems being solved, and how it gets into the hands of those who need it. While metrics and individual stack traces are appropriate for many use use cases, I created this example because, as a developer writing new features, refactoring existing code, or just keeping the lights on, you may have to rely on systems and services outside your control. In those cases, distributed tracing is, in my opinion, the best tool for the job.

Hello, OpenTracing

OpenTracing is a vendor-neutral interface that defines how to measure the performance of any individual operation or component in your infrastructure, and how you can tie those individual bits together into a cohesive end-to-end performance story. Importantly, the data required to do so is well defined and very simple: start and stop timestamps, service and operation names, and a common transaction ID are essentially all you need to get started. The backend typically takes care of “gluing” together the related operations into a single trace, so overhead can be extremely minimal.

OpenTracing’s lightweight data model makes it a perfect fit for measuring the performance of ephemeral architecture components, including containers and serverless functions. We’ll need three things: the OpenTracing data, a way to get that data to our backend, and of course, the backend itself.

The OpenTracing data

OpenTracing defines the exact API calls required in each language to extract the necessary information and put together an end-to-end trace. The intent is for these calls to be made throughout your environment, and then for your desired client library (the library that receives OpenTracing data and sends it to your backend) to receive the resulting data by establishing itself as the global destination for the instrumentation. This means you define exactly what you’d like to measure in your code. More importantly, because OpenTracing is an open standard, it’s widely adopted by library and infrastructure developers, and many popular frameworks and tools have built-in OpenTracing instrumentation.

In this example, we won’t leverage any of the community-driven plugins and instrumentation. Instead, we’ll do ours by hand to explicitly demonstrate how the technology works. We’ll start with the backend.

A brief note on SpanContext

As I mentioned, OpenTracing requires a common transaction ID to be present when each measurement is made. This is bundled and transmitted throughout your system using an object called the SpanContext. It can be included in many carriers, such as HTTP headers, text maps, and binary blobs, and transmitted in whatever format works for your services. In this example, since this is an HTTP request we will be injecting the SpanContext into the HTTP headers.

Backend instrumentation

First, we’re going to initialize our client library and assign it as our global tracer, which makes it the destination for any OpenTracing data emitted throughout the application.

let opentracing = require('opentracing');
let lightstep = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
  access_token: process.env.lightstep-token,
  component_name : 'node-service',
}));

Next, we’ll instrument one of our Express routes with OpenTracing.

var express = require('express');
var router = express.Router();
var createError = require('http-errors');
var request = require('request-promise');
var opentracing = require('opentracing');

router.get('/ot-gen', function(req, res, next) {
  // create our parent span, using the operation name "router"
  let parentSpan = opentracing.globalTracer().startSpan('router');
  // assign the count passed in our request query to a variable which will be used in our for loop below
  count = parseInt(req.query.count);
  promises = [];
  if (count) {
    for (let c = 0; c < count; c++) {
      promises.push(new Promise(() => {
	    // create our child spans for each request that will be made to Lambda
        let childSpan = opentracing.globalTracer().startSpan('service-egress', { childOf : parentSpan });
	    // create an empty carrier object and inject the child span's SpanContext into it
        var headerCarrier = {
          'Content-Type': 'application/json'
        };
        opentracing.globalTracer().inject(childSpan.context(), opentracing.FORMAT_HTTP_HEADERS, headerCarrier);
	    // make our outbound POST request to our Lambda function, and inject the SpanContext into the request headers
        request.post(process.env.lambda-url, { headers: headerCarrier, body: { 'example': 'request'}, json: true }).then((response) => {
	      // append some contextual information into span for use later in LightStep
          childSpan.logEvent('remote request ok', { response });
	      childSpan.setTag('Destination', 'xxx.us-east1.amazonaws.com/production/');
        })
        .catch((err) => { 
			// if there is an error in the Lambda response, then we attach an error boolean tag, and the error message received to the span, which we can then access later in LightStep 
            childSpan.setTag('error', 'true');
            childSpan.logEvent('remote request failed', {
                error   : err
            });
        })
        .then(() => {
			// finish our child span
            childSpan.finish();
        });
      }));
    }
  } else {
    // basic route functionality
    next(createError(400, 'Count not provided'));
  }
  // end the response so it doesn't stay open while we wait for Lambda to return our responses
  res.end();
  // after all requests have been made, end the parent span
  Promise.all(promises).then(parentSpan.finish());
});
module.exports = router;

As you can see, I’ve made a route here which allows us to instruct Express to send a variable number of requests (passed as a query parameter) to our Lambda function. I’ve left comments in the code above, but the primary takeaways and steps here are:

  1. Start a parent span when the route is invoked. This establishes the foundation of our trace hierarchy, which we’ll use in the next step.
  2. For each spawned Lambda function, we want to create a child span. In OpenTracing, each span has its own ID, which along with the persistent transaction ID, is included in the SpanContext. This means each child span has its own SpanContext that it needs to pass along to the next hop in the system.
  3. Inject the SpanContext into the request headers and attach it to our Lambda POST request.
  4. If there is an error in the request, we attach the error along with a boolean KV pair which tells our backend that there is an error present on the active span. This becomes really useful in a system like LightStep [x]PM, where error rate is automatically calculated and can be used as a foundation for alerts and analytics.
  5. End our parent span after all Lambda requests are made, and end each child span after we receive a 200 from the AWS API Gateway. We’re ending our parent span here after the Lambda requests leave our app, but we aren’t waiting for the responses. This is a matter of preference, and it can be modified to fit your needs.

Function instrumentation

In instrumenting any serverless function, we want to make certain that we’re being as performant as possible. Most providers take execution time into account, so any overhead added during monitoring has a direct cost associated with it.

There’s one extra step when tracing HTTP requests on Lambda: exposing the HTTP headers in the event. Luckily, this great blog post by Ken Brodhagen explains how to do it. Once you’ve added the mapping template to the API Gateway, ‘event.headers’ will contain the HTTP headers, allowing you to extract the SpanContext.

var opentracing = require('opentracing');
var lightstep   = require('lightstep-tracer');

opentracing.initGlobalTracer(new lightstep.Tracer({
    access_token: process.env.lightstep-token,
    component_name : 'lambda_function',
    // along with other considerations, we remove any initial reporting delay in the library initialization to remove unnecessary overhead. NOTE: This is a LightStep tracer specific option, though similar options should be available in your tracer of choice
    delay_initial_report_millis: 0,
}));

exports.handler = (event, context, callback) => {
    // all interactions with a request in a Lambda function (after it is through the API Gateway) occur through a handler
    // we don't really need to use promises here, but we've utilized them for part 2 of our exploration, which will be coming soon
    let promise = new Promise(function () {
	    // we extract the SpanContext from the body of our request
        let extractedContext = opentracing.globalTracer().extract(opentracing.FORMAT_HTTP_HEADERS, event.headers);
	    // and use it to start a span called "lambda-execution"
        let span = opentracing.globalTracer().startSpan('lambda-execution', { childOf : extractedContext });
        span.finish(); 
    })
};

You’ll notice that we have a configuration option in our Tracer initialization which zeros out an initial reporting delay. We typically suggest setting this option to a few hundred milliseconds to allow the NodeJS event loop to flush and stabilize before adding new network calls. However, since Lambda is highly optimized and our function is lightweight, we remove any delay to reduce our overhead.

Walking through the instrumentation in this function, we initialize our tracer and attach it to OpenTracing as the global tracer. Then we extract the SpanContext from our JSON payload, and use it to start the span we’ll use for all measurements in this function. We then immediately finish the span, but in a more realistic example, you’d probably do some kind of work between these calls.

Putting it all together

LightStep [x]PM - Monitoring Serverless Functions with OpenTracing
The final trace illustrates the performance of the entire transaction from Express routing to Lambda execution

LightStep [x]PM takes all of the emitted span data from both environments and uses the timestamps and relational data to put together a cohesive story of the system performance. In operations where you might want some additional context outside of timing, or other golden signals, you can add any number of key/value pairs (including metrics) and logs right from your code. To make this automatic, many users add the OpenTracing logging calls to their existing error handling or logging libraries, which adds exception messages directly to the individual operation.

This view, enabled through the lightweight data model of OpenTracing and the low overhead of the LightStep client library, is the fastest and most context-rich way to understand a distributed performance story. Feel free to contact us with any comments or questions, or to share your own OpenTracing experiences.

Monitorama 2018 – Metrics, Serverless, Tracing, and Socks

As a former Portlander, Monitorama was a great reason to return to the Silicon Forest as summer keeps the infamous rains at bay. It was also an excellent opportunity to learn about the latest technologies and techniques in monitoring and observing large-scale, distributed systems from industry experts and practitioners.

This year the conference organizers implemented a few scheduling changes intended to support more “hallway track” – speakers were aligned to a single track and breaks were longer. This absolutely helped sustain a high level of energy throughout the three day event.

OpenTracing - Monitorama 2018
Lots of conversations took place in the hallway track

An obvious trend at the conference was the prevalence of foot-encasing swag. Socks were available from a variety of vendors, including LightStep, with Monitorama-branded socks available for each attendee.

It’s always important to make a statement with your first speaker, and this year’s last minute substitution generated a lot of buzz. From Buzzfeed, Logan McDonald’s talk, Optimizing for Learning, blended behavioral research with practical techniques that resonated for many attendees, based on my conversations throughout the three days. It was the first of several talks that touched on how to sustainably build, grow, and integrate teams in fast-paced high technology companies including talks by Kishore Jalleda (Microsoft), Zach Musgrave/Angelo Licastro (Yelp), and Aditya Mukerjee (Stripe).

As expected, serverless was the subject of talks and open discussion, displacing the last few years of container domination (although containers were still represented by Allan Espinosa from Bloomberg). Serverless was featured alongside a plethora of cat puns in Pam Selle’s talk, and in Yan Cui’s effort to submit feature requests to every vendor simultaneously with his walkthrough of the ideal serverless monitoring system he wanted to use.

Metrics continued to be very important, however tracing made its mark with a strong showing from vendors (Sematext announces support for OT), speakers (OpenTracing’s Ted Young and his lightning talk), and strong interest from attendees in Wednesday’s Tracing breakfast. Some thirty folks braved the early morning pastries of Cafe Umbria, following the late night of Tuesday’s vendor parties. Many attendees were just beginning their journey into microservices, considering Open Zipkin and Jaeger, while others were on the hunt for anecdotes about a vendor that would meet their needs as microservices and serverless continue to increase the observability complexity of their environments.

Tracing Breakfast - Monitorama 2018
Lively discussion at the Tracing breakfast

I left Monitorama feeling really energized by the current state of the industry and how our field is quickly becoming a focal point for this latest wave of DevOps innovation and best practices. So many of the observability challenges that were raised and passionately discussed are precisely the ones we’re focused on solving at LightStep. If you’d like to learn more about what we do, see how our customers use our product to maintain performance and reliability for their modern applications with advanced distributed tracing using industry-adopted standards.

DevOps and Site Reliability Engineering – What’s Different at LightStep

At LightStep, we spend every day helping our customers understand performance behavior in their distributed applications. We’re proud our product is used to diagnose problems for many important software systems. And as a tool used to improve performance and reliability in other applications, we must hold our product to even higher standards when it comes to those metrics. At the same time, we challenge ourselves to innovate quickly while still meeting (or exceeding) those standards.

As one of the co-founders and the CTO at LightStep, I’d like to share a bit of what it’s like to work on the engineering team, how we collaborate, and our process for bringing ideas to market.

One critical part of running highly available services is determining who is responsible for making sure that those services are available. Two related terms that get tossed around a lot here are DevOps and Site Reliability Engineering (SRE). Unfortunately, neither of these terms are particularly well defined – just Google them and see for yourself!

One of the parts of DevOps that I like best (though certainly not the only part) is that individual teams are responsible for the entire application lifecycle, from design, to coding and testing, to deployment and ongoing maintenance. This gives teams the flexibility to choose the processes and tools that will work best for them. However, that autonomy can lead to fragmentation across the org in how services are managed and duplication of effort across teams.

On the other side, SRE is often used to describe organizations that are laser-focused on product availability, performance, and incident response. While these are all important, these SRE organizations can sometimes build antagonistic relationships with the rest of engineering where SRE is seen as impeding progress for the sake of its own goals.

At LightStep, we believe in a hybrid implementation of these two philosophies, where our engineers are organized into small groups with split responsibilities but shared objectives. SRE at LightStep is responsible in part for building shared infrastructure that is leveraged by the whole organization, but they are also embedded within teams to help spread best practices and understand current developer pain points. This structure has enabled our teams to remain agile, to conduct rapid product experiments, and to have the flexibility to quickly adopt new (or discard old) technologies and tools. Retaining the natural and healthy tension between maintaining product stability and accelerating innovation to market ensures every decision we make is a balance that ultimately focuses on our customers’ success.

When considering prospective DevOps engineers or SRE (titles don’t really matter much to us at LightStep), we look for engineers who are excited about working side-by-side with the rest of our team. To us, SRE isn’t a separate organization so much as a mindset: we look for engineers who are excited to collaborate and apply a broad set of tools – including traditional operational tools like automation and monitoring as well as robust software development practices – to improve the reliability of our product and increase the velocity of individual teams and of our organization as a whole.

We’re always striving to improve how we do things and looking to new team members to help us on this journey. All of our engineers bring complementary skills and experience from both academia and industry. Above all, we value those who respect differing opinions, communicate clearly, and are empathetic towards their peers.

If you’d like to be part of this journey and would enjoy working on these engineering challenges, we’d love to hear from you!

Twilio Engineer Shares How They Achieve Five 9s of Availability

In our recent tech talk on SD Times – Managing the Performance of Applications in the Microservices Era – Tyler Wells, Director of Engineering at Twilio, shared his insights on how to effectively manage the performance of microservices-based applications and how they achieve five 9s of availability and success.

Tyler said that integrating new tools and solutions into a developer’s workflow can be a challenge for any organization: there needs to be a big carrot. For Twilio, the carrot was a 92% reduction in mean time to resolution (MTTR) for production incidents, and 70% improvements to mean latency for critical services. Now, they can also detect failures before they impact customers. This article shows how they accomplished these results and how other organizations can do the same.

How Twilio integrated [x]PM into its engineering process and workflow

Tyler described why his team was motivated to try [x]PM and how it fit into their workflow. “Twilio was born and raised in the cloud and has always been built on distributed microservices. My team was an early adopter of LightStep. We were excited about the opportunity to instrument and add tracing to the complex distributed systems we have in the Programmable Video group. You can imagine that setting up a video call involves a lot of steps, and there are a lot of systems. The orchestration messages have to pass through: authorization, authentication, creating the Room [session], orchestrating the Room, adding Participants to the Room. These are all distributed systems, so we added tracing, including Tags and rich information specific to our business, and we started watching. We watched the p99 latency, and we started honing in on the outliers. As we highlighted these outliers, we pulled the information we needed to help identify one of these Rooms using [the Room’s] Sid or GUIDs. We used those IDs to look through [LightStep] and figure out, from the highlighted spans showing the latency, exactly what was going on. That was our first experience with LightStep and how we started to derive value.”

LightStep [x]PM - Managing Application Performance in the Microservices EraMonitor latency, alert on SLA violations, and focus on the outliers to quickly determine root cause

How chaos actually helps

Tyler talked about the benefits of always assuming that things will break. “We like to break our systems before we put them into the hands of our customers, so we do a lot of Chaos Engineering. We use a tool like Gremlin to start breaking things. LightStep makes it easy for us to be able to hone in on what happens when things go wrong. We know when you’re operating in the cloud, everything is going to break at some point in time. Using LightStep in conjunction with our ‘Game Days,’ we got a ton of visualization, so we could create the SLA alerts, which we have integrated into PagerDuty and Slack. If incidents are triggered, our team immediately shows up in a Slack channel and all of the rich LightStep information is there for us to help identify issues.”

Achieving five 9s of availability and success

Tyler explains how they achieve operational excellence. “We have a program at Twilio called Operational Maturity Model (OMM). It’s a program all teams must follow when pushing product into production. The program has a number of different dimensions: LightStep sits in the Operations dimension. We have a specific policy in the Operations dimension that’s literally called LightStep. There are a number of items in every dimension that teams need to check off to reach a specific grade, with the highest grade being Iron Man. In order for any team to go into production and claim general availability, they have to implement LightStep, use LightStep as part of their Game Days, and they have to achieve Iron Man status. That’s how we use it at Twilio.”

Tyler summarized Twilio’s focus on operational excellence to build customer confidence: “We typically target five 9s [99.999%] of availability and five 9s of success. Generally speaking, five 9s is discipline, not luck.”

Overcoming resistance to change

Tyler described how his team was able to show results and convince other teams at Twilio to use [x]PM. “Any time you try to introduce a new tool to engineers, there’s always going to be some level of resistance. Everybody has more work on their plates and in their backlog than they can handle, and then someone shows up and says: ‘hey, here’s this really cool tool that you should try.’ It’s always met with a healthy dose of skepticism. We had some teams that were early adopters that really derived incredible value from using LightStep. We were able to articulate those results and show other teams (that may have been skeptics). We showed how it helped us solve production-level issues, meet our goals on the operational excellence front, and deliver that higher level of operational maturity to our customers.”

Watch the tech talk, Managing the Performance of Applications in the Microservices Era, to get all of the details about how Twilio is using [x]PM. Don’t miss the demo to see [x]PM in action.

Performance is a Shape, Not a Number

This article originally appeared on Medium.

Applications have evolved – again – and it’s time for performance analysis to follow suit

In the last twenty years, the internet applications that improve our lives and drive our economy have become far more powerful. As a necessary side-effect, these applications have become far more complex, and that makes it much harder for us to measure and explain their performance – especially in real-time. Despite that, the way that we both reason about and actually measure performance has barely changed.

I’m not here to argue about the importance of understanding real-time performance in the face of rising complexity – by now, we all realize it’s vital – but for the need to improve our mental model as we recognize and diagnose anomalies. When assessing “right now,” our industry relies almost entirely on averages and percentile estimates: these are not enough to efficiently diagnose performance problems in modern systems. Performance is a shape, not a number, and effective tools and workflows should present and explore that shape, as we illustrate below.

We’ll divide the evolution of application performance measurement into three “phases.” Each phase had its own deployment model, its own predominant software architecture, and its own way of measuring performance. Without further ado, let’s go back to the olden days: before AWS, before the smartphone, and before Facebook (though perhaps not Friendster)…

Watch our tech talk now. Hear Ben Sigelman, LightStep CEO, present the case for unsampled latency histograms as an evolution of and replacement for simple averages and percentile estimates.

Phase 1: Bare Metal and average latency (~2002)

LightStep - the Stack 2002The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.

If you measured application performance at all in 2002, you probably did it with average request latency. Simple averages work well for simple things: namely, normally-distributed things with low variance. They are less appropriate when there’s high variance, and they are particularly bad when the sample values are not normally distributed. Unfortunately, latency distributions today are rarely normally distributed, can have high variance, and are often multimodal to boot. (More on that later)

To make this more concrete, here’s a chart of average latency for one of the many microservice handlers in LightStep’s SaaS:

LightStep - Recent Average LatencyRecent average latency for an important internal microservice API call at LightStep

It holds steady at around 5ms, essentially all of the time. Looks good! 5ms is fast. Unfortunately it’s not so simple: average latency is a poor leading indicator of reliability woes, especially for scaled-out internet applications. We’ll need something better…

Phase 2: Cloud VMs and p99 latency (~2012)

LightStep - the Stack 2012The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).

Even if average latency looks good, we still don’t know a thing about the outliers. Per this great Jeff Dean talk, in a microservices world with lots of fanout, an end-to-end transaction is only as fast as its slowest dependency. As our applications transitioned to the cloud, we learned that high-percentile latency was an important leading indicator of systemic performance problems.

Of course, this is even more true today: when ordinary user requests touch dozens or hundreds of service instances, high-percentile latency in backends translates to average-case user-visible latency in frontends.

To emphasize the importance of looking (very) far from the mean, let’s look at recent p95 for that nice, flat, 5ms average latency graph from above:

LightStep - Recent p95 LatencyRecent p95 latency for the same important internal microservice API call at LightStep

The latency for p95 is higher than p50, of course, but it’s still pretty boring. That said, when we plot recent measurements for p99.9, we notice meaningful instability and variance over time:

LightStep - Recent p99.9 LatencyRecent p99.9 latency for the same microservice API call. Now we see some instability.

Now we’re getting somewhere! With a p99.9 like that, we suspect that the shape of our latency distribution is not a nice, clean bell curve, after all… But what does it look like?

Phase 3: Microservices and detailed latency histograms (2018)

LightStep - the Stack 2018The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.

When we reason about a latency distribution, we’re trying to understand the distinct behaviors of our software application. What is the shape of the distribution? Where are the “bumps” (i.e., the modes of the distribution) and why are they there? Each mode in a latency distribution is a different behavior in the distributed system, and before we can explain these behaviors we must be able to see them.

In order to understand performance “right now”, our workflow ought to look like this:

  1. Identify the modes (the “bumps”) in the latency histogram
  2. Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
  3. Explain the behaviors that characterize these high-priority modes

Too often we just panic and start clicking around in hopes that we stumble upon a plausible explanation. Other times we are more disciplined, but our tools only expose bare statistics without context or relevant example transactions.

This article is meant to be about ideas (rather than a particular product), but the only real-world example I can reference is the recently released Live View functionality in LightStep [x]PM. Live View is built around an unsampled, filterable, real-time histogram representation of performance that’s tied directly to distributed tracing for root-cause analysis. To get back to our example, below is the live latency distribution corresponding to the percentile measurements above:

LightStep - A Real-Time View of LatencyA real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.

The histogram makes it easy to identify the distinct modes of behavior (the “bumps” in the histogram) and to triage them. In this situation, we care most about the high-latency outliers on the right side. Compare this data with the simple statistics from “Phase 1” and “Phase 2” where the modes are indecipherable.

Having identified and triaged the modes in our latency distribution, we now need to explain the concerning high-latency behavior. Since [x]PM has access to all (unsampled) trace data, we can isolate and zoom in on any feature regardless of its size. We filter interactively to hone in on an explanation: first by restricting to a narrow latency band, and then further by adding key:value tag restrictions. Here we see how the live latency distribution varies from one project_id to the next (project_id being a high-cardinality tag for this dataset):

LightStep - Isolate and Zoom In on Any FeatureGiven 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.

Here we are surprised to learn that project_id 36 experienced consistently slower performance than the aggregate. Again: Why? We restrict our view to project_id=36, filter to examine the latency outliers, and open a trace. Since [x]PM can assemble these traces retroactively, we always find an example, even for rare behavior:

LightStep - End-to-End Transaction TracesTo attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.

The (rare) trace we isolated shows us the smoking gun: that contention around mutex acquisition dominates the critical path (and explains why this particular project — with its own highly-contended mutex — has inferior performance relative to others). Again, compare against a bare percentile: simply measuring p99 latency is a far cry from effective performance analysis.

Stepping back and looking forward…

As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.

If you made it this far, I hope you’ve learned some new ways to think about latency measurements and how they play a part in diagnostic workflows. LightStep continues to invest heavily in this area: to that end, please share your stories and points of view in the comment section, or reach out to me directly (Twitter, Medium, LinkedIn), either to provide feedback or to nudge us in a particular direction. I love to nerd out along these lines and welcome outside perspectives.

Want to work on this with me and my colleagues? It’s fun! LightStep is hiring.

Want to make your own complex software more comprehensible? We can show you exactly how LightStep [x]PM works.

LightStep [x]PM Architecture Explained

Intro

LightStep [x]PM has made an incredible impact at some of the world’s most innovative companies. It provides an unprecedented level of visibility into the production performance of these highly-distributed applications. When we say unprecedented, we mean it – we analyze 100% of the performance data flowing through these enterprise systems. These analyses include a large number of customizable facets, and we provide real-time, end-to-end distributed traces, with no up-front sampling at all. Our users can see their applications and services in entirely new ways, which we’ll discuss a bit later. But first, it’s important to explore how we can analyze a near-limitless volume of data with no scaling, cardinality, or overhead concerns.

Measure everything, diagnose anything

Our unique data collection architecture allows us to collect and analyze the large volume of production data that our enterprise customers generate. LightStep was founded by pioneers in the distributed tracing space who realize end-to-end traces are the holy grail of performance data. These traces provide visibility into exactly how separate services and parts of an application interact with each other.

Furthermore, time-series data that represents latency, throughput, and error rate for operations, services, and entire transactions is necessary to enable Service Level Objective (SLO) and root cause analysis capabilities in any modern performance monitoring solution.

The gap in existing solutions can be seen in the granularity and availability of both distributed tracing and time-series data. Early in the design of [x]PM, our team realized that in order to reliably provide this data, we couldn’t do any heavy lifting on the application hosts. We wanted to provide a new way to measure and analyze time-series and trace data, and doing so with either set of data would be very expensive computationally. So instead, we built a new way of collecting and analyzing this data.

[x]PM ArchitectureGranular timeseries and trace data can be collected from any facet of a distributed environment, at any scale

Our Satellite Architecture collects the performance data of individual operations in a service through the OpenTracing libraries. OpenTracing is a vendor-neutral API that defines both how we can measure performance in a system and piece it together with other related distributed operations. The OpenTracing libraries in conjunction with LightStep are extremely lightweight.

[x]PM was designed from the outset to have no measurable performance overhead, and LightStep as a company has a “first, do no harm” policy – performance transparency has been a first principles priority here from day one. As a result, 100% of our customers run LightStep 100% of the time in production. We make no attempts to do any kind of intensive processing on the app host, making overhead concerns a distant memory. Our Satellite Architecture can also use log translations to extract the necessary information from the system.

That data then flows to our satellites, which sit on premise within your hosted datacenter or cloud environment. These stand-alone satellites are a key component of our high performance stream processing system. The satellites store and analyze the performance data (extracted from your system) for about 3-5 minutes. This gives us enough time to index the entire volume of performance data across historical norms, user-defined thresholds, and other metrics.

For example, let’s say you’re measuring the performance of an authentication transaction. That transaction might rely on several services, each with its own datastore. Our satellites will receive the performance data of each operation, across every service, and automatically analyze the performance of each segment against historical performance, error rates, and throughput.

LightStep [x]PM segments performance data across two VIP customers, without front-end sampling or data-smoothingLightStep [x]PM segments performance data across two VIP customers, without front-end sampling or data-smoothing

But let’s say you want to analyze that performance data along a couple of dimensions. Maybe you want to track the performance of your authentication service by identity management provider or deployment version. Maybe you want to track SSO functionality versus traditional username and password logins. LightStep (and OpenTracing) model those dimensions as key/value pairs called “tags.” Once they’ve been added to LightStep [x]PM, our satellites will automatically partition that data out and provide the time-series and trace data for each segment without you having to worry about the cardinality of your performance data.

We take this one step further by offering the ability to set your own SLOs for each dimension. So now, you can track the performance of any arbitrary segment of your application, set different performance thresholds for each, and receive context-rich alerts – complete with moment-in-time traces relevant to those segments.

The operation started with a front-end user interaction but the “problem” was a long-running DB call over a hundred layers deepThe operation started with a front-end user interaction but the “problem” was a long-running DB call over a hundred layers deep

See the entire request (and response) payload that was generated when the user first clicked “Get Historical Matches”See the entire request (and response) payload that was generated when the user first clicked “Get Historical Matches”

And this is only the beginning. Because we’re storing all of your performance data for that 3-5 minute window of omniscience, we can tell you not only when and where a performance issue occurs in your stack, but also everything that happens both downstream and upstream from that issue. So, if you’re trying to diagnose a 1 in a 1,000 database issue that is only affecting users of a certain deployment, starting with a client-side operation, we can tell you exactly what that environment looked like, and every call leading from the front-end to the datastore and back.

This is a glimpse of what is possible when you have 100% visibility into the performance of your production distributed system. With LightStep [x]PM, you get a fine-grained, objective view into your system, so you no longer have to react with partial information to the latest unexpected systems failures; instead, you can focus on building, improving, and tracking the core value your system was built to deliver.

Solving Research Problems Before Lunch – Applying the Scientific Method to Software Engineering

Remember the scientific method? You probably first learned about it in elementary school when you had to apply it to create a lovely tri-fold backboard science fair project. If that was not only the first time you used it but also the last, it went like this: ask a question, do background research, formulate a hypothesis, test the hypothesis, draw a conclusion and determine next steps. While this might bring back memories of long, drawn-out research projects, I have found that this exact method creates a great framework for solving many problems I run into at work.

Sometimes, the right approach to take when solving a problem can be opaque. When is it right to ask for help? When does it make sense to power through and find the answer alone? If asking for help, how does one do that while still maintaining co-workers’ high esteem? The scientific method can help guide this process: it will provide some direction on how to start solving the problem, and lead to the ability to ask intelligent and thoughtful questions. I will walk through both the basic principles as well as how I applied them to a real software engineering problem that I had to solve recently.

Ask a question – Start with a problem. Figure out what the component pieces of the problem are.

Using Go, a language I am still pretty unfamiliar with, how do I take in a generic payload and truncate its fields to make viable JSON, while allocating minimal memory and time?

Background Research – Think about what pieces of knowledge might be needed to solve the problem. Use Google, company docs, and similar code that already exists to build a foundation of knowledge to work off of.

I looked at the Go concepts that seemed the most relevant: empty interfaces /(interface{}/), switch statements, some of the finer details around pass by reference vs. pass by value for how Go handles specific types when passed into a function as parameter (answer: technically always pass by value, but this can be confusing depending on the type), slices vs. arrays. I also knew that my team had made an attempt at this problem in the past but it was not as memory efficient or fast as we needed, so I made sure I was familiar with that code.

Formulate a hypothesis – This could be the full answer to the problem or a theory that tackles only a portion. The answer might still seem far away but it is important to do this step; it forces a thorough understanding and is a good exercise for problems in the future.

After looking at the previous version of the truncator, doing my Google research, and sketching out some ideas, I came up with a possible solution that involved walking through the interface{} and encapsulating information in a variant on a linked list type structure. This preserved the order I wanted, limited memory usage, and kept track of what information to include in the final output and what was to be truncated.

Experiment – Take the hypothesis and test it out. The controlled, repetitive style of tests used in real scientific research are not necessary here, but the idea of finding components of the hypothesis that are easily testable is important. Some simple problems, like determining how a specific language handles the rounding of floats to ints, lend themselves to a simple experiment. For more complex problems try to find components that are clearly testable like above. In order to write an “experiment” for the larger picture, there may not be an experiment in the classical sense, draw logic diagrams or write the method signatures the hypothesis would require.

I could have sat down and cranked out my complete project but that would very likely have led to my being confused and writing less than perfect code. So, as a way to examine my hypothesis, I wrote out a simple scaffolding including all the public methods, a number of the private methods with quick descriptions of what they would do, and the data structure I wanted to use. I filled in some of the structure with very basic code just to make sure I could create and use my modified linked list in Go the way I thought I could.

Conclusion and Next Steps – At this point there should be more clarity around the problem. It is possible that the solution is now fairly obvious, and thus now solvable. If not, then what were once open-ended questions are now more targeted and concrete, and can be posed to a colleague. Going to a colleague at this point means that there was a chance to learn, develop a more solid idea of the problem being tackled, allows for smarter question and a more thoughtful dialogues to be had as well as being a demonstration of self-reliance and of respect for the co-worker’s time.

I now had a complete plan, and a few basic assumptions that I tested out. It looked like my approach was going to work, but since this was a fairly complex solution I wanted to run my ideas by my colleague and mentor. I walked him through my thought process and showed him the scaffolding that I had written. We ended up needing to tweak my idea a fair amount, but the scaffolding and POC were extremely useful. They provided something tangible for my mentor to look at and provide feedback on and the scaffolding provided the start to some code that I could quickly iterate on.

When I was first presented with the task of writing the truncator I felt overwhelmed and not certain how to proceed. Using the scientific method made the project more manageable. It made certain that I had a firm grasp on the high level problem I was trying to solve, which made it easier to understand the finer details. I still went to my mentor, but I was able to present a solution and engage with the feedback on a much deeper level.

Once I started employing the scientific method in this way I found that I was able to wrap my head around problems quicker, made the help that I received from my co-workers more valuable, and has helped to guarantee I continue learning. I find myself solving mini research problems everyday. Sometimes, before lunch!

GoLang Dep: The Missing Manual

At LightStep we run a number of applications written in Go that handle data ingestion from customers, query processing, monitoring and a variety of other tasks. We’ve adopted dep to manage dependencies for these apps. As our engineering team has grown, and as the number of different applications have grown, it’s helped us stay sane when dealing with changes in external dependencies.

During the transition from a custom vendor management solution to dep, I found some of the important documentation to be scattered around. This post collects the most helpful “extra” information I wish I had known getting started.

What’s dep?

dep is a dependency management tool for the Go programming language. We elected to use dep because of its close relationship with the official Go toolchain developers and its straightforward, “unmagical” dependency management model.

It has an active and helpful development community over on gophers.slack.com#vendor and an exciting roadmap. If you start using dep you should join up and get to know the community. It’s a great time to help find edge cases and make this tool better.

Set up dep

Follow the steps in the dep README to install dep. The homebrew release is up-to-date and is the recommended way to install dep.

Where to run dep

dep should be run at the project root—the directory just above where your vendor directory sits. depassumes that any packages that cannot be reached in your GOPATH by navigating down from where it’s run are external packages that need to be added to the vendor directory.

To get started, run dep init. This will read your application code and generate a set of constraints based on its best solution for your dependencies. After running this, you may need to edit it to be more specific about constraints, especially if you depend on older versions of some projects.

Gopkg.toml

Read this: Gopkg.toml README

The Gopkg.toml file describes all the dependencies required for your project. It only describes primary dependencies, not transitive dependencies, leaving the dep constraint solver free to pick transitive dependency versions as long as a version can be chosen that satisfies all constraints.

The most important fields in the file are the constraint entries. Constraints look like this:

[[constraint]]
  name = "github.com/lightstep/lightstep-tracer-go"
  version = "v0.14.0"

Every constraint needs a name, which is actually the URI you would use when go geting the project. Every constraint should also have either a versionbranch, or revision, with version being preferred if you can use it.

version fields use SemVer 2.0 [http://semver.org/ ] and dep assumes that v1.2.0 means ^v1.2.0. If you need to constrain to an exact version, use =v1.2.0.

Never edit the lock file!

Gopkg.lock is really an output of the constraint solver. Editing it does nothing.

dep ensure -v is your friend

It’s possible to describe a set of constraints that cannot be solved—you may have a primary dependency on a version of a package, and one of your primary dependencies may depend on an incompatible version. In that case, dep ensure will fail and print an error message. By running dep ensure -v you will get detailed output from the constraint solver that can help you identify the source of the problem.

dep ensure doesn’t do much if Your constraints are already satisfied!

dep ensure will “ensure” that a package that you import is also installed in your vendor directory and satisfies any described constraints… That’s it! It won’t make sure you have the latest release if your constraints are already satisfied.

Use dep ensure -update pgkname to get the latest version that satisfies constraints.

Gopkg.toml trick: required

Sometimes you need some go code included that your application doesn’t run directly. For example, https://github.com/grpc-ecosystem/grpc-gateway generates code which can then be committed to a repository, but a dep ensure will not install it, and a dep prune would remove it if installed.

The required keyword lets you depend on repositories that are not dependencies of your application like so:

required = ["github.com/grpc-ecosystem/grpc-gateway"]

Note that after ensuring it’s installed, you still need to go install your requirements:

$ dep ensure
$ go install vendor/github.com/grpc-ecosystem/grpc-gateway/...

There are other tools that can be used to make these installations project-specific as well, like virtualgo: https://github.com/GetStream/vg

Gopkg.toml trick: ignored

The ignored keyword prevents a package and its transitive dependencies from being installed in a project. Why would you want to do that? A typical use case might be to support updating to a new major version of a library that removes a package. Let’s say that you depend on github.org/foo/bar/bazsomewhere in your application, and version 2.0.0 of foo/bar drops this package.

Original Gopkg.toml:

[[constraint]]
  name = "github.com/foo/bar"
  version = "1.2.1"

Updating to 2.0.0 without changing your source code:

ignored = ["github.com/foo/bar/baz"]

[[constraint]]
  name = "github.com/foo/bar"
  version = "2.0.0"

This can help you install the new version of your library, even though it doesn’t have all the packages required by your application. You can then work to transition your application code while having the source for the library version you’re working with installed locally.

Committing vendor

If you’re writing a library, especially an open-source library, it’s not generally a good idea to commit your vendor directory. Any users who go get your code may have an impossible time compiling if they have conflicting dependencies with you. It’s a great idea, however, to commit and share your Gopkg.toml file. This will help other users of dep easily consume your library.

If you’re writing an applicaiton that emits binaries, there are some arguments for committing your vendor directory as an application that builds a binary and some arguments against it.

In favor of committing vendor

  • You have all the source and binaries needed to build your application in your repository. This can speed CI builds by avoiding a lot of downloading and dependency resolution when building. It also gives you a repeatable set of source to build from.
  • You’re protected from upstream changes breaking your builds—if a dependency unpublishes a previous release, you’ll still have a copy.

Against committing vendor

  • You may be storing and handling a lot of code that isn’t directly related to your application in your own repository.
  • Changes that include dependency upgrades will have very large diffs and may be unwieldy.

We’ve chosen, for the time being, to commit our vendor directories for our application binaries.

Making your project dep friendly

This section assumes you’re hosting your project as a git repository. Similar rules hold for bzr and hg as well.

Use annotated git tags to mark releases. git tag -a v2.1.3 -m "Release Version 2.1.3"Including release notes is a great idea. Leave off the -m argument to open your editor and add a longer tag message.

Mark them with 3-digit SemVer tags: v2.1.3.

Be honest about what a “breaking change” is

Any change that would cause a build failure if it is installed is considered a “breaking change” by SemVer and should be released with a major revision update. Changes that add new APIs can be considered minor releases, and changes that do not modify the API of the project can be considered a “patch” version.

We’ve found dep to be a straightforward, understandable tool for managing application dependencies. It fits nicely into our CI and development workflows, and it’s easy to understand which version of a given dependency is currently being used in a build. Let us know in the comments about your experiences with dep and share any information you’ve found helpful when using the tool!

Everything I Wish I Had Known about Enterprise SSO | Part II

Single sign-on (SSO) is critical for Enterprise products. It offers security benefits to the client company and an enhanced user experience for the end user. However, there is sometimes a conflict between security and ease of use. Balancing these sometimes-conflicting goals is an important part of defining SSO for your application. Once SSO is on the product roadmap, creating a detailed spec is the first step to to achieve both of these goals.

In our last post, we covered the technical research that goes into building SSO for Enterprise, and this post will detail the product specification best practices that are required to create a compelling, yet secure, user experience. This list might seem long, but we’ve included many things you’ll need for a robust implementation. While most of the tips are table-stakes for a releasable enterprise SSO feature, we’ve added some :sparkles: Bonus tips :sparkles: as well to uncover the extra mile.

New user registration or sign-up

Skip email verification once you have SSO setup.

Because SSO enables you to verify emails on the spot, you no longer need to send verification emails or confirm accounts. This helps shorten the account creation process, but that also means redoing or significantly changing your current setup.

Use just-in-time provisioning to increase product adoption.

Just-in-time (JIT) provisioning (based on domain whitelisting for example) minimizes the manual work, eliminates wait-time, and achieves the ultimate aim of SSO by directly propagating the customer’s user accounts management directly through into the app.This requires automating the process that creates a new account and granting the correct permissions to that new user.

Consider your pricing model when choosing your provisioning method.

The permissions model for your application and even the pricing model will impact how the sign-up process is designed. If a customer wants fine-grained access control for their users within the app, then manual account creation or approval is necessary. If the pricing is per-seat, then blanket domain-whitelisting may not be the right approach.

Sign-in

Highlight your preferred sign-in process in the user interface.

It is important to decide which takes precedence, manual sign in (user enters username and password) or SSO.

For example, Medium leads with SSO and buries the option for email and password in a link, whereas Heroku takes the opposite approach.

Medium Sign In Page

Heroku Sign In Page

Ask for user input when supporting multiple Identity Providers.

When supporting multiple Identity Providers (Google Sign-in, GitHub, OneLogin, Ping Identity, etc.), ask the user for their email address or unique account URL to determine the correct Identity Provider. While offering buttons for each provider is the most direct approach, it can clutter the UI and assumes the user knows which provider their company uses. This is often not true—think of a Google Apps user whose company also uses Okta.

Pagerduty SSO Sign In Page

Use cookies to reduce the user’s signing-in steps.

To reduce manual input, use a long-lifetime cookie that specifies which Identity Provider a user is affiliated with. The next time they login, send them directly to the authentication page. If they have an active session with the Identity Provider, they will be signed in without any extra steps. Their experience will stay the same as if they had actually stayed logged into your app.

Redirect new users to sign-up flow while maintaining the authenticated state.

While obvious in retrospect, many SSO implementers forget the scenario where a new user accidentally tries to sign in via SSO (instead of signing up). In this case, redirect the user to the sign-up flow while maintaining the authenticated state through the sign-up process. Putting them through the authentication flow twice is like being that credit card customer support rep who asks for identity verification after the caller has already spent 18 minutes punching in the numbers in the automated system.

Logging out

Ask the users whether they’d like to be logged out of their Identity Provider as well.

When a user logs out of the application, they can either be logged out of the Identity Provider, or not. User expectations can be varied in this case. The options are to either prompting the user to make a choice upon logging out or making it an admin option for organizations.

Don’t risk unauthorized access by keeping authentication sessions verified for too long.

The longer the session persists, the more opportunity there is for an account that has been revoked with the Identity Provider to still access to the application.

Identity provider flow

Support all the Identity Providers your customers need.

Enterprises will often not even consider products that do not support their SSO Identity Providers.

:sparkles: Bonus tip: Don’t overlook the permissions-grant page.

This is an oft-overlooked page in product design. However, this is where the user grants the application permissions so it is important it inspires trust. Some key elements are—the product image (this should be the correct size and high resolution) and the content of the drop-down menu. The correct user email address needs to populate here as well.

Google Auth with Details

:sparkles: Bonus tip: Whitelist the app with SSO Identity Providers

A lesser-known fact is that applications can often be white-listed with SSO Identity Providers, allowing the user to skip the permissions-grant page altogether. For example, in the case of Google Apps Whitelisting the customer’s admin can configure settings to allow an app direct access without interrupting the user with the permissions-grant page. This is the pinnacle of the SSO experience: a brand new user can arrive at the app via a deep-link and start using it with exactly one click. Magic.

Multiple sign-in options

Give customers the option to force SSO as the only sign-in option.

This can be a gating feature. Enterprise customers often want to force SSO as the only sign-in mechanism because of easier user management. If an employee is no longer at a company, SSO makes it easy to revoke access.

Specify the process for resolving a user signing in via multiple SSO Identity Providers or manually.

Often a user will forget they signed up manually or via SSO and try the other option. When using a globally unique identifier, such as email, to identify users, automatically merging new accounts with the same email address would create the best user experience.

:sparkles: Bonus tip: Allow customer admins to bypass SSO with a username and password.

This is relevant when a company forces SSO for their employees. If SSO breaks or the client switches providers, then the admin needs to be able to log in and change configurations in their organization settings to indicate the new provider. Otherwise, everyone is locked out.

User management

:sparkles: Bonus tip: Allow the application to connect with the client’s user management directory for advanced controls.

The client’s user management directories offer metadata, such as access level or role in the company. Once connected to the application, they can be used to create whitelists and blacklists or level-based access.

Provide non-SSO guest accounts.

The majority of users from an Enterprise customer will come through their SSO with emails on the customer’s domain. Inevitably, however, the customer will want to grant access via other means for users outside of their domain, such as external consultants.


There are many considerations when building an SSO feature for Enterprise customers. The above is a list of the best practices we found useful when spec’ing out our implementation. It is by no means an exhaustive list but should serve as a guiding document which can help uncover more questions that need to be answered for a truly comprehensive spec. And once you’re there, it’s time to build. Good luck!

Everything I Wish I Had Known About Enterprise SSO

Single sign-on (SSO) makes it easy for users to get started with an application. For enterprise applications, support for SSO is critical: many corporate security policies require that all applications must use approved SSO methods. While the experience of using SSO is simple, its specification is anything but simple. It’s easy to get lost in a sea of jargon: OAuth 1.0, 1.0a, 2.0, SAML, JWT, OpenID, OpenID Connect, JIT, and tokens: bearer tokens, refresh tokens, access tokens, authorization tokens, skeeball tokens. Standards documents are too specific to allow you to generalize, and content from vendors is designed to make you think it’s all too complicated to do yourself. When I was tasked with building SSO for LightStep, I spent days researching. Below are some lessons that I hope will save you time and headaches. It boils down to knowing your market, your vocabulary, your standards, and your platform.

Know your market

This is both the most important and most time consuming task before designing your application’s SSO. All other considerations are moot if you don’t understand the ways (there can be more than one) your customers are already managing their accounts. It is worth probing a bit deeper than simply asking “What do you use for SSO?” because you may discover that there are multiple options available.

Important questions to consider:

Do they use Gmail or Google Apps? Does everyone in the company have an account with Github using their work email address?
Do they have a vendor for Single Sign-on such as Okta, OneLogin, or Ping Identity? Do they have an LDAP/Active Directory server or other internally managed Identity Provider?
How do they log in and manage user accounts with their other SaaS vendors?

Know your vocabulary

There is a lot of jargon in SSO standards, tutorials, and documentation, including the ways of referring to the parties in an SSO transaction. Sometimes the same term is used with different meanings! For example, I have seen Service Provider used to refer to at least two distinct roles in the authentication transaction.

These are the three most important concepts in SSO:

  • The User (or Principal or Client) is the individual whose identity is being verified so that you can grant them access to your application.
  • Your application is the Service Provider (or Consumer or Relying Party), which uses a third-party to verify the User’s credentials.
  • The Identity Provider (or Server or OpenID Connect Provider) is the authority responsible for verifying the identity of the User and furnishing claims or assertions of the User’s credentials to the Service Provider.

Know your standards

There are many shared authentication and authorization schemes and standards, though the term “standard” is often used loosely. Some sound a lot alike but are actually quite different. Here are the important standards used in Enterprise SSO:

  • SAML: Security Assertion Markup Language is an XML-based data format and protocol for user authentication. SAML has been around the longest, and it is more common in larger enterprises. Even companies such as Google that have embraced newer standards still support SAML-based workflows for SSO.
  • OpenID: OpenID 1.0 and 2.0 are obsolete standards for maintaining a digital identity with an Identity Provider, which would verify your identity to other websites, also known as Service Providers. They have been replaced by OpenID Connect.
  • OpenID Connect is the latest standard authentication protocol and data format from OpenID. It was originally based on the design of Facebook Connect and has now been embraced by Google and other large providers of user authentication. The standard relies on the OAuth 2.0 protocol for the User to grant the Service Provide access to their identity data, and JSON Web Tokens (JWT) to package identity and other claims in the token payload.
  • OAuth 1.0a was deprecated in favor of OAuth 2.0 but is still in use by large companies, including Twitter. Revision A was released to close security holes in the original standard. It is considered more secure but more difficult to implement than its successor, OAuth 2.0.
  • OAuth 2.0 is the current OAuth standard as of this writing. OAuth 2.0 introduced a wider variety of data flows to support clients beyond the standard in-browser web application. OAuth 2.0 also removed the requirement for the client to encrypt the request, falling back on the built-in encryption of https communication. Even with the updated protocol, OAuth is an authorization scheme, not an authentication scheme. When you use OAuth for authentication, you are using it to get authorizationfrom the User to access their credentials stored with the Provider, and you are trusting the provider to have verified those credentials.

Know your platform

Many of these standards and protocols were developed with a focus on the canonical web app in a browser. If your application does not fit that mold, you may have fewer choices of how you implement SSO.

  • Mobile: Mobile Apps do not have access to the stored cookies in the phone’s web browser when making http requests in a webview, but these cookies are how Identity Providers maintain the logged-in state of the User. Without that state, the mobile app may be able to use the credentials from the Identity Provider, but the User will have to log in again for each new app, taking you from single sign-on to everytime sign-on. Newer protocols, such as OpenID Connect, are working to support native mobile applications without this limitation through NAPPS support. NAPPS provides capabilities that enable the application to register a custom callback URL to support a flow from the app into and back out of the core system web browser.
  • Single-page apps: The standard SSO authentication flows involve at least one communication step between the application’s server and the Identity Provider, but in a single-page app you may want to avoid the server altogether. Even if you don’t, you may not want to go through losing the application state through the standard redirect-based approach. This is where the alternative flows for OAuth 2.0 and OpenID Connect come in. You need to be even more careful when implementing the serverless flows, since the entire flow is happening in client-side JS and you don’t have the out-of-band server-to-server channel to exchange secure credentials. For this reason, big Identity Providers such as Google want you to use their JS toolkits.

I hope you found the distillation of my research on SSO useful. If there’s anything I missed, feel free to add to it in the comments and I will add it to the post and attribute it to you. Now that you have a solid foundation of SSO in enterprise, the next step is product design. Good luck!

Using a Mystery Shopper: Discovering Service Interruptions in Monitoring Systems

Many retail stores use mystery shoppers to assess the quality of their customer-facing operations. Mystery shoppers are employees or contractors that pretend to be ordinary shoppers, but ask specific questions or make specific complains and then report on their experiences. These undercover shoppers can act as a powerful tool: not only do organizations get information on their employees’ reactions, they don’t need to depend on ordinary shoppers to ask the right questions.

At LightStep, we faced a similar problem: we wanted to continuously assess how well our service is monitoring our customers’ applications and to identify cases where they are failing to meet their SLAs (or more properly, their SLOs). However, being an optimistic bunch, we don’t want to rely on our customers applications to continuously fail to meet their SLAs. 🙂 We needed another way to test whether or not LightStep was noticing when things were going wrong.

Who watches the watcher?

Who watches the watcher?

To provide some context, LightStep is a reliability monitoring tool that builds and analyzes traces of distributed or highly concurrent software applications. (A trace tells the story of a single request’s journey from mobile app or browser, through all the backend servers, and then back.) As a monitoring service, it’s critical that we carefully track our own service levels. Part of our solution is what we call the Sentinel. From the point of view of the rest of LightStep, the Sentinel looks just like any other customer. Unlike our real customers, however, the Sentinel’s behavior is predictable, and it is designed to trigger specific behaviors in our service. (We named it the “Sentinel” both because it helps keep watch on our service, but also because it creates traces with the intention of finding them later, and so it’s similar in spirit to a sentinel value.)

Designing the sentinel

To understand what the Sentinel does, you’ll first need a crash course on LightStep: as part of tracing, every component in an application (including mobile apps, browsers, and backend servers) records the durations of important operations along with a log of any important events that occurred during those operations. It then packages this information up as a set of spans and sends it all to LightStep. There, each trace is assembled by taking all the spans for an end-user request and building a graph that shows the causal relationships between those spans. Of course, assembling every trace would be expensive, so choosing the right set of traces to assemble is an important part of the value that LightStep provides.

Distributed call graph (showing connections between components)
and a trace showing the timing of one of these calls.
Click it to see magic!

In designing the Sentinel, we first identified two important features of LightStep: assembling traces based on request latency and alerting our customers when the rate of errors in their applications exceeds a predetermined threshold. To exercise these features, the Sentinel generates two streams of data. The first is a kind of background or ambient signal: a set of spans that represent ordinary, day-to-day application requests. We ensure that the latencies of these spans test the limits of our high-fidelity latency histograms, and, most importantly, we check that the number and content of the assembled traces matches our expectations.

The second stream of spans represents a set of applications errors. This stream periodically starts and stops, and each batch of errors exceeds the SLA threshold and causes an alert to trigger. Moments later, after the batch ends, the alert becomes inactive. On and off, on and off, all day long, these spans trigger alerts, and we verify each one.

The Sentinel has helped us discover incidents that other monitoring tools haven’t and avoids spurious alerts that might have been caused solely by changes in our customers’ behavior. We’ve found the Sentinel to be a particularly powerful technique when used in combination with a load test. While the load test simulates an unruly customer, the Sentinel acts as a well-behaved one. Using them together means that we can verify that one doesn’t interfere with the other.

Comparison to other monitoring techniques

Why not just use a health-checking tool like Pingdom? Of course, we use tools like those as well, but we’ve found that the Sentinel enables us to test more complicated interactions than off-the-shelf health-checking tools. Assembling traces from complex applications can be… well, complex, since spans from even a single trace can come from different parts of an application and may arrive out of order. No single span has the complete picture of what’s happening: in fact, the point of assembling a trace is to show the complete picture! Another way of saying this is that the correctness condition for trace assembly is defined globally: only by considering many different API requests (and their results) can we say whether or not a trace was assembled correctly.

Isn’t this all just an integration test? In a way, yes, but we see integration testing as a way of validating that our code works, while our online monitoring, including the Sentinel, ensures that our service continues to work. We explicitly decided that we wouldn’t try to use the Sentinel to cover all of LightStep’s features. While coverage is important for integration testing, we wanted the Sentinel just to test the most important features and components of LightStep and to test them continuously. Picking a subset of features helps us keep the Sentinel simpler and more robust.

When to use your own mystery shopper

The Sentinel acts a mystery shopper, letting us carefully control the input to LightStep and validate the results. You might find a similar technique is valuable, especially if the behavior of your service can’t be tested with a single API request and where there are complex interactions between requests, including time dependence or the potential for interference with other systems.

For example, if you have a product that includes some form of user notification, you might want to test the following sequence:

  1. Set up a notification rule
  2. Send a request that triggers the rule
  3. Check that the notification is sent

Continuously exercising this sequence can give you confidence that your service is up and running. Just don’t forget to remove the notification rule so that it can be tested again!

As in the case of any testing or monitoring, think about what matters to your users. What features do they depend on most? Just as a retail store manager can hire a mystery shopper to ask the right questions, you should use monitoring tools to verify that your most important features are working to spec.

Want to chat about monitoring, mystery shoppers, or SLAs? Reach us at hello@lightstep.com@lightstephq, or in the comments below.

TracedPromise: Visualizing Async JS Code

Writing code in Node.js strongly encourages use of event-driven callbacks and asynchronous functions. There are plenty of advantages to writing event-driven code, but it can often obscure the control flow of operations.

We’ll look at how adding OpenTracing instrumentation to JavaScript ES6 Promise objects can allow you to visualize how asynchronous code behaves. We’ll then dig into the nuts and bolts of a TracedPromiseclass to demonstrate how you can see useful tracing information with only a small code change.

As the example throughout, we’ll use an instrumented Promise.all() call. The example contains five sub-operations, including a single rejected Promise. The trace for the example looks like this:

This example contains 5 sub-operations, including a single rejected Promise. This is what the trace for the example looks like.

A bit about promises

If you’re already a Promises expert, feel free to skip right along to the next section.

There are many ways to make coordinating asynchronous JavaScript code more sane. For example, Caolan McMahon’s async library is one immensely popular option that is not promises-based. There are also plenty of libraries that have a more promise-like API, though they may use terminology like futuresdeferredsubscriptions, or callback objects. Libraries like the comprehensive Bluebird package and the concisely named ‘Q’ package are two of the more popular ones.

Promises have recently become a more interesting choice as the Promise class is a standard ES6 class* with pretty good support across platforms. With Promises as a standard API—and a rather effective, trim API at that—the process of writing asynchronous code is lot more natural and the code itself becomes less mentally taxing to read.

*ES6 being shorthand for ECMAScript 6th Edition. JavaScript is an implementation of ECMAScript.

Example code

Let’s look at the Promises “fail-fast” example, copied from the MDN documentation:

var p1 = new Promise((resolve, reject) => {
    setTimeout(resolve, 1000, "one");
});
var p2 = new Promise((resolve, reject) => {
    setTimeout(resolve, 2000, "two");
});
var p3 = new Promise((resolve, reject) => {
    setTimeout(resolve, 3000, "three");
});
var p4 = new Promise((resolve, reject) => {
    setTimeout(resolve, 4000, "four");
});
var p5 = new Promise((resolve, reject) => {
    reject("reject");
});

Promise.all([p1, p2, p3, p4, p5]).then(value => {
    console.log(value);
}, function(reason) {
    console.log(reason)
});

While this isn’t a full tutorial on using Promises, the general idea is that Promise objects allow for asynchronous operations to be written without resorting to deep nesting of callbacks. Functions that finish with a single resolve or reject callback are encapsulated into Promise objects, which in turn can be chained together or coordinated using a few standard Promise methods. These Promise method calls replace the deep nesting.

If you’re interested in more about why ES6 Promises are a good fit for JavaScript, check out runnable’s recent post on reasons to use Promises.

Visualizing the control flow with a trace

While the code is easier to write (and more importantly to read) with Promises, there’s still the challenge of knowing what actually happened at runtime. That’s precisely what tracing is for.

Let’s show the Promise.all() trace again. It’s based on the example code from the prior section (we’ll get to how this was instrumented in a moment):

Here is the Promise.all() trace again. It’s based on the example code from the prior section.

Very briefly, a trace is a collection of related spans. A span can be thought of as a single operation with well-defined start and finish times. A trace is a graph of the relationships between those spans. In the common case where a trace represents a single request, the trace is just a tree of spans with parent-child relationships. The Google research paper, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, is a good place to head if you’d like to get more detail.

In the specific trace above, a five operations are started concurrently (p1 through p5). The Promise.all() is used to coordinate on the success or failure of that group of operations. The example code itself is contrived (each operation is simply a timeout), but it’s easy to imagine the operations as network, file, database, or other asynchronous requests. The five inner requests are issued and the outer operation cannot proceed until the child operations have either succeeded or failed.

Fail-fast behavior visualized

One quick insight from visualizing this trace: the Promise.all() function is, in fact, a “fail-fast” operation. The timing of the Promises.all call shows that control returns to its handler as soon as p5fails. It does not wait for p4 to resolve to proceed.

For a standard function like all(), the insight is limited (it’s documented behavior after all). However, what’s happening is made clear from the trace rather what’s been manually documented. Visualizing the runtime behavior is especially helpful when applied to a production system where complexity and lack of documentation might make such insight otherwise unavailable.

Taking it for granted that tracing visualizations would be useful in a complex system: what code changes are needed to make an ES6 Promise into a building block for collecting tracing data about a system?

Instrumenting promises

To keep the code simple and clean, we’ll create a new class called TracedPromise that closely mirrors a standard Promise. Mirroring the existing API allows for existing code to be adapted rapidly. This technique should apply equally well to other promise-like libraries and methods, making adapting something like Q’s spread method relatively easy.

A standard ES6 Promise with a little more data

The TracedPromise API should be as simple and familiar as possible. The intent of tracing is to make code easier to understand, not more complex, after all. There are only two additional pieces of information we need to make standard promises more effective for tracing:

  • Add a meaningful name for each promise: good names will make examining the data offline more effective.
  • Add a reference to the parent: Promise objects can be chained, but don’t have the notion of parent-chain relationships. In practice it is a not uncommon in systems of even moderate complexity to have deeply nested Promises. Exposing those relationships in the tracing data can be very valuable.

Example code—this time with tracing!

The original example needs a few modifications to use TracedPromise:

  • Create an outer span to capture the total work being done (i.e. to correspond to what Promise.all() is tracking)
  • Use a TracedPromise object in place of a standard Promise
  • Add arguments to give the TracedPromise objects names and parent-child relationships
// NEW: Set up an initial span to track all the subsequent work
let parent = opentracing.startSpan('Promises.all');

// NEW: Assign a parent and name to each child Promise
let p1 = new TracedPromise(parent, 'p1', (resolve, reject) =&gt; {
    setTimeout(resolve, 100, 'one');
});
let p2 = new TracedPromise(parent, 'p2', (resolve, reject) =&gt; {
    setTimeout(resolve, 200, 'two');
});
let p3 = new TracedPromise(parent, 'p3', (resolve, reject) =&gt; {
    setTimeout(resolve, 300, 'three');
});
let p4 = new TracedPromise(parent, 'p4', (resolve, reject) =&gt; {
    setTimeout(resolve, 400, 'four');
});
let p5 = new TracedPromise(parent, 'p5', (resolve, reject) =&gt; {
    setTimeout(reject, 250, 'failure!');
});&lt;/javascript&gt;

// NEW: Use TracedPromise.all() to finish the parent span
TracedPromise.all(parent, [p1, p2, p3, p4, p5]).then(value =&gt; {
console.log(`Resolved: ${value}`);
}, reason =&gt; {
console.log(`Rejected: ${reason}`);
});

The code has the same fundamental form and should be easily understood and recognizable to anyone familiar with an untraced Promise.

Implementation with OpenTracing

If you’re using a standard ES6 Promise, the TracedPromise class may be already directly usable in your code. In the case where you’re using another approach to dealing with asynchronous code, the details of the handful of changes needed to create TracedPromise should be generalizable to many similar libraries and primitives.

OpenTracing spans

Tracing and OpenTracing are topics of their own. For the purposes of the code snippets below, the key idea is this: an OpenTracing Span tracks a meaningful operation from start to finish. To give a very quick example, here’s how a span could be used to track an asynchronous file read:

var span = Tracer.startSpan('fs.readFile');
fs.readFile('my_file.txt', 'utf8', function(err, text) { // ignoring errors!
    span.finish();
    doStuffWithTheText(text);
});

In the case of a promise, the promise itself is the operation. A Span object should be used to track logically when the promise’s work first starts and when finishes.

First a few helpers

Promises work on the basis of resolve and reject callback functions which finish the promise (moving them out of the pending state to either a fulfilled or rejected state). Two simple helpers will be used to create wrapped versions of the resolve and reject functions. These will behave the same as the unwrapped versions but also finish the span associated with the Promise.

function wrapResolve(span, f) {
    return function (...args) {
        span.finish();
        return f(...args);
    };
}

function wrapReject(span, f) {
    return function (...args) {
        span.setTag('error', 'true');
        span.finish();
        return f(...args);
    };
}

(Quick note on a span that has an error: there’s still some discussion around “standard tags” in the OpenTracing community. The tracer used to generate the images, LightStep, treats spans with the tag 'error' as containing errors.)

The TracedPromise constructor()

The TracedPromise class encapsulates a paired Promise and OpenTracing Span.

The first two arguments to the constructor are used to set up the TracedPromise’s own span. The third argument mirrors the standard Promise constructor’s argument: a single callback that receives a resolvereject pair. The helpers above are used to wrap the resolve and reject function objects in new function objects that will finish the span in concert with the promise.

constructor(parent, name, callback) {
    let span = opentracing.startSpan(name, { childOf : parent });
    let wrappedCallback = (resolve, reject) => callback(
        wrapResolve(span, resolve),
        wrapReject(span, reject)
    );
    this._promise = new Promise(wrappedCallback);
    this._span = span;
}

(An aside: TracedPromise intentionally does not inherit from Promise. The native promise implementation—at least as of Node v6.3.0—will create new Promise objects using the invoking object’s constructor, meaning it will internally create TracedPromise objects, not Promise objects. Using inheritance with TracedPromise would require code specifically be aware of this implementation detail, which is not desirable.)

The then() and catch() methods

The then method behaves very similarly to the constructor: it simply wraps the two function objects and then forwards them along to the underlying promise.

then(onFulfilled, onRejected) {
    return this._promise.then(
        wrapResolve(this._span, onFulfilled),
        wrapReject(this._span, onRejected)
    );
}

…and catch() is even simpler:

catch(onRejected) {
    return this._promise.catch(wrapReject(this._span, onRejected));
}

all() & race() methods

In the case of the static all method, the span should resolve as soon as the promise created by allresolves. Standard then and catch handles can do just this. The race() method has the same requirement. We’ll add one more helper and then implement all and race:

function chainFinishSpan(promise, span) {
    return promise.then((value) => {
        span.finish();
        return value;
    }, (reason) => {
        span.setTag('error', 'true');
        span.finish();
        return Promise.reject(reason);
    });
}

Back in the TracedPromise implementation:

static all(span, arr) {
    return chainFinishSpan(Promise.all(arr), span);
}

static race(span, arr) {
    return chainFinishSpan(Promise.race(arr), span);
}

reject() & resolve() methods

The final two static methods of Promise are reject and resolve. These are trivial to handle. Both immediately return a completed Promise. There’s no need to add tracing instrumentation as no on-going operation exists.

static reject(...args) {
    return Promise.reject(...args);
}

static resolve(...args) {
    return Promise.resolved(...args);
}

That’s it! The ES6 Promise is a slim API composed of only a handful of methods.

Adding tracing to real world systems

The exercise of creating a TracedPromise is intended to be a simple, but practical example of how tracing can make asynchronous control flow far easier to understand. The implementation details also show that adding tracing instrumentation is fairly straightforward.

Production systems in the real world aren’t all exclusively based on ES6 Promises, but most architectures of a certain scale are built around a handful of basic primitives and patterns. Whether they are Promise objects, Promise-like objects, control flow helper libraries, or custom in-house code, the general principles behind the instrumentation stay the same. A few simple additions to the core building blocks of an asynchronous system can a long way towards better understanding of the system itself.

OpenTracing is an open source project on Github. Contributions for tracing-enabled versions of common Node.js libraries and utilities would undoubtedly be welcome! LightStep is a distributed tracing system that relies on the OpenTracing standard.

The full code for this post is available at: github.com/opentracing-contrib/javascript-tracedpromise.