Performance is a Shape, Not a Number

This article originally appeared on Medium.

Applications have evolved – again – and it’s time for performance analysis to follow suit

In the last twenty years, the internet applications that improve our lives and drive our economy have become far more powerful. As a necessary side-effect, these applications have become far more complex, and that makes it much harder for us to measure and explain their performance – especially in real-time. Despite that, the way that we both reason about and actually measure performance has barely changed.

I’m not here to argue about the importance of understanding real-time performance in the face of rising complexity – by now, we all realize it’s vital – but for the need to improve our mental model as we recognize and diagnose anomalies. When assessing “right now,” our industry relies almost entirely on averages and percentile estimates: these are not enough to efficiently diagnose performance problems in modern systems. Performance is a shape, not a number, and effective tools and workflows should present and explore that shape, as we illustrate below.

We’ll divide the evolution of application performance measurement into three “phases.” Each phase had its own deployment model, its own predominant software architecture, and its own way of measuring performance. Without further ado, let’s go back to the olden days: before AWS, before the smartphone, and before Facebook (though perhaps not Friendster)…

Watch our tech talk now. Hear Ben Sigelman, LightStep CEO, present the case for unsampled latency histograms as an evolution of and replacement for simple averages and percentile estimates.

Phase 1: Bare Metal and average latency (~2002)

LightStep - the Stack 2002The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.

If you measured application performance at all in 2002, you probably did it with average request latency. Simple averages work well for simple things: namely, normally-distributed things with low variance. They are less appropriate when there’s high variance, and they are particularly bad when the sample values are not normally distributed. Unfortunately, latency distributions today are rarely normally distributed, can have high variance, and are often multimodal to boot. (More on that later)

To make this more concrete, here’s a chart of average latency for one of the many microservice handlers in LightStep’s SaaS:

LightStep - Recent Average LatencyRecent average latency for an important internal microservice API call at LightStep

It holds steady at around 5ms, essentially all of the time. Looks good! 5ms is fast. Unfortunately it’s not so simple: average latency is a poor leading indicator of reliability woes, especially for scaled-out internet applications. We’ll need something better…

Phase 2: Cloud VMs and p99 latency (~2012)

LightStep - the Stack 2012The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).

Even if average latency looks good, we still don’t know a thing about the outliers. Per this great Jeff Dean talk, in a microservices world with lots of fanout, an end-to-end transaction is only as fast as its slowest dependency. As our applications transitioned to the cloud, we learned that high-percentile latency was an important leading indicator of systemic performance problems.

Of course, this is even more true today: when ordinary user requests touch dozens or hundreds of service instances, high-percentile latency in backends translates to average-case user-visible latency in frontends.

To emphasize the importance of looking (very) far from the mean, let’s look at recent p95 for that nice, flat, 5ms average latency graph from above:

LightStep - Recent p95 LatencyRecent p95 latency for the same important internal microservice API call at LightStep

The latency for p95 is higher than p50, of course, but it’s still pretty boring. That said, when we plot recent measurements for p99.9, we notice meaningful instability and variance over time:

LightStep - Recent p99.9 LatencyRecent p99.9 latency for the same microservice API call. Now we see some instability.

Now we’re getting somewhere! With a p99.9 like that, we suspect that the shape of our latency distribution is not a nice, clean bell curve, after all… But what does it look like?

Phase 3: Microservices and detailed latency histograms (2018)

LightStep - the Stack 2018The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.

When we reason about a latency distribution, we’re trying to understand the distinct behaviors of our software application. What is the shape of the distribution? Where are the “bumps” (i.e., the modes of the distribution) and why are they there? Each mode in a latency distribution is a different behavior in the distributed system, and before we can explain these behaviors we must be able to see them.

In order to understand performance “right now”, our workflow ought to look like this:

  1. Identify the modes (the “bumps”) in the latency histogram
  2. Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
  3. Explain the behaviors that characterize these high-priority modes

Too often we just panic and start clicking around in hopes that we stumble upon a plausible explanation. Other times we are more disciplined, but our tools only expose bare statistics without context or relevant example transactions.

This article is meant to be about ideas (rather than a particular product), but the only real-world example I can reference is the recently released Live View functionality in LightStep [x]PM. Live View is built around an unsampled, filterable, real-time histogram representation of performance that’s tied directly to distributed tracing for root-cause analysis. To get back to our example, below is the live latency distribution corresponding to the percentile measurements above:

LightStep - A Real-Time View of LatencyA real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.

The histogram makes it easy to identify the distinct modes of behavior (the “bumps” in the histogram) and to triage them. In this situation, we care most about the high-latency outliers on the right side. Compare this data with the simple statistics from “Phase 1” and “Phase 2” where the modes are indecipherable.

Having identified and triaged the modes in our latency distribution, we now need to explain the concerning high-latency behavior. Since [x]PM has access to all (unsampled) trace data, we can isolate and zoom in on any feature regardless of its size. We filter interactively to hone in on an explanation: first by restricting to a narrow latency band, and then further by adding key:value tag restrictions. Here we see how the live latency distribution varies from one project_id to the next (project_id being a high-cardinality tag for this dataset):

LightStep - Isolate and Zoom In on Any FeatureGiven 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.

Here we are surprised to learn that project_id 36 experienced consistently slower performance than the aggregate. Again: Why? We restrict our view to project_id=36, filter to examine the latency outliers, and open a trace. Since [x]PM can assemble these traces retroactively, we always find an example, even for rare behavior:

LightStep - End-to-End Transaction TracesTo attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.

The (rare) trace we isolated shows us the smoking gun: that contention around mutex acquisition dominates the critical path (and explains why this particular project — with its own highly-contended mutex — has inferior performance relative to others). Again, compare against a bare percentile: simply measuring p99 latency is a far cry from effective performance analysis.

Stepping back and looking forward…

As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.

If you made it this far, I hope you’ve learned some new ways to think about latency measurements and how they play a part in diagnostic workflows. LightStep continues to invest heavily in this area: to that end, please share your stories and points of view in the comment section, or reach out to me directly (Twitter, Medium, LinkedIn), either to provide feedback or to nudge us in a particular direction. I love to nerd out along these lines and welcome outside perspectives.

Want to work on this with me and my colleagues? It’s fun! LightStep is hiring.

Want to make your own complex software more comprehensible? We can show you exactly how LightStep [x]PM works.

KubeCon 2017: The Application Layer Strikes Back

You know it’s a special event when it snows in Texas.

Several of my delightful colleagues and I just returned from a remarkably chilly – and remarkably memorable – trip to Austin for KubeCon+CloudNativeCon Americas. We went because we were excited to talk shop about the future of microservices with 4,500 others involved with the larger cloud-native ecosystem. We had high hopes for the conference, as you won’t find a higher-density group of attendees when it comes to strategic, forward-thinking infrastructure people; yet even our lofty expectations were outdone by the buzz and momentum on display at the event.

On Wednesday at 7:55am, I emerged from my hotel room and had the good fortune of running into the inimitable Kelsey Hightower on my way to the elevator. I never miss an opportunity to learn something from Kelsey, so I asked him what was new and special in k8s-land these days. His response, paraphrased, was that “the big feature news is that – finally – we don’t have a big new feature in Kubernetes.” He went on to explain that this newfound stability at the infrastructural layer is a huge milestone for the k8s movement and opens the door to innovation above and around Kubernetes proper.

From an ecosystem standpoint, I was also lucky to speak with Chen Goldberg as part of a dinner that IBM organized. It was fascinating to hear how she and her team have architected the boundaries of Kubernetes to optimize for community. The project nails down the parts of the system that require standardization, while carving out white-space for projects and vendors to innovate around those core primitives.

This Kubernetes technology and project vision, along with its API stability, have led us to the present reality: Kubernetes has won when it comes to container orchestration and scheduling. That was not clear last year and was very far from clear two or three years ago, but with even the likes of AWS going all-in on Kubernetes, we have both OSS developers, startup vendors, and all of the big cloud providers bought in on the platform. So now everyone and their dog are going to become a Kubernetes expert, right?

Not really. It’s even better than that: our industry is evolving towards a reality where everyone and their dog are going to depend on Kubernetes and containers, yet nobody will need to care about Kubernetes and containers. This is a huge and much-needed transformation, and reminiscent of how microservice development looked within Google: every service did indeed run in a container which was managed by an orchestration and scheduling system (internally code-named “Borg”), but developers didn’t know or care how the containers were built, nor did they need to know or care how Borg worked.

So what will devs and devops care about? They will care about application-layer primitives, and those primitives are what KubeCon + CloudNativeCon was about this year. As I mentioned in my keynote on Wednesday, this means that devs and devops will be able to take advantage of CNCF technologies like service mesh (e.g., Envoy and Istio) as well as OpenTracing in order to tell coherent stories about the interaction between their microservices and monoliths.

We were humbled to hear existing LightStep customers telling folks who stopped by our booth how our solution has helped them tell clear stories about the most urgent issues affecting their own systems. Because LightStep integrates at the application layer – through OpenTracing, Envoy, transcoded logging data, or in-house tracing systems – it’s easy to connect our explanations for system behavior to the business logic and application semantics, and to steer clear of the poor signal-to-noise ratio of unfiltered container-level data.

Given the momentum behind Kubernetes and microservices in general, KubeCon felt like a glimpse into the future. That future will empower devs/devops to build and ship features faster and with greater independence. With CNCF’s portfolio of member projects fleshing out the stack around and above Kubernetes, we’re all moving to a world where we can stop caring about containers and keep our focus where it belongs: at the application layer where our developers write and debug their own software.

Announcing LightStep: A New Approach for a New Software Paradigm

(Image Credit: daneden.me)

Today, LightStep emerged from stealth, announced its first product, LightStep [x]PM, as well as its Series A and Series B funding.

With today’s launch, we’re excited to speak more openly about what we’ve been up to here at LightStep. As a company, we focus on delivering deep insights about every aspect of high-stakes production software. With our first product, LightStep [x]PM, we identify and troubleshoot the most impactful performance and reliability issues. This post is about how we got here and why we’re so excited.

I started thinking about this problem in 2004. It began during an impromptu conversation I had with Sharon Perl, a brilliant research scientist who came to Google in the early days. She was mainly working on an object store (à la S3) at the time but also had a few prototype side projects. We talked through five of them, I believe, but one captured my attention in particular: Dapper.

Dapper circa 2004 was not fully baked, though the idea was magical to me: Google was operating thousands of independently-scalable services (they’d be called “microservices” today), and Dapper was able to automatically follow user requests across service boundaries, giving developers and operators a clear picture of why some requests were slow and others ended with an error message. I was so enamored of the idea that I dropped what I was doing at the time, adopted the (orphaned) Dapper prototype, and built a team to get something production-ready deployed across 100% of Google’s services. What we built was (and is still) essential for long-term performance analysis, but in order to contend with the scale of the systems being monitored, Dapper only centrally recorded 0.01% of the performance data; this meant that it was challenging to apply to certain use cases, such as real-time incident response (i.e., “most firefighting”).

Ten years later, Ben Cronin, Spoons (Daniel Spoonhower), and I co-founded LightStep. Enterprises are in the midst of an architectural transformation, and the systems our customers and prospects build look a lot like the ones I grew up with at Google. We visit with enterprise engineering and ops leaders frequently, and what we see are businesses that live (or die) by their software, yet often struggle to stay in control of it given the overwhelming scale and complexity of their own systems.

We built LightStep to help with this, and we started with LightStep [x]PM to focus on performance and reliability in particular. Our platform is not a reimplementation of Dapper, but an evolution and a broadening of its value prop: with LightStep’s unconventional architecture, we can analyze 100.0% of transaction data rather than 0.01% of it like we did with Dapper. This unique – and technically sophisticated – approach gives our customers the freedom to focus on the performance issues that are most vital to their business and jump to the root cause with detailed end-to-end traces, all in real-time.

For instance, Lyft sends us a vast amount of data – LightStep analyzes 100,000,000,000 microservice calls every day. At first glance, that data is all noise and no signal: overwhelming and uncorrelated. Yet by considering the entirety of it, LightStep can measure how performance affects different aspects of Lyft’s business, then explain issues and anomalies using end-to-end traces that extend from their mobile apps to the bottom of their microservices stack. The story is similar for Twilio, GitHub, Yext, DigitalOcean, and the rest of our customers: they run LightStep 100% of the time, in production, and use it to answer pressing questions about the behavior of their own complex software.

The credit for what LightStep has accomplished goes to our team. We value technical skill and motivation, of course; that said, we also value emotional sensitivity, situational awareness, and the ability to prioritize and leverage our limited resources. LightStep will continue to innovate and grow well into the future, and the people here and their relationships with our inspiring customers are the reason why. The company has also benefited in innumerable ways from early investors Aileen Lee and Michael Dearing, the staff at Heavybit, and of course our board members from Redpoint and Sequoia, Satish Dharmaraj and Aaref Hilaly. Our board brings deep company-building experience as well as a humility and humor that we don’t take for granted.

It’s no secret that software is getting more powerful every day. As it does, it becomes more complex. LightStep exists in order to decipher that complexity, and ultimately to deliver insights and information that let our customers get back to innovating. Nothing gets us more excited than the success stories we hear from our customers. As we continue to build towards our larger vision, we look forward to hearing many more.