This article originally appeared on Medium.
Applications have evolved – again – and it’s time for performance analysis to follow suit
In the last twenty years, the internet applications that improve our lives and drive our economy have become far more powerful. As a necessary side-effect, these applications have become far more complex, and that makes it much harder for us to measure and explain their performance – especially in real-time. Despite that, the way that we both reason about and actually measure performance has barely changed.
I’m not here to argue about the importance of understanding real-time performance in the face of rising complexity – by now, we all realize it’s vital – but for the need to improve our mental model as we recognize and diagnose anomalies. When assessing “right now,” our industry relies almost entirely on averages and percentile estimates: these are not enough to efficiently diagnose performance problems in modern systems. Performance is a shape, not a number, and effective tools and workflows should present and explore that shape, as we illustrate below.
We’ll divide the evolution of application performance measurement into three “phases.” Each phase had its own deployment model, its own predominant software architecture, and its own way of measuring performance. Without further ado, let’s go back to the olden days: before AWS, before the smartphone, and before Facebook (though perhaps not Friendster)…
Watch our tech talk now. Hear Ben Sigelman, LightStep CEO, present the case for unsampled latency histograms as an evolution of and replacement for simple averages and percentile estimates.
Phase 1: Bare Metal and average latency (~2002)
The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.
If you measured application performance at all in 2002, you probably did it with average request latency. Simple averages work well for simple things: namely, normally-distributed things with low variance. They are less appropriate when there’s high variance, and they are particularly bad when the sample values are not normally distributed. Unfortunately, latency distributions today are rarely normally distributed, can have high variance, and are often multimodal to boot. (More on that later)
To make this more concrete, here’s a chart of average latency for one of the many microservice handlers in LightStep’s SaaS:
Recent average latency for an important internal microservice API call at LightStep
It holds steady at around 5ms, essentially all of the time. Looks good! 5ms is fast. Unfortunately it’s not so simple: average latency is a poor leading indicator of reliability woes, especially for scaled-out internet applications. We’ll need something better…
Phase 2: Cloud VMs and p99 latency (~2012)
The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).
Even if average latency looks good, we still don’t know a thing about the outliers. Per this great Jeff Dean talk, in a microservices world with lots of fanout, an end-to-end transaction is only as fast as its slowest dependency. As our applications transitioned to the cloud, we learned that high-percentile latency was an important leading indicator of systemic performance problems.
Of course, this is even more true today: when ordinary user requests touch dozens or hundreds of service instances, high-percentile latency in backends translates to average-case user-visible latency in frontends.
To emphasize the importance of looking (very) far from the mean, let’s look at recent p95 for that nice, flat, 5ms average latency graph from above:
Recent p95 latency for the same important internal microservice API call at LightStep
The latency for p95 is higher than p50, of course, but it’s still pretty boring. That said, when we plot recent measurements for p99.9, we notice meaningful instability and variance over time:
Recent p99.9 latency for the same microservice API call. Now we see some instability.
Now we’re getting somewhere! With a p99.9 like that, we suspect that the shape of our latency distribution is not a nice, clean bell curve, after all… But what does it look like?
Phase 3: Microservices and detailed latency histograms (2018)
The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.
When we reason about a latency distribution, we’re trying to understand the distinct behaviors of our software application. What is the shape of the distribution? Where are the “bumps” (i.e., the modes of the distribution) and why are they there? Each mode in a latency distribution is a different behavior in the distributed system, and before we can explain these behaviors we must be able to see them.
In order to understand performance “right now”, our workflow ought to look like this:
- Identify the modes (the “bumps”) in the latency histogram
- Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
- Explain the behaviors that characterize these high-priority modes
Too often we just panic and start clicking around in hopes that we stumble upon a plausible explanation. Other times we are more disciplined, but our tools only expose bare statistics without context or relevant example transactions.
This article is meant to be about ideas (rather than a particular product), but the only real-world example I can reference is the recently released Live View functionality in LightStep [x]PM. Live View is built around an unsampled, filterable, real-time histogram representation of performance that’s tied directly to distributed tracing for root-cause analysis. To get back to our example, below is the live latency distribution corresponding to the percentile measurements above:
A real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.
The histogram makes it easy to identify the distinct modes of behavior (the “bumps” in the histogram) and to triage them. In this situation, we care most about the high-latency outliers on the right side. Compare this data with the simple statistics from “Phase 1” and “Phase 2” where the modes are indecipherable.
Having identified and triaged the modes in our latency distribution, we now need to explain the concerning high-latency behavior. Since [x]PM has access to all (unsampled) trace data, we can isolate and zoom in on any feature regardless of its size. We filter interactively to hone in on an explanation: first by restricting to a narrow latency band, and then further by adding key:value tag restrictions. Here we see how the live latency distribution varies from one project_id to the next (project_id being a high-cardinality tag for this dataset):
Given 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.
Here we are surprised to learn that project_id 36 experienced consistently slower performance than the aggregate. Again: Why? We restrict our view to project_id=36, filter to examine the latency outliers, and open a trace. Since [x]PM can assemble these traces retroactively, we always find an example, even for rare behavior:
To attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.
The (rare) trace we isolated shows us the smoking gun: that contention around mutex acquisition dominates the critical path (and explains why this particular project — with its own highly-contended mutex — has inferior performance relative to others). Again, compare against a bare percentile: simply measuring p99 latency is a far cry from effective performance analysis.
Stepping back and looking forward…
As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.
If you made it this far, I hope you’ve learned some new ways to think about latency measurements and how they play a part in diagnostic workflows. LightStep continues to invest heavily in this area: to that end, please share your stories and points of view in the comment section, or reach out to me directly (Twitter, Medium, LinkedIn), either to provide feedback or to nudge us in a particular direction. I love to nerd out along these lines and welcome outside perspectives.
Want to work on this with me and my colleagues? It’s fun! LightStep is hiring.
Want to make your own complex software more comprehensible? We can show you exactly how LightStep [x]PM works.