How Deep Systems Broke Observability — and What We Can Do About It
by Ben Sigelman
"Responsibility without control."
It’s the textbook definition of stress. And for the human beings who build, maintain, and operate microservices at scale and in production, it is relentless and often overwhelming. The stress takes on many forms — if you work with microservices, you’ve surely heard it firsthand:
- “Don’t deploy on Fridays”
- “I’m too busy to update the runbook”
- “Where’s Chris?! I’m dealing with a P0 and they’re the only one who knows how to debug this.”
- “We have too many dashboards”
- “What’s an SLO?”
- “I don’t know what this graph means but it looks like it might be correlated”
So why is it so rough out there? And how did it get this way?
There are two ways systems scale. They can scale wide, or they can scale deep.
Countless real-world systems scale wide: Lakes scale wide. Pizzas scale wide. Traffic jams scale wide. And in software, MapReduces and memcache pools scale wide. When things scale wide, you “just add more of them”: more water, more dough, more cars, more processes.
But some systems scale deep: Cities scale deep. Brains scale deep. And when things scale deep, they don’t just get bigger, they get different. Paris is nothing like a very large village. The brain in your pet goldfish is nothing like the brain in your head.
And when microservice architectures scale, they scale deep. I first managed to deploy Dapper across all of Google’s production systems back in 2005. I found some websearch “cache miss” traces and started expanding them, all the way from the top to the bottom. From the ingress frontend that handled the query down through the depths of the serving system, there were more than 20 layers of microservices! The complexity was mesmerizing, though also somewhat terrifying; who could possibly understand a system that deep? With conventional tools at the time, the answer was simple: nobody.
But are non-Google, non-Facebook, non-planet-scale companies dealing with deep systems, too? At Lightstep we have insight into this, as our customers are somewhere along the journey towards a microservices architecture, often in conjunction with a monolith of some sort. Here’s what their systems look like (with service names blurred out for confidentiality reasons):
Service diagrams (and “excerpts” from service diagrams that are too big for a single screenshot) from typical Lightstep customers. All service names have been blurred out for confidentiality reasons. The smallest architectures have a depth of 3 service layers, and the largest have depths of 20+ service layers.
The pattern that emerges is clear: microservice architectures scale deep, not wide. And once there are four layers of microservices, it’s a deep system – and old-style observability starts to fall apart.
As we mentioned at the outset, stress can be defined as “responsibility without control.” For the human beings who manage the individual services within a deep system, what is their scope of responsibility? And what can they actually control?
The relationship between control and responsibility for service-owners in deep systems. The dots represent distinct services, with the nested triangles representing the nested “scopes of responsibilities” for the people who manage those services.
Conceptually speaking, each service sits at the top of a triangle that contains that service’s downstream dependencies, and each of those dependencies sits at the top of a smaller triangle, and so on and so forth. Of course, in practice both the depth and breadth of these triangles can be much larger – just look at the diagrams of real-world microservice architectures above.
For the human beings who maintain a service, the scope of control is really “just that service.” This is by design: the whole point of microservices is that individual teams can deploy and operate their own services, all without interference or artificial barriers involving other teams. Put another way, overlapping control was what made monoliths such a disaster for developer productivity and release velocity; hence the very narrow scope of control for microservice owners.
The scope of responsibility, however, is everything “in the triangle” beneath the given service. Service-owners are responsible for their service’s latency and error SLOs (more on SLOs in this post and this forthcoming book), but their service can only be as fast and reliable as its slowest, least-stable dependency. To make matters worse, the scope of responsibility grows geometrically relative to the depth of the dependencies triangle.
So if stress is “responsibility without control,” deep systems are a recipe for, well, lots of stress. As systems grow deeper, the scope of a service-owner’s responsibility dwarfs their scope of control.
No wonder it’s so hard out there.
What is observability, really? The term was introduced in the 1960s in the context of Control Theory. Conceptually, it’s simple: observability is a measure of how well one can understand the internal state of the system (i.e., “what’s actually happening in production”) given only the outputs of the system (i.e., the telemetry you gather from your system). Here’s an illustration:
For microservices, the conventional wisdom is that there are “three pillars of observability,” namely metrics, logs, and tracing. The argument (read: “the dogma”) goes something like this:
- Observing microservices is hard
- Google, Facebook, and Twitter solved this already (PS: they didn’t)
- They used metrics, logging, and distributed tracing…
- … “So we should, too.”
I’ve long argued (on twitter and at conferences) that the three pillars are really “three pipes” and that the conventional wisdom makes no sense. Unless you’re a vendor who sells three SKUs that match up to the three pillars; in that case it’s high margins and money in the bank, though end-users are no better off for it.
Make no mistake: metrics, logs, and traces are all vital and important. But they are “the telemetry,” not “the observability.” And traces are not peers to metrics and logs: traces must form the backbone of observability in deep systems. We’ll get back to that in a bit.
As far as metrics, logs, and traces are concerned, bet on OpenTelemetry. At Lightstep, we did a lot of technical and bridge-building work across our ecosystem to make the OpenTelemetry project, which merges OpenTracing and OpenCensus, a reality. We want to move all telemetry out into the open as a portable commodity, available by default and decoupled from any particular vendor. Unlike OpenTracing and OpenCensus, OpenTelemetry won’t even involve manual source code modifications – automatic instrumentation can still be portable instrumentation.
But how do we take advantage of high-quality, ubiquitous telemetry? Certainly not by deploying three parallel products in three parallel browser tabs. If we go back to the “triangle of responsibility,” conventional logging and metrics solutions are a disaster for deep systems. Not only does the raw dollar cost grow geometrically with the size of the triangle, the sheer cognitive overhead of hunting through dashboards and logs does, too, and that is untenable.
Schematically, it looks like this:
If you are responsible for the service at the apex of the triangle, the cognitive overhead of conventional metrics and logs is proportional to the depth of your system, squared. I’ve heard many customers talk about “the bad old days” when they would spend days, weeks, or even months searching in vain for an explanation for real, SLO-violating problems affecting their service. This is a really big deal! Guess-and-check searches through metrics, dashboards, and logs are completely unsustainable in deep systems, and yet it’s still standard practice in many organizations.
But what about traces?
Taken individually, traces can certainly be useful. But after four years of development and innovation at Lightstep, we find that the true value of tracing appears in the aggregate: these aggregates tell valuable stories themselves (e.g., our correlations feature), but they also allow us to use an SLO at “the apex of the triangle” to filter and rank the metrics and logs streaming in across the entire dependency tree.
This allows us to reduce the cognitive overhead to “the height of the triangle,” not the area. We’ve taken O(depth_2) and made it O(_depth).
In deep systems, tracing-first observability can turn O(depth_2) problems into O(_depth) problems – whether they be human problems or dollar problems. And this is why tracing must be the backbone of unified observability, and why Lightstep has built its product strategy around this underlying thesis about deep systems.
When microservices proliferate, ordinary systems become deep systems. In deep systems, the scope of responsibility for service owners grows geometrically without any change to the scope of control. This induces stress and dysfunction.
Deep systems break conventional observability: the dollar cost of metrics and logs explodes, and at the same time the metrics and logs themselves become less useful. Service owners cannot possibly comprehend the firehose of metrics and logs streaming out of their entire dependency tree, and yet, somehow, they must.
If we escape from the idea of tracing as “a third pillar,” it can solve this problem, but not by sprinkling individual traces on top of metrics and logging products. Again, tracing must form the backbone of unified observability: only the context found in trace aggregates can address the sprawling, many-layered complexity that deep systems introduce.
Simple observability in deep systems realigns “responsibility” and “control” for service owners, empowering them to act with confidence, independence, and — last but certainly not least — less stress.