If you're migrating to a microservices architecture, use serverless functions, or run a service mesh, it's likely that your system is not merely distributed — it's deep.
Deep systems are architectures where there are at least four layers of stacked, independently operated services, including cloud or SaaS dependencies.
If you have a deep system, you've probably heard things like this around the office (or on Slack):
- "Don't deploy on Fridays"
- "I'm too busy to update the runbook"
- "Where's Chris?! I'm dealing with a P0, and they're the only one who knows how to debug this."
- "We have too many dashboards"
- "What's an SLO?"
- "I don't know what this graph means, but it looks like it might be correlated"
Deep systems, by their very nature, are notoriously difficult to understand and operate. Conventional tools can neither track nor analyze the requests that pass through these independently managed layers, and yet without that context it's impossible for operators to understand the interdependencies that drive application behavior.
The result: a lack of confidence in the production system, difficult deploys, and long resolution times for critical performance issues.
Organizations move to microservices so that developers can deploy faster, without having to plan or coordinate their activities from other teams.
But if microservices were supposed to be about release velocity, why has "pushing new code" become such a stressful experience?
Because developers in deep systems are responsible for far more than they can control.
If you own a service, you're focused on its health — latency, error rates, throughput, etc. — but that service can only be as fast and reliable as its slowest, most error-prone dependency. Furthermore, any of these dependencies can kick off a deploy or config push at any time, so a service-owner in a deep system is ultimately responsible for the side-effects of countless unscheduled change events.
So, as systems scale, and as more services and layers are added, the discrepancy between what you can control and what you are responsible for continues to grow: fundamentally, you only control your own service, but the scope and complexity of the services you are implicitly responsible for increases as each new service and dependency deploys to production.
Observability should enable operators to understand the "what" and the "why" of their production software — the inner workings — without recompiling, reconfiguring, or redeploying any component of it.
In a deep system, the purpose of observability is to close the gap between responsibility and control, and ultimately, give developers the confidence to ship code faster.
But when trying to solve problems in deep systems — when requests cross boundaries between layers and teams — conventional observability falls apart. The amount of time, cognitive load, and tribal knowledge required to search for the right dashboards, grep through logs, or try to figure out who owns a particular service is simply not scalable.
So, how should we think about observability in deep systems?
Observability is often (mis)understood in terms of "metrics, logs, and tracing": aka, the "three pillars of observability." In this framing, metrics, logs, and tracing are presented as distinct, loosely-coupled products or tools. Treating traces, metrics, and logs as separate capabilities means that you will have (at least) three tabs open during releases, incidents, and performance investigations. You lose context as you switch from one to the other (and back again, and again).
Without context, deep systems are a recipe for catastrophic on-call shifts, performance mysteries, unexplained regressions, inter-team finger-pointing, and an overarching lack of confidence that decelerates feature velocity and, ultimately, innovation.
Observability should be structured around use cases that create a better experience for customers.
In a deep system, use cases are best managed through service level objectives (SLOs), which pair service level indicators (SLIs) — typically latency, throughput, or error rate — with a target, such as 99.9% uptime, p99 latency < 1s, etc.
There are three critical use cases for observability in deep systems:
- Deploying new service versions (i.e., innovating)
- Reducing MTTR (i.e., enforcing SLOs)
- Improving steady-state performance (i.e., improving SLOs)
These map to a better experience — new functionality, less downtime, and faster, more responsive products — and provide a framework for shipping code faster.
But in a deep system, maintaining SLOs is particularly challenging. Not only must we understand the services we can control, we must also understand the relationship between service layers and, in effect, the entire system itself.
The difficult problems in deep systems typically involve interactions between multiple services communicating across multiple independently-managed levels of the distributed stack.
Traces are the only type of telemetry that model these multi-service, multi-layer dependencies — this is why tracing must form the backbone of observability in deep systems.
Rather than three related but disconnected pipes of data, tracing enables an integrated workflow between telemetry data, surfacing only the right metrics and the right logs for any given scenario.
For instance, let's say you build, monitor, and maintain Service A. If Service A depends on Service Z — perhaps through several intermediaries — and Service Z pushes a bad release, that will very likely affect the performance and reliability of Service A and everything in between.
The right approach is to build a model of the application from the perspective of Service A, and to take snapshots of that model before, during, and after events such as Service Z's hypothetical bad release above.
By assembling thousands of traces in each snapshot, an observability solution can find extremely strong statistical evidence that the regression in Service A's behavior is due to the change in the version tag in Service Z, and can correlate the negative change to other metrics and logs in Service Z — both before and during the bad release.
Traces, in aggregate, prevent developers and operators from having to consider most of the system most of the time.
Gather your traces, metrics, and logs using portable, high-performance, vendor-neutral instrumentation. This should take less than 10 minutes with LightStep, which integrates immediately with OpenTracing, OpenCensus, Jager, Zipkin (and soon OpenTelemetry).
LightStep analyzes 100% of the requests flowing through your service, and automatically surfaces correlations that best explain errors and latency across your entire system.
Since canaries and new versions are automatically detected, there's no need for special integrations into your CI/CD system. Integrate LightStep into your neighbor services and detect their releases just as easily.
Complete visibility into your system at any moment: before, during, or after every release. See how each deploy or event impacts SLOs for all of your service's endpoints, and track and pinpoint performance issues across any number of services.