Making Your System Observable from the Outside In
The chants coming from conferences, newsletters, and books are relentless. “Observability observability observability”. You’ve heard it. I’ve said it. Now that I work on an observable system, I never want to be without it again. Observable systems are easier to understand, easier to control, and easier to fix. When building new systems, how you’ll observe it should be a key design consideration. But all software developers and operators work every day with systems we didn’t build, whose design we didn’t influence. How do we make them observable?
Principles of Observability
Observable systems have some key characteristics:
Their Service-Level Indicators are defined. A Service-Level Indicator (SLI) are the characteristics of a system that really matter to customers or systems that depend on you. Availability, latency, throughput, and error rate are common (and excellent) service-level indicators.
They emit Telemetry Data. Telemetry are the granular measurements of the system that both allow you to measure SLIs and to explain their values. Logs, events, spans, metrics - whatever format works for you - need to be emitted by your components, and collected for analysis and monitoring in a cost-effective way.
You can analyze them in real-time. “Observable” means you need to be able to find out what’s going on now. When debugging a problem, or fighting an outage, information from ten minutes ago just isn’t going to cut it.
Help! I’ve got an old program that I can’t change
That’s alright! There are a lot of good options for getting from zero to Observable.
First, define your SLIs. You’ll notice that they all face “up and out” from your system - they’re focused on the interface that you provide to customers. This provides a straightforward approach to adding telemetry data. Let’s illustrate with a “simple” dependency chain:
1 2 3 A --- B | ---xCx---- D |x---- E
If you’re working on service C, you’ll want to implement some telemetry emission at each of the X’s on the chart. That will cover the calls from systems that depend on C, and calls to the systems that C depends on. After that, you might want to hop into system D’s slack channel and gently (or not so gently) nudge them to do the same!
Next, decide what telemetry to emit. I would make a strong bet on the OpenTelemetry project’s standards and practices - you’ll have a telemetry solution that supports many different data types, and it’s supported by both vendor-supplied solutions and open-source projects. You should consider all the following types of telemetry data:
- Counters for api calls, api errors, and other items that are useful to analyze in the aggregate.
- Spans that cover the latency, logs, and tags for a particular transaction. They can then be tied up with the other spans representing that transaction into a distributed trace.
- Logs to cover security compliance, audit trails and the like.
Now, implement the telemetry. This is where things can get hairy for existing systems. It’s also where we can make incremental progress. One straightforward approach to telemetry is to implement an “onion-skin” wrapper around the service. Duplicate the API surface, emit all telemetry in this onion-skin, then forward the request directly to the application. You can do the same with outgoing calls to other services — intercept outgoing calls, emit telemetry data, and then forward to the downstream service.
Using a service proxy like Envoy is one of the simplest methods to implement the onion-skin approach. There are zero code changes required, but there are some added operational concerns.
When adding telemetry to a system, it’s tempting to add instrumentation everywhere. The more you know, the more you can do, right? This isn’t quite right - you can easily pile up data that’s expensive to send, expensive to keep, and not very useful for analysis. Like Andy Grove says in High Output Management, you need to focus your measurements on vital, measurable indicators.
This is why we define SLIs in the first place. Add your telemetry only in places where you can measure your SLIs. Then, move on to measuring the SLIs of systems that you call. As your ability to observe and control your system spreads across services, you’ll have views into the things that matter, without drowning in the lake of data.
In a large system, it’s often best to start this process at the very outside of the whole system. Measuring API performance, then work your way into the core of your data storage systems.
OK, we have the data. Now what?
Telemetry is only the first step towards Observability. For your system to be observable, you need to know what’s going on now, is that different in an important way, and why is it different? You can analyze this data in-house with systems like Prometheus, Grafana, and Jaeger, but for deep, insightful analysis of complex systems, automatic analysis is extremely useful.
Modern observability tools can automatically identify a number of issues and their causes:
- Failures caused by routine changes. Deployments are the most common source of regressions. Other routine configuration changes are also a big source.
- Regressions that only affect specific customers. Problems that affect subsets of users are often “lost in the noise” of a large system. Stronger statistics can still identify these issues.
- Downstream errors causing errors at the API boundary. Errors in APIs are often correlated with errors in some other system. Unless you’re a dashboard expert, finding these connections is a slow, frustrating process. Powerful tools can show you these connections immediately.
To Sum it Up
In existing systems, you can go from zero to Observable incrementally, accruing the benefits as you go. By starting at the boundaries of your services and expanding to other services’ boundaries, you’ll get the most meaningful information quickly without getting lost in the weeds of instrumentation.