What is observability and why should you care?
by Ted Young
I’ve been asked several times about the difference between the terms ‘observability’ and ‘monitoring’, and why someone might prefer one approach vs. the other.
Since this is a bit of a false dichotomy, and the (many) articles written on this subject appear to be rather vague, I wanted to write up my thoughts and add some clarity.
Observability and monitoring do not have specific technical meanings and they delineate different eras in history. Additionally, corporate marketing has naturally muddied the water of their definitions. But don’t worry about this. Familiarize yourself with the new technology and techniques that are part of the movement to contextualize and standardize data as a way to improve quality, which the term ‘observability’ has become associated with.
The OpenTelemetry project has become the focal point for making these improvements a reality.
The observability movement is about saving time – saving so much time that it changes how you approach the problem.
If you’ve been operating systems for a while, and fought a number of fires, ask yourself: how much time, on average, do you spend collecting, correlating, and cleaning data before you can analyze it? Do you choose to limit your investigations due to the effort in collecting data? Has it gotten worse as your system has grown to include more instances of every component and have more components involved in every transaction? These are the issues that have pushed distributed tracing and a unified approach to observability into the limelight.
In the end, it always comes down to this:
- Receive an alert that something changed;
- Look at a graph that went squiggly;
- Scroll up and down looking for other graphs that went squiggly at the same time – possibly putting a ruler or piece of paper up to the screen to help;
- Make a guess;
- Dig through a huge pile of logs, hoping to confirm your guess;
- Make another guess.
Now, there’s nothing fundamentally wrong with this basic alert-read-eval-resolve routine. But the way we have traditionally instrumented our systems makes it much harder – and slower – than it needs to be.
Logs describe events but have virtually no structure; metrics aggregate events but have no explicit relationship with the logs that describe them. This lack of coherence makes investigating and responding to incidents extremely labor-intensive. The observability revolution of the past five years has been focused on reducing the effort needed to investigate each hypothesis, increasing the speed at which data can be correlated, and opening the door to automated analysis.
Here’s a rundown of the real changes.
The biggest change has been a shift from traditional logging to distributed tracing. Here’s a simple definition: if you have one log you are interested in, you can automatically find every related log from the client all the way down to the database.
How does it work? Fundamental in tracing, every event (a.k.a. a log) is stamped with a 128-bit trace ID which represents the entire transaction, and a 64-bit span ID which represents the operation.
With these two IDs, you can now easily find all of the relevant logs. Got an exception? Look at its log, do a search by its trace ID, and see all the related logs which led to the exception. Boom. Done. That’s it! By the addition of these two IDs, logs have been transformed into something far more powerful – a graph.
What if you want even more terms to query your logs by? Just add more attributes to each transaction, operation, or individual log. By having three levels of context – trace, span, and event – it is now possible to recreate what happened in your system and automate analysis of cause and effect.
There are a number of other benefits, too, like automatically recording the timing of operations. Unfortunately, implementing distributed tracing is harder than it sounds.
Automated analysis tools are still limited by their understanding of the data they are looking at. When syntax is unknown, the data cannot be parsed. When the semantics are unknown, the data cannot be interpreted. A shared language for describing distributed systems has been needed for some time.
If there’s no common schema, then you end up comparing apples to oranges: one system records response codes as HTTP => 500, and another records them as http_code => 500 Internal Error. Yet another records httpStatus => 5xx. That lack of uniformity is just sand in the gears.
Solving this means agreeing on how we record common operations and standard protocols, such as HTTP requests and CPU usage. We have to decide as a community which attributes are necessary, the names of those attributes, and the format of their values. This allows analysis tools to understand the meaning of each observation and automate a lot of work that currently must be done by hand.
Traditionally, the tools for tracing, logs, and metrics have been split. Each tool was seen as its own separate technique with separate communities and separate toolchains. The entire pipeline for each tool – the instrumentation, the data transmission, and the analysis – had nothing in common.
Removing this separation and creating a single stream of data, which shares context between observations, creates a platform for innovation. For example, if logs are collected as traces, and traces are associated with metrics, then alerts can automatically collect examples of problem operations when they are triggered.
To implement all of these desired changes, the industry came together to define a shared standard for observability: OpenTelemetry. And I do mean the industry. Over 200 organizations have contributed to date; OpenTelemetry is currently the most active project in the CNCF (Cloud Native Computing Foundation) after Kubernetes. Most of the major vendors (Lightstep, Splunk, Datadog, New Relic, Honeycomb, etc.) and all of the major infrastructure providers (Google, Microsoft, and Amazon) are either actively leading the project or committed to OpenTelemetry support.
OpenTelemetry encapsulates the design principles and features listed above. It combines tracing, metrics, and logs into a single system. Based on Ben Seiglman’s distributed tracing design (developed while he worked at Google), OpenTelemetry provides the missing context needed to correlate logs, metrics, and traces. It also includes a set of semantic conventions that define common operations, and a unified data protocol that combines all three types of signals into a single stream. The OpenTelemetry Collector Service takes this unified stream and converts it to various formats to support the wide variety of existing monitoring and analysis tools, such as Prometheus, Jaeger, and Lightstep.
So, what’s the current status? The tracing portion of OpenTelemetry is currently stabilizing; v1.0 releases will become available over the next several months. The metrics portion is currently experimental, but we are partnering with the Prometheus and OpenMetrics community to ensure compatibility. (I say ‘we’ because I work on the OpenTelemetry project).
Hopefully, that clears up some confusion about lingo and progress in the world of observability. We’re still monitoring as we always have, only faster and more effectively thanks to shared context and standardization.