What is Observability?
by Katy Farmer
There are some excellent resources on observability written by experts, and someday, I will read (and understand) them. What I needed instead was a foundational knowledge that I could grow over time and add nuance to when the time was right. This is how I learned to code, play the trumpet, do car repairs and pretty much everything else I know how to do.
This article will be a foundation you can build on to understand observability (often abbreviated as o11y). If you’re like me, you have a lot of questions about observability, and they all begin with one very important question: what exactly is observability?
As a philosophy, observability is our ability as developers to know and discover what is going on in our systems. In practice, it means adding telemetry to our systems in order to measure change and track workflows.
Imagine we’re baking cookies. We mix up the cookie dough while we jam out to our favorite ‘80s pop station, scoop 12 cookies out onto a baking sheet, set a timer, and put the cookies in the oven. When the timer goes off, we pull the cookies out to discover that they have formed one mega-cookie, completely covering the baking sheet. Why did this happen? If we were cookie experts, we might be able to guess, but we’re just cookie enthusiasts who want snacks.
We know the effect (mega-cookie), but we don’t know the cause. Did we forget to add an ingredient? Was the oven the wrong temperature? Did we get so wrapped up in ‘80s kitchen karaoke that we only added half the flour? If we could answer all of these questions, we would have enough context to find the cause of our mega-cookie. This is the underlying principle of observability: we want to be able to ask questions of our system to find out what is happening.
Telemetry, which is a science for measuring things, gives us breadcrumbs (cookie crumbs?) to follow when we’re investigating behavior. An irregular error message might lead us to an endpoint that might lead us to a service that uses an unmaintained library. If we had added telemetry to our baking process, we would know that the oven was the right temperature, but we only added half the flour. Easy, right?
In our example, observability empowers us to discover why we created a mega-cookie instead of the perfect dozen. Of course, our systems at work can be a bit more complex than cookies (though I would argue both are of equal importance), which is why it’s so important to be able to ask and answer questions.
In practice, observability is a combination of metrics, logs, and traces in our software (these are also referred to as the “three pillars of observability”). When we talk about telemetry, this is how we measure. We could monitor the oven temperature (metric), read the baker’s notes (log) or examine the baker’s workflow (trace). In the real world, we’re probably going to use some kind of application metrics, database or syslogs and a tracing tool. Metrics and logs offer us valuable measurements, and traces show the lifecycle of a request.
Traditionally, observability is a combination of metrics, logs, and traces in our software (these are also referred to as the “three pillars of observability”). We could monitor the oven temperature (metric), read the baker’s notes (log) or examine the baker’s workflow (trace). In the real world, we’re probably going to use some kind of application metrics, database or syslogs and a tracing tool. Metrics and logs are two types of valuable data, and traces show the life cycle of a request. That said, data is different from measurement. In our baking example, the oven temperature is data; this alone doesn’t tell us whether the oven temperature was too high for cookies (and we certainly don’t know for sure whether the number on the dial was correct —maybe the oven was set to 350°F — but the real temperature inside the oven fell to 325°F every time the oven door was opened.
When observability is really going right, it can sometimes show us new information we weren’t expecting. If I carefully measured all of my baking ingredients (which I do not), I would know that the reason my chocolate chip cookies are so crispy is that I always add a little more butter than necessary. If we add telemetry to our systems with the same kind of diligence, we might discover unused resources (that eat into our budget) or unnecessarily complex workflows that increase latency.
Observability exists as a response to the deep technical question: What is going on in my system? I have certainly yelled versions of this, usually after a deployment. Like baking, we don’t know all of the things that can go wrong until our code is in production and our cookies are out of the oven. The explanations here are foundations for you to explore and further understand observability. If you want to see what an observability tool looks like in the real world, try the Lightstep sandbox, where you can happily click through problems that don’t affect you.