Why Observability Needs to Stay Weird
by Austin Parker
It’s a strange truth about technology — most of our problems with it wind up being, well, weird. This seems strange when you step back to think about it. There’s a bit of a meme that’s been going around for several years now, that a computer is just a rock that we tricked into thinking — but it’s true! Computers, and the systems we build with them, all should be very ordered and logical, because the underlying thing that backs them is extremely straightforward math. Every iota of functionality that we wring out of an application is just an abstraction over millions of operations of arithmetics, over and over, in a ceaseless charge towards infinity.
So, if everything should be so simple and clean, why does it usually feel confusing and dirty? I’ve spoken at length about this topic, about how we build abstractions over abstractions over abstractions, and how each of those closes in to add uncertainty to everything we build on top of them, and I believe that this is the key to understanding what reliability in software actually means — it’s a way to control that uncertainty, to harness it and tame it and eventually have it be an obstacle that you grow from mastering, rather than one that you eternally stumble over.
The process of mastering these obstacles can be thought of as an element of “observability.” Let me focus on one specific, and extremely important, use case: alerting.
Altering is fairly straightforward; you set some condition, “x” on some value, “y.” When “y” approaches or violates “x,” this creates one or many signals that are then communicated through a variety of channels, ultimately leading to some intervention in order to “correct” “y” into its “normal” form. This is a process that has been intrinsic to the operation of software since before software was a thing.
Lightstep is a platform that provides opinionated analysis to help you release new functionality, reduce time out of SLO and improve the steady-state of performance.
We all experience alerting, to an extent, regardless of our role. Resource exhaustion is a common alert condition, something that you’ve probably had to resolve both professionally and in your interactions with personal computing. How many times have you ever had to delete an application due to storage exhaustion on your personal computer? How many downloaded files have you needed to delete, or move to secondary storage?
In a prior job, I worked as technical support for a major telecom provider, specifically handling BlackBerry phone cases. One flaw, somewhat endemic, to these phones, was that as their internal storage got close to full, the performance of the device would quickly degrade. A fix was implemented; automatically delete old text messages in order to reclaim space. Internally, this is consistent logic with alerting — a condition is created on a value, and when that condition is approached, a signal is sent in order for corrective action to be applied. This seems fine, yeah? This is how things should work! The system is self-correcting!
There was a problem, however.
What sort of text messages do people tend to keep around for a while? Things that are important, yeah? Messages from loved ones that may have passed, reminders from their past selves that they’d like to preserve, all sorts of emotional detritus that get gummed up in possibly the least safe and reliable storage imaginable, a text message database on a BlackBerry. It’s impossible for me to know what the engineers who were tasked with implementing this system considered “normal,” but to the users who one day awoke to find things that they valued had been uncaringly deleted through deus ex machina, the cold and unfeeling technical explanation that they shouldn’t keep important things on their phone in the text messages rang rather hollow.
Monitoring, as a discipline, requires you to pre-define normal and then freeze it, with ruthless efficiency, suborning agility and humanity and adaptiveness in the sake of producing a steady-state system.
This is the hollow, and false, promise of monitoring.
We tend to think that we can accurately understand and model the myriad interactions that we build into, and through, our software with precision so great as to remove human interaction and understanding from the equation. What is “normal” to your software? What is “normal” to your on-call rotation? None of these things are rocks that we tricked into thinking, that’s just the bare and rough surface that we operate on — all of these things are extremely human, almost to a fault. Normal today may look different than normal tomorrow, or normal in a week, because of externalities that you didn’t or couldn’t comprehend at the time you figured out what normal was. Monitoring, as a discipline, requires you to pre-define normal and then freeze it, with ruthless efficiency, suborning agility and humanity and adaptiveness for the sake of producing a steady-state system.
A more prosaic example: You control some service, and you’ve determined that the size of requests is a valuable signal for the overall health of your service. If requests get too large, it introduces unacceptable latency, or even process failure. So, how do you monitor this value? You can introduce a metric or an attribute on a trace — good start. But, what does that look like? Even in state-of-the-art systems, it can be challenging to perform greater-than or less-than queries against an extremely large set of data, so you decide to coalesce these values into a simple boolean; true or false, the request is over a certain size in bytes.
This pattern works extremely well, as a matter of fact — so much so, that I recommend it to people for this use case. However, this is the monitoring a value that you’ve predefined as “normal” in a potentially awkward way. You need to be aware of normal, so that you can change it proactively. What happens if the bounds of normal change? If your underlying assumptions about what “too big” are, you’ll need to change that value as well. If your interaction or usage model changes, maybe you need to restrict the payload size. Perhaps you’re trying to monetize, or add different tiers of service, and now there’s two normals, or even a whole matrix of normals?
The power of observability is not that it provides tools that somehow “magic” away these fundamental questions and tradeoffs. Instead, the power of observability is that it handles weirdness. In our prior example, observability would give you the power to avoid pre-computing “normal” values for your requests, as it encourages you to keep more data and gradually reduce the resolution over time through dynamic sampling approaches. Observability puts you in a better state to manage changes in normal because it approaches these questions as holistic concerns about how you and your team can witness system state, understand it, and shape it over time rather than as technical details to be frozen in a spec sheet somewhere. Observability refocuses your thinking about tools, asking “how can everyone use this?” rather than “how can I use this?” It leads you to create unselfish systems that are accessible to everyone involved in your software, rather than jealously hoarding information in dashboards and behind arcane queries. Observability helps build organizations and teams that are adaptable, who are empowered by the telemetry data their services generate, rather than ones who don’t “play well with others.”
Observability isn’t about three pillars, or dashboards or a particular product. It’s about how we think about change, and how we manage the effects of that change. An observability mindset eschews blind monitoring and alerting, rather, it puts people in the center of the process of running reliable software, and asks them “how can I help?” rather than “do this now.” It’s about embracing the weirdness rather than trying to stamp it out.
So c’mon, keep observability weird.
If you’re interested in trying an unselfish observability tool, check out lightstep.com/play.