The Case for Human-Centric Observability
by Austin Parker
Modern systems have resulted in an explosion of complexity for organizations of every size, shape, and purpose. In an effort to create more resilient and reliable software, we’ve cast around for solutions to tame and manage this complexity. Observability practices have come to be seen as essential for operating systems at scale, but in practice, they’re often seen as technical solutions to what is ultimately a social problem: Software, at the end of the day, is built and run by people.
Human-Centric Observability is a methodology to evaluate observability practices by how they impact, and are built around, three distinct groups of people: external users, internal users, and engineers. We’ll discuss how to identify these groups and the value of putting them at the center of your observability practice.
HCO provides a framework to evaluate your own internal practices and make on-call suck less for your team. It helps teams not only build resilient software but a resilient organization — one that can adapt to changing requirements and needs
The important thing about observability isn’t the tool, it’s how you structure it into your team. This is a ‘first principles’ line of thought - are you buying fancy dashboards you can put on a screen, or are you trying to run reliable software? Think about observability as something that people are at the center of, and things start to become more clear.
Since external users are the ones most impacted by unreliable or unavailable software, you need to be able to connect their actions to insights. Think about it this way: you can have 99.9% reliability in aggregate, but that could mean 100% for everyone except one person, who’s got 0%. Behind every outage or incident is a person that relies on your application, and they’re often the one that is forgotten. Some examples of these users might be people trying to add items to a cart, request a ride, play a song, or even external consumers of an API.
Internal users can be “the business” — if you want to justify scale, you need to back it up with numbers — but are also product and design and marketing and so on, and so forth. Don’t neglect charts and graphs, but ask yourself how they can be harnessed for analytics and correlation between teams. Look for gulfs between KPIs.or example, are bad conversion rates on sign-up correlated with anything in terms of performance? Observability can help you answer these questions!
Engineers are another important group to consider; is uptime more important than retention? What happens when someone goes on vacation? Consider the “bus factor” of your system: good observability can ameliorate SPOF in people, as well as in code. It’s also extremely humane to try and reduce the burden on individuals/superheroes in your team; leaning too hard can lead to burnout, etc.
At Lightstep, we’ve put people at the center of our internal observability practice. On-call gets a lot easier, since we use Lightstep to monitor Lightstep, we’re able to quickly identify both specific failures and aggregate performance regressions across releases. We correlate customers across all of our telemetry streams, not only internally but for analytics events as well, and can easily move between tools in order to understand user experience. For internal users, Lightstep is valuable in helping us do not only capacity planning but builds confidence through internal releases of new services before rolling them out to all of our customers.
First, you should be able to evaluate what you’re currently doing and try to understand how it’s helping or harming, you. Are you missing outlier events because of aggressive sampling, or a reliance on too few streams of telemetry data? Are you resilient to people taking vacations, burning out, changing jobs, etc.? Can your existing analysis tools scale to handle a large volume of new events or a changing pattern in those events?
There’s no single fix, usually, but there are guidelines that can help you understand and tame these problems. First, focus on the people at the heart. If your tools don’t serve the people, then they’re bad tools. Second, look toward open source frameworks like OpenTelemetry as the single source of telemetry data, and integrate it everywhere, as this avoids a lot of questions about “what tools to use”.
Finally, evaluate your observability tools through something like an Observability Scorecard - you need to be able to measure the impact of performance on your users, and you need to be able to explain the variations in these measurements by quickly narrowing the search space. With these principles in mind, you can make life better for all of the people touched by your software, both internally and externally.