Observability: what is it good for?
Observability tools form the cornerstone of understanding what’s happening in your production systems. But these tools only offer real value when they’re in the hands of developers and engineers across your organization. When you are evaluating existing or new tools, you need to consider how they are being used not just by experts in your organization, but every member of your team whose work affects production systems.
When many people think about observability, their thoughts immediately jump to oncall and investigating production incidents. However, observability has an equally important role to play in many day-to-day activities, including deploying new features, optimizing critical code, and planning architectural changes to the application.
What do all of these activities have in common? They are all about managing change: whether it’s a planned change (like a new release) or an unplanned one (like an instance suddenly going offline). And in modern application development, not only are changes happening all of the time (both the planned ones and the unplanned ones), these changes can have effects far across the application. A new deployment of one service can affect the performance of another. A configuration change can lead to new errors. A new feature can lead to changing user behavior which can put new pressure on backend services which in turn can increase demand for infrastructure.
The purpose of observability is to track dependencies within an application and tie changes to services, infrastructure, and user behavior with changes to user experience, cost, and reliability in general. That is, observability is about linking causes with the effects that matter to your business. It follows that observability tools are providing value to your organization if they are being used to determine which changes matter and address them in a timely way. Let’s consider a couple of examples in more detail.
Every deployment presents a risk and a tradeoff. Could more testing be useful? Perhaps that PR should bake a bit longer in staging. Or maybe we should wait until
senior engineer is back from vacation next week? On the other hand, shipping new functionality makes for happy users.
Monitoring dashboards – that is, looking at a set of graphs showing key aspects of application performance – has long been used as the main way to understand whether or not a service was healthy after a deployment. And it should continue to be part of that process! But unfortunately, it’s not enough for two reasons.
First, while a service may be healthy by its own standards, any deployment may have adverse effects on other services. It’s not realistic to expect a developer who just pushed a deployment to look at the dashboards of every service that might be affected. (Much less to understand them!) Observability tools must automatically assess the health of every potentially affected service and proactively notify everyone who might be affected (in addition to the developer who just deployed).
Second, even if a dashboard is a good way to assess whether or not a service is healthy after a deployment, it’s a terrible way to try to understand why that service became unhealthy. Assuming that the deployment looked good in staging (and if not, well, that’s another problem), there must be something about the production environment that’s different from staging, something that’s causing the service to become unhealthy. It could be a different version of a dependency that’s running in prod (but not in staging), something different about the load in prod, or just some configuration that’s different in that environment. Observability tools must be able to quickly identify the aspects of the production environment that can explain a change in health.
Observability can also help teams to become more proactive and to prevent problematic releases from impacting users. By integrating observability with your CI/CD pipeline, you can automatically identify releases that cause changes to key indicators before they roll out to your whole user base.
Observability tools can also be used as part of thinking about future engineering work. While these changes have not yet occurred, observability can play a key role as part of a “what if” analysis.
One example is looking at improving overall application performance. For user-facing interactions, improvements as small as 100 milliseconds can have a measurable impact on user behavior. But among the dozens (or hundreds) of services in a microservice-based application, which service needs to be optimized to improve user experience? This is exactly the sort of question that observability can help answer: how can we connect a change in an individual service’s performance (the cause) to the overall performance of the application (the effect)? In my time at Google, this sort of analysis was critical to determining which teams should be working on performance optimization and which should focus instead on building new features or reducing technical debt.
Observability should also be part of how you measure the reliability of your application, whether through service level objectives (SLOs) or another technique. As in the case of performance optimization, it can tell you whether or not your application as a whole – as well as individual services – are reliable enough or if engineering time should be budgeted to improve reliability.
A final example of using observability as part of planning engineering work is when making decisions about changes to application architecture. Say that a new feature will likely increase the load on one of your databases. Is it time for a cache to handle reads? Or maybe a message queue to manage updates? Under what conditions is database performance already slow? What other services and features depend on that database (and might be affected if performance of that database suffered as a result of that additional load)? Observability lets you explore these options using real world data, with the actual dependencies between services, and in the context of your users’ experience.
And of course, observability still has an important role as part of incident response. Whether it’s because of an unplanned change (read: disappearance) of some of your infrastructure, an outage at a cloud provider or a third-party API, or flood of new users when your app blows up on social media – being able to quickly detect, identify, and mitigate the effects of these changes is central to the responsibilities of oncall engineers.
But like monitoring deployments and planning engineering work, handling pages cannot be activities that are managed only by your organization’s most experienced engineers and SRE. If your organization is really going to scale – and take advantage of a microservice-based or other distributed architecture – these responsibilities need to be distributed as well. And because observability is central to all of these responsibilities, observability tools need to not only be accessible to every developer that touches prod but used by them as well.
Interested in joining our team? See our open positions here.