Observability: A complete overview for 2021
A practical guide for developers
Table of ContentsWhat is Observability?Telemetry data: Logs, metrics, and tracesWhat questions can Observability answer?Observability vs. monitoringObservability vs. APMCardinality in ObservabilityWhy Do We Care About Cardinality?What strategies should we employ to address the critical issue of cardinality in observability?Benefits of ObservabilityObservability benefits by roleThe Three Pillars of ObservabilityObservability tools: Rules that cannot be broken
Observability helps developers understand multi-layered architectures: what’s slow, what’s broken, and what needs to be done to improve performance.
By making systems observable, anyone on the team (excluding marketing, sales, HR — fine not anyone) can easily navigate from effect to cause in a production system. The path from effect to cause often requires many steps, including any number of innocent intermediaries. Observability is a means to follow each of those steps.
In the words of Shaun McCormick, senior staff engineer at BigCommerce, “Observability is not just knowing a problem is happening, but knowing why it is happening. And knowing how I can go in and fix it.”
More formally, observability is defined as the ability to measure the internal state of a system only by its outputs.
For distributed systems, such as microservices, serverless, service meshes, etc., these outputs are telemetry data: logs, metrics, and traces.
There are three primary types of telemetry data through which systems are made observable.
Logs – Structured or unstructured lines of text that are emitted by an application in response to some event in the code. Logs are distinct records of “what happened” to or with a specific system attribute at a specific time. They are typically easy to generate, difficult to extract meaning from, and expensive to store.
Structured log example:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Unstructured log example:
Metrics – A metric is a value that expresses some data about a system. These metrics are usually represented as counts or measures, and are often aggregated or calculated over a period of time. A metric can tell you how much memory is being used by a process out of the total, or the number of requests per second being handled by a service.
Traces – A single trace shows the activity for an individual transaction or request as it flows through an application. Traces are a critical part of observability, as they provide context for other telemetry. For example, traces can help define which metrics would be most valuable in a given situation, or which logs are relevant to a particular issue.
This is undoubtedly an incomplete list, but given that the application of observability is so broad, we thought it’d be helpful to provide a sampling of possible questions that can be addressed by an effective observability solution:
- Why is x broken?
- What services does my service depend on — and what services are dependent on my service?
- “What went wrong during this release?”
- “Why has performance degraded over the past quarter?”
- “Why did my pager just go off?”
- “What changed? Why?”
- “What logs should we look at right now?”
- “Should we roll back this canary?”
- “Is this issue affecting certain Android users or all of them?”
- “What is system performance like for our most important customers?”
- What SLO should we set?
- Are we out of SLO?
- What did my service look like at time point x?
- What was the relationship between my service and x at time point y?
- What was the relationship of attributed across the system before we deployed? What’s it like now?
- What is most likely contributing to latency right now? What is most likely not?
- Are these performance optimizations on the critical path?
Monitoring requires you to know what you care about before you know you care about it. Observability, which comes from control theory, allows you to understand your entire system and how it fits together, and then use that information to discover what specifically you should care about when it’s most important.
For example, when you’re trying to pinpoint the problem during an incident: if there’s a new dependency deep in the stack affecting your service, observability will surface this information. Monitoring tools will not.
Another way of looking at it: Monitoring requires you to already know what normal is. Observability allows discovery of different types of “normal” by looking at how the system actually behaves, over time, in different circumstances.
“In control theory, observability is a measure of how well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals.” - Wikipedia
The one constant in complex systems is change, whether software versions, customer traffic, or even third party dependencies. Expecting anyone to predict normal in this environment is not setting them or the company up for success. Observability embraces a philosophy of awareness and flexibility instead of hoping for perfect prediction.
Conventional application performance monitoring (APM) vendors rely heavily on sampling. This means that when you need to debug an issue, you only get 1% or so of the data. Of course, the result is that there is only a fractional chance that you’ll get to the root cause on the first attempt.
After you’ve finally found the root cause, should such a thing exist (hello, contrarian DevOps and SRE teams), there is a delay until you can retest. This will drag out your MTTR efforts dramatically.
Also, these tools use an agent-based approach, which requires CPU. As you journey into the world of services, containers, or even multiple monoliths, this will require additional resources you’ll need to account for.
Lastly, after “x” months, your data will turn into data aggregation summaries, meaning you will no longer have full-fidelity data if you need to audit your deployments or attempt to find efficiencies in your code.
Cardinality is a mathematical term that comes to us from set theory, a fundamental theory of mathematics pioneered in the 1870s by mathematicians such as Georg Cantor. To overly simplify, a set is any collection of unique elements, and the cardinality of that set is the count of those elements. What does this have to do with observability?
The cardinality of observability data is not a problem that can be ignored: it must be managed. There are many solutions to this problem, the details of which I can’t evaluate completely in this post, but what’s more important is understanding why cardinality challenges are unavoidable. Once we understand that, then it becomes possible to discuss how to mitigate these issues.
- Ultimately, cardinality is an unavoidable consequence of scale. As systems and applications become increasingly larger and more complex, our requirements for understanding those systems become more nuanced and detailed, not less.
- Cardinality is not a problem that can be solved by simply throwing more and more resources at it. Generally, the unique combinations of timeseries metrics will increase at a greater rate than the amount of resources allocatable to hold those metrics in memory for a query. Different components of your system will need different solutions as well – where the cardinality gets added can matter significantly! Consider shared services; Different consumers of the service may wish to apply their own attributes to measurements that other consumers may not need. This can add multiple complications to how you write, deploy, configure, and operate applications.
- If you aren’t prepared for cardinality, you can easily wander into difficult situations to recover from. A single errant attribute being added to a widely used metric can suddenly trigger an explosion of new time series being generated and consumed by your metrics server. This can cause severe memory usage inflation by not only your application services, but also by your metrics server and its associated components as it struggles to keep up with the increased burden. This can have the unfortunate side-effect of causing your metrics infrastructure to crash or become extremely slow to respond to queries, blinding you to problems that are occurring and removing one of your tools to understand why the problem is occurring in the first place!
- Think of metrics as the edge of your telemetry. In general, metrics processing is fast – faster than log aggregation, faster than aggregate trace analysis. You want to lean on that speed and ensure that you’re emitting metrics for things that are leading indicators of trouble, and you want to do this consistently across all of your services.
- Be consistent with attributes. Your logs, traces, and metrics should use the same keys for the same concept across multiple servers, operating systems, deployment strategies, etc. Not only does this make it easier for humans to understand, but it reduces the cognitive load required to understand and interpret unfamiliar dashboards and telemetry data when it’s being emitted by a dependent service.
- Treat your observability code the same as any other change to your system. You wouldn’t blindly merge a pull request that changed a critical feature of your application, and you shouldn’t blindly merge a pull request that adds or mutates existing metrics. When deploying changes with a potential impact to metric cardinality, be sure to take the pager and ensure things go off without a hitch.
Observable systems are easier to understand, easier to control, and easier to fix than those that are not.
As systems change — either on purpose because you’re deploying new software, new configuration, or scaling up or scaling down, or because of some other unknown action — observability enables developers to understand when the system starts not to be in the state it’s intended to be in.
Modern observability tools can automatically identify a number of issues and their causes, such as failures caused by routine changes, regressions that only affect specific customers, and downstream errors in services and third-party SaaS.
Observability reduces the amount of stress when deploying code or making changes to the system. By highlighting “what changed” after any deploy or alerting on p9x outliers, customer-affecting issues can be found quickly and rolled back before SLOs are broken.
With a real-time understanding of full-system dependencies, developers spend far less time in meetings. There’s no longer a reason to wait around on a call to see who owns a particular service, or what the system looked like hours, days, or months before the most-recent deployment.
By revealing the critical path of end-to-end requests, and surfacing only the relevant data to resolve an issue, observability enables better workflows for debugging, performance optimization, and fire fighting.
Ruling out signals that are unlikely to have contributed to the root cause, developers can form and investigate more effective hypotheses.
For teams of all sizes, observability offers a shared view of the system. This includes its health, its architecture, its performance over time, and how requests make their way from frontend / web apps to backend and third-party services.
Observability provides context across roles and organizations, as it enables developers, operators, managers, PMs, contractors, and any other approved team members to work with the same views and insights about services, specific customers, SQL queries, etc.
Since observability tools enable automated capture of any moment in time, it serves as a historical record of system architecture, dependencies, and service health — both what was and what changed over time.
Observability drives more effective post-mortems following incidents, because it allows the team to revisit actual system behavior at the time of the incident, rather than rely on the recollections of individuals operating under a stressful situation.
Ultimately, by making a system observable, organizations are able to release more code, more quickly, and more safely.
What often determines whether your business is successful or not is the ability to change, and ship new features. But that change is directly opposed to the stability of your systems. And so there’s this basic tension: You need to change so that you can expand your business. But whenever change is introduced, risk is introduced, and could create negative outcomes from a business sense.
Observability resolves this basic tension. Businesses can make changes with higher levels of confidence and figure out whether those are or are not having the intended effect, and limit the negative impact of change.
The result is more confident releases at a higher velocity. You can deploy more frequently and with greater confidence, because you have tooling that will help you understand what goes wrong, isolate any issues, and make immediate improvements.
Ultimately, it allows you to keep customers happy with less downtime, new features, and faster systems.
To the best of our knowledge, observability does not benefit crows. They care not for the health of your systems, distributed or otherwise.
Historically, the three types of telemetry have been referred to as the “three pillars” of observability: separate data types often with their own dashboards.
The growing scale and complexity of software has led to changes in this model, however, as practitioners have not only identified the interrelationships between these types of telemetry data, but coordinated workflows involving them.
For example, time series metrics dashboards can be used to identify a subset of traces that point to underlying issues or bugs — and log messages associated with those traces can identify the root cause of the issue. Then, new metrics can be configured to more proactively identify similar issues before the next incident.
Also, when viewed in aggregate, traces can reveal immediate insights into what is having the largest impact on performance or customer experience, and surface only the metrics and logs that are relevant to an issue.
Let’s say there is a sudden regression in the performance of a particular backend service, deep in your stack. It turns out that the underlying issue was that one of your many customers changed their traffic pattern and started sending significantly more complex requests. This would be obvious within seconds after looking at aggregate trace statistics, though it would have taken days just looking at logs, metrics, or even individual traces on their own.
In short: observability is not simply telemetry — it’s how that telemetry is used to solve problems and ultimately create a better experience for customers.
There are three main ways to make a distributed system observable. Like any choice, each comes with a set of benefits and costs (the latter of which may be literal).
Teams can build their own observability tools, work with open source software such as Jaeger or Zipkin, or purchase an observability solution.
Regardless of how you decide to make your system observable, there are a handful of vetted rules that apply to all observability solutions.
If integration is too difficult, it’s unlikely your project will ever get off the ground. Who has extra cycles to for multi-month integration projects and testing? And, even if there is somehow availability, who wants to take on these sorts of projects?
Successful observability efforts connect with the tools you are already using.They should be able to support polyglot environments (with whatever languages and frameworks you use), integrate with your service mesh or container platform, and connect to Slack and PagerDuty or whatever system your on call team prefers.
If the platform is too difficult to learn or use on a daily basis, it won’t become part of existing processes and workflows. Developers won’t feel comfortable turning to the tool during moments of high stress, and little improvement will be made in the health and reliability of the system.
Upfront sampling, random sampling, and most methods of sampling data simply won’t enable observability at scale. High-fidelity data is required to identify outliers or specific issues in a distributed system, as incident resolution may require the analysis of intermittent, infrequent, and rare events.
Reports, dashboards, and queries need to provide insight into what’s going on “right now,” so that developers can understand the severity of an issue or the immediate impact of their performance optimizations. When debugging a problem, or fighting an outage, information from ten minutes ago just isn’t going to cut it.
Systems at scale — or even relatively few microservices — produce an enormous amount of data. Far more than can be easily understood by humans without some sort of guidance. For an observability tool to be effective, it needs to make insights obvious. This includes interactive visual summaries for incident resolution and clear dashboards that offer an at-a-glance understanding of any event.
What is an investigation without context? Guesswork? Trial and error? An observability tool should guide its users toward successful incident resolution and system exploration by providing context every step of the way.
Temporal context: How does something look now versus one hour, one day, or one week ago? What did this look like 3 months ago? What did it look like before we deployed?
Relative context: How much has this changed relative to other changes in the system?
Relational context: What is dependent on this, and what is it dependent on? How will changes to this dependency chain affect other services?
Proportional context: What is the scope of an incident or issue? How many customers, versions, or geographies are affected? Are VIP customers more or less impacted?
Ingest, process, and analyze your data without latency. Additionally, it can’t be cost-prohibitive to do so, as data goes from terabytes to petabytes and beyond. An effective observability must then provide insights across services, which will likely require tracing.
This may seem obvious, but it can be all too easy to unintentionally conflate how a tool is perceived by a developer and how it actually does — or does not — drive business value. Ultimately, observability tools must improve the customer experience, increase developer velocity, and ensure a more reliable, resilient, and stable system at scale.
Want to learn (even) more? Check out our ebook: The Complete Guide to Observability