OpenTelemetry 101: What Is Observability?
by Austin Parker
You may be asking yourself, “What’s observability, anyway?” Observability is a topic that has been in the news a lot recently, and it seems that every application monitoring vendor is trying to rebrand as an ‘observability’ vendor. This paper aims to demystify observability, explain the concepts you need to know in order to understand OpenTelemetry, and why this project matters.
The term ‘observability’ stems from control theory, an engineering discipline that concerns itself with how to keep dynamic systems in check, and refers to the ability to infer the internal state of a system based on its external outputs. An applied example of control theory can be seen in cruise control for cars. Under constant power, a car’s speed would decrease as it drives up a hill; in order to keep the vehicle’s speed consistent, an algorithm increases the power output of the engine in response to the measured speed. This is also an application of observability — the cruise control subsystem is able to infer the state of the engine by observing the measured output (in this case, the speed of the car).
In software, observability generally refers to the ability to understand an application’s performance based on output data, or telemetry. In distributed systems, this telemetry can be divided into three major categories:
- Traces: contextual data about a request through a system
- Metrics: quantitative information about processes
- Logs: specific messages emitted by a process or service
Historically, these three verticals have been referred to as the “three pillars” of observability. The growing scale and complexity of software have led to changes in this model, however, as practitioners have not only identified the interrelationships between these types of telemetry data, but coordinated workflows involving them.
For example, time series metrics dashboards can be used to identify a subset of traces that point to underlying issues or bugs. Log messages associated with those traces can identify the root cause of the issue. When resolving the issue, new metrics can be configured to more proactively identify similar issues before the next incident.
The ultimate goal for OpenTelemetry is to ensure that this telemetry data is a built-in feature of cloud-native software. This means that libraries, frameworks, and SDKs should emit this telemetry data without requiring end-users to proactively instrument their code. To accomplish this, OpenTelemetry is producing a single set of system components and language-specific libraries that can create, collect, and capture these sources of telemetry data and export them to analysis tools through a simple exporter model.
In summary, observability in software is about the integration of multiple forms of telemetry data which together can help you better understand how your software is operating. It is unique from traditional application monitoring because it focuses on the integration of multiple forms of telemetry data, and the relationships between them. Observability doesn't just stop at the capture of telemetry data, however – the most critical aspect of the practice is what you do with the data once it's been collected. This is where a tool like Lightstep comes in handy, providing features such as correlation detection, historical context, and automatic point-in-time snapshots through unparalleled analysis of your telemetry data.
In the next part of this series, we'll take a deeper dive into telemetry data sources, starting with tracing.