By now, you’ve likely heard about OpenTelemetry, an open source observability framework created by the merger of OpenTracing and OpenCensus. You may be asking yourself, however, “what’s observability, anyway?” It’s a valid question – it’s a topic that’s been in the news a lot recently, and it seems that every application monitoring vendor is trying to rebrand as an ‘observability’ vendor. In this series of blog posts, I’ll demystify observability and explain the concepts you need to know in order to understand OpenTelemetry, and why it matters.
Observability as a term stems from control theory, an engineering discipline that concerns itself with how to keep dynamic systems in check. An applied example of control theory can be seen in cruise control for cars – under constant power, your speed would decrease as you drive up a hill. Instead, in order to keep your speed consistent, an algorithm increases the power output of the engine in response to the measured speed. This is also an application of observability – the cruise control subsystem is able to infer the state of the engine by observing the measured output (in this case, the speed of the car).
In software, observability is a bit more prosaic, referring to the telemetry produced by services in an application. This telemetry data can be divided into three major forms:
- Traces: contextual data about a request through a system.
- Metrics: quantitative information about processes such as counts and gauges.
- Logs: specific messages emitted by a process or service.
Historically, these three verticals have been referred to as the “three pillars” of observability. The growing scale and complexity of software have lead to changes in this model, however, as practitioners have not only identified the interrelationships between these types of telemetry data, but coordinated workflows involving them.
For example, time-series metrics dashboards can be used to identify a subset of traces that point to underlying issues or bugs. Log messages associated with those traces can identify the root cause of the issue. When resolving the issue, new metrics can be configured to more proactively identify similar issues before the next incident.
The ultimate goal for OpenTelemetry is to ensure that this telemetry data is a built-in feature of cloud-native software. This means that libraries, frameworks, and SDKs should emit this telemetry data without requiring end-users to proactively instrument their code. To accomplish this, OpenTelemetry is producing a single set of system components and language-specific libraries that can create, collect, and capture these sources of telemetry data and export them to analysis tools through a simple exporter model.
In summary, observability in software is about the integration of multiple forms of telemetry data which together can help you better understand how your software is operating. It is unique from traditional application monitoring because it focuses on the integration of multiple forms of telemetry data, and the relationships between them. Observability doesn’t just stop at the capture of telemetry data, however — the most critical aspect of the practice is what you do with the data once it’s been collected. This is where a tool like LightStep comes in handy, providing features such as correlation detection, historical context, and automatic point-in-time snapshots through unparalleled analysis of your telemetry data.
In the next part of this series, we’ll take a deeper dive into telemetry data sources, starting with tracing.