Why choose OpenTelemetry?
by Ted Young
I’ve written quite a bit about OpenTelemetry at this point, but almost all of it has been focused on explaining what OpenTelemetry does and how to use it. But why does OpenTelemetry even exist in the first place? What problem does it solve, and why does it scratch an itch for a lot of developers?
Personally, it was frustration that drove me to work on observability full time, and to focus on telemetry in particular. Telemetry – traces, logs, metrics, etc – is the language our systems use to describe what they are doing. And it felt like the traditional “three pillars” approach to generating telemetry was designed to make my life a nightmare.
So here, in a nutshell, are the top four reasons which motivated me to help create the OpenTelemetry project.
I always want to try new tools before I buy them. But having to rip and replace my entire telemetry pipeline in order to do that creates a serious headache.
With the OpenTelemetry Collector, you can add and remove providers with a simple configuration change.
Here’s why. When you exchange one set of instrumentation for another, you’re not just switching out code, you’re also changing what data is emitted. Even something as effortless as swapping out one Java Agent for another will have this effect. The new data won’t work with the old system. Not only will the new data be in an incompatible format, the content of the data will be completely different – different metrics, different labels, different logs, etc. So, even if you translated the new data into the old format, your current dashboards and alerts would still be broken.
But with OpenTelemetry, you can now send the same telemetry to almost every observability provider. And you can tee the data off to multiple providers at the same time. This makes trying out new services easy.
There’s a real observer’s paradox with telemetry. Managing a high volume telemetry pipeline can be a real beast, and operators often need to make changes quickly, in a coordinated fashion across the entire deployment. If making those changes involves reconfiguring and restarting applications, operators have to risk impacting the system.
In some cases, they may require an application developer to make the changes for them. This can remove quite a bit of agency from the operator. It can be especially painful when applications go through a complex release pipeline, where they end up running in many different environments (integration testing, staging, load testing, etc) all of which have different telemetry setups.
When running OpenTelemetry, applications can stick to the default OTLP settings. Instead of making configuration changes in the application, telemetry routing and processing can be managed using pools of Collectors. These Collector deployments can be fully controlled by the operator, making telemetry management a separate concern from application management. Operators can make updates whenever they want, without accidentally affecting production.
Large scale production systems problems need to handle huge numbers of concurrent requests, all of which are attempting to utilize the same resources at the same time. These complex, emergent interactions end up generating all kinds of unexpected and unfortunate behavior. Because these issues are ephemeral and only emerge under certain conditions, they can be difficult to diagnose.
An exciting recent development in observability is the use of machine learning and other statistical tools to identify emergent patterns of bad behavior. But there’s a telemetry problem - logs, metrics, traces, RUM, and other data types are traditionally kept in completely separate systems. It should go without saying, but automated analysis can’t find correlations between two data points when they aren’t stored in the same place, or otherwise connected in any way.
OpenTelemetry integrates logging, metrics, tracing, and resources into a single data structure that is ideal for finding correlations and other forms of statistical analysis.
OpenTelemetry solves this by providing a unified data structure, OTLP. This is more than just putting traces, logs, and metrics next to each other in the same pipe. This is highly integrated data, which can only be generated from instrumentation which is context-aware.
For example, OpenTelemetry has trace exemplars. Whenever metrics are emitted, OpenTelemetry will correlate those metrics with a sampling of traces. So, when counting status codes, the counts are linked to the traces of requests which created those status codes. And when measuring RAM or CPU, traces of requests which were active on that machine at that time. And when I look at any of these traces, I want to also see the logs.
This kind of integrated telemetry is designed to power modern observability systems, which use machine analysis to surface correlations across
While we don’t need standards for everything, data protocols are one place where they can be extremely useful. OpenTelemetry isn’t just for application developers, having a standard also enables OSS libraries, databases, and managed services to participate in observability.
OSS code is run by many different organizations, all of which have made different choices about what observability system they want to use. When the only instrumentation options available are proprietary, or open but tied to a specific observability platform, it’s hard to emit telemetry from these shared libraries and services. Making telemetry work for OSS is an important goal for the OpenTelemetry project. That’s why we work so hard to ensure that OpenTelemetry is stable, and works with every observability system.
These four reasons are why I work on OpenTelemetry. If some of those reasons resonate, let me know! This was just a quick overview, but if you want an in-depth, deep-dive into why OpenTelemetry exists and how to use it, check out my O’Reilly report on OpenTelemetry and the future of Observability.