Deployment strategies for OpenTelemetry
by Ted Young
In this blog post, we will explore important changes in how we practice observability, through the lens of progressively deploying OpenTelemetry across a large organization.
Currently, we are in the initial stages of a large paradigm shift that is taking place in the world of observability. Probably the biggest change that I’ve experienced in my career, which is longer than I care to mention. And, as I hope you’ll see by the end, OpenTelemetry is a key part of moving us to a new set of practices, which I like to call Modern Observability.
To me, Modern Observability is what happens when we move past the three pillars model of separate, siloed data streams, and into a new model of integrated data, braided together into a single stream. Along the way I’d also like to showcase several short term, practical benefits we can gain To help adopt the tools we want to use, when we want to use them. And at the same time, manage the complex hodgepodge of legacy tools which often build up around large, long lived computer systems.
Let’s say you’ve heard of OpenTelemetry, and you’re wondering if you should use it. But, your services are already emitting quite a bit of telemetry. And from experience, you know that switching telemetry systems can often be very painful.
- Re-instrumenting takes time
- Load patterns may shift
- Data will definitely shift
- Data needs to keep flowing - no gaps in coverage
- Traces need to stay connected
So given that, what’s the best approach to adopting OTel? And what would you get from switching to it, even if you’re happy with your current setup? To keep things practical, we’ll explain everything by walking through an OpenTelemetry transition, step by step, and discuss the value each step unlocks. We’ll review how to execute a seamless transition, explain why this will be the last transition, show how OpenTelemetry will create a fundamental shift in our observability practices, and highlight all the benefits you will gain along the way.
Let’s start by imagining a large computer system, managed by a large organization. The system is patchwork – some parts were built in different eras, other parts were built by different teams, and including several parts which have been glued together from multiple acquisitions. This complex system is running a variety of different observability tools - the authors of each portion have all chosen different solutions.
How can we clean up this mess, without losing any visibility along the way?
We want to bring the new telemetry system online, but we don’t want to disrupt the current system. To do this, we start with a middleman: the OpenTelemetry Collector.
The Collector is a stand-alone service for processing telemetry. It has a wide variety of benefits – once operators become familiar with it, the Collector often becomes their favorite Swiss Army knife. To start our seamless transition, we’re going to set up the collector as a translation service.
The Collector functions as an extensible pipeline. The architecture is simple: Receivers, Processors, and Exporters.
We’ll start with receivers. Receivers accept different telemetry protocols. The default receiver is for OTLP, OpenTelemetry’s native format. But receiver plugins are available for many popular tracing, metrics, and logging protocols: Zipkin, Jaeger, Prometheus, StackDriver, Fluent Forward, StatsD. Just about everything. Currently, there are 45 supported receivers in the Collector-Contrib repository, and both push and pull models of telemetry are supported.
Collectors are configured via yaml. No, I’m not exactly a huge fan of yaml, either. But being able to define these pipelines in a declarative fashion is very powerful, and yaml is convenient for this purpose.
To build a translation service, start by defining receivers for every type of telemetry your system produces. Traces, metrics, logs – all of it. For example, a StatsD receiver would look like this:
receivers: statsd: endpoint: "localhost:8127" aggregation_interval: 70s enable_metric_type: true is_monotonic_counter: false timer_histogram_mapping: - statsd_type: "histogram" observer_type: "gauge" - statsd_type: "timing" observer_type: "gauge"
Next, define a set of exporters for which also matches your current telemetry. Exporters are just like receivers, but in reverse - they produce data in a variety of formats. Here’s an example exporter for jaeger:
exporters: jaeger: endpoint: jaeger-all-in-one:14250 tls: cert_file: file.cert key_file: file.key
Once you have your receivers and exporters, define a pipeline which connects them together. This effectively turns the Collector into a proxy - what comes in, goes out.
service: pipelines: metrics: receivers: [opencensus, prometheus] exporters: [opencensus, prometheus] traces: receivers: [opencensus, jaeger] processors: [batch] exporters: [opencensus, jaeger]
These collectors can then be deployed locally on every machine. Begin routing all of your traffic through these collectors, so they now sit in the middle of your telemetry pipeline.
Why deploy a proxy? What is the point? For starters, this allows you to lift up your traffic without any disruption, so it’s a safe and controlled way to introduce a new software component. It’s easier to verify that nothing changed than to deploy a new component and new behavior at the same time.
But, to avoid extra overhead, the Collector can begin to quickly replace other pieces of your pipeline, which are now redundant. The collector works as a replacement for most telemetry services, such as:
- Prometheus servers, providing TSDB sources for scaling solutions such as Cortex, Loki, Thanos, etc.
- Agents which collect host metrics
- Log processors
- Trace buffers
Basically, there’s no need to run separate services for processing and transmitting metrics, traces, and logs. All of these various jobs can be folded into the collector. This saves on cost, and simplifies your deployment topology.
Ok, so you have collectors deployed, all telemetry is running through them, and you’ve retired any other services which are now redundant. What new capabilities does this deployment give you? What you get at this stage is flexibility.
For example: want to change backends for tracing, metrics, or logs? Just add those new backends by defining additional exporters. No need to re-instrument your system — OpenTelemetry does the translation for you. Likewise, you can take your hodgepodge of existing observability systems and begin to consolidate them.
For example, let’s say your system was emitting three different types of metrics. You can pick one metric type as the solution you want to use, then translate the other metric types into that format. Once dashboards have been recreated in the system of choice, Then shut down the two other metric systems. You can tee the data off to both systems during the transition.
This kind of flexible, extensible exporting allows you to try out new observability solutions without disruption. Overlapping rollout means no downtime. You can even perform a bake-off: try several solutions at once, and compare their features using the same data. No reinstrumenting. No giant lift. Just reconfigure your Collectors. You don’t even need to restart your applications.
Collectors also include a robust processing pipeline. Between Receivers and Exporters, all telemetry is converted into OTLP. Since processors are written against OTLP, they are flexible and reusable across all types of input. You can install processors to scrub sensitive information, normalize all the data coming in from different sources, and to generate new metrics from existing metrics, logs, and traces. Potentially, you can also reduce cost by applying sampling algorithms.
The Collector is also a good place to attach resources – metadata which describes the services producing the telemetry. Kubernetes pods, host names, regions, and other cloud identifiers are good examples of resources. The collector can also capture the usual machine metrics, such as cpu, memory, disk, network. etc.
Okay, that’s the first stage of our journey. To review the value proposition for deploying the OpenTelemetry Collector on its own: Over time, large organizations end up managing complex workloads. This can include a hodgepodge of different observability solutions, creating a fractured environment that can be difficult to manage. The Collector knits all of these services together into a single observability pipeline. Operators can control this pipeline through configuration, progressively adapting the flow of data to adopt and shift to new solutions without writing any code or redeploying any applications.
This flexibility is an important piece of modern observability. Use the Collector to future proof your system while paving over any legacy lumpiness which is already present.
Speaking of the future - what are some of the deeper changes coming to our practice of observability? To discuss that, let’s look at how to install OpenTelemetry clients in all of our applications, so that we can begin exporting OTLP - a unified data format.
Installing OpenTelemetry is different in every language, but the basics are the same. The client has two parts: an API for writing instrumentation, and an SDK for processing telemetry.
Because the OpenTelemetry API is backwards compatible and free of dependencies, it can eventually be baked in natively to OSS libraries, making observability a core feature that libraries can provide. But for now, you will need to install instrumentation plugins provided by OpenTelemetry.
It is extremely important to install instrumentation for all major libraries your application uses - every application framework and web server, as well as all HTTP, RPC, messaging, and database clients. Missing instrumentation can break tracing – make sure that instrumentation is available for all of your libraries, and that you have installed it properly.
Besides instrumentation, you need to install the OpenTelemetry SDK so that you can start sending data. Like the collector, the SDK can be configured with a variety of exporters. But it is best to use the OTLP exporter – we’ll see why in a minute.
Instead of configuring the exporter in the SDK, use the default configuration, which sends OTLP data to localhost. Run a local collector to receive that data, and move all of your configuration there. Instead, operators can configure the collector to attach resources metadata, record machine metrics, and export data in the appropriate formats. This causes a separate concern - applications (and application developers) do not need to know anything about the telemetry setup, which is usually deployment-specific and may change significantly as an application moves from development to load testing to production. Operators manage the collectors, which gives them complete control over the telemetry pipeline without needing to bother developers or restart applications to make a configuration change. And if the application dies, you won’t lose any buffered data.
Given that re-instrumenting every service may take a fair amount of work, how should you roll out OpenTelemetry across your organization? This is where deploying the collector first comes in handy. Because the collector translates any input into the output you want, you can move an application over to OpenTelemetry while still sending data to the same backend. However, because the instrumentation has changed, the content of the data will be different - the keys and values will probably be slightly different. To account for this, you should clone your existing dashboards, then modify the duplicate dashboards to use the new OpenTelemetry data.
As a migration strategy, I recommend taking an end-to-end approach. Pick a specific target, such as a high-value transaction, and instrumenting all of the services involved in that transaction. Then move on to other targets, until all services have been migrated. This is more coherent than a patchwork or scattershot approach. If you’re adding tracing for the first time, focusing on instrumenting valuable transactions helps to ensure that complete traces are being created, and you can start to investigate important issues early, without having to wait for the entire organization to complete their migration.
Okay, so what do you get for all of this work? Changing instrumentation is definitely a pain. So what’s the benefit?
The benefit is unified data. OpenTelemetry records all signal types at once - traces, metrics, and logs. And future signals such as eBPF, RUM, and profiling will eventually be added with no extra work. And unlike prior systems, all of this data is correlated, organized into a structured graph.
Every time you log, logs will automatically be connected to traces. Which means you can instantly find all of the logs in a single transaction, no matter how many services participated in that transaction. Normally, this is a real pain - finding all of the logs and stitching them together takes a surprising amount of work, even in the 21st century. But when all the logs are marked with trace IDs, it’s a single lookup.
Likewise, every time you emit a metric, that metric is connected to a sampling of traces, called trace exemplars. This means that it is possible to easily move between metrics and logs. For example, when looking at a spike in HTTP 500 status codes, you could immediately click through to the traces which generated those 500s, looking at both the logs and the trace graphs. This is also a real pain in today’s systems - metrics are completely separate from logs, you have to guess which logs are relevant.
Last but not least, all telemetry is connected to the resources which created it. By mapping metrics and traces to resources, a complete topology of the entire distributed system emerges.
This structured data supports common workflows - exploring our system by starting with one piece of data, then looking at relevant aggregates of data, then looking at examples of those aggregates, then looking at other aggregates those examples are part of. This is how we investigate our systems today, but we are slowed down considerably by the effort of manually collecting the relevant information and piecing it together.
When we are searching through all of this data, we are often looking for correlations. What shifts in metrics, traffic patterns, logs messages, or trace attributes might be associated with an alert? Or a deployment? Or a spike in latency? Traditionally, we rely on our eyeballs to find these correlations – scanning through dashboards to see if a visual pattern emerges. In fact, I’ve often resorted to slapping a ruler up onto the monitor to help me find which little lines all went squiggly at the same time.
The point is, you need to correlate across multiple data sources before you can begin solving your problems. But we shouldn’t have to use our eyeballs to do this. If all of this data is connected into a single traversable graph, then computer analysis can be leveraged to automatically find these correlations for us.
Correlation does not equal causation - operators will still need to interpret the meaning of every correlation. But think about how much time we currently spend pawing through data and hunting for correlations – it’s substantial. So substantial that it makes us cautious about pursuing lines of inquiry. When collecting and reviewing the data takes time, and time is of the essence, you only get so many guesses. Automating this process is a huge win for observability.
Over the next year, observability tools are going to shift from analyzing one kind of data - traces, metrics, logs, etc - to analyzing all types of data together. That’s the only way to move freely between data types, and bring machine analysis into the picture. But in order to provide these tools, the data coming in has to be unified and cross indexed - metrics, traces, and logs connected to each other using trace IDs and resources.
So, this is the value proposition. Switch to a unified telemetry processing system – the Collector – to wrangle your existing telemetry into a single, extensible pipeline. Then, progressively deploy OpenTelemetry clients across your services so that you can begin to emit OTLP. This prepares you for automated analysis.
Many different observability platforms will begin adding features which take advantage of OTLP. As new platforms and features come out, you can try them out easily just by teeing the data off, and discover which ones are right for you. At the same time, many databases and hosted services will also begin to emit OTLP, allowing your application traces to continue deep inside the data storage systems you use, creating a layer of insight previously unavailable.
Today, OpenTelemetry tracing is stable. Metrics and logs are expected to be stable in early 2022. OpenTelemetry takes stability very seriously, we don’t mark something as stable until it is ready for long term support – stable APIs and data formats must be backwards compatible for the remainder of their lifetime. Use this as a guideline when considering your adoption strategy. But, the sooner you start, the closer you will be to making the last telemetry transition you will ever need to make.