The Big Pieces: OpenTelemetry Collector design and architecture
by Ted Young
Welcome back to The Big Pieces, a weekly series focused on the high level design of OpenTelemetry. In this installment, we’re covering the way OpenTelemetry handles the transmission part of that “telemetry” term.
We covered client architecture in detail last week. Code is instrumented using a clean, low dependency API. The API is implemented by the SDK, a data processing framework. The SDK includes Exporters, framework plugins for sending data in various formats. This allows clients to send data directly to your storage system of choice, without running any collectors.
A standard configuration is to run the SDK with the OTLP/gRPC exporter, OpenTelemetry’s native format, pointed at the default address of
localhost:4317, where it will expect a Collector to be listening.
The collector is a stand alone service for transmitting observations. The collector follows the pipeline pattern. Receivers, processors, and exporters can be chained together to form pipelines Data can be received, processed, and exported in a variety of formats: buffering data, managing configuration and converting from one format to another.
Collectors are configured via yaml files. In depth documentation and details on the format can be found here. Let's cover the basics.
- Receivers: Receivers ingest data from a variety of popular sources and formats, such as Zipkin and Prometheus. Running multiple types of receivers can help with mixed deployments, allowing for a seamless transition to OpenTelemetry from older systems.
- Processors: Processors allow tracing, metrics, and resources to be manipulated in a variety of ways.
- Exporters: Collector exporters are the same as client exporters. They take completed data from OpenTelemetry’s in-memory buffer and flush it to various endpoints in a variety of formats. Exporters can be run in parallel, and data may be efficiently fanned out to multiple endpoints at once. For example, sending your trace data to both Lightstep and Jaeger, while sending your metrics data to Prometheus.
Once the collectors have finished their transformations, OpenTelemetry data ends its journey being handed to some form of stable storage where analysis can be performed. By design, OpenTelemetry does not provide any analysis tools or long term storage system – we’re focused on standardizing how systems describe themselves. What you do with that data is another matter, and we hope to see many great analysis tools designed to leverage OpenTelemetry’s data model.
Minimizing impact on the underlying system is a primary goal. The first tenant of OpenTelemetry design is “do no harm.”
What types of impact would we like to minimize, in this case?
- Rebooting application processes just to manage configuration changes
- Stealing system resources from the application process
- Slowing application shutdown while waiting for data to flush
There are a couple of deployment choices which can help alleviate these issues. The first is the rebooting issue. To mitigate this, run the OpenTelemetry clients in as close to default mode as possible, pointed at a local collector. This allows configuration changes to be made by rebooting the collector, not the application process. The local collector can also measure system metrics such as CPU and memory usage for your application.
While a local collector is good, it can’t solve the overhead issue. To manage telemetry at scale, a pool of data processing collectors can be run on separate machines in the same private network. This allows the local collector to avoid spending application resources, and allows the data processing to occur on machines where the collector is allowed to utilize the whole machine.
And that’s the basic design of the telemetry pipeline. Hopefully this high level advice helps you to understand the project components, and decide how to best set up your own OpenTelemetry deployment.
Interested in more? Check out our latest video on how to instrument our OpenTelemetry Launchers.