The Life of an OTEP: The Story of Context Propagation in OpenTelemetry
by Matt Wear
Within the OpenTelemetry project, the process used to propose and discuss broad-impact changes is called an OpenTelemetry Enhancement Proposal, or OTEP for short. This concept isn’t unique to the OpenTelemetry project, and it's possible you've seen a similar process in place in the Rust, Kubernetes, and/or OpenTracing communities.
In OpenTelemetry, OTEPs are generally required when the proposed changes would have an impact across languages and implementations, and when what is being proposed would require a change in user behavior or would modify requirements.
An OTEP usually begins when an individual or small group of people find an aspect of OpenTelemetry that they feel could use improvement. Their vision takes form as a document describing the technical details and reasoning behind their proposal. The small group or individual begins to champion their proposal, and broadcast it to the larger community where the ideas are discussed, refined, and ultimately, accepted or rejected.
The process of merging OpenTracing and OpenCensus to form OpenTelemetry has been a big undertaking. It has required a delicate balance to preserve the features from the previous projects while introducing improvements to the new project as it makes sense. As the project has evolved and as people begin to put functionality to use it's not uncommon to find some aspects that need to be refined. When people started to put the context system to use concerns began to surface over its design. In the merger, OpenTelemetry inherited three types of subtly different context, and each of these types of context had to be managed and propagated separately. It was becoming clear that this system was in need of further attention and improvement.
Without getting too deep into the technical weeds, context propagation is essential for distributed tracing and distributed correlations. When a service sends a request to another service, the caller needs to send context to the callee so that it can continue the trace and gather correlations. Context is propagated into a request to another service via an `inject` operation. A receiving service will use the `extract` operation to restore the context received from a request. At the time we started working on a recent OTEP, each type of context needed to be injected or extracted separately. Meaning that, for n different context types, there were n calls to `inject` or `extract`. To further complicate matters, some applications need to propagate the same type of context in multiple formats, leading to a combinatorial explosion at inject and extract call sites.
Lightstep realized the importance of getting context propagation right and decided to take that task on. We began writing OTEP 66 for a new approach to context propagation. The first step was to lay out our vision in broad strokes in a technical document. But while a well-written document is necessary in the OTEP process, it takes more than that to get the community engaged and to demonstrate that the ideas are sound. A proposal needs to be thoroughly thought out and shown to be viable across the supported languages in the OpenTelemetry ecosystem. This is no small task with nearly a dozen supported languages and the varying paradigms among them.
Based on our experience in OpenTelemetry and dating back to OpenTracing, we have found “debating in code” to be a very successful strategy. Reading words and reading code are two separate processes. Ideas may seem reasonable in prose, but can become complicated when it comes to implementation, so demonstrating that an idea works in code can help alleviate concerns, prevent debates, and keep discussions on track. This process helps the group to focus on actual problems, rather than concerns and what-if scenarios that arise when there is not enough information to go on.
Code often stimulates discussions far better than the written word. When developers see code, they have opinions, sometimes very strong opinions that might not be present when reading a technical document. Having written code helps to encourage community participation.
In order to facilitate discussion and smooth out any rough spots in the design of OTEP 66, Lightstep engineers built prototypes in 5 separate languages. We were joined in our efforts by engineers from other organizations contributing and building prototypes of their own. This helped to demonstrate the approach was viable across the breadth of languages supported by OpenTelemetry. By writing code alongside the OTEP, we created a feedback loop in which the code exercise informed and refined the OTEP, and vice versa. Having working examples in languages that participants are comfortable with helped to validate the ideas and facilitate the conversation. While we had a small group of people driving the process, the prototypes helped to facilitate community participation and input on the design.
OTEP 66 took over two months to write and included 79 commits, and 224 comments. Through our prototypes and the OTEP process, we were able to come up with a unified representation of the multiple, nuanced types of context and simple APIs to interact with it. We were able to simplify inject and extract to single calls regardless of the types and formats of context being propagated, avoiding the combinatorial explosion mentioned previously. Additionally, the resulting system is extensible to other forms of context and custom formats, ensuring it can accommodate unique and yet-to-be-imagined use cases.
The future of OpenTelemetry is shaped through the OTEP process and made possible by the hard work and investments made by Lightstep, and the other organizations and individuals helping to build it.