The Big Pieces: OpenTelemetry specification
by Ted Young
In this installment of The Big Pieces, we’re going to cover the way that OpenTelemetry designs itself, and how to get involved yourself.
OpenTelemetry is a big project. Every language requires an OpenTelemetry client, which can be complex and tricky to build, and require features that programmers normally do not worry about. For example, OpenTelemetry clients have strict rules around dependency management and backward compatibility, and they need to follow the flow of execution without being passed around as function parameters. The clients need to feel similar across all languages, produce the same data, and interoperate with each other. For example, there is more than one way to interpret OpenTelemtry data when exporting it to Prometheus or Jaeger. If every implementation varied its approach, we would have a mess. We want to provide clarity through consistency.
OpenTelemetry is also a community project. From the outset, we knew that we did not want to take the benevolent dictator approach, and simply trust the instincts of a single person or a small group of insiders. We want our decisions to be well researched and peer-reviewed, and allow anyone to participate in the design process.
All of this led towards a specification model for OpenTelemetry. The project began as a merger between OpenTracing and OpenCensus. We knew that we wanted to keep attributes from both projects; what would the result look like? The initial version of the specification resulted from these discussions.
Every week, we have a specification meeting (you can find all of our meeting times on our calendar). The spec meeting is a good place to bring up ideas and concerns, and to talk through issues when the writing process gets stuck. I recommend floating your idea here to gauge interest and identify stakeholders before putting in a lot of work. It’s not necessary but it helps.
Official work starts with an issue being raised in the specification. This tracking issue serves as a reference point and helps keep track of progress as it moves around the project.
For small additions that do not significantly change the meaning of the spec, you can just make a pull request and that’s the end.
To actually propose the change, you need to make an RFC. We call ours OTEPS – OpenTelemetry Enhancement Proposals. This allows the entire proposal, supporting arguments, prior art, etc, to all be collected in one place and reviewed as a complete work.
As part of creating an OTEP, I strongly recommend prototyping in several languages. It can be hard to understand how a big change will actually play out. English is a sloppy language, and we tend to think in terms of our favorite programming language. The bigger the change, the bigger the effort, the sooner you want to confirm that the design proposal is universal and won’t run into implementation problems. I have found that prototyping identifies issues and solutions much faster than trying to work everything out in your mind like some kind of Greek philosopher.
Since it can be a fair amount of work, I try to recruit a group from different SIGs to help with prototyping and drafting the OTEP before I submit it. If no one is interested in prototyping with me, I’ve received my first piece of feedback and I go back to working on my pitch or I move on to a more popular problem.
In order to be accepted, OTEPs require four approvals from the specification team. OTEPs are our highest bar; we want all of the issues identified and worked out before we try to add something to the spec (and definitely not after we add something to the spec).
Once your OTEP is approved, you’re ready to add it to the spec.
We want the spec to be as clear as possible about what the requirements are, so really think about how to remove incidental detail and phrasing which sounds overly restrictive. More than once I’ve submitted spec changes that encoded the requirements as a particular solution when more than one solution exists! Be on the lookout for that, and try to simplify as much as possible.
When writing your pull request, it’s important to look over the entire spec, as larger changes often affect more than one section. Use the IETF RFC keywords to clarify which required features MUST be implemented and which optional features SHOULD be implemented, and don’t forget to update the changelog and the glossary!
Once your PR is merged, it will go out in the next release of the specification and every SIG (special interest group) will implement the change. The specification maintainers can help with this, but as the champion, I like to follow up and ensure that the issues are properly created in every backlog.
And that’s all you need to know about the specification process! If you’d like to see an example of this process in action, please read on.
Sometimes an example of a process can help explain things. I recently went through the specification journey, defining the versioning procedures and stability guarantees for the OpenTelemetry clients. Since we do all of our work in public, it’s easy to reconstruct the entire process. Here’s how it went, from start to finish.
(I should mention that this OTEP is on the extreme end of organizing effort, most issues are definitely simpler than this so please don’t be intimidated.)
Here’s how it went. We had been discussing versioning over the course of the project, and we knew that we wanted strict backward compatibility guarantees – if we broke the API, that would break every project which depends on OpenTelemetry. That was acceptable risk in beta, but once mainstream adoption began it would no longer be an option. Now that we were getting close to finishing the tracing portion of the project, we needed to buckle down and define how we would deliver the stability guarantees we needed while still leaving a pathway open for experimentation and improvements. Someone needed to champion this issue, and since I had already been giving it a lot of thought, I raised my hand.
I like to add infographics when I can, I find they really help explain things.
I started by creating a spec issue, as usual. The GitHub collaboration tools aren’t always the best tools at this stage, so I made my first draft of my OTEP in a shared google doc so that others could comment and edit. We like to talk things out over zoom, so I scheduled a set of meetings and brought it up at every spec meeting. We record all of our meetings so those are public too (here’s an example).
Working with maintainers of different OpenTelemetry implementations, we prototyped our approach to versioning and stability by going through an exercise. We imagined that tracing was now stable, and we were going to add metrics using the process proposed in our rough draft. Listing out all the actions we would take, including versions, releases, movement of packages, etc. we identified a number of gotchas and differences in how languages manage backward compatibility. This exercise helped a lot.
After that, a refined version of the initial draft was submitted as an OTEP. There was still a lot of work to do, but because of our drafting and prototyping, I knew that the basic structure was working for people. Line by line commenting in GitHub would work now to discuss the remaining details.
Once the OTEP was approved, I added all of the relevant content to the specification as a pull request. Along the way, I noticed that the overview and the glossary did not do the greatest job of describing some of the client architecture that was relevant to versioning, so I improved those sections as well.
Some last-minute concerns were raised as well. We had made a rule about how instrumentation could be updated to ensure that future changes to the data OpenTelemetry emitted would not break dashboards. Members of the metrics working group pointed out that some of our assumptions about how metrics data could be updated did not hold for Stackdriver and possibly other systems. Rather than hold the entire process up, we removed that section and marked it TBD. I find this is a common pattern – you reach 90% agreement, but there are one or two thorny issues that can’t seem to reach consensus. In these cases, we always try to merge the 90% if we can, and not let perfect be the enemy of good enough.
With those issues resolved, we let the pull request sit open for a week, and merged it in. Now OpenTelemtry had a plan for versioning, and not a moment too soon – the tracing specification had stabilized, and we were ready to release v1.0!
Thanks for reading. I hope the specification process is clearer now, and I look forward to your changes.