OpenTelemetry: Emerging standard for all DevOps solutions - error analytics
by Clay Smith
Lightstep wants to help customers and our DevOps partners adopt OpenTelemetry. As a founding member and core contributor of the standard, we have the expertise, tools, and templates to help vendors easily adopt it in their solutions.
We are launching a series of OpenTelemetry-based tutorials and example instrumentation for different DevOps solutions. We’ll show how to connect these solutions to observability data within Lightstep, and show how that’s a better user experience for running experiments, operating cloud services, or investigating errors. In earlier posts, we showed how you can extend instrumentation to AWS cloud services and feature flag solutions.
In this post we discuss the value of adding instrumentation to error analytics tools as part of your overall monitoring and incident response workflow.
Error analytics are at the center of troubleshooting potential problems with software, especially unexpected problems with code. Error analytics solutions provide detailed analysis that can connect specific lines of code to a problem. They are a key part of developer workflows that happen behind the scenes when a customer receives a message that says something went wrong.
Error analytics solutions work well when developers can connect a specific signal to an error. For example, a new version of an app is released and the error count spikes. A quick look at error analytics for that app points to an exception coming from a new line of code that didn’t cover some edge case.
In speaking with Lightstep customers, we know some of the trickiest kinds or errors are those where they are not connected to an obvious cause or they are missed entirely. This is especially problematic when teams scale to dozens or hundreds of services.
Here are some examples of root causes not-at-all obvious customer-facing errors we’ve seen:
- Latency in retrieving data from cloud-based object storage service—itself a dependency of a backend service—occasionally caused requests for mobile app users in Europe to fail with a cryptic error message.
- Requests suddenly fail for a small subset of customers when pod restarts in Kubernetes.
- A dependency of backend service 500s when a specific request is made from customers with over 1,000 active subscribers.
At the center of investigating all three problems is an error trace in an error analytics tool. Unfortunately, getting to the root cause can be extremely difficult—different services involved are owned by different teams and the types of telemetry each collect vary and live in different places.
OpenTelemetry allows developers to link errors and their associated rich metadata (like the line of code where the errors was observed) to their telemetry, specifically specifically distributed traces. With a single line of code, this can be done automatically using an open standards based plugin.
Here’s Rollbar error information embedded in Lightstep’s Trace View:
Check our Lightstep Developer Toolkit:
- Our demo app demonstrates OpenTelemetry-aware error tracking with Lightstep using an experimental Rollbar SDK plugin.
- A Lightstep learning path to learn how to enable this in your services or see a demo app on GitHub.
- Our OpenTelemetry Docs for software teams considering adopting OpenTelemetry.
Contact us if you’d like to learn more or know what’s planned for future integrations or know more about OpenTelemetry.