Observability in action: What happens when a deploy goes wrong
by Ben Sigelman
Sometimes we should philosophize about observability and sometimes we should just get ultra-pragmatic and examine real use cases from real systems. 😊
Here is one about a bad deploy we had at Lightstep the other day. Let’s get started with a picture.
In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to detect the regression and rollback, but in order to fix the underlying issue, of course we had to understand it. We knew the failure was related to a bad deploy of the
liveview service. The screenshot above shows
liveview endpoints, ranked by the
biggest change for the new release; at the top is
ExplorerService/Create with a huge (!!) increase in error ratio.
It’s worth noting that this dashboard was created automatically from aggregated span (i.e., tracing) data, and the
ExplorerService/Create endpoint rose to the top automatically as well. There is no need to manually create, maintain, or stare at ad hoc dashboards.
Where do we go from here? This is when things used to get particularly painful – one would open up a bunch of dashboards, stare at logs, and start guessing-and-checking. Not good.
What if we could just click on the spike in error rate?? Let’s try it:
In 99% of incidents (certainly including this one), the big question is “What Changed?!!” Observability should directly answer that question. Here we see an entire view populated with color-coded data showing, well, “what changed” with respect to our error rate spike:
Scrolling down the page, there are many avenues we can pursue to further understand this regression. The one that jumps out is an
InvalidArgument tag that’s strongly correlated with our originating issue. Let’s click on that:
Simply by grouping the regression transactions by
response_code, we find the smoking gun: more than 98% of our error spike is due to these
If we want to understand this in more detail, we can click on that row in the table to see (many) example transactions. They are all part of the original spike in
liveview errors or exhibiting the
InvalidArgument response code and there are 545 to choose from!
At this point we have very high confidence that the
InvalidArgument responses caused our error spike. Observability gave us this confidence by analyzing many thousands of distributed traces, logs, and metrics, though we didn’t have to dig through any of that by hand. We can select any one of the trace examples from the table of
InvalidArgument examples above and immediately get our diagnosis. By automatically joining (transactional) logs with our traces, we see this error message:
invalid - cannot have empty analyzer query
And that’s all our developer needed to understand what had changed with the new (bad) release. The next roll-forward was successful, and that was that.
To recap our workflow: we began with the affected service, then simply clicked on whichever data seemed most relevant. And we never lost context, despite depending on (many thousands of) traces, metrics, and logs. That’s how observability should be: unified and simple.
To learn more about observability and how it can save you from a bad deploy, check out our Complete Guide to Observability.