Announcing Major Updates to Lightstep’s Distributed Tracing:

RCA in Three Clicks!

Observability


Observability in action: What happens when a deploy goes wrong


Ben Sigelman

by Ben Sigelman

Explore More Observability Blogs

Ben Sigelman

by Ben Sigelman


06-23-2020

Looking for Something?

No results for 'undefined'

Sometimes we should philosophize about observability and sometimes we should just get ultra-pragmatic and examine real use cases from real systems. 😊

Here is one about a bad deploy we had at Lightstep the other day. Let’s get started with a picture.

bhs thread 1

In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to detect the regression and rollback, but in order to fix the underlying issue, of course we had to understand it. We knew the failure was related to a bad deploy of the liveview service. The screenshot above shows liveview endpoints, ranked by the biggest change for the new release; at the top is ExplorerService/Create with a huge (!!) increase in error ratio.

Distributed Tracing: Insights vs. Dashboards

It’s worth noting that this dashboard was created automatically from aggregated span (i.e., tracing) data, and the ExplorerService/Create endpoint rose to the top automatically as well. There is no need to manually create, maintain, or stare at ad hoc dashboards.

Where do we go from here? This is when things used to get particularly painful – one would open up a bunch of dashboards, stare at logs, and start guessing-and-checking. Not good.

What if we could just click on the spike in error rate?? Let’s try it:

bhs thread 2

Observability answers “What Changed?”

In 99% of incidents (certainly including this one), the big question is “What Changed?!!” Observability should directly answer that question. Here we see an entire view populated with color-coded data showing, well, “what changed” with respect to our error rate spike:

bhs thread 3

Scrolling down the page, there are many avenues we can pursue to further understand this regression. The one that jumps out is an InvalidArgument tag that’s strongly correlated with our originating issue. Let’s click on that:

bhs thread 4

Simply by grouping the regression transactions by response_code, we find the smoking gun: more than 98% of our error spike is due to these InvalidArgument responses!

bhs thread 5

If we want to understand this in more detail, we can click on that row in the table to see (many) example transactions. They are all part of the original spike in liveview errors or exhibiting the InvalidArgument response code and there are 545 to choose from!

bhs thread 6

At this point we have very high confidence that the InvalidArgument responses caused our error spike. Observability gave us this confidence by analyzing many thousands of distributed traces, logs, and metrics, though we didn’t have to dig through any of that by hand. We can select any one of the trace examples from the table of InvalidArgument examples above and immediately get our diagnosis. By automatically joining (transactional) logs with our traces, we see this error message: invalid - cannot have empty analyzer query

bhs thread 7

And that’s all our developer needed to understand what had changed with the new (bad) release. The next roll-forward was successful, and that was that.

To recap our workflow: we began with the affected service, then simply clicked on whichever data seemed most relevant. And we never lost context, despite depending on (many thousands of) traces, metrics, and logs. That’s how observability should be: unified and simple.

To learn more about observability and how it can save you from a bad deploy, check out our Complete Guide to Observability.

Explore More Observability Blogs