Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Observability in action: What happens when a deploy goes wrong

Sometimes we should philosophize about observability and sometimes we should just get ultra-pragmatic and examine real use cases from real systems. 😊

Here is one about a bad deploy we had at Lightstep the other day. Let’s get started with a picture.

bhs thread 1

In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to detect the regression and rollback, but in order to fix the underlying issue, of course we had to understand it. We knew the failure was related to a bad deploy of the liveview service. The screenshot above shows liveview endpoints, ranked by the biggest change for the new release; at the top is ExplorerService/Create with a huge (!!) increase in error ratio.

Distributed Tracing: Insights vs. Dashboards

It’s worth noting that this dashboard was created automatically from aggregated span (i.e., tracing) data, and the ExplorerService/Create endpoint rose to the top automatically as well. There is no need to manually create, maintain, or stare at ad hoc dashboards.

Where do we go from here? This is when things used to get particularly painful – one would open up a bunch of dashboards, stare at logs, and start guessing-and-checking. Not good.

What if we could just click on the spike in error rate?? Let’s try it:

bhs thread 2

Observability answers “What Changed?”

In 99% of incidents (certainly including this one), the big question is “What Changed?!!” ObservabilityObservability should directly answer that question. Here we see an entire view populated with color-coded data showing, well, “what changed” with respect to our error rate spike:

bhs thread 3

Scrolling down the page, there are many avenues we can pursue to further understand this regression. The one that jumps out is an InvalidArgument tag that’s strongly correlated with our originating issue. Let’s click on that:

bhs thread 4

Simply by grouping the regression transactions by response_code, we find the smoking gun: more than 98% of our error spike is due to these InvalidArgument responses!

bhs thread 5

If we want to understand this in more detail, we can click on that row in the table to see (many) example transactions. They are all part of the original spike in liveview errors or exhibiting the InvalidArgument response code and there are 545 to choose from!

bhs thread 6

At this point we have very high confidence that the InvalidArgument responses caused our error spike. Observability gave us this confidence by analyzing many thousands of distributed traces, logs, and metrics, though we didn’t have to dig through any of that by hand. We can select any one of the trace examples from the table of InvalidArgument examples above and immediately get our diagnosis. By automatically joining (transactional) logs with our traces, we see this error message: invalid - cannot have empty analyzer query

bhs thread 7

And that’s all our developer needed to understand what had changed with the new (bad) release. The next roll-forward was successful, and that was that.

To recap our workflow: we began with the affected service, then simply clicked on whichever data seemed most relevant. And we never lost context, despite depending on (many thousands of) traces, metrics, and logs. That’s how observability should be: unified and simple.

To learn more about observability and how it can save you from a bad deploy, check out our Complete Guide to ObservabilityComplete Guide to Observability.

June 23, 2020
3 min read
Observability

Share this article

About the author

Ben Sigelman

Ben Sigelman

Read moreRead more
Observability

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more
Observability

Gain agility through observability

Heather Waters | Jan 19, 2023

As we navigate geopolitical challenges, macroeconomic headwinds, and the post-pandemic comedown, there is pressure to drive transformation, reduce costs, and be more efficient. See how observability can help you rise to the challenge and be more agile.

Learn moreLearn more
Observability

Developing a culture of observability

Doug Odegaard | Jan 4, 2023

Businesses must deliver remarkable customer experiences, release reliable products fast, and reduce costs to achieve consistent growth. See how observability can help.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems