Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Observability in action: What happens when a deploy goes wrong

Sometimes we should philosophize about observability and sometimes we should just get ultra-pragmatic and examine real use cases from real systems. 😊

Here is one about a bad deploy we had at Lightstep the other day. Let’s get started with a picture.

bhs thread 1

In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to detect the regression and rollback, but in order to fix the underlying issue, of course we had to understand it. We knew the failure was related to a bad deploy of the liveview service. The screenshot above shows liveview endpoints, ranked by the biggest change for the new release; at the top is ExplorerService/Create with a huge (!!) increase in error ratio.

Distributed Tracing: Insights vs. Dashboards

It’s worth noting that this dashboard was created automatically from aggregated span (i.e., tracing) data, and the ExplorerService/Create endpoint rose to the top automatically as well. There is no need to manually create, maintain, or stare at ad hoc dashboards.

Where do we go from here? This is when things used to get particularly painful – one would open up a bunch of dashboards, stare at logs, and start guessing-and-checking. Not good.

What if we could just click on the spike in error rate?? Let’s try it:

bhs thread 2

Observability answers “What Changed?”

In 99% of incidents (certainly including this one), the big question is “What Changed?!!” ObservabilityObservability should directly answer that question. Here we see an entire view populated with color-coded data showing, well, “what changed” with respect to our error rate spike:

bhs thread 3

Scrolling down the page, there are many avenues we can pursue to further understand this regression. The one that jumps out is an InvalidArgument tag that’s strongly correlated with our originating issue. Let’s click on that:

bhs thread 4

Simply by grouping the regression transactions by response_code, we find the smoking gun: more than 98% of our error spike is due to these InvalidArgument responses!

bhs thread 5

If we want to understand this in more detail, we can click on that row in the table to see (many) example transactions. They are all part of the original spike in liveview errors or exhibiting the InvalidArgument response code and there are 545 to choose from!

bhs thread 6

At this point we have very high confidence that the InvalidArgument responses caused our error spike. Observability gave us this confidence by analyzing many thousands of distributed traces, logs, and metrics, though we didn’t have to dig through any of that by hand. We can select any one of the trace examples from the table of InvalidArgument examples above and immediately get our diagnosis. By automatically joining (transactional) logs with our traces, we see this error message: invalid - cannot have empty analyzer query

bhs thread 7

And that’s all our developer needed to understand what had changed with the new (bad) release. The next roll-forward was successful, and that was that.

To recap our workflow: we began with the affected service, then simply clicked on whichever data seemed most relevant. And we never lost context, despite depending on (many thousands of) traces, metrics, and logs. That’s how observability should be: unified and simple.

To learn more about observability and how it can save you from a bad deploy, check out our Complete Guide to ObservabilityComplete Guide to Observability.

June 23, 2020
3 min read
Observability

Share this article

About the author

Ben Sigelman

Ben Sigelman

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems