Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Find the Needle: How Change Intelligence finds the cause of a metric deviation

Lightstep’s Change Intelligence combines metric and tracing telemetry data to gain full observability into your system using one tool. Your metric dashboards and alertsalerts not only show you when there’s a problem, they also become actionable tools that find the source. You don’t need to be a DevOps engineer or know the dependencies in your system; Lightstep understands them for you and can find the issues deep in your stack.

When you notice a deviation on a metric chart, you can use Change IntelligenceChange Intelligence to correlate that deviation with traces from your services to find what in your system may have caused that change. Change Intelligence uses trace data from your service instrumentation to determine up and downstream dependencies and then finds changes in that path that happened at the same time as the metric deviation.

Change Intelligence - Metric Deviation

When you find an issue in a dashboarddashboard or chartchart (even a chart in an alertalert), you can click into the deviation and choose What caused this change?

Change Intelligence - what caused this change

Change Intelligence begins by setting baseline and deviation time windows and then compares and analyzes the performance of the service that sent the metric data before and during the deviation.

Change Intelligence - setting baseline and deviation time windows

When Change Intelligence finds performance changes on any Key Operations on that service, it determines the magnitude of the change and lists them in descending order (most changed first). For each operation, it displays sparkcharts for latency, operation rate, and error rate.

In this example, Change Intelligence shows us that the update-catalog Key Operation on the warehouse service experienced the most change (in p99 latency) at the same time as the metric deviation.

Change Intelligence - biggest change in the warehouse

Once it finds a Key Operation with meaningful performance changes, Change Intelligence looks for traces with that operation. It analyses those traces, searching for attributesattributes that appear frequently on spans from services up and down the stack, during the performance regression. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue. These are displayed as likely causes of the change.

In this example, the attribute customer:ProWool was found on over 41% of traces during the deviation and in less than 7% during the baseline. The latency on traces with that attributed increased 5x and the operation rate for those spans also increased.

Change Intelligence - most likely causes

Looking at the service diagram, you can see that attribute is being sent on traces coming from the /api/get-catalog operation on the iOS service. The diagram also shows that there's one service in between the iOS service and the warehouse service.

Change Intelligence - service diagrams & attributes

Change Intelligence collects exemplar traces that include that correlated attribute with the performance issue. Clicking View sample traces, allows you to choose one and open it in the Trace viewTrace view.

Change Intelligence - exemplar traces

In the trace, it looks like the customer ProWool sent 1,000 requests and the write to the database is overwhelmed. That's likely why the CPU metric spiked.

Change Intelligence - CPU spike

Change Intelligence was able to pinpoint the part of the system that is likely causing the change in the metric performance. By combining metrics with tracing, instead of just knowing that a change occurred, you can find the source without leaving Lightstep.

Interested in joining our team? See our open positions herehere.

March 11, 2021
3 min read
Observability

Share this article

About the author

Robin Whitmore

Robin Whitmore

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems