Lightstep from ServiceNow Logo






Lightstep from ServiceNow Logo

Find the Needle: How Change Intelligence finds the cause of a metric deviation

Robin Whitmore

by Robin Whitmore

Find the Needle: How Change Intelligence finds the cause of a metric deviation

Explore more Observability Blogs

Lightstep’s Change Intelligence combines metric and tracing telemetry data to gain full observability into your system using one tool. Your metric dashboards and alerts not only show you when there’s a problem, they also become actionable tools that find the source. You don’t need to be a DevOps engineer or know the dependencies in your system; Lightstep understands them for you and can find the issues deep in your stack.

When you notice a deviation on a metric chart, you can use Change Intelligence to correlate that deviation with traces from your services to find what in your system may have caused that change. Change Intelligence uses trace data from your service instrumentation to determine up and downstream dependencies and then finds changes in that path that happened at the same time as the metric deviation.

Change Intelligence - Metric Deviation

When you find an issue in a dashboard or chart (even a chart in an alert), you can click into the deviation and choose What caused this change?

Change Intelligence - what caused this change

Change Intelligence begins by setting baseline and deviation time windows and then compares and analyzes the performance of the service that sent the metric data before and during the deviation.

Change Intelligence - setting baseline and deviation time windows

When Change Intelligence finds performance changes on any Key Operations on that service, it determines the magnitude of the change and lists them in descending order (most changed first). For each operation, it displays sparkcharts for latency, operation rate, and error rate.

In this example, Change Intelligence shows us that the update-catalog Key Operation on the warehouse service experienced the most change (in p99 latency) at the same time as the metric deviation.

Change Intelligence - biggest change in the warehouse

Once it finds a Key Operation with meaningful performance changes, Change Intelligence looks for traces with that operation. It analyses those traces, searching for attributes that appear frequently on spans from services up and down the stack, during the performance regression. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue. These are displayed as likely causes of the change.

In this example, the attribute customer:ProWool was found on over 41% of traces during the deviation and in less than 7% during the baseline. The latency on traces with that attributed increased 5x and the operation rate for those spans also increased.

Change Intelligence - most likely causes

Looking at the service diagram, you can see that attribute is being sent on traces coming from the /api/get-catalog operation on the iOS service. The diagram also shows that there's one service in between the iOS service and the warehouse service.

Change Intelligence - service diagrams & attributes

Change Intelligence collects exemplar traces that include that correlated attribute with the performance issue. Clicking View sample traces, allows you to choose one and open it in the Trace view.

Change Intelligence - exemplar traces

In the trace, it looks like the customer ProWool sent 1,000 requests and the write to the database is overwhelmed. That's likely why the CPU metric spiked.

Change Intelligence - CPU spike

Change Intelligence was able to pinpoint the part of the system that is likely causing the change in the metric performance. By combining metrics with tracing, instead of just knowing that a change occurred, you can find the source without leaving Lightstep.

Interested in joining our team? See our open positions here.

Explore more Observability Blogs