Find the Needle: How Change Intelligence finds the cause of a metric deviation
by Robin Whitmore
Lightstep’s Change Intelligence combines metric and tracing telemetry data to gain full observability into your system using one tool. Your metric dashboards and alerts not only show you when there’s a problem, they also become actionable tools that find the source. You don’t need to be a DevOps engineer or know the dependencies in your system; Lightstep understands them for you and can find the issues deep in your stack.
When you notice a deviation on a metric chart, you can use Change Intelligence to correlate that deviation with traces from your services to find what in your system may have caused that change. Change Intelligence uses trace data from your service instrumentation to determine up and downstream dependencies and then finds changes in that path that happened at the same time as the metric deviation.
Change Intelligence begins by setting baseline and deviation time windows and then compares and analyzes the performance of the service that sent the metric data before and during the deviation.
When Change Intelligence finds performance changes on any Key Operations on that service, it determines the magnitude of the change and lists them in descending order (most changed first). For each operation, it displays sparkcharts for latency, operation rate, and error rate.
In this example, Change Intelligence shows us that the
update-catalog Key Operation on the
warehouse service experienced the most change (in p99 latency) at the same time as the metric deviation.
Once it finds a Key Operation with meaningful performance changes, Change Intelligence looks for traces with that operation. It analyses those traces, searching for attributes that appear frequently on spans from services up and down the stack, during the performance regression. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue. These are displayed as likely causes of the change.
In this example, the attribute
customer:ProWool was found on over 41% of traces during the deviation and in less than 7% during the baseline. The latency on traces with that attributed increased 5x and the operation rate for those spans also increased.
Looking at the service diagram, you can see that attribute is being sent on traces coming from the
/api/get-catalog operation on the
iOS service. The diagram also shows that there's one service in between the
iOS service and the
Change Intelligence collects exemplar traces that include that correlated attribute with the performance issue. Clicking View sample traces, allows you to choose one and open it in the Trace view.
In the trace, it looks like the customer ProWool sent 1,000 requests and the write to the database is overwhelmed. That's likely why the CPU metric spiked.
Change Intelligence was able to pinpoint the part of the system that is likely causing the change in the metric performance. By combining metrics with tracing, instead of just knowing that a change occurred, you can find the source without leaving Lightstep.
Interested in joining our team? See our open positions here.