Lightstep’s Change Intelligence combines metric and tracing telemetry data to gain full observability into your system using one tool. Your metric dashboards and alertsalerts not only show you when there’s a problem, they also become actionable tools that find the source. You don’t need to be a DevOps engineer or know the dependencies in your system; Lightstep understands them for you and can find the issues deep in your stack.
When you notice a deviation on a metric chart, you can use Change IntelligenceChange Intelligence to correlate that deviation with traces from your services to find what in your system may have caused that change. Change Intelligence uses trace data from your service instrumentation to determine up and downstream dependencies and then finds changes in that path that happened at the same time as the metric deviation.
When you find an issue in a dashboarddashboard or chartchart (even a chart in an alertalert), you can click into the deviation and choose What caused this change?
Change Intelligence begins by setting baseline and deviation time windows and then compares and analyzes the performance of the service that sent the metric data before and during the deviation.
When Change Intelligence finds performance changes on any Key Operations on that service, it determines the magnitude of the change and lists them in descending order (most changed first). For each operation, it displays sparkcharts for latency, operation rate, and error rate.
In this example, Change Intelligence shows us that the
update-catalog Key Operation on the
warehouse service experienced the most change (in p99 latency) at the same time as the metric deviation.
Once it finds a Key Operation with meaningful performance changes, Change Intelligence looks for traces with that operation. It analyses those traces, searching for attributesattributes that appear frequently on spans from services up and down the stack, during the performance regression. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue. These are displayed as likely causes of the change.
In this example, the attribute
customer:ProWool was found on over 41% of traces during the deviation and in less than 7% during the baseline. The latency on traces with that attributed increased 5x and the operation rate for those spans also increased.
Looking at the service diagram, you can see that attribute is being sent on traces coming from the
/api/get-catalog operation on the
iOS service. The diagram also shows that there's one service in between the
iOS service and the
Change Intelligence collects exemplar traces that include that correlated attribute with the performance issue. Clicking View sample traces, allows you to choose one and open it in the Trace viewTrace view.
In the trace, it looks like the customer ProWool sent 1,000 requests and the write to the database is overwhelmed. That's likely why the CPU metric spiked.
Change Intelligence was able to pinpoint the part of the system that is likely causing the change in the metric performance. By combining metrics with tracing, instead of just knowing that a change occurred, you can find the source without leaving Lightstep.
Interested in joining our team? See our open positions herehere.
March 11, 2021
3 min read
About the author
Robin WhitmoreRead moreRead more
Explore more articles
How to Operate Cloud Native Applications at ScaleJason Bloomberg | May 15, 2023
Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.Learn moreLearn more
2022 in reviewAndrew Gardner | Jan 30, 2023
Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.Learn moreLearn more
The origin of cloud native observabilityJason English | Jan 23, 2023
Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.Learn moreLearn more
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems