When there are unexpected issues in production, things can escalate quickly.

Suddenly, your phone is buzzing and Slack notifications are everywhere.

In situations like these, how do you know where to start your investigation?

Sure, sometimes there are clear indicators (AWS, GCP, or Azure is down).

But what about when there are not? When you don’t even know where something is broken?

Enter: Correlations.

What Makes Correlations Unique?

Leveraging the complete end-to-end traces produced by LightStep’s unique Satellite architecture, Correlations quantifies the relationship between system attributes and performance — surfacing likely root causes.

Because LightStep’s unique Satellite architecture performs no up-front sampling, Correlations is able to yield powerful, statistically-driven insights, all in a fraction of the time it takes a person to find and view an individual trace.

  1. Correlations is fast. Producing root cause insights almost immediately — aggregating and analyzing thousands of traces in seconds — to identify which signals best explain the regression in your service.
  2. No signal is too rare, subtle, specific. Correlations reveals issues unavailable to conventional monitoring solutions. This includes extreme outliers, low-frequency events, and performance issues related to any specific tag, trace, service, geography, release version, operation, or individual customer.
  3. Automatically gathers evidence. As soon as a non-performant signal is identified, LightStep provides immediate access to the exact spans, traces, and tags related to that signal — all of which can be shared with your team via Snapshots.

See the Forest, Not the Trace

Let’s say you’ve just sat down for a delicious home-cooked meal with some friends, when an unexpected guest shows up: an alert containing a latency increase warning. You knew you were on-call, but there were no big deploys recently so everything should have been fine.

You apologize, grab your laptop, and open the alert. It’s from a service whose p99 latency had been apparently rising while you were whipping up dinner. And the service isn’t yours. In fact, you’re not sure you’ve ever even heard of it. The food will be hot for another 10 minutes or so … You check the dashboards and they confirm the rising latency. You open up a slow trace.

The beginning of an investigation: this isn’t my service. Let the guessing games begin.

And another. You open up the On-Call Playbook and start scanning. Nada. Well, might as well look at another trace. Two had errors, and the dashboard showed a rising error rate, so it seems like a reasonable place to dig in. You look up and see the steam coming from your beautiful dinner is getting faint.

But we’re here to talk about a better way, so before we dive deep into investigating errors, let’s jump into our mental DeLorean and go back in time, way back to when you first received the ill-fated page.

You open the alert and see the same service. But this time, rather than scanning over dashboards and looking at random traces, you use Correlations to analyze the service and the specific function that was reporting the errors.

The beginning of a real investigation with Correlations: Well, I’ll check out that host.

Correlations analyzes thousands of traces and immediately reveals that many of the high latency traces have something in common: {“host”: “66448856b8-dl7cc”}. A second search, and you confirm this host’s response time was rapidly rising. Dinner is still hot!

Remove the failing host. Move the food and plates to the table and tell your friends, dinner is ready. Refresh the dashboard, and YES!!! latency is back where it belongs. Dinner is still hot. You’re amazing.

But how many dashboards and traces would you have needed to open and how many logs would you have needed to read through to think to check this host? How much luck would have been required to arrive at the same hypothesis? Instead, you saved dinner, you nailed the issue, and you can pour that second glass of wine without needing to worry.

Forest Illuminated: Visualize High-Latency Tags and Operations

Let’s take a look at another example.

After receiving a complaint from a customer that their service had become slow, you query the top-level service: api-server.

This customer_id has a +.86 correlation, indicating a strong relationship with high latency.

Using Correlations, system attributes related to latency are listed on the left with a correlation score. Hovering or clicking on this attribute will display a visualization of the distribution of spans containing the attribute on the histogram above. You can easily see that customer_id: BEEMO is on slower spans for requests coming through the api-server service, and you get a correlation coefficient score of +.86, indicating what can be confirmed visually: Slow spans tend to contain this tag while lower latency spans do not.

However, correlation is not causation, but you can see that another tag (below), subnetwork: us-west1, is also appearing more frequently when latency is higher. And a hypothesis is born!

An almost identical distribution and correlation. Investigation breakthrough!

Correlations also highlights the inverse relationship: the system attributes that are related to lower latency. These attributes will appear with a negative correlation score indicating that these tags tend to appear on lower latency spans. By selecting customer_id: ACME, you can see that this customer’s traffic appears to be faster as the spans that contain the tag are on the low latency region of the histogram.

customer_id: ACME tends to not be on high latency spans.

So, What’s Next?

Correlations is already helping organizations better understand how their distributed systems are performing.

No matter where you are on your microservices journey, we can help make root cause analysis faster, easier, and more effective.

If you’d like to see Correlations for yourself, sign up for a demo, and we’ll walk you through our newest feature.