How distributed tracing makes logs more valuable
by Ben Sigelman
A few months ago, I sat down to chat with a platform team leader from one of Lightstep’s customers. We were talking about observability (big surprise!), and he made a fascinating remark: logging has become a “selfish” technology.
So, what did he mean by this?? What is logging withholding with its selfishness? And how can we address this with a more thoughtful and unified approach to observability?
First off, a crucial clarification: I don’t mean that the “loggers” – that is, the human operators – are selfish, of course! The problem has been that their (IMO primitive) toolchain needlessly localizes and constrains the value of the logging telemetry data.
How? Well, traditional logging tech encourages the code-owner to add logs in order to explain what’s happening with their code, in production, to themself. Or, maybe, to other/future owners of that particular patch of code. Useful? Sometimes. But both limited and limiting.
Logging data itself seems so rich, and surely relevant to countless “unsolved mysteries” in production. And yet: during any particular incident or investigation, the narrow sliver of relevant logging data can’t be separated from the rest of the firehose.
So how do we realize the actual value of our logs? How do we rank and filter them in the context of an actual incident or investigation? How do we “multiply their value,” and make logging a “generous” discipline rather than a “selfish” one?
- Automatically attach the logs to the trace context
- Collect everything
- Use tracing data and AI/ML to correlate critical symptoms in one service with the relevant logs from others … and now the logs help everyone, not just the logger.
These sorts of things are much easier to explain with a real-world example, so let’s make this more concrete. I’ll pick a recent issue from Lightstep’s own live, multi-tenant SaaS in order to illustrate the concepts here.
I heard that we had an elevated error ratio for HTTP POST in a central service – code-named “crouton” – a few nights ago. Here’s how it looked: (PS: “crouton” is the oldest service in our system, and the name is my fault. It was a dad-joke about “breadcrumbs.”)
In a conventional metrics or monitoring tool, this is when you’d start guessing and checking – very basic, and in a bad way. What you’d rather do is just click on the spike and understand what’s changed. Here, you can:
We need to understand “what changed.” To help with that, this view color-codes regression state (red) and baseline state (blue) and uses the root-cause-analysis engine to examine all of its upstream and downstream dependencies – including logs from those distant services!
Here’s why this would be so painful without 1000s of traces to power the analysis. This is the dependency diagram just for this HTTP POST regression. No human has time to figure this out manually, much less analyze the relevant snippets of log data along the way.
Thankfully, that sort of strenuous manual effort is no longer needed. An observability solution should be able to do that work for us, and should also provide quantitative impact data. Here’s what this view has to say about correlated logs:
This is interesting! We now have concrete evidence of an actual data integrity issue, including some specifics (“byte 22”, etc) – and we can follow up by looking at many distributed traces.
Here’s the first one of those, for example – this trace shows the connection between this deeply-nested log message and the HTTP POST that concerned us at the outset, and gives us enough detail to reproduce and resolve outside of production.
Logging tech used to be “selfish” – service logs only really helped that service’s owners. Enriched with context from tracing data, though, logs can help the entire org resolve mysteries across the stack. This is what unified observability can – and should – deliver.