Let me tell you a quick story about monitoring. In my last role, I was responsible for managing a fleet of integration testing infrastructure, which was a collection of pre-built VMs, custom PowerShell and bash scripts, and creaky internal tools with poor documentation. This setup was obviously of critical importance to the company, as we were unable to close tickets or validate releases without successful test runs of our software on this infrastructure.
It wouldn’t be a stretch to say that my experience working with that stack is the reason that I’m writing this today -- many sleepless nights and long hours of investigation and correlation by hand touched off not only a deep interest in monitoring and observability, but also a deep passion in ensuring that nobody else ever has to suffer through what I did.
The most frequent and puzzling task I had to perform was correlation between resource metrics such as CPU, disk, or memory utilization and transactions, like a specific test run that exercised an API route. Transactions were logged in plain text to a SQL database, and resources emitted rather coarse metrics from the hypervisor. Environments could have tens, or hundreds, of unique nodes and tracking down correlations between metric signals and relevant logs was a painstaking manual process.
One of the reasons I’m such a proponent of distributed tracing is that it solves half of this problem; Traces offer rich, detailed diagnostic information about the ‘golden signals’ of rate, error, and duration with deep context into the up and downstream dependencies of a service. They’re the best way to understand transaction performance in a system.
Connecting this trace data to resource telemetry, however, remains a challenge. A given resource, such as a Kubernetes node or SQL server, can handle thousands or millions of transactions at any given moment. How do we bridge the gap between resource telemetry and transaction telemetry?
Exemplars are perhaps the most common solution to this problem. A metric exemplar allows for the association of a trace with a measurement -- so if your application records a metric of request counts, you can link that count measurement to a specific trace identifier. This doesn’t fully satisfy our needs, though -- one, exemplar assignment must be performed manually in most cases, and two, you’re limited to the things you thought to create exemplars for in the first place. In addition, underlying resource metrics from Kubernetes can’t easily have assigned exemplars -- the node doesn’t know too much about what application code is running in a pod, after all.
This week at Kubecon, Lightstep is previewing a new feature that we’re calling “attribute pivots” that solves this problem. Attribute pivots are like exemplars, but with a couple of crucial differences:
Pivots don’t require specific exemplar assignment in advance
Pivots can be between any metric and _any _trace, as long as they share a common attribute
Pivots allow you to join your resource telemetry from a Kubernetes cluster, such as container memory utilization, and view trace exemplars in-situ in the same Notebook graph. This offers you the ability to visually correlate a resource consumption spike with long tail latency in a service, or to jump into application errors straight from a DB queue metric.
We’d love to show you this feature in action and get your feedback this week at KubeCon, so come to Booth P11 at 10:35 and 4:35 Wednesday or Thursday to watch a live demo. We believe that cloud-native observability requires the unification of telemetry signals from logs, metrics, and traces -- attribute pivots are an example of how this unification works in practice. In the future, you can imagine features like this working with log data, profiles, or any other structured event sent to the Lightstep platform. We’re hard at work integrating and building on the work that our new colleagues from Era Software started, which we’ll have more to say about in 2023, but let me be the first to say -- I can’t wait to show you what’s coming next.
October 25, 2022
4 min read
About the author
Austin ParkerRead moreRead more
Explore more articles
2022 in reviewAndrew Gardner | Jan 30, 2023
Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.Learn moreLearn more
The origin of cloud native observabilityJason English | Jan 23, 2023
Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.Learn moreLearn more
Gain agility through observabilityHeather Waters | Jan 19, 2023
As we navigate geopolitical challenges, macroeconomic headwinds, and the post-pandemic comedown, there is pressure to drive transformation, reduce costs, and be more efficient. See how observability can help you rise to the challenge and be more agile.Learn moreLearn more
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems