Announcing Change Intelligence, actionable Metrics dashboards and alerts, and a new approach for Observability
In this blog post
The Anatomy of ObservabilityThe Anatomy of ObservabilityLayer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!Layer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”Layer 3 – Benefits: Observability that explains changeLayer 3 – Benefits: Observability that explains changeChange Intelligence in actionChange Intelligence in actionDigging deeperDigging deeperOne last thing…One last thing…Getting startedGetting startedUse our Community TierUse our Community TierStart a free trial at scaleStart a free trial at scaleSpeak with an expertSpeak with an expertIn closing…In closing…Today, Lightstep announced many new things all at once. We went live with:
core monitoring dashboards and alerts that are 100% actionable
robust infrastructure, application, and cloud metrics
automatic migration from Datadog and Prometheus/Grafana, and more
So of course, these announcements mean it’s a big day for Lightstep.
It’s also a big day for anyone interested in observability, and particularly for anyone struggling with observability… and unfortunately, there have been a lot of DevOps teams and SREs stuck in that second camp. It’s easy to see why: until today, observability has either been (1) a shallow label – applied by incumbent vendors – to their siloed, minimally-integrated product portfolios, or (2) something accessible to only those DevOps engineers and SREs who have the time and background required to become bona fide observability experts.
With today’s announcement, Lightstep takes the insights previously restricted to observability experts and makes them accessible to every developer, operator, and SRE. We’ve done this by reframing observability around change, and integrating it with a clean, progressive, and general-purpose monitoring solution built on top of a wildly efficient time series database (TSDB), designed and built by the same people who created Google’s planet-scale MonarchMonarch system.
Let’s unpack all of this a bit further…
The Anatomy of Observability
You can read more about “The Anatomy of Observability” in this postthis post; with today’s announcements, Lightstep is innovating at every layer.
Layer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!
Lightstep co-created the OpenTelemetryOpenTelemetry project (aka “OTel”), as well as the OpenTracing project that preceded it, and through our work in numerous OTel SIGs, the OTel Technical Committee, and the OTel Governance Committee, we are doing all that we can to make high-quality, open-source, and vendor-neutral telemetry a built-in feature across your entire stack. This is important work and we’ve been at it for over 5 (!) years now. It’s particularly gratifying for us to see the recent traction and announcements of support from the likes of AWSAWS, GCPGCP, AzureAzure, and many (many) other vendors and OSS projects. Among other things, today Lightstep announced native support for OpenTelemetry MetricsOpenTelemetry Metrics. If you’re looking for an OTel Metrics integration target, Lightstep’s free-forever Community Tierfree-forever Community Tier makes an excellent choice.
Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”
Lightstep has long been an innovator in distributed tracingdistributed tracing, and as part of that, we’ve built a heavily differentiated transaction (aka “event”, aka “tracing”) database capable of highly dynamic sampling, full-system SnapshotsSnapshots, widely-distributed storage, and query evaluation, and much more. But that’s not news. Beginning today, though, Lightstep is also offering time series storage, dashboarding, and alerting built on top of our next-gen time series database (TSDB). Our TSDB was designed and built by the same people who created Google’s planet-scale Monarch systemplanet-scale Monarch system, and we’ve taken many of our lessons there into account. Of course, it’s scalable, and it’s also been designed to make profiling and control over metrics data and telemetry costs easy, intuitive, and centrally manageable.
But most importantly, since our new TSDB was designed from day one to fit into Lightstep’s overall product vision, it enables a progressive and accessible approach to both everyday monitoring and change-oriented observability – the third and most important layer of this “anatomy of observability”...
Layer 3 – Benefits: Observability that explains change
Monitoring isn’t going anywhere, nor should itMonitoring isn’t going anywhere, nor should it. What we monitor and how we monitor it can always be more thoughtful and precise, but some sort of charting and alerting is here to stay. Still, all that those charts and alerts can tell you is whether any given part of your system is unhealthy – the charts and alerts alone won’t answer the most important question in observability: “what caused that change?” Nearly every time you need observability, there’s a change taking place: either the intended changes of service deployments and config pushes, or the unintended changes of incident response and unanticipated workloads.
This is why Lightstep has built its entire product around Change Intelligence: by making change our core competency, we can take any deviation – from an alert, from a deployment, or even just from an ordinary chart in a metrics dashboard – and offer explanations across the distributed system.
Finally, the benefits of best-in-class observability are available to any SRE or DevOps engineer who detects an unwanted change in their own system. And that’s what makes today’s release so important for Lightstep and for our industry.
Change Intelligence in action
All of the above sounds wonderful, but you may be wondering – ”what does it actually do?!” It’s a fair question.
We could play with synthetic demo data, but that always feels a little hollow. Or we could choose something obvious from a real production system, like a bad release.
But let’s try something more subtle!
We’ll start with a mysterious but unmistakable blip in an infrastructure metric we care about (in this case, heap usage), then…
We’ll try to determine what led to that unwanted change.
Here’s a dashboard of machine metrics for Lightstep’s TraceAssembler
service (part of our SaaS), and we’ve highlighted the “mysterious blip” in question. If you’ve ever maintained a service in production, surely you’ve seen thousands of these sorts of things yourself:
Quite frankly, when I’ve encountered blips like this in the past – without Change Intelligence, that is – I’ve (a) been a little concerned, but (b) shrugged my shoulders because sudden and intermittent changes have been too difficult and time-consuming to diagnose.
But what if it wasn’t time-consuming or difficult to diagnose sudden changes? What if it was quick and easy? Like this:
With Change Intelligence, all we need to do is click. After clicking on the deviation where we see a spike, we’re immediately brought to a system-wide analysis that’s specifically tasked with explaining this particular change of behavior:
Without additional effort from me, the user, Lightstep compares the deviation we selected (in sky blue) with baseline behavior (in dark purple). And it’s already highlighted the most likely causes – let’s expand the top-ranked suggestion, project_name:8037
…
Now this is really interesting! On the left we can see that there has been a change in the traffic coming from another service: that is, a service that has been calling the traceassembler
has changed its workload.
The first thing I’d like to point out is that overall traffic is flat! We’ve gone from about 1.31K operations per second to 1.36K operations per second: just a 3.8% change, which is basically noise.
But what Change Intelligence is telling us here is that project 8037
has gone from about 74 ops/sec to 174 ops/sec! That’s a big (235%) change.
And now we know what created that mysterious spike in our heap usage: a single Lightstep customer (from project 8037
) more than doubled their usual workload.
Digging deeper
If we’d like to explore further, Change Intelligence includes representative traces for each candidate hypothesis – in this case, traces from the upstream traceanalyzer
service, and specifically involving project_name:8037
– and we can examine as many as we’d like:
One last thing…
Hopefully it’s clear how this new functionality is innovative. You may also be wondering if it’s expensive!
It’s not. If you presently use a SaaS vendor for metrics, it’s likely that Lightstep will save you 50% or more on your bill. That’s because we built up our TSDB from scratch (and from first principles), and it’s awesome. 😄 More on that herehere.
Getting started
Interested to try this for yourself? There are several risk-free ways to try Lightstep today:
Use our Community Tier
The Community TierCommunity Tier is “free forever,” no strings attached – be up and running with OpenTelemetry and Lightstep in minutes.
Start a free trial at scale
Lightstep’s “Teams” Tier offers a 14-day free trial14-day free trial. Get started and send as much telemetry as you’d like, kick the tires, and experience Change Intelligence with your own data (and your own anomalies). Send both your metrics and tracing data to understand how your infrastructure depends on your ever-fluctuating workload (e.g., “which customer is causing CPU spikes”).
Speak with an expert
Lightstep helped write the book on distributed tracing (no, literallyliterally), SLOs, and has a founding role in the OpenTelemetry project. We can help you get started with any of the above, with or without Lightstep’s product – just get in touchget in touch.
In closing…
Today’s announcements are certainly the most significant innovations we’ve introduced since launching Lightstep. And from a personal standpoint, this is the most excited I’ve ever felt about the future of observability. My fellow Lightsteppers and I have been working hard for years to get us to this point, and we are eager to share it with the rest of the world – please check it outcheck it out and let us know what you think!
In this blog post
The Anatomy of ObservabilityThe Anatomy of ObservabilityLayer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!Layer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”Layer 3 – Benefits: Observability that explains changeLayer 3 – Benefits: Observability that explains changeChange Intelligence in actionChange Intelligence in actionDigging deeperDigging deeperOne last thing…One last thing…Getting startedGetting startedUse our Community TierUse our Community TierStart a free trial at scaleStart a free trial at scaleSpeak with an expertSpeak with an expertIn closing…In closing…Explore more articles

How to Operate Cloud Native Applications at Scale
Jason Bloomberg | May 15, 2023Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.
Learn moreLearn more
2022 in review
Andrew Gardner | Jan 30, 2023Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.
Learn moreLearn more
The origin of cloud native observability
Jason English | Jan 23, 2023Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems