Observability

Announcing Change Intelligence, actionable Metrics dashboards and alerts, and a new approach for Observability


Ben Sigelman

by Ben Sigelman

Explore more Observability Blogs

Ben Sigelman

by Ben Sigelman


02-04-2021

Looking for Something?

No results for 'undefined'

Today, Lightstep announced many new things all at once. We went live with:

  • Change Intelligence
  • core monitoring dashboards and alerts that are 100% actionable
  • robust infrastructure, application, and cloud metrics
  • automatic migration from Datadog and Prometheus/Grafana, and more

So of course, these announcements mean it’s a big day for Lightstep.

It’s also a big day for anyone interested in observability, and particularly for anyone struggling with observability… and unfortunately, there have been a lot of DevOps teams and SREs stuck in that second camp. It’s easy to see why: until today, observability has either been (1) a shallow label – applied by incumbent vendors – to their siloed, minimally-integrated product portfolios, or (2) something accessible to only those DevOps engineers and SREs who have the time and background required to become bona fide observability experts.

With today’s announcement, Lightstep takes the insights previously restricted to observability experts and makes them accessible to every developer, operator, and SRE. We’ve done this by reframing observability around change, and integrating it with a clean, progressive, and general-purpose monitoring solution built on top of a wildly efficient time series database (TSDB), designed and built by the same people who created Google’s planet-scale Monarch system.

Let’s unpack all of this a bit further…

The Anatomy of Observability

Change Intelligence - The Anatomy of Observability You can read more about “The Anatomy of Observability” in this post; with today’s announcements, Lightstep is innovating at every layer.

Layer 1 – (Open)Telemetry: High-quality, built-in telemetry for all!

Lightstep co-created the OpenTelemetry project (aka “OTel”), as well as the OpenTracing project that preceded it, and through our work in numerous OTel SIGs, the OTel Technical Committee, and the OTel Governance Committee, we are doing all that we can to make high-quality, open-source, and vendor-neutral telemetry a built-in feature across your entire stack. This is important work and we’ve been at it for over 5 (!) years now. It’s particularly gratifying for us to see the recent traction and announcements of support from the likes of AWS, GCP, Azure, and many (many) other vendors and OSS projects. Among other things, today Lightstep announced native support for OpenTelemetry Metrics. If you’re looking for an OTel Metrics integration target, Lightstep’s free-forever Community Tier makes an excellent choice.

Layer 2 – Storage: “Time series, Transactions, or Efficiency: Pick Three”

Lightstep has long been an innovator in distributed tracing, and as part of that, we’ve built a heavily differentiated transaction (aka “event”, aka “tracing”) database capable of highly dynamic sampling, full-system Snapshots, widely-distributed storage, and query evaluation, and much more. But that’s not news. Beginning today, though, Lightstep is also offering time series storage, dashboarding, and alerting built on top of our next-gen time series database (TSDB). Our TSDB was designed and built by the same people who created Google’s planet-scale Monarch system, and we’ve taken many of our lessons there into account. Of course, it’s scalable, and it’s also been designed to make profiling and control over metrics data and telemetry costs easy, intuitive, and centrally manageable.

But most importantly, since our new TSDB was designed from day one to fit into Lightstep’s overall product vision, it enables a progressive and accessible approach to both everyday monitoring and change-oriented observability – the third and most important layer of this “anatomy of observability”...

Layer 3 – Benefits: Observability that explains change

Monitoring isn’t going anywhere, nor should it. What we monitor and how we monitor it can always be more thoughtful and precise, but some sort of charting and alerting is here to stay. Still, all that those charts and alerts can tell you is whether any given part of your system is unhealthy – the charts and alerts alone won’t answer the most important question in observability: “what caused that change?” Nearly every time you need observability, there’s a change taking place: either the intended changes of service deployments and config pushes, or the unintended changes of incident response and unanticipated workloads.

This is why Lightstep has built its entire product around Change Intelligence: by making change our core competency, we can take any deviation – from an alert, from a deployment, or even just from an ordinary chart in a metrics dashboard – and offer explanations across the distributed system.

Finally, the benefits of best-in-class observability are available to any SRE or DevOps engineer who detects an unwanted change in their own system. And that’s what makes today’s release so important for Lightstep and for our industry.

Change Intelligence in action

All of the above sounds wonderful, but you may be wondering – ”what does it actually do?!” It’s a fair question.

We could play with synthetic demo data, but that always feels a little hollow. Or we could choose something obvious from a real production system, like a bad release.

But let’s try something more subtle!

  • We’ll start with a mysterious but unmistakable blip in an infrastructure metric we care about (in this case, heap usage), then…
  • We’ll try to determine what led to that unwanted change.

Here’s a dashboard of machine metrics for Lightstep’s TraceAssembler service (part of our SaaS), and we’ve highlighted the “mysterious blip” in question. If you’ve ever maintained a service in production, surely you’ve seen thousands of these sorts of things yourself:

Change Intelligence 2

Quite frankly, when I’ve encountered blips like this in the past – without Change Intelligence, that is – I’ve (a) been a little concerned, but (b) shrugged my shoulders because sudden and intermittent changes have been too difficult and time-consuming to diagnose.

But what if it wasn’t time-consuming or difficult to diagnose sudden changes? What if it was quick and easy? Like this:

Lightstep Change Intelligence - click on regression

With Change Intelligence, all we need to do is click. After clicking on the deviation where we see a spike, we’re immediately brought to a system-wide analysis that’s specifically tasked with explaining this particular change of behavior:

Change Intelligence 3

Without additional effort from me, the user, Lightstep compares the deviation we selected (in sky blue) with baseline behavior (in dark purple). And it’s already highlighted the most likely causes – let’s expand the top-ranked suggestion, project_name:8037

Change Intelligence 4

Now this is really interesting! On the left we can see that there has been a change in the traffic coming from another service: that is, a service that has been calling the traceassembler has changed its workload.

The first thing I’d like to point out is that overall traffic is flat! We’ve gone from about 1.31K operations per second to 1.36K operations per second: just a 3.8% change, which is basically noise.

But what Change Intelligence is telling us here is that project 8037 has gone from about 74 ops/sec to 174 ops/sec! That’s a big (235%) change.

And now we know what created that mysterious spike in our heap usage: a single Lightstep customer (from project 8037) more than doubled their usual workload.

Digging deeper

If we’d like to explore further, Change Intelligence includes representative traces for each candidate hypothesis – in this case, traces from the upstream traceanalyzer service, and specifically involving project_name:8037 – and we can examine as many as we’d like:

Change Intelligence 5
Change Intelligence 6

One last thing…

Hopefully it’s clear how this new functionality is innovative. You may also be wondering if it’s expensive!

It’s not. If you presently use a SaaS vendor for metrics, it’s likely that Lightstep will save you 50% or more on your bill. That’s because we built up our TSDB from scratch (and from first principles), and it’s awesome. 😄 More on that here.

Getting started

Interested to try this for yourself? There are several risk-free ways to try Lightstep today:

Use our Community Tier

The Community Tier is “free forever,” no strings attached – be up and running with OpenTelemetry and Lightstep in minutes.

Start a free trial at scale

Lightstep’s “Teams” Tier offers a 14-day free trial. Get started and send as much telemetry as you’d like, kick the tires, and experience Change Intelligence with your own data (and your own anomalies). Send both your metrics and tracing data to understand how your infrastructure depends on your ever-fluctuating workload (e.g., “which customer is causing CPU spikes”).

Speak with an expert

Lightstep helped write the book on distributed tracing (no, literally), SLOs, and has a founding role in the OpenTelemetry project. We can help you get started with any of the above, with or without Lightstep’s product – just get in touch.

In closing…

Today’s announcements are certainly the most significant innovations we’ve introduced since launching Lightstep. And from a personal standpoint, this is the most excited I’ve ever felt about the future of observability. My fellow Lightsteppers and I have been working hard for years to get us to this point, and we are eager to share it with the rest of the world – please check it out and let us know what you think!

Explore more Observability Blogs