APM is Dying — and That’s Okay
by Ben Sigelman
In APM’s heyday (think “New Relic and AppDynamics circa 2015”), the value prop was straightforward: “Just add this one magic agent and you’ll never need to wonder why your monolithic app is broken!”
But then everything changed, and APM wasn’t able to change along with it. Here’s what happened…
Systems Got Deep
APM was designed for monoliths, where development revolved around a single application server. It turned out that monoliths slowed down developer velocity, so we broke them into layer upon layer of microservices.
In doing so, we enabled developers to build and release software faster, and with greater independence, as they were no longer beholden to the elaborate, Sisyphean release processes associated with monolithic architectures.
But as these now-distributed systems scaled, it became increasingly difficult for developers to see how their own services depend on or affect other services, especially after a deployment or during an outage, where speed and accuracy are critical.
Conventional APM tools weren’t built to understand or even represent these multi-layered architectures, let alone provide guidance on how to identify and improve performance when it matters most.
There are two ways systems scale. They can scale wide, or they can scale deep.
Countless real-world systems scale wide: Lakes scale wide. Pizzas scale wide. Traffic jams scale wide. And in software, MapReduces and memcache pools scale wide. When things scale wide, you “just add more of them”: more water, more dough, more cars, more processes.
But some systems scale deep: Cities scale deep. Brains scale deep. And when things scale deep, they don’t just get bigger, they get different. Paris is nothing like a very large village. The brain in your pet goldfish is nothing like the brain in your head.
And when microservice architectures scale, they scale deep.
To make the depth of typical microservices architectures more tangible, here are images taken from typical microservice architecture diagrams (blurred for confidentiality reasons). Even with just a dozen services, there are already 6+ layers of depth!
Telemetry Got Portable
The most valuable thing about APM had been the agents. They gave us telemetry where before there had been — literally — none. Recently, though, OpenTracing, OpenCensus, and now OpenTelemetry have made that telemetry portable — and free.
Outdated pricing units are not only ill-suited to analyze deep systems, but they are typically priced per-host or per-container. That is neither the unit of cost (COGS) nor the unit of perceived value. With the container explosion, that’s brutal for customers.
And perhaps the biggest problem for APM is that deep systems aren’t just bigger than monoliths, they’re different, and products designed for one don’t work for the other.
APM has a lucrative sweet spot; it just doesn’t cover where large-scale systems are headed.
What’s Replacing APM
Historically, most approaches to monitoring or observability have almost no way to analyze or represent the elaborate dependencies between services in deep systems.
They treat metrics and logs (and possibly traces) as loosely coupled products or tools, and fundamentality lack the context required to solve the complex challenges of today’s multi-layered architectures.
In recent years, metrics- or logging-oriented products have thrown in distributed traces “on the side,” typically as individual data-points that can be inspected manually in a trace visualizer. This blunt, simplistic approach can be effective in identifying some limited number of egregious problems, but complex issues in production are more subtle.
Lightstep’s approach is unique: We ingest 100% of event data, and aggregate and analyze in order to address specific high-value questions:
- “What went wrong during this release?”
- “Why has performance degraded over the past quarter?”
- “Why did my pager just go off?!”
For instance, one of our customers recently experienced a sudden regression in the performance of a particular backend, deep in their stack. It turns out that the underlying issue was that one of their 100,000 customers changed their traffic pattern by 2000x. This was obvious within seconds after looking at aggregate trace statistics, though they estimated it would have taken days just looking at logs, metrics, or even individual traces on their own.
This is all possible because Lightstep’s Satellite architecture grants us access to about 100x more data than a conventional SaaS solution at the same (or lower) cost. With so much more data, and colocated storage and compute, we extract more context about deep systems. This is why we have earned the trust of customers like Lyft, GitHub, Twilio, UnderArmour, and many more.