How we built Lightstep Metrics: creating a database from scratch
by Alex Kehlenbeck
My introduction to the subject that we now call Observability came many years ago when I attended an internal tech-talk at Google given by Ben Sigelman, now Lightstep’s CEO, who at the time in 2006 was the tech lead of Dapper. I was immediately smitten with the technology (not to mention the dapper suit he wore for the occasion) and joined the project soon afterwards. Since then, with just a few brief interruptions, I’ve been working continuously on distributed systems related to observability. It’s an area with an endless supply of interesting problems related to data ingestion, storage, query serving, reliability, and cost.
Many of the hard problems faced by observability systems stem from an impedance mismatch between the write workloads and the read workloads that those systems are forced to handle. Consider a typical scenario: you’re collecting data from many resources -- they might be VMs, application processes, network routers, whatever -- and then constructing dashboards or alerts to monitor those resources. Each resource is generating many different kinds of data: it might be producing traditional metrics such as CPU and memory statistics, or spans that are part of distributed traces, or log messages related to error events.
Your dashboards and alerts usually cut across those resources. To render a dashboard showing CPU usage, the system needs just a little bit of information from each resource (its CPU usage). To visualize a distributed trace, we need to gather all spans belonging to that trace, regardless of which host or process generated them. To find log messages about an exception in a common library component, you must search across the logs from many different application processes.
One way of thinking about this mismatch is to see that write workloads are “resource-major” (when a host writes to the monitoring systems, it sends a single request containing a jumble of information about many different metrics), but queries are “data-major” (a query asks for a single metric, such as CPU, from many hosts; a single distributed trace involves spans that came from many different processes). This mismatch presents an immediate difficulty for any storage system: the way that data is organized when it is received is essentially orthogonal to the organization that you would prefer for serving queries.
There’s a long list of additional challenges:
Writes tend to be very small: many metric data points are as small as sixteen bytes, and many spans aren’t much larger. This necessitates judicious batching throughout the system to amortize any fixed overheads involved in processing the data.
But batching is at odds with the product goal of serving alerts and queries in near-real time. The most important use of observability products is for oncall response, and any significant delays for batch processing can undermine that use case.
The vast majority of queries only touch data that is relatively young, say a few hours or a day old. But users also expect historical queries over weeks or months to be served with interactive latency, so that class of queries can’t be ignored outright.
Queries cover an enormous range of scale. Metric queries often involve aggregations that result in four or five orders of magnitude reduction from input data to output data. Statistical analyses of trace data, such as those performed by Lightstep’s Root Cause Analysis feature, may process many thousands of traces in order to produce an output of just a dozen numbers. Drilldown and aggregation features make it easy for a user to issue two queries that, superficially, might look very similar but whose computational requirements differ by a factor of a thousand or more.
One of the other leaders in our space recently observed on Twitter that the observability use case is sufficiently unique that trying to adapt off-the-shelf storage technology is likely to cost a lot, perform poorly, or both. We couldn’t agree more.
When we started thinking about a database solution for Lightstep’s new metrics product, we had a few goals in mind.
We wanted it to be cost-effective. A consistent, loud message from our customers, especially since the pandemic began, is that their observability solution must not break the bank. Many established vendors in this space have a reputation for being expensive, and we didn’t want to follow their example.
We wanted a system that worked for both metric data and our existing trace data. At the time Lightstep stored trace data in a combination of several managed storage systems run by our cloud provider; it was a solution that had worked well enough at low traffic volume but which was showing its age.
As mentioned above, the workloads for our trace and metric data are sufficiently similar that we didn’t want to develop and maintain two distinct systems (although we run many separate database instances for operational reasons). By sharing a codebase between our metric and trace databases, any improvements or new features developed for one kind of data would automatically benefit the other kind of data. Although there have been a handful of instances where this requirement made it more difficult to implement a particular feature, there have been many more instances of shared benefit. This decision was perhaps the best early architectural commitment we made.
Most importantly, Lightstep believes that separating metrics, traces, and logs into distinct pillars (also known as “browser tabs”) results in a poor product experience. Perhaps in 2010 it made sense to say that’s a metric and that’s a trace and the two shall not mingle. But in the last decade, we’ve learned a lot as an industry about how limiting that dogma can be.
The line between tracing data and metric data has already started blurring and, we believe, will continue to blur in the coming years. For a long time, many of the squiggly lines in Lightstep’s product that look like traditional time series have in fact been derived from information that’s implicitly contained in trace data. Our new Change Intelligence feature is powered by both metric and trace data, but doesn’t make a big fuss about the difference between those: it just tries to provide users with insight about their systems.
Why does this matter for the backend? We’ve found that there’s a kind of corollary to Conway’s Law at work: when data is siloed in the backend, the product experience inevitably ends up exposing that fact. By sharing a common storage and serving system for all of our telemetry data, our goal is to make it much easier to build product features that incorporate both kinds of data without jarring the user with sharp transitions.
Finally, we wanted control over the relationship between our storage layer and query execution. It comes as no surprise to anyone who has worked on database technology that the most difficult performance challenges appear at that boundary.
That combination of goals ruled out all existing open-source systems that we were aware of; many systems failed to meet even one of those goals. So we built one that did: TLDB is The Lightstep Database.
Lightstep has a lot of institutional experience with these challenges, earned both from having built a distributed tracing product and also, for some members of the team, from having previously built systems like Monarch.
Much of the cost and complexity of a system like Monarch stems from the fact that it must strenuously avoid dependencies on other critical systems that themselves want to use it. Ironically, despite the abundance of amazing internal technology at a cloud provider like Google, it’s one of the most difficult places to build a monitoring system. It’s a little like being a kid in a gourmet chocolate shop: there are fancy goodies all around you, but you aren’t allowed to touch any of them.
To address that problem, Monarch took the following approach (italics are my additions):
… Monarch must be low dependency on the alerting critical path. To minimize dependencies, Monarch stores monitoring data in memory despite the high cost. Most of Google’s storage systems, including [long list elided], rely on Monarch for reliable monitoring; thus, Monarch cannot use them on the alerting path to avoid a potentially dangerous circular dependency.
Facebook’s Gorilla paper contains very similar language as well.
Lightstep doesn’t have the problem of being used by the systems on which it is built. We still have to be mindful of our dependencies as a general matter, like all good distributed systems engineers, but we don’t have to worry about the catastrophic scenario where an outage of one of our cloud dependencies would make it impossible for that dependency to be debugged and restored to service.
The main way that this relaxation makes life simpler is that we aren’t forced to store multiple in-memory replicas of every piece of data. Monarch reports using three in-memory replicas of each data point; the Gorilla paper similarly reports using two in-memory replicas. Lightstep’s TLDB doesn’t eschew memory entirely -- we use memory as a cache for our hottest data -- but we aren’t forced to rely on in-memory storage as a tool for minimizing dependencies. The resulting software is simpler and the cost savings of this simplification are significant.
Not needing to store all data in memory opens up possibilities for additional cost and performance improvements. Above, we mentioned that one of the challenging features of telemetry workloads is that most writes are extremely small, often as small as sixteen bytes. Hence one easy way to build an expensive system is to touch each data point many times: even very small fixed costs add up rapidly if you have to pay them for every point. The standard technique is to batch points together, to amortize those fixed costs, but batching introduces new challenges for systems that aim to serve real time or near-real time data.
The data structures that are most natural for in-memory storage tend not to be very similar to the layouts that work well for on-disk storage, meaning that any system that utilizes both in-memory storage and on-disk storage is likely to touch each data point at least twice. Since we aren’t forced to store data in memory, we can translate more directly from ingested data to a format that makes sense for on-disk storage. Eliminating the extra touch of each point represents an additional significant savings.
The first version of TLDB launched in late 2019 as a backend behind Lightstep’s Service Health for Deployments feature. It was quite primitive at the time: it was statically sharded, had very limited query capabilities, relied on an off-the-shelf embedded key-value store, etc. But its design was such that we were confident we could evolve it into something much more featureful and high-performance.
Over the last year, we’ve incrementally extended TLDB into a fully-featured telemetry database. It now contains an integrated query engine that powers all of Lightstep’s alerting and dashboard features; it can be dynamically resharded to add capacity; and we replaced its embedded key-value store with a homegrown implementation tailored to our workloads. Nearly all data ingested by Lightstep is now stored in TLDB, and we are fully relying on it to serve our own internal alerts, dashboards, and traces.
As a rule, infrastructure projects tend not to have well-defined completions, but we have achieved most of our original goals. The following chart, for example, shows the latency reduction resulting from transitioning one of our user-facing workloads to TLDB.
This particular transition simultaneously cut our CPU usage at the time by half, our memory usage by about two-thirds, and our disk storage by more than 80%.
We hope that you enjoy using all of Lightstep’s features, both existing and new, that are now powered by TLDB. And if you are an engineer who thinks these kinds of problems sound interesting, we would love to hear from you.