At Lightstep, we treat metric queries and span queries almost exactly the same. In this post, we’ll explore how the different stages of a query interact with each other to take your raw data from our data storage layer and aggregate it into useful visualizations.
Before we dive into how data points are queried, we first need to understand how data is stored in Lightstep. We continuously collect customer data points that have a wide variety of attributes, such as customer_id, or hostname. For both metric and span data points, we partition the points based on these attributes, as well as the name of the span or metric itself. Points with the same exact set of attributes are stored together in chronological order based on the timestamp. This is also known as a timeseries.
Every query at Lightstep can be broken down into multiple stages. The most common stages found in almost every query are:
These stages are applied to the data in the order defined by the user, which is usually the same order as they are listed above. Each of these stages uses the output of the previous stage and produces a collection of timeseries. Let’s look at each of these stages in more detail.
Query Pipeline: Fetch and Filter
Now that we have data stored as timeseries, we can query that data. The fist stage of every query pipeline is called the fetch operation. Fetch uses a predicate to determine what subset of timeseries need to be loaded from the data storage layer. In Lightstep’s Unified Query Language – a text-based query language for metrics and spans also known as UQL – this is spelled
metric <metric_name> for metrics and
spans count or
spans latency for span queries along with an (optional for metrics)
filter operation. In our visual query builder, the fetch and filter options correspond with the first section.
💡 TIP: For those readers who are familiar with the SQL language, the fetch and filter stages are quite similar to SQL’s FROM and WHERE clauses.
The fetch operation will grab all the timeseries that match the predicate. These points may not be temporally aligned, so the next stage in every pipeline aligns the data.
Query Pipeline: Align
Raw timeseries data for queries can often contain thousands of individual data points. If we attempted to plot this raw data on a chart, we’d quickly run into a fundamental issue: there aren’t enough pixels to render each value! To avoid this issue, we pick a few hundred timestamps that we want data for and align the raw data to match. We’ll also combine points across multiple timeseries in the “group by” stage. To do that, the points must first have the same timestamp. And so, the align stage is required for all telemetry queries at Lightstep.
For each output timestamp that has been chosen, we aggregate all the points from the original timeseries that have a timestamp between the output timestamp and [output timestamp - input window]. The default “input window” is just the distance between each output point, which results in each output point aggregating a unique set of input points. By specifying an input window that is larger than the distance between output points, the resultant data can be smoothed; the larger the input window the more smoothing occurs.
This type of aggregation is called temporal aggregation because we combine points across the time dimension. Note that the points being combined during a temporal aggregation all come from the same timeseries and all have different timestamps. The following aggregators are available to use for temporal aggregation:
💡 TIP: Think of temporal aggregation as horizontal aggregation.
Output the last point in the found in the window.
Output the change in the metric from the earliest point to the latest point.
Output the rate that the metric is changing at (per second).
Output the mathematical sum of all points.
Output the maximum point.
Output the minimum point.
Output the average of all points.
Continuing the visual from above here we show the fetch and align.
Query Pipeline: Group by
Now that we have aligned the timeseries data, we can combine it in a pipeline stage called “group by.” When grouping, we only consider attributes with the keys defined by the user and ignore all others. The “group by” stage combines all timeseries where the considered attributes’ key/values are equal using a specified aggregator.
For example, if we group by “host” and half of our fetched timeseries have a host value of “A” and the other half have value “B,” we end up with two timeseries where these two sets of values are aggregated.
This type of aggregation is called spatial aggregation because we combine points across the space of all the timeseries in each group. Note that the points being combined during a spatial aggregation all have the same exact timestamp and all come from a different timeseries. This is why the “align” stage is so crucial; it ensures that the points during spatial aggregation all have the same timestamp. The following aggregators are available to use for spatial aggregation:
💡 TIP: Think of spatial aggregation as vertical aggregation.
Group by Aggregator
For scalar values, all points are added together.
For distribution values the distributions are combined. This is the only supported aggregator for distribution metrics and span latency queries.
The mathematical mean of all points.
The maximum value of combined points.
The minimum value of combined points.
Ignores the values of points and just returns the number of points combined.
Let’s look at the “group by" stage in action. While this only shows two underlying timeseries in each group, in practice we can combine hundreds of timeseries in each group.
Tying it all together
Now that we've seen how each stage of the query pipeline works, let's put it together. Over the course of this post we’ve been building this query.
Here is the same query represented in UQL
And here is the full animation of that query as it is processed through each pipeline stage.
Hopefully, you have a better understanding of how our query pipeline works for these basic queries. It works almost identically for metrics and spans, which means that once you know how to query one, it’s super easy to query the other! Stay tuned for part 2 where we will cover stages like joins that allow you to build even more powerful queries!
October 24, 2022
6 min read
About the author
Brian LambRead moreRead more
Explore more articles
Let's Talk About Psychological SafetyAdriana Villela | Jul 12, 2022
System outages are rough for all involved, INCLUDING those who are scrambling to get things up and running again as quickly as possible. Psychological safety is crucial, ensuring that employees are at their best & don't burn out. Read on for more on this.Learn moreLearn more
OpenTelemetry Python Metrics approaching General AvailabilityAlex Boten | Apr 1, 2022
OpenTelemetry Metrics is moving towards general availability across the project, including in Python. Learn about how to get started with it today!Learn moreLearn more
Create insightful trace diffs with the Lightstep APIIshmeet Grewal | Nov 24, 2020
Lightstep recently added several new APIs to help developers access the high value data being sent to Lightstep from their systems and integrate the rich analysis built on that data into their existing workflows. Read this article to learn how.Learn moreLearn more
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems