Lightstep from ServiceNow Logo

Products

Solutions

Developers

Resources

Login

Lightstep from ServiceNow Logo
Announcements

Analyse billions of events in seconds with Notebook’s span time series


Neena Dugar

by Neena Dugar

Analyse billions of events in seconds with Notebook’s span time series

Explore more Announcements Blogs

We recently released Notebooks, which has one particularly game changing feature: being able to create time series out of your historical span data over the last three days! Before, if you wanted to graph your spans as a time series, you had to pre-define a stream or key operation. But when you’re troubleshooting, you don’t always know in advance exactly what you’re going to be looking for. Notebooks allows you to make any query over the last three days of span data. While that flexibility is awesome as a user (our team is a huge fan of the feature, and use it constantly), it was a little daunting as an engineering problem: we’re talking about analysing millions – even billions – of events, in just a few seconds.

notebooks-demo-gif

Generating these time series really comes down to a problem of counting: we can label every span with its “time bucket” (which 15 second window it arrived in). Computing a rate time series then looks something like this:


   const timeBucketSize = 15 * time.Second

func getTimeSeries(
 predicate query.Predicate,
 startTime time.Time,
 duration time.Duration,
) []int {
 numPoints := duration/timeBucketSize
 timeSeries := make([]int, numPoints)

 for i in range(0, numPoints) {
   timeBucket := startTime + timeBucketSize * i  
   timeSeries[i] = countSpans(predicate AND time_bucket == timeBucket)
 }

 return timeSeries 
}

Following the axiom of “premature optimisation is the root of all evil”, when we first started, we chose the simplest approach: read all the spans and count them.


func countSpans(predicate query.Predicate) int {
  spans := tldb.Query(predicate) 
  count := 0 
  for span in spans {
    count++
  }
  return count
}

This approach just about worked when the predicate didn’t match too many spans, but as the predicate gets wider there’s a clear problem: you have to read more and more data the more spans you match. How much data are we talking about? A large Lightstep customer sends us anywhere between 500MB/s to several GB/s of uncompressed span data. Taking 500MB/s as our ingress, if we store that for 3 days, that works out to about 26 TB on disk. Just reading all that data from an SSD would take about 7 hours. We can (and do) parallelise by spreading the data over multiple disks, but that’s really not enough to bridge the gap between hours and seconds, let alone what it would take to decompress / parse / analyse that data.

Remember that here at Lightstep we’ve built our own database. That means that when we start thinking about this problem, we can approach it at literally any layer, from the application level to the files on disk. This blog post is about some of the tricks we ended up using to be able to analyse those terabytes of data in seconds.


Trick #1: Indexes are smaller than your data

First step: we need a way to serve those queries without actually reading all those bytes. Luckily, we know exactly what fields our queries will touch: attributes. By storing those associations in a separate index, and using that to count, we can avoid reading the spans themselves. Our index contains a mapping of each attribute to the span IDs it matches – we call the list of span IDs the “posting list”.

attribute => matching span IDs (posting list)

browser=chrome => [1,2,7]
browser=opera => [3,4,5]
browser=safari => [10,13,16]
customer=acme => [1,2,3,4,5]
customer=beemo => [6,7,11,12]
os=android => [1,7,12,15]
os=iOS => [2,10,13,16]

When we make queries, we just fetch the relevant index rows, and count based on the matching span IDs – we don’t need to yield the spans themselves.


func Count(predicate query.Predicate) int {
  postingList := getPostingList(predicate) // only touches index rows
  count := 0 
  for id in postingList {
    count++
  }

  return count
}

Storing the posting lists as arrays doesn’t scale well: as you store more spans the sizes of those lists increases and scanning them in order to do intersection and union operations gets to be slow. So we leverage Roaring Bitmaps, which makes our stored posting lists smaller, and provides us with faster operations.


Trick #2: Getting out of the loop

In our getTimeSeries function, the bulk of the work is in this loop:


for i, timeBucket in timeBuckets { 
    timeSeries[i] = countSpans(predicate AND time_bucket == timeBucket)
  }

Notice that we’re computing the predicate numPoints times, since it happens inside the loop. It turns out that computing the predicate is actually relatively expensive, so it’s a big advantage to move it outside of the loop. We can do this by rewriting the query as a group-by count, like:

COUNT spans
WHERE ...
GROUP_BY time_bucket

Implementing a group-by count looks something like this (note: in our case, groupByKey would is time_bucket, but at this layer we’re implementing a generic group-by):

groups, groupPostingLists := getValuesAndPostingListsForLabelKey(groupByKey)
postingList := getPostingList(predicate) 

groupCounts := map[int]int 
for i, groupPostingList in groupPostingLists {
  intersection := intersect(postingList, groupPostingList)
  for id in intersection {
    groupCounts[i]++
  }
}

So now, we only call getPostingList for the predicate once, regardless of the number of time buckets!


Trick #3: Leverage the group-by

We can do better! Once we’d done this, we started to notice we were spending a lot of time computing posting list intersections. We can leverage the fact that all of the groups are disjoint to avoid those intersections entirely.

So first, we read each of the group posting lists, and create mapping between each ID and the group it came from.


groups, groupPostingLists := getValuesAndPostingListsForLabelKey(groupByKey)
idToGroup := map[int]int

for idx, postingList in groupPostingLists {
  for id in postingList {
    idToGroup[id] = idx
  }
}

Then, we read through the posting list and use the map we just created to record the count of each group.


postingList := getPostingList(predicate) 
groupCounts := map[int]int 
for id in postingList {
  group := idToGroup[id]
  groupCounts[group]++
}

Trick #4: Why compute when you can pre-compute?

In this new world, most of our time is spent computing the id-to-group map. For common group-by keys (such as time bucket, which is on every query), we precompute the id-to-group map and store it, which means we avoid computing it on every query.


Count Blog Pre-computation of the id-to-group map


This is an example of a really powerful trick that we use over and over again: amortising your query cost over your ingest cost. By writing auxiliary data at write time, we can significantly speed up our queries – we pay for it in CPU, but that’s spread out over the ingestion.


Trick #5: Locality is king

Once we reached this point, we found we were spending most of our time reading posting lists and pre-computed maps. So at this point we started to think about whether we can avoid reading some of that data.

I mentioned that we store posting lists in the Roaring Bitmap format, which stores the IDs in 2^16 chunks (called containers). If we can make it so that our queries only touch a subset of containers, we reap many performance benefits – less to read from disk, less to decompress, better cache efficiency. We know that many of our queries are focussed on specific service/operations – those are particularly important span attributes – so we sort the data such that the IDs for a given service/operation are contiguous. This means that if a service/op is only responsible for 1% of ingested spans, we only read 1% of the containers.

This makes a huge difference! Check out the trace below – can you spot which spans are on sorted vs unsorted data?


Count Blog 2


Trick #6: SIMD

So far I’ve talked mostly about the single group-by case, but in reality we’re usually grouping by multiple keys. For latency queries we also want to group-by the latency bucket; and we have a special internal tag called “adjusted count” which we use to re-inflate counts after sampling. For the multi-group-by case, we need to take each key’s group ID and combine them into a single group ID.

After all the optimisations above, this step started to become a bottleneck. As we started to look at this more closely, we realised that we could leverage vector intrinsics and get more parallelism during this step. So we’ve leveraged Go’s support for AVX512 instructions to get a ~3x speedup on that calculation.


Concluding thoughts

Performance is fun! This blog post only scratches the surface of what we’ve been working on: there are hundreds of other improvements that helped get us here. And honestly, every time I use Notebooks to make a spans query it feels a little bit magical – because I know we’re analysing trillions of spans in seconds.

Learn more about Notebooks and get started today.

Explore more Announcements Blogs