Distributed Tracing


How We Built & Scaled Log Search and Pattern Extraction


Karthik Kumar

by Karthik Kumar

Katia Bazzi

and Katia Bazzi

Explore More Distributed Tracing Blogs

Karthik Kumar

by Karthik Kumar


Katia Bazzi

and Katia Bazzi


07-31-2020

Looking for Something?

No results for 'undefined'

We recently added the ability to search and aggregate trace logs in Lightstep!

With conventional logging systems, developers often lack the system context necessary to debug issues. With Lightstep’s new log analysis features, we have unified familiar logging features with the rich tracing data to provide only the context relevant to your investigation. Now, for example, if you see that a service is logging a particular error, you can view a corresponding trace to identify abnormalities in caller or callee behaviour.

Lightstep RCA errors

With this launch, developers can now view:

  • A log timeline which illustrates the frequency of a log message over time
  • Aggregated log patterns, to minimize the noise when sifting through the logs
  • Example traces that contain a specific log message
  • Raw log message search results (and see the corresponding timeline)

We faced a number of interesting technical challenges while building this, which are outlined below.

Log Patterns

The feature supports automatically extracting patterns from logs. To do this, it parses the log into two parts: (1) the static words that describe the error (for example: “User auth failed”) and (2) the dynamic information that adds context to the error message (for example: entity identifiers, timestamps, structured payloads etc). Aggregating logs in this way allows us to compare logs messages that are originating from the same location in the code but may not be exactly the same. These logs are then counted and summarized to provide helpful information about which log messages may be the root-cause of an error (especially after a deployment).

One important assumption is that logs which are being aggregated come from the same service and operation, have the same log key and “look similar”. We define similarity by hashing of the constant parts of each message. We then group similar logs and identify patterns on these groups.

Here’s an example (the [string], [1-98], and {"token": [*]} terms are the detected patterns):

INPUT:

User j1bjp logged in, with 23 items in cart, {"token": "123"}
User sl9xh logged in, with 1 items in cart, {"token": "456"}
User l81na logged in, with 98 items in cart, {"token": "789"}

OUTPUT:

User [string] logged in, with [1-98] items in cart, {"token": [*]}

Cool right? So, how does it work? Detecting log patterns is a two-step process: tokenization and summarization. First, each log message is split into tokens with a corresponding “kind” (a type that is detected - like “number” or “space”). It also preserves spaces and casing from the original log and tracks the start and end index of each kind. Below is an example of tokenization in action. The output is in the format: {[start,end),kind}.

INPUT:

abc 12.3

OUTPUT:

{[0,3), string}
{[3,4), space}
{[4,8), number}

The summarization step builds off of the tokens returned by the previous step. It uses the Levenshtein Distance algorithm to find the shortest edit-distance path between tokens in a given set of logs. It identifies a pattern for these tokens by combining adjacent kinds together. For example:

INPUT (tokens):

"abc 12.3" -> {[0,3), string}, {[3,4), space}, {[4,8), number}
"defg 14.6" -> {[0,4), string}, {[4,5), space}, {[5,9), number}
"xy 1499.1" -> {[0,2), string}, {[2,3), space}, {[3,9), number}

OUTPUT:

__[string] [12.3-1499.1]__

Scaling this feature to support a large number of logs requires us to distribute the work across many machines. Logs are data intensive and loading logs from a large cohort of traces in real-time is challenging for a single machine. Therefore, we use a scatter-gather model in which workers compute tokens for a given subset of logs. The tokens are gathered by the leader and the summarization step occurs in a single node. This architecture allowed us to scale the feature to support the high volume of logs sent by our customers while still returning results in seconds.

The feature also supports a basic substring search for log messages. When a user submits a query, we run a string alignment algorithm in order to determine the matching logs. We score each log message’s match based on a heuristic that is biased for earlier, longer matches. We decided against supporting more complex querying capabilities (like regular expression or compound queries) to optimize for ease of use; most queries our users ran were simple substring ones. Here’s an example:

Lightstep Application Showing Log Search

Scoring these log messages against the query is an intensive process and this model allowed us to support real-time querying capabilities.

Interested in working on problems like these? We’re hiring!

Explore More Distributed Tracing Blogs