Diagnosing latency: Lightstep vs. Jaeger
by Andrew Chee
For many organizations starting out with distributed tracing, Jaeger is often the first tool used to ingest and visualize traces. It provides a way for developers to query for individual requests and see their behavior as they traverse all the services and operations to complete the request.
As powerful as this is, it only provides a partial picture of your system’s performance because you are only able to visualize individual requests or at most compare two requests to each other. There is no real way of knowing whether the requests you chose to view are in any way representative of the problems you are facing in the system.
With the head-based sampling strategies that most organizations do in production, it is practically impossible to discern whether the request you chose is representative of the performance problems you are facing. In order to confidently make this assessment, you would have to view a large enough number of requests to show a picture of what is actually happening. In addition, you need the ability to intelligently connect traces to other telemetry like metrics and logs.This is where Lightstep excels.
Today, we will discuss how Lightstep approaches the problem of diagnosing latency and will demonstrate how our approach differs from Jaeger’s, even though we both accept the same distributed tracing telemetry.
Lightstep is able to ingest and analyze 100% of your production request traffic and show you systematic performance issues. By automatically analyzing hundreds and thousands of representative requests, Lighstep provides an aggregate view of your requests as opposed to just showing individual requests (which we do as well). Hopefully, by the end of this, you will be able to see how Lightstep’s approach provides a much faster actionable analysis of your performance problems.
The Jaeger interface basically provides a database of traces.
You are able to query your request traffic based on some attribute value (service, span, or attribute name) and the system will list traces that you can view.
You are also able to select two different traces and compare the behaviors of the two requests.
By selecting a trace example the system is able to then show you the distributed trace for that individual request.
Showing an individual request tells you about the performance of that one request. Are there bottlenecks in that request? Are there errors? What services and operations does that request traverse in completing its work? All of that can be answered by viewing a request trace and that is powerful indeed!
Jaeger’s trace visualization is not able to answer the question, “does this request represent the problem I’m facing?” You must manually visualize or compare a large number of traces to gain an understanding of your system’s behavior.
Worse still is that most organizations heavily sample their request traces. With a 5% sample rate, your chances of gathering a P99 trace is .05%, making it virtually impossible to ensure capturing of your performance outliers (using head-based sampling methods). And even if you sample at 100% (no sampling) the system only provides you a listing of traces, and it is still up to you to manually analyze each one yourself.
Lightstep accepts the exact same open source instrumentation as Jaeger. The data contained in individual traces are exactly the same. One of the main differences though is that Lightstep natively captures and analyzes 100% of your tracing traffic.
For each service and operation, Lightstep is able to build the primary application metrics (latency/throughput/errors). And because we are analyzing 100% of your traces we are able to guarantee that we capture trace examples from your fastest to your slowest operations and all the different kinds of errors you are facing. The data analyzed by Lightstep truly reflects the performance spectrum of your production application.
Additionally, Lightstep captures and stores a large representation set of traces for each of these operations so when problems occur, we have the data necessary for you to fully understand what is happening at a systemic level.
This operation diagram above shows you the operation and service dependencies for requests that started from an iOS mobile application. The yellow circles represent overall latency contribution. As you can see, there are two operations that are creating a bottleneck within the system. In order to generate this view, Lightstep automatically analyzes a large, representative dataset of traces for each operation/service combination.
If needed, you are also able to dig into a trace view of an individual request to confirm the behaviors seen in the aggregate level analysis.
As you can see from these two examples, even though Lightstep and Jaeger accept the exact same distributed instrumentation telemetry, our handling of the data is very different. Jaeger’s approach primarily provides a listing of traces that you have to sift through and analyze yourself.
On the other hand, Lightstep gathers a large number of representative traces and automatically analyzes them in aggregate to show you systematic behaviors and bottlenecks. This approach saves developers much time and effort by removing the need to tediously sift through a large number of traces to ensure the data is actually reflective of the problem faced.
Hope this was helpful! Want to try Lightstep yourself? A fun way to try the workflows described above is with our interactive sandbox. We also offer free community accounts, and you can schedule a demo at any time.