Using Operation diagrams to understand dependencies and performance
by Andrew Chee
We all know that in a modern distributed software system your application is only as fast as your slowest dependency. With the proliferation of microservices and containerization, any given application request can now potentially traverse multiple services to complete its work. Understanding the performance for a particular request as it hits all of the service boundaries is becoming more and more important because it’s useless to optimize one part of the stack when the bottleneck could be somewhere further downstream.
Lightstep offers a unique approach to understanding your operation dependencies and system performance using our Change Intelligence workflow and Operation diagrams. This allows even the most inexperienced engineers to quickly understand their system dependencies and performance bottlenecks.
Lightstep captures and analyses 100% of your distributed tracing requests. In Lightstep the requests that are important to your service (such as API endpoints and other ingress points) are marked as Key Operations. For root cause analysis purposes, each key operation is tracked for a 10 day period, during which tens of thousands of traces are stored. These traces are intelligently chosen to represent the entirety of your latency distribution, including outliers and errors. When any performance regressions occur, we have the historical data stored within our system to effectively perform a simple and powerful root cause analysis.
Let’s take a look at one of these scenarios:
Here is a service dependency map of our application in question. There are several web/mobile services that communicate with a set of backend services by going through an API gateway,
In this scenario, we are investigating a problem with the iOS service. As you can see in the image below, there is an operation called
update-catalog which is performing slowly. Its P99 response time went from 266ms to 1.29secs.
As you saw in the original service dependency map above, that iOS service is potentially hitting 15 different downstream services. How do we understand what dependencies are hit with this single operation? How do we understand where the bottleneck actually originated? Do we wake up all the teams and put them into one big war room? Do we go ask the senior engineer who has been around forever (and knows where all the bodies are buried) for their expert opinion? Lightstep’s operation dependency diagrams can help even the most junior engineers understand and pinpoint performance bottlenecks and behaviors. Let’s see how.
Above is an image of Lightstep’s Operation diagram. We are able to analyze a large number of traces in aggregate and show not only the service and operation dependencies but also the aggregate time spent in each operation as well as any errors. Using this data, the system automatically shows you where the bottlenecks and errors are originating.
In the above scenario, we can quickly see that the iOS service’s update-catalog operation calls downstream to the Krakend API gateway which in terms calls down to an Authentication service as well as the Warehouse service. Within the Warehouse service, we see that there are two large yellow circles in the
database-update operations, indicating high overall latency contribution. In just a glance, we can see all the downstream services and operations that this one type of request hits as well as where the bottlenecks are at a systemic level.
Using this information we can easily identify the offending service (the Warehouse service). Because of this, we quickly eliminated the need to notify all the different service teams in the stack. We can even eliminate all but one of the service teams within this one request chain because we now pinpointed the problem to one service and two operations within that service.
Traditionally, this type of root cause analysis exercise would involve a large number of people, taking time away from other important tasks. But with Lightstep’s Operation diagrams and root cause analysis capabilities, we can easily pinpoint the cause and remove the need to involve so many resources unnecessarily.
Having the ability to visualize the operation and service flow of your important requests in aggregate allows you to quickly identify systemic problems. This type of analysis has traditionally been impossible because tracing systems did not analyze your entire traffic volume and could therefore not provide an aggregated/systematic view of your requests. The best they could do was to show you the performance of a single request. With Lightstep’s Operation diagrams and root cause analysis capabilities, this type of analysis becomes quick and simple for anyone on your team.
Interested in joining our team? See our open positions here.