Being on call can be frustratingBeing on call can be frustrating. Getting paged for a service or operation you've never even heard of can be even worse. You don't know where to start investigating, who to contact, or which services are even involved. Scouring through your favorite metrics, recent logs, and starred dashboards can quickly become overwhelming and counterproductiveoverwhelming and counterproductive. Even looking at a full system topology of your application doesn't help in such situations: you need tools that allow you to focus your analysis on the subset of your architecture that's relevant to the actual problem you're trying to solve.
In a microservices environment, the situation described above is unfortunately all too common. A single component in your application can have many dependencies and investigating each one soon becomes intractable, especially while firefighting. The system architecture changes continuously, and no single engineer is able to maintain an accurate, up-to-date view of all the service interdependencies. Relying solely on intuition or giant, sprawling, per-service dashboards is costly and counterproductive during software outages.
Introducing: Service Diagram
Today, we're excited to introduce Lightstep's Service Diagram, a new tool to expedite root-cause analysis and reduce mean time to resolution. Service Diagram provides a visual, interactive, and hierarchical representation of a system's behavior that is filtered, constructed, and annotated to shed light on user-specified performance questions, all in real time.
Traditional full-system maps are a static, high-level view of sampled system state with limited interaction capabilities. They can be overwhelming, distracting, and are usually irrelevant or even misleading during a root cause investigation.
Lightstep's Service Diagram is uniquely designed to guide the user towards components undergoing a performance regression. A user can start their investigation by searching for a specific service, filter by a latency range, high-cardinality tag or operation, and view an interactive diagram that clearly explains the latency bottleneck. By starting with a symptom ("Why is my service slow?") and proceeding to a visual explanation ("One of the descendant services is returning errors"), Lightstep provides the context the user needs to gain insights in their investigation.
Lightstep's novel Satellite ArchitectureLightstep's novel Satellite Architecture is what makes this all possible: the focused analysis is constructed just-in-time from thousands of distributed tracesdistributed traces assembled in response to the user's query. This allows Lightstep to suppress the distractions from unrelated services and components, reducing the noise and further amplifying the signals generated from the actual issue under investigation.
Visualize dependencies between services
Service Diagram intelligently organizes the services in your system to match the flow of requests. It annotates nodes with important data, highlighting services that contribute to the latency of a transaction or services experiencing errors. Service Diagram helps you easily visualize complex system architecture, identify troublesome services, and narrow the search space of possible root causes.
In the example above, the user has queried for a single service, a tag for a single customer, and a specific latency region. Service Diagram builds a diagram to answer this very explicit question about application performance. It also clearly distinguishes between api-proxy
, which is sending traffic to api-server
, and the few services that are receiving traffic from api-server
.
Identify services contributing to latency
Service Diagram's distinct design surfaces performance bottlenecks in complex, multi-layered architectures. In the example above, Service Diagram guides the user's focus towards the two services highlighted in yellow. The size of the halos is proportional to the latency observed in each service, helping the user confidently narrow the scope of their investigation.
Service Diagram is fully interactive, enabling users to analyze system performance across any service, operation, or tag. Users can iteratively refine their query, select any interesting portion of the Latency HistogramLatency Histogram, and view a diagram built from only the traces that match the given query and filter. This is especially useful when investigating issues impacting unfamiliar services or operations.
Formulate, validate, and propagate root cause hypotheses
Service Diagram is built from the traces captured in SnapshotsSnapshots, which means it can be captured and shared across an organization. This allows users to share both historical and real-time observations while triaging a performance issue. For organizations that have recently adopted microservices, Service Diagram can be an invaluable resource to document architectural and performance changes over time.
Service Diagram demystifies some of the complexity commonly associated with microservices. It aggregates the rich information available in tracing data and surfaces it in an easy-to-visualize way. We've received positive feedback from customers with early access to this feature, and we're excited to see how Service Diagram will be used to make being on call a little less harrowing.
Are you adopting microservices? Do you relate a little too well with the on-call story above? Try LightstepTry Lightstep and see how we can help you.
February 11, 2019
•
4 min read
Announcements
About the authors

Danton Rodriguez
Read moreRead more
Karthik Kumar
Read moreRead moreIn this blog post
Introducing: Service DiagramIntroducing: Service DiagramVisualize dependencies between servicesVisualize dependencies between servicesIdentify services contributing to latencyIdentify services contributing to latencyFormulate, validate, and propagate root cause hypothesesFormulate, validate, and propagate root cause hypothesesExplore more articles

Strengthening our commitment to the OpenTelemetry project
Carter Socha | Apr 20, 2023Lightstep is the first company to natively provide customers with complete control of their telemetry pipeline which saves time and money, and provides the freedom to innovate at scale. By embracing OpenTelemetry support without vendor lock-in, Lightstep helps you make complex app development easier and faster.
Learn moreLearn more
Transform ServiceNow workflows with Service Graph Connector for Observability - Lightstep
Andrew Gardner | Dec 20, 2022The Service Graph Connector for Observability - Lightstep is the bridge between IT Operations and DevOps teams. When combined with ITOM Visibility, it provides organizations with a complete, end-to-end view of their entire cloud estate.
Learn moreLearn more
Evolving our incident response strategy
Lightstep | Nov 2, 2022Lightstep’s Incident Response offering will be sunset effective January 31, 2023. Current customers may continue to use the service until then. Lightstep Observability will not be affected.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems