DigitalOcean Uses Lightstep to Get a Reliable Picture of its Distributed System and Saves 1,000 Hours of Developer Time per Month
by Kristin Brennan
Lightstep enables customers, including DigitalOcean, to diagnose performance problems across service boundaries and identify the teams that can fix them using end-to-end traces. DigitalOcean uses Lightstep to monitor 100+ apps in real time across its distributed system. Lightstep also helps engineers work together and improve productivity, saving 1,000 hours per month of developer time.
Challenge: performing root cause analysis in a distributed system
As its software team was growing quickly, DigitalOcean wanted to improve the way it responded to errors and performance degradations. The company needed a source of truth to see a complete, reliable picture of the system in real time that would help them all have the same baseline information. Teams were shipping features efficiently, but communication across different engineering teams had suffered. Because it was so difficult to pinpoint the exact origins of a performance problem, it was also difficult to determine the right person to address the problem. Teams had logs at their disposal, but correlating events in log data was like looking for a needle in a haystack, wasting countless developer hours per week. According to Dave Smith, Sr. Director of Engineering at DigitalOcean: “In our increasingly complex environment, it was impossible for a single person to understand the entire system. Root cause analysis was becoming difficult, and we couldn’t find an application performance monitoring system robust enough to work with our heterogeneity.”
Find the root cause and assign the right team to fix it quickly
Lghtstep was able to fit into DigitalOcean’s complex ecosystem, and now gives the engineers a real-time view of the entire system. 100+ apps are being monitored using Lightstep, and the organization is using the results to promote intra-company accountability and visibility. They also have 144 company-wide visible dashboards that help each team understand their services’ performance and see how it relates to all the other services hosted by other teams.
Customize dashboards to measure application performance along any dimension, by team ownership, customer transactions, or even individual services.
Lightstep has also changed how teams collaborate on root cause analysis. Prior to using Lightstep, logs were one of the main ways to drill into issues and identify a root cause. It involved digging through multiple databases and external services to identify the problem, followed by a lengthy search through logs to find the cause. Identifying the responsible team to fix the issue was an additional challenge before final remediation. Using Lightstep’s end-to-end traces, alongside customizable dashboards and alerts, this process was cut down to 2-3 steps, and it was completed in less than 15 minutes. Lightstep breaks down a performance issue into detailed traces, which connects the dots and explicitly highlights the root cause. This process makes it easy to identify the team that can mitigate the issue even when it crosses teams and service boundaries. “Lightstep scales beautifully with our business and our use cases. We’re very pleased with our decision to standardize on it for application performance management,” said Smith.
Read the full case study, DigitalOcean Uses Lightstep as a Source of Truth for its Distributed System, Saving 1000 Hours of Developer Time per Month, to get more information about DigitalOcean’s success.