DigitalOcean uses LightStep as a Source of Truth for its Distributed System, Saving 1000 Hours of Developer Time per Month

DigitalOcean, a cloud platform provider with offices in New York, NY and Cambridge, MA, makes it simple for developers to build great software by offering transparent and affordable pricing, a simple and elegant user experience, a highly engaged developer community, and one of the most comprehensive libraries of open source resources in the world.

The Challenge

Distributed Technologies, Distributed Systems, and Distributed Teams

DigitalOcean’s business has grown exponentially over the last four years. The company has more than one million users and has shipped seven new offerings in the last 18 months. DigitalOcean’s legacy code was written in PERL and Ruby, but the team switched to Go, so they could better support the development velocity they needed for new product features and an increasingly complex distributed environment.

To support this growth and evolution, the leadership team decided to hire globally. Enabling teams to work across distributed locations was the best strategy to attract diverse, top-tier talent. The engineering team was thrilled to have the breadth of experience and knowledge that new hires across all continents brought. But, they realized DigitalOcean needed a source of truth for developers to see a complete, reliable picture of the system in real time that would help them all have the same baseline information.

Root Cause Analysis in a Heterogeneous System

With the software and team growing quickly, DigitalOcean wanted to improve the way they responded to errors and performance degradations. According to Dave Smith, Sr. Director of Engineering, DigitalOcean: “In our increasingly complex environment, it was impossible for a single person to understand the entire system. Root cause analysis was becoming difficult, and we couldn’t find an application performance monitoring system robust enough to work with our heterogeneity.”

Smith explained that individual engineering teams needed to see all the moving parts of the larger system because they had specialized on specific services. Teams were shipping features efficiently, but communication across different engineering teams had suffered.

Because it was so difficult to pinpoint the exact origins of a performance problem, it was also difficult to determine the right person to address the problem. Teams had logs at their disposal, but correlating events in log data was like looking for a needle in a haystack, wasting countless developer hours per week.

The Solution

LightStep Empowers Engineers

In the summer of 2016, Antoine Grondin, Sr. Engineer, discovered LightStep through developers in the Go community. When he investigated, he found LightStep was able to bring critical information about performance degradations front and center using clear dashboards and detailed traces. He also found that LightStep could deliver a customized implementation to fit DigitalOcean’s complex needs. He looped in Sr. Director of Engineering, Dave Smith, who quickly confirmed Antoine’s instinct and saw that LightStep could streamline cross-organization diagnostics. “Antoine found a technology that empowered engineers, and I knew the value he saw for his individual workflow would scale well to serve the rest of the organization. We value the customer experience, and ensuring the highest quality means we use the best solutions,” said Smith.

After instrumenting a few of his own services with OpenTracing and connecting them to LightStep, Antoine and his team saw immediate benefits from the root cause and bottleneck identification. Then, they instrumented DigitalOcean’s Remote Procedure Call (RPC) layer, which serves as the standard communication layer in the company’s architecture. Immediately, DigitalOcean found that with LightStep detailed alerts and traces, it became easy to identify which teams could mitigate an issue when it crossed teams and service boundaries.

As a result, they incorporated LightStep into the company-wide service generator using OpenTracing. This ensures that any new microservice is OpenTracing-compatible and has the option of using LightStep, which immediately gives the entire organization visibility into all services touched by the RPC. This helped DigitalOcean and its heterogeneous system get great value with LightStep from the start.

Different Projects, Same Page

Distributed systems allow teams to focus on specific projects or services. As engineers get more specialized, staying up-to-date on everyone’s work and progress can be challenging. It’s also hard to understand who should be accountable for issues and performance degradations. By using the LightStepdashboard feature, DigitalOcean is able to create graphs related to each team’s work. LightStep is the unifying solution that DigitalOcean engineers use to get a real-time view of the entire system. It helps the distributed team understand where other team members are impacting the project. With LightStep, multiple teams create their dashboards, which anyone in the organization can review, helping create a culture of transparency and accountability.

Another important aspect of this transparency is the way LightStep has changed how teams collaborate on root cause analysis. Prior to using LightStep, logs were one of the main ways to drill into issues and identify a root cause. It involved digging through multiple databases and external services to identify the problem, followed by a lengthy search through logs to find the cause. Identifying the responsible team to fix the issue was an additional challenge before final issue remediation. Using LightStep, this process was cut down to 2-3 steps, completed in less than 15 minutes. LightStep breaks down a performance issue into detailed traces, which connects the dots and explicitly highlights the service or component that holds the key to the root cause.

The Results

LightStep not only acts as a connector for a distributed team because it is a source of truth, but it also helps engineers work together and improve productivity. By removing the process of digging through logs for root cause analysis, LightStep saves the average backend engineer four hours per week. Collectively, this adds up to nearly 1000 hours per month of engineering time.

LightStep serves as a one-stop shop to help engineers understand various elements of the DigitalOcean ecosystem. 100+ apps are now being monitored using LightStep, and the organization is using the results to promote intra-company accountability and visibility. They also have 144 company-wide visible dashboards that help each team understand their services’ performance and see how it relates to all the other services hosted by other teams.

Challenges

  • Performing root cause analysis in a polyglot distributed system
  • Providing a source of truth for developers to see a complete, reliable picture of the system in real time
  • Creating a culture of trust and accountability
  • Improving communication across different engineering teams

Business Results

  • Monitors 100+ apps in real time
  • Saved 1000 hours per month of developer time
  • Completed root cause analysis in minutes instead of hours or days
  • Allows access to real-time, accurate snapshot of the entire system

Organization Details

  • Headquarters: New York City
  • Industry Segment: Information Technology
  • Employees: 300
  • Funding: $305 Million
Download PDF

Learn from the dreams and the nightmares of those managing production software

No hype – just thoughts about software performance and reliability for modern systems.

Stay Informed