Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep

Founded in 2008, Twilio’s mission is to fuel the future of communications. Twilio lowers engineering costs with its flexible, scalable communications APIs, and allows businesses to easily build relevant and contextual messaging, voice, and video capabilities directly into their software. Research shows that software applications increase user adoption and utilization rates when conversations with users take place within the application itself. This translates to lower costs, new revenue opportunities, and improved operational efficiencies. Thousands of global businesses including ING Bank, Morgan Stanley, Netflix, and Salesforce incorporate Twilio APIs into their products to better engage with their customers. Twilio has over 900 employees, with headquarters in San Francisco and other offices in Bogotá, Dublin, Hong Kong, London, Madrid, Malmö, Mountain View, Munich, New York City, Singapore, and Tallinn.

The Challenge

Selling Trust with Reliable Performance at Scale

Companies trust Twilio to power their communications with customers. If there are service interruptions or high latency in Twilio systems, it impacts the Twilio customers and their end users. Twilio application performance and reliability is critical in maintaining customer relationships. “We’re in the business of selling trust, and the ability to identify and resolve system issues quickly and efficiently is a huge priority for us,” said Jason Hudak, Vice President of Platform Engineering, Twilio.

Hudak and his team wanted to go beyond traditional monitoring approaches. Their vision was to deliver a flawless experience to the 1.6 million developers registered on the Twilio platform, as well as provide additional service and insights for premium clients.

Twilio’s engineers realized that quickly detecting and mitigating performance issues affecting their premium customers wasn’t going to work with traditional practices. Setting up Service Level Agreement (SLA) alerts on a customer-by-customer basis, using approaches like log data alerting, was cost-prohibitive due to the dramatic increase in logging data volumes associated with Twilio’s transition to microservices. Hudak wanted to be able to identify traces of specific, noteworthy events, but traditional approaches – like centralized logging – were “simply not the right solution. Logging solutions can provide information about who, what, and where things happened, but LightStep answers why things happened and helps us do root cause analysis very quickly,” noted Hudak.

As Data Scales, So Do Production Issues

As Twilio’s architecture grew to more than 40 core services, it experienced engineering growing pains that are typical of a highvolume, high-performance distributed system. It’s simple to track and segment data when you have a small number of services, users, and tags; but as an engineering system scales and the number of unique values in data increases, so do the cost and difficulty of identifying specific system issues.

Opportunity Cost – Buy Versus Build

Hudak’s team explored several options to reinforce Twilio’s application performance management strategy. Traditional solutions could not support the scale and complexity of Twilio’s systems. The team scoped that it would take at least 6 months and 2 full-time engineers to build a very simplistic in-house system. The maintenance of that basic system would also require 10% of the developers’ time moving forward. Plus, the system would be incapable of capturing 100% of the application performance data, handling account-level application performance management, or enforcing SLAs.

The Solution

Save Engineering Time by Buying a Proactive Performance
Management Solution

Once Twilio engineers discovered LightStep, they immediately recognized it as a solution capable of handling the complexity of Twilio’s multi-dimensional performance data, running across the entire distributed system. They quickly scrapped plans to build their own solution. “With LightStep, the decision to buy versus build a tracing-based application performance monitoring system became easy. We didn’t want to delay the availability of valuable insights by reinventing a wheel that LightStep already built,” Hudak said. “We take our commitments to customers very seriously and only rely on best-in-class technologies that deliver enterprise-wide visibility and scalability.”

With LightStep, Twilio avoided spending a significant amount of engineering time and money to build and maintain a basic proprietary system. LightStep took care of performance monitoring and issue identification, so Twilio’s engineering teams could focus on building new features and driving the platform forward.

“There’s total cost of ownership of a system. We should be shooting for 5 to 10x return on investment on every engineer. We originally thought building a performance management solution was a great use of engineering resources. Then we found LightStep and realized, we’d not only save a significant amount of engineering time and money by using the product, but would also benefit from LightStep’s support and superior feature set.” David Dunstan, Director of Insight Engineering, Twilio.

Keeping Top Customers Happy

When Hudak and his team chose LightStep, their top priority was proactive account-level performance management. They found LightStep provided a straightforward solution that didn’t involve additional overhead or engineering resources. To integrate LightStep into existing workflows and increase internal adoption, the Twilio team set up dashboards and alerts. These tools allowed Twilio teams to track individual customers and transactions – focusing on information most pertinent to them and their specific customers.

“We have over a million customers today. With LightStep, we can prioritize remediation for our most important customers and give them customized insights and reliability,” said Hudak. By using LightStep, Hudak’s team doesn’t have to sift through huge volumes of log data to identify performance traces, which is cost-prohibitive and timeconsuming. “LightStep enables us to look at data at the account level and deliver alerts to account managers and customer success managers in real time when potential latency issues arise.” Hudak continued, “We can proactively work with customers to deliver premium service and remediate problems before they impact business operations.”

The Results

Trustworthy and Customized Performance Stability

LightStep helps Twilio deliver reliable performance while saving engineering time. Buying LightStep allowed David Dunstan’s Insight Engineering team to save a significant amount of engineering time and money and provided them a fully supported superior feature set. It has also made root cause analysis a breeze. “With LightStep, our ability to detect and remediate issues has dramatically improved,” said Hudak. “When we go through exercises to test the system, root cause analysis for many complex failures has been reduced from an average of 40 minutes to less than three minutes with LightStep. This saves our engineering team nearly 20 hours each week.”

After observing the impact of LightStep on Dunstan’s team, Hudak decided to extend the product to the Billing Transactions Engineering team. Within an hour of running LightStep, they were able to identify issues and deliver improvements that led to a 70% reduction in latency. This is because LightStep not only detects performance incidents, but it explains why they occur.

Using LightStep also helped Twilio redefine personalized customer support by introducing SLA alerting for individual accounts. “We initially started with one of our enterprise accounts,” Hudak said. “The test proved successful when our support engineer received an alert and proactively reached out to remediate the issue with the customer. Had we not done so, the problem could have escalated into a service impacting event.”

Twilio uses LightStep to monitor and manage the transaction performance data for each individual customer and to set up alerts for the appropriate member of the account team. While lower latency and downtime translates into quantified business metrics, Hudak explained that maintaining and improving customer confidence is an even bigger win. “Operational excellence builds customer confidence. If our systems go down, we lose trust. LightStep helps us to reduce latency and build trust by delivering on our mission to provide the features customers want and the quality they deserve,” said Hudak.

Challenges

  • Continuously analyzing the transaction performance data generated by more than 40 core services without adding overhead
  • Getting a macro view of interactions across services and clients
  • Building, maintaining, and increasing customer trust
  • Measuring performance and mitigating issues for specific, strategic customers
  • Ensuring developers stay focused on shipping, not building monitoring
  • Needing to analyze all the of performance data generated by more than 40 core services

Business Results

  • Improved mean time to resolution by 92% for production issues
  • Found issues in the first hour of deployment that led to a 70% reduction in latency
  • Saved Insight Engineering team 20 hours per week
  • Launched segmented and detailed performance monitoring for each top Twilio customer, including customer-specific root cause analysis

Organization Details

  • Headquarters: San Francisco, CA
  • Industry Segment: Communications
  • Employees: 990
  • Publicly Traded: NYSE: TWLO
  • Market Capitalization: $2.5B+
Download PDF

Learn from the dreams and the nightmares of those managing production software

No hype – just thoughts about software performance and reliability for modern systems.

Stay Informed