Twilio Engineer Shares How They Achieve Five 9s of Availability

In our recent tech talk on SD Times – Managing the Performance of Applications in the Microservices Era – Tyler Wells, Director of Engineering at Twilio, shared his insights on how to effectively manage the performance of microservices-based applications and how they achieve five 9s of availability and success.

Tyler said that integrating new tools and solutions into a developer’s workflow can be a challenge for any organization: there needs to be a big carrot. For Twilio, the carrot was a 92% reduction in mean time to resolution (MTTR) for production incidents, and 70% improvements to mean latency for critical services. Now, they can also detect failures before they impact customers. This article shows how they accomplished these results and how other organizations can do the same.

How Twilio integrated [x]PM into its engineering process and workflow

Tyler described why his team was motivated to try [x]PM and how it fit into their workflow. “Twilio was born and raised in the cloud and has always been built on distributed microservices. My team was an early adopter of LightStep. We were excited about the opportunity to instrument and add tracing to the complex distributed systems we have in the Programmable Video group. You can imagine that setting up a video call involves a lot of steps, and there are a lot of systems. The orchestration messages have to pass through: authorization, authentication, creating the Room [session], orchestrating the Room, adding Participants to the Room. These are all distributed systems, so we added tracing, including Tags and rich information specific to our business, and we started watching. We watched the p99 latency, and we started honing in on the outliers. As we highlighted these outliers, we pulled the information we needed to help identify one of these Rooms using [the Room’s] Sid or GUIDs. We used those IDs to look through [LightStep] and figure out, from the highlighted spans showing the latency, exactly what was going on. That was our first experience with LightStep and how we started to derive value.”

LightStep [x]PM - Managing Application Performance in the Microservices EraMonitor latency, alert on SLA violations, and focus on the outliers to quickly determine root cause

How chaos actually helps

Tyler talked about the benefits of always assuming that things will break. “We like to break our systems before we put them into the hands of our customers, so we do a lot of Chaos Engineering. We use a tool like Gremlin to start breaking things. LightStep makes it easy for us to be able to hone in on what happens when things go wrong. We know when you’re operating in the cloud, everything is going to break at some point in time. Using LightStep in conjunction with our ‘Game Days,’ we got a ton of visualization, so we could create the SLA alerts, which we have integrated into PagerDuty and Slack. If incidents are triggered, our team immediately shows up in a Slack channel and all of the rich LightStep information is there for us to help identify issues.”

Achieving five 9s of availability and success

Tyler explains how they achieve operational excellence. “We have a program at Twilio called Operational Maturity Model (OMM). It’s a program all teams must follow when pushing product into production. The program has a number of different dimensions: LightStep sits in the Operations dimension. We have a specific policy in the Operations dimension that’s literally called LightStep. There are a number of items in every dimension that teams need to check off to reach a specific grade, with the highest grade being Iron Man. In order for any team to go into production and claim general availability, they have to implement LightStep, use LightStep as part of their Game Days, and they have to achieve Iron Man status. That’s how we use it at Twilio.”

Tyler summarized Twilio’s focus on operational excellence to build customer confidence: “We typically target five 9s [99.999%] of availability and five 9s of success. Generally speaking, five 9s is discipline, not luck.”

Overcoming resistance to change

Tyler described how his team was able to show results and convince other teams at Twilio to use [x]PM. “Any time you try to introduce a new tool to engineers, there’s always going to be some level of resistance. Everybody has more work on their plates and in their backlog than they can handle, and then someone shows up and says: ‘hey, here’s this really cool tool that you should try.’ It’s always met with a healthy dose of skepticism. We had some teams that were early adopters that really derived incredible value from using LightStep. We were able to articulate those results and show other teams (that may have been skeptics). We showed how it helped us solve production-level issues, meet our goals on the operational excellence front, and deliver that higher level of operational maturity to our customers.”

Watch the tech talk, Managing the Performance of Applications in the Microservices Era, to get all of the details about how Twilio is using [x]PM. Don’t miss the demo to see [x]PM in action.

DigitalOcean Uses LightStep [x]PM to Get a Reliable Picture of its Distributed System and Saves 1,000 Hours of Developer Time per Month

LightStep [x]PM enables customers, including DigitalOcean, to diagnose performance problems across service boundaries and identify the teams that can fix them using end-to-end traces. DigitalOcean uses [x]PM to monitor 100+ apps in real time across its distributed system. [x]PM also helps engineers work together and improve productivity, saving 1,000 hours per month of developer time.

Challenge: performing root cause analysis in a distributed system

As its software team was growing quickly, DigitalOcean wanted to improve the way it responded to errors and performance degradations. The company needed a source of truth to see a complete, reliable picture of the system in real time that would help them all have the same baseline information. Teams were shipping features efficiently, but communication across different engineering teams had suffered. Because it was so difficult to pinpoint the exact origins of a performance problem, it was also difficult to determine the right person to address the problem. Teams had logs at their disposal, but correlating events in log data was like looking for a needle in a haystack, wasting countless developer hours per week. According to Dave Smith, Sr. Director of Engineering at DigitalOcean: “In our increasingly complex environment, it was impossible for a single person to understand the entire system. Root cause analysis was becoming difficult, and we couldn’t find an application performance monitoring system robust enough to work with our heterogeneity.”

Find the root cause and assign the right team to fix it quickly

[x]PM was able to fit into DigitalOcean’s complex ecosystem, and now gives the engineers a real-time view of the entire system. 100+ apps are being monitored using [x]PM, and the organization is using the results to promote intra-company accountability and visibility. They also have 144 company-wide visible dashboards that help each team understand their services’ performance and see how it relates to all the other services hosted by other teams.

DigitalOcean Uses [x]PM’s End-to-End Traces Along with Customizable Dashboards and AlertsCustomize dashboards to measure application performance along any dimension, by team ownership, customer transactions, or even individual services.

[x]PM has also changed how teams collaborate on root cause analysis. Prior to using [x]PM, logs were one of the main ways to drill into issues and identify a root cause. It involved digging through multiple databases and external services to identify the problem, followed by a lengthy search through logs to find the cause. Identifying the responsible team to fix the issue was an additional challenge before final remediation. Using [x]PM’s end-to-end traces, alongside customizable dashboards and alerts, this process was cut down to 2-3 steps, and it was completed in less than 15 minutes. [x]PM breaks down a performance issue into detailed traces, which connects the dots and explicitly highlights the root cause. This process makes it easy to identify the team that can mitigate the issue even when it crosses teams and service boundaries. “[x]PM scales beautifully with our business and our use cases. We’re very pleased with our decision to standardize on it for application performance management,” said Smith.

Read the full case study, DigitalOcean Uses LightStep [x]PM as a Source of Truth for its Distributed System, Saving 1000 Hours of Developer Time per Month, to get more information about DigitalOcean’s success.

LightStep [x]PM Enables Lyft’s Move to Microservices

LightStep [x]PM allows customers, just like Lyft, to analyze every transaction at scale. Lyft uses [x]PM to observe every request, in real-time, across web, mobile, microservices, and monoliths – ensuring end-to-end performance management across all of its systems. Pete Morelli, VP of Engineering at Lyft, says: “LightStep is the future of monitoring and was instrumental in our move to microservices.”

Challenge: moving to microservices to manage growth, reduce costs, and improve product efficiency

In order to rapidly scale its system and support growth, Lyft started to explore moving from a monolithic architecture to microservices. Today, Lyft deploys more than 200 microservices in its distributed architecture, and this number is growing. These services work together to perform fundamental functions of the Lyft app, including matching riders with drivers, optimizing the route for the most efficient ride, and processing riders’ payment information. It’s a challenge to quickly and accurately monitor Lyft’s system as the number of microservices grows, because a distributed architecture generates exponentially more data than its monolithic predecessor. To gain insight into this detailed level of performance as efficiently as possible, Lyft chose to implement [x]PM.

Diagnose anything by seeing everything

[x]PM is the only solution that monitors 100% of unsampled transaction data and is always-on in production environments, with negligible overhead. With its unique architecture, [x]PM can capture a near-limitless amount of data and weave distributed trace data together into meaningful point-in-time stories about the application – even if the data was produced asynchronously or across distinct service boundaries. [x]PM considers every operation and intelligently assembles traces automatically for interesting events like errors or latency spikes, as well as traces representative of normal operating behavior. Once assembled, these traces are stored indefinitely and can be reviewed at any time. By considering all of an application’s transactional data, [x]PM reliably detects one-in-a-million anomalies, unlike any other technology, and shows everything that happens both upstream and downstream from the event. Lyft’s systems generate more than 100 billion microservice calls per day. As Morelli stated, “With [x]PM, there is no risk of overlooking any problems at the edges where the biggest problems are found.”

LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product EfficiencyView detailed end-to-end traces for complex, distributed transactions and make better critical-path optimizations

Monitoring and application performance insights from [x]PM also empower engineers to make many critical-path optimizations that improve ride request times, increase dispatch efficiency, and ensure effective incident postmortems – all of which translates into increased revenue and developer efficiency. According to engineers Roy Williams and Danial Afzal, one of the first projects where they used [x]PM was a spring cleaning of the entire system. The focus was on identifying and optimizing critical paths for dispatch services that connect riders with drivers. Lyft was able to improve the efficiency of customer ride routes and accelerated response times by 60% (250 milliseconds). Saving time is a key goal, explained Williams: “The more time we get, the more efficient we can be. If we can use those extra milliseconds to find a more efficient match, that’s a win for us, that’s a win for our customers.”

Read the case study, LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product Efficiency, and get all of the details about Lyft’s success.

Yext Moves to LightStep [x]PM, Reducing Time Spent on Root Cause Analysis

LightStep [x]PM enables customers, including Yext, to diagnose anything across all components of their software systems. Yext uses [x]PM to identify latency and error issues and then remediate quickly and efficiently. Overall, Yext believes LightStep is saving its team nearly one week a month of staff time in problem diagnostics.

Challenge: making root cause analysis more efficient

Microservices allow a distributed system to evolve as a set of independent parts, but a complete, global view of the system is still a requirement to effectively manage a production system. Yext found that the traditional APM solutions could not provide that full visibility. The team wanted a solution that could trace a transaction end-to-end and be always-on in production, so they could understand the root cause and resolve it quickly. “As the other solutions just record a sample of transactions and traces, we were only getting a partial picture,” said Rob Figueiredo, VP of Engineering at Yext. “In addition, finding these partial records in the system, so we could act on them was extremely difficult.”

Pinpoint the root cause with LightStep [x]PM

[x]PM automatically finds and connects related trace data into meaningful stories about an application’s performance, even if the data was produced asynchronously and across distinct service boundaries. [x]PM pinpoints actual instances of notable events such as latency issues and errors. By analyzing 100% of diagnostic data, [x]PM reliably detects one in a million anomalies. It shows everything that happens both upstream and downstream from these issues, including critical paths and logs, providing complete root cause analysis.

Yext Uses LightStep [x]PM, Reducing Time Spent on Root Cause AnalysisView end-to-end traces from frontend clients down to backend services, and jump straight to errors and bottlenecks

Yext is realizing substantial returns from its deployment of [x]PM, including improved efficiencies for the engineering team. Prior to the deployment of [x]PM, weekly production engineering meetings were time-consuming and inefficient. “We spent a lot of time on detective work, trying to perform root cause analysis on application performance from the prior week,” Figueiredo recalled. “Not only did these problems fester, but we expended a lot of valuable time trying to ascertain the root cause of problems. Our production engineering meetings are now embarrassingly easy. [x]PM is so simple; it tells us about performance problems within seconds, and we’re able to go straight to the root cause in the production system.”

Read the full case study, Yext Moves to LightStep [x]PM, Improving Application Performance, to get more information about Yext’s success.

Twilio Delivers Proactive Account-Level Performance Management for Premium Clients with LightStep [x]PM

LightStep [x]PM cuts through the scale and complexity in today’s software systems to help organizations, like Twilio, tie system performance to business objectives. [x]PM has allowed Twilio teams to track individual customers and transactions, so they can deliver premium service and remediate problems proactively.

Challenge: measuring performance and mitigating issues for specific, strategic customers

The Twilio team’s vision was to deliver a flawless experience to the 1.6 million developers registered on the Twilio platform, as well as provide additional service and insights for premium clients. They realized that quickly detecting and mitigating performance issues affecting their premium customers wasn’t going to work with traditional practices. Setting up Service Level Agreement (SLA) alerts on a customer-by-customer basis, using approaches like log data alerting, was cost-prohibitive due to the dramatic increase in logging data volumes associated with Twilio’s transition to microservices. “We’re in the business of selling trust, and the ability to identify and resolve system issues quickly and efficiently is a huge priority for us,” said Jason Hudak, Vice President of Platform Engineering at Twilio.

Keeping top customers happy with LightStep [x]PM

Twilio integrated [x]PM into existing workflows and set up dashboards and alerts which allowed the teams to track individual customers and their transactions. “[x]PM enables us to look at data at the account level and deliver alerts to account managers and customer success managers in real time when potential latency issues arise. We can proactively work with customers to deliver premium service and remediate problems before they impact business operations,” said Hudak.

Measure performance where it matters most

[x]PM enables enterprises to pursue any business objective that depends on fast, reliable software. Just like Twilio, companies can accurately monitor latency, operation rates, and errors, and also view detailed trace information for every transaction throughout their applications, all broken down by individual customers and in real time. [x]PM also provides the flexibility to create account-specific alerts for these metrics, so key team members are notified when acceptable thresholds are exceeded. The evaluation window for these alerts is customizable from minutes to even weeks, so users can both detect immediate issues before they escalate and also monitor app performance for sustained customer trends that may have serious business consequences.

Monitor Account-Specific Performance Metrics and Create Custom AlertsMonitor account-specific performance metrics and create custom alerts for them

[x]PM helped Twilio redefine personalized customer support by introducing SLA alerting for individual accounts. Hudak said, “We initially started with one of our enterprise accounts. The test proved successful when our support engineer received an alert and proactively reached out to remediate the issue with the customer. Had we not done so, the problem could have escalated into a service-impacting event.”

Twilio uses [x]PM to monitor and manage the transaction performance data for each individual customer and to set up alerts for the appropriate member of the account team. Lower latency and downtime translates into quantified business metrics. However, maintaining and improving customer confidence, especially for premium clients, is an even bigger win. “Operational excellence builds customer confidence. If our systems go down, we lose trust. [x]PM helps us to reduce latency and build trust by delivering on our mission to provide the features customers want and the quality they deserve,” said Hudak.

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.

Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM

LightStep [x]PM has helped some of the world’s most innovative companies, including Twilio, monitor what matters most and diagnose anomalies within seconds. LightStep enables companies to pinpoint the root cause of issues quickly, and Twilio used [x]PM to improve mean time to resolution (MTTR) by 92%.

Challenge: reducing time to detect and remediate issues

When we first talked to the team at Twilio, they said they wanted to be able to identify traces of specific, noteworthy events, but traditional approaches – like centralized logging – were “simply not the right solution. Logging solutions can provide information about who, what, and where things happened, but LightStep [x]PM answers why things happened and helps us do root cause analysis very quickly,” said Jason Hudak, VP of Platform Engineering at Twilio.

LightStep [x]PM satellite architecture yields targeted insights

[x]PM is built on LightStep’s cutting-edge Satellite Architecture which distributes data collection and statistical analyses, yielding targeted insights from anywhere within today’s software systems. To help customers reduce MTTR, [x]PM delivers prompt, content-rich alerts and provides real-time traces that give visibility into exactly how separate services and parts of an application interact with each other.

Root causes for anomalous latency spikes or errors are often buried in some backend service, making them extremely difficult to uncover. [x]PM lets users easily drill down and examine the complex service interactions for very large traces across arbitrary time ranges and for any latency band to diagnose those issues. [x]PM further analyzes these services within the context of one another for every trace to help users quickly determine the critical path, and it presents log information and payloads inline for each transaction of interest. These capabilities enable customers like Twilio to visualize, identify, and resolve issues faster.

LightStep [x]PM Real Time Trace for Root Cause AnalysisVisualize, identify, and resolve latency spikes and errors faster with LightStep [x]PM

[x]PM has demystified root cause analysis at Twilio. As Hudak said, “With [x]PM, our ability to detect and remediate issues has dramatically improved. When we go through exercises to test the system, root cause analysis for many complex failures has been reduced from an average of 40 minutes to less than three minutes with [x]PM. This saves our engineering team nearly 20 hours each week.”

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.