LightStep and OpsGenie Partner to Improve Application Performance and Incident Management

Microservices-based architectures enable software teams to deliver innovations and value to their customers faster. Microservices are often owned by individual engineering teams that are solely responsible for everything from development to deployment. This autonomy reduces cross-team dependencies, but it also often means each development team is solely accountable for the ongoing performance of their own services in production. Using LightStep [x]PM and integrated solutions such as OpsGenie, a leading incident management platform, teams are proactively alerted when potential SLA violations or latency issues occur, and can see the associated end-to-end traces to pinpoint root causes quickly.

LightStep [x]PM is unique because it analyzes completely unsampled trace data and is able to segment this information by extremely high-cardinality key:value tags, such as customer IDs or build numbers. This means [x]PM captures every performance anomaly or failure, no matter how brief or rare the occurrences are. [x]PM is the ideal solution for companies with microservices-based applications, because users can isolate real-time and historical performance data along any dimension and uncover root causes even for complex transactions spanning service boundaries – letting teams focus on the issues they’re responsible for.

LightStep and OpsGenie Partner to Improve Application Performance and Incident Management
LightStep [x]PM with OpsGenie alerts the right people based on on-call schedules when SLA violations or latency issues occur.

Useful information is meaningful only when users receive it when and where they need it. [x]PM has been integrated with complementary DevOps solutions to allow teams to access their performance data in their preferred, existing workflows. Our customers requested that we integrate with the OpsGenie incident management platform for operating always-on services. When a Service-Level Alert (SLA) is violated or resolved, [x]PM sends JSON notifications to OpsGenie, which automatically creates custom alerts and notifies the right people based on on-call schedules – via email, text messages (SMS), phone calls, iOS and Android push notifications, and escalates alerts until they are acknowledged or closed.

We’re excited about the value of this new integration for our customers. We’ll continue to enhance [x]PM to work well with popular tools and DevOps best practices for adopting, developing, and maintaining microservices-based applications.

Try it out and share your feedback at support@lightstep.com, and let us know what other integrations you’d like to see.

Twilio Engineer Shares How They Achieve Five 9s of Availability

In our recent tech talk on SD Times – Managing the Performance of Applications in the Microservices Era – Tyler Wells, Director of Engineering at Twilio, shared his insights on how to effectively manage the performance of microservices-based applications and how they achieve five 9s of availability and success.

Tyler said that integrating new tools and solutions into a developer’s workflow can be a challenge for any organization: there needs to be a big carrot. For Twilio, the carrot was a 92% reduction in mean time to resolution (MTTR) for production incidents, and 70% improvements to mean latency for critical services. Now, they can also detect failures before they impact customers. This article shows how they accomplished these results and how other organizations can do the same.

How Twilio integrated [x]PM into its engineering process and workflow

Tyler described why his team was motivated to try [x]PM and how it fit into their workflow. “Twilio was born and raised in the cloud and has always been built on distributed microservices. My team was an early adopter of LightStep. We were excited about the opportunity to instrument and add tracing to the complex distributed systems we have in the Programmable Video group. You can imagine that setting up a video call involves a lot of steps, and there are a lot of systems. The orchestration messages have to pass through: authorization, authentication, creating the Room [session], orchestrating the Room, adding Participants to the Room. These are all distributed systems, so we added tracing, including Tags and rich information specific to our business, and we started watching. We watched the p99 latency, and we started honing in on the outliers. As we highlighted these outliers, we pulled the information we needed to help identify one of these Rooms using [the Room’s] Sid or GUIDs. We used those IDs to look through [LightStep] and figure out, from the highlighted spans showing the latency, exactly what was going on. That was our first experience with LightStep and how we started to derive value.”

LightStep [x]PM - Managing Application Performance in the Microservices EraMonitor latency, alert on SLA violations, and focus on the outliers to quickly determine root cause

How chaos actually helps

Tyler talked about the benefits of always assuming that things will break. “We like to break our systems before we put them into the hands of our customers, so we do a lot of Chaos Engineering. We use a tool like Gremlin to start breaking things. LightStep makes it easy for us to be able to hone in on what happens when things go wrong. We know when you’re operating in the cloud, everything is going to break at some point in time. Using LightStep in conjunction with our ‘Game Days,’ we got a ton of visualization, so we could create the SLA alerts, which we have integrated into PagerDuty and Slack. If incidents are triggered, our team immediately shows up in a Slack channel and all of the rich LightStep information is there for us to help identify issues.”

Achieving five 9s of availability and success

Tyler explains how they achieve operational excellence. “We have a program at Twilio called Operational Maturity Model (OMM). It’s a program all teams must follow when pushing product into production. The program has a number of different dimensions: LightStep sits in the Operations dimension. We have a specific policy in the Operations dimension that’s literally called LightStep. There are a number of items in every dimension that teams need to check off to reach a specific grade, with the highest grade being Iron Man. In order for any team to go into production and claim general availability, they have to implement LightStep, use LightStep as part of their Game Days, and they have to achieve Iron Man status. That’s how we use it at Twilio.”

Tyler summarized Twilio’s focus on operational excellence to build customer confidence: “We typically target five 9s [99.999%] of availability and five 9s of success. Generally speaking, five 9s is discipline, not luck.”

Overcoming resistance to change

Tyler described how his team was able to show results and convince other teams at Twilio to use [x]PM. “Any time you try to introduce a new tool to engineers, there’s always going to be some level of resistance. Everybody has more work on their plates and in their backlog than they can handle, and then someone shows up and says: ‘hey, here’s this really cool tool that you should try.’ It’s always met with a healthy dose of skepticism. We had some teams that were early adopters that really derived incredible value from using LightStep. We were able to articulate those results and show other teams (that may have been skeptics). We showed how it helped us solve production-level issues, meet our goals on the operational excellence front, and deliver that higher level of operational maturity to our customers.”

Watch the tech talk, Managing the Performance of Applications in the Microservices Era, to get all of the details about how Twilio is using [x]PM. Don’t miss the demo to see [x]PM in action.