Microservices Lead the New Class of Performance Management Solutions

Microservices have become mainstream and are disrupting the operational landscape

There is a fundamental shift taking place in the market today with the massive adoption of microservices. In greater numbers than ever, organizations are decomposing their monolithic applications into microservices and building new applications as microservices from the start. This movement is disrupting the operational landscape and breaking the traditional APM model. As Gartner stated in a recent report, most APM solutions are ill-suited to the dynamism, modularity and scale of microservice-based applications.

The back story

We wanted to better understand if the trends we’re seeing around the rise of microservices and the pain associated with traditional APM tools were specific to only the early adopters in the market or if there was a massive shift underway. To help answer that question, we surveyed hundreds of companies about their microservices adoption plans. The results of that formal survey are detailed in the Global Microservices Trends report. The survey was conducted by Dimensional Research in April 2018 and sponsored by LightStep, and it included a total of 353 development professionals across the U.S. and Europe, the Middle East and Africa (EMEA) and Asia. Company size ranged from 500 employees to more than 5,000.

Microservices bring new operational challenges

Almost all survey respondents, 99 percent, report challenges in monitoring microservices. And each additional microservice increases operational challenges, according to 56 percent of respondents. One of the key architectural differences in a microservices environment is that they process transactions through heavy use of cross-service API calls, which has caused an exponential increase in data volume. 87% of those using microservices in production report they generate more application data.

Global Microservices Trends Report 2018 - ChallengesGlobal Microservices Trends Report 2018

Microservice performance management is critical to success

Among those that have microservices in production, 73 percent report it is actually more difficult to troubleshoot in this environment than it is with a traditional monolithic application. 98 percent of users that have trouble identifying the root cause of performance issues in microservices environments report it has a direct business impact with 76 percent of those reporting it takes longer to resolve issues.

Investments are increasing in performance management for microservices

Performance management for microservices will be a big area of investment in the coming year, with most respondents (74 percent) reporting that they will increase their investment. The money to fund these purchases will frequently be coming from existing expenditures for performance management of monolithic applications, because about a third (30%) will be decreasing their investments in those types of solutions in the coming year.

Global Microservices Trends Report 2018 - InvestmentGlobal Microservices Trends Report 2018

Record growth in microservices

According to the survey results, 92 percent of respondents said they increased their number of microservices in the last year, and 92 percent expect to grow their use of microservices in the coming year. Agility (82 percent) and scalability (78 percent) were the top motivators for microservice adoption.

Microservices are widely used today

Microservices have become ubiquitous among enterprise development teams. About 9 in 10 are currently using or have plans to use microservices. For well over half, 60 percent, adoption is already advanced. 86 percent expect microservices to be the default architecture within five years.

These findings in the Global Microservices Trends report underscore the need for a new class of performance management solutions. The days of monolithic applications and traditional APM tools are quickly becoming a relic of the past. LightStep [x]PM is specifically designed to address the scale and complexity of modern applications.

Read the complete report and contact us, so we can help you manage the performance of your microservices.

DigitalOcean Uses LightStep [x]PM to Get a Reliable Picture of its Distributed System and Saves 1,000 Hours of Developer Time per Month

LightStep [x]PM enables customers, including DigitalOcean, to diagnose performance problems across service boundaries and identify the teams that can fix them using end-to-end traces. DigitalOcean uses [x]PM to monitor 100+ apps in real time across its distributed system. [x]PM also helps engineers work together and improve productivity, saving 1,000 hours per month of developer time.

Challenge: performing root cause analysis in a distributed system

As its software team was growing quickly, DigitalOcean wanted to improve the way it responded to errors and performance degradations. The company needed a source of truth to see a complete, reliable picture of the system in real time that would help them all have the same baseline information. Teams were shipping features efficiently, but communication across different engineering teams had suffered. Because it was so difficult to pinpoint the exact origins of a performance problem, it was also difficult to determine the right person to address the problem. Teams had logs at their disposal, but correlating events in log data was like looking for a needle in a haystack, wasting countless developer hours per week. According to Dave Smith, Sr. Director of Engineering at DigitalOcean: “In our increasingly complex environment, it was impossible for a single person to understand the entire system. Root cause analysis was becoming difficult, and we couldn’t find an application performance monitoring system robust enough to work with our heterogeneity.”

Find the root cause and assign the right team to fix it quickly

[x]PM was able to fit into DigitalOcean’s complex ecosystem, and now gives the engineers a real-time view of the entire system. 100+ apps are being monitored using [x]PM, and the organization is using the results to promote intra-company accountability and visibility. They also have 144 company-wide visible dashboards that help each team understand their services’ performance and see how it relates to all the other services hosted by other teams.

DigitalOcean Uses [x]PM’s End-to-End Traces Along with Customizable Dashboards and AlertsCustomize dashboards to measure application performance along any dimension, by team ownership, customer transactions, or even individual services.

[x]PM has also changed how teams collaborate on root cause analysis. Prior to using [x]PM, logs were one of the main ways to drill into issues and identify a root cause. It involved digging through multiple databases and external services to identify the problem, followed by a lengthy search through logs to find the cause. Identifying the responsible team to fix the issue was an additional challenge before final remediation. Using [x]PM’s end-to-end traces, alongside customizable dashboards and alerts, this process was cut down to 2-3 steps, and it was completed in less than 15 minutes. [x]PM breaks down a performance issue into detailed traces, which connects the dots and explicitly highlights the root cause. This process makes it easy to identify the team that can mitigate the issue even when it crosses teams and service boundaries. “[x]PM scales beautifully with our business and our use cases. We’re very pleased with our decision to standardize on it for application performance management,” said Smith.

Read the full case study, DigitalOcean Uses LightStep [x]PM as a Source of Truth for its Distributed System, Saving 1000 Hours of Developer Time per Month, to get more information about DigitalOcean’s success.

LightStep [x]PM Enables Lyft’s Move to Microservices

LightStep [x]PM allows customers, just like Lyft, to analyze every transaction at scale. Lyft uses [x]PM to observe every request, in real-time, across web, mobile, microservices, and monoliths – ensuring end-to-end performance management across all of its systems. Pete Morelli, VP of Engineering at Lyft, says: “LightStep is the future of monitoring and was instrumental in our move to microservices.”

Challenge: moving to microservices to manage growth, reduce costs, and improve product efficiency

In order to rapidly scale its system and support growth, Lyft started to explore moving from a monolithic architecture to microservices. Today, Lyft deploys more than 200 microservices in its distributed architecture, and this number is growing. These services work together to perform fundamental functions of the Lyft app, including matching riders with drivers, optimizing the route for the most efficient ride, and processing riders’ payment information. It’s a challenge to quickly and accurately monitor Lyft’s system as the number of microservices grows, because a distributed architecture generates exponentially more data than its monolithic predecessor. To gain insight into this detailed level of performance as efficiently as possible, Lyft chose to implement [x]PM.

Diagnose anything by seeing everything

[x]PM is the only solution that monitors 100% of unsampled transaction data and is always-on in production environments, with negligible overhead. With its unique architecture, [x]PM can capture a near-limitless amount of data and weave distributed trace data together into meaningful point-in-time stories about the application – even if the data was produced asynchronously or across distinct service boundaries. [x]PM considers every operation and intelligently assembles traces automatically for interesting events like errors or latency spikes, as well as traces representative of normal operating behavior. Once assembled, these traces are stored indefinitely and can be reviewed at any time. By considering all of an application’s transactional data, [x]PM reliably detects one-in-a-million anomalies, unlike any other technology, and shows everything that happens both upstream and downstream from the event. Lyft’s systems generate more than 100 billion microservice calls per day. As Morelli stated, “With [x]PM, there is no risk of overlooking any problems at the edges where the biggest problems are found.”

LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product EfficiencyView detailed end-to-end traces for complex, distributed transactions and make better critical-path optimizations

Monitoring and application performance insights from [x]PM also empower engineers to make many critical-path optimizations that improve ride request times, increase dispatch efficiency, and ensure effective incident postmortems – all of which translates into increased revenue and developer efficiency. According to engineers Roy Williams and Danial Afzal, one of the first projects where they used [x]PM was a spring cleaning of the entire system. The focus was on identifying and optimizing critical paths for dispatch services that connect riders with drivers. Lyft was able to improve the efficiency of customer ride routes and accelerated response times by 60% (250 milliseconds). Saving time is a key goal, explained Williams: “The more time we get, the more efficient we can be. If we can use those extra milliseconds to find a more efficient match, that’s a win for us, that’s a win for our customers.”

Read the case study, LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product Efficiency, and get all of the details about Lyft’s success.

Yext Moves to LightStep [x]PM, Reducing Time Spent on Root Cause Analysis

LightStep [x]PM enables customers, including Yext, to diagnose anything across all components of their software systems. Yext uses [x]PM to identify latency and error issues and then remediate quickly and efficiently. Overall, Yext believes LightStep is saving its team nearly one week a month of staff time in problem diagnostics.

Challenge: making root cause analysis more efficient

Microservices allow a distributed system to evolve as a set of independent parts, but a complete, global view of the system is still a requirement to effectively manage a production system. Yext found that the traditional APM solutions could not provide that full visibility. The team wanted a solution that could trace a transaction end-to-end and be always-on in production, so they could understand the root cause and resolve it quickly. “As the other solutions just record a sample of transactions and traces, we were only getting a partial picture,” said Rob Figueiredo, VP of Engineering at Yext. “In addition, finding these partial records in the system, so we could act on them was extremely difficult.”

Pinpoint the root cause with LightStep [x]PM

[x]PM automatically finds and connects related trace data into meaningful stories about an application’s performance, even if the data was produced asynchronously and across distinct service boundaries. [x]PM pinpoints actual instances of notable events such as latency issues and errors. By analyzing 100% of diagnostic data, [x]PM reliably detects one in a million anomalies. It shows everything that happens both upstream and downstream from these issues, including critical paths and logs, providing complete root cause analysis.

Yext Uses LightStep [x]PM, Reducing Time Spent on Root Cause AnalysisView end-to-end traces from frontend clients down to backend services, and jump straight to errors and bottlenecks

Yext is realizing substantial returns from its deployment of [x]PM, including improved efficiencies for the engineering team. Prior to the deployment of [x]PM, weekly production engineering meetings were time-consuming and inefficient. “We spent a lot of time on detective work, trying to perform root cause analysis on application performance from the prior week,” Figueiredo recalled. “Not only did these problems fester, but we expended a lot of valuable time trying to ascertain the root cause of problems. Our production engineering meetings are now embarrassingly easy. [x]PM is so simple; it tells us about performance problems within seconds, and we’re able to go straight to the root cause in the production system.”

Read the full case study, Yext Moves to LightStep [x]PM, Improving Application Performance, to get more information about Yext’s success.

Twilio Delivers Proactive Account-Level Performance Management for Premium Clients with LightStep [x]PM

LightStep [x]PM cuts through the scale and complexity in today’s software systems to help organizations, like Twilio, tie system performance to business objectives. [x]PM has allowed Twilio teams to track individual customers and transactions, so they can deliver premium service and remediate problems proactively.

Challenge: measuring performance and mitigating issues for specific, strategic customers

The Twilio team’s vision was to deliver a flawless experience to the 1.6 million developers registered on the Twilio platform, as well as provide additional service and insights for premium clients. They realized that quickly detecting and mitigating performance issues affecting their premium customers wasn’t going to work with traditional practices. Setting up Service Level Agreement (SLA) alerts on a customer-by-customer basis, using approaches like log data alerting, was cost-prohibitive due to the dramatic increase in logging data volumes associated with Twilio’s transition to microservices. “We’re in the business of selling trust, and the ability to identify and resolve system issues quickly and efficiently is a huge priority for us,” said Jason Hudak, Vice President of Platform Engineering at Twilio.

Keeping top customers happy with LightStep [x]PM

Twilio integrated [x]PM into existing workflows and set up dashboards and alerts which allowed the teams to track individual customers and their transactions. “[x]PM enables us to look at data at the account level and deliver alerts to account managers and customer success managers in real time when potential latency issues arise. We can proactively work with customers to deliver premium service and remediate problems before they impact business operations,” said Hudak.

Measure performance where it matters most

[x]PM enables enterprises to pursue any business objective that depends on fast, reliable software. Just like Twilio, companies can accurately monitor latency, operation rates, and errors, and also view detailed trace information for every transaction throughout their applications, all broken down by individual customers and in real time. [x]PM also provides the flexibility to create account-specific alerts for these metrics, so key team members are notified when acceptable thresholds are exceeded. The evaluation window for these alerts is customizable from minutes to even weeks, so users can both detect immediate issues before they escalate and also monitor app performance for sustained customer trends that may have serious business consequences.

Monitor Account-Specific Performance Metrics and Create Custom AlertsMonitor account-specific performance metrics and create custom alerts for them

[x]PM helped Twilio redefine personalized customer support by introducing SLA alerting for individual accounts. Hudak said, “We initially started with one of our enterprise accounts. The test proved successful when our support engineer received an alert and proactively reached out to remediate the issue with the customer. Had we not done so, the problem could have escalated into a service-impacting event.”

Twilio uses [x]PM to monitor and manage the transaction performance data for each individual customer and to set up alerts for the appropriate member of the account team. Lower latency and downtime translates into quantified business metrics. However, maintaining and improving customer confidence, especially for premium clients, is an even bigger win. “Operational excellence builds customer confidence. If our systems go down, we lose trust. [x]PM helps us to reduce latency and build trust by delivering on our mission to provide the features customers want and the quality they deserve,” said Hudak.

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.

Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM

LightStep [x]PM has helped some of the world’s most innovative companies, including Twilio, monitor what matters most and diagnose anomalies within seconds. LightStep enables companies to pinpoint the root cause of issues quickly, and Twilio used [x]PM to improve mean time to resolution (MTTR) by 92%.

Challenge: reducing time to detect and remediate issues

When we first talked to the team at Twilio, they said they wanted to be able to identify traces of specific, noteworthy events, but traditional approaches – like centralized logging – were “simply not the right solution. Logging solutions can provide information about who, what, and where things happened, but LightStep [x]PM answers why things happened and helps us do root cause analysis very quickly,” said Jason Hudak, VP of Platform Engineering at Twilio.

LightStep [x]PM satellite architecture yields targeted insights

[x]PM is built on LightStep’s cutting-edge Satellite Architecture which distributes data collection and statistical analyses, yielding targeted insights from anywhere within today’s software systems. To help customers reduce MTTR, [x]PM delivers prompt, content-rich alerts and provides real-time traces that give visibility into exactly how separate services and parts of an application interact with each other.

Root causes for anomalous latency spikes or errors are often buried in some backend service, making them extremely difficult to uncover. [x]PM lets users easily drill down and examine the complex service interactions for very large traces across arbitrary time ranges and for any latency band to diagnose those issues. [x]PM further analyzes these services within the context of one another for every trace to help users quickly determine the critical path, and it presents log information and payloads inline for each transaction of interest. These capabilities enable customers like Twilio to visualize, identify, and resolve issues faster.

LightStep [x]PM Real Time Trace for Root Cause AnalysisVisualize, identify, and resolve latency spikes and errors faster with LightStep [x]PM

[x]PM has demystified root cause analysis at Twilio. As Hudak said, “With [x]PM, our ability to detect and remediate issues has dramatically improved. When we go through exercises to test the system, root cause analysis for many complex failures has been reduced from an average of 40 minutes to less than three minutes with [x]PM. This saves our engineering team nearly 20 hours each week.”

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.