Commiserating About Microservices Nightmares


Organizations adopt microservices to scale engineering productivity and because they have a dream that things will become easier, they’ll be able to move faster, and that their software will become more reliable. However, this dream often devolves into an elaborate, sprawling, and persistent nightmare of complexity and confusion. Software inevitably breaks, and when that breakage is difficult to observe or explain, it’s particularly painful. How do things go wrong in production and what are the lessons we can learn from those experiences?

Those are the questions we’ve been asking ourselves, our customers, and leaders in the industry that we know and respect. We held our first “Evening of Microservices Nightmares” event at our new office last June. We had lively presentations by Bryan Cantrill, Vijay Gill and Marius Eriksen. Their presentations struck a chord with attendees. We know they did with us.

Learning and sharing

Learning and sharing are core to our values at LightStep. We wanted to start a dialogue about “microservices nightmares” and lessons learned. We believe there just isn’t enough open discussion in the industry, and there is mounting anxiety around potential pitfalls. There is plenty of hype about upside, but not enough clear, specific documentation of failure modes and possible remediations. Too many engineering teams in too many companies are struggling with the same issues: we’d like to have more forums where practitioners can ask questions and learn how things are or should be done with microservices.

Upcoming events

We’re excited to host our next “Evening of Microservices Nightmares” event on Thursday, September 20 in Austin. We’ve got engineering leaders from Under Armour, BigCommerce, and Indeed lined up to present lightning talks about what they’ve learned along their path to adopting microservices.

If you’re in the Austin area, check out the details and join us! It’s sure to be entertaining, and of course there’s an opportunity to commiserate over pizza and beer.

We’ll also be co-hosting an event on September 26 in San Francisco with Redpoint. Ben Sigelman, LightStep co-founder and CEO and former Googler, will be presenting “What we got wrong: Lessons from the birth of microservices at Google.” Google deserves a lot of credit for imagining (and popularizing) what we now call “microservice architectures,” but hindsight is 20/20. Many of the mistakes that were made at Google are being recreated by the rest of the industry today. Ben will talk about what they got wrong at Google and how those lessons can be applied today. We hope many of you in the San Francisco Bay Area will be able to join us.

Want to learn about microservices and avoid nightmares? Stay informed.

Event Details

Name: What We Got Wrong – Lessons from the Birth of Microservices at Google
Description: Google deserves a lot of credit for imagining (and popularizing) what we now call microservice architectures. That said, hindsight is 20/20, and many of the mistakes we made at Google are being recreated by the rest of the industry today. What did we get wrong about microservices at Google, and how can we apply those lessons today?
Date: 9/26/2018, 5:30-7:30 pm
Speaker: Ben Sigelman
Location:
Redpoint’s South of Market Office
21 South Park Street
San Francisco, CA 94107

Tech Field Day at LightStep: A Blogger’s Paradise

Last week, we participated in our first Tech Field Day. We had a great time hosting the esteemed bloggers from literally all over the world. We shared the LightStep story with them and all of the people who watched the livestream. It was truly an opportunity to geek out and show them why we’re so passionate about application performance management for today’s complex distributed systems from monolith to microservices and serverless.

After we finished giving them a tour of our amazing office and highlighting all of our donut-themed conference rooms, we got down to business. First up was Ben Sigelman, CEO and Co-founder at LightStep, who shared his thoughts on why companies are moving away from monoliths to microservices and discussed the complexities that arise with this transition. He’s been thinking about these complexities since his days at Google where he focused on Google’s large scale distributed monitoring solutions including Dapper (a large scale distributed tracing system), and Monarch.

LightStep - Ben Sigelman - Moving Away from Monoliths
Ben Sigelman explaining why companies move to microservices

People are moving to microservices because of the promise of increased speed, agility, and scalability. Plus, microservices solve a number of managerial challenges. With microservices architectures, services and the teams that support them are independent actors who are managed on their own and can break free of the logjam created by monolithic architectures and organizational structures. Once you have 100 developers, or even potentially 50, the move away from the monolith to microservice is inevitable. Human communication patterns and Conway’s Law have proven that large teams are extremely inefficient, and organizations end up shipping their org chart. Vijay Gill recently shared in a guest blog post that shipping the org chart is the only good reason to adopt microservices.

Ben went on to explain that while microservices solve managerial issues, they also introduce new dependencies and complexities. The reality of microservices is that people often feel a little out of control and have poor visibility into what’s happening in the system. Application performance management tooling needs to adapt to deal with that complexity which is precisely why LightStep [x]PM and OpenTracing, the distributed tracing standard, were created.

Next up was Dennis Chu, our head of Product Marketing at LightStep. He gave a great demo of LightStep [x]PM which customers use to diagnose problems and pinpoint the root cause of application performance issues in their distributed system/microservices environments. Dennis showed how customers use distributed traces from within [x]PM users to jump directly to the critical path of the transaction and understand the function or microservice that’s causing the latency. He also demonstrated how [x]PM can move beyond simple p99 metrics of latency and use histograms to understand the shape of the latency and get better insight into what’s normal and what might be anomalies. [x]PM doesn’t do any upfront sampling, so teams never miss an outlier or anomalous transaction. Dennis demonstrated that with no limits to cardinality, teams can monitor what matters most to them. Ben chimed in to show how our approach is completely different from the traditional APM way of monitoring systems. LightStep can use any correlation ID to stitch together the flow of the transaction. This enables [x]PM users to really isolate the performance of their biggest customers or a specific release or project.

LightStep - Dennis Chu - [x]PM Demo
Dennis Chu is about to start the demo

Spoons, CTO and Co-founder at LightStep, then explained how the [x]PM magic happens. The unique [x]PM satellite architecture enables customers to send 100x more data than they send to any other monitoring solution. He explained what a distributed system is, and joked that distributed is a term computer scientists put in front of things to mean “but harder”.

LightStep - Spoons - [x]PM Architecture
Spoons explaining the [x]PM architecture

[x]PM has two primary components: Satellites and the LightStep engine (SaaS). The Satellites run within the customer’s environment or network region. They observe all transactions and communicate continuously with the LightStep engine (SaaS). They can scrub PII or other data before it exits the customer’s environment. The LightStep engine (SaaS) uses statistical models and customer configuration to identify and record detailed time-series data and the end-to-end traces that will be most valuable to the customer.

OpenTracing is one way to get data into the system, but people can use any correlation id including home-grown tracing solutions, logs, etc. There’s also a great blog that goes into more depth on the [x]PM architecture.

Ben then brought us home with some closing commentary on why we need a new rubric for observability. The conventional wisdom has been that metrics, logging, and tracing are “the three pillars” of observability, yet organizations check these boxes and still find themselves grasping at straws during emergencies. The problem is that metrics, logs, and traces are just data – if what we need is a car, all we’re talking about is the fuel. We’ll continue to disappoint ourselves until we reframe observability.

Watch the Tech Field Day videos to get lots more info and hear the Q&A with the bloggers.

Microservices Lead the New Class of Performance Management Solutions

Microservices have become mainstream and are disrupting the operational landscape

There is a fundamental shift taking place in the market today with the massive adoption of microservices. In greater numbers than ever, organizations are decomposing their monolithic applications into microservices and building new applications as microservices from the start. This movement is disrupting the operational landscape and breaking the traditional APM model. As Gartner stated in a recent report, most APM solutions are ill-suited to the dynamism, modularity and scale of microservice-based applications.

The back story

We wanted to better understand if the trends we’re seeing around the rise of microservices and the pain associated with traditional APM tools were specific to only the early adopters in the market or if there was a massive shift underway. To help answer that question, we surveyed hundreds of companies about their microservices adoption plans. The results of that formal survey are detailed in the Global Microservices Trends report. The survey was conducted by Dimensional Research in April 2018 and sponsored by LightStep, and it included a total of 353 development professionals across the U.S. and Europe, the Middle East and Africa (EMEA) and Asia. Company size ranged from 500 employees to more than 5,000.

Microservices bring new operational challenges

Almost all survey respondents, 99 percent, report challenges in monitoring microservices. And each additional microservice increases operational challenges, according to 56 percent of respondents. One of the key architectural differences in a microservices environment is that they process transactions through heavy use of cross-service API calls, which has caused an exponential increase in data volume. 87% of those using microservices in production report they generate more application data.

Global Microservices Trends Report 2018 - ChallengesGlobal Microservices Trends Report 2018

Microservice performance management is critical to success

Among those that have microservices in production, 73 percent report it is actually more difficult to troubleshoot in this environment than it is with a traditional monolithic application. 98 percent of users that have trouble identifying the root cause of performance issues in microservices environments report it has a direct business impact with 76 percent of those reporting it takes longer to resolve issues.

Investments are increasing in performance management for microservices

Performance management for microservices will be a big area of investment in the coming year, with most respondents (74 percent) reporting that they will increase their investment. The money to fund these purchases will frequently be coming from existing expenditures for performance management of monolithic applications, because about a third (30%) will be decreasing their investments in those types of solutions in the coming year.

Global Microservices Trends Report 2018 - InvestmentGlobal Microservices Trends Report 2018

Record growth in microservices

According to the survey results, 92 percent of respondents said they increased their number of microservices in the last year, and 92 percent expect to grow their use of microservices in the coming year. Agility (82 percent) and scalability (78 percent) were the top motivators for microservice adoption.

Microservices are widely used today

Microservices have become ubiquitous among enterprise development teams. About 9 in 10 are currently using or have plans to use microservices. For well over half, 60 percent, adoption is already advanced. 86 percent expect microservices to be the default architecture within five years.

These findings in the Global Microservices Trends report underscore the need for a new class of performance management solutions. The days of monolithic applications and traditional APM tools are quickly becoming a relic of the past. LightStep [x]PM is specifically designed to address the scale and complexity of modern applications.

Read the complete report and contact us, so we can help you manage the performance of your microservices.

DigitalOcean Uses LightStep [x]PM to Get a Reliable Picture of its Distributed System and Saves 1,000 Hours of Developer Time per Month

LightStep [x]PM enables customers, including DigitalOcean, to diagnose performance problems across service boundaries and identify the teams that can fix them using end-to-end traces. DigitalOcean uses [x]PM to monitor 100+ apps in real time across its distributed system. [x]PM also helps engineers work together and improve productivity, saving 1,000 hours per month of developer time.

Challenge: performing root cause analysis in a distributed system

As its software team was growing quickly, DigitalOcean wanted to improve the way it responded to errors and performance degradations. The company needed a source of truth to see a complete, reliable picture of the system in real time that would help them all have the same baseline information. Teams were shipping features efficiently, but communication across different engineering teams had suffered. Because it was so difficult to pinpoint the exact origins of a performance problem, it was also difficult to determine the right person to address the problem. Teams had logs at their disposal, but correlating events in log data was like looking for a needle in a haystack, wasting countless developer hours per week. According to Dave Smith, Sr. Director of Engineering at DigitalOcean: “In our increasingly complex environment, it was impossible for a single person to understand the entire system. Root cause analysis was becoming difficult, and we couldn’t find an application performance monitoring system robust enough to work with our heterogeneity.”

Find the root cause and assign the right team to fix it quickly

[x]PM was able to fit into DigitalOcean’s complex ecosystem, and now gives the engineers a real-time view of the entire system. 100+ apps are being monitored using [x]PM, and the organization is using the results to promote intra-company accountability and visibility. They also have 144 company-wide visible dashboards that help each team understand their services’ performance and see how it relates to all the other services hosted by other teams.

DigitalOcean Uses [x]PM’s End-to-End Traces Along with Customizable Dashboards and AlertsCustomize dashboards to measure application performance along any dimension, by team ownership, customer transactions, or even individual services.

[x]PM has also changed how teams collaborate on root cause analysis. Prior to using [x]PM, logs were one of the main ways to drill into issues and identify a root cause. It involved digging through multiple databases and external services to identify the problem, followed by a lengthy search through logs to find the cause. Identifying the responsible team to fix the issue was an additional challenge before final remediation. Using [x]PM’s end-to-end traces, alongside customizable dashboards and alerts, this process was cut down to 2-3 steps, and it was completed in less than 15 minutes. [x]PM breaks down a performance issue into detailed traces, which connects the dots and explicitly highlights the root cause. This process makes it easy to identify the team that can mitigate the issue even when it crosses teams and service boundaries. “[x]PM scales beautifully with our business and our use cases. We’re very pleased with our decision to standardize on it for application performance management,” said Smith.

Read the full case study, DigitalOcean Uses LightStep [x]PM as a Source of Truth for its Distributed System, Saving 1000 Hours of Developer Time per Month, to get more information about DigitalOcean’s success.

LightStep [x]PM Enables Lyft’s Move to Microservices

LightStep [x]PM allows customers, just like Lyft, to analyze every transaction at scale. Lyft uses [x]PM to observe every request, in real-time, across web, mobile, microservices, and monoliths – ensuring end-to-end performance management across all of its systems. Pete Morelli, VP of Engineering at Lyft, says: “LightStep is the future of monitoring and was instrumental in our move to microservices.”

Challenge: moving to microservices to manage growth, reduce costs, and improve product efficiency

In order to rapidly scale its system and support growth, Lyft started to explore moving from a monolithic architecture to microservices. Today, Lyft deploys more than 200 microservices in its distributed architecture, and this number is growing. These services work together to perform fundamental functions of the Lyft app, including matching riders with drivers, optimizing the route for the most efficient ride, and processing riders’ payment information. It’s a challenge to quickly and accurately monitor Lyft’s system as the number of microservices grows, because a distributed architecture generates exponentially more data than its monolithic predecessor. To gain insight into this detailed level of performance as efficiently as possible, Lyft chose to implement [x]PM.

Diagnose anything by seeing everything

[x]PM is the only solution that monitors 100% of unsampled transaction data and is always-on in production environments, with negligible overhead. With its unique architecture, [x]PM can capture a near-limitless amount of data and weave distributed trace data together into meaningful point-in-time stories about the application – even if the data was produced asynchronously or across distinct service boundaries. [x]PM considers every operation and intelligently assembles traces automatically for interesting events like errors or latency spikes, as well as traces representative of normal operating behavior. Once assembled, these traces are stored indefinitely and can be reviewed at any time. By considering all of an application’s transactional data, [x]PM reliably detects one-in-a-million anomalies, unlike any other technology, and shows everything that happens both upstream and downstream from the event. Lyft’s systems generate more than 100 billion microservice calls per day. As Morelli stated, “With [x]PM, there is no risk of overlooking any problems at the edges where the biggest problems are found.”

LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product EfficiencyView detailed end-to-end traces for complex, distributed transactions and make better critical-path optimizations

Monitoring and application performance insights from [x]PM also empower engineers to make many critical-path optimizations that improve ride request times, increase dispatch efficiency, and ensure effective incident postmortems – all of which translates into increased revenue and developer efficiency. According to engineers Roy Williams and Danial Afzal, one of the first projects where they used [x]PM was a spring cleaning of the entire system. The focus was on identifying and optimizing critical paths for dispatch services that connect riders with drivers. Lyft was able to improve the efficiency of customer ride routes and accelerated response times by 60% (250 milliseconds). Saving time is a key goal, explained Williams: “The more time we get, the more efficient we can be. If we can use those extra milliseconds to find a more efficient match, that’s a win for us, that’s a win for our customers.”

Read the case study, LightStep [x]PM Enables Lyft’s Move to Microservices, Helping Drive Significant Revenue and Improving Product Efficiency, and get all of the details about Lyft’s success.

Yext Moves to LightStep [x]PM, Reducing Time Spent on Root Cause Analysis

LightStep [x]PM enables customers, including Yext, to diagnose anything across all components of their software systems. Yext uses [x]PM to identify latency and error issues and then remediate quickly and efficiently. Overall, Yext believes LightStep is saving its team nearly one week a month of staff time in problem diagnostics.

Challenge: making root cause analysis more efficient

Microservices allow a distributed system to evolve as a set of independent parts, but a complete, global view of the system is still a requirement to effectively manage a production system. Yext found that the traditional APM solutions could not provide that full visibility. The team wanted a solution that could trace a transaction end-to-end and be always-on in production, so they could understand the root cause and resolve it quickly. “As the other solutions just record a sample of transactions and traces, we were only getting a partial picture,” said Rob Figueiredo, VP of Engineering at Yext. “In addition, finding these partial records in the system, so we could act on them was extremely difficult.”

Pinpoint the root cause with LightStep [x]PM

[x]PM automatically finds and connects related trace data into meaningful stories about an application’s performance, even if the data was produced asynchronously and across distinct service boundaries. [x]PM pinpoints actual instances of notable events such as latency issues and errors. By analyzing 100% of diagnostic data, [x]PM reliably detects one in a million anomalies. It shows everything that happens both upstream and downstream from these issues, including critical paths and logs, providing complete root cause analysis.

Yext Uses LightStep [x]PM, Reducing Time Spent on Root Cause AnalysisView end-to-end traces from frontend clients down to backend services, and jump straight to errors and bottlenecks

Yext is realizing substantial returns from its deployment of [x]PM, including improved efficiencies for the engineering team. Prior to the deployment of [x]PM, weekly production engineering meetings were time-consuming and inefficient. “We spent a lot of time on detective work, trying to perform root cause analysis on application performance from the prior week,” Figueiredo recalled. “Not only did these problems fester, but we expended a lot of valuable time trying to ascertain the root cause of problems. Our production engineering meetings are now embarrassingly easy. [x]PM is so simple; it tells us about performance problems within seconds, and we’re able to go straight to the root cause in the production system.”

Read the full case study, Yext Moves to LightStep [x]PM, Improving Application Performance, to get more information about Yext’s success.

Twilio Delivers Proactive Account-Level Performance Management for Premium Clients with LightStep [x]PM

LightStep [x]PM cuts through the scale and complexity in today’s software systems to help organizations, like Twilio, tie system performance to business objectives. [x]PM has allowed Twilio teams to track individual customers and transactions, so they can deliver premium service and remediate problems proactively.

Challenge: measuring performance and mitigating issues for specific, strategic customers

The Twilio team’s vision was to deliver a flawless experience to the 1.6 million developers registered on the Twilio platform, as well as provide additional service and insights for premium clients. They realized that quickly detecting and mitigating performance issues affecting their premium customers wasn’t going to work with traditional practices. Setting up Service Level Agreement (SLA) alerts on a customer-by-customer basis, using approaches like log data alerting, was cost-prohibitive due to the dramatic increase in logging data volumes associated with Twilio’s transition to microservices. “We’re in the business of selling trust, and the ability to identify and resolve system issues quickly and efficiently is a huge priority for us,” said Jason Hudak, Vice President of Platform Engineering at Twilio.

Keeping top customers happy with LightStep [x]PM

Twilio integrated [x]PM into existing workflows and set up dashboards and alerts which allowed the teams to track individual customers and their transactions. “[x]PM enables us to look at data at the account level and deliver alerts to account managers and customer success managers in real time when potential latency issues arise. We can proactively work with customers to deliver premium service and remediate problems before they impact business operations,” said Hudak.

Measure performance where it matters most

[x]PM enables enterprises to pursue any business objective that depends on fast, reliable software. Just like Twilio, companies can accurately monitor latency, operation rates, and errors, and also view detailed trace information for every transaction throughout their applications, all broken down by individual customers and in real time. [x]PM also provides the flexibility to create account-specific alerts for these metrics, so key team members are notified when acceptable thresholds are exceeded. The evaluation window for these alerts is customizable from minutes to even weeks, so users can both detect immediate issues before they escalate and also monitor app performance for sustained customer trends that may have serious business consequences.

Monitor Account-Specific Performance Metrics and Create Custom AlertsMonitor account-specific performance metrics and create custom alerts for them

[x]PM helped Twilio redefine personalized customer support by introducing SLA alerting for individual accounts. Hudak said, “We initially started with one of our enterprise accounts. The test proved successful when our support engineer received an alert and proactively reached out to remediate the issue with the customer. Had we not done so, the problem could have escalated into a service-impacting event.”

Twilio uses [x]PM to monitor and manage the transaction performance data for each individual customer and to set up alerts for the appropriate member of the account team. Lower latency and downtime translates into quantified business metrics. However, maintaining and improving customer confidence, especially for premium clients, is an even bigger win. “Operational excellence builds customer confidence. If our systems go down, we lose trust. [x]PM helps us to reduce latency and build trust by delivering on our mission to provide the features customers want and the quality they deserve,” said Hudak.

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.

Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM

LightStep [x]PM has helped some of the world’s most innovative companies, including Twilio, monitor what matters most and diagnose anomalies within seconds. LightStep enables companies to pinpoint the root cause of issues quickly, and Twilio used [x]PM to improve mean time to resolution (MTTR) by 92%.

Challenge: reducing time to detect and remediate issues

When we first talked to the team at Twilio, they said they wanted to be able to identify traces of specific, noteworthy events, but traditional approaches – like centralized logging – were “simply not the right solution. Logging solutions can provide information about who, what, and where things happened, but LightStep [x]PM answers why things happened and helps us do root cause analysis very quickly,” said Jason Hudak, VP of Platform Engineering at Twilio.

LightStep [x]PM satellite architecture yields targeted insights

[x]PM is built on LightStep’s cutting-edge Satellite Architecture which distributes data collection and statistical analyses, yielding targeted insights from anywhere within today’s software systems. To help customers reduce MTTR, [x]PM delivers prompt, content-rich alerts and provides real-time traces that give visibility into exactly how separate services and parts of an application interact with each other.

Root causes for anomalous latency spikes or errors are often buried in some backend service, making them extremely difficult to uncover. [x]PM lets users easily drill down and examine the complex service interactions for very large traces across arbitrary time ranges and for any latency band to diagnose those issues. [x]PM further analyzes these services within the context of one another for every trace to help users quickly determine the critical path, and it presents log information and payloads inline for each transaction of interest. These capabilities enable customers like Twilio to visualize, identify, and resolve issues faster.

LightStep [x]PM Real Time Trace for Root Cause AnalysisVisualize, identify, and resolve latency spikes and errors faster with LightStep [x]PM

[x]PM has demystified root cause analysis at Twilio. As Hudak said, “With [x]PM, our ability to detect and remediate issues has dramatically improved. When we go through exercises to test the system, root cause analysis for many complex failures has been reduced from an average of 40 minutes to less than three minutes with [x]PM. This saves our engineering team nearly 20 hours each week.”

Read the full case study, Twilio Improves Mean Time To Resolution (MTTR) by 92% with LightStep [x]PM, to get all of the details about Twilio’s success.