Monitorama 2018 – Metrics, Serverless, Tracing, and Socks

As a former Portlander, Monitorama was a great reason to return to the Silicon Forest as summer keeps the infamous rains at bay. It was also an excellent opportunity to learn about the latest technologies and techniques in monitoring and observing large-scale, distributed systems from industry experts and practitioners.

This year the conference organizers implemented a few scheduling changes intended to support more “hallway track” – speakers were aligned to a single track and breaks were longer. This absolutely helped sustain a high level of energy throughout the three day event.

OpenTracing - Monitorama 2018
Lots of conversations took place in the hallway track

An obvious trend at the conference was the prevalence of foot-encasing swag. Socks were available from a variety of vendors, including LightStep, with Monitorama-branded socks available for each attendee.

It’s always important to make a statement with your first speaker, and this year’s last minute substitution generated a lot of buzz. From Buzzfeed, Logan McDonald’s talk, Optimizing for Learning, blended behavioral research with practical techniques that resonated for many attendees, based on my conversations throughout the three days. It was the first of several talks that touched on how to sustainably build, grow, and integrate teams in fast-paced high technology companies including talks by Kishore Jalleda (Microsoft), Zach Musgrave/Angelo Licastro (Yelp), and Aditya Mukerjee (Stripe).

As expected, serverless was the subject of talks and open discussion, displacing the last few years of container domination (although containers were still represented by Allan Espinosa from Bloomberg). Serverless was featured alongside a plethora of cat puns in Pam Selle’s talk, and in Yan Cui’s effort to submit feature requests to every vendor simultaneously with his walkthrough of the ideal serverless monitoring system he wanted to use.

Metrics continued to be very important, however tracing made its mark with a strong showing from vendors (Sematext announces support for OT), speakers (OpenTracing’s Ted Young and his lightning talk), and strong interest from attendees in Wednesday’s Tracing breakfast. Some thirty folks braved the early morning pastries of Cafe Umbria, following the late night of Tuesday’s vendor parties. Many attendees were just beginning their journey into microservices, considering Open Zipkin and Jaeger, while others were on the hunt for anecdotes about a vendor that would meet their needs as microservices and serverless continue to increase the observability complexity of their environments.

Tracing Breakfast - Monitorama 2018
Lively discussion at the Tracing breakfast

I left Monitorama feeling really energized by the current state of the industry and how our field is quickly becoming a focal point for this latest wave of DevOps innovation and best practices. So many of the observability challenges that were raised and passionately discussed are precisely the ones we’re focused on solving at LightStep. If you’d like to learn more about what we do, see how our customers use our product to maintain performance and reliability for their modern applications with advanced distributed tracing using industry-adopted standards.

Performance is a Shape, Not a Number

This article originally appeared on Medium.

Applications have evolved – again – and it’s time for performance analysis to follow suit

In the last twenty years, the internet applications that improve our lives and drive our economy have become far more powerful. As a necessary side-effect, these applications have become far more complex, and that makes it much harder for us to measure and explain their performance – especially in real-time. Despite that, the way that we both reason about and actually measure performance has barely changed.

I’m not here to argue about the importance of understanding real-time performance in the face of rising complexity – by now, we all realize it’s vital – but for the need to improve our mental model as we recognize and diagnose anomalies. When assessing “right now,” our industry relies almost entirely on averages and percentile estimates: these are not enough to efficiently diagnose performance problems in modern systems. Performance is a shape, not a number, and effective tools and workflows should present and explore that shape, as we illustrate below.

We’ll divide the evolution of application performance measurement into three “phases.” Each phase had its own deployment model, its own predominant software architecture, and its own way of measuring performance. Without further ado, let’s go back to the olden days: before AWS, before the smartphone, and before Facebook (though perhaps not Friendster)…

Watch our tech talk now. Hear Ben Sigelman, LightStep CEO, present the case for unsampled latency histograms as an evolution of and replacement for simple averages and percentile estimates.

Phase 1: Bare Metal and average latency (~2002)

LightStep - the Stack 2002The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.

If you measured application performance at all in 2002, you probably did it with average request latency. Simple averages work well for simple things: namely, normally-distributed things with low variance. They are less appropriate when there’s high variance, and they are particularly bad when the sample values are not normally distributed. Unfortunately, latency distributions today are rarely normally distributed, can have high variance, and are often multimodal to boot. (More on that later)

To make this more concrete, here’s a chart of average latency for one of the many microservice handlers in LightStep’s SaaS:

LightStep - Recent Average LatencyRecent average latency for an important internal microservice API call at LightStep

It holds steady at around 5ms, essentially all of the time. Looks good! 5ms is fast. Unfortunately it’s not so simple: average latency is a poor leading indicator of reliability woes, especially for scaled-out internet applications. We’ll need something better…

Phase 2: Cloud VMs and p99 latency (~2012)

LightStep - the Stack 2012The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).

Even if average latency looks good, we still don’t know a thing about the outliers. Per this great Jeff Dean talk, in a microservices world with lots of fanout, an end-to-end transaction is only as fast as its slowest dependency. As our applications transitioned to the cloud, we learned that high-percentile latency was an important leading indicator of systemic performance problems.

Of course, this is even more true today: when ordinary user requests touch dozens or hundreds of service instances, high-percentile latency in backends translates to average-case user-visible latency in frontends.

To emphasize the importance of looking (very) far from the mean, let’s look at recent p95 for that nice, flat, 5ms average latency graph from above:

LightStep - Recent p95 LatencyRecent p95 latency for the same important internal microservice API call at LightStep

The latency for p95 is higher than p50, of course, but it’s still pretty boring. That said, when we plot recent measurements for p99.9, we notice meaningful instability and variance over time:

LightStep - Recent p99.9 LatencyRecent p99.9 latency for the same microservice API call. Now we see some instability.

Now we’re getting somewhere! With a p99.9 like that, we suspect that the shape of our latency distribution is not a nice, clean bell curve, after all… But what does it look like?

Phase 3: Microservices and detailed latency histograms (2018)

LightStep - the Stack 2018The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.

When we reason about a latency distribution, we’re trying to understand the distinct behaviors of our software application. What is the shape of the distribution? Where are the “bumps” (i.e., the modes of the distribution) and why are they there? Each mode in a latency distribution is a different behavior in the distributed system, and before we can explain these behaviors we must be able to see them.

In order to understand performance “right now”, our workflow ought to look like this:

  1. Identify the modes (the “bumps”) in the latency histogram
  2. Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
  3. Explain the behaviors that characterize these high-priority modes

Too often we just panic and start clicking around in hopes that we stumble upon a plausible explanation. Other times we are more disciplined, but our tools only expose bare statistics without context or relevant example transactions.

This article is meant to be about ideas (rather than a particular product), but the only real-world example I can reference is the recently released Live View functionality in LightStep [x]PM. Live View is built around an unsampled, filterable, real-time histogram representation of performance that’s tied directly to distributed tracing for root-cause analysis. To get back to our example, below is the live latency distribution corresponding to the percentile measurements above:

LightStep - A Real-Time View of LatencyA real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.

The histogram makes it easy to identify the distinct modes of behavior (the “bumps” in the histogram) and to triage them. In this situation, we care most about the high-latency outliers on the right side. Compare this data with the simple statistics from “Phase 1” and “Phase 2” where the modes are indecipherable.

Having identified and triaged the modes in our latency distribution, we now need to explain the concerning high-latency behavior. Since [x]PM has access to all (unsampled) trace data, we can isolate and zoom in on any feature regardless of its size. We filter interactively to hone in on an explanation: first by restricting to a narrow latency band, and then further by adding key:value tag restrictions. Here we see how the live latency distribution varies from one project_id to the next (project_id being a high-cardinality tag for this dataset):

LightStep - Isolate and Zoom In on Any FeatureGiven 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.

Here we are surprised to learn that project_id 36 experienced consistently slower performance than the aggregate. Again: Why? We restrict our view to project_id=36, filter to examine the latency outliers, and open a trace. Since [x]PM can assemble these traces retroactively, we always find an example, even for rare behavior:

LightStep - End-to-End Transaction TracesTo attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.

The (rare) trace we isolated shows us the smoking gun: that contention around mutex acquisition dominates the critical path (and explains why this particular project — with its own highly-contended mutex — has inferior performance relative to others). Again, compare against a bare percentile: simply measuring p99 latency is a far cry from effective performance analysis.

Stepping back and looking forward…

As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.

If you made it this far, I hope you’ve learned some new ways to think about latency measurements and how they play a part in diagnostic workflows. LightStep continues to invest heavily in this area: to that end, please share your stories and points of view in the comment section, or reach out to me directly (Twitter, Medium, LinkedIn), either to provide feedback or to nudge us in a particular direction. I love to nerd out along these lines and welcome outside perspectives.

Want to work on this with me and my colleagues? It’s fun! LightStep is hiring.

Want to make your own complex software more comprehensible? We can show you exactly how LightStep [x]PM works.

Microservices Lead the New Class of Performance Management Solutions

Microservices have become mainstream and are disrupting the operational landscape

There is a fundamental shift taking place in the market today with the massive adoption of microservices. In greater numbers than ever, organizations are decomposing their monolithic applications into microservices and building new applications as microservices from the start. This movement is disrupting the operational landscape and breaking the traditional APM model. As Gartner stated in a recent report, most APM solutions are ill-suited to the dynamism, modularity and scale of microservice-based applications.

The back story

We wanted to better understand if the trends we’re seeing around the rise of microservices and the pain associated with traditional APM tools were specific to only the early adopters in the market or if there was a massive shift underway. To help answer that question, we surveyed hundreds of companies about their microservices adoption plans. The results of that formal survey are detailed in the Global Microservices Trends report. The survey was conducted by Dimensional Research in April 2018 and sponsored by LightStep, and it included a total of 353 development professionals across the U.S. and Europe, the Middle East and Africa (EMEA) and Asia. Company size ranged from 500 employees to more than 5,000.

Microservices bring new operational challenges

Almost all survey respondents, 99 percent, report challenges in monitoring microservices. And each additional microservice increases operational challenges, according to 56 percent of respondents. One of the key architectural differences in a microservices environment is that they process transactions through heavy use of cross-service API calls, which has caused an exponential increase in data volume. 87% of those using microservices in production report they generate more application data.

Global Microservices Trends Report 2018 - ChallengesGlobal Microservices Trends Report 2018

Microservice performance management is critical to success

Among those that have microservices in production, 73 percent report it is actually more difficult to troubleshoot in this environment than it is with a traditional monolithic application. 98 percent of users that have trouble identifying the root cause of performance issues in microservices environments report it has a direct business impact with 76 percent of those reporting it takes longer to resolve issues.

Investments are increasing in performance management for microservices

Performance management for microservices will be a big area of investment in the coming year, with most respondents (74 percent) reporting that they will increase their investment. The money to fund these purchases will frequently be coming from existing expenditures for performance management of monolithic applications, because about a third (30%) will be decreasing their investments in those types of solutions in the coming year.

Global Microservices Trends Report 2018 - InvestmentGlobal Microservices Trends Report 2018

Record growth in microservices

According to the survey results, 92 percent of respondents said they increased their number of microservices in the last year, and 92 percent expect to grow their use of microservices in the coming year. Agility (82 percent) and scalability (78 percent) were the top motivators for microservice adoption.

Microservices are widely used today

Microservices have become ubiquitous among enterprise development teams. About 9 in 10 are currently using or have plans to use microservices. For well over half, 60 percent, adoption is already advanced. 86 percent expect microservices to be the default architecture within five years.

These findings in the Global Microservices Trends report underscore the need for a new class of performance management solutions. The days of monolithic applications and traditional APM tools are quickly becoming a relic of the past. LightStep [x]PM is specifically designed to address the scale and complexity of modern applications.

Read the complete report and contact us, so we can help you manage the performance of your microservices.

KubeCon 2017: The Application Layer Strikes Back

You know it’s a special event when it snows in Texas.

Several of my delightful colleagues and I just returned from a remarkably chilly – and remarkably memorable – trip to Austin for KubeCon+CloudNativeCon Americas. We went because we were excited to talk shop about the future of microservices with 4,500 others involved with the larger cloud-native ecosystem. We had high hopes for the conference, as you won’t find a higher-density group of attendees when it comes to strategic, forward-thinking infrastructure people; yet even our lofty expectations were outdone by the buzz and momentum on display at the event.

On Wednesday at 7:55am, I emerged from my hotel room and had the good fortune of running into the inimitable Kelsey Hightower on my way to the elevator. I never miss an opportunity to learn something from Kelsey, so I asked him what was new and special in k8s-land these days. His response, paraphrased, was that “the big feature news is that – finally – we don’t have a big new feature in Kubernetes.” He went on to explain that this newfound stability at the infrastructural layer is a huge milestone for the k8s movement and opens the door to innovation above and around Kubernetes proper.

From an ecosystem standpoint, I was also lucky to speak with Chen Goldberg as part of a dinner that IBM organized. It was fascinating to hear how she and her team have architected the boundaries of Kubernetes to optimize for community. The project nails down the parts of the system that require standardization, while carving out white-space for projects and vendors to innovate around those core primitives.

This Kubernetes technology and project vision, along with its API stability, have led us to the present reality: Kubernetes has won when it comes to container orchestration and scheduling. That was not clear last year and was very far from clear two or three years ago, but with even the likes of AWS going all-in on Kubernetes, we have both OSS developers, startup vendors, and all of the big cloud providers bought in on the platform. So now everyone and their dog are going to become a Kubernetes expert, right?

Not really. It’s even better than that: our industry is evolving towards a reality where everyone and their dog are going to depend on Kubernetes and containers, yet nobody will need to care about Kubernetes and containers. This is a huge and much-needed transformation, and reminiscent of how microservice development looked within Google: every service did indeed run in a container which was managed by an orchestration and scheduling system (internally code-named “Borg”), but developers didn’t know or care how the containers were built, nor did they need to know or care how Borg worked.

So what will devs and devops care about? They will care about application-layer primitives, and those primitives are what KubeCon + CloudNativeCon was about this year. As I mentioned in my keynote on Wednesday, this means that devs and devops will be able to take advantage of CNCF technologies like service mesh (e.g., Envoy and Istio) as well as OpenTracing in order to tell coherent stories about the interaction between their microservices and monoliths.

We were humbled to hear existing LightStep customers telling folks who stopped by our booth how our solution has helped them tell clear stories about the most urgent issues affecting their own systems. Because LightStep integrates at the application layer – through OpenTracing, Envoy, transcoded logging data, or in-house tracing systems – it’s easy to connect our explanations for system behavior to the business logic and application semantics, and to steer clear of the poor signal-to-noise ratio of unfiltered container-level data.

Given the momentum behind Kubernetes and microservices in general, KubeCon felt like a glimpse into the future. That future will empower devs/devops to build and ship features faster and with greater independence. With CNCF’s portfolio of member projects fleshing out the stack around and above Kubernetes, we’re all moving to a world where we can stop caring about containers and keep our focus where it belongs: at the application layer where our developers write and debug their own software.

The End of Microservices

A post from the future, where building reliable and scalable production systems has become as easy as, well, writing any other software. Read on to see what the future looks like…

Back in 2016, people wrote a lot about “microservices,” sort of like how they wrote a lot about the “information superhighway” back in 1996. Just as the phrase “information superhighway” faded away and people got back to building the internet, the “micro” part of microservices was also dropped as services became the standard way of building scalable software systems. Despite the names we’ve used (and left behind) both terms marked a shift in how people thought about and used technology. Using services-based architectures meant that developers focused on the connections between services, and this enabled them to build better software and to build it faster.

Rise and Fall of the Information SuperhighwayThe rise and fall of the information superhighway (source)

Since 2016, developers have become more productive by focusing on one service at a time. What’s a “service”? Roughly, it’s the smallest useful piece of software that can be defined simply and deployed independently. Think of a notification service, a login service, or a persistent key-value storage service. A well-built service does just one thing, and it does it well. Developers now move faster because they don’t worry about virtual machines or other low-level infrastructure: services raise the level of abstraction. (Yet another buzzword: this was called serverless computing for a while.) And because the connections between services are explicit, developers are also freed from thinking about the application as a whole and can instead concentrate on their own features and on the services they depend on.

Back in the day, many organizations thought that moving to a microservice architecture just meant “splitting up one binary into 10 smaller ones.” What they found when they did was that they had the same old problems, just repeated 10 times over. Over time, they realized that building a robust application wasn’t just a matter of splitting up their monolith into smaller pieces but instead understanding the connections between these pieces. This was when they starting asking the right questions: What services does my service depend on? What happens when a dependency doesn’t respond? What other services make RPCs to my service? How many RPCs do I expect? Answering these questions required a new set of tools and a new mindset.

Tools, tools, tools

Child with Robot

Building service-based architectures wouldn’t have been possible without a new toolkit to reconcile the independent yet interconnected nature of services. One set of tools describes services programmatically, defining API boundaries and the relationships between services. They effectively define contracts that govern the interactions of different services. These tools also help document and test services, and generate a lot of the boilerplate that comes with building distributed applications.

Another important set of tools helps deploy and coordinate services: schedulers to map high-level services to the underlying resources they’d consume (and scaling them appropriately), as well as service discovery and load balancers to make sure requests get where they need to go.

Finally, once an application is deployed, a third set of tools helps developers understand how service-based applications behave and helps them isolate where (and why) problems occur. Back in the early days of microservices, developers lost a lot of the visibility they were accustomed to having with monolithic applications. Suddenly it was no longer possible to just grep through a log file and find the root cause: now the answer was split up across log files on 100s of nodes and interleaved with 1000s of other requests. Only with the advent of multi-process tracing, aggregate critical path analysis, and smart fault injection could the behavior of a distributed application really be understood.

Many of these tools existed in 2016, but the ecosystem had yet to take hold. There were few standards, so new tools required significant investment and existing tools didn’t work well together.

A new approach

Vintage Flying Machine Drawing

Services are now an everyday, every-developer way of thinking, in part because of this toolkit. But the revolution really happened when developers started thinking about services first and foremost while building their software. Just as test-driven development meant that developers started thinking about testing before writing their first lines of code, service-driven development meant that service dependencies, performance instrumentation, and RPC contexts became day-one concerns (and not just issues to be papered over later).

Overall, services (“micro” or otherwise) have been a good thing. (We don’t need to say “microservices” anymore since, in retrospect, it was never the size of the services that mattered: it was the connections and the relationships between them.) Services have re-centered the conversation of software development around features and enabled developers to work more quickly and more independently to do what really matters: delivering value to users.

Back in the present now… There’s still a lot exciting work to be done in building the services ecosystem, and here at LightStep, we are excited to be part of this revolution and to help improve visibility into production systems through tracing! Want to chat more about services, tracing, or visibility in general? Reach us at hello@lightstep.com@lightstephq, or in the comments below.