Now Comes the Fun Part: Mainstream Microservices and Their Implications

I had a lovely time at QCon London earlier this month. I had the opportunity to present on a few of my favorite topics (hint: they all involve microservices) and also got to chat with devs/devops building many different flavors of powerful software for companies of all shapes and sizes. (As a side note, the vendor areas at most tech conferences seem to be fluorescent-lit, windowless rooms – not so at QCon London! We all had a beautiful floor-to-ceiling view of Westminster Abbey. Not bad!)

Mainstream Microservices and their Implications (QCon London 2019) - Volunteers
QCon London event staff (and a few LightSteppers) sporting fashionable “Tracing is Fun!” t-shirts.

I’ve been to a number of tech conferences in Europe over the years. Things felt qualitatively different this time around. In the past, it seemed like enterprise software developers in the E.U. were curious about microservices and other distributed architectures, but they were still stuck with their monoliths for various practical reasons. “Tracing” and “serverless” were similarly foreign, at least in production.

Fast forward to 2019: Microservices have gone mainstream. It was remarkable how far microservices – as well as the problems they introduce – have proliferated, especially at older, traditionally more risk-averse companies. This is no doubt due to the strength of the evidence in favor of a transition to microservices; for instance, Sarah Wells gave a wonderful keynote presentation where she documented, with evidence, how Financial Times increased their release velocity more than 100x by switching to microservices. It’s all very compelling and hard to ignore.

Granted, from a certain perspective, nothing has changed. Teams still need to provide an excellent (and speedy) product experience for their end users, they need to ship code faster, and they need to resolve incidents more quickly. How can we make all of this possible? What can we do to help organizations develop with confidence despite the growing complexity of their modern, distributed systems?

Perhaps we can come to agreement on a few guiding principles:

1. Observability must be service-centric
We can do a much better job transforming signals (spans, traces, etc.) into insights when we have clear objective functions. For example, once a service team declares their SLIs – and clearly states which metrics serve as indicators of the health of their service – our tools have an objective function to work with: p99 latencies, error rates, throughput, etc. This clarity lends itself to meaningful automation. Everything from automatic rollbacks (based on SLI latency thresholds) to dynamic, contextual analysis of spans and traces is suddenly possible.

2. Tracing isn’t just for microservices
Traces should absolutely achieve coverage of the modern, progressive services in a production deployment – but they should also account for overall time spent in mobile apps, web clients, and monoliths. In fact, it’s the best way to understand their interdependence. An Android dev may think about latency only in terms of the literal end user, whereas a backend engineer’s mind is likely focused on their particular service, but in distributed systems both developers are working on components that depend on each other. Mapping the journey of transaction – from swipe to servers – is needed if we expect to form a nuanced understanding of systemic issues afflicting modern applications.

3. There’s simply “too much signal”
Some say that observability has a “signal-to-noise” problem. I’d say it’s deeper than that: there’s simply too much signal. None if it should be discarded – it is signal, after all! – but we need tools to detect the actionable patterns and surface them for us. Simply discarding outliers because they are infrequent runs contrary to the very purpose of observability: to understand the inner workings of a system by its outputs. Does this mean we need to manually analyze every span? No – we don’t have the time or the brainpower to do so without assistance. But by using tools that ingest the firehose in its entirety, we can begin to understand and build a strong, evidence-based case about the root cause of complex, multifactorial problems in production.

4. Serverless means too many things
It’s problematic that “Serverless” has come to mean everything from FaaS in general, to “nanoservices,” to edge compute functions. It’s high time that we choose more self-descriptive terms, or we will inevitably end up talking past each other. ETL processes ported to Lambda and S3 are completely different than latency-sensitive consumer-facing products, even if they’re all “serverless.” As a trend, “serverless” is worth understanding, but it’s so broad that it’s difficult to have a coherent discussion about problems and solutions.

For those who were at QCon and those who weren’t, I’d love to get your feedback on these ideas. You can find me on twitter @el_bhs or drop me a line over old-fashioned email if you’d like to use more than 280 characters.

Welcoming Denise Persson to our Board of Directors

Today it’s my pleasure to announce that Denise Persson has joined LightStep’s Board of Directors.

We couldn’t be happier about the news! From a pure resume standpoint, Denise is a world-class technology marketer: She is presently the Chief Marketing Officer at Snowflake, and she brings 20 years of experience scaling high-growth companies. Prior to joining Snowflake, she served as the CMO for Apigee, helping to take them public in 2015. When I asked around about marketing executives who really understand how to reach and educate a highly technical audience, Denise’s name came up time and time again.

While Denise is clearly an impressive operator, she is also a great coach and provides thoughtful, pragmatic advice and guidance, always informed by experience and examples. In our conversations about LightStep’s challenges and opportunities, she listens carefully, asks thoughtful questions, uncovers hidden assumptions, and helps me reframe my own thinking; this is the sort of thought-partnership that LightStep wants from an independent board member, and so we feel especially fortunate to be working with Denise in this capacity.

At Snowflake, she helped define – then lead – a new category. The industry-wide move to cloud computing was hugely disruptive for Snowflake’s market, and Denise and her team navigated that transition brilliantly. Today, LightStep’s market is in the midst of an analogous transformation brought on by microservices and serverless computing: as enterprises decompose their monoliths, their traditional APM solutions have ceased to function, leaving confusion and ambiguity in their wake. LightStep is eager to educate our ecosystem about these challenges and light a path toward operational confidence at scale, and Denise will be instrumental as we engage in this effort.

We asked Denise why she accepted the board role here at LightStep. In her own words: “I’m very happy to join LightStep’s board at an exciting phase of their growth, and a very transformative time for software development. LightStep is well positioned to become a key player in helping enterprises of all sizes to adopt microservices more quickly, which has become essential for any organization that needs to launch new products and services faster and more efficiently. LightStep has an incredible team of passionate and highly experienced individuals, I look forward to collaborating with them and sharing my experiences creating new categories.”

We look forward to it, too. Welcome, Denise!

Traces are Dead: Long Live (LightStep) Tracing

Several years ago when I’d give talks about tracing, I would start things off by asking the audience if they knew what distributed tracing was. Depending on the particular event, about 10-25% of the audience would raise their hands. These days, it’s more like 90-95%. It happened incrementally, but somehow distributed tracing has graduated to become a well-understood requirement for any microservices or serverless transition strategy. After all, how else could one hope to understand the behavior of the wildly complex distributed systems we’re all building?

But there’s a catch that’s not so well-understood: most distributed tracing users extract very little value from their traces. I’ve been working on some form of distributed tracing technology since I started tinkering with Dapper in late 2004. Over that time, I’ve come to recognize an important paradox: if we wish to understand the behavior of distributed systems, individual distributed traces are necessary but rarely sufficient on their own. Since they only represent isolated transactions, these individual transactions cannot convey confidence about a larger pattern of behavior, nor can they diagnose contention for shared resources or other interference effects – we even documented many of these limitations in the Dapper paper. So how can service owners extract the higher-level patterns and insights from the firehose of distinct distributed traces?

Today, we’re announcing LightStep Tracing: the fastest way for teams to adopt best-of-breed distributed tracing. It supports real-time, high-cardinality search over – and visualization of – unsampled distributed traces, and it also offers the powerful aggregate analysis features of its larger sibling, LightStep [x]PM.

LightStep Tracing - Explorer
Build dynamic system diagrams around any service or tag with unlimited cardinality

Like [x]PM, LightStep Tracing goes far beyond individual traces: tracing really gets powerful when our tools intelligently sample and analyze thousands of traces in order to answer specific, high-value questions. Features like high-fidelity histograms, historical layers, dynamic system diagrams – and building blocks like Snapshots – are all valuable industry firsts, and each offers something much richer than the distinct distributed traces used as input data.

LightStep Tracing makes all of this available – as cloud-hosted SaaS – to any team that’s adopting distributed tracing. And, since integration is based on a portable industry standard, those teams can switch to and from any other OpenTracing-compatible tracing solution in the future.

At LightStep, our mission is to deliver confidence at scale for those who develop, operate, and rely upon today’s powerful software applications. With LightStep Tracing, we hope that many more teams across our industry will be able to attain a new level of confidence about their own systems – because that’s what great tracing should do.

Ready to get tracing? Learn more

Looking Back, Looking Forward … and fundraising when we’re ready to use it

This article originally appeared on Medium.

There’s a rule of thumb about startup fundraising: “Raise when you can.”

But that’s not how we think about it. For our recent Series C financing, the idea wasn’t to “raise when we could” — in that case, we would have closed something much earlier. Rather, it was to “raise when we’re ready to use it,” and that’s exactly the situation we find ourselves in: everywhere we look, there are high-ROI projects to bet on, and we should pursue as many as possible.

Thinking back on the past year, it’s remarkable how much has changed. Business is great, of course — we are excited about each and every customer we’ve brought on. Lately, we are working with more traditional enterprises who have embarked on their own microservices adventure and want to use LightStep [x]PM to maintain confidence and control along the way. From a people standpoint, we’ve more than doubled in size, established multiple new departments while keeping our collaborative culture, formalized our values and built practices around them, and tried to lay the foundations — in particular, a company-wide sense of responsibility and autonomy — for more rapid, healthy growth in the years to come.

Our [x]PM product has evolved and continues to lead the industry in terms of microservice APM and observability at scale. If I had to summarize:

  • (Not) Sampling: Random sampling hobbles distributed tracing, especially during incident resolution: otherwise you miss the outliers because they are, by definition, rare. This has been obvious to us since we founded LightStep, and indeed we’ve been running in production without upfront sampling since 2015, but it still bears repeating since many of our fellow vendors are now “inventing” this idea. 🙂
  • Traces are the fuel, not the car: The most impactful work we’ve done in the past year involves trace aggregates: when we can look at the statistics of trace structures, we can make higher-level statements about our customers’ systems that go well beyond the mysteries of individual transactions.
  • Performance is a Shape: High-percentile latency is better than median latency, but neither holds a candle to real-time histograms with no cardinality limitations.
  • Snapshots: In my recent KubeCon talk about a new scorecard for observability, I presented what Cindy Sridharan referred to as the “CAP Theorem for Observability”: namely that a positive-ROI observability solution can’t have high throughput, high cardinality, historical context, and unsampled data. This is where Snapshots come in — they give us the fidelity of unsampled data, but in the past; all at scale and without the cardinality limits that cause trouble for traditional time-series statistics. The “big picture” for Snapshots will come into greater focus in 2019 as we deliver insights on top of the core abstraction; suffice it to say that Snapshots will present a more detailed, actionable picture of system behavior than anything else that’s out there.

As a company, LightStep exists to give developers and operators greater confidence as their software scales. What we’ve seen is that scaled-out software begets many developers, many developers beget many small teams, and the presence of many small teams forces an organization to adopt microservices and/or serverless for managerial reasons. Vijay Gill described this in his excellent blog post about the only good reason to adopt microservices. And, sure enough, big enterprises are now running microservices; not in some zero-throughput labs environment, but in production, powering their bread-and-butter applications.

Of course this shift was evident at KubeCon — in fact, it probably originated there! This month at KubeCon North America, my colleague Ted Young organized the first-ever Observability Practitioners Summit, with speakers from academia, open-source observability projects, other great vendors, and in-house practitioners. The talks went beyond “Observability 101” material, delving deep into the details of these new monitoring technologies, visualization strategies, and novel use cases. The slides for all of the talks are available via the link above (and are recommended to any other observability nerds out there). Moreover, the O.P.S. event was packed: two years ago we were still explaining what distributed tracing was, and now the conversation is far more developed and far larger to boot. Ted also gave a great talk during the main KubeCon event about using distributed traces as a way to make “distributed assertions” about the behavior of microservice applications: Trace Driven Testing.

I remember doing internal tech talks at Google twelve years ago, trying to get highly-specialized Google software engineers (who develop scaled-out distributed systems for a living) to care at all about my Dapper project and distributed tracing in general. Frankly, it was a bit like pushing a car uphill — at the time, the concepts were simply a bit too new. Jump to 2018, where Lew Cirne, New Relic’s founding CEO, is talking about distributed tracing by name during NEWR’s quarterly earnings call. What planet are we living on here? I’m not sure, but it’s a lot of fun.

In closing, what could be better than a growing company — built around a remarkable team of wonderful people with strong shared values — creating a novel product that’s leading the industry into a dynamic and rapidly-growing market? There’s a lot to be excited about, and that’s why we raised when we did: not because we needed to, not because we can, but because we know what to build with it, and we want to build it faster. We can’t wait for 2019.

Three Pillars with Zero Answers – Towards a New Scorecard for Observability

This article originally appeared on Medium.

The orthodoxy

Have you heard about the “three pillars of observability” yet? No? The story goes like this:

If you’re using microservices, you already know that they’re nearly impossible to understand using a conventional monitoring toolchain: since microservices were literally designed to prevent “two-pizza” devops teams from knowing about each other, it turns out that it’s incredibly difficult for any individual to understand the whole, or even their nearest neighbor in the service graph.

Well, Google and Facebook and Netflix were building microservices before they were called microservices, and I read on Twitter that they already solved all of these problems… phew! They did it using Metrics, Logging, and Distributed Tracing, so you should, too – those are called “The Three Pillars of Observability,” and you probably even know what they look like already:

LightStep - Beware Observability Dogma - Metrics
LightStep - Beware Observability Dogma - Logs
LightStep - Beware Observability Dogma - Traces

So, if you want to solve observability problems like Google and Facebook and Twitter, it’s simple… find a metrics provider, a logging provider, a tracing provider, and voila: your devops teams will bask in the light of an observable distributed system.

Fatal flaws

Perhaps the above is hyperbolic. Still, for those who deployed “the three pillars” as bare technologies, the initial excitement dissipated quickly as fatal flaws emerged.

Metrics and cardinality

For Metrics, we all needed to learn a new vocab word: cardinality. The beauty of metrics is that they make it easy to see when something bad happened: the metric looks like a squiggly line, and you can see it jump up (or down) when something bad happens. But diagnosing those anomalous moments is deeply difficult using metrics alone… the best we can do is to “drill down,” which usually means grouping the metric by a tag, hoping that a specific tag value explains the anomaly, then filtering by that tag and iterating on the drill-down process.

“Cardinality” refers to the number of elements in a set. In the case of metrics, cardinality refers to the number of values for any particular metric tag. If there are 5 values, we’re probably fine; 50 might be ok; 500 is probably too expensive; and once we get into the thousands, you simply can’t justify the ROI. Unfortunately, many real-world tags have thousands or millions of values (e.g., user-id, container-id, and so forth), so metrics often prove to be a dead end from an investigative standpoint.

Logging volumes with microservices

For Logs, the problem is simpler to describe: they just become too expensive, period. I was at a speaker dinner before a monitoring conference last year, and one of the other presenters – a really smart, reputable individual who ran the logging efforts for one of today’s most iconic tech companies – was giving a talk the following day about how to approach logging in a microservices environment. I was excited about the topic and asked him what his basic thesis was. He said, “Oh, it’s very simple: don’t log things anymore.”

It’s easy to understand why: if we want to use logs to account for individual transactions (like we used to in the days of a monolithic web server’s request logs), we would need to pay for the following:

LightStep - Beware Observability Dogma - Logging Costs

Logging systems can’t afford to store data about every transaction anymore because the cost of those transactional logs is proportional to the number of microservices touched by an average transaction. Not to mention that the logs themselves are less useful (independent of cost) due to the analytical need to understand concurrency and causality in microservice transactions. So conventional logging isn’t sufficient in our brave new architectural world.

Tracing and foreknowledge

Which brings us to “Distributed Tracing,” a technology specifically developed to address the above problem with logging systems. I built out the Dapper project at Google myself. It certainly had its uses, especially for steady-state latency analysis, but we dealt with the data volume problem by applying braindead, entirely random, and very aggressive sampling. This has long been the elephant in the room for distributed tracing, and it’s the reason why Dapper was awkward to apply in on-call scenarios.

The obvious answer would be to avoid sampling altogether. For scaled-out microservices, though, the cost is a non-starter. It’s more realistic to defer the sampling decision until the transaction has completed: this is an improvement, though that approach masks a crucial question: which traces should we sample, anyway? If we’re restricting our analysis to individual traces, we typically focus on “the slow ones” or those that result in an error; however, performance and reliability problems in production software are typically a byproduct of interference between transactions, and understanding that interference involves much more sophisticated sampling strategies that aggregate across related traces that contend for the same resources.

In any case, a single distributed trace is occasionally useful, but a bit of a hail mary. Sampling the right distributed traces and extracting meaningful, accessible insights is a broader challenge, yet much more valuable.

And about emulating Google (et al.) in the first place…

Another issue with “The Three Pillars” is the very notion that we should always aspire to build software that’s appropriate for the “planet-scale” infrastructure at Google (or Facebook, or Twitter, and so on). Long story short: don’t emulate Google. This is not a dig on Google – there are some brilliant people there, and they’ve done some terrific work given their requirements.

But! Google’s technologies are built to scale like crazy, and that isn’t necessarily “good”: Jeff Dean (one of those brilliant Googlers who deserves all of the accolades – he even has his own meme) would sometimes talk about how it’s nearly impossible to design a software system that’s appropriate for more than 3-4 orders of magnitude of scale. Further, there is a natural tension between a system’s scalability and its feature set.

Google’s microservices generate about 5 billion RPCs per second; building observability tools that scale to 5B RPCs/sec therefore boils down to building observability tools that are profoundly feature poor. If your organization is doing more like 5 million RPCs/sec, that’s still quite impressive, but you should almost certainly not use what Google uses: at 1/1000th the scale, you can afford much more powerful features.

Bits vs Benefits

So each “pillar” has a fatal flaw (or three), and that’s a problem. The other problem is even more fundamental: Metrics, Logs, and Distributed Traces are just bits. Each describes a particular type of data structure, and when we think of them, we tend to think of the most trivial visualization of those data structures: metrics look like squiggly lines; logs look like a chronological listing of formatted strings; and traces look like those nested waterfall timing diagrams.

None of the above directly addresses a particular pain point, use case, or business need. That is, with the “three pillars” orthodoxy, we implicitly delegate the extraordinarily complex task of actually analyzing the metric, log, and trace data as “an exercise to the reader.” And, given the fatal flaws above and the subtle interactions and co-dependencies between these three types of data, our observability suffers greatly as a result.

In our next installment…

We need to put “metrics, logs, and tracing” back in their place: as implementation details of a larger strategy – they are the fuel, not the car. We need a new scorecard: stay tuned for our next post, where we will introduce and rationalize a novel way to independently measure and grade an observability strategy.

Hippies, Ants, and Healthy Microservices

This article originally appeared on Medium.

For any organization that expects its developers to produce powerful software, the decision to adopt microservices should be an easy one – developer velocity is king, and that’s hard to come by in the byzantine build-test-release lifecycles of monolithic software architectures. But the initial commitment to adopt microservices is much simpler than the decisions that follow about how to structure that adoption: there are uncountably many blog posts addressing various and sundry technical details, and dozens of partially-overlapping solutions for every problem, even (especially?) for the ones you haven’t encountered yet.

And yet, despite the mountains of content about the technical details, it surprises me how little has been written about the biggest failure mode I’ve seen out there in the wild: a fundamental misunderstanding of the goals surrounding a microservices migration, and how those goals best translate into engineering management practices. In particular, the conventional wisdom makes a microservices-oriented engineering organization sound like a hippie commune. But it should probably feel more like an ant colony.


LightStep Microservices - Hippies

Before I proceed, let it be known that I have a soft spot for the hippies of yore. I love idealists as long as they’re peaceful, and you can’t get much more peaceful or idealistic than a good, old-fashioned hippie. If the hippie ethos could be distilled into a single value, it would be the freedom to make independent decisions and act on them.

There are other posts offering greater detail (I’m especially fond of this article about the intersection of management and microservices from Vijay Gil, SVP Eng at Databricks), but to summarize: the only good reason to adopt microservices is to accelerate development through reduced human communication overhead. The idea is that each microservice gets its own development team, and these teams stay out of each other’s way – i.e., they make completely independent decisions that further their own goals, and they try to allow others to do the same. It’s like “the Me generation” for software.

But I don’t think hippies have the right instincts for engineering management. What happens if we try to truly maximize the independence of distinct microservice teams? Every dev team chooses the language, frameworks, message queue, CI/CD strategy, and naming conventions (etc) that make the most sense for their service and their expertise as a group. Since every service and situation is different, this appears to be a rational strategy: after all, aren’t microservices about increasing parallelism in decision-making (if not the software itself)?


LightStep Microservices - Ants

And yet there are many flavors of independence. Ants are certainly enterprising little creatures: they readily explore every nook and cranny, they can famously carry up to 50x their own body weight, and some species build architecturally marvelous structures for themselves. But they use social and utterly standardized behaviors (and some pheromones) to facilitate their own versions of load-balancing, discovery, security, and replication.

LightStep Microservices - Ants
Service discovery for ants

While one can observe an individual ant and reason about their actions in the context of their environment, their most adaptive behaviors rely on “biological standardization.” For instance, if an individual wanders its way to a plentiful food source, that ant will emit a “trail pheromone” and head straight back to the colony; their fellow ants pick up the scent and use it to backtrack to the food source.

Similarly, when ants are in an alarmed or panicked state, they emit chemicals that alert their peers to the threat and protect the group. And so on and so forth: for every macrobehavior that benefits the colony, there is a standard chemical mechanism that all individuals understand and obey that facilitates that macrobehavior.

These collective adaptations have made ants one of the most “horizontally scalable” animals on Earth: the largest ant colony is 3,700 miles wide and is home to billions of individual organisms. They are remarkable animals!

…and back to microservices

There’s no question that hippies are more independent than ants. And I suppose I should acknowledge that ants would make lousy engineering managers (they can’t even drink coffee). But when we’re spinning up microservices, we have a lot to learn from ants and other hive-minded animals: their reliance on the rigid standardization of certain functions facilitates optimal outcomes for the group as a whole.

There’s always a temptation to allow each service team to decide on a language, a stack, and a set of primitives that feel familiar or appropriate to them. This is well-intentioned, as it seems to maximize the autonomy of the distinct service teams. But in a microservices deployment – especially at scale – we must also facilitate cross-cutting concerns like deployment, load-balancing, service discovery, security, and observability. If we encourage our two-pizza teams to make entirely independent decisions about each of these critical aspects, we are left with a monstrous challenge when operating our distributed application, especially as teams disband and services go into maintenance mode.

When transitioning towards a microservices architecture, it’s best to create a limited number of choices – ideally only one – for each cross-cutting aspect of the larger system. For example:

  • Programming language(s)
  • Service (and infrastructure) naming conventions
  • Orchestration and auto-scaling
  • Web/RPC framework
  • Service-to-service authentication
  • Instrumentation for logging
  • Instrumentation for tracing (I am obligated as a co-creator to plug OpenTracing for this)
  • Instrumentation for metrics
  • Service discovery
  • Load balancing
  • (and so on…)

By standardizing in these areas, a central team can manage these well-factored facets of the larger system, and the developers working on the microservices themselves can focus on what’s most important: building something valuable.

Performance is a Shape, Not a Number

This article originally appeared on Medium.

Applications have evolved – again – and it’s time for performance analysis to follow suit

In the last twenty years, the internet applications that improve our lives and drive our economy have become far more powerful. As a necessary side-effect, these applications have become far more complex, and that makes it much harder for us to measure and explain their performance – especially in real-time. Despite that, the way that we both reason about and actually measure performance has barely changed.

I’m not here to argue about the importance of understanding real-time performance in the face of rising complexity – by now, we all realize it’s vital – but for the need to improve our mental model as we recognize and diagnose anomalies. When assessing “right now,” our industry relies almost entirely on averages and percentile estimates: these are not enough to efficiently diagnose performance problems in modern systems. Performance is a shape, not a number, and effective tools and workflows should present and explore that shape, as we illustrate below.

We’ll divide the evolution of application performance measurement into three “phases.” Each phase had its own deployment model, its own predominant software architecture, and its own way of measuring performance. Without further ado, let’s go back to the olden days: before AWS, before the smartphone, and before Facebook (though perhaps not Friendster)…

Watch our tech talk now. Hear Ben Sigelman, LightStep CEO, present the case for unsampled latency histograms as an evolution of and replacement for simple averages and percentile estimates.

Phase 1: Bare Metal and average latency (~2002)

LightStep - the Stack 2002The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.

If you measured application performance at all in 2002, you probably did it with average request latency. Simple averages work well for simple things: namely, normally-distributed things with low variance. They are less appropriate when there’s high variance, and they are particularly bad when the sample values are not normally distributed. Unfortunately, latency distributions today are rarely normally distributed, can have high variance, and are often multimodal to boot. (More on that later)

To make this more concrete, here’s a chart of average latency for one of the many microservice handlers in LightStep’s SaaS:

LightStep - Recent Average LatencyRecent average latency for an important internal microservice API call at LightStep

It holds steady at around 5ms, essentially all of the time. Looks good! 5ms is fast. Unfortunately it’s not so simple: average latency is a poor leading indicator of reliability woes, especially for scaled-out internet applications. We’ll need something better…

Phase 2: Cloud VMs and p99 latency (~2012)

LightStep - the Stack 2012The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).

Even if average latency looks good, we still don’t know a thing about the outliers. Per this great Jeff Dean talk, in a microservices world with lots of fanout, an end-to-end transaction is only as fast as its slowest dependency. As our applications transitioned to the cloud, we learned that high-percentile latency was an important leading indicator of systemic performance problems.

Of course, this is even more true today: when ordinary user requests touch dozens or hundreds of service instances, high-percentile latency in backends translates to average-case user-visible latency in frontends.

To emphasize the importance of looking (very) far from the mean, let’s look at recent p95 for that nice, flat, 5ms average latency graph from above:

LightStep - Recent p95 LatencyRecent p95 latency for the same important internal microservice API call at LightStep

The latency for p95 is higher than p50, of course, but it’s still pretty boring. That said, when we plot recent measurements for p99.9, we notice meaningful instability and variance over time:

LightStep - Recent p99.9 LatencyRecent p99.9 latency for the same microservice API call. Now we see some instability.

Now we’re getting somewhere! With a p99.9 like that, we suspect that the shape of our latency distribution is not a nice, clean bell curve, after all… But what does it look like?

Phase 3: Microservices and detailed latency histograms (2018)

LightStep - the Stack 2018The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.

When we reason about a latency distribution, we’re trying to understand the distinct behaviors of our software application. What is the shape of the distribution? Where are the “bumps” (i.e., the modes of the distribution) and why are they there? Each mode in a latency distribution is a different behavior in the distributed system, and before we can explain these behaviors we must be able to see them.

In order to understand performance “right now”, our workflow ought to look like this:

  1. Identify the modes (the “bumps”) in the latency histogram
  2. Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
  3. Explain the behaviors that characterize these high-priority modes

Too often we just panic and start clicking around in hopes that we stumble upon a plausible explanation. Other times we are more disciplined, but our tools only expose bare statistics without context or relevant example transactions.

This article is meant to be about ideas (rather than a particular product), but the only real-world example I can reference is the recently released Live View functionality in LightStep [x]PM. Live View is built around an unsampled, filterable, real-time histogram representation of performance that’s tied directly to distributed tracing for root-cause analysis. To get back to our example, below is the live latency distribution corresponding to the percentile measurements above:

LightStep - A Real-Time View of LatencyA real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.

The histogram makes it easy to identify the distinct modes of behavior (the “bumps” in the histogram) and to triage them. In this situation, we care most about the high-latency outliers on the right side. Compare this data with the simple statistics from “Phase 1” and “Phase 2” where the modes are indecipherable.

Having identified and triaged the modes in our latency distribution, we now need to explain the concerning high-latency behavior. Since [x]PM has access to all (unsampled) trace data, we can isolate and zoom in on any feature regardless of its size. We filter interactively to hone in on an explanation: first by restricting to a narrow latency band, and then further by adding key:value tag restrictions. Here we see how the live latency distribution varies from one project_id to the next (project_id being a high-cardinality tag for this dataset):

LightStep - Isolate and Zoom In on Any FeatureGiven 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.

Here we are surprised to learn that project_id 36 experienced consistently slower performance than the aggregate. Again: Why? We restrict our view to project_id=36, filter to examine the latency outliers, and open a trace. Since [x]PM can assemble these traces retroactively, we always find an example, even for rare behavior:

LightStep - End-to-End Transaction TracesTo attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.

The (rare) trace we isolated shows us the smoking gun: that contention around mutex acquisition dominates the critical path (and explains why this particular project — with its own highly-contended mutex — has inferior performance relative to others). Again, compare against a bare percentile: simply measuring p99 latency is a far cry from effective performance analysis.

Stepping back and looking forward…

As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.

If you made it this far, I hope you’ve learned some new ways to think about latency measurements and how they play a part in diagnostic workflows. LightStep continues to invest heavily in this area: to that end, please share your stories and points of view in the comment section, or reach out to me directly (Twitter, Medium, LinkedIn), either to provide feedback or to nudge us in a particular direction. I love to nerd out along these lines and welcome outside perspectives.

Want to work on this with me and my colleagues? It’s fun! LightStep is hiring.

Want to make your own complex software more comprehensible? We can show you exactly how LightStep [x]PM works.

KubeCon 2017: The Application Layer Strikes Back

You know it’s a special event when it snows in Texas.

Several of my delightful colleagues and I just returned from a remarkably chilly – and remarkably memorable – trip to Austin for KubeCon+CloudNativeCon Americas. We went because we were excited to talk shop about the future of microservices with 4,500 others involved with the larger cloud-native ecosystem. We had high hopes for the conference, as you won’t find a higher-density group of attendees when it comes to strategic, forward-thinking infrastructure people; yet even our lofty expectations were outdone by the buzz and momentum on display at the event.

On Wednesday at 7:55am, I emerged from my hotel room and had the good fortune of running into the inimitable Kelsey Hightower on my way to the elevator. I never miss an opportunity to learn something from Kelsey, so I asked him what was new and special in k8s-land these days. His response, paraphrased, was that “the big feature news is that – finally – we don’t have a big new feature in Kubernetes.” He went on to explain that this newfound stability at the infrastructural layer is a huge milestone for the k8s movement and opens the door to innovation above and around Kubernetes proper.

From an ecosystem standpoint, I was also lucky to speak with Chen Goldberg as part of a dinner that IBM organized. It was fascinating to hear how she and her team have architected the boundaries of Kubernetes to optimize for community. The project nails down the parts of the system that require standardization, while carving out white-space for projects and vendors to innovate around those core primitives.

This Kubernetes technology and project vision, along with its API stability, have led us to the present reality: Kubernetes has won when it comes to container orchestration and scheduling. That was not clear last year and was very far from clear two or three years ago, but with even the likes of AWS going all-in on Kubernetes, we have both OSS developers, startup vendors, and all of the big cloud providers bought in on the platform. So now everyone and their dog are going to become a Kubernetes expert, right?

Not really. It’s even better than that: our industry is evolving towards a reality where everyone and their dog are going to depend on Kubernetes and containers, yet nobody will need to care about Kubernetes and containers. This is a huge and much-needed transformation, and reminiscent of how microservice development looked within Google: every service did indeed run in a container which was managed by an orchestration and scheduling system (internally code-named “Borg”), but developers didn’t know or care how the containers were built, nor did they need to know or care how Borg worked.

So what will devs and devops care about? They will care about application-layer primitives, and those primitives are what KubeCon + CloudNativeCon was about this year. As I mentioned in my keynote on Wednesday, this means that devs and devops will be able to take advantage of CNCF technologies like service mesh (e.g., Envoy and Istio) as well as OpenTracing in order to tell coherent stories about the interaction between their microservices and monoliths.

We were humbled to hear existing LightStep customers telling folks who stopped by our booth how our solution has helped them tell clear stories about the most urgent issues affecting their own systems. Because LightStep integrates at the application layer – through OpenTracing, Envoy, transcoded logging data, or in-house tracing systems – it’s easy to connect our explanations for system behavior to the business logic and application semantics, and to steer clear of the poor signal-to-noise ratio of unfiltered container-level data.

Given the momentum behind Kubernetes and microservices in general, KubeCon felt like a glimpse into the future. That future will empower devs/devops to build and ship features faster and with greater independence. With CNCF’s portfolio of member projects fleshing out the stack around and above Kubernetes, we’re all moving to a world where we can stop caring about containers and keep our focus where it belongs: at the application layer where our developers write and debug their own software.

Announcing LightStep: A New Approach for a New Software Paradigm

(Image Credit:

Today, LightStep emerged from stealth, announced its first product, LightStep [x]PM, as well as its Series A and Series B funding.

With today’s launch, we’re excited to speak more openly about what we’ve been up to here at LightStep. As a company, we focus on delivering deep insights about every aspect of high-stakes production software. With our first product, LightStep [x]PM, we identify and troubleshoot the most impactful performance and reliability issues. This post is about how we got here and why we’re so excited.

I started thinking about this problem in 2004. It began during an impromptu conversation I had with Sharon Perl, a brilliant research scientist who came to Google in the early days. She was mainly working on an object store (à la S3) at the time but also had a few prototype side projects. We talked through five of them, I believe, but one captured my attention in particular: Dapper.

Dapper circa 2004 was not fully baked, though the idea was magical to me: Google was operating thousands of independently-scalable services (they’d be called “microservices” today), and Dapper was able to automatically follow user requests across service boundaries, giving developers and operators a clear picture of why some requests were slow and others ended with an error message. I was so enamored of the idea that I dropped what I was doing at the time, adopted the (orphaned) Dapper prototype, and built a team to get something production-ready deployed across 100% of Google’s services. What we built was (and is still) essential for long-term performance analysis, but in order to contend with the scale of the systems being monitored, Dapper only centrally recorded 0.01% of the performance data; this meant that it was challenging to apply to certain use cases, such as real-time incident response (i.e., “most firefighting”).

Ten years later, Ben Cronin, Spoons (Daniel Spoonhower), and I co-founded LightStep. Enterprises are in the midst of an architectural transformation, and the systems our customers and prospects build look a lot like the ones I grew up with at Google. We visit with enterprise engineering and ops leaders frequently, and what we see are businesses that live (or die) by their software, yet often struggle to stay in control of it given the overwhelming scale and complexity of their own systems.

We built LightStep to help with this, and we started with LightStep [x]PM to focus on performance and reliability in particular. Our platform is not a reimplementation of Dapper, but an evolution and a broadening of its value prop: with LightStep’s unconventional architecture, we can analyze 100.0% of transaction data rather than 0.01% of it like we did with Dapper. This unique – and technically sophisticated – approach gives our customers the freedom to focus on the performance issues that are most vital to their business and jump to the root cause with detailed end-to-end traces, all in real-time.

For instance, Lyft sends us a vast amount of data – LightStep analyzes 100,000,000,000 microservice calls every day. At first glance, that data is all noise and no signal: overwhelming and uncorrelated. Yet by considering the entirety of it, LightStep can measure how performance affects different aspects of Lyft’s business, then explain issues and anomalies using end-to-end traces that extend from their mobile apps to the bottom of their microservices stack. The story is similar for Twilio, GitHub, Yext, DigitalOcean, and the rest of our customers: they run LightStep 100% of the time, in production, and use it to answer pressing questions about the behavior of their own complex software.

The credit for what LightStep has accomplished goes to our team. We value technical skill and motivation, of course; that said, we also value emotional sensitivity, situational awareness, and the ability to prioritize and leverage our limited resources. LightStep will continue to innovate and grow well into the future, and the people here and their relationships with our inspiring customers are the reason why. The company has also benefited in innumerable ways from early investors Aileen Lee and Michael Dearing, the staff at Heavybit, and of course our board members from Redpoint and Sequoia, Satish Dharmaraj and Aaref Hilaly. Our board brings deep company-building experience as well as a humility and humor that we don’t take for granted.

It’s no secret that software is getting more powerful every day. As it does, it becomes more complex. LightStep exists in order to decipher that complexity, and ultimately to deliver insights and information that let our customers get back to innovating. Nothing gets us more excited than the success stories we hear from our customers. As we continue to build towards our larger vision, we look forward to hearing many more.