I had a lovely time at QCon London earlier this month. I had the opportunity to present on a few of my favorite topics (hint: they all involve microservices) and also got to chat with devs/devops building many different flavors of powerful software for companies of all shapes and sizes. (As a side note, the vendor areas at most tech conferences seem to be fluorescent-lit, windowless rooms – not so at QCon London! We all had a beautiful floor-to-ceiling view of Westminster Abbey. Not bad!)
QCon London event staff (and a few LightSteppers) sporting fashionable “Tracing is Fun!” t-shirts.
I’ve been to a number of tech conferences in Europe over the years. Things felt qualitatively different this time around. In the past, it seemed like enterprise software developers in the E.U. were curious about microservices and other distributed architectures, but they were still stuck with their monoliths for various practical reasons. “Tracing” and “serverless” were similarly foreign, at least in production.
Fast forward to 2019: Microservices have gone mainstream. It was remarkable how far microservices – as well as the problems they introduce – have proliferated, especially at older, traditionally more risk-averse companies. This is no doubt due to the strength of the evidence in favor of a transition to microservices; for instance, Sarah Wells gave a wonderful keynote presentation where she documented, with evidence, how Financial Times increased their release velocity more than 100x by switching to microservices. It’s all very compelling and hard to ignore.
Granted, from a certain perspective, nothing has changed. Teams still need to provide an excellent (and speedy) product experience for their end users, they need to ship code faster, and they need to resolve incidents more quickly. How can we make all of this possible? What can we do to help organizations develop with confidence despite the growing complexity of their modern, distributed systems?
Perhaps we can come to agreement on a few guiding principles:
1. Observability must be service-centric We can do a much better job transforming signals (spans, traces, etc.) into insights when we have clear objective functions. For example, once a service team declares their SLIs – and clearly states which metrics serve as indicators of the health of their service – our tools have an objective function to work with: p99 latencies, error rates, throughput, etc. This clarity lends itself to meaningful automation. Everything from automatic rollbacks (based on SLI latency thresholds) to dynamic, contextual analysis of spans and traces is suddenly possible.
2. Tracing isn’t just for microservices Traces should absolutely achieve coverage of the modern, progressive services in a production deployment – but they should also account for overall time spent in mobile apps, web clients, and monoliths. In fact, it’s the best way to understand their interdependence. An Android dev may think about latency only in terms of the literal end user, whereas a backend engineer’s mind is likely focused on their particular service, but in distributed systems both developers are working on components that depend on each other. Mapping the journey of transaction – from swipe to servers – is needed if we expect to form a nuanced understanding of systemic issues afflicting modern applications.
3. There’s simply “too much signal” Some say that observability has a “signal-to-noise” problem. I’d say it’s deeper than that: there’s simply too much signal. None if it should be discarded – it is signal, after all! – but we need tools to detect the actionable patterns and surface them for us. Simply discarding outliers because they are infrequent runs contrary to the very purpose of observability: to understand the inner workings of a system by its outputs. Does this mean we need to manually analyze every span? No – we don’t have the time or the brainpower to do so without assistance. But by using tools that ingest the firehose in its entirety, we can begin to understand and build a strong, evidence-based case about the root cause of complex, multifactorial problems in production.
4. Serverless means too many things It’s problematic that “Serverless” has come to mean everything from FaaS in general, to “nanoservices,” to edge compute functions. It’s high time that we choose more self-descriptive terms, or we will inevitably end up talking past each other. ETL processes ported to Lambda and S3 are completely different than latency-sensitive consumer-facing products, even if they’re all “serverless.” As a trend, “serverless” is worth understanding, but it’s so broad that it’s difficult to have a coherent discussion about problems and solutions.
For those who were at QCon and those who weren’t, I’d love to get your feedback on these ideas. You can find me on twitter @el_bhs or drop me a line over old-fashioned email if you’d like to use more than 280 characters.