Kubernetes Observability for Contrarians
by James Burns
So, maybe you’ve heard of this Kubernetes thing? In June of 2014 Google launched an ambitious effort to provide a best-in-class system for orchestrating container-based workloads as open source. Over the last 5 years, that effort has led to the collaboration of all major cloud providers and many major technology companies, a feat in itself. As Kubernetes continues to pick up steam as the best platform to run your cloud native applications (or, if you listen to Kelsey Hightower, to build your platform for running your cloud native applications) questions continue to be raised about how to make it observable.
Here’s the dirty secret of Kubernetes for distributed systems in VMs or even on bare metal.
Whether container, VM, or bare metal, you need to understand how well the users of your service are able to accomplish their goals, whether that’s placing an order for socks, arguing with people on the internet, or just looking at cats. In the era of the monolith, all the information about users' success was available in a single process or on a single machine (and the load balancer in front of it). In the era of services, that information is spread across many machines, VMs, pods, or functions. To piece together the truth about customer experience and to place that in business context, you need to use distributed tracing. By tracking time spent across all of these different places where work can be done, we can answer questions like “why was this order slow?” or “why didn’t the cat picture show up?”
Observability on Kubernetes is necessarily observability for distributed systems, but that is not particular to Kubernetes, it’s just that, for those new to running applications this way, they’re forced to cope with the “no single machine with answers” problem.
When you look at what is different observing a distributed system on Kubernetes vs VMs, the main difference is context. Context is what allows you to correlate failure with causes, often contention for a shared resource, CPU, storage, network, or database connection. With VMs your context will likely be an instance id. As you schedule or “bin pack” more services onto a single VM, you need to add to the context so you can answer questions like is this failing because of an issue with this pod, this VM, this machine, this rack, this region or some other shared resource like a NAT gateway.
Kubernetes just adds more context to be included, usually as tags, onto your distributed trace, but it does not fundamentally change how or why you observe.
As much as I’d like to tell you that distributed tracing is magic dust that you spread across your applications to see what they’re doing, it’s a bit more complicated than that. The most important thing to know is that transactions through your system need to carry the distributed tracing context everywhere. For the usual HTTP request-based applications, this context is carried in standard headers. To make sure you get visibility throughout your system, even if a particular service doesn’t support distributed tracing yet, these headers need to be passed through. This initially surprises many people, why would applications need to change? Shouldn’t the distributed tracing system just know that they’re the same request by timestamp or whatever?
Thinking about it a bit more though, the answer becomes clear: When a request goes into a service and request comes out of a service, the process that handles that request is doing something, that’s why it exists. That something could be making multiple requests for a single inbound request, or it could be retrying a request for something that failed before. Trying to guess what’s happening in that black box may work when everything is fine but when things are failing, you want to know for sure that a request leaving that service was associated with a particular request to that service, all the way back to the customer. The point is to understand failure in business context — trying to do that by being lucky is not a strategy.
So, how do you observe this newfangled Kubernetes thing? The same way we observe all the other distributed systems, through distributed tracing, associating the work done by services with the context needed to understand failure.
Still, this isn’t particularly helpful if you’re new to distributed systems, which is why the Istio project is particularly interesting. It’s a “batteries included” way of running a distributed application on Kubernetes. By checking out the distributed tracing functionality built into Istio, you can start to get a more intuitive sense of how all this works.