So, you’ve gone “cloud native” … you’re running apps in containers, you’re scheduling them with Kubernetes, and now you’re trying to figure out what the heck is going on. It’s the third time in a month where your customers are seeing two minutes of error pages — and then it’s just fixed. That’s plenty long enough for them to complain on Twitter, and besides that, it’s just embarrassing. When you were deploying on VMs with Chef, at least you knew what software was running where. Now with services, pods, and deployments, how do you know which software is handling a customer request?
Kubernetes provides a new and powerful way to deploy and manage software, but it also requires new ways of thinking about software as a resource — and making that resource observable.
Pitfall #1: Reconstructing the Customer Experience From Logs
You’ve been doing this software engineering thing for a while. You’ve built log aggregation (and it works). With a standard logging library and structured logging, you’ve been on the cutting edge of getting information out of systems. And now your software is on Kubernetes and nothing makes sense.
Before you had a few long-running processes that were probably still running when you found something going wrong. Now with pods, by the time you see an issue, there’s a good chance the pod has exited and the resource has been scheduled somewhere else.
When there’s a failure or set of failures, it’s a furious series of distributed greps to try to see the issue before it changes again, and to try to figure out what correlates and what doesn’t. That’s just the application logs, but you’ve got cloud load balancer logs, edge load balancer logs, service mesh load balancer logs. How are other developers, in the worst case, all the developers, supposed to be able to reason about this when stuff’s on fire?
Tracing — Because Kubernetes Logs Are Not Enough
You’ve fallen into the Pit of Contextual Logging on Ephemeral Systems. PoCLoES for short. PoCLoES (which rhymes with “oh noes”) lies in the path of those who have been able to get by via smart people processing lots of information in their expertise domain, usually by hand. As any particular piece of software becomes more ephemeral, and more varied in terms of what versions are running at the same time, the information changes too quickly to do hypothesis investigation or to maintain accurate mental models of what’s in play.
The shortest way out of PoCLoES is distributed tracing. All of the essential information about how a request was handled, whether it experienced errors or slowdowns, and what software (at what version) it touched is assembled into a coherent relational graph. The trace is not just a mental model of how the systems interacted, it’s what actually happened. Depending on how deep you are in the pit, it can take a while to build out the context propagation and systems integration necessary to see everything that a request touched, but once you have even service mesh based-views, you can start reasoning much more quickly. If you use something like LightStep Correlations to generate hypotheses to investigate, you can move even faster.
Pitfall #2: Assuming Responsive Means Healthy
As your one service becomes two services then becomes ten services, then 50, every service is making requests to (at least) two or three other services. The service mesh and load balancers require a health check, so you wrote up an endpoint that returns 200, what could go wrong? The same for your REST client code, what did the service respond with? It’s a 200, let’s just insert into memcache and do business logic right?
And then software starts changing — a lot. Another team deployed a breaking change to their API. Now, they didn’t think it was breaking, but none of your customers can check out, so probably that’s broken. Maybe a key in the response that’s used for monthly batch processing got pulled since “no one’s used it in the last 2 weeks.” Or perhaps, and this happens often, the health check endpoint stays up just fine, while the application itself is unable to reach other services or data stores.
What Does K8s Service Health Actually Mean?
You’ve fallen into the Pit of Eventually Consistent Networked Software (PoECoNS). Working on Kubernetes means continually working in an eventually consistent environment, of software, of configuration, even of networking. There is a lot of smart design that goes into Kubernetes to at least give you the tools to handle this well, but it’s still really easy to bypass a protection, like health checking, because the consequences aren’t obvious.
To get out of PoECoNS, you’ll need to think hard about what it means for a service to be healthy (enough) to receive requests from other services. What happens when it’s degraded? What happens when everything is degraded? What happens when everything is down? How does the health check endpoint keep up with changing dependencies? None of these questions are easy, and, to be honest, it’s easier to just send requests to everything that’s healthy enough to show as running. But, as software changes more and more, as you start running canaries or multiple versions of software at the same time, it becomes necessary to understand what “healthy” or “available” actually means, so that a bad software deploy doesn’t lead to waves of 500s for your customers.