Kubernetes is changing the infrastructure landscape is by enabling platform teams to scale much more effectively. Rather than convincing every team to secure VMs properly, manage network devices — and then following up to make sure they’ve done so — platform teams can now hide all of these details behind a K8s abstraction. This lets both application and platform teams move more quickly: application teams because they don’t need to know all the details, and platform teams because they are free to change them.
There are some great tutorials and online courses on the Kubernetes website. If you’re new to Kubernetes, you should definitely check these out. But it’s also helpful to understand what not to do.
Not Budgeting for Maintenance
The first big failure mode with Kubernetes is not budgeting for maintenance. Just because K8s hides a lot of details from application developers doesn’t mean that those details aren’t there. Someone still needs to allocate time for upgrades, setting up monitoring, and thinking about provisioning new nodes.
You should budget for (at least) quarterly upgrades of master and node infrastructure. Or however frequently you were upgrading VM images, make sure you are now doing the same for Kubernetes infrastructure as well.
Who should do these upgrades? If you’re going to roll out K8s in a way that is standard across your org (which you should be doing!), this needs to be your infrastructure team.
Moving Too Fast
The second big failure mode is that teams move so quickly they forget that adopting a new paradigm for orchestrating services creates new challenges around observability. Not only is a move to K8s often coincident with a move to microservices (which necessitates new observability tools) but pods and other K8s abstractions are often shorter lived than traditional VMs, meaning that the way that telemetry is gathered from the application also needs to change.
The solution is to build in observability from day one. This includes instrumenting code in such a way that you can take both a user-centric as well as infrastructure-centric view of requests — and understand how instrumentation data is transmitted, aggregated, and analyzed. Waiting until you’ve experienced an outage is too late to address these issues, as it will virtually impossible to get the data you need to understand and remediate that outage.
Not Accounting for Infrastructure
With all the hype around Kubernetes — and of course it’s many, many benefits — it’s easy to assume that it will magically solve all your infrastructure problems. What’s great about K8s is that it goes a long way toward isolating those problems (so that platform teams can solve them more effectively) but they’ll still be there, in need of a solution.
So in addition to managing OS upgrades, vulnerability scans, and patches, your infrastructure team will also need to run, monitor, and upgrade K8s master components (API server, etcd) as well as all of the node components (docker, kubelet). If you choose a managed K8s solution, then a lot of that work will be taken care of for you, but you still need to initiate master and node upgrades. And even if they are easy, node upgrades can still be disruptive: You’ll want to make sure you have enough capacity to move services around during the upgrade. While it’s good news that application developers no longer need to think about these issues, the platform team (or someone else) still does.
Not Embracing the K8s Community
The K8s community is an incredible resource that really can’t be overvalued. Kubernetes is certainly not the first open source orchestration tool, but it’s got a vibrant and quickly growing community. This is really what’s powering the continued development of K8s, as it continues to turn out new features.
The platform boasts thousands of contributors, including collaboration with all major cloud providers and dozens of tech companies (you can check out the list here). If you have questions or need help, it’s almost guaranteed that you can find the answer on Github or Slack, or find someone who can point you in the right direction.
And last, but certainly not least, is that contributing to and being a part of the community and can be a great way to meet other developers who might one day become members of your team.
Not Thinking Through “Matters of State”
Of course, how you divide your application into smaller services is a critical decision to get right. But for K8s specifically, it’s really important to think about how you are going to handle state: whether it’s using StatefulSets, leveraging your provider’s block storage devices, or moving to a completely managed storage solution, implementing stateful services correctly the first time around is going to save you huge headaches.
It’s all too easy to get burned by a corrupted shard in a database or other storage system, and recovering from these sorts of failures is by definition more complex when running on K8s. Needless to say, make sure you are testing disaster recovery for stateful services as they are deployed on your cluster (and not just trusting that it will just work like it did before you moved to K8s).
Not Accounting for Migration
Another important item to address is what your application looked like before you began implementing K8s. Did it already have 100s of services? If so, your biggest concern should be understanding how to migrate those services in an incremental but seamless way.
Are you just breaking the first few services off of your monolith? Making sure you have the infrastructure to support an influx of services is going to be critical to a successful implementation.
Wrapping Up: The Need for Distributed Tracing
K8s and other orchestration and coordination tools like service mesh are really only half the story. They provide flexibility in how services are deployed as well as the ability to react quickly, but they don’t offer insight into what’s actually happening in your services.
The other half is about building that insight and understanding how performance is impacting your users. That’s where LightStep comes in: by linking request metrics and logs across services, we enable teams to understand how a bad deployment in one service is affecting users 5, 10, or 100 services away.