Exploring What Kubernetes Observability Might Look Like for SRE and Operations Teams in the Future
In this blog post
Get Started with Kubernetes Traces and MetricsGet Started with Kubernetes Traces and MetricsInvestigate Performance with Metrics and SpansInvestigate Performance with Metrics and SpansDive into Kubernetes Internals with TracesDive into Kubernetes Internals with TracesEasier to Understand ClustersEasier to Understand ClustersIn case you're not a close reader of Kubernetes release notes, v1.25 includes supportsupport for distributed tracing of the kubeletkubelet, one of its core components that uses OpenTelemetry. Expanding existing support for tracing in the Kubernetes API Server tracing in the Kubernetes API Server and container runtimes, it's now possible to explore what Kubernetes observability might look like for SRE and operations teams in the future. This blog will provide an overview of the setup and dive into some workflows enabled by traces and metrics for internal Kubernetes components.
Before we get into the details,we should note that tracing workloads running in Kubernetes clusters has been around for many years. Think (micro)services, databases, and load balancers. But now, tracing is built-in for the internal components that power Kubernetes itself. This means that operators who need to diagnose tricky performance issues have some powerful new solutions.
Get Started with Kubernetes Traces and Metrics
If you're a developer who wants to see OpenTelemetry traces for your application alongside some infrastructure or cluster metrics, the process usually looks something like this:
Add an OpenTelemetry library (like Lightstep's OpenTelemetry Java Launcher OpenTelemetry Java Launcher) to your code.
Run OpenTelemetry Collectors that can ingest infrastructure and cluster metrics from various sources (like Prometheus/OpenMetrics endpoints) and traces from your app.
For Kubernetes operators on the cutting edge that want to see traces and metrics about internal components, there are some differences:
Run OpenTelemetry collectors that can ingest traces from internal Kubernetes components and metrics exposed by OpenMetrics/Prometheus endpoints.
Toggle Kubernetes v1.25.0 feature gatesToggle Kubernetes v1.25.0 feature gates and configure the API Server, kubelet, etcd, and container runtimes to send traces to the collector configured in step #1.
Since the internal Kubernetes tracing features are under active development, we've published a working example on GitHubpublished a working example on GitHub using minikube to make it easy to test this functionality on your laptop.
Now, for the fun part! Let's dive into what this actually looks like and what you can do with the data.
Investigate Performance with Metrics and Spans
Let's start with something familiar for Kubernetes operators and look at some Prometheus-format metrics that describe various internal Kubernetes systems like the API Server or kubelet.
Using the OpenTelemetry Collector, scrape the endpoints and send the metrics to an OpenTelemetry-compatible endpoint of your choice. Then, visualize and query them. For example, here's a Lightstep notebook Lightstep notebook that looks at the total number of API Server requests related to namespaces in a test cluster:
With the experimental Kubernetes' tracing features enabled, we can go much deeper by looking at the spans collected on the same requests.
This Lightstep Notebook chart uses a span query to visualize latency from the new data the experimental Kubernetes' tracing features generate. Here, we've customized it to visualize the p99 latency of API requests related to namespaces.
The green dots are trace exemplars that go deeper and allow us to understand what contributes to the latency of individual requests. Let's look at how one of the traces can help debug complex performance issues.
Dive into Kubernetes Internals with Traces
Earlier on, we configured multiple Kubernetes' components for tracing. A classic use-case for tracing is understanding how a database impacts the latency of an external API request. It's possible to do a similar exploration in Kubernetes with its API Server and etcd key-value store.
If you click a trace exemplar in the Notebook chart, you can now see how requests to etcd impact the latency of the overall request:
Each individual span in the trace has detailed attributes attached for further analysis and querying, like isolating suspicious latency issues to a particular node, or cloud availability zone with a single click.
With the new tracing in v1.25, it's also possible to do a similar analysis of interactions between nodes, the kubelet, and container runtimes.
Easier to Understand Clusters
We're excited about the direction (and hard work) the Kubernetes Instrumentation SIGKubernetes Instrumentation SIG has put into making Kubernetes itself more observable using OpenTelemetry. While Kubernetes tracing features are still experimental at the time this blog post was written, it's going to make understanding what's happening in your cluster much easier. (While writing this blog post, we diagnosed a test cluster configuration issue using traces in the first five minutes of receiving telemetry.)
Want to learn more about how Lightstep extends visibility across Kubernetes and other cloud-native technologies to provide a clearer, more accurate picture of your microservices architecture:
In this blog post
Get Started with Kubernetes Traces and MetricsGet Started with Kubernetes Traces and MetricsInvestigate Performance with Metrics and SpansInvestigate Performance with Metrics and SpansDive into Kubernetes Internals with TracesDive into Kubernetes Internals with TracesEasier to Understand ClustersEasier to Understand ClustersExplore more articles
How to Define and Track Incident Management KPIs
Keanan Koppenhaver | Oct 11, 2022Incidents can have a serious impact on your business. Learn how to track key performance indicators (KPIs) to ensure that your organization is running smoothly.
Learn moreLearn moreOverview of Site Reliability Engineering
Lukonde Mwila | Sep 22, 2022Site Reliability Engineering has become more common over the past few years, and many more are looking at it trying to understand what exactly it means. In this guide you’ll be covering this area, giving a high-level overview of SRE.
Learn moreLearn more
Leading SRE with Empathy
Ana Margarita Medina | Aug 10, 2022Writing and operating software is hard, lets lead Site Reliability Engineering with Empathy where we relate to other human beings by being curious, listening, offering help while building trust.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems