Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Exploring What Kubernetes Observability Might Look Like for SRE and Operations Teams in the Future

In case you're not a close reader of Kubernetes release notes, v1.25 includes supportsupport for distributed tracing of the kubeletkubelet, one of its core components that uses OpenTelemetry. Expanding existing support for tracing in the Kubernetes API Server tracing in the Kubernetes API Server and container runtimes, it's now possible to explore what Kubernetes observability might look like for SRE and operations teams in the future. This blog will provide an overview of the setup and dive into some workflows enabled by traces and metrics for internal Kubernetes components.

Before we get into the details,we should note that tracing workloads running in Kubernetes clusters has been around for many years. Think (micro)services, databases, and load balancers. But now, tracing is built-in for the internal components that power Kubernetes itself. This means that operators who need to diagnose tricky performance issues have some powerful new solutions.

Get Started with Kubernetes Traces and Metrics

If you're a developer who wants to see OpenTelemetry traces for your application alongside some infrastructure or cluster metrics, the process usually looks something like this:

  1. Add an OpenTelemetry library (like Lightstep's OpenTelemetry Java Launcher OpenTelemetry Java Launcher) to your code.

  2. Run OpenTelemetry Collectors that can ingest infrastructure and cluster metrics from various sources (like Prometheus/OpenMetrics endpoints) and traces from your app.

For Kubernetes operators on the cutting edge that want to see traces and metrics about internal components, there are some differences:

  1. Run OpenTelemetry collectors that can ingest traces from internal Kubernetes components and metrics exposed by OpenMetrics/Prometheus endpoints.

  2. Toggle Kubernetes v1.25.0 feature gatesToggle Kubernetes v1.25.0 feature gates and configure the API Server, kubelet, etcd, and container runtimes to send traces to the collector configured in step #1.

Since the internal Kubernetes tracing features are under active development, we've published a working example on GitHubpublished a working example on GitHub using minikube to make it easy to test this functionality on your laptop.

Now, for the fun part! Let's dive into what this actually looks like and what you can do with the data.

Investigate Performance with Metrics and Spans

Let's start with something familiar for Kubernetes operators and look at some Prometheus-format metrics that describe various internal Kubernetes systems like the API Server or kubelet.

Using the OpenTelemetry Collector, scrape the endpoints and send the metrics to an OpenTelemetry-compatible endpoint of your choice. Then, visualize and query them. For example, here's a Lightstep notebook Lightstep notebook that looks at the total number of API Server requests related to namespaces in a test cluster:

lightstep observability kubernetes trace in-app screenshot

With the experimental Kubernetes' tracing features enabled, we can go much deeper by looking at the spans collected on the same requests.

api server namespace spans - screenshot

This Lightstep Notebook chart uses a span query to visualize latency from the new data the experimental Kubernetes' tracing features generate. Here, we've customized it to visualize the p99 latency of API requests related to namespaces.

The green dots are trace exemplars that go deeper and allow us to understand what contributes to the latency of individual requests. Let's look at how one of the traces can help debug complex performance issues.

Dive into Kubernetes Internals with Traces

Earlier on, we configured multiple Kubernetes' components for tracing. A classic use-case for tracing is understanding how a database impacts the latency of an external API request. It's possible to do a similar exploration in Kubernetes with its API Server and etcd key-value store.

If you click a trace exemplar in the Notebook chart, you can now see how requests to etcd impact the latency of the overall request:

lightstep observability kubernetes trace in-app screenshot

Each individual span in the trace has detailed attributes attached for further analysis and querying, like isolating suspicious latency issues to a particular node, or cloud availability zone with a single click.

With the new tracing in v1.25, it's also possible to do a similar analysis of interactions between nodes, the kubelet, and container runtimes.

Easier to Understand Clusters

We're excited about the direction (and hard work) the Kubernetes Instrumentation SIGKubernetes Instrumentation SIG has put into making Kubernetes itself more observable using OpenTelemetry. While Kubernetes tracing features are still experimental at the time this blog post was written, it's going to make understanding what's happening in your cluster much easier. (While writing this blog post, we diagnosed a test cluster configuration issue using traces in the first five minutes of receiving telemetry.)

Want to learn more about how Lightstep extends visibility across Kubernetes and other cloud-native technologies to provide a clearer, more accurate picture of your microservices architecture:

Interested in joining our team? See our open positions here.Interested in joining our team? See our open positions here.
October 19, 2022
4 min read
DevOps Best Practices

Share this article

About the author

Clay Smith

How to Define and Track Incident Management KPIs

Keanan Koppenhaver | Oct 11, 2022

Incidents can have a serious impact on your business. Learn how to track key performance indicators (KPIs) to ensure that your organization is running smoothly.

Learn moreLearn more

Overview of Site Reliability Engineering

Lukonde Mwila | Sep 22, 2022

Site Reliability Engineering has become more common over the past few years, and many more are looking at it trying to understand what exactly it means. In this guide you’ll be covering this area, giving a high-level overview of SRE.

Learn moreLearn more

Leading SRE with Empathy

Ana Margarita Medina | Aug 10, 2022

Writing and operating software is hard, lets lead Site Reliability Engineering with Empathy where we relate to other human beings by being curious, listening, offering help while building trust.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems