With a little over 6 weeks to go until the Observability Practitioners Summit, part of the Day Zero lineup at KubeCon + CloudNativeCon North America, the team at LightStep has been eagerly perusing the schedules of both events. Not only is the overall lineup this year exciting, but there are some topics that we just can’t wait to see covered in a presentation. Here’s a look at some of our top picks for both events.
Talks at the Observability Practitioners Summit (OPS)
SlackTrace: A New Tracing Tool
Suman Karumuri, Slack
Trace data contains very rich information about a request execution. However, current tracing tools only expose that information as a trace view or a service graph, which severely limits the questions we can ask of trace data and diminishes the utility of tracing. However, from past experience, we found that these limitations arise because unlike logs or metrics, we can’t query raw trace data.
To query raw trace data easily, we designed a new span format called SpanEvent and built our tracing infrastructure called SlackTrace around it. In addition, to presenting the trace data as a trace view and a service graph, the SpanEvent format allows us to query raw span data using SQL queries which allows us to derive rich insights from trace data that is not possible with existing tracing systems. In this talk, I will present SpanEvent format and an overview of our SlackTrace infrastructure.
When Connections are Magic: Understanding Performance in Serverless
James Burns, LightStep
While understanding API performance in serverless environments at major cloud providers we saw that runtimes may modify HTTP request behavior providing “pre-connected” clients leading to significant performance differences – as much as 4x in some cases. Integrating information from net/http/httptrace in Go into distributed traces lead to invaluable insights into how and why connections (and hence entire transactions) performed differently in many environments.
Working with many modern systems means network connections, many of them. Understanding how those connections impact your customer’s experience can be difficult. Distributed tracing helps isolate what parts of the system are failing, but when only implemented at the RPC level the reasons for and scope of network induced issues can be lost. We provide specific examples and source code to get this visibility in your applications.
A Picture is Worth 1,000 Traces
Yuri Shkuro, Uber Technologies
Distributed tracing has emerged as the go-to solution for understanding what’s going on in the ever-changing cloud native architectures. A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the right trace among billions that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. Especially when we use creative visualizations that put our visual cortex to work without overloading it with useless information. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.
Reliable Observability at Scale: Error Budgets for 1,000+
Fred Moyer, Zendesk
Observability and reliability engineering have been on a convergent course for several years. Error Budgets joined the reliability lexicon of engineering organizations in 2016 with the release of the SRE book. The intersection of observability and reliability has largely been the domain of specialists for practical implementation. How can one democratize these techniques to put them in the hands of a thousand engineers at once?
At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc).
Talks at KubeCon
Beyond Getting Started: Using OpenTelemetry to Its Full Potential
Sergey Kanzhelev (Microsoft) and Morgan McLean (Google)
OpenTelemetry is a cloud-native set of APIs and libraries used to generate, collect, and export telemetry from distributed systems. This session goes beyond a basic introduction, and demonstrates how you can customize OpenTelemetry’s components and architecture for the unique needs of your app. Attendees will learn how to set up and configure built-in data collectors, how to write their own instrumentation, how to extend and enrich automatically collected telemetry with app-specific information, and how to send this data to Prometheus and Jaeger for analysis.
Deep Linking Metrics and Traces with OpenTelemetry, OpenMetrics, and M3
Rob Skillington, Chronosphere
Metrics and traces are two pillars of Observability and are often used in a complementary fashion. Metrics can give you a high level view of application’s responses and performance and tracing can give you a detailed view of requests through applications. Often when using metrics in graphs or alerts you want be able to jump to an example of a request represented by a given metric datapoint which is difficult to do today. In this talk we show an example of this using an OpenTelemetry exporter to publish trace IDs as exemplars using the OpenMetrics exposition format.
We then walk through configuring Jaeger as a tracing backend and M3 as a metrics backend to store the trace ID alongside a datapoint. We show how it is then possible to go from a metrics graph that visualizes the latency of your application to a trace that fell into a latency bucket using the deep link of the trace ID.
OpenTelemetry: The First Release, What’s Next, and How to Get Involved
Morgan McLean (Google) Tristan Sloughter (Postmates), Sergey Kanzhelev (Microsoft), and Chris Kleinknecht (Google)
Earlier this year, the OpenCensus and OpenTracing communities merged to form OpenTelemetry, the first version of which will be released at Kubecon. OpenTelemetry provides libraries and agents that capture metrics and distributed traces from your applications and send them to backends like Prometheus, Zipkin, and Jaeger. The project is backed by a large community of end-user developers and the majority of cloud and APM vendors. We’re always interested in welcoming more people to the project! In this session we will cover:
- What’s included in the v1 release, the project’s overall status and production readiness
- Community structure, including governance, SIGs, and how to get involved
- Recent integrations with various frameworks, clients, and Kubernetes itself!
- Related projects like W3C TraceContext
- What we’re working on next, including more languages, more integrations, and logs
Application Observability for DevSecOps
Sabree Blackmon, Docker
Technical teams are eagerly “pushing left” – supply-chain concerns and security testing are now first-class members of the development lifecycle. But, what do we do about security as an operational concern? How do we give our operators the data they need to respond to application security events? Luckily, these observability tools already exist.
In this talk, Sabree will demonstrate how we can securely tool our applications with metrics and tracing, using OpenTelemetry, to provide valuable security heuristics and auditing to our operators. First, we’ll ask, what kind of data is useful? Then, he’ll show how to set up Prometheus alerting around metrics and overview anomaly detection and forecasting. Lastly, we’ll exercise a test application, and show how real-time tracing can be used to quickly spot and investigate adversarial and anomalous requests.
If you plan on attending, don’t wait to register for KubeCon + CloudNativeCon North America. If you’re already registered, don’t forget to check out the Observability Practitioners Summit on Day Zero, and add passes by logging in to your KubeCon registration profile.