OpenTelemetry and tracing and logs, oh my!
by Ted Young
Welcome to my OpenTelemetry log blog, wherein I mix the latest updates about the state of logging with good advice and gentle ranting.
Most of the effort to date has been focused on adding log processing as part of our telemetry pipeline. This work is moving along at a healthy clip. OpenTelemetry now has its own log data model, which allows log data to be combined with traces and metrics as part of OTLP, our unified observability protocol. Basic log processing has been added, allowing the Collector to act as a format exchange. Currently, logs can be received as either OTLP or Fluent Forward, and then be exported in a variety of formats. Exporters currently exist for the below list:
There is also experimental support for FluentBit, which you can read about here.
While exporting logs is useful, it is still a fairly simple form of log processing. However, the project received a huge shot in the arm with the donation of Stanza, a robust logging processor written in Go. Stanza will be integrated into the collector over the coming months, which will add a rich toolkit for log processing and turn the OpenTelemetry Collector into a state of the art logging agent.
If you’re interested in log processing, I recommend joining the Logging SIG to track the integration of Stanza with OpenTelemetry. But what else can be done in the meantime? We’re not ready to work on an OpenTelemetry Logging API – honestly, it’s not clear what exactly the world needs on that front. So here’s where things get interesting.
The truth is, you can already log effectively with OpenTelemetry today by using the tracing API. Spans don’t just record the start and finish time of an operation, they also record all of the events which occured. And “recording all of the events which occured” is about the most precise definition of logging that you can fit into a single sentence.
Traces are basically logs with all of the context you wish you had, plus extra timing data. I got into tracing myself via logging: working on large systems, I was sick of all the detective work I had to do just to find all of the logs I needed to understand what happened. I wanted to focus on why it was happening. Any time spent collecting the data was wasted time, and time is precious.
Since OpenTelemetry already has a tracing system, and you presumably already have a logging system, there’s a number of ways you can make a delicious peanut butter and jelly sandwich out of the two. This is an area where a lot of low hanging fruit exist, just waiting to be baked into a delicious pie of integrations. (Pretty sure I was hungry when I wrote this.)
Finding the right logs is difficult, even if you’re using a logging tool that can index and search your logs. But search by what index? This is the core annoyance I have with logging - there’s no key you can search by which finds all of the logs in a transaction, and only the logs in a transaction. I know that’s true because if you try to add this transaction ID to your logging system, guess what? You’re building a tracing system. Which, trust me, please trust me, is a lot of work.
But as they say, we’ve already got one. You can just install OpenTelemetry to get your tracing, and thus your TraceIDs and your SpanIDs (not to mention any baggage you might want to add).
Adding these IDs to your existing logging system is easy enough. Just write a logger integration. Every time a log is created, get the current span from the tracer and add it’s SpanID and TraceID to your log in whatever format suits you best. If you want more context beyond those two IDs – a ProjectID for example – you can add the ProjectID to your Baggage. It’s not even necessary to export the trace data or add any further overhead – you can just leverage the tracing and context propagation to improve your existing log setup.
For example, here’s a log4j integration. I would love to see more of these logging integrations added to OpenTelemetry – that would be a great way to get started as a contributor. Please send me a shout out if you build one.
You can also go the other direction, and add your logs to your traces. Again, just write a logger integration. Only now, when you get the current span, record the log and a span event. This instantly enriches your traces with all of the fine grained observations you’ve already added to application.
There is an incredible potential for value here, not just in time but in money. When people take this approach, I’ve noticed a pattern – they eventually find themselves only looking at their tracing system, and not really turning to their logging system any more when they want to ask a question. Given the cost and overhead large scale production log collection can incur, there is the opportunity for a serious improvement to both your system overhead and your bank account – just turn off your logging system.
I’ll leave you with one final thought to ponder. Observation creates a paradox – the more detail you observe, the more overhead you create and the more you affect the performance of your system. When we want to reduce the pressure and resource consumption our telemetry creates, we reduce the amount of detail that we record.
In logging, that meant setting log levels – you see fewer details about each transaction. In tracing, that meant sampling – you see fewer transactions.
Early forms of sampling were simplistic - just roll a 1024-sided die and if it comes up a 1, record the trace. Not a great substitution for logging when rare but potentially critical errors might be missed. But that concern is old news – modern sampling algorithms do a much better job of capturing errors and rare anomalies.
And if that isn’t enough, you could just trace without sampling at all. People say “but that would be expensive!” Have you looked at your logging bill? It’s already expensive. So that’s a non-argument; obviously the more you store the more it costs. The point is that with trace-based sampling you now have a dial you can turn at operation time. That’s a huge step above running around your codebase deleting logs to save money, hoping that they weren’t important.
So please, stop putting tracing and logging in two separate buckets. Just focus on the features that you want. Don’t worry about what they are called!
Interested in joining our team? See our open positions here.