In this blog post
1- Traces not treated as a first-class citizen1- Traces not treated as a first-class citizen2- The Wall ’o Dashboards2- The Wall ’o Dashboards3- Getting someone else to instrument your code3- Getting someone else to instrument your code4- Belief that Observability Tooling == Observability4- Belief that Observability Tooling == Observability5- Observability theater5- Observability theaterFinal ThoughtsFinal ThoughtsHave you ever been at a place that claimed to be all in on Observability, only for you to realize that well, they’re not really following Observability practices? Yeah. Me neither. Just kidding. I’ve been in this space long enough to witness my fair share of anti-patterns, or what I like to refer to as, “Crimes Against Observability”. In today’s post, I’ll be calling these out, in the hopes that you can avoid these crimes and be on your merry way towards unlocking Observability’s powers.
Let’s get started!
1- Traces not treated as a first-class citizen
Too many organizations put far too much emphasis on metrics and logs, while either completely disregarding or downplaying traces. Yes, metrics and logs are useful…to a pointmetrics and logs are useful…to a point.
Metrics can give us information about things like CPU levels and the amount of time that it takes to complete a transaction. But they can only provide aggregate information that you can’t drill down into to understand what’s going on with your system.
Logs provide useful point-in-time information; however, by themselves, logs make it pretty damn difficult to troubleshoot. They’re a wall of text that you have to parse through so you can kinda sorta maybe piece together what’s up with your code.
Neither metrics nor logs give you enough context to understand what’s happening with your system at a high levelNeither metrics nor logs give you enough context to understand what’s happening with your system at a high level. Thus, the biggest crime against Observability is committed when metrics and logs are treated as the main actors of your Observability story, when in fact they take more of a supporting role. Spoiler alert: traces are the true stars of the show.
So how do we fix this? Take a trace-first approach. Traces give you that end-to-end system wide view. They show you not only what’s going on within services, but also across services. How do logs and metrics fit into this?
Make logs more useful by making them a part of our overall story—i.e. traces—embedded as Span Events.
Correlate metrics to traces via a linking attribute. For example, a VM with a given IP address can be correlated to a Trace if we capture IP address as a Span attribute.
Moral of the story: take a trace-first approach with your Observability landscapeObservability landscape.
2- The Wall ’o Dashboards
Y’all, this one’s nails on chalkboard for me. I once worked at a place where leadership thought that their production woes would be solved by dashboarding all the things. “I helped set up a wall of dashboards when I worked at XYZ company and it helped us so much!” Yeah. Like 10 years ago, when there wasn’t much else to work with. Time to get with the times and rethink that wall ‘o dashboards, my friends.
Does that mean that dashboards go away entirely? No. Instead, rethink your dashboard situation. Use fewer dashboards. Don’t rely on the Wall ‘o Dashboards to guide your Observability journey. A better alternative to Metrics dashboards would be to use Service-Level Objectives (SLOs)Service-Level Objectives (SLOs). SLOs are actionable. For example, suppose you have an SLO that states that the response time for Service X must be 95% of the time. If the service is not meeting that SLO, it triggers an alert to notify by Slack, phone, pager, passenger pigeon, or whatever, to tell your on-call engineers that your system is not behaving within the expected parameters, and that you've gotta take a closer look at things.
For more hot takes on dashboards, check out this great piece by Charity Majorsgreat piece by Charity Majors and this short, fun video by Austin Parkershort, fun video by Austin Parker.
3- Getting someone else to instrument your code
Say you’re a developer. Would you get someone else to comment your code or write your unit tests? I didn’t think so. Now, say that your team is getting into Observability. This means that you need to instrument your code à la OpenTelemetryOpenTelemetry. Would you:
Instrument your own code
Ask someone else (maybe your SREs?) to instrument your code
If you answered B, then, yeaaaaahhhh...“Houston, we have a problem”.
What’s wrong with this picture? Well, for starters, the SREs didn’t write the code. You did. Just like it would be super weird to have someone comment YOUR code and write unit tests for YOUR code, then why would it be okay to have someone else instrument YOUR code? How in Space do you expect them to know WHAT to instrument?
Look, there’s no shame in not having instrumented your code before. If you’re just getting started with Observability on an existing code base, you bet your pants that you’ll have to go through your code and instrument it. But you can also ensure that you instrument new code as you write it, going forward. Moral of the story, instrument your own code. More specifically, if you focus on instrumenting your home-grown frameworks and librariesinstrumenting your home-grown frameworks and libraries, then you have all the coverage you need, as far as tracing is concerned. Whatever you do, please, don't get someone else to do your dirty work for you.
But hey, don’t take just my word for it. I’ll let Liz Fong-Jones have the last wordLiz Fong-Jones have the last word:
“You’re a full grown-up software engineer. Write your own damn tests. Write your own damn comments. Write your own damn Observability annotations. This will help YOU understand your code later."
4- Belief that Observability Tooling == Observability
Oh, mes amis, this couldn’t be further from the truth. Observability is a set of practices supported by tools. These include:
Instrumenting your code properly so that you have enough info to troubleshoot when issues arisecode properly so that you have enough info to troubleshoot when issues arise, and thereby avoiding calling the same group of “experts” to troubleshoot
Treating traces as first-class citizens
Keeping an eye on the health of your system after you release to prod
Creating meaningful, SLO-based alerts
Put that into place, and you’ve got yourself some Observability.
5- Observability theater
We’ve seen it before. Companies going all-in on so-called “digital transformationsdigital transformations