What Your Observability Software Should Deliver
by Eric O'Rear
Your observability software should deliver confidence in your distributed systems. On Friday deploys. During a 100x increase in service traffic. When latency starts spiking all over the network. Real observability will give you confidence.
Unfortunately, achieving this level of confidence is not as simple as flipping the switch on new telemetry streams.
What does an effective observability software deliver?
Observability can't help you if it's too expensive to use when you need it most.
Because observability solutions leverage such a large volume of data, network and storage costs are a significant factor. Many observability solutions have pricing structures that penalize businesses for scaling, or moving too much data across networks to be sustainable from a cost perspective.
Make sure any observability software you are considering can grow with your business and is handling your telemetry in a cost-effective way.
The days of relying on a wizard hacker with arcane system knowledge are long behind us. With context-rich trace data, correlation analysis, and developer-focused UIs, observability software should deliver a user-friendly experience that makes regression analysis, understanding service relationships, and inter-team communication easier.
By revealing the critical path of end-to-end requests, and surfacing only the relevant data to resolve an issue, observability enables better workflows for debugging, performance optimization, and crisis management.
If using an observability tool is itself an obstacle to velocity, then keep looking.
To allow developers to respond to –– and stay ahead of –– performance issues, an observability solution needs to be as close to real-time as possible. With queries handling data from thousands of spans from potentially just as many services, it is important that your observability software be able to keep up and not turn into a bottleneck itself.
This isn’t just speed for speed’s sake –– this is intelligent speed. Getting telemetry insights and alerts in front of service owners when it matters can save your company untold costs in missed SLOs and soured customer relationships.
Your observability software needs to be a centralized resource for your system data. Logs, metrics, traces, service relationships –– all of this information should be accessible, user-friendly, and contribute to a larger, coherent picture of system health, in a single context.
If an observability software requires context-switching between various third-party services and platforms, it is failing your developers. Time spent this way costs developers precious time, increases the likelihood of oversights, and isn’t necessary.
A true observability software provides a single, shared context across roles and organizations, as it enables developers, operators, managers, PMs, contractors, and any other approved team members to work with the same views and insights about services, specific customers, SQL queries, etc.
If an observability tool is pre-sampling all of your data, it isn’t an observability tool.
One of the central tenets of observability is the ability to answer performance questions that you didn’t predict. Pre-sampling involves making assumptions about your data, and this can come back to haunt you when things go wrong. Sometimes unique behaviors are hiding in single traces, and pre-sampling is basically flipping a coin on thousands of traces before ever looking at the data.
Make sure the insights from your observability software are made from all of your data, otherwise you might lose out on things like outlier detection, correlations, and accurate performance shapes.
Microservice complexity is a legitimate challenge for development teams. The move from monolith tightens scope and increases release velocity, but creates a tangle of service dependencies that no single person can, nor should, be expected to troubleshoot without assistance.
An observability solution should clarify the nebulous tangle of services, and make possible dynamic, reproducible root-cause analysis that doesn’t rely on some preternatural knowledge of the system that is all but impossible in a complex microservice architecture.
This often includes service dependency maps, critical path analysis, automated root cause identification, and UI that is easy to use for developers at all levels.
Robin Whitmore wrote a great article called, “Data-Driven Hypotheses with Lightstep: A Step-by-Step Guide” wherein she walks readers through root-cause analysis on our observability software.
For a better idea of how observability can make it easier to resolve incidents and improve system performance, check out Lightstep’s Sandbox!