Observability: “it’s worth it” – or isn’t it?
by Talia Moyal
How do we know whether something is “worth it”? Whether it’s valuable?
For anything we buy – whether it be a meal, an experience, or a multi-year SaaS observability contract – we need to consider its benefits and subtract its costs. In order to assess future value, we want the benefits and costs to share the same units: only then is it easy to tell a straightforward ROI story.
This article is about why that’s been a difficult proposition for observability, and what we can do about it.
Observability starts once you’ve gathered your telemetry. Telemetry data — logs, metrics, and traces — is a commodity. It has little to no value on its own yet most vendors charge a premium based on it because it’s easy to measure and scales with business growth.
There are two overarching flavors of telemetry data:
- Statistics (metrics)
- Events (traces and logs)
For metrics, the cost – whether in vendor bills or local RAM resources – is tightly coupled to cardinality, especially custom metrics. Custom metrics allow for custom tagging, and the “cardinality” of a metric is the number of distinct combinations of those tags.
For example, below we’re looking at a metric called request_counter. Each row represents a time series for a unique set of tags for request_counter. The cardinality for this metric is the number of different time series (and also the number of rows) which would be 7.
Realistically you live in a world with more than one of these tables and have thousands of metrics with thousands of unique tags for each time series, which is why infrastructure metrics end up with such large cardinality.
Infrastructure metrics providers might charge you $5 a month for 100 tag / value combinations. In common systems with millions of tag / value combinations it’s easy to see a monthly bill into hundreds of thousands of dollars. Not just that, but most of those tag / value combinations will never appear in query results or dashboards. Customers should be able to trade-off between fidelity, spend, and query result quality.
Want perfect fidelity for specific business-critical metrics? You should be able to “drag the slider” (the gray rectangle on the right side of the graph — yes our CEO Ben Sigleman drew these himself) all the way to the right. Want to degrade gracefully for a customer_ID tag with millions of values? Drag to the left and save 90% of your bill!
For event data, we have to trade off between the number of transactions (txns / second), the level of detail (bytes / txn) in those transactions, and the distance to the storage and compute that creates value from the data.
Sending the data across the internet to a SaaS for processing costs 100 times more than using local storage and compute, meaning you’ll need to either throw away 99% of the data before processing or accept high costs with doubtful value.
This insight drives Lightstep’s architecture, as well as our approach to sampling. Computation must be distributed – close to the data – and that data must stay close to the services themselves. Otherwise, the network cost is simply too disruptive to overall observability ROI. In general, there are different approaches to this when it comes to sampling — head-based sampling can lead to gaps in telemetry and tail-based sampling means holding onto large amounts of data indefinitely. Lightstep’s dynamic sampling, based on context from 100% of the unsampled telemetry data, provides actionable observability insights and won’t miss even the most infrequent failures, without breaking your budget with storage costs.
Observability is the ability to navigate from effect to cause. It helps developers understand what’s slow, what’s broken, and what needs optimization in a software system.
Observability will help you:
- Optimize performance (lower latencies or error rates)
- Investigate regressions
- Monitor deployments
- Understand and visualize service topology
- Alleviate complexity-driven anxiety
So how should you think about observability and pricing?
Licensing seats and services is the best way to scale predictably. You know that with increased usage of an observability platform in your company, the cost will predictably increase, without the fear of observing “too much data.” This is Lightstep’s model and it serves companies that have deep, complex systems with many instances of many services deployed at the same time, or wider organizations with teams supporting services with large instance counts.
Systems with particularly low throughput can see lower value per service as a pricing unit.
For companies running small to medium-sized monolithic infrastructures, the value derived per host as a pricing unit can be higher than units in other models. While there are per-host container limits in host-based pricing models, architectures that use lower instance counts of monoliths or microservices are less likely to be negatively impacted. Additionally, per host pricing models are easy to measure and control as you know who is authorized to create new instances.
Companies running higher numbers of microservices or having large instance counts in general can quickly hit overage limits as infrastructure scales up to meet additional demand or monolithic applications are broken into smaller, containerized microservices resulting in higher total container counts. Additionally, organizations with architectures that have many instances of a service off the critical path can see lower value per host as a pricing unit. Finally, per-host pricing means you’re inherently paying more as your business grows or you see more traffic on your site, not necessarily when you’re getting more value out of the observability.
Pricing models based on ingestion or retention allow organizations to pay to process and store only the data they believe is worth processing. This presents a feeling of granular control over the spend of these tools. In many tools, there’s also a higher potential ingress limit for data in comparison to tools that charge on a per-host model, though of course you’ll have to pay for that.
In reality, there’s no way to accurately project costs under this model, and there will be friction between developers wanting to add instrumentation and individuals controlling the spend. Further, as instrumentation inevitably increases along with data volume and price, it becomes difficult to identify the source of those cost changes in a distributed architecture. Of course, you’re also paying for retention of the data you send, so if you need to do month-over-month or year-over-year comparisons you’ll have to pay for that as well.
In conclusion? Pay for a unit that adds value, so that you can solve your observability pain points. This will ensure increased ROI as your business scales. Vendors want to charge you on telemetry. They do things like $15 per host per month for basic infrastructure metrics solutions or $30 per host per month for tacking on their APM solution. This is great if you have low data volume forever (said no one ever). But as your business grows, its needs change, and it becomes more complex with the migration to cloud and microservices, but this model destroys CIO and CFO budgets all over the world. Your options are to just have a metrics solution, a log management solution or a tracing solution to try to cut costs, but those are still all telemetry data not observability platforms.
Make sure that the vendor you’re evaluating is addressing your day-to-day use cases and can scale gracefully without limits. You shouldn’t have to pay for a telemetry data hose you can’t control.
You can try Lightstep for free to see why we are the quickest way to detect and resolve regressions quickly, regardless of system scale or complexity. And we don’t charge based on telemetry 😀.