♪♪ Let’s talk about SL(X) ♫
by Andrew Gardner
Recently we had Alex Hidalgo as our guest on 99PercentVisible who joined us to speak about SLOs, and how we can improve the way in which this data is used within organizations to make people happier — including your customers, engineers, and your business.
I’m not going to recap the whole chat (because you should watch it yourself), but I did want to hit on a few key points that Alex brought up.
SLIs, or Service Level Indicators, are measurements used to define the performance of your service and tell you how it is operating. Your SLI is often calculated via some aggregate or combination of your logs, metrics, and traces. It’s essentially how you define that a service is operating, but it needs to be explicit.
SLIs should be measured from your user’s point of view. If you have meaningful SLIs that are powering good SLOs, you only have to page someone, or you only have to make it an emergency, if you're violating your SLO. Fewer pages generally make engineers happier.
SLOs, or Service Level Objectives, are the bedrock of how site reliability engineering works. But why do we need SLOs? As Alex talks about in the chat (you should really watch it), everything is a service, whether it is IaaS, PaaS, NaaS (network as a service), DBaaS (database as a service), and understanding how these services are performing is critical to providing that functionality to your customers.
Let’s talk about how you can use SLOs to actually make people happier — from your customers, to your engineers, to your business. Your SLO is a target percentage and this is where the concept of “error budgets” comes into the conversation because your SLOs define thresholds which you can then use to judge your performance.
There is a plethora of data available about measuring SLIs and setting SLO targets. But, now that you have this data, what are you actually supposed to do with it? The classic example is to “Ship features when you have an error budget; focus on reliability when you don’t.” In Alex’s view (and a view shared by Lightstep), this is antiquated, too simple, and ignores all of the amazing discussions and decisions you can have with your SLO data.
“If you think something's 100% reliable, you're probably wrong and you're probably measuring the wrong things.”
Error budgets are just a way of calculating how well your SLOs have performed over some kind of time window. Common ones are calendar months, or 30 days, or quarters. And in fact, there's no reason to not measure multiple error budgets, so you can talk about how you've performed over the quarter versus the week and things like that.
Error budget numbers help you decide whether you should be zigging instead of zagging. Error budgets are really effective at making the concept easier to understand for humans. It’s difficult to understand 99.999% availability but the concept of four minutes and 32 seconds of potential downtime per year is an easier concept to understand. This is where error budgets are really effective in sharing information about performance and reliability with the wider organization.
Looking at your error budget burn and your SLO status, generally allows you to determine what your risks are. If you burn a certain percentage of error budget every time you perform a new release, for example, maybe that means you need to shore up your release process. Leveraging error budgets to improve your processes can ultimately drive better engagement and make people (customers, engineers, and your business) happier.
As Alex cautioned, you don't want to be too reliable, but you do want to be reliable enough. Using the example of a streaming service, if a video buffers for 20s but then performs fine to your user (which is where the focus should be), that’s reliable enough. If it buffers 20s every single time, that’s not going to cut it for your users and they will begin to look for other options.
It's also important to remember that developing SLOs in the first place is project work. Figuring out how to calculate SLIs, whether it's for the first time or improving on them, that is project work. And so is picking the correct SLO thresholds. Nothing is ever perfect is one of the primary tenets of an SLO based approach. And that means your measurements aren't going to be perfect either.
The most important part of an entire SLO based approach to reliability isn't even SLOs at all, it's SLIs. It's about having measurements that meaningfully measure your service in terms of your user's perspective. Things like SLOs and error budgets, they're just some pre-done math that makes it easier to look at the numbers, that make it easier to make these decisions, or have the conversations to make these decisions. But the root of all of it is that service level indicators actually matter.
The purpose of the whole reliability stack is to provide you with the data points to have better discussions about how to improve your service for your end-customer. Take a step back, measure things from your user's perspective. Use SLOs and error budgets to provide you with some pre-done math so you don't have to break out a paper and pencil and figure stuff out. But just use this data to ensure that your team or your organization or your leadership, whoever it needs to be has better data in front of them to know here's how we've looked and here's how we want to look moving into the future.
If you haven’t clicked over to watch it by now, here it is!