Managing SLOs and SLIs in Lightstep
by Ashley Rahimi Syed
Why are service levels important?
Because ultimately, service level objectives, indicators, and agreements (SLO, SLIs, and SLAs), reflect customer expectations. They help developers manage the kinds of failures that will stand in the way of your customers’ success.
Lightstep can help you monitor and meet your Service-Level Objectives (SLOs) and resolve incidents quickly. It’s easy to set custom alerts and notify your team as soon as SLI behavior trends towards a regression. Here’s how it works.
How to track key SLOs with custom alerting
Let’s start with a visual representation of the performance history of a specific service, operation, or query. In Lightstep, we refer to this as a stream.
Here we have a stream for the Krakend API Gateway service in this system. I’ll click the “Create Condition” button in the upper-righthand corner to produce the dialogue seen below. I can choose exactly which signal I’d like to monitor – latency, error percentage, or operation rate – along with the threshold and evaluation window. In this instance, I’ve indicated that I’d like to be alerted if the error percentage for this stream surpasses 10% in a ten minute period.
Now that I’ve set my conditions, I have to add an alerting rule. From here, I’ll select the PagerDuty integration from the list of available options, along with the destination and update interval.
Lightstep will automatically review the percentage of errors affecting the service’s performance health over the last ten minutes, and report any findings in the sidebar. It is clear in the image below that the condition I set was breached sometime during the last ten minutes.
As a result, a page was triggered by PagerDuty as soon as this breach was detected.
Now, I can immediately investigate the error using Lightstep, and hopefully identify and implement a solution more quickly.
What’s Different About Investigations with Lightstep
When trying to rapidly restore service, it can be difficult to separate good hypotheses from bad ones. But Lightstep can help you avoid the guesswork entirely: with unlimited cardinality and a high-fidelity dataset uncompromised by sampling, Lightstep reveals issues unavailable to conventional monitoring solutions. It instantly analyzes thousands of traces from your system to produce root-cause insights for performance regressions, so your team can resolve issues quickly and meet SLOs.
Want to see it for yourself? Check out our free interactive sandbox, where you can debug an iOS error or resolve a performance regression using our suite of observability tools.