Why You Need SRE-Driven Incident Response in the Cloud-Native Age
by Darius Koohmarey
Efficient incident response is essential to an organization’s bottom line. Why? Because of the proliferation of digital transformation, rapid change, and the number of critical services to maintain.
Release velocity has become a modern success benchmark and incident response is key in maintaining reliability through the changes.
Narrowing down the specific departments responsible for incident response, software development, and deployment, we need to acknowledge increasing release velocity begins with restructuring the way engineering and operations teams interact.
Engineering and operations teams often had conflicting outcomes before organizations adopted DevOps. Engineering teams prioritized shipping new features and rapid changes, whereas operations teams had to figure out how to ensure stability in production — often adding guardrails to slow the number of changes.
The notion of DevOps brought engineering and operations functions together with a common methodology and shared outcome — to ensure service reliability after deployment — with the SRE acting as the bridge between the two functions.
To align outcomes, error budgets were introduced. Budgets placed limits on the amount of downtime and issues a given team or individual introduced, resulting in self regulation to make sure quality wasn't lost at the expense of innovation. If an individual or team's budget was spent, they couldn’t push anymore changes or features.
Here are four ways DevOps and SREs can keep services and incident response teams resilient in an increasingly cloud-native computing era.
Incident response is an attempt to get a degraded service back to expected performance levels. But what defines the service level indicators your environment should operate at?
When is a service actually degraded beyond acceptable or agreed upon levels? What alert is worth waking up at 2 a.m. for? Was the alert related to an actual impending incident, or was it something omittable and the system is still healthy?
The answers to these questions generally depend on three factors:
- The metric that's breached and its value (what’s going wrong, and by how much)
- The number of affected users (the impact)
- What the cost of the downtime means to the business (the urgency)
These types of details are generally captured within the context of a service, which we recommend you use to categorize and track your alerts and incidents.
Knowing the service(s) affected by the issues your teams are handling will help prioritize and streamline escalation workflows for resolution.
Observability solutions will use a slew of different monitoring approaches, from synthetics and logs to heartbeats and pings to assess a system's current health level in relation to the service it provides.
There are different observable measures to identify service health. How do we measure criteria for the 'health' of services to identify if an observation we collect is healthy (successful) or not?
Steve Mushero summarizes a common framework that includes monitoring:
Rate — Request rate, in requests/sec. We want to make sure this is within the expected range and below the max utilization of a service, similar to t traffic for sites.
Errors — Error rate, in errors/sec. To be healthy, we generally want to monitor if this is below a given threshold.
Latency — Response time, including queue/wait time, in milliseconds. This should also be at or below a given threshold.
Saturation — How overloaded something is, which is related to utilization but more directly measured by queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated and not much before. Usually a counter.
Utilization — How busy the resource or service is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Healthy use is generally a range with a buffer below 100%, as use indicates value provided.
Modern implementations roll these technical metrics into larger, abstracted services, which routinely decide SLAs/SLOs/SLIs, or targets for the health of the service.
With achievable health and performance goals defined for your services, the next important step in your journey is to try to make these services as resilient as possible, so that their health doesn’t degrade.
Now that we discussed the importance of defining and getting visibility on your success metrics for health, let's start getting actionable on the first approach you can take to ensure you are producing the expected and favorable output.
One of the ageless desires of any technology or IT organization is to become more proactive in ensuring service availability. The best solution to incident response isn’t to automate the resolution to issues as they occur, but rather to prevent the issues from happening in the first place.
Applying a system theory approach to get the desired output involves improving the resilience and tolerance of the system itself. Given a static input, what changes can be made to the system to improve observed output? In many cases, we cannot enforce or dictate the type of interactions and inputs entering our services. Whether it’s a spike in volume of users or network interruptions, there are variances in the inputs that affect our services. While we can't control these events, we can become more resilient by involving the operations/production lens sooner in the process to become more tolerant to handling the variance.
How do you make your system more immune to failure from the outside? “Get SREs involved earlier in the process. We want them to participate in the design, system design, and architecture building resiliency and reliability in those aspects.” — Ajoy Chattopadhyay, SRE Leader, LinkedIn
And I agree— the best way to stop the phone from ringing is to make the service, and CI/CD pipeline downstream resilient enough for it to not ring in the first place.
Where we both see opportunity is to include the SRE earlier in the plan, build, run paradigm of software and service delivery — to act as a consultant advising the larger development teams so that code and services are built for the run/production environment from the start. The SRE team can advise engineering with an operations lens to ensure best practices for production and scalability instead of managing the fallout of inefficient systems and design.
In turn, SRE and operations teams have more confidence since they were involved in the design and decision-making process to make sure services perform optimally (rather than having to troubleshoot unstable or low-quality code from the development team after the fact).
Another important piece of proactive ops is to add automated or self-remediation scripts based on the logic of the types of events and alerts your observability is throwing, so that you don’t need to be paged at 2 a.m. If you’re on-call, you don’t want to wake up just to run a simple low risk script that a machine could run on your behalf.
While you can make systems more resilient regardless of what kind of inputs are going into them, there are also ways to proactively go outside the boundary of your service to control the set of inputs that the system experiences.
In this case, we want to try to modify and understand the inputs to your system to generate a preferred output.
It’s no surprise that the number one threat to your services' resilience is change — planned and unplanned. Getting visibility and control into your development and release pipeline will help anticipate what is going on.
Modern DevOps operating philosophies tend to introduce an error budget, which will punish low quality code, alongside similar principles of defect budgets that incentivize engineers to value quality alongside features and scope. Visibility into your release and change pipeline, as well as GitHub and development history, will help you uncover configuration changes that could be causing instability or adverse effects in your service. Automated decision-based rules and approval policies can streamline what does and does not go into effect on your production service.
Most outages are caused by changes, so having the change context can expedite identifying the cause. Adequate change context helps focus testing on the highest probability causes before moving on to more speculative theories. Involving SREs early in the system design and product architecture process streamlines this awareness. The immediate focus of the incident response activities should be on improving the service health, followed with the identification of actual root cause and formal post mortem. The distinction here is that some actions in incident response are intentionally temporary, whether reallocating resources, initiating fail overs, or throttling requests. Once the root cause is identified and solved, these temporary ‘bandaids’ can be reverted.
Aaron Merrill, a Sr. Manager IT Ops Center at Sony Playstation, highlights the value of service context, especially with remote employees and teams.
“We have virtual teams and different infrastructure groups that work together. There are new dependencies … on internal technologies as well as external service providers. Knowing where to escalate and having them combined under one service is very useful.”
Understanding the services a team supports and the services they depend on from others help visualize a service dependency map. This map drives faster remediation by making changes, alerts, and incidents affecting your services easier to understand
In the world of cybersecurity and Kali Linux, you may have heard the phrase, “The quieter you are, the more you can hear”. This same principle needs to be acknowledged in an incident response methodology since every actionable alert you send your team has a cost. If you send on-call notifications for flappers, duplicate alerts, or alerts that do not cause service impact, you are not just wasting your on-call team’s time, but you're damaging their trust in your alerting service.
Reducing alert fatigue by ensuring each alert is meaningful should be a primary goal of incident response. If an event is not predicting or indicative of a service degradation, an alert shouldn’t be sent for incident response; rather, the event should remain as a logged event with the rest of your observability data.
Are alerts too noisy, and so existing monitoring thresholds need to be adjusted for stricter range? Update your observability metrics, thresholds, and routinely review alerts entering your incident response service to ensure your metric definitions are sufficient. Or, you may find you have a visibility gap and need to set up additional monitoring. You may also find you want to implement more granular logging and tracing for future troubleshooting. Automated alert grouping is also important here to reduce repeated alerts for the same issues, while simultaneously providing environmental context into potentially related symptoms.
When we look ahead to ‘what's next’ for SRE, we see the continued maturation and growth of AIOps, and touchless site engineering (TLS). TLS is a concept Ajoy Chattopadhyay describes where bots notify response teams of identified and mitigated issues on their behalf, using a combination of modern technologies such as supervised and unsupervised machine learning for classification and aggregation, Robotic Process Automation (RPA) and Natural Language Processing (NLP).
His four recommendations for teams beginning this journey:
- Aggregating data
- Automating triage
- Automating remediation
- Focusing on incident prevention
Continued data aggregation and sharing across SRE organizations in companies means models and algorithms can better detect and provide insights into outages and degradation.
The future of incident response is a world with less manual work, where humans focus on what they do best: creatively solve hard new problems, leaving the repeated and structured tasks to the machines.