Building an Incident Response Playbook
by Molly Star
Incident response is dynamic by nature. Every disruption in operations or unplanned change looks different and teams must be flexible, communicative, and prepared to adapt and evolve their strategies on the fly. While there is no one-size-fits-all guide for IR management, a standard of best practices and processes is an essential foundation for any incident response team. This is where your incident response playbook comes in. A service outage or degradation can be chaotic for the most seasoned IR teams, but a playbook can guard against further chaos. So what goes into creating an effective playbook?In this article post, we will discuss how to build an incident response playbook for your organization. We will cover topics such as identifying critical systems, creating notification procedures, assembling your incident response team, and common playbook scenarios.
An incident response playbook is a document that outlines the steps that should be taken when an organization experiences a security incident. It can be used to help ensure that the right people are notified, the correct systems are checked, and the appropriate actions are taken to mitigate the damage caused by the attack. It is essentially all the moving part of incident response management
You may have heard the phrase “incident response runbook,” which is closely related to a playbook. In contrast to a playbook’s holistic approach, a runbook details specific tasks and automated processes. Playbooks can (and should!) include runbooks.
When an incident starts, it’s the worst time to be figuring out who’s in charge, what to do, and in what order. A playbook guarantees that your team already knows the basic protocol for incident response. It also helps you guard against vulnerabilities when an incident is not happening. A playbook also reinforces your organization’s culture and attitude around IR management.
1. Define Your Terms What is an incident for your organization? What requires an emergency response? Your top risks might be very different from other teams depending on your services, your size, and your customer base. If your organization uses a 4-tier severity system, your playbook will define these levels clearly. A low severity incident at a large company could be a higher severity incident for a small team. During an incident, being on the same page about the nature of the emergency saves precious time.
2. Determine Communication Channels Decide on how and when you communicate during an incident. This includes:
- email for internal incident response
- a status page
- any chat channels within your team
- outward-facing communication with clients and customers.
Have a set order of operations for what channels are established and which roles are involved in each step. Define what should be included in every incident response email. Even if it seems repetitive, it will leave you with a transparent record of the incident.
Take into account how remote/hybrid work affects communication protocol. Your team may work in different time zones or have varying availability on certain platforms. If you are remote and planning to integrate hybrid or in-office work in the future, account for that transition in your playbook.
3. Set Predefined Roles Clear definitions of roles and what they constitute during an incident cut down confusion before an incident happens. How you define other roles depends on the size of your team, the nature of an incident, and how you interface with customers/clients:
The incident manager is the person in charge of the incident response team and is responsible for directing the team's actions. They are typically a senior member of the organization who has experience in managing emergency situations. It’s also important to distinguish what this person doesn’t do. While they lead the process of incident response, manage communication, and delegate other roles, an incident manager is likely not dealing with the tech hands-on. Even if this role needs to be handed over in the course of one incident, your playbook will provide the necessary info to make a smooth transition possible.
The tech lead/on-call engineer serves as a subject matter expert or you may bring in additional subject matter experts. Consider keeping a ready-made list of experts for your organization’s services. This eliminates a time-consuming hunt for names!
The communications manager is responsible for communications around an incident, whether internal or external. Additional roles can include social media managers to communicate with customers during an incident, or a postmortem lead. Be consistent about what roles mean, but define them as best serve your specific organization’s needs.
4. Delineate Your Process The last thing you want when an incident is already underway is disagreement about what steps to take. Define your incident response lifecycle: the processes of detection, communication, assessment, escalation, delegating roles, and resolution. With this outline, it’s clear what each stage involves, who is on deck, when to communicate and update, etc. Tooling like the LIR product can help streamline and automate this lifecycle and provide a hub for communication and alerts.
5. Encourage Thorough Postmortems Good incident response management doesn’t end when the incident itself is resolved. Your playbook should detail your postmortem process, from documentation and discussion to action items. An incident report should include:
- Steps taken to resolve incidents
- Steps to protect against disruption going forward
A blameless postmortem culture lets engineers openly discuss what went wrong and what can be done going forward. Your postmortem process can also help your team prioritize actions after an incident - what needs to be accomplished first, and what can wait.
6. Flexibility Even for the most nimble IR management, there will always be unknowns. It may be a worthwhile exercise to think about what parts of a playbook are a “living document.” What is static - e.g. the structure of an incident report email - and what could be collaborative or more frequently updated? Perhaps a recent incident’s postmortem discussions revealed something that changed the way your team thinks about IR processes. Don’t be afraid to be fluid.
An incident response playbook should be used whenever an unexpected or abnormal event occurs that impacts the normal operations of your business. This could include:
- Service outage
- Resource exhaustion/Phishing Attacks
- Security incident
- Ransomeware attacks
- Distributed Denial of Service (DDoS)
Having a playbook in place will help you to quickly respond to and resolve the incident. If you don't have an incident response playbook, now is the time to create one!
Taking the basic components of a playbook, you can tailor them to common threats. Let’s look at a few examples of incident response playbook scenarios:
1. Slowdown of service Your site or app slowing down can have big logistical and financial consequences. Playbooks can put you in a good position for these unplanned degradations. These might involve:
- Preparation: know your backup plans inside and out; detail the SRE metrics you’ll be expected to maintain even in an unplanned event.
- Analyze: determine the severity of the degradation, and whether it is directly impacting customers or is contained internally.
- Mitigate: figure out steps for getting service to a place where customers and clients are not directly affected; prepare for what processes you might pause to triage.
- Resolve: determine the contributing factors of the slowdown and take action to fix it.
- Postmortem: discuss incident logs and where different steps could have been taken, how the issue could have been resolved faster, and what to change to prevent further incidents.
2. Resource exhaustion Web server resource exhaustion puts you and your customers in a jam. It could prevent users, customers, and even your colleagues from logging into and accessing services, cause serious slowdown, and more. The incident response playbook for resource exhaustion might involve things like:
- Preparation: plan ahead of time for what you will prioritize in case of limiting traffic or pausing an app or function.
- Analyze: contributing factors and fixes can be very diverse here. Just one misconfigured polling process or overlooked memory leak could be quickly consuming resources.
- Mitigate the disruption: if engineers need to repair a code, have plans in place to restore or save even partial service, like resorting to a static page.
- Recovery: include metrics for what your baseline performance is in addition to restoring any paused services.
- Postmortem: depending on the source of the problem, a discussion could involve more efficient allocation of memory, fixing a bug in software, changing how you monitor resource usage, or making a plan to check code for any errors that could cause further issues
3. Service outage It’s overwhelming to think about an essential, customer-facing service like a payment gateway going down. It can seem like a worst case scenario when it comes to communication plans, since customers are already affected when the incident starts!
- Preparation can include simple steps like being extra aware of when you’ll get high traffic with a payment gateway - like a big product launch or pre order window opening.
- Frequent and open communication with customers is key in this situation, but so is the wellbeing of team members. A fatigued or stressed communications lead or social media manager won’t help stressed out customers.
- Mitigate damage early on If your service has auto renewed subscriptions, pausing that function means a lot less failed payments while the outage is happening
- Postmortem: after a payment gateway goes down and service is restored, your team can strategize calmly about how this incident might be prevented in the future. Is there a software fix that needs to be prioritized? Does your team want to consider using multiple payment gateways?
Most companies aren't or do not handle incidents as effectively as they could be. In fact, a recent study showed that the average company takes more than 200 hours to resolve an incident. That's where Lightstep comes in. We're the all-in-one platform that enables Developers, DevOps, and Site Reliability Engineers to respond to incidents fast. By combining the right people, processes, and tools in one place, teams can recover quickly and prevent effectively.
We know that you want your team to be prepared for anything - that's why we've built Lightstep Incident Response. With our platform, you'll have everything you need to get your team back on its feet quickly.
An incident response playbook is invaluable for any IR management team. It not only keeps your incident response plans well-oiled, but reinforces your core values.