In this blog post
What is an Incident Response Lifecycle?What is an Incident Response Lifecycle?Why a Cycle? Isn’t Incident Response a Linear Activity?Why a Cycle? Isn’t Incident Response a Linear Activity?What is an Incident Response Plan?What is an Incident Response Plan?PreparationPreparationDetectionDetectionResolutionResolutionPost-Incident LearningPost-Incident LearningConclusionConclusionFAQFAQWhat is Incident Response?What is Incident Response?What do Incident Response Lifecycles Teach Us?What do Incident Response Lifecycles Teach Us?What is the NIST Incident Response Framework?What is the NIST Incident Response Framework?Sign up for a free trial of Lightstep Incident ResponseSign up for a free trial of Lightstep Incident ResponseIncident response doesn’t just stop when a service is operational again. Your team may breathe a sigh of relief, but incidents are part of an ongoing incident response lifecycle through which your organization learns from its mistakes.
What is an Incident Response Lifecycle?
An incident response lifecycle is a multi-step procedure that your organization uses to detect and resolve software incidents.
The National Institute of Standards and Technology (NIST) defines the incident response lifecycle in four stages: preparation, detection and analysis, containment, eradication and recovery, and finally, post-incident activity. Below, we’ll outline each stage and what it entails.
Why a Cycle? Isn’t Incident Response a Linear Activity?
By thinking of incident response as a cycle, resolutions and takeaways from one incident flow into the preparation and response for the next. Over time, your organization will get better at incident response as it becomes more proactive and less reactive.
What is an Incident Response Plan?
An incident response plan is the specific, documented protocols a team uses during incident resolution. These protocols define roles and communication procedures. We can think of an incident response plan as fleshing out the details of the broader incident response life cycle framework.
Preparation
The Preparation stage of the incident response life cycle covers the work your organization does prior to an incident, during “normal operations” time.
This includes:
Setting up alerts, monitoring, traces, and logs
Setting up on-call schedules
Setting up incident response roles and teams
Setting up incident response playbooksincident response playbooks
Setting up incident response communication channels and protocols
While most of these items can be achieved with simple software like a spreadsheet, or a google doc, or an email list, using a platform like Lightstep Incident Response gives considerably more flexibility and dynamism. For instance, if you’re managing an on-call rotation using a spreadsheet, there’s a bunch of manual back and forth if someone goes on vacation, or is out sick. With LIR on the other hand, you can simply edit the on-call rotation page, and engineers will be notified when they go on call. Managers or site reliability engineers (SREs) can also create incident response teams, add team members, and define service ownership and escalation policies. Learn more about the roles and responsibilities of an incident response management team here.
Lightstep Incident ResponseLightstep Incident Response provides integrations with popular observability and monitoring systems; when metrics look anomalous, an alert will be piped through the LIR system and you will receive an email, a text message, or push notification on the LIR app.
Detection
The detection phase of the incident response lifecycle begins with the detection of the incident and the mobilization of the incident response team, which must swiftly do the following:
Diagnose the scope and severity of the incident to determine the appropriate response level.
Create an incident and communicate that the incident is occurring broadly
Assemble the incident response team/initiate needed collaboration across chat and virtual meetings.
In the Lightstep Incident Response product, incidentsincidents can be created by manually promoting an alert, automatically promoting alert via response rule, or via direct manual incident creation from desktop or mobile. The incident, once created, appears with comments, work notes, and incident activity captured in the incident timeline stream. It becomes a dynamic dashboard that captures the current state of the incident and can serve as a beacon for everyone in the org who might have questions.
The incident response team, meanwhile, will gather in the incident response workspace, a part of the LIR application that contains panels with incident details and different possible actions. This is where you can kick off/join a Slack channel or Zoom for incident communicationsincident communications, and access the incident document where information is aggregated to get participants up to speed.
Resolution
In this next phase of the incident response lifecycle, the incident response team is working together to actually resolve the incident.
While every incident is unique and there’s little we can prescribe to solve your exact issue, the following are some best practices:
1. The Incident commander is the single source of truth
The incident commander should be making all decisions, and all information should flow through them. No one should be performing any remediation or taking any action unless the incident commander has said that they can. Incidents can be very chaotic, and it’s important that it’s clear who is in charge.
2. SMEs make diagnosis and recommend fixes
The first item on the incident commander’s to-do list, once they have assembled their team, is to go to their subject matter experts to gather information to make a diagnosis and remediation plan. The SMEs should describe the symptoms of the incident (e.g. backend servers are unreachable) followed by their recommendations for how to fix it. The focus here is on restoring service as quickly as possible and may involve short term fixes.
3. The incident commander makes decision on which fixes to try
The incident commander should next aggregate the list of repair actions, and make a decision on which approach to try, weighted by how risky it is (for instance you could probably restart all 100 servers at once and that would fix the problem, but it would be much more risky than doing a rolling restart).
4. Repeat this mini-cycle until the issue is resolved
Throughout this phase, communication protocols that were established during the preparation phase should be followed. Meanwhile, for each one of these mini-cycles, internal updates should be communicated broadly and include:
The issue
The service affected
Time since degradation
The current severity level
How far along you are in response to the issue
Remediation steps taken and their results
Who is managing the coordination and members involved
Post-Incident Learning
The work isn’t over once service is restored. After the incident, in the post-incident learning phase of the incident response lifecycle, your team should make sure the following occurs:
Incident review meeting takes place
Postmortem is written
Action items are assigned and carried out
In an incident review meeting, the incident is discussed, the chronology reviewed, and the root cause determined. All responders are invited to this meeting to capture all the actions and feedback. Afterwards, a key person, often the incident commander, will create a post-mortem to formally document what happened during the incident, what caused it, and what needs to be done to make sure it doesn’t happen again. This is where records like the Incident timeline in LIR can be very helpful in recreating the incident and recalling what remediation efforts were made when.
In addition to determining technical action items, this is also when you should review the incident response itself. Was it speedy? Were the right people involved at the right time? Were the alerts accurate or do their thresholds need to be made more sensitive? Do certain systems need more instrumentation? Was the communication around the incident timely and clear? This retrospective analysis provides the feedback loop in the incident response lifecycle. In particular, LIR incident and alert dashboards will help you spot trends and areas to focus on for improvement.
Conclusion
According to the recent IBM report on cybersecurity, the average period of system breach detection in 2019 was 206 days, and it took 73 more days on average to contain and remediate the threat. The goal of Lightstep Incident Response is to provide a single, integrated platform for your team to move seamlessly through the different phases of the incident response lifecycle – and ultimately to learn from mistakes by instilling best practicesbest practices.
FAQ
What is Incident Response?
Incident response is an organization’s systematic actions to detect and remediate a software outage or cybersecurity breach.
What do Incident Response Lifecycles Teach Us?
Incident response life cycles can teach us about gaps in our incident response plans, communication protocols, teams, and training.
What is the NIST Incident Response Framework?
The NIST Incident Response Framework is a particular definition of the incident response lifecycle that was developed by the National Institute of Standards and Technology. It includes four main stages: preparation, detection/analysis, containment/eradication, and recovery.
Learn more about the roles and responsibilities of an incident response management teamroles and responsibilities of an incident response management team.
Sign up for a free trial of Lightstep Incident ResponseSign up for a free trial of Lightstep Incident Response
In this blog post
What is an Incident Response Lifecycle?What is an Incident Response Lifecycle?Why a Cycle? Isn’t Incident Response a Linear Activity?Why a Cycle? Isn’t Incident Response a Linear Activity?What is an Incident Response Plan?What is an Incident Response Plan?PreparationPreparationDetectionDetectionResolutionResolutionPost-Incident LearningPost-Incident LearningConclusionConclusionFAQFAQWhat is Incident Response?What is Incident Response?What do Incident Response Lifecycles Teach Us?What do Incident Response Lifecycles Teach Us?What is the NIST Incident Response Framework?What is the NIST Incident Response Framework?Sign up for a free trial of Lightstep Incident ResponseSign up for a free trial of Lightstep Incident ResponseExplore more articles

Monitoring Apache with OpenTelemetry and Lightstep
Andrew Gardner | May 2, 2023Continue your observability journey by ingesting metrics from Apache and sending them to Lightstep.
Learn moreLearn more
Monitoring MySQL with OpenTelemetry and Lightstep
Andrew Gardner | Apr 11, 2023Learn how to ingest metrics from MySQL and send them to Lightstep.
Learn moreLearn more
Monitoring NGINX with OpenTelemetry and Lightstep
Robin Whitmore | Apr 6, 2023Learn how to start ingesting metrics from NGINX and send them to Lightstep for more intelligent analysis and monitoring.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems