Lightstep from ServiceNow Logo

Products

Solutions

Developers

Resources

Login

Lightstep from ServiceNow Logo
Technical

Understanding the Incident Response Lifecycle


Michelle Ho

by Michelle Ho

Understanding the Incident Response Lifecycle

Explore more Technical Blogs

Incident response doesn’t just stop when a service is operational again. Your team may breathe a sigh of relief, but incidents are part of an ongoing incident response lifecycle through which your organization learns from its mistakes.

What is an Incident Response Lifecycle?

An incident response lifecycle is a multi-step procedure that your organization uses to detect and resolve software incidents.

The National Institute of Standards and Technology (NIST) defines the incident response lifecycle in four stages: preparation, detection and analysis, containment, eradication and recovery, and finally, post-incident activity. Below, we’ll outline each stage and what it entails.

Why a Cycle? Isn’t Incident Response a Linear Activity?

By thinking of incident response as a cycle, resolutions and takeaways from one incident flow into the preparation and response for the next. Over time, your organization will get better at incident response as it becomes more proactive and less reactive.

Incident Response Lifecyle 2

What is an Incident Response Plan?

An incident response plan is the specific, documented protocols a team uses during incident resolution. These protocols define roles and communication procedures. We can think of an incident response plan as fleshing out the details of the broader incident response life cycle framework.

Preparation

The Preparation stage of the incident response life cycle covers the work your organization does prior to an incident, during “normal operations” time.

This includes:

  • Setting up alerts, monitoring, traces, and logs
  • Setting up on-call schedules
  • Setting up incident response roles and teams
  • Setting up incident response playbooks
  • Setting up incident response communication channels and protocols

While most of these items can be achieved with simple software like a spreadsheet, or a google doc, or an email list, using a platform like Lightstep Incident Response gives considerably more flexibility and dynamism. For instance, if you’re managing an on-call rotation using a spreadsheet, there’s a bunch of manual back and forth if someone goes on vacation, or is out sick. With LIR on the other hand, you can simply edit the on-call rotation page, and engineers will be notified when they go on call. Managers or site reliability engineers (SREs) can also create incident response teams, add team members, and define service ownership and escalation policies. Learn more about the roles and responsibilities of an incident response management team here.

Lightstep Incident Response provides integrations with popular observability and monitoring systems; when metrics look anomalous, an alert will be piped through the LIR system and you will receive an email, a text message, or push notification on the LIR app.

Detection

The detection phase of the incident response lifecycle begins with the detection of the incident and the mobilization of the incident response team, which must swiftly do the following:

  • Diagnose the scope and severity of the incident to determine the appropriate response level.
  • Create an incident and communicate that the incident is occurring broadly
  • Assemble the incident response team/initiate needed collaboration across chat and virtual meetings.

In the Lightstep Incident Response product, incidents can be created by manually promoting an alert, automatically promoting alert via response rule, or via direct manual incident creation from desktop or mobile. The incident, once created, appears with comments, work notes, and incident activity captured in the incident timeline stream. It becomes a dynamic dashboard that captures the current state of the incident and can serve as a beacon for everyone in the org who might have questions.

The incident response team, meanwhile, will gather in the incident response workspace, a part of the LIR application that contains panels with incident details and different possible actions. This is where you can kick off/join a Slack channel or Zoom for incident communications, and access the incident document where information is aggregated to get participants up to speed.

Resolution

In this next phase of the incident response lifecycle, the incident response team is working together to actually resolve the incident.

While every incident is unique and there’s little we can prescribe to solve your exact issue, the following are some best practices:

1. The Incident commander is the single source of truth

The incident commander should be making all decisions, and all information should flow through them. No one should be performing any remediation or taking any action unless the incident commander has said that they can. Incidents can be very chaotic, and it’s important that it’s clear who is in charge.

2. SMEs make diagnosis and recommend fixes

The first item on the incident commander’s to-do list, once they have assembled their team, is to go to their subject matter experts to gather information to make a diagnosis and remediation plan. The SMEs should describe the symptoms of the incident (e.g. backend servers are unreachable) followed by their recommendations for how to fix it. The focus here is on restoring service as quickly as possible and may involve short term fixes.

3. The incident commander makes decision on which fixes to try

The incident commander should next aggregate the list of repair actions, and make a decision on which approach to try, weighted by how risky it is (for instance you could probably restart all 100 servers at once and that would fix the problem, but it would be much more risky than doing a rolling restart).

4. Repeat this mini-cycle until the issue is resolved

Throughout this phase, communication protocols that were established during the preparation phase should be followed. Meanwhile, for each one of these mini-cycles, internal updates should be communicated broadly and include:

  • The issue
  • The service affected
  • Time since degradation
  • The current severity level
  • How far along you are in response to the issue
  • Remediation steps taken and their results
  • Who is managing the coordination and members involved

Post-Incident Learning

The work isn’t over once service is restored. After the incident, in the post-incident learning phase of the incident response lifecycle, your team should make sure the following occurs:

  • Incident review meeting takes place
  • Postmortem is written

Action items are assigned and carried out

In an incident review meeting, the incident is discussed, the chronology reviewed, and the root cause determined. All responders are invited to this meeting to capture all the actions and feedback. Afterwards, a key person, often the incident commander, will create a post-mortem to formally document what happened during the incident, what caused it, and what needs to be done to make sure it doesn’t happen again. This is where records like the Incident timeline in LIR can be very helpful in recreating the incident and recalling what remediation efforts were made when.

In addition to determining technical action items, this is also when you should review the incident response itself. Was it speedy? Were the right people involved at the right time? Were the alerts accurate or do their thresholds need to be made more sensitive? Do certain systems need more instrumentation? Was the communication around the incident timely and clear? This retrospective analysis provides the feedback loop in the incident response lifecycle. In particular, LIR incident and alert dashboards will help you spot trends and areas to focus on for improvement.

Conclusion

According to the recent IBM report on cybersecurity, the average period of system breach detection in 2019 was 206 days, and it took 73 more days on average to contain and remediate the threat. The goal of Lightstep Incident Response is to provide a single, integrated platform for your team to move seamlessly through the different phases of the incident response lifecycle – and ultimately to learn from mistakes by instilling best practices.

FAQ

What is Incident Response?

Incident response is an organization’s systematic actions to detect and remediate a software outage or cybersecurity breach.

What do Incident Response Lifecycles Teach Us?

Incident response life cycles can teach us about gaps in our incident response plans, communication protocols, teams, and training.

What is the NIST Incident Response Framework?

The NIST Incident Response Framework is a particular definition of the incident response lifecycle that was developed by the National Institute of Standards and Technology. It includes four main stages: preparation, detection/analysis, containment/eradication, and recovery.

Learn more about the roles and responsibilities of an incident response management team.

Sign up for a free trial of Lightstep Incident Response

Explore more Technical Blogs