Lightstep from ServiceNow Logo






Lightstep from ServiceNow Logo

Incident Response Lifecycle

Michelle Ho

by Michelle Ho

Incident Response Lifecycle

Explore more Technical Blogs

Incident response doesn’t just stop when a service is “operational” again. Your team may breathe a sigh of relief, but incidents are part of an ongoing incident response lifecycle through which your organization learns from its mistakes. Specifically, the incident response lifecycle is a multi-step procedure that your organization uses to detect and resolve software incidents.

The goal of Lightstep Incident Response is to provide a single, integrated platform for you to move seamlessly through the different phases of the incident response lifecycle. Below, we’ll outline each of them in turn.

Incident Response Lifecyle 2

1. Preparation

The Preparation stage of the incident response lifecycle covers the work your organization does prior to an incident, during “normal operations” time, to either prevent incidents from happening, or to mitigate their impact when they do. This includes:

  • Setting up alerts, monitoring, and traces
  • Setting up logs
  • Defining services
  • Setting up on-call schedules
  • Setting up incident response roles
  • Setting up incident response playbooks
  • Setting up incident response communication systems
  • Setting up up-to-date lists of subject matter experts who are on-call for each system

While most of these items can be achieved with simple software like a spreadsheet, or a google doc, or an email list, using a platform like LIR gives considerably more flexibility and dynamism. For instance, if you’re managing an on-call rotation using a spreadsheet, there’s a bunch of manual back and forth if someone goes on vacation, or is out sick. With LIR on the other hand, you can simply edit the on-call rotation page, and engineers will be notified when they go on call. Managers or SREs can also create incident response teams, add team members, and define service ownership and escalation policies.

LIR provides integrations with popular observability and monitoring systems like Datadog; when metrics in Datadog look anomalous, an alert will be piped through the LIR system and you will receive an email, a text message, or push notification on the LIR app.

2. Detection

Once a potential issue is detected, you are now in the detection phase of the incident response lifecycle. Your team must swiftly do the following:

  • Diagnose the scope and severity to determine the appropriate response level.
  • Create an incident and communicate that the incident is occurring broadly
  • Assemble the incident response team/initiate needed collaboration across chat and virtual meetings.

In the Lightstep Incident Response product, incidents can be created by manually promoting an alert, automatically promoting alert via response rule, or via direct manual incident creation from desktop or mobile. The incident, once created, appears with comments, work notes, and incident activity captured in the Incident timeline stream. It becomes a dynamic dashboard that captures the current state of the incident and can serve as a beacon for everyone in the org who might have questions.

A subset of the responders become the incident response team, led by an incident commander. They are gathered into the incident response workspace, a part of the LIR application that contains panels with incident details and different possible actions. This is where you can kick off/join a Slack channel or Zoom for incident communications, and access the incident document where information is aggregated to get participants up to speed.

3. Resolution

In this next phase of the incident response lifecycle, the incident response team is working together to actually resolve the incident. While every incident is unique and there’s little we can prescribe to solve your exact issue, we can provide a few guidelines.

First, the incident commander is the single source of truth. They are making all of the decisions, and all information should flow through them. No one should be performing any remediation or taking any action unless the incident commander has said that they can. Incidents can be very chaotic, and it’s important that it’s clear who is in charge.

The first item on the incident commander’s to-do list, once they have assembled their team, is to go to their subject matter experts to gather information to make a diagnosis and remediation plan. The SME’s should describe the symptoms of the incident (e.g. backend servers are unreachable) followed by their recommendations for how to fix it. The focus here is on restoring service as quickly as possible and may involve short term fixes.

The incident commander should aggregate the list of repair actions, and make a decision of which approach to try, weighted by how risky it is (for instance you could probably restart all 100 servers at once and that would fix the problem, but it would be much more risky than doing a rolling restart). Rinse and repeat this mini-cycle until the issue is resolved.

Meanwhile, for each one of these mini-cycles, internal updates should be communicated broadly and include:

  • The issue
  • The service affected
  • Time since degradation
  • The current severity level
  • How far along you are in response to the issue
  • Remediation steps taken and their results
  • Who is managing the coordination and members involved

4. Post-incident learning

The work isn’t over once service is restored. After the incident, in the post-incident learning phase of the incident response lifecycle, your team should make sure the following occurs:

  • Incident review meeting
  • Postmortem is written
  • Action items are assigned and carried out

In an incident review meeting, the incident is discussed, the chronology reviewed, and the root cause determined. All responders are invited to this meeting to capture all the actions and feedback. Afterwards, a key person, often the incident commander, will create a post-mortem to formally document what happened during the incident, what caused it, and what needs to be done to make sure it doesn’t happen again. This is where records like the Incident timeline in LIR can be very helpful in recreating the incident and recalling what remediation efforts were made when.

In addition to determining technical action items, this is also when you should review the incident response itself. Was it speedy? Were the right people involved at the right time? Were the alerts accurate or do their thresholds need to be made more sensitive? Do certain systems need more instrumentation? Was the communication around the incident timely and clear? This retrospective analysis provides the feedback loop in the incident response lifecycle. In particular, LIR incident and alert dashboards will help you spot trends and areas to focus on for improvement.

Incidents are an inevitable part of building any software business. Our goal with Lightstep Incident Response is to provide a single, unified platform that allows DevOps engineers and SRE’s to move easily between the various phases of the response lifecycle – and ultimately to learn from mistakes by instilling best practices.

Learn more about the roles and responsibilities of an incident response management team.

Sign up for a free trial of Lightstep Incident Response

Explore more Technical Blogs