Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Incident Response Management Team - Roles and Responsibilities

Incident Response Management Team

Incidents happen. When something goes wrong - broken features, degraded service, or outages - your incident management response team needs to respond immediately to restore functionality. But who do you call? How do you organize them? What are their responsibilities? In this article, we will discuss the incident response team's roles and responsibilities.

What does an incident response management team do?

The incident response management team is an ad-hoc group of people drawn together from different parts of the company with the collective goal of bringing the incident to a quick resolution. This could include anything from a data breach to a system outage. The team is responsible for assessing the situation, determining the best course of action, and executing the plan. Incidents are made worse when the incident response teams can't communicate or work together. To improve this situation, team members should develop a plan before responding to an incident and share any discovered information with the team members, while they work on execution.

Incident management response team Goals

The goal of the incident response team is to minimize the impact of incidents on the business. This includes minimizing the time it takes to resolve an incident, the financial impact of an incident, and the reputation damage that can occur as a result of an incident.

Each team member's role has different expectations and prerequisites; nevertheless, each role should be fulfillable by more than one person. In fact, some roles will feature more than one person per incident.

Who is on the incident response management team?

The incident response management team is typically composed of members from different departments within a company. Each member of the team has a specific role that they play during an incident. The incident response team may include the following roles:

Incident Manager

Also known as Incident Commander

Of all the incident response management roles, only the incident commander and the SMEs (Subject Matter Experts) are strictly necessary. The incident commander is the undisputed authority during the incident – they even outrank the CEO. Their job is to guide an incident to its resolution, managing the plan, communication, and responsibilities involved. This is your team leader.

In particular, the incident manager/incident commander's responsibilities include:

  • Initially assessing the severity of the incident and assembling the appropriate incident response team members.

  • Gathering information from the tech lead, and subject-matter experts(SME), including symptoms of the incident, potential fixes, and risks involved in the remediation plans.

  • Making decisions around which remediation plans to pursue.

  • Tracking decisions and changes

  • Communicating decisions and changes

  • Leading post-incident review meetings and determining whether a public post-mortem is necessary

Choosing the right Incident Commander (Incident manager)

The role requires someone who can command respect and trust, and is often a senior individual contributor/engineer, but does not necessarily require technical knowledge. In fact, the incident manager should avoid directly working hands-to-keyboard on the technical issue if possible. If they’re trying to fix the problem from the trenches, they lose sight of the bigger picture and run out of bandwidth to effectively communicate with other stakeholders and organize the response.

While some specific technical knowledge can be useful, successful incident commanders rely most on skill and experience in asking good questions, sorting out disagreements, and collaboratively deciding on a path forward. Their top priority is to oversee and coordinate all facets of the incident response effort. After the incident is over, they make sure (sometimes working with a scribe role) to document what happened and how to prevent it from happening again via a post mortem. This is a document written with all members involved in the response covering what happened, why, and how to prevent it.

Subject-Matter Experts (SME)

Also known as Technical Lead or On-call Engineer

Subject Matter Experts (SMEs) are the experts on malfunctioning services. They are responsible for developing theories about what’s broken and why, and how it can be fixed. They are usually a senior individual contributor who is familiar with the underlying system, can make recommendations to the incident manager, and will participate in the writing of the post-mortem.

In particular, subject matter expert responsibilities include:

  • Communicating symptoms of the incident response activities to the incident manager

  • Recommending fixes to the tech lead and incident manager

  • Notifying the incident manager/incident commander of risks associated with the fixes

  • Notifying the incident manager/incident commander of the estimated timeframe for resolution

  • After the incident manager has made a decision regarding the remediation plan, execute that plan.

  • Managing engineers who are actually resolving system vulnerabilities.

There can be multiple subject-matter experts on an incident management team, depending on the scope of the incident and how many services it touches. SMEs are usually the on-call engineer for the affected service.

Communication Manager

Also known as Communication Officer or Communication Lead

Your communication manager oversees both internal and external incident response communicationincident response communication throughout the incident lifecycle. They determine which communication channels will be used for the respective audiences, ensuring that the team, the organization, external stakeholders, customers, and the public are properly informed. This communication routinely involves updating an external status page or sending mass email communication.

In complicated or broader-reaching incidents, you may want to separate the role into internal and external communication managers

If your organization has an active social media presence, part of a communication manager’s responsibilities may include fielding questions from users across your social media account. The response time expected on social media is much more immediate. Users may complain of a crash or outage, and following your organization’s related hashtags can keep your team’s ears to the ground. Users’ reposts and retweets can help share updates and increase your outreach.

Business Lead

Your business lead serves as a liaison between the incident team and upper management. If an incident is serious enough, the executive teams need to stay updated on information that may impact the business as a whole, rather than as purely a technical entity.

This role’s responsibilities include fielding upper management’s questions and reports, collaborating with the customer liaison for higher-level public messages (ie “From the CEO”), and coordinating any legal reviews, law enforcement cooperation, or regulatory notification processes. If it’s a major legal or security issue, like a hacker stealing sensitive information, your team should consider a separate legal liaison, law enforcement liaison, or regulatory coordinator.

Customer Support Liaison

Also known as Help Desk Lead or Customer Support Agent

In incidents that affect people outside the organization, your customer support liaison acts as the public face of the incident response task force. They’re in charge of the front-line customer support team, handling all incoming support tickets and giving customer updates during the incident. They’ll work closely with the external comms manager to make sure the customers are receiving adequate and consistent information across all channels.

Updating external customers is critical in maintaining their trust. Even though their service has been impacted, they’ll be able to plan their own mitigation strategy and timeline as long as they know you’ll keep them up to date.

Tailor your communications according to the customer’s interest level. For enterprise customers, they may want technical details if it affects their operational status, downstream services, or downstream customers. For customers who are mainly interested in knowing when the problem will be fixed, a few sentences per update will suffice.

Problem Manager

Also known as Root Cause Analysis Manager

The problem manager’s job is to find and fix the root cause of the incident so that it doesn’t happen again. They work with the incident commander to determine when it’s appropriate to transition from firefighting mode into problem-solving mode.

This role includes conducting post-incident reviews, analyzing incident data and logs, and developing preventative measures to stop similar incidents from happening in the future. The problem manager also documents the incident for future reference.

Not every organization needs a separate problem manager role. In small organizations or those with limited resources, the incident commander may double as the problem manager.

Legal Liaison

Your legal liaison provides legal guidance, especially advised in security breaches or data loss incidents. They handle the issues surrounding compliance, interactions with law enforcement, legal representation, standards of integrity for forensic evidence, and even legal implications of the response process and communications.

Not all incidents will need to have all roles filled all the time: if the incident doesn’t have a customer-facing impact, there will be no need for a social media lead or a customer support liaison. Conversely, effective incident response team roles and responsibilities are scoped such that there should be some redundancy in who can fill them.

Especially during lengthy or intense incidents, there should be shifts and rotations. incident managers should keep an eye out for how long members have been working, and decide who will substitute or fill their roles next. Planning for substitutions and replacements allows your team to remain fresh, up to speed, and continuous for as long as the incident requires. Even the incident manager doesn’t have to be the same person for the entire incident, as long as the incident document is kept up to date, and the incident is handed off to the next commander in an unequivocal way.

By assigning incident roles with clear scope, responsibilities, and authority, you avoid confusion during the incident itself, and hopefully set up your team for a speedy resolution.

Learn more about the in-app incident repsonse roles and responsibilitesincident repsonse roles and responsibilites.

Effective Tips for Incident Response Team Members

  • Become familiar with the organization’s incident response plan and procedures.

  • Keep up to date on incident response news and best practicesbest practices.

  • Attend incident response training courses.

  • Join incident response communities and forums.

  • Be familiar with the organization’s incident response tools.

  • Build an incident response playbookincident response playbook.

  • Participate in incident response exercises and drills.

  • Maintain up-to-date personal contact information.

  • Ensure that backups of all important data are kept in a secure location.

  • Stay calm and focused during an incident.

  • Communicate clearly and concisely with other members of the incident response team.

Get a free trial of Incident ResponseGet a free trial of Incident Response
July 7, 2022
9 min read
Technical

Share this article

About the author

Michelle Ho

Monitoring Apache with OpenTelemetry and Lightstep

Andrew Gardner | May 2, 2023

Continue your observability journey by ingesting metrics from Apache and sending them to Lightstep.

Learn moreLearn more

Monitoring MySQL with OpenTelemetry and Lightstep

Andrew Gardner | Apr 11, 2023

Learn how to ingest metrics from MySQL and send them to Lightstep.

Learn moreLearn more

Monitoring NGINX with OpenTelemetry and Lightstep

Robin Whitmore | Apr 6, 2023

Learn how to start ingesting metrics from NGINX and send them to Lightstep for more intelligent analysis and monitoring.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems