Incident Response process
Incident Response integrates with your processes and existing infrastructure to ensure you can easily and effectively manage your incident response process resulting in improved service resilience. When IT services are disrupted, getting service restored is a top priority for your business.
Set up a team and your profile
The account administrator for Incident Response, invites other internal or external users. The initial user profile is created in the system automatically as soon as an invite is sent. Note that the user is able to log in to Incident Response only after activating their profile using the invitation link that is sent to them via email. Managers or site reliability engineers (SREs) can create the services since a manager may want to select the required services while creating a team. Managers then create their own teams. As a part of the team creation, they can add team members, create on-call schedules, add escalation policies, assign services which have associated integrations. This team is responsible for handling issues related to the associated services.
Services and integrations
The services are integrated with various observability or monitoring systems. These monitoring systems continuously keep track of the status of the service to have the earliest warning of failures, defects, or problems. A system such as a server, app, microservice, or database can contribute to service degradation. When monitoring systems such as Datadog and Google monitor, that are configured to find out system health like network traffic, latency, saturation, or errors, detect a spike on a metric, they send an alert to Incident Response. Alerts can come from these pre-built monitoring integrations, a generic Rest API, a CLI, or an email. The alert includes information on the type of issue and the affected component or system.
Incident Response flow
When Incident Response receives the alert, response rules are triggered to find out if any post-processing is needed or if another automated action needs to be performed. Automated actions can route an alert to a team or contact an on-call team member. The system determines which on-call team member is to be notified based on who is currently on-call. The system identifies who is on call based on schedules and shifts that are defined while creating a team.
The on-call team member can acknowledge the alert via the mobile app, the desktop, an SMS, CLI, a Slack integration, or an email notification. If the team member does not acknowledge the alert within a specific time, the escalation policy associated with the team notifies the next level of recipients. If an incident is required, the on-call team member can promote an alert to an incident. Once the alert is acknowledged, they can look through logs, debug, log in to a remote system, or take other actions to find the root cause and resolve the issue. If additional expertise is needed, the on-call team member can collaborate with cross-functional teams.
Resolution and postmortem
Once a root cause is identified, the on-call team member works to remediate the issue and restore service levels. This could involve rolling back a change or reconfiguring a system. While the service was degraded or suffering an outage, stakeholders were updated on the progress of the alert or incident. When the service is restored, a postmortem review meeting can walk through what happened and capture lessons learned. The team can create a postmortem document to formally track the root cause analysis of the issue and the actions required to prevent this issue from occurring in the future.
The following diagram illustrates the process to achieve service resilience: