Incident Response For Remote Work
by Michelle Ho
While remote work may seem like a drastic change for something as time-sensitive and high-stakes as incident response, in reality many large tech companies like Uber, Google, and Shopify, have long had distributed teams. SRE’s were often located in remote offices to take advantage of time differences for round-the-clock on-call coverage. This head start on remote-friendly IR practices minimized the disruption when the Covid-19 pandemic hit, and in fact led to greater transparency, organization and accountability in dealing with incidents. In this blog post, we’re sharing an overview of how the incident response process changed for the better with remote work.
- Engineers are largely at the office 9-5, where they have access to their desktops and monitors.
- Onboarding, especially onto an on-call rotation, is done via an apprenticeship model. New engineers are added when they’re “ready”.
- Large monitors at the office show metric dashboards that are visible to everyone.
- Everyone on the team gets notified for every issue and alert
- Engineers work much more flexible hours, and travel more frequently, where they may not always have access to desktop computers or even their laptops; being able to do at least some preliminary triaging of the incident on their phone becomes crucial.
- Onboarding is much more formalized, with written checklists and steps. Educating a new hire on how to be on-call should be automated, like adding them to the HR system.
- Automatic alerting systems become much more important; alerts must be low-noise, and traceable to root causes, in order to be trustworthy.
- Monitoring and incident response systems are tightly integrated and intelligence is built-in to alert the right people at the right time
- Ownership and definition is at the core of on-call teams and incident response
- Operators, product owners, SRE’s gather in a physical “war-room” at the office to resolve the incident.
- People or teams are pulled into the war-room on an ad-hoc basis, i.e. you try to run a command that needs certain permissions, or you want to talk to the last person to have touched some offending code.
- There’s an auto created slack channel, zoom room, or incident workspace page where everything is captured and updated
- Alerts are automatically routed to the right service owners and teams. This has the nice side effect of logging exactly when someone is pinged, signed on or joined the IR process, for the eventual postmortem.
- Context from CI/CD pipelines, change records, service dependency map views etc, are easily accessible from a single incident management tool.
- Incident information can be easily synced with systems-of-record like ServiceNow to have a enterprise level view
- Everyone is more or less co-located, so most communication happens verbally. This makes it difficult to keep track of which fixes have been tried when, and who is recommending what.
- Although incident roles have been assigned, the co-location means that roles are fluid, and authority comes from various people in the room.
- Executives who want updates on the situation come by the war room, creating pressure on the situation.
- Subject matter experts and old-hands are relied on to provide a rundown of remediation steps, from memory.
- A person (sometimes called the “scribe”) is put in charge of updating JIRA tickets, status pages, etc, or they simply don’t get updated.
- Since communication mostly happens either over chat or structured incident response UI’s, it’s all captured in the system and synced. Communication over video calls is transcribable and attached to incidents. There is a timestamp for everything. (Hot tip: all timestamps for incident events should be normalized to UTC, since people may be helping from different time zones.)
- There’s more defensiveness against miscommunication, so incident response roles and authority are very explicit. For example, the IC (Incident commander)’s handle can be posted at the top of the incident channel and there are explicit verbal handovers (“John, can you take IC". "Confirmed, I’m taking IC") whenever an IC needs to be relieved.
- Executives joining virtual war-rooms can do so on mute, with the webcam off. By staying anonymous, they avoid making people nervous or being a distraction.
- Remote work forces institutional knowledge out of the heads of individuals and into documents. Instead of waking up that super senior engineer who knows everything, the super senior engineer is more likely to have written playbooks.
- ServiceNow records, JIRA tickets, Github issues, Status pages, and the like, are automatically updated.
- Writing the postmortem requires high-level detective skills, because so much happened verbally and was not recorded.
- The postmortem is in a google doc that gets sent out over email once and is never opened again.
- Most of the information needed to write the postmortem has been logged, either in slack messages, the incident response tool, alerts, or dashboards. It’s much easier to recreate the situation because everything is online.
- The postmortem is written in an integrated incident response tool that stores prescriptive structured information (i.e. playbooks) that can be reused for future incidents.
- The postmortem document is templatized and easily published and shared with cross functional teams and stakeholders
I’m a big believer that the move to remote work is the best thing that could have happened to the field of incident response. In accelerating the shift from a culture of talking, to a culture of writing things down, it’s forced companies to make the sort of long-term investments in tooling, automation, and documentation that they’ve been meaning to do for years, but never quite got around to.
Here at Lightstep, we’re building a truly modern incident response product with all the features for a remote work world. Go here if you want to check it out.
Learn more about IT alerting and why it is important to businesses.