DevOps Best Practices
How to Get Started with Chaos: A Step-by-Step Guide to Gamedays
by James Burns
When you first start deploying applications in the cloud, it can feel amazing. You just tell the system to do something and suddenly your code is available to everyone. A bit later though, you’ll likely experience failure. It could be failure of the instance running the code, networking to clients, networking to databases, or something else.
After a while, that failure can seem like Chaos: uncontrolled, unpredictable, unwelcome.
It’s often from this place that you may hear about Chaos Engineering and wonder “why would I ever want to do that?!” Chaos Engineering seeks to actively understand the behavior of systems experiencing failure so that developers can decide, design, implement, and test resilience strategies. It grows out of knowing that failure will happen, but you can choose to see it with a clear head at 2 p.m. instead of confused, half awake, and stressed out at 2 a.m.
“Everything fails all the time” — Werner Vogel, VP & CTO at Amazon
Chaos Gamedays are an ideal way to ease into Chaos Engineering. In a Chaos Gameday, a “Master of Disaster” (MoD) decides, often in secret, what sort of failure the system should undergo. He or she will generally start with something simple like loss of capacity or loss of connectivity. You may find, like we did, that until you can easily and clearly see the simple cases, doing harder or more complex failures is not a good way to build confidence or spend time.
So, with that said, let’s take a look at how to run a gameday.
With the team gathered in one room (physical or virtual), the MoD declares “start of incident” and then causes the planned failure. One member of the team acts as first on-call and attempts to see, triage, and mitigate whatever failure the MoD has caused. That person is strongly encouraged to “page” other members of the team and bring them in to help understand what’s happening. Ideally the team will find and solve the issue in less than 75% of the allocated time. When that has been done or the time allocated for response has ended, the MoD will reverse the failure and the team will proceed to do a post mortem of the incident.
It is entirely possible that, when starting out, the team will be unable to find or solve the problem. The Master of Disaster can escalate the failure to make it more visible, because often full outages are the only observable failures. Don’t be too worried if this happens: Observability that hasn’t been tested for failure scenarios often does not show them. Knowing this is the first step in fixing your instrumentation and visualization, and ultimately giving your customers a better experience.
The post mortem should follow the usual incident process (if there is one) and/or follow best practices like PagerDuty’s. Effective post mortems is a broad topic, but I’d encourage you to include sharing perspectives, assumptions that were made during responses, and expectations that didn’t reflect the behavior of the system or observability tooling. Following out of the post mortem, you should have a set of actions the first fix any gaps in observability for the failure scenario. You also likely will have some ideas about how to improve resilience to that failure.
The key to the Chaos Gameday process is to, at the very least, repeat the failure and validate the specific changes to observability and resilience that were made to the application.
If you follow this process regularly, you will see a transformation in your team. Being first on-call for Chaos Gamedays, even though it’s not “real”, builds composure under pressure when doing on-call for production outages. Not only do your developers gain confidence in their understanding of the systems and how they fail, but they also get used to feeling and being ok with pressure.
Some concrete benefits:
- A more diverse on call inclusive of those who do not feel comfortable with a “thrown in the deep end” learning process.
- Developers encounter failure with up-to-date mental models of the behavior of systems, instead of just whenever they happened to be on call during a failure last.
- Leaders have confidence that new team members are ready to handle on-call and have clear ways to improve effectiveness.
The transformation in systems is as dramatic. Developers, since they regularly experience failure as part of their job, start designing for failure. They consider how to make every change and every system observable. They carefully choose resilience strategies because the vocabulary of resilience is now something they simply know and speak.
It’s not that systems become resilient to the specific things done to a specific system in a Chaos Gameday for, they become resilient, by design, for all the scenarios that the developer knows exist and are likely.
Starting the Journey of Chaos Engineering is as simple as a “sudo halt”. Following the path will grow your team and your systems in ways that are hard to imagine at first, but truly amazing to see become real. If you would like confident on-call, happy developers, and resilient systems, I encourage you to start that journey. We’re happy to help. Feel free to reach out at @1mentat.