Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

How to Get Started with Chaos: A Step-by-Step Guide to Gamedays

When you first start deploying applications in the cloud, it can feel amazing. You just tell the system to do something and suddenly your code is available to everyone. A bit later though, you’ll likely experience failure. It could be failure of the instance running the code, networking to clients, networking to databases, or something else.

After a while, that failure can seem like Chaos: uncontrolled, unpredictable, unwelcome.

Enter Chaos

It’s often from this place that you may hear about Chaos Engineering and wonder “why would I ever want to do that?!” Chaos Engineering seeks to actively understand the behavior of systems experiencing failure so that developers can decide, design, implement, and test resilience strategies. It grows out of knowing that failure will happen, but you can choose to see it with a clear head at 2 p.m. instead of confused, half awake, and stressed out at 2 a.m.

“Everything fails all the time” — Werner Vogel, VP & CTO at Amazon

Chaos Gamedays

Chaos Gamedays are an ideal way to ease into Chaos Engineering. In a Chaos Gameday, a “Master of Disaster” (MoD) decides, often in secret, what sort of failure the system should undergo. He or she will generally start with something simple like loss of capacity or loss of connectivity. You may find, like we did, that until you can easily and clearly see the simple cases, doing harder or more complex failures is not a good way to build confidence or spend time.

So, with that said, let’s take a look at how to run a gameday.

Chaos Gameday: Planned Failure

With the team gathered in one room (physical or virtual), the MoD declares “start of incident” and then causes the planned failure. One member of the team acts as first on-call and attempts to see, triage, and mitigate whatever failure the MoD has caused. That person is strongly encouraged to “page” other members of the team and bring them in to help understand what’s happening. Ideally the team will find and solve the issue in less than 75% of the allocated time. When that has been done or the time allocated for response has ended, the MoD will reverse the failure and the team will proceed to do a post mortem of the incident.

Chaos Gameday: Escalation

It is entirely possible that, when starting out, the team will be unable to find or solve the problem. The Master of Disaster can escalate the failure to make it more visible, because often full outages are the only observable failures. Don’t be too worried if this happens: Observability that hasn’t been testedObservability that hasn’t been tested for failure scenarios often does not show them. Knowing this is the first step in fixing your instrumentation and visualization, and ultimately giving your customers a better experience.

Chaos Gameday: Post Mortem

The post mortem should follow the usual incident process (if there is one) and/or follow best practices like PagerDuty’sPagerDuty’s. Effective post mortems is a broad topic, but I’d encourage you to include sharing perspectives, assumptions that were made during responses, and expectations that didn’t reflect the behavior of the system or observabilityobservability tooling. Following out of the post mortem, you should have a set of actions the first fix any gaps in observability for the failure scenario. You also likely will have some ideas about how to improve resilience to that failure.

The key to the Chaos Gameday process is to, at the very least, repeat the failure and validate the specific changes to observability and resilience that were made to the application.

How Chaos Gamedays Can Transform Your Team

If you follow this process regularly, you will see a transformation in your team. Being first on-call for Chaos Gamedays, even though it’s not “real”, builds composure under pressure when doing on-call for production outages. Not only do your developers gain confidence in their understanding of the systems and how they fail, but they also get used to feeling and being ok with pressure.

Some concrete benefits:

  • A more diverse on call inclusive of those who do not feel comfortable with a “thrown in the deep end” learning process.

  • Developers encounter failure with up-to-date mental models of the behavior of systems, instead of just whenever they happened to be on call during a failure last.

  • Leaders have confidence that new team members are ready to handle on-call and have clear ways to improve effectiveness.

The transformation in systems is as dramatic. Developers, since they regularly experience failure as part of their job, start designing for failure. They consider how to make every change and every system observable. They carefully choose resilience strategies because the vocabulary of resilience is now something they simply know and speak.

It’s not that systems become resilient to the specific things done to a specific system in a Chaos Gameday for, they become resilient, by design, for all the scenarios that the developer knows exist and are likely.

Starting the Journey of Chaos Engineering is as simple as a “sudo halt”. Following the path will grow your team and your systems in ways that are hard to imagine at first, but truly amazing to see become real. If you would like confident on-call, happy developers, and resilient systems, I encourage you to start that journey. We’re happy to help. Feel free to reach out at @1mentat@1mentat.

Interested in joining our team? See our open positions herehere.

April 10, 2019
5 min read
DevOps Best Practices

Share this article

About the author

James Burns

Exploring What Kubernetes Observability Might Look Like for SRE and Operations Teams in the Future

Clay Smith | Oct 19, 2022

The exciting and new tracing capabilities now built-in for the internal components that power Kubernetes itself, which means that operators that need to diagnose tricky performance issues have some powerful new solutions.

Learn moreLearn more

How to Define and Track Incident Management KPIs

Keanan Koppenhaver | Oct 11, 2022

Incidents can have a serious impact on your business. Learn how to track key performance indicators (KPIs) to ensure that your organization is running smoothly.

Learn moreLearn more

Overview of Site Reliability Engineering

Lukonde Mwila | Sep 22, 2022

Site Reliability Engineering has become more common over the past few years, and many more are looking at it trying to understand what exactly it means. In this guide you’ll be covering this area, giving a high-level overview of SRE.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems