Chaos Engineering: A Digital Vaccine
Three things in life are certain: death, taxes, and on-demand streaming content. Well, maybe not the last one, but it is pretty amazing that when we choose a movie on Netflix, it almost always loads.
That level of reliability is not an accident.
In 2010, when Netflix moved to Amazon Web Services, they needed to ensure that outages within the cloud infrastructure would not impact their ability to deliver content to customers.
That’s when the Netflix software team invented Chaos Monkey, the first instance of chaos engineering, now an industry-wide practice. Netflix actually made Chaos Monkey open source in 2012.
The idea behind chaos engineering is that, by anticipating failures, engineers can learn how a system will respond to them. This is especially important because distributed systems have so many co-dependencies, like software, platform, infrastructure, and APIs. Therefore, chaos engineers deliberately introduce failures to see how a system will respond.
Proponents of chaos engineering often compare it to administering a vaccine. Just like a vaccine builds resilience in the immune system, chaos engineering builds resilience in IT infrastructure.
CISO and ISPG Director Rob Wood explained how chaos engineering could help make software much more resilient in the face of real-world threats at the CMS Cyberworks training event on July 20. “The premise behind it is that you are building your software such that it can auto-recover and self-heal from random, self-induced adversarial probing,” Wood said.
Despite its name, chaos engineering really isn’t that chaotic. It’s akin to conducting controlled experiments. Engineers hypothesize how a certain type of failure will impact their system, and then they find out what really happens. The following concepts help control the chaos.
- Steady State - the “normal” functioning status of a system. The steady state should be defined by discrete metrics such as system throughput, error rates, and latency percentiles.
- Fault injection testing (FIT) – “the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability.” Examples include network disruptions, server crashes, hard drive malfunctions, and invalid data (called “fuzzing” or “fuzz testing”).
- Blast radius - the impact from a chaos engineering experiment. When introducing a failure, engineers must include mechanisms for containing the impacts, especially with respect to customers or users.
At Cyberworks, a representative from the chaos engineering firm Kessel Run presented on their work with the Department of Defense. Omar Merrero explained that Kessel Run automates testing with a tool called Bowcaster.
(Kessel Run is a reference to the first “Star Wars” franchise. For those of you aficionados, Mererro did not indicate whether Bowcaster can complete a run in less than 12 parsecs.)
He said that developers appreciate chaos engineering because it allows them to spend less time fixing problems and more time building.
Whether or not CMS brings on a chaos engineering partner, Wood said that it can be adopted as a “design principle.”
In fact, the technology consulting firm Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.”