With the rise of microservices and cloud architecture, software distributed systems are evolving rapidly with increasing complexity over time. Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
In order to prevent the chaotic nature of production environments that lead to outages, chaos engineering is the practice of facilitating controlled experiments to uncover weaknesses and help instil confidence in the system’s resiliency.
Principles of Chaos Engineering
From the Principles of chaos engineering
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production”.
Software testing commonly involves implementing and automating unit tests, integration tests and end-to-end tests. Although these tests are critical, they do not encompass the broader spectrum of disruptions possible in a distributed system
Chaos engineering is not meant to replace the likes of Unit and Integration tests but are meant to work together in harmony to give a high availability and durability which means no outages and therefore good customer experience.
So, what is the process involved in Chaos Engineering?
Some of the concepts which aid in chaos engineering are performed in different phases
Putting the Application to a “Steady State”
It’s the regular behaviour of the service based on its business metric.
Business metric Is that metric that shows you the real experience that your end users have with your system. Finding a use in the application which are most used by the users, track its behaviour week after week until they find its regular usage. That’s a steady state.
Building a Hypothesis case
When we have the business, metrics defined and set up the application into a steady state, the next step is building hypothetical use cases for which we don’t know the exact outcome. For example
- Database stops working?
- Requests increase unexpectedly?
- Latency increases by 100%?
- A container stops working?
- A port becomes inaccessible?
Designing the Experiment
Now that we have built hypothetical use cases for our application for which we don’t know the outcome, it’s time to start experimenting. Best Practices include
- Start small (failure injections)
- Spinup customer/Production like environment
- Setup the groups – Experimental and Control Group(s)
Control group: Ex: small set of users that are under the same conditions of the steady state
Experiment group: has the same size as the Control group, but it’s where the chaos will be injected
- Minimize Blast radius – This means that you should minimise the number of potential users affected by the experiment. Always have an emergency stop so that any unintended consequences can be halted or rolled back.
- Option to immediately stop the experiment
Findings on the chaos experiment, analysis on the differences found between control and experimental group when the application was in steady state are good indicators on how the system behaved under chaotic conditions. Some of the observations could be like
- Time to detect the failure?
- Time to get notified/alarm systems
- Time taken for graceful degradation?
- Time taken for Partial/Full recovery
- Time taken for system to go back to steady state
Continuos Chaos Experiments
Focus on automating the entire chaos experimental process to keep repeatedly carrying out the same tasks, set up similar chaos experiments in both non-production (customer/production like) and actual production environments.
The problems found during chaos experiments need to be fixed and verified again in all phases of the chaos experiment.
The importance of observability
As systems evolve, they tend to become more complex and as such have more chance of failing. The more complex a system the more difficult it is to debug. Focus on the principles of observability – you need to become more data driven when debugging systems and using the feedback to improve the software. Use your chaos engineering experiments to build on this observability to allow you to pre-emptively uncover issues and bugs in the system
“Failure is a success if we learn from it.”
Learning from failures makes the system more resilient and increases the confidence in the system’s capabilities.
To achieve this, chaos engineering could be a game changer in the industry and top tech giants like Netflix, Facebook, Google, LinkedIn are using it to ensure systems can withstand any breakdowns by acting on plugging issues caused during the chaos experiments.
Senior Test Automation Engineer | Ammeon