The few social media breakdowns in past, when it was not available to the public, got solved so quickly with the help of experienced engineers, as they use the term “Chaos Engineering.” This term is purposely used to learn how to manage the system during such a tragic event and gives the opportunity to understand the system well feels Bahaa Al Zubaidi.
The exercise of using this chaos engineering is to check the ability and level of confidence of the system during the time of its instability.
Chaos engineering is carried out in seven steps.
1) Approval from the leader:
The first step should be obtaining permission from the leader to conduct the experiment in a safe atmosphere. The experiment should be carried out in a controlled setting.
2) Being familiar with the system architecture
Understanding Chaos Engineering’s structure and having a discussion with the developers will help you better grasp how it operates and identify any flaws before you take any action.
3) Making theories
Start formulating thoughts and taking notes about potential peak locations and potential trouble spots. For example: failing hard drives, broken network connections, etc. You will become more proficient at understanding the system with each writing process.
4) Reduce the bang
Fewer users may be affected by reducing the affected zone. For instance, instead of having it throughout the entire blast radius, limiting the blast radius and simply shutting it down for a select number of users.
5) Schedule the new way too
Always have a backup plan on hand. Create a unified communication channel in Teams (or on the business’s communication platform) to post updates often. It can also be used to give at least a week’s notice to all parties that need to know. It is advisable to assemble your team of developers, testers, DevOps experts, SREs, and others before beginning your first experiment.
6) Conduct your initial experiment
It’s like riding an exhilarating roller coaster to run the initial crazy experiment. Make sure you can stop the experiment and take apart the infrastructure with the aid of your team in case something goes wrong. Your system must be purposefully compromised to make some infrastructure parts unavailable for experimentation.
7) Review & brainstorm the results
Gather all your data in a spreadsheet after the experiment is finished, analyse it, and then decide whether your hypothesis was true. With the aid of this, the team will be able to comprehend the choices made and address the issues you identified. After addressing the problems, you can try the trials once more.
Thank you for your interest in Bahaa Al Zubaidi blogs. For more stories, please stay tuned to www.bahaaalzubaidi.com