Technology Insights
Is Predictability Overrated? The Case for a ‘Chaos Engineering’ Game Day
By Juan Ramollino / April 13, 2023
Software development practices have changed significantly over the last two decades. With DevOps being mainstream, it is no longer possible to build a career in a single technology and aspect of the software lifecycle. Teams are multidisciplinary and have an increased cognitive load due to the breadth of the knowledge they need to perform their day-to-day work.
What is an engineering game day?
A game day is an event designed to build the skill of a group within Engineering. Your teams are constantly exposed to documentation to the point of being saturated. Don't rely solely on written documentation or recorded videos to build an essential skill within your workforce. Various studies have demonstrated that the brain retains more information when associated with an emotion (e.g. fun, sadness).
A well-engineered game can increase the engagement of your workforce and strengthen critical learnings for your organization's success. Some chaos engineering events involve a real team facing a simulated system failure to build incident response memory. What is chaos engineering, anyway? ChatGPT explains it like this: Chaos engineering is like throwing a surprise party for your IT systems, but instead of balloons and cake, you bring chaos and disorder.
Growing pains
At the end of 2022, AppDirect engineering was facing scaling challenges. The platform was facing a significant increase in volume, and teams were in the process of shifting toward DevOps practices. DataDog had just been deployed to simplify the stack, but teams needed more expertise. Finally, some groups had difficulties identifying the root cause of problems and only addressed the symptoms.
At that point, the organization needed to increase its expertise on DataDog and correctly identify the root cause of issues to address core problems. We designed a game day around those organizational needs.
The game day event: The rules of the game
Over a few hours, a small group of Kubernetes administrators introduced various issues in the continuous integration environment for teams to find and report on. Problems ranged from a complete database outage to more subtle issues like failures in Kafka.
Here’s how we structured our game day:
We had more than 90 participants on 30 teams in different locations
Teams could submit at most three incident reports to allow participants with limited time a fair chance at winning the game
A judging panel rated reports based on their precision and the proposed mitigation measures
The engineering department mandated each team to sign up at least one participant and organized live troubleshooting sessions with an expert before the event
Prior to the event, the teams also had access to curated Udemy training to level set on DataDog
Results
The chaos engineering game day event received stellar feedback. Overall, it helped build the DataDog expertise and reinforced DevOps practices within the company. Some examples of identified areas for improvement are described below.
Investigation traces
A look into the investigation traces showed that there was room for improvement on the reports. Some didn’t provide enough information for a peer to confirm the problem. Good investigation notes contain the following elements:
Timeline
Screenshot or link to the symptoms of the issue
Cause, if found
Some investigation traces only showed high level configuration changes without any proof of research and link to the symptoms reported.
Reported problems
The scoring criteria encouraged teams to report “harder to spot” issues. An analysis of the incident reports demonstrated that the most visible problems were also the most reported: MySQL & AuthZ outages (48% of issues).
The issues introduced to RabbitMQ might have been spotted, but no team reported them.
Problem cause
Not surprisingly, participants reported on problems that they could explain. In fact, in 47 percent of the incident reports the team had identified the exact change that was introduced.
Make it your own
Hosting chaos engineering game days is not new; technology leaders like Amazon recommend regularly running some with your incident response teams.
40% of companies will adopt chaos engineering as part of their DevOps initiatives in 2023 reducing unplanned downtime by 20%
— Gartner
As you plan your own game day, don’t hesitate to enter uncharted territory and build something that suits your needs. An organization facing performance issues could organize an event where teams use load-testing tools to degrade target micro-services. Don't rely exclusively on documentation to evolve your organization's practices and culture. Instead, add a game day to your toolbox.
Have you hosted your own engineering game day? Connect with us on LinkedIn and share what worked best for you and your team. Or check out how we improved our platform performance by replacing our message broker and moving from RabbitMQ to Kafka. We'd also like to provide a special thanks to Jean-Philippe Boudreault for his contributions in writing this blog with Juan.
Sources:
- https://www.frontiersin.org/articles/10.3389/fpsyg.2021.519729/full
- https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/
Related Articles
Technology Insights
6 Ways Quality Drives Product Development at AppDirect
Discover the cross-company culture and processes that ensure AppDirect SaaS products meet the highest quality standards available—including a quality-driven software development lifecycle, continuous development and deployment, and production monitoring.By Ideas @ AppDirect / AppDirect / March 6, 2024
Technology Insights
7 Steps to Replacing a Message Broker in a Distributed System
What do you do when you need to improve overall throughput and performance of your platform?Technology Insights
Don’t Let Microservice Data Segregation Mess With Your Customers’ Search Experience
Is your software system built on a monolithic or microservice architecture? If it’s the latter, how do you solve the problem of joining data from multiple services while running search queries?By Adam Demjen / AppDirect / April 1, 2022