Netflix Upgrades Chaos Engineering Strategy with Chaos Kong

A few years after NetflixTrack this API first implemented its Chaos Engineering strategy with the release of Chaos Monkey, the company has made significant advances to its resiliency testing system with the implementation of Chaos Kong. Built with the realization that server failures remain inevitable, Chaos Monkey picks servers from Netflix's production environment and kills such servers during business hours. The process tests Netflix's resiliency while Netflix engineers are available to examine the process and adjust for unexpected failures. Chaos Kong takes Chaos Engineering to the next level. Instead of killing a single server, Chaos Kong kills an entire AWS region upon which Netflix runs.

AWS region failures are much rarer than single server failures; however, such failures do occur and Netflix (and other AWS users) need to be prepared for such occurrences. A recent AWS region failure left many major websites unavailable for 6-8 hours. However, because Netflix had already prepared for such failures with Chaos Kong, Netflix was minimally affected by the failure.

Through Chaos Engineering, Netlfix is able to automatically shift traffic from a failed region to a live region in the event of a region failure. At the resolution of such failure, the traffic returns to the normally dispersed traffic pattern. Chaos Kong intentionally creates region failures while Netflix Engineers can examine outages and the effects such outages have on Netflix video streaming.

Netflix has certainly benefited from Chaos Engineering, and now the company looks to expand the service as a new discipline to be used in a wider set of circumstances. To reach this goal, Netflix has published the Principles of Chaos Engineering. The project is a work in progress, and Netflix hopes other companies/contributors will help improve the project and evolve Chaos Engineering to a mainstream tool. Netflix has already started experimenting with additional uses of Chaos Engineering and will continue to publish its findings. Continue to check out the Netflix Tech Blog to see updates and reach out to the engineering team with questions, comments, or ideas. 

Be sure to read the next Testing article: Daily API RoundUp: TestFairy, Form.io, CognisantMD, MetricsBot

 

Comments (0)