Netflix Open Sources "Resilience Engineering" Code Library

Hystrix: it's the genus name for "Old World" porcupines, and it's also the latest release from Netflix. But you won't see it in their catalog of movie and TV titles, and you can't add it to your queue, because it's not content--it's how Netflix makes sure its content is highly available. Now, Netflix has made Hystrix open source, for anyone using Amazon Web Services (AWS) to implement in their own cloud applications. Read on for details on this "resilience engineering" code Library.

an example of Hystrix dependency isolation

Mention Netflix, and most people will think of the company's DVD-rental-by-mail service or its growing library of "Watch Instantly" streaming video titles. But Netflix has developed internal infrastructure to supplement the AWS cloud, on which many Netflix services run, and has started releasing some of that code under open source licenses for any developer to use.

This week, Netflix added Hystrix to its bag of open-source tricks. Hystrix helps applications using distributed services tolerate the inevitable latencies and failures that occur in even the most reliable systems, by using "circuit breaker" mechanisms, dependency isolation, request collapsing, and request caching to insulate command calls from individual backend service dependencies.

The Hystrix home page on GitHub defines the problem:

[R]unning an application that depends on 30 services that each have 99.99% uptime we get:

99.9930 = 99.7% uptime
0.3% of 1 billion requests = 3,000,000 failures
2+ hours downtime/month even if all dependencies have excellent uptime.

Reality is generally worse.

These issues are discussed at length in the February 29, 2012, Netflix Tech Blog post "Fault Tolerance in a High Volume, Distributed System," which gives much of the background on the system which has now been released as Hystrix.

The Hystrix FAQ notes that Netflix itself uses Hystrix "in many applications, particularly its edge services such as the Netflix API," and that infrastructure handles "hundreds of billions of semaphore-isolated calls ... every day."

The Hystrix How It Works page gives a detailed overview of the system and highlights major features, like the Circuit Breaker architecture, which prevents cascading failures across multiple client services, and the Request Collapsing mechanism, which combines multiple requests into a single backend dependency call to reduce the number of threads and network connections required.

Hystrix circuit breaker flow diagram

For more information, see Hystrix on GitHub, and check out Netflix's other open Source Code libraries, such as the Chaos Monkey resilience testing tool.

(Hat tip: GigaOM)

Be sure to read the next Cloud article: Amazon Web Services Introduces T2 for EC2