Algolia’s goal is to become “the search layer of the internet”, and the company’s real-time search API is leading the Search-as-a-Service running. In a post on Medium, Algolia CTO Julien Lemoine discussed the 15 steps that were instrumental in building this infrastructure.
To begin with, Lemoine points out that high availability was designed, not implemented! Using a bare metal infrastructure, the team launched with a single machine in two different locations, assigning CPU time to processes ranked on importance.
Implementation of high availability in the architecture was achieved by evolving the two-machine system into a three-machine, master-master setup. This was followed by the official launch of the service, with 10 API clients developed manually, and constant upgrades to system hardware.
Deployment is a big risk for high availability. Stability concerns over agile development were addressed by creating a test suite of over 6,000 unit tests and over 200 non-regression tests, which still allowed a new feature to introduce a big into production. The team’s deployment of machines in AWS to serve Asia then added latency to European search queries. This left them to manage high loads of write operations by implementing a queue before the consensus algorithm for distributed coherency across the cluster.
In April 2014, a car accident broke a pipe containing 120 fibres between Montreal and New York. The rerouting of this signal traffic and the resulting delays served as a reminder that network high availability is close to impossible with one data center. In July, Algolia released their first deployment on two data centers, improving results by important milliseconds. Service for American customers was then improved by adding a presence in the US, with services launched on both east and west coasts.
This increase to the number of machines to manage was handled with automation via Chef. However, the use of the .io TLD was causing the service to be intermittently slow since it operates on fewer locations than the main TLDs. Using only one provider meant that this DNS was a SPoF (Single Point of Failure) in the architecture.
February 2015 saw the team realise its vision of “worldwide expansion to better serve our users”, with the launch of a synchronised worldwide infrastructure. The US clusters were then spread across two completely independent providers for better high availability per location.
Algolia then hit trouble, experiencing random file corruptions on production machines. Fortunately, triple replication of data minimised the disruption and allowed the issue to be resolved without any data loss. The team then introduced several DNS providers that allowed the configuration of query routing using their API.
In July 2015, Algolia implemented three completely independent providers per cluster, drastically improving the infrastructure’s resilience. However, this system is still vulnerable to route leaks, as well as link/router starts producing packet loss.
This journey proves that building a highly available architecture takes time. Algolia’s architecture was designed early on in the process, which greatly eased the implementation as they continuously worked towards their original plan.