The Guardian’s Content API team recently needed to upgrade Elasticsearch, having fallen several major versions behind due to a need for stability during the recent release of their new website. In a blog post, two back end developers for Guardian, Luke Taylor and Chris Birchall discussed the upgrade process, which involved using a dual-stack strategy and AWS’s Route 53 because of its fine-grained control over what percentage of traffic goes where as well as its ability to minimise downtime.
As they ramped up the percentage of traffic sent to the replica stack created with the new Elasticsearch version, the internal monitoring tools showed that some clients were not respecting the DNS record TTL (time to live) values being broadcast by Route 53, which was set at 60 seconds. This meant that some clients were continuing to connect to the original stack indefinitely. The team managed to work out that the problem was not systematic, but was related to specific clients who were using the team’s very own Content API Scala client.
Since the DNS caching can take place at so many levels of the stack, the team began methodically testing and eliminating levels until they found the problem. They began at the bottom of the stack testing the OS, then moving onto the Java runtime’s DNS caching functionality before setting a simple test harness for the Scala client to see what IP address the client was connecting to.
After hunting through the source code, the team, narrowed it down to AsyncHttpClient which expires an idle connection after 60 seconds, but pools and reuses that connection indefinitely if it is used more than once per minute. This channel pooling behaviour is configurable and so turned out to be a simple one-line fix.
The team concluded that AWS’s Route 53 as an out-of-the-box solution may not be the best option for them and going forward they may consider placing a proxy server, such as HAProxy, in front of their stack. This would allow them to dynamically route traffic with instant effect, but does come with operational overheads and the increased risk of a single point of failure.