Like many dev teams these days, the team at Buffer is working toward building a more service-oriented architecture. As part of this, the team has built a special service to keep track of how many times links have been embedded in Buffer posts. Buffer full-stack dev Harrison Harnisch over at Buffer Overflow takes us through the highs and lows of building the service and how the team went from simple to complex and back to simple again.
The link counting service was one of the main targets of Buffer API traffic. The API could have 400 to 700 requests per second on a db of six hundred million link records. The original solution involved a basic PHP app querying MongoDB. Needless to say, this didn’t work well. The new service had to have high availability and high throughput to cope with all this traffic while maintaining historical data in case the team wanted to use the history of links in posts for new client features.
The first attempt at building a new service was a NodeJS backend coupled with a Redis cache for fast reads and Amazon Aurora, a db system based on MySQL, which promised auto-scaling and high throughput. To get a count, the team simply ran a select count query to get the number of rows containing the relevant link. The bad news was: when a link had a count in the tens of millions, this was really slow, up to 20 seconds in some cases.
On the second iteration, the team tried ElasticSearch. The link data was distributed across shards with replication for high availability. The idea was with sharded data the team could run counts in parallel on the various shards and then aggregate the results map/reduce style. In testing, this seemed promising. 200 millisecond queries for even monster counts. On pushing to more than 50% of production traffic, though, the system crashed. ElasticSearch was the bottleneck.
The team concluded count queries were simply too slow. So, third time’s a charm, the old data structure went out the window. Instead, the team built a simple dictionary, with links for keys and count values, stored in a Redis cache and archived the old data structure in S3 for future features. The values in the cache could simply be incremented when the count changed.
The result? 20 second queries down to less than a millisecond. The search for the perfect data structure and architecture was over.