Twitter has announced a brand new search architecture that indexes every public Tweet since 2006. The new system consists of a batched data aggregation and preprocess pipeline, an inverted index builder that runs on Mesos, Earlybird shards, and Earlybird roots. Twitter first added search functionality to the platform in 2008 through the acquisition of Summize, a Twitter search service that also provided an API. Due to the massive growth of Twitter users, the company introduced in 2010 a new search architecture based on an inverted index, replacing the MySQL indexes.
The new search architecture, a heavily modified version of Lucene optimized for real-time search, was code named internally as "Earlybird." Even though large parts of the Lucene core in-memory data structures had been rewritten, Earlybird still supported Lucene's standard APIs. Supporting the standard APIs allowed Earlybird to utilize the Lucene search layer using very little modification. In 2011, Twitter added new features to the Earlybird indexer including image and video search, index compression, and improved relevance search functionality.
In a keynote at Lucene Solr Revolution 2013 in San Diego, Michael Busch, Tech lead, Search Infrastructure at Twitter, outlines Twitter's search architecture and talks about the latest technologies being used. In the keynote he explains that "Twitter serves billions of queries per day from different Lucene indexes, while appending more than hundreds of millions of Tweets per day in real time." He also explains that "Lucene and Earlybird use an inverted index tool for fast retrieval. An inverted index takes all documents, finds all unique terms in all these documents, and builds a dictionary."
Twitter's new search index infrastructure utilizes Earlybird while incorporating important features in the design such as modularity, scalability, simple interface, and incremental development. In the new system, Twitter has introduced Earlybird shards and Earlybird roots. In the official announcement post, Yi Zhuang, search infrastructure engineer at Twitter, explains the need for sharding in the new search index infrastructure:
"The inverted index builders produced hundreds of inverted index segments. These segments were then distributed to machines called Earlybirds. Since each Earlybird machine could only serve a small portion of the full Tweet corpus, we had to introduce sharding... With simple hash partitioning, expanding clusters in place involves a non-trivial amount of operational work – data needs to be shuffled around as the number of hash partitions increases. Instead, we created a two-dimensional sharding scheme to distribute index segments onto serving Earlybirds."
According to the announcement post, Twitter introduced Earlybird roots "to abstract away the internal details of tiering and partitioning in the full index." Earlybird roots makes it possible for Twitter to provide a simple Search API. The Twitter Search API allows queries that search through recent or popular Tweets and the results are based on relevance as opposed to completeness.
In the past, the only way to search for Tweets that date back to 2006 would be through a service with access to the complete Twitter firehose such as DataSift and Gnip. Earlier this year, Twitter acquired Gnip with the intention of making Twitter data even more accessible.
The new search infrastructure offers a variety of use cases, particularly when it comes to Twitter hashtag searches. Comprehensive searches for news events like #Ferguson and #Election2014, conferences like #APIconUK and #APIstrat, and other topics of conversation via hashtags are now possible using the Twitter platform.
The new search capability will be rolling out to Twitter users over the next few days and the company plans on continuing to make improvements to search functionality on the Twitter platform.