Historical Architecture: Data Mining Billions of Tweets

This guest post comes from Stewart Townsend, Developer Relations for real-time social data mining company DataSift.

The DataSift Platform allows users to define a "stream" using filtering parameters such as keywords or locations. Users can immediately begin to receive data in real time as comments are posted on social media sites. With a license to access Twitter's full "Firehose", we offer users the ability to search for posts using all the metadata contained in a Tweet, making it a far more powerful search. Though we make the search available via the simple Datasift API, there's actually quite a bit that goes into it.

In fact, we're making the technology behind the DataSift platform available for historical data, too. Developers can now run queries against stored data and export the results, and all of this is through the same Conceptual Schema Definition Language (CSDL) and interface.

The technology behind the platform

In addition to the existing live and buffered streaming, DataSift can now record data to its massive storage cluster. The core of the recording platform is an HBase cluster of more than 30 nodes, with over 400TB of storage. Every piece of information is replicated three times for high availability and disaster recovery.

The same infrastructure is used to record the entire Twitter Firehose, along with the other input sources and all the augmentations (including trends, sentiment analysis, Klout authority score, and gender demographics).

Recording the raw Firehose (250 million Tweets daily) would probably require an entire hard drive every day so the data is compressed and decompressed on the fly using a highly efficient compression Library developed at Google for heavy workloads and high-speed compression.

Communication between the website/ API and the storage cluster is accomplished using several languages (Java, Scala, PHP via Thrift), and the cluster is continuously monitored, using metrics to dynamically adapt the workload for maximum efficiency.

Internally, a standalone version of the filtering platform runs on each Hadoop node in parallel, effectively analyzing billions of records against several filters at once. The data is transformed, filtered, checked against each user's license and output mask, and then emitted as a new recording, that can be exported (for example, to Amazon S3).

The platform had to be partially re-engineered to work as an embeddable library instead of a standalone service.

Moving data around at this volume is not an option, so everything has to be local, the analysis happening where the data is already stored. This is the opposite of what's been done so far in most data centers where data is moved to the processing unit.

As this diagram shows, data is uniformly distributed across different servers on several data nodes. When a request to filter historical data is received, hundreds of parallel tasks commence, and each of them filters one data node within the selected time range and for the chosen data sources. Thanks to the nature of Hadoop and map-reduce, everything is performed in parallel, and the results of each unit are then collated into a continuous recording in a subsequent step.

What you receive

The full historical data is available for post processing so, for instance, it's possible to apply a filter on the entire Twitter Firehose, or on Digg or MySpace, for the past month. This feature is particularly valuable when the need arises to analyze trends after they happened. There are many potential use cases including analyzing the response to an ad campaign or looking for correlations between unanticipated events and the social media comments that followed.

The DataSift recording interface is very simple and abstracts all the internal complexity. You can select the data sources, the time range, and the filter you want to apply, and you'll be notified when the data is ready to be exported. You don't need to worry about anything else.


  • 30-node Hadoop cluster
  • 180 hard drives
  • Storing the entire Twitter Firehose of 250+ million Tweets per day
  • 500 GB of compressed data per day

Data Stores

  • MySQL (Percona server) on SSD drives
  • HBase cluster (currently ~30 Hadoop nodes, 400TB of storage)

Get on the list for early access to Historical Data.

Be sure to read the next Real Time article: Go With the Flow: Real-time APIs Are Here