New York Times Event Shows the Promise of Big Data

The TimesOpen developer network last week gathered thinkers, tinkerers, and innovators involved in the emerging data deluge. This was their final conference prior to the first TimesOpen Hack Day this Saturday in Manhattan.

Eric Sammer of Cloudera opened with a concise overview of the characteristics of Big Data:

- sources of data are server logs, business transactions, network traffic, Twitter, IMDB, Wikipedia, and the New York Times (with its own 13 New York Times APIs and contribution to linked open data).

- typical applications are ad optimization, intrusion detection, fraud detection, capacity planning, matching (finding jobs and sweethearts), and product cross-selling and up-selling.

- Big Data technology normally runs with NoSQL databases, distributed processing across commodity machines, and fault-tolerant, self-healing software architectures.

The base technologies of Hadoop (see this Hadoop tutorial for starters) and MapReduce are augmented with other modules such as Hue, Pig, Hive and Flume (plus sturdy awk and grep) - more in his slides.

Jer Thorp (some earlier work covered here) and Mark Hansen of the New York Times R&D group are artists and technologists, and they focused on recent experiments in visualization of influence with data from Using the concept of the Event Cascade, which in Twitter is represented primarily by retweets, they produced images of different subsets of event data, culminating in a video of three-dimensional images unfolding over time.

The graphics (from the visualization tool processing) showed the influence effect in time and audience of Tim O'Reilly's retweeting of a Paul Krugman article (geeks will be pleased to note that the affable Web 2.0 alpha dog far surpasses the Nobel Prize-winning economist, at least on Twitter). Another visualization confirmed the very democratic nature of interest in the bolting flight attendant story, which was picked up and retweeted across a broad range of disparate groups.

In terms of process, Thorp and Hansen emphasized what many Big Data researchers believe - the value of first building an exploratory data tool to see where the interesting parts of the data live.

Hilary Mason, as lead scientist at, is on the front lines of dealing with vast amounts of public data in a commercial setting. Each time someone clicks on a link on Twitter or other social networks, the data and context are captured and stored for analysis. This translates to 10M URLs per day, 100M clicks, and billions of events per month. This data can be analyzed by geography, social influence, device used, and other factors.

To deal with the deluge, the architecture is message-queue based, so that if one piece is down or can't keep up, the data remains sound. Typical of the Big Data practicioners, she relies and contributes to open-source efforts, with a tech toolbox that includes sharded MySQL, Mongo DB, Redis, and this open-source queueing software.

Jake Hofman of Yahoo Research suggested that the major issue in processing large sets is not the size of the data or the types of algorithms, but rather "To What Question Are These Data the Answer?"

His Nielsen-based data set comprised demographics on 265,000 people and a couple hundred gigabytes of their browsing history. The research began by trying to predict age based on that clickstream. After several forays through the data, with different classifications and visualizations, the researchers ended up with a different result, a comprehensive comparison of diversity online and offline.

Hofman emphasized that the bulk of the work is in taming and normalizing the data, and that the tidy Ph.D. paper that emerges from delving into this type of data is often the unpredictable byproduct of the original research intent.

The role of the data scientist is clearly on the rise. Here's a data aggregation on the conference itself - with a tweet-summarized infographic.

Be sure to read the next News Services article: Web Services Org Folds Up and the REST is History