Is Big Data the Next Big Thing?

New tools make for new opportunities.

Much as the availability of connectivity drove the early Internet, social networks spurred all manner of viral expression, and open APIs have transformed software development, an advancing technology has the potential to have a disruptive effect on online usage and its social and business context.

What is it?

Big Data is an umbrella term that signifies the democratization of access to storage and analysis of large data sets. Just as the kilobyte grew to the gigabyte, from here on in it's tera, peta, and ultimately yotta. The data comes from public online activity, companies' transactional histories, and website server logs. Also IMDB, Wikipedia, DBPedia, Gov 2.0 releases, and other open data initiatives.

What can I do with it?

Applications include business intelligence, predictive analytics, recommendation systems, and standard data mining. It can be seen is disparate areas such as box-office predictions, flu tracking, language analysis and the quantified self. Big Data crosses over with other emerging trends: the Internet of Things, the Open Data movement, the Cloud Software Stack, and Real-Time Data.

Where do I get it?

Although the types of analysis are not new, there is increasingly availability to individuals and small teams at an attractive price point to poke and fiddle with vast amounts of labeled and connected information. Cloud computing has lowered the hardware and engineering costs, with setups available from leading vendors such as Amazon AWS and Cloudera.

The Big Data software tools are largely free and open-source, and centered around Hadoop and MapReduce, with contributions from Python and the open-source statistical language R. Hadoop is a distributed-processing framework (and Apache project), and MapReduce is a parallel architecture that originated with Google and is useful in formulating divide-and-conquer solutions to analysis of large datasets.

How can I learn more?

A comprehensive overview of the software stack is provided here by Edd Dumbill. The New York Times conference on Big Data brought together innovators and researchers in the field, in advance of the TimesOpen Hack Day this Saturday. And the sessions in the O'Reilly Strata conference next spring give a flavor of the thinking and momentum behind Big Data.

This is all great, right?

There is a downside to the growing openness. Although most participants in research and in Web 2.0 efforts take pains to work with only volunteered or anonymized data, the experience of the last ten years shows that once data is out, it's out for good. A recent article in the excellent Wall Street Journal series on privacy shows that each time you read an online piece on a site you've registered on (or travel, or contribute to a charity), an insurance company might be looking right over your shoulder.

But good news for hackers, who started as programmers, became developers, and then system architects - now by massaging a couple hundred gigs you can become a data scientist! Cautionary note: you have to be wicked smart to do it for real - it's both an art and a science. Like hacking!


Comments (3)

Karmasphere Studio is a graphical environment to develop, debug, deploy and monitor MapReduce jobs. It accelerates the development process for experienced Hadoop developers and reduces the learning curve for those new to Hadoop.