Is Incomplete Twitter Data Skewing Social Analytics?

Data scientists are increasingly reliant on consuming social media APIs to aid market research studies, but in doing so, they put entire data collection models at the whim of fluctuating access controls and biased data returns. In an article published by the London School of Economics and Political Science, Sam Kinsley, a researcher at the University of Exeter, argues that harvesting big data through social media APIs such as Twitter Streaming or Twitter Search APIs is not an easy or reliable practice. 

The Twitter Streaming API doesn’t allow access to the “firehose,” the total data pool, but rather to a peculiar 1% of total streams, a data bank generally called the “spritzer.” A recent research study found biased results when comparing the spritzer alongside a random 1% taken from the firehose, without clear methods as to how these samples were generated. 

The Twitter Search API is problematic for additional reasons. Researchers can’t query a specific date in the past; they can only view posts from the previous week. Access is governed by a strict number of permitted calls, with a maximum return rate estimated at 450 calls per 15 minutes using multiple access tokens. In comparison to the 20 million posts being made every hour on Twitter, this is but a minuscule fraction.

In order to maximize harvesting data using the Twitter APIs, a research firm must go through many hurdles. They must acquire additional access tokens, write a program that provides continuous harvesting at the maximum request threshold, and create a structure for storing and retrieving the collected data. ScraperWiki is a tool that used to provide this service, that is, until their API access was revoked by Twitter — proof that continued stable API access is not always guaranteed.

Organizations may choose to partner with Twitter for full data access, but this is a costly endeavor. And though Twitter recently tested their Data Grant project to provide data access in a way similar to philanthropic monetary funding, only 6 out of 1300 applicants were given access.

With uncertainty mounting, Kinsley argues that the fad of social media data collection and data visualization may be skewed from the get-go, as data science performed by commercial organizations (rather than academic institutions) has a lack in criticality. 

Original Article

A political economy of Twitter data? Conducting research with proprietary data is neither easy nor free.

Bill Doerrfeld I am a consultant that specializes in API economy research & content creation for developer-centric programs. I study Application Programming Interfaces (APIs) and related tech and develop content [eBooks, blogs, whitepapers, graphic design] paired with high-impact publishing strategies. I live and work in Seattle, and spend most of my time as Editor in Chief for Nordic APIs, a blog and knowledge center for API providers. For a time I was a Directory Manager & Associate Editor at ProgrammableWeb, and still add new APIs to the directory every now and then. Drop me a line at bill@doerrfeld.io. Let's connect on Twitter at @DoerrfeldBill, or follow me on LinkedIn.

Comments

Comments(2)

stustu12

You may also be interested in a tool we are working on (Sifter) that gives free estimates of filtered queries against the complete (undeleted) history of Twitter using Gnip PowerTrack rules. 

dennisyu

Twitter does natively report post-level data on your accounts.  The impression counts are perhaps questionable, but certainly it's in Twitter's short-term interest to hide their data. The engagement levels are low and the user quality is suspect (too many bots that we can't easily identify).