Is Incomplete Twitter Data Skewing Social Analytics?

Data scientists are increasingly reliant on consuming social media APIs to aid market research studies, but in doing so, they put entire data collection models at the whim of fluctuating access controls and biased data returns. In an article published by the London School of Economics and Political Science, Sam Kinsley, a researcher at the University of Exeter, argues that harvesting Big Data through social media APIs such as Twitter Streaming or Twitter Search APIs is not an easy or reliable practice. 

The Twitter Streaming API doesn’t allow access to the “firehose,” the total data pool, but rather to a peculiar 1% of total streams, a data bank generally called the “spritzer.” A recent research study found biased results when comparing the spritzer alongside a random 1% taken from the firehose, without clear methods as to how these samples were generated. 

The Twitter Search API is problematic for additional reasons. Researchers can’t query a specific date in the past; they can only view posts from the previous week. Access is governed by a strict number of permitted calls, with a maximum return rate estimated at 450 calls per 15 minutes using multiple access tokens. In comparison to the 20 million posts being made every hour on Twitter, this is but a minuscule fraction.

In order to maximize harvesting data using the Twitter APIs, a research firm must go through many hurdles. They must acquire additional access tokens, write a program that provides continuous harvesting at the maximum request threshold, and create a structure for storing and retrieving the collected data. ScraperWiki is a tool that used to provide this service, that is, until their API access was revoked by Twitter — proof that continued stable API access is not always guaranteed.

Organizations may choose to partner with Twitter for full data access, but this is a costly endeavor. And though Twitter recently tested their Data Grant project to provide data access in a way similar to philanthropic monetary funding, only 6 out of 1300 applicants were given access.

With uncertainty mounting, Kinsley argues that the fad of social media data collection and data visualization may be skewed from the get-go, as data science performed by commercial organizations (rather than academic institutions) has a lack in criticality. 

Be sure to read the next Data Mining article: How to Extract COVID-19 Keywords from News Using Python

Original Article

A political economy of Twitter data? Conducting research with proprietary data is neither easy nor free.