This guest post comes from Adam Green, a Twitter API consultant and author. Adam blogs about Twitter programming at 140dev.com and tweets from @140dev. His latest book, Twitter API Engagement Programming, is available on Amazon in paperback and Kindle editions. This is the second installment in Adam’s four-part series. The first can be seen here.
Controlling a local database is vital for collecting tweets and users from the Twitter API. Once this is accomplished, you no longer have to worry about rate limits, API reliability or speed of access. I do most of my work with MySQL, but any server-based database can be used—even a flat file in CSV format is a viable solution for minimal storage needs.
I’m a long-time relational database guy, so I parse tweets and users into separate entities (names, tags, URLs, @mentions, etc.), then populate a set of tables with related keys. The MySQL schema in my engagement programming book is an example of this relational model, and this sample source code shows how to parse tweets from the search API, then store the separate elements in a database.
There are three basic ways to collect tweets: the streaming API, the search API and user timelines. They vary in volume, rate limits and timeframe of results, as the linked docs explain. Tweets returned via any of these methods can be combined in the same database. After the data is collected, the original source doesn’t matter: They can all be treated as tweets.
All of the API calls that return tweets also include the author’s account information in the same package. As you are parsing tweets to prepare them for database insertion, you can pull out data about users and add this to a separate users table. Getting a user’s data with each tweet is a cheap way of keeping the database copy of that user’s record up to date. Since the tweet contains the current user values, you can insert the user if she’s not already in the database, or update her record with the new values from the tweet for users whom you already have.
Getting new users from tweets and updating existing folks is just part of the work you have to do. Many aspects of a user’s profile change over time, and it’s important to keep this data current. For example, if you want to search for users based on keywords in their bios, you need to keep refreshing the bio text. The approach I use is to refresh a last_updated field in the user record when the user is added or updated through a new tweet. Then I run a cronjob at a regular interval that pulls out all the users who have not been updated recently, and requests their most recent profiles with the /users/lookup API call. This minimizes the number of user profiles that have to be refreshed.
Users who are mentioned in tweets are a good source for engagement. After all, one definition of influence is being included in conversations. Unfortunately, the various tweet-collection APIs only deliver the account data for the tweet’s author. If @fred sends a tweet of “Hey there, @sally,” the API delivers @fred’s profile, but not @sally’s. If you also want to gather the accounts for users who are mentioned in tweets, you have to find them in the database and then look up their profiles with a separate /users/lookup request.
This isn’t as hard as it seems if you use a relational database schema. Here is a SQL query for a list of users who were mentioned in tweets, but haven’t yet been added to the users table:
SELECT DISTINCT target_user_id FROM tweet_mentions WHERE target_user_id NOT IN (SELECT user_id FROM users)
The next installment of this series will go further with this type of data mining to show you how to identify the most influential users in the tweet database.