This guest post comes from Jesse Emery, Co-Founder and Chief Identity Officer of YourTrove, the world's first truly social search engine. You can follow Jesse @ejesse or check out his linkedin.
At YourTrove, a lot of what we do involves ingesting social content via APIs and then regularizing that data within our system. For a lot of data, this is pretty straightforward. For example, while Facebook and Flickr might return different meta data, or name fields differently, no one disagrees that a photo is a photo. This is true for essentially all uploaded binary user generated content.
But things get wildly different for textual content. In an earlier version of YourTrove we were pulling down a lot of Facebook and Twitter data and looking at the best way to regularize all the user updates. What ensued was a passionate debate that's best summarized as: "Is a tweet a status?"
For human UI, the answer there was almost no disagreement. Most users expect that tweets and Facebook statuses are roughly equivalent, regardless of what techies want to argue about them being called. But from a data standpoint things were far from simple...
At first glance, it would seem that a Facebook status update and a tweet are very similar: they're both generally short bursts of text (although Facebook permits signficantly longer updates than Twitter) bursts of text that are broadcast to your friends/followers and can contain links, but don't contain formatting.
But the next level of interaction is where everything starts breaking down: replies and/or comments.
On Twitter, a reply is another tweet. On Facebook, a "reply" is a comment, which Facebook treats completely differently than a status update. Now, all of the sudden, Facebook status updates look more like blog posts than like tweets from a data standpoint. Most humans would think tweets we're more like Facebook statuses and that blog posts we're closer to news articles.
At this point, our debate started ranging off to services YourTrove hadn't even begun to deal with yet: what about Reddit posts and comments? A Quora question or answer? An Amazon review or wish list? Even poor old email got dragged into the fray. If you really want to melt your brain: what about re-tweets (or reblogs on Tumblr)? And then add in to the mix that most of these services change the composition of these objects themselves from time to time.
So what's social data regularizer to do?
Originally YourTrove had a quite complex object hierarchy that, while being "correct" and quite elegant to work with, rapidly became impractical to almost all the other parts of our system besides well, us, the developers. Our approach scaled poorly, was bug-prone, and was downright allergic to change.
So we threw all that out. What YourTrove has now could be thought of as "interfaces over abstract classes" although that phrase itself only describes our implementation in an abstract way, rather than the literall programmatical meaning of that phrase.
YourTrove no longer store any user content in any kind of RDBMS or even nosql database. Everything goes into a search index as json. Rather than worrying tremendously about "types," (although YourTrove does still have types) we instead transform everything at the field level and use static methods when we need to perform more complex operations like consolidating different fields for particular views.
For example, whatever a particular external API calls its "date," YourTrove removes that field from the json and reinserts it as a "service_date" field and ensure the value is an ISO UTC date. YourTrove lets lists of comments be called "comments" but we use a static method called "get_replies(obj)" which will retrieve both the comments and any other kind of reply.
We let content be embedded however it wants to in a list of "references". In this way, if a Facebook status contained links to a Twitpic photo (or a tweet has "entities"), YourTrove can return that status to the user or API consumer as a "photo" because it has a reference to a photo. Essentially we invert the reference on the fly. The same would be true, for example, for a link to a New York Times article.
And just like that, not only can a tweet be a status, but it can also be a photo, video, a link, or, in theory, whatever.
So, is a tweet a status? The answer for us was: "We don't care as long as YourTrove can give the user what they're asking for."