Why Forcing LinkedIn to Allow Scrapers Sets a Dangerous Precedent for the API Economy

Last night, I thought my work day was over as I was doing one last scan of the interwebs when I saw it. Usually, a one word tweet -- "Whoa" -- isn't enough to get my attention. It could be anything. But it came from Sam Ramji, now working in developer relations at Google and formerly of Apigee and Cloud Foundry; someone that I know is not easily surprised. My work day apparently wasn't over. I drilled backwards through the tweetstream to find another friend, Redmonk principal analyst Steve O'Grady, who tweeted "so this was interesting and surprising."

They were responding to news that a federal judge has ordered Microsoft's LinkedIn to, within 24 hours, remove any technical barriers that prevented 3rd-parties from scraping the data off of any LinkedIn profiles designated as public (which must be like, all of them). As it sank in, I gasped.

"Scraping" gets its name from the phrase "screen scraping." Back in the PC days, before the Web was hot, some very clever programmers wrote routines that could reverse-engineer whatever data an application was displaying on a PC's screen and then siphon that data off to a structured database. Along came the Web with its highly repetitive structural HTML patterns from one page to the next on data intensive sites like LinkedIn and eBay and now, developers didn't even have to intercept the pixels. They just had to retrieve and parse the HTML with the same kind of Regular Expressions that drove the success of search engines like Google and Yahoo!  

For sites that don't offer an API, making scraped Web data available through an API (sometimes called "Scrape-P-I") can be an invaluable workaround to remixing a site's data into new and innovative applications. There are even services that will do it for you for sites that allow it because they don't have an API. ProgrammableWeb recently reviewed one (see How to Turn Existing Web Pages into RESTful APIs with import.io). 

However, like many sites containing valuable data that third parties would like to freely access, there are manual, technical, and legal barriers to scraping LinkedIn. After LinkedIn blocked hiQ Labs from scraping its site, hiQ Labs filed a lawsuit and prevailed on the basis that LinkedIn was “unfairly leveraging its power in the professional networking market for an anticompetitive purpose.” The judge likened LinkedIn’s argument to the idea of blocking “access by individuals or groups on the basis of race or gender discrimination.”

In my opinion, the judge got it wrong and the implications for API providers of this terrible decision should not be underestimated.  For those of you that know me and my history of defending open computing, you would think that I might hail this decision. After all, who is LinkedIn to hoard all that data for itself? However, when it comes to collecting data and organizing it for ease of searching through it, displaying it, and consuming it with a machine (eg: through an API), my opinion on this matter is deeply affected by the strategic and tactical investments we make in order to provide ProgrammableWeb's API and SDK directories as a free service.

A long long time ago, whether it had to do with personal profiles or API profiles, the grand majority of related data that lived wherever it lived was (and often still is) both disorganized and unstructured. Highly "disaggregated" as we like to say. Where it was organized or structured, it was only in pockets. For example, your contact data might have been structured and organized according the contact management system you used like the the one embedded in your email system. But other information like the list of jobs you held and what you did at each of them was likely scattered across resumes and other text files if at all. 

When the engineers at a service like LinkedIn sit down to think about their data model, they have to think about what sort of experiences they want to offer to their various constituencies, what sort of data is required to enable those experiences, where that data can be found, and, once it is found, how to best store it in a database. This involves the design and construction of schemas, tables, vocabularies, taxonomies, and other important artifacts that, taken together, enable great and performant user experiences. For example, as soon as you discover that one entity type might have two or more of another entity type connected to it, you've got what's called a one-to-many relationship that must be factored into your data model. An example might be how one person has multiple jobs and each job is connected with a company.

Or, in ProgrammableWeb's case, when we first made a provision in our API profiles for the location and type of an API's public-facing API description file (like an OAS, RAML, or Blueprint-based description), we wrongly assumed there would be only one such description file per API. Just including the fields for the API description in our profle was an important data model decision aimed at serving the needs of both API providers and developers (our primary audience). But then when we saw services like Stoplight.IO and Microsoft's Azure API Management offering more than one description file per API (StopLight offers both OAS and RAML, Azure offers both OAS and WADL) , we decided to fix our data model to accommodate those use cases. 

Be sure to read the next Data Mining article: Introduction to scrapestack's Real-time, Scalable Proxy & Web Scraping REST API