Why Forcing LinkedIn to Allow Scrapers Sets a Dangerous Precedent for the API Economy

Last night, I thought my work day was over as I was doing one last scan of the interwebs when I saw it. Usually, a one word tweet -- "Whoa" -- isn't enough to get my attention. It could be anything. But it came from Sam Ramji, now working in developer relations at Google and formerly of Apigee and Cloud Foundry; someone that I know is not easily surprised. My work day apparently wasn't over. I drilled backwards through the tweetstream to find another friend, Redmonk principal analyst Steve O'Grady, who tweeted "so this was interesting and surprising."

They were responding to news that a federal judge has ordered Microsoft's LinkedIn to, within 24 hours, remove any technical barriers that prevented 3rd-parties from scraping the data off of any LinkedIn profiles designated as public (which must be like, all of them). As it sank in, I gasped.

"Scraping" gets its name from the phrase "screen scraping." Back in the PC days, before the Web was hot, some very clever programmers wrote routines that could reverse-engineer whatever data an application was displaying on a PC's screen and then siphon that data off to a structured database. Along came the Web with its highly repetitive structural HTML patterns from one page to the next on data intensive sites like LinkedIn and eBay and now, developers didn't even have to intercept the pixels. They just had to retrieve and parse the HTML with the same kind of Regular Expressions that drove the success of search engines like Google and Yahoo!  

For sites that don't offer an API, making scraped Web data available through an API (sometimes called "Scrape-P-I") can be an invaluable workaround to remixing a site's data into new and innovative applications. There are even services that will do it for you for sites that allow it because they don't have an API. ProgrammableWeb recently reviewed one (see How to Turn Existing Web Pages into RESTful APIs with import.io). 

However, like many sites containing valuable data that third parties would like to freely access, there are manual, technical, and legal barriers to scraping LinkedIn. After LinkedIn blocked hiQ Labs from scraping its site, hiQ Labs filed a lawsuit and prevailed on the basis that LinkedIn was “unfairly leveraging its power in the professional networking market for an anticompetitive purpose.” The judge likened LinkedIn’s argument to the idea of blocking “access by individuals or groups on the basis of race or gender discrimination.”

In my opinion, the judge got it wrong and the implications for API providers of this terrible decision should not be underestimated.  For those of you that know me and my history of defending open computing, you would think that I might hail this decision. After all, who is LinkedIn to hoard all that data for itself? However, when it comes to collecting data and organizing it for ease of searching through it, displaying it, and consuming it with a machine (eg: through an API), my opinion on this matter is deeply affected by the strategic and tactical investments we make in order to provide ProgrammableWeb's API and SDK directories as a free service.

A long long time ago, whether it had to do with personal profiles or API profiles, the grand majority of related data that lived wherever it lived was (and often still is) both disorganized and unstructured. Highly "disaggregated" as we like to say. Where it was organized or structured, it was only in pockets. For example, your contact data might have been structured and organized according the contact management system you used like the the one embedded in your email system. But other information like the list of jobs you held and what you did at each of them was likely scattered across resumes and other text files if at all. 

When the engineers at a service like LinkedIn sit down to think about their data model, they have to think about what sort of experiences they want to offer to their various constituencies, what sort of data is required to enable those experiences, where that data can be found, and, once it is found, how to best store it in a database. This involves the design and construction of schemas, tables, vocabularies, taxonomies, and other important artifacts that, taken together, enable great and performant user experiences. For example, as soon as you discover that one entity type might have two or more of another entity type connected to it, you've got what's called a one-to-many relationship that must be factored into your data model. An example might be how one person has multiple jobs and each job is connected with a company.

Or, in ProgrammableWeb's case, when we first made a provision in our API profiles for the location and type of an API's public-facing API description file (like an OAS, RAML, or Blueprint-based description), we wrongly assumed there would be only one such description file per API. Just including the fields for the API description in our profle was an important data model decision aimed at serving the needs of both API providers and developers (our primary audience). But then when we saw services like Stoplight.IO and Microsoft's Azure API Management offering more than one description file per API (StopLight offers both OAS and RAML, Azure offers both OAS and WADL) , we decided to fix our data model to accommodate those use cases. 

David Berlind is the editor-in-chief of ProgrammableWeb.com. You can reach him at david.berlind@programmableweb.com. Connect to David on Twitter at @dberlind or on LinkedIn, put him in a Google+ circle, or friend him on Facebook.
 

Comments

Comments(5)

hendy

Your article misses a huge factor that is Google et al scrapers. Who defines the line between "good" vs "bad' scrapers?

Suppose you're a startup search engine and LinkedIn or any other site blocks you, that means search engine competition becomes unfair, right? Discrimination...

PS. Your comment form sucks.. You should display the comment form AFTER requiring the user to log in, instead of requiring me to retype my comment all over again! :(

david_berlind

Hendy, 

Thanks for the feedback. As the inventor of the service, LinkedIn should be able to decide if the use case for crawling is appropriate. What hiQ Labs wants to do with the data is not about searching and potentially constrains LinkedIn's ability to uphold its privacy policies. 

Thank you very much for the feedback on our comment forms. We've been aware of the problem (because we see duplicate comments) and have been looking for feedback just like this to help us diagnose the problem. 

David Berlind

editor-in-chief

ProgrammableWeb.com

Jason-Shinn

Thanks for a well-written technical and policy explanation of this case. While I don't disagree with your analysis, one point that should be looked favorably upon by the Judge's ruling is the discussion about limiting the federal Computer Fraud and Abuse Act (CFAA). Essentially what LinkedIn claims in the suit is that whether “access” to a publicly viewable site may be deemed “without authorization” under the CFAA where the web site host purports to revoke permission. I've represented a number of clients in CFAA litigation where the expansion of the CFAA has distorted its original purpose. Here is an article link for more about this. Also, I don't think LinkedIn did itself any favors by allowing hiQ Lab to access LinkedIn's data for years and then decide it was no longer permitted.  

david_berlind

Hi Jason,

Thanks for the feedback and the link to your article which I read. Maybe it raises the question about the best form of defense in situations like this. Perhaps the CFAA is not and ideal shield in situations like this. I'm not an attorney, but maybe the site's Terms of Service is a better defense (although I could see the Judge's opinion of hiQ's rights trumping that as well).  

Two issues that I did not conflate with my technical arguments are as follows:

1. The judge's specific choice of language that LinkedIn was “unfairly leveraging its power in the professional networking market for an anticompetitive purpose” suggests he bought into the antitrust part of hiQ's complaint. That sounds a lot like antitrust language ... particularly the "professional networking market" part because of how it defines a market that LinkedIn dominates. As I'm sure you know. antitrust case history is riddled with arbitrary market definitions in order to better support a complaint. We're all monopolists of something. It just depends on how you conveniently define that "something."  Why wasn't the market defined as the "social networking market?" Oh, because then LinkedIn could not be accused of antitrust behavior. 

2. LinkedIn's promise to its users: LinkedIn has a responsiblity to stand up for the rights and privacy of its users. No one signs up for LinkedIn thinking that some company is going to scrape their profile to let their employers know if they're looking for a job. That doesn't fit with my expectations of what the service is going to offer me and I fully expect the service to protect me as one its users from creeps who want to use my own information against me. That, to me, is a violation of my privacy.  So, I fully expect that LinkedIn is going to go to court to protect my expectations as a user. I even wonder if there's any legal exposure there. For example, if LinkedIn doesn't try to stop hiQ and then someone loses their job because an algorithm flagged them as a flight risk. Does that someone have a case against LinkedIn? 

Max-Cherry

If a website can be downloaded, it can be scraped.  It's amazing to me that lawsuits like this actually occur in the first place. A good summation of my thoughts here, or in short let me just say that screen scraping is the essence of the programmable web, so these lawsuits really echo perhaps a millennial mindset of "everything must be free!" Which I tend to agree with, but still, a business should have a right to defend it's main source of income.