Startups that want to design innovative products and services using data that's accessible via the web now have better data mining tools, thanks in part to startups like the award-winning import.io. Meanwhile, the recent settlement between U.S. startup People+ and AOL around accessing the entire Crunchbase database via API for a Google Glass app means there could be greater clarity for startups around data usage rights when creating commercial ventures. ProgrammableWeb spoke with import.io co-founder and chief data officer Andrew Fogg and Electronic Frontier Foundation staff attorney Mitch Stoltz to discuss commercial trends in data scraping and developer rights in using data via APIs for business product design.
Import.io provides a number of tools to help end users "structure the web," and web scraping is just one technique it provides. This is evident in the product suite, which includes a way to create a dataset from web pages, as well as identify when a website's data is available via API. Fogg explained the service to ProgrammableWeb:
"When a user is browsing the web in Chrome, if they have our Chrome extension installed and they navigate to a website for which we have an existing public API, then a notification icon appears next to the Data Factory button in Chrome."
Fogg demonstrated with screenshots of the RadioTimes website:
Users can also create APIs by pulling data from a website using import.io's Data Browser tools. "The API always exists in the background. When viewing a dataset, users click the 'Integrate' button. This takes you from the dataset to integration instructions for the API," Fogg explained. The service also enables private APIs to be created by a number of import.io's tools including the Data Browser and Data Factory services.
Monetizing data from the web
Import.io is seeing a number of companies create products by structuring and restructuring web data in this way. "We have a diversity of use cases from lots of industry sectors. Particular examples where we are seeing significant traction include: sales (lead generation), online retail (pricing monitoring), recruitment (jobs data)," Fogg said. A few businesses in particular are already monetizing commercial products by drawing on web data via import.io's API creation tools:
"Examples would be WisePricer, Digital Shadows and ClickMechanic. The way that end users monetize depends on their business models. Some have used affiliate sales programs. Others are using the data to drive other products or services."
Import.io's potential lies beyond just monetizing products directly, according to Fogg. In an import.io presentation at the recent APIDays Paris conference, president Emmanuel Javal discussed how import.io is used internally in business workflows. He gave the example of HP, which wants suppliers to adhere to set retail prices for its hardware. HP uses import.io to monitor online sales prices of its products to ensure it's providing a consistent value across global markets. Fogg added: "There are lots of companies who are not directly monetizing the data but using it to solve problems integral to their businesses: examples would be Storefront (lead generation), and HP (channel partner monitoring)."
Data users' rights: AOL vs. People+
End users are monetizing structured data sourced from the web in a way that's reminiscent of the way startup People+ has drawn on data from the Crunchbase API for its forthcoming Google Glass product. The Google Glass app recognizes stakeholders that the wearer meets and automatically feeds information about the person in front of them from Crunchbase's industry database. The idea is that this will give the Glass wearer enough information to start a conversation and create a relationship with the business contact.
AOL—owner of the Crunchbase API—wanted the right to prohibit People+'s access to the database, given the commercial nature of the People+ product. The Crunchbase database had been available under a Creative Commons Attribution license that allows anyone to use the data, as long as they credit the source to Crunchbase. Since raising the dispute, AOL has worked with the Electronic Frontier Foundation to change its API terms of service to license the data use under a Creative Commons Attribution Non-Commercial licence. Future startups following People+'s lead will need to arrange a separate license with AOL for any monetized products developed off the Crunchbase data access. In the meantime, startups wanting to use the web to pull in data (via scraping or APIs) are encouraged to check how the data is being licensed for external use.
"Once someone uses copyrighted material under a Creative Commons license, the copyright owner can't change its mind and stop that person from using the material," Mitch Stoltz told ProgrammableWeb via email. Stoltz is a Staff Attorney at Electronic Frontier Foundation and worked with both People+ and AOL to find a solution to the dispute. "So anyone using material from CrunchBase from before last week can keep using it under the Creative Commons Attribution license, but any material CrunchBase adds in the future will be under the Attribution-Noncommercial license."
The lesson for other startups may well be to make sure they store a copy of the terms of service of an open API when they first sign up, in order to ensure their future usage rights should terms of service change in future. Stoltz also adds an important caveat:
"But using the materials and using the API are two different things. Just as a store can usually kick people out for any reason or no reason, an Internet site can stop people from using its API at any time. Startups that want to use open APIs should read the fine print, and get a lawyer's help if necessary."
Data owners' rights
While import.io provides new data access tools for end users, Fogg also recognizes the importance of the data owners' rights: "We fully support a data owners' rights to restrict the use of data. Crunchbase is a good example: Many people use the data in a way that is supported and encouraged by AOL. They do have a public API, and we can help people use that data and make it easy to combine it with other sources to create unique data sets, which AOL also support. If a specific data use through import.io concerns a data owner, we can support them in solving that problem. Data access is not the issue here, data use is the issue."
Peter Berger, founder of People+, sees one of the key lessons in the recent dispute as being about building relationships between data users and owners when restructuring web data: "The most important thing for startups to understand is that Terms of Service can change subject to business needs," Berger told ProgrammableWeb. "Alternative data sources can be helpful, but the best hedge against being hurt by a change in someone's terms is to communicate and foster relationships with the people who's data you rely on."
For data owners, Stoltz recommends the use Creative Commons templates in order to manage the use of their data via open API.
Stoltz says that for content being released on the web, "it depends on the contents of the data, [but] for creative materials that are covered by copyright, a good way is to use one of the Creative Commons Noncommercial licenses. They allow anyone to use the material for non-commercial purposes but require commercial users to negotiate a separate license with the website owner.
"Access to a wide variety of public domain and freely licensed material on the Web is good for everyone, including the people who make it available. The Creative Commons licenses are one of the easiest and most effective ways to make creative work and data available to all. When using Creative Commons, website owners should resist the temptation to add other legal terms or requirements that change the way Creative Commons works - that undermines a lot of the usefulness of Creative Commons and increases legal risks for everyone."
API providers who want to provide their specifications and data model openly can use the tools at API Commons. These tools place a declarative statement about the relevant Creative Commons license directly into the API definition.
For a more detailed look at the issues involved in how third parties are developing products and services using web-scraping tools, watch out for Patricio Robles' forthcoming article on "Unoffical APIs" on ProgrammableWeb.
By Mark Boyd. Mark is a freelance writer focusing on how we use technology to connect and interact. He writes regularly about API business models, open data, smart cities, Quantified Self and e-commerce. He can be contacted via email, on Twitter, or on Google+.