Web Scraping Evolved: APIs for Turning Webpage Content into Valuable Data

This guest post comes from Marc Mezzacca, founder of NextGen Shopping, a company dedicated to creating innovative shopping mashups.    Marc's latest venture is a social-media based coupon code website called CouponFollow that utilizes the Twitter API.

While the rates in adoption of semantic standards are increasing, the majority of the web is still home to mostly unstructured data.  Search engines, like Google, remain focused on looking at the HTML markup for clues to create richer results for their users.  The creation of schema.org and similar movements has aided in the progression of the ability draw valuable content from webpages.

But even with semantic standards in place, structured data requires a parser to extract information and convert it into a data-interchange format, namely JSON or XML.  Many libraries exist for this, and in several popular coding languages.  But be warned, most of these parser libraries are far from polished.  Most are community-produced, and thus may not be complete or up to date, as standards are ever changing.  On the flip side, website owners whom don't fully follow semantic rules, can break the parser.  And of course there are sites which contain no structured data formatting at all.   This inconsistency causes problems for intelligent data harvesting, and can be a roadblock for new business ideas and startups.

Several companies are offering an API service to make sense of this unstructured data, helping to remove these roadblocks.  For example, AlchemyAPI offers a suite of data extraction APIs including a Structured Content Scraping API, which enables structured data to be extracted based on both visual and structural traits.  Another company, DiffBot, is also taking care of the "dirty work" in the cloud, allowing entrepreneurs and developers to focus on their business instead of the semantics involved in parsing.  DiffBot stands out because of their unique approach.  Instead of looking at the data as a computer, they are looking visually, like a human would.  They first classify what type of webpage (eg. article, blog post, product, etc.) and then proceed to extract what visually appears to be relevant data for that page type (article title, most relevant image, etc).

Currently their website lists APIs for Page Classification (check out their infographic), as well as parsing Article type webpages.  Much of the web, including discussion boards, events, e-commerce data, etc. remains as potential future API offerings and it will be interesting to see which they go after next.

You can test drive the Artcle API on their website and see the extraction results instantly, as shown below of this article:

And now in JSON formatting:

This API can be a handy tool for young startup companies looking to avoid the parsing game.  For Delve, a New York City startup based out of WeWork Labs, they can't wait around for the true Semantic Web to get here, so they've been using DiffBot's Article API as a main component of their product.  Delve provides an enterprise news reader adding in a social element.  Teams as small as five, up to one-thousand, can converse and collaborate on relevant news articles.

Delve wants to provide the best information possible to their users, so indexing article content correctly is essential, but as a startup so is time.  Thomas Weingarten, CTO of Delve, tells that they've landed on DiffBot because it provides the quality results for elements critical to their product offering, such as article title, author and publish date, but admits for high value targets they still do some scraping themselves to fully ensure accuracy.

The web still has more revisions to go before we'll be able to fully extract the wealth of knowledge it contains.  But combining semantic markup with intelligent data parsing could be the cocktail needed to turn a mess of interconnected webpage data into a structured knowledge base.   These APIs seem to be a step in the right direction.

Be sure to read the next Best Practices article: W3C Proposes Push API Specification