Diffbot Analyze API Enables Automatic Data Extraction

Imagine the possibilities when apps and programs can see the web the way humans do? Well, this is what Palo Alto-based startup, Diffbot, set out to achieve. Using a combination of crawling software, computer vision and Machine Learning, the company provides something that understands pages on the web and is able to classify them and break each one down into its basic parts. Diffbot's Analyze API makes this functionality available to developers.

So how does it all work exactly? In an article on Xconomy, the author sums it up quite nicely, saying,

"Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page. Using machine-learning techniques, this geometric data can then be compared to frameworks or 'ontologies'..."

Diffbot makes it possible for users to automatically retrieve the data they need from specific web pages. Users can access content from articles, products, images and other page types using Diffbot's selection of automatic APIs. The Analyze API does just what its name suggests; it analyzes a web page visually and accessing the page's URL, can determine what type of page it is. The Analyze API will then determine which of the extraction APIs would be appropriate; the Article API, Image API, Product API or Discussion API. The Article API extracts clean article text and related data from news articles and blog posts, the Image API identifies primary images on a page and returns detailed information about those images, the Product API extracts detailed data from shopping or e-commerce product pages, and the Discussion API (currently in beta) extracts detailed information regarding discussion pages.

Further information and API documentation is available on the Diffbot website.

Be sure to read the next Extraction article: Diffbot's Discussions API Provides Comment Section Searchability