We have covered Web scraping before. As the amount of data that can be captured from the web increases, developers increasingly need a way to handle huge quantities of data downloads from the internet, both structured as well as unstructured. The basic underlying problem in creating custom scrapers is that websites often differ in the basic variety of formats, so a scraper that runs smoothly on some websites may return junk results on other websites.
Scrape.it, a web scraper API, aims to solve this problem. In an exclusive interview with ProgrammableWeb, Scrape.it founder John Kim talks about a better way to extract data.
Programmable Web (PW) - What is the basic function of your API?
John Kim (JK) - Scrape.it is an automatic and unsupervised data extraction API. You give it a URL and it will return a JSON string with data records without knowing anything about the underlying structure of the page. It discovers data fields on the page and produces columns and rows. Scrape.ly is a URL-based API for writing complete web scrapers that lets you crawl web pages.
PW - What can developers use the Scrape.it API for?
JK - Scrape.it extracts data from an HTML page and it is useful for web scraping across several different domains. An HTML page is filled with noise that makes it difficult to identify and pull out the relevant fields. You could write a script but it's a grind when you need to do it again and again, with each time being very different from the last domain you extracted data from.
PW - Can we use it for text mining, statistical analysis and even Big Data analytics?
JK - Before you wish to do any data mining, you need a way to grab the raw data from web pages. This is a two-part process, downloading the pages and extracting data from those pages. For the former, there are solutions like doing an exhaustive crawl of every link, which is quite inefficient and wasteful of resources. For the latter, you would use element selectors like xpath, css and classnames to pull the text which does not survive page layout changes over time. Scrape.ly handles page downloads by defining a crawl path (a sequence of links and elements to click and download pages). Scrape.it solves the data extraction problem without making any assumptions about a page (it will automatically discover and extract data fields).
Both Scrape.it and Scrape.ly can be used in order as a prerequisite to data mining. It saves you time and money while offering robustness against variations on web pages so that you can focus on the fun part of discovering and learning new information from the extracted data.
PW - What is some of the core technology you developed for this API?
JK - The technology was an outcome of reading various scientific papers on the subject. Automatic data extraction is a very hard thing to pull off, and I had to explore what the best and the brightest were thinking. It was natural for me to use that knowledge and translate it into a consumable API. With Scrape.ly, the idea to use URL to define a web crawl path was inspired by Oxpath while I was reading a scientific paper on it. The problem with Oxpath was that it was very verbose. Xpath is good enough for locating elements to click on links and forms but it is fragile when used to extract data from a page, just like using fixed element properties such as ID, classname and css properties are often unreliable and stop working when page layouts change and this is where Scrape.it fills the gap. When you use Scrape.ly and Scrape.it, they complement each other. Scrape.ly lets you download all the pages by defining a sequence of elements to click on by writing a specially crafted URL and Scrape.it lets you extract data from the downloaded page.
PW - Describe your personal journey from student to API creator
JK - Pretty crazy. I studied Economics at Simon Fraser University, so I did not have any formal computing science training. At the time I was obsessed with automating my day trading system, so it progressed into learning how to program. I realized eventually that this was not sustainable (turn on trading system, sleep all day, profit??). Something was missing. I wasn't solving any problems or bettering humanity by writing trading systems. I wasn't getting rich either like I'd hoped.
One summer in my third year university, I read a book called Webbots, Spiders and Screen Scrapers by Michael Schrenk, and web scraping was such a fascinating topic and it still is. I liked the idea of automating something that would require manual labor, because unlike trading systems, I recognized a value here. I started doing some freelance work specializing in just scraping data from websites, and it quickly proved to be cumbersome. I had to figure out what data fields needed to be extracted using the tools of the trade like classnames, xpath and css, which were quite fragile in nature because websites would change layout and the script would stop working.
Sometimes a site would have 20 different fields and it would take a lot of time mapping the HTML to the data fields, so I became frustrated and I set out to automate this.
One summer later, I started to write a web-based screen scraper that was meant for my use, to automate the data extraction process. What followed was failure after failure. I also did not know proper software development practices so I relied extensively on Stackoverflow. Web scraping is a challenging problem to solve because web pages are semi-structured and have a great deal of variations. To write something that could work across all those edge cases was too much of challenge.
I open sourced the GUI screen scraper for Scrape.it because I felt like I could no longer look at Java Swing (using '90s technology in 2009 was a mistake). I even took out a student loan to purchase $1500 browser component for Java Swing. I didn't even know Java at the time but thought I'd figure it out and learn as I go along. I took a semester of computing science course and they were going to slow! In order to learn, I would have to continue to tinker with it. I worked on Scrape.it for about 2 more years alone before I started working as a software developer. I spent another two years as a software developer in Vancouver, BC and found myself wishing I could tackle Scrape.it again, and here I am.
PW - How have Mashape and other partners (if any) helped you with your API
JK - Mashape was insanely easy to implement, it was a turnkey solution to avoid all the plumbing work like setting up authentication, billing, user feedbacks. This is a huge win for developers because it let's you focus on what really matters, the API.
Other APIs in the Programmable Web directory that deal with Web Scraping are ScrapeLogo, Bobik, and Web Scrape Master. Other options for Web Scraping include the Scrapy library in Python and iMacros .
Scrape.it is unsupervised and automatic. It detects and extracts data records (using an algorithm) automatically and is resistant to layout change. This means that you don't need to know the structure of the HTML document ahead of time which is currently the only way to go in order to write a script to parse out data fields. This would mean that in order to extract data from a page, normally you would have to write a lengthy script using Beautifulsoup or jQuery libraries and discover each element's position on the page (using xpath or css) manually which is very time consuming and unreliable. This gets ugly when the underlying page changes layout, you have to repeat the whole process again and again. Scrape.it uses an algorithm to automatically detect data fields and knows how to separate them into rows and columns
You can connect with John Kim directly for more on this service.