No API? Use ScraperWiki to Add One

Tomas Vitvar
Aug. 23 2010, 11:21PM EDT

One of the major purposes of a Web API is to expose a structured content that you can use in your own app, create some great mashup and share it with your friends. But what do you do if a popular app does not expose any API? If you're a developer, you write code to scrapes the app's content and transform it to a format you need. It's admittedly murky legal territory, but ScraperWiki makes that process easier by providing a console and an API to access the data you collect.

There are several issues that can make your scraping task complicated. For example, you need to deal with often unclean HTML. Although there are parses and libraries that can do the job for you nicely, finding right tools and setting everything up may take time. You need to also figure out how often you should run your scraper, implement caching, and run everything online somewhere. Quite a lot of work for may be a small task.

ScraperWiki Coding Window

ScraperWiki significantly simplifies implementation and deployment of your scrapers by providing you with a platform in which you can code, test, run, and share your scrapers and resulting datasets. ScraperWiki currently supports Python and PHP as coding languages and several source formats including HTML, Excel Files, or PDF. You can also use ScraperWiki's API to search for scrapers, serialize scraper's results to various major representations such as JSON and XML, and access the ScraperWiki's datastore directly. ScraperWiki also has quite an impressive coding interface. Not only does it highlight your code but also you can run your script and see its intermediate results, data it scrapes and source URLs it uses.

Although ScraperWiki provides many very usable features for screen scraping, there are also other platforms that you can use for similar purposes. For example, Google Apps Script gives you the opportunity to code a script in JavaScript and run it inside Google Spreadsheets. You can access Google core services such as Contacts, Calendar, or Maps, make REST and SOAP API calls, parse HTML and XML, and use OAuth 2.0 for authentication. After your script populates a sheet with some data and you make it publicly available, you can then access the data through a number of formats including Atom, RSS or HTML.

Screen scraping is an essential technique if you want to utilize Web apps' content that is for human eyes only. Those apps without an API have perhaps not yet understood the value of the structured content, or may have other business reasons for not providing the data via an API. You should be aware of apps' terms and conditions before you scrape them. Assuming you're in the clear, ScraperWiki is an easy way to structure unstructured data.

Tomas Vitvar

Comments

Comments(2)