APIs are all the rage these days and with good reason. Gone are the days when people used desktop applications, either web-based or native, to interact with data that lived on proprietary backends. These days, more often than not, the primary computing device will be a cell phone or tablet. The apps on the mobile devices might be browser based or native, with many consuming data from no single source. Data is coming from everywhere, literally. How are applications accessing this landscape of distributed data sources? Typically via an API.
However, there is still a lot of data out there on web pages that's not accessible via API (i.e., the website operator doesn't offer one), including data that's of interest to developers, business analysts, report writers, and other parties. And even in cases where the operator of the site offers an API, the skillset required to access it requires a good degree of programming know-how that's not typical for most business folks. Yet, they still need that data. What's to be done?
This is where using Web Page technology comes into play. The Case for Using Web Page Adapters to Create Structured Data Web pages have been publishing structured data since the time when JavaServer Pages (JSP), Active Server Pages (ASP), and Adobe's ColdFusion showed up on the scene. Using web pages to format and display data retrieved from a database was an inevitability, particularly when it came to publishing merchandise catalogs online.
However, as the web has matured, data-driven pages have become meaningful well beyond their consumption by human eyes. There is analytic inspection going on behind the scenes — think web crawlers and Search Engine Optimization (SEO) engines. People want that data for more than viewing purposes. Web page data is useful in terms of determining other things (e.g., context) provided you can get at it.
For example, imagine that you want to know the dentists in the Los Angeles area within a certain zip code, and you want to know the number of reviews for each of the dentists in the list. Typically you would go to a rating web page such as Yelp, do a search on dentists according to city, and view the results. (Please see Figure 1.)
Figure 1: Many web pages have data that can be useful when applied to an abstract structure.
Using a Yelp page visually is fine if all you want to do is view the results. But what if you want to load that data into a spreadsheet, such as the one shown next in Figure 2?
Figure 2: Once data is extracted into an abstract structure, it can be applied to a spreadsheet.
Or what if you want to create a comparative graph of that data (see Figure 3)? In order to meet your graphing need without the help of tooling, you're in for a lot cutting and pasting at best, or more likely, a lot of manual keyboard entry.
Figure 3: Spreadsheet data extracted from a web page can be easily converted to a chart.
Figure 4: A good extraction technology can convert data in a web page to JSON.
Getting structured data out of web pages — often referred to as "web scraping" — is a real need, particularly for people whose job it is to prepare and analyze the information that's presently available in web pages. Meeting this need is right up the alley of a data extraction tool, such as Import.io.
(This article is paginated. Use the pagination control below to retrieve the next page)