Continued from page 1.
Import.io allows you to ingest data in a web page and convert it into structured data that you can use in a spreadsheet or express in JSON. In fact, the scenario described above in which data was extracted from a Yelp page of dentists and converted to a Google spreadsheet and an array of JSON objects was done with Import.io. (The graph was made using Google Sheet's Insert Graph Feature.)
Import.io has a lot to offer when it comes to data extraction. In this tutorial, you'll learn:
- How Import.io can be "trained" to determine various fields of data on a web page.
- How to organize fields you extract into a consistent data structure.
- How to make these data structures available for display in Google Sheets and within a structured JSON object, which can in turn be made available as an API.
- How, in an advanced section at the end of this piece, to aggregate data from a variety of like-minded websites into a single, normalized data structure using the Import.io APITrack this API and then consume that API with a custom application coded especially for this article.
To get the full benefit from reading the piece, I'm assuming that you know how to work with structured data and tables. I'm also assuming you have a working knowledge of Google Sheets and that you understand JSON.
Let's start by taking a look at Import.io
As mentioned previously, Import.io is a service that allows you convert the information found on web pages into structured data. The service works on a try-before-you-buy basis. To get started, you'll create an account and then access your account by way of the Import.io Dashboard. (Figure 5 next shows the Import.io home page. Notice the Dashboard button in the upper right of the web page.)
Figure 5: Import.io is a platform that allows data to be extracted from a web page in a structured manner.
The Dashboard is the place where you'll do the work of creating, configuring, and editing extractors. An extractor is the technology that allows you scrape the information off a web page and into a data structure that meets your need.
As just described, an exactor is an Import.io technology that has the ability to inspect a web page and determine the underlying data structures within the page. In most cases, an Import.io extractor can infer a web page's data structures automatically. (See Figure 6.)
Figure 6: An Import.io extractor can determine the data structures within a web page.
However, there are times you'll want to train and configure an extractor to structure data in a way that's special to a given need. Customizing an Import.io extractor is a pretty straightforward process that we'll cover soon. But first, let's start using Import.io to create a simple extractor. Then, once the extractor is created, you'll customize the columns to better describe the data at hand. Then, you'll use the Export to Google Sheets feature to display the extracted data.
Creating an Extractor
You're going to create an extractor that describes a list of jobs from a jobs site. You're going to create the extractor against the following URL (which lists a bunch of jobs): http://www.jobserve.com/us/en/JobListing.aspx?shid=414ABF05F8664A7B5A
Figure 7 shows the web page associated with the URL. Just by looking at the user interface and scanning from one job listing to the next, you can assume that a fairly standard data structure lives beneath the surface:
Figure 7: A web page well suited for Import.io extraction.
To create an extractor, go to the Import.io dashboard and click the New Extractor button, located on the upper-left side as shown next in Figure 8:
Figure 8: You create a new extractor by clicking the New Extractor button in the Import.io Dashboard.
Clicking the button presents a dialog into which you enter the URL of the web page from which you want to extract data. Enter the URL and then click the Go button on the lower right of the dialog. (Please see Figure 9, next.)
Figure 9: Enter the URL and click the button labeled Go.
After you click Go, Import.io goes out to the URL and analyses the web page to determine the data that's on the web page. Also, Import.io will infer the structure of the data on that page. While all this is going on, you're presented with the message shown next in Figure 10.
Figure 10: Import.io technology does a lot of work behind the scenes figuring out the data in a web page
Upon finishing the web page analysis, Import.io presents the data extracted from the web page in tabular format, as shown next in Figure 11.
Figure 11: Import.io has automation that inspects a web page to determine the data structures within the page.
The extraction intelligence in Import.io will discover the data fields that exist on the web page being inspected, as you've seen in Figure 11. However, just because Import.io will retrieve all the data you want, it does not always follow that all the data extracted is necessary. In addition, you might find that the names that Import.io associates with each field in the extraction are not to your liking. Don't fret. Import.io allows you to customize field names according to your need. And you can format the data displayed in each field to meet your needs as well.
Customizing an Extractor
As you've seen in Figure 11 previously, Import.io structured all the data on the Job Listings web page and displayed its findings accordingly. In fact, Import.io is displaying too much data for what you need. Let's say that the only information of interest on the web page is Job Title, Location, and Description. Can you adjust the extractor be that specific? Absolutely.
(This article is paginated. Use the pagination control below to retrieve the next or previous page)