How to Turn Existing Web Pages Into RESTful APIs With Import.io

Continued from page 4. 

Job-Track-O-Tron has three components: the Import.io API, the Proxy-Adapter API, and the Proxy Clients. Table 1 shown next describes each component of the application.

Component
Description
Import.io API Emits a list of extractors, also provides the JSON data for each extractor based on its unique identifier. It requires the Import.io API key for access.
Proxy-Adapter API Gets extractor information from the Import.io API and transforms the JSON returned from the Import.io API into a simpler JSON format that's easier to consume by the Proxy-Client.
Proxy-Client Publishes the Single Page Application (SPA) written in Angular that displays data retrieved from Proxy-Adapter API.
Table 1: The components that make up the Job-Track-O-Tron application.


Figure 27 illustrates how the components described in Table 1 interact with each other. Notice that that your Proxy-Adapter API uses data retrieved from the Import.io API. Also, notice that the Proxy-Adapter API publishes an endpoint that's used by the Proxy-Client. The purpose of the Proxy-Client is to publish a SPA that displays information retrieved by the Proxy-Adapter API.

Figure 27: This article's sample application is made up of three components.

Figure 27: This article's sample application is made up of three components.

How Aggregation is Implemented

The purpose of the Proxy-Adapter API is to encapsulate your API Key and use it to access your extractor data that resides in Import.io. So, one key function of the Proxy-Adapter API is it retrieves extractor data via the Import.io API so that you can present the list of extractors for the end user to pick from (as a means of filtering). Thus, you can think of the Proxy-Adapter API as a "pass through" to the Import.io API. However, the Proxy-Adapter API provides another service; it can aggregate data from all your extractors into a single feed. In this case, the aggregation is a list of all jobs listed in all the extractors.

The way that the Proxy-Adapter API implements the aggregation of job listings from extractors into a master list is by applying a naming convention to the name (description) of an extractor. (When you name an extractor in the Import.io website, you're actually setting the value on the extractor's description attribute in terms of the JSON emitted by the API.) In other words, instead of making it so that the Proxy-Adapter API aggregates all the data from all the extractors under your Import.io account, you add a special suffix to an extractor's name (description) as a way to tell the Proxy-Adapter API to "aggregate only those with the suffix." The special suffix you use is x-aggregate. This suffix is meaningful only to the Proxy-Adapter API.

For example, let's say you have an extractor that is labeled dice.com as shown in Figure 28.

Figure 28: Add the x-aggregate suffix to those Job Sites that you want aggregated into the master Job List.

Figure 28: Add the x-aggregate suffix to those Job Sites that you want aggregated into the master Job List.

If you want the information in that extractor to be aggregated into the master Job List, you'll change the description of the extractor to be dice.com x-aggregate. When it comes time to create an array (list) of all jobs, intelligence in the Proxy Adapter API will get all of your extractors by making a request to the Import.io API. Then, the Proxy-Adapter API code will apply a regular expression to the description attribute of each JSON object in the array of data associated with an extractor. If the value of extractor contains the characters x-aggregate at the end of the description, the Adapter-Proxy API makes a call back to the Import.io API to get the data associated with the identified extractor. Then, that extractor's data is added to the master list of all jobs. If you do not add the x-aggregate suffix, to the extractor name, the Proxy-Adapter will ignore the extractor as having data that needs to be added to the master job list. Listing 1 (next) shows the Node.js code that implements this logic.

const _ = require('lodash');

module.exports.aggregateExtractors = function aggregateExtractors(refresh){
   return getExtractors(null, refresh)
       .then(result => {
           const validObjs = _.filter(result, r => { return r.description.match(/(x-aggregate)$/g) })
           const getters = _.map(validObjs, r=> {
              return getExtractor(r.id)
           })

           return Promise.all(getters);
       })
       .then(result => {
           return _.flatten(result);
       })
};

Listing 1: The Proxy-Adapter API, written in NodeJS, inspects an extractor's description attribute for the x-aggregate suffix, thus identifying the extractor as part of the master Job List.

Working with the Proxy Adapter API

As mentioned above, the Proxy-Adapter API works with the Import.io API to get extractor information. The Proxy-Adapter API publishes a single endpoint, /api, that allows you to get a list of all your extractors, a single extractor, or an aggregation of all information combined from data in many eligible extractors.

The /api endpoint provides a variety of parameter options that allow you to retrieve the information you need. Table 2 describes these various options.

Parameterization
Description
/api Returns a JSON array that describes all extractors according to the given API key.
/api/:id Returns the extractor JSON which includes listing data and descriptive metadata about the extractor, according to the unique identifier of the given extractor.
/api/:id?refresh=true Makes the Proxy-Adapter API force Import.io to refresh an extractor's data before return.
/api?aggregation=true Returns an array of JSON objects that are an aggregation of all data in extractors named with the x-aggregate suffix.
Table 2: The various parameter options for using the Proxy-Adapter.


The Proxy-Client uses the Proxy-Adapter API exclusively to get the data it needs to display.

Listing 2 (next) shows an example of JSON from an array of All Job Listings. All Job Listings is an aggregation of all jobs from all extractors names with the suffix x-aggregate.

{
    "Title": [{
        "text": "Senior Design Engineer",
        "href": "http://www.jobserve.com/us/en/search-jobs-in-Lake-Forest,-California,-USA/SENIOR-DESIGN-ENGINEER-9843F7E77995B5FC/?utm_source=50&utm_medium=job&utm_content=1&utm_campaign=JobSearchLanding"
    }],
    "City": [{
        "text": "Lake Forest"
    }],
    "State": [{
        "text": "California"
    }],
    "Description": [{
        "text": "A very exciting worldwide prosthetic solutions medical device company that is looking for a Sr. Design Engineer with a mechanical engineering background. This will be a 8-month contract to hire position on site in Orange County. This person must be an..."
    }]
}

Listing 2: A row of custom extractor data in JSON format.

Making the Import.io API Key Known to the Proxy-Adapter API.

As mentioned above, an area of concern for the Proxy-Adapter API is to encapsulate your API Key and restrict public knowledge of that key. For example, given how the key is treated as a secret, you want to minimize the chances of a hacker breaking into the source code for the Proxy-Adapter API and seeing the API key in clear text. Only the Proxy-Adapter API knows your API Key and only the Proxy-Adapter passes that information onto the Import.io API.

The way you make your Import.io Key known only to the Proxy-Adapter API is by setting an environment variable named IMPORT_IO_KEY in the server process that is running the Proxy-Adapter API code. Thus, the environment variable IMPORT_IO_KEY will be associated with our particular API Key. You set the environment variable according to the method prescribed by the operating system of the computer running the Proxy-Adapter API. The items in the list that follows link to instructions that tell you how to set an environment variable according to operating system:

Applying of Aggregation

Having the ability to aggregate all data from a variety of extractors into a single list is a powerful feature of the Proxy-Adapter API; however there is a limitation. Every extractor that you want to aggregate into a master list must adhere to the same data structure. In other words, if you want your aggregation to publish the attributes Title, Description, City and State of all job listings, then each extractor must be configured to have the columns Title, Description, City and State. (Please see Figure 29.)

Figure 29: You need to adhere to a standard set of columns if you want to aggregate data from a variety of extractors.

Figure 29: You need to adhere to a standard set of columns if you want to aggregate data from a variety of extractors.

Nothing bad will happen should your aggregator have additional columns; however, having a consistent set of columns will make things a lot easier when it comes time to have a client consume the aggregation of all the extractors.

Let's move on to the Proxy Client.

Working with the Proxy Client

The purpose of the Proxy Client is to publish the SPA that displays the various lists of information that you can get from the Proxy-Adapter. As mentioned above, the Proxy-Client (source code) is written in Angular.

The Proxy Client is the front end to the system, so when you add an extractor using the Dashboard in Import.io that information will appear automatically in the Proxy Client provided that both the Proxy Client website and Proxy-Adapter API are up and running. And when you apply the x-aggregate suffix to an extractor's name that extractor's data will show up when it comes time to "Show All Jobs." (Please see Figure 30.)

Figure 30: The Proxy-Client uses an SPA written in Angular to display extractor data.

Figure 30: The Proxy-Client uses an SPA written in Angular to display extractor data.

Putting It All Together

We've covered a lot in this article. I showed you how to use Import.io to extract data from a web page automatically using an extractor. I showed you how to Import.io's edit feature to customize the columns in an extractor and how to train those columns as part of the editing process. Also, I showed you how to customize the display of data in a column using a regular expression and a default value.

I showed you the sample application we created for this article, Job-Track-O-Tron. I gave you an introductory explanation of the components of the application. I explained how the Proxy-Adapter API interacts with the Import.io API to get extractor data. Also, I explained at a conceptual level how the Proxy-Adapter API identifies extractors that are to be use for aggregation. Finally, I gave you an overview of the Proxy-Client component and how it uses an SPA written in Angular to display the Job Listing information contained in the various extractors.

Granted, we covered a lot, but we only scratched the surface. The Import.io service by itself is a very big product. It has many capabilities that require some time and attention to master. But each time you learn something new, you'll see an immediate benefit. It took me only 10 minutes to learn to generate my first extractor automatically. Learning customization took a little longer. But the company offers many learning aids as well as sales and a support staff who will go to great lengths to get you up and running with the product.

As far as the sample app, Job-Track-O-Tron, goes, you need to learn a lot of little details to gain a full understanding of the code and to use the example to your benefit. There is not a lot of code in play, but despite the brevity, the code does a lot. Take a look at the code. The documentation is informative and provides a good example of how to to work the Import.io API.

Also, take a look at the screencasts provided here that show you how to create an extractor and then customize it to suit a special need. In this case, you customized extractors to make them compatible for use in the sample job listing application. Of course, you can extend the customization technique to meet your own needs.

Import.io fills a real need in the world of modern data collection. The UI is solid and the API is broad. For those of us that work with APIs to make a living, we find the documentation informative. You'll will have to experiment with the product to get the hang of things. Also, remember that Import.io does a lot of inspection, inferencing, and calculation behind the scenes, so you'll want to be aware of latency and performance constraints as you get accustomed to using the API. Still, overall Import.io is a solid tool for data extraction, and that it can be used in a variety of ways to produce data in a variety of formats only enhances it power. So the next time you find that a page that has data you want, but you can't use in a structured manner, consider Import.io to be a tool to use.

As we completed our testing, one major caveat of Import.io that revealed itself was that it had no built-in way to deal with paginated data. In other words, if a job site can only display 15 jobs at a time out of a total of 500 job listings (and the remaining 485 listings are viewable on subsequent pages), Import.io cannot detect the pagination and extra data from those subsequent pages. The problem revealed itself in two scenarios: the first involving websites where each page in the paginated series has a different URL and the second involving an unchanging URL has the user move from page to page.

One brute-force solution to the problem is to exercise the target site's option to display all the listings on one page, but sites that do this are few and far between. After suggesting to Import.io that the ability to deal with paginated data was the most significant opportunity for improvement, the company updated the service in a way that deals with with the first scenario. The feature revealed itself to work as advertised after some light testing by ProgrammableWeb, but further testing is required to ensure that it works across a wide variety of sites.

According to Import.io product manager Nicolas Mondada, the company apparently plans to address the second scenario with an update to the service that's due in early July.

Getting the Code

You can get the code for the Proxy-Adapter API and the Proxy Client on GitHub:

(This article is paginated. Use the pagination control below to retrieve the previous page)
Bob Reselman Bob Reselman is a nationally known software developer, system architect and technical writer/journalist. Over a career that spans 30 years, Bob has worked for companies such as Gateway, Cap Gemini, The Los Angeles Weekly, Edmunds.com and the Academy of Recording Arts and Sciences, to name a few. Bob has written 4 books on computer programming and dozens of articles about topics related to software development technologies and techniques as well as the culture of software development.
 

Comments