Introduction to scrapestack's Real-time, Scalable Proxy & Web Scraping REST API

Welcome to today's tutorial on scrapestack, a powerful, real-time web scraping API service used by more than 2,000 companies. And, it's free to scrape up to 10,000 pages per month before a paid plan is required.

If you're not familiar with the term web scraping, it means to download a web page associated with a URL as a browser would, except you are capturing the HTML source code from the page.

Why would you scrape the web? There is a wide variety of uses such as identifying breaking news, aggregating content, data collection for machine learning, market analysis, SEO management and data extraction, and more.

Home page of apilayer's scrapestack real-time scalable proxy and web scraping REST API, handles CAPTCHAs and JavaScript rendering.

Home page of apilayer's scrapestack real-time scalable proxy and web scraping REST API, handles CAPTCHAs and JavaScript rendering.

Not only is scrapestack fast and easy to get started with but it has advanced features that ensure consistent high quality scraping results.

  • JavaScript Rendering: Since many websites load content dynamically, scrapestack can simulate an actual user browsing the page to deliver the dynamic content.
  • CAPTCHAs: scrapestack can dynamically respond to CAPTCHAs to scrape underlying content.
  • Global web scraping: scrapestack can deliver pages as if you browsed a webpage from any area of the world.
  • Proxy IP addresses for more successful scraping, geolocated web scraping and virtual anonymity.

And, scrapestack is highly scalable, capable of serving millions of page requests per day.

Since its a REST API, scrapestack can be used from any programming language. To help you get started quickly, it includes coding examples for PHP, Python, Node.js, jQuery, Go and Ruby.

And, there is a status page where you can review the nearly 100% uptime of the service.

The company behind scrapestack is apilayer. This is my eighth article about its services, and I'm a big fan of the simplicity and accessibility of its services. apilayer provides similar pricing models, sign up, quickstart guides and clearly structured documentation.

Here are a few of apilayer's services, I've written about previously at ProgrammableWeb:

If you've used any of its services before, starting with the scrapestack API will be familiar to you. The documentation and REST API structure is consistent across its suite of products, most of which are perfect for strengthening your application or website.

If you're a startup or independent developer, apilayer always offer a generous free plan with all of its services, as it has with the scrapestack API.

Let's start scraping the web!

Getting Started with the scrapestack API

To begin exploring scrapestack, you'll need to register for an account, the free account level is great for exploring the API and initial usage.

Get Your Free scrapestack API Key

Screenshot of scrapestack pricing page. Sign up for free, or choose from four paid plans: Basic, Professional, Business or Enterprise.

Screenshot of scrapestack pricing page. Sign up for free, or choose from four paid plans: Basic, Professional, Business or Enterprise.

 

The Free plan includes 10,000 page scraping requests with standard proxy servers. Other plans add more advanced features which we'll describe further below.

By paying yearly, you can save 20 percent on any plan (except for the free plan, haha).

The Sign Up Form

Once you've chosen a plan you'll be asked to complete a Sign Up form. It's very straightforward:

scrapestack sign up and registration form. The example shown is the free plan with email, password, name and address information. The image shows the below the scroll portion of the form for Company Details and Google CAPTCHA at the right.

scrapestack sign up and registration form. The example shown is the free plan with email, password, name and address information. The image shows the below the scroll portion of the form for Company Details and Google CAPTCHA at the right.

Once you click Sign Up, a welcome letter with links to the documentation will arrive in your email.

The welcome email from scrapestack real time scalable web scraping api includes a link to API documentation and their customer support email.

The welcome email from scrapestack real time scalable web scraping api includes a link to API documentation and their customer support email.

Let's check out the Dashboard, which customers of other apilayer services will immediately recognize. If you use one apilayer service, getting started with any other is quite simple.

The scrapestack API Dashboard

scrapestack's API dashboard provides your API key and a simple 3 step quickstart guide:

Step 1 - Your API Access Key

Your API access key provides access to the scrapestack API. It's required to be included as a parameter in every call. You can also reset the key to secure a new one whenever you wish.

The scrapestack API 3-Step Quickstart Guide for getting started using real-time web scraping API.

The scrapestack API 3-Step Quickstart Guide for getting started using real-time web scraping API.

Step 2 - Make Your First API Request

Now, let's try scraping our first web page with scrapestack.

scrapestack API Quickstart Make API Request screenshot. It shows the parameters for a REST API request to the service.

scrapestack API Quickstart Make API Request screenshot. It shows the parameters for a REST API request to the service.

To begin, you can try accessing the following URL without optional parameters. You will need to replace the letter x's below with your API access key.

https://api.scrapestack.com/scrape?access_key=xxxx&url=https://apple.com

In my Safari and Opera browsers, the above request returns an unrendered html page from Apple.com, but by switching to View Source, I could see the HTML returned by scrapestack:

scrapestack API example of scraping the HTML from Apple.com. On the left is the Opera presentation of partially rendered HTML. On the right is the view source HTML code.

scrapestack API example of scraping the HTML from Apple.com. On the left is the Opera presentation of partially rendered HTML. On the right is the view source HTML code.

In the developer version of Safari, I clicked Develop => Show Page Source and in Opera, I found using the "view-source:" prefix before my API call returns directly from scrapestack into HTML from the site I'm requesting. In Opera, just paste in the line as shown below to land in the HTML source:

view-source:https://api.scrapestack.com/scrape?access_key=xxx&url=https://www.nytimes.com/2019/10/01/us/politics/trump-impeachment-pompeo.html

This may not work in browsers other than Opera.

Typically, you'll be using the scrapestack API programmatically and won't encounter any of the partially rendered visual elements. And you'll likely be using scrapestack from a backend server.

However, if you use these APIs commonly from JavaScript frontends, it's not a bad idea to change your access key on a regular schedule. You can reset your key from the account dashboard, click the black reset button beside your API key.

Step 3 - Integrate into Your Application

To finish the Quickstart and move on, let's dive into the API more closely.

Screenshot of last step of the scrapestack API Quickstart, Integrate into your application.

Screenshot of last step of the scrapestack API Quickstart, Integrate into your application.

Scrapestack provides coding examples for six languages: PHP, Python, Node.js, jQuery, Go and Ruby. To take a look at how you might use scrapestack from code, here's a Python example:

import requests
params = {
  'access_key': 'YOUR_ACCESS_KEY',
  'url': 'http://scrapestack.com'
}
api_result = requests.get('http://api.scrapestack.com/scrape', params)
website_content = api_result.content
print(website_content)

I'll show more of these further below.

As I spoke about earlier, this Python example shows a programmatic server-based scenario where the output is scraped as HTML and printed to the screen as source code. It's never shown in the browser.

Beyond the basics, scrapestack offers a number of important advanced features for building a more powerful and reliable web scraping engine. Let's take a look at these.

Using scrapestack's Advanced Features

It's been a long time since most websites were powered by fairly simple static HTML. Scraping is not as simple as it once was. Many websites use JavaScript to display dynamic content based on location or IP address. And websites use a lot of built-in protections to filter which people can see different kinds of content. scrapestack's advanced features provide powerful assistance to work with these barriers and scrape more complex dynamic websites.

JavaScript Rendering

scrapestack is capable of scraping pages with dynamic content rendered by JavaScript after the page loads. Just set the render_js parameter to 1 to enable JavaScript rendering as shown below:

https://api.scrapestack.com/scrape?access_key=xxxx&url=https://apple.com&render_js=1

For example, if someone has a Twitter timeline widget on their blog's sidebar, a regular scraper might turn up the empty DIVs, but using JavaScript Rendering, you would see their latest tweets.

The JavaScript Rendering feature requires the Basic plan or higher.

HTTP Headers

Websites today employ increasingly complex methods to block bots and hackers. A basic web scraper will often get caught up by these tools. By using scrapestack's HTTP headers, you can configure some common request arguments that successfully bypass website validity tests.

You can't use URL parameters to submit HTTP headers. Instead, you can make a curl request with a header string as long as you include keep_headers=1 in your URL request.

curl --header "X-AnyHeader: Test" "https://api.scrapestack.com/scrape?access_key=xxx&url=https://apple.com&keep_headers=1"

Here are some sample headers to include in your requests for best results scraping web pages:

1. User-Agent which provides information to the website about your computer and web browser as well as the renderer that the browser uses. For example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.1 Safari/605.1.15

2. Accept-Language tells the website your language of choice. For example:

Accept-Language: en-US;

3. Accept-Encoding tells the website which compression algorithms your browser supports. For example:

Accept-Encoding: deflate, gzip;q=1.0, *;q=0.5

4. Accept reports which formats your browser can accept. For example:

Accept: text/html

5. Referer provides a URL from which the arrival at this web page arrived from. For example:

Referer: https://scrapestack.com/documentation

Use of HTTP headers is available on all plans. scrapestack does not support the content-encoding and content-length headers.

Using Proxies for Web Scraping

scrapestack provides two different features for using proxy servers.

Location Based Proxy

The proxy_location parameter allows you to scrape a web page using a proxy web server in any of 77 specific countries. Here's an example for France:

https://api.scrapestack.com/scrape?access_key=xxx&url=https://google.com...

In my rudimentary tests of common websites, I did not see differences in scraping by location, even with render_js, however, you will likely have specific websites that differ greatly depending on where they are scraped from.

Usage of these location-based standard proxy servers is available in the Basic plan and above.

scrapestack also offers more exclusive proxy servers available for mission critical web scraping, these are called Premium Proxy servers.

Premium Proxy Servers

When using scrapestack's standard proxy servers, it is possible your attempts at scraping could be blocked. Some websites routinely block scrapers if they suspect they are being deployed inappropriately. The IP addresses regularly used by the location proxies may be known to the sites you are scraping.

scrapestack offers premium proxy servers which are actual residential proxy servers with real residential IP addresses. They are much less likely to be blocked. Premium Proxy servers are available in 38 geolocations.

Here's an example, the premium_proxy parameter is set to 1 and proxy_location Denmark:

https://api.scrapestack.com/scrape?access_key=xxx&url=https://slashdot.com&premium_proxy=1&proxy_location=dk

Use of the premium_proxy feature is restricted to the Professional plan or higher levels. And, requests are charged as 25 API requests whereas all other requests are charged as 1 request.

HTTP POST/PUT Requests

scrapestack also includes POST and PUT support. For example, let's say you need to login or submit information to reach the page you are trying to scrape. Here's a curl example of submitting form data using POST:

curl -H 'Content-Type: application/x-www-form-urlencoded'
-F 'username=reifman_abc' -F 'password=writer_7!' 
-X POST
"https://api.scrapestack.com/scrape?access_key=xxx&url=https://google.com/login"

I'm not exactly sure why you would use PUT for web scraping.

HTTP POST/PUT is available on all plans.

Programming Language Examples

As I mentioned earlier, scrapestack provides programming examples for six of the most popular languages: PHP, Python, Node.js, jQuery, Go and Ruby.

Here's an example for jQuery:

.get('https://api.scrapestack.com/scrape',
  {
    access_key: 'YOUR_ACCESS_KEY',
    url: 'http://scrapestack.com'
  },
  function(websiteContent) {
    console.log(websiteContent);
  }
);

And, here's an example of PHP using curl:

<?php
$queryString = http_build_query([
  'access_key' => 'YOUR_ACCESS_KEY',
  'url' => 'http://scrapestack.com',
]);
$ch = curl_init(sprintf('%s?%s', 'http://api.scrapestack.com/scrape', $queryString));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$website_content = curl_exec($ch);
curl_close($ch);
echo $website_content;
?>

And, here's an example using Go:

package main
import (
  "fmt"
  "io/ioutil"
  "net/http"
)
func main() {
  httpClient := http.Client{}
  req, err := http.NewRequest("GET", "http://api.scrapestack.com/scrape", nil)
  if err != nil {
    panic(err)
  }
  q := req.URL.Query()
  q.Add("access_key", "YOUR_ACCESS_KEY")
  q.Add("url", "http://scrapestack.com")
  req.URL.RawQuery = q.Encode()
  res, err := httpClient.Do(req)
  if err != nil {
    panic(err)
  }
  defer res.Body.Close()
  if res.StatusCode == http.StatusOK {
    bodyBytes, err := ioutil.ReadAll(res.Body)
    if err != nil {
        panic(err)
    }
    websiteContent := string(bodyBytes)
    fmt.Println(websiteContent)
  }
}

These examples make it so easy to quickly integrate scrapestack into your development platform. Making it easy to get started with an API service is one of apilayer's specialties

.

Let's talk about usage levels and your account.

Upgrading Your Account

scrapestack is a subscription-based service and your chosen plan renews automatically each month. You can upgrade, downgrade or cancel anytime.

You might want to upgrade your account for any combination of these reasons:

  • You require HTTPS encryption for your REST API requests, this requires Basic level or above
  • You require concurrent requests, requires Basic level or above
  • You require the JavaScript rendering feature, requires Basic level or above
  • You require the proxy_location feature, requires Basic level or above
  • You require premium_proxy usage, starts at the Professional level
  • You require very high volume, scalability and custom features, contact scrapestack for a custom Enterprise level solution and a price quote

To make changes, visit your Subscription Plan page from the Dashboard:

scrapestack plan and upgrade or downgrade pricing plans for real-time web scraping API usage.

scrapestack plan and upgrade or downgrade pricing plans for real-time web scraping API usage.

In the above image, you can see I'm on the Business plan but can downgrade to the other plans. Similarly, if I needed a custom enterprise solution, I can click the Request Quote button.

And, scrapestack provides a page to calculate your usage statistics in the current period and historically over time. Just visit your Dashboard and click API Usage (I've just recently been using my account so the usage is over a brief timeframe.):

The scrapestack Dashboard's API Usage page displays the number of web scraping API requests all time as well as a statistical daily log for your internal tracking.

The scrapestack Dashboard's API Usage page displays the number of web scraping API requests all time as well as a statistical daily log for your internal tracking.

You can use this log to help you in deciding to upgrade or downgrade subscription levels. For example, Free plans are allowed 10,000 calls per month whereas Professional are allowed 1,000,000 and Business level 3,000,000. If that isn't enough to meet your requirements, contact scrapestack for a custom enterprise plan.

In Conclusion

I hope you've enjoyed learning about web scraping and the scrapestack API. It's a simple new REST API to get started with your basic scraping requirements and it can also scale to enterprise level concurrent, distributed proxy web scraping from any geolocation. It's incredibly powerful.

I enjoy writing for the folks behind scrapestack, the apilayer team, and sharing their new projects. They are skillful technologists that provide powerful services at affordable prices with easy to integrate APIs and scalable performance and capacity.

Check out their suite of products and you'll likely find more that interest you.

apilayer and scrapestack appreciate your questions, comments and feedback. You can also follow them on Twitter @apilayer and the apilayer Facebook page.

About apilayer

scrapestack is the latest service from apilayer, an established leader in service APIs. It aims to help developers and businesses automate and outsource complex processes by serving them with dedicated and handy programming interfaces.

Two other products by apilayer include weatherstack, the free real time weather data and forecast API and userstack, the free user agent lookup and device detection API, both of which I wrote about previously for ProgrammableWeb.

 

Comments (0)