In modern systems, APIs are a gateway to access third-party data for almost any use you can think of. But, before APIs became as popular as they are now, and in the absence of an RSS feed, a common tactic for retrieving data from external resources such as Web pages was to scrape the page.
In general terms, a scraper is a program that parses the output from an HTTP request to extract formatted data to perform some kind of action. While they aren't used quite as often anymore, scrapers do still have some uses. In this tutorial on the Dataquest Blog, Vik Paruchuri explains how he used a Craigslist scraper and the Slackbot API to find an apartment in San Francisco’s notoriously competitive rental market.
After assessing the problem of finding and vetting apartments and getting in touch with the landlord before anybody else took the property, Paruchuri defined four distinct steps to help him find a new place.
Scraping listings from Craigslist
Craigslist doesn’t offer an API, so the author used the python-craigslist package to scrape the content of the SF apartments listings page, and used BeautifulSoup to extract relevant data and give it some structure. Each output result is a single listing containing seven fields.
Filtering the Results
Paruchuri had already earmarked several areas where he was interested in living, so began defining these areas on a map using BoundingBox, being sure to get each box’s coordinates to create a dictionary of neighbourhoods and coordinates. He then wrote a piece of code to filter the Craigslist results by checking if the property coordinates fall inside any of the boxes. He then created a dictionary of transit station coordinates from Google Maps for ensuring there was an appropriate transit station within close range of each apartment result.
Creating the Slack Bot
Since Paruchuri wasn’t looking on his own, he created a Slack team and channel to collaborate with others to decide which listings were best for them. After initialising the python-slackclient with a Slack API token, he created a message string from the result containing relevant info like the price and neighbourhood, posting this to the Slack channel.
Operationalising It All
To run this code persistently, Paruchuri stored the listings in the database to reduce duplicates, separated the configuration from the code to simplify adjustments, and created a loop so that the scraper will run continuously, feeding results to Slack in near real-time to quickly find possible apartments that match the predefined criteria.
While this type of scraper is nothing new and has been accomplished many times before using different languages and platforms (as the comments in this Reddit thread attest), many of these have been for private use. Paruchuri has posted the entire project on Github, so anybody can leverage the code, set their own parameters and find an apartment quickly.