How Indiana’s Legislative Site Foiled Attempts to Scrape It

With Indiana enacting the Religious Freedom Restoration Act earlier this year, the Open States team wanted to access the data to provide the text to the public. However, in a subsequent post on the Sunlight Foundation’s blog, software developer Rachel Shorey discussed some trouble the team encountered while trying to scrape the data.

Bill text is usually provided by the state as a PDF, and Open States provides a link to that specific bill on the state’s website. The organization gets the majority of its information from Indiana through its legislative API, MyIGA, which requires an API key even for PDFs. With no way for Open States users to download the PDF this way without a key, the nonprofit resorted to scraping an ungated download link from the bill’s page.

However, it seemed like the link URL was generated on the fly and required a document-specific hash value that the team needed to find. Some custom code was able to locate this ID, but it left the scraper needing to visit a slow site multiple times for a single bill, often with multiple timeouts, leaving the scraper crashed or hung at most attempts.

This method for generating link URLs on the go could be viewed as a preventive measure against website scraping, despite the nature of open data, and may have applications elsewhere. Reverse engineering the document ID returned nothing about how the IDs were constructed, so the team returned to the API.

The state legislative service failed to return Open States’ calls about the matter, but terms of service allow Open States to use data gained via the API key to create an app. So the team used available data to construct a simple proxy URL that retrieves the document for download, circumventing the hash-generated URLs.

Be sure to read the next Open Data article: How to Profit From Free and Open APIs

Original Article

Opening up Indiana’s hard to reach legislative data