Data Science Toolkit Wraps Many Data Services in One API

The Data Science Toolkit is a collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself.

The Data Science Toolkit is  essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface. Developer documentation is quite nicely done.

The API has the following sub components-

  • Geodict- It is an emulation of Yahoo's Placemarker service, but Geodict is designed for applications that are intolerant of false positives, so for example Placemaker will flag "New York Times" as a location whereas Geodict will ignore it.
  • Text to Places -This is the friendly interface to the same underlying functionality as Geodict. It uses a much simpler JSON format, taking an array with a single string as its input and returning an array of the locations found in that text.
  • IP Address to Coordinates- This API takes either a single numerical IP address, a comma-separated list, or a JSON-encoded array of addresses, and returns a JSON object with a key for every IP. The value for each key is either null if no information was found for the address, or an object containing location information, including country, region, city and latitude/longitude coordinates.This API uses data from
  • Street Address to Coordinates-This API takes either a single string representing a postal address, or a JSON-encoded array of addresses, and returns a JSON object with a key for every address. This API uses data from the US Census and OpenStreetMap, along with code from GeoIQ and Schuyler Erle.
  • Google-style geocoder- This emulates Google's geocoding API. It's designed to take the same arguments and return the same data structure as their service, to make it easy to port existing code. It's very similar in functionality to street2coordinates
  • Coordinates to Politics-This API takes either a single string representing a comma-separated pair of latitude/longitude coordinates, or a JSON-encoded array of objects containing two keys, one for latitude and one for longitude. It returns a JSON array containing an object for every input location. The location member holds the coordinates that were queried, and politics holds an array of countries, states, provinces, cities, constituencies and neighborhoods that the point lies within. This API relies on data gathered by volunteers around the world for OpenHeatMap, along with US census information and neighborhood maps from Zillow.
  • File to Text- If you pass in an image, this API will run an optical character recognition algorithm to extract any words or sentences it can from the picture. If you upload a PDF file, Word document, Excel spreadsheet or HTML file, it will return a plain text version of the content. This API relies on the Ocropus project for handling images, and catdoc for pre-XML Word and Excel documents.
  • Text to Sentences-This call takes some text, and returns any fragments of it that look like proper sentences organized into paragraphs. It's most useful for taking documents that may be full of uninteresting boilerplate like headings and captions, and returning only the descriptive passages for further analysis.
  • HTML to Text-This call takes an HTML document, and analyzes it to determine what text would be displayed when it is rendered. This includes all headers, ads and other boilerplate. If you want only the main body text (for example the story section in a news article) then you should use html2story on the HTML.
  • HTML to Story-This call takes an HTML document, and extracts the sections of text that appear to be the main body of a news story, or more generally the long, descriptive passages in any page. This is especially useful when you want to run an analysis only on the unique content of each page, ignoring all the repeated navigation elements.
  • Text to People-Extracts any sequences of words that look like people's names, and tries to guess their gender from any first names found. This works best on names that are common in English-speaking countries, though it does cover the most popular foreign-language names too. This API uses a Ruby port of Eamon Daly and Jon Orwant's original GenderFromName Perl module to classify first names.
  • Text to Times-Searches the input for strings that represent dates or times, and parses them into the standard form of a Ruby date/time string (eg "Mon Feb 01 11:00:00 -0800 2010") as well as seconds since Jan 1st 1970. This API uses the Chronic Ruby gem to convert the strings it finds into times.

This project was created by Pete Warden who has also written 2 books on Data Science  and Big Data. Personally, I found  the following as quite an unexpected  treat- a self contained Virtual Machine as well as an Amazon AMI. I wish more developer advocates or API creators were so thoughtful particularly the ones who are okay with open sharing.

Data Science- Just another API call away!

Be sure to read the next API article: Myna A/B Testing API: The Bird That Tests Your Website Fast


Comments (2)

[...] Of the many APIs we publishes this week, eleven were highlighted on the blog by our team of writers. In this post, we’ll toss those eleven into the spotlight, which include the Data Science Toolkit API. The API is a collection of open data sets and open-source tools for data science. These tools consists of Geodict, Text to Places, File to Text, Street Address Coordinates, and 6 more that can be found on their website. To learn more about the Data Science Toolkit API visit the Data Science site as well as the Data Science Toolkit API blog post. [...]