How to Geocode a Dataset With Open Data

This is the first part of our series How to Add Postcode-Based Proximity Search With Open Data. In this part of our series, we will explain how to geocode a dataset using open data.

In this article, the first in a series about adding location to applications, I explain how to geocode a dataset using open data. Geocoding involves placing a thing or place (e.g. a restaurant) in a specific position. By the end of this article, you should be able to take a dataset and add the position of each record to that dataset.

This whole process is made easier by the increasing availability of open data. Governments around the world are realising the importance of opening up their data to help drive the digital economy. In this context, that means they allow new applications and derivative datasets to be developed, unencumbered by previous licensing restrictions.

In this article series, I work with UK data because that’s where I am from. No matter where you are in the world, however, you should be able to find similar resources.

So, in particular: The UK’s Ordnance Survey geocodes all UK postcodes and releases them as open data, with the exception of Northern Irish postcodes, which are under a more restrictive license. The one reuse requirement is to include an attribution statement.

Whichever dataset you use, the open data geocoding maps to a specific place: a destination you could point to on a map or with a GPS coordinate. Typically, the coordinates provided match a building located roughly centrally in the postcode; a single postcode can contain tens of buildings. If you need to accurately geocode a specific building relative to its neighbors, you cannot currently do this with open data alone.

For this how-to, I use Open Postcode Geo, which is optimized for geospace applications and is available as both a CSV file and an API. As the example project, we are making an application to help people find the pub nearest to them. I use Open Postcode Geo to geocode a list of pubs, by which I mean that I find the position of each pub. I then create an application which builds on this geocoded dataset to allow a user to find the nearest pub to a given location. Because… really, do I need to explain why someone would want to find a pub?!

Dataset to Geocode: Start with Requirements

Perhaps it’s obvious, but the first element you need in building a geolocation-based application or feature is a list of locations and data about those locations. In order to add positions, your dataset needs a postcode for each record you want to geocode. Sometimes that’s information you collect yourself, or your company has as proprietary data. It is likely a collection of addresses, such as  the addresses of every branch in your retail company, or all the contacts in your organization.

But open data gives us access to so much more -- with varying amount of detail. Many open datasets provide addresses but no coordinates (for example, easting and northing or latitude and longitude). Here are some examples:

It is worth noting that the data in Open Postcode Geo is updated at least once a quarter, as new postcodes are issued and old ones are retired. That likely is true no matter which data source you use for location-related data, open-data or otherwise. Restaurants open and close all the time; banks open new branches; museums regularly add to their digital collections. Whatever location-based application you design, it always needs to encompass regular data updates.

For the purposes of this how-to, I use a sample dataset that I produced specifically for the purpose. It contains every premise listed in the “Pub/bar/nightclub” category of the Food Standards Agency’s rating data, which is open data released under the UK’s Open Government License.

The how-to therefore solves a very frequent proximity usecase: Where is the nearest pub?

To keep things simple, the sample dataset contains just three fields:

  • The pub’s name
  • Address
  • Postcode

You can download the sample dataset here. If you want more (and up-to-date) pub data for real world applications, look at the Open Pubs open dataset.

Choosing a coordinate system

The Open Postcode Geo dataset uses two different coordinate systems suitable for geocoding.

The first system is eastings and northings, a simple grid reference describing the distance in meters to the east and north from an origin point in the far southwest corner of the grid, the coordinates of the origin being (0,0).

The second is latitude and longitude, an explanation of which is beyond the scope of this how-to.

Latitude and longitude have the benefit of being able to locate a point anywhere in the world; in contrast, eastings and northings are limited to a position within a specified grid. As we are concerned with the UK, eastings and northings both suit our purposes, and they are much easier to understand mathematically. Hence I for this example I focus on eastings and northings.

Choosing CSV or API

Open Postcode Geo is available in two formats: A CSV file which you can import into your database, and an API .

If you need to geocode your data on an ongoing basis (e.g. new records are added to your application each month which need geocoding), you should consider using the API, as it is always up to date. If you have a large dataset and speed or connectivity is an issue, then the CSV file allows you to quickly geocode a large number of records. Alternatively, you can just use an SQL join to join the postcodes in your dataset to the postcodes in Open Postcode Geo, bringing in whatever coordinates you need at the same time.

Database setup

The first thing to do is set up the database as we need some tables in place into which to load our datasets. Our examples use MySQL.

First, load the Open Postcode Geo CSV file, which comes as comes as CSV or formatted for a MySQL import. If you decided to use the CSV file rather than the API, you need to load it into your database.

For this example, I import the CSV, as the instructions are more easily adapted to other databases.

First, create a new database. (You can skip this step if you are working with an existing database.) As our example dataset is a list of pubs, and we’re building a proximity search, call the database “pub_finder”:

mysql> create database pub_finder;
Query OK, 1 row affected (0.00 sec)

mysql> use pub_finder;
Database changed

Next, create a table into which to load the Open Postcode Geo data. Here is the SQL table create statement:

CREATE TABLE `open_postcode_geo` (
 `postcode` char(8) NOT NULL,
 `status` enum('live','terminated') NOT NULL,
 `usertype` enum('small','large') NOT NULL,
 `easting` mediumint(9) DEFAULT NULL,
 `northing` mediumint(9) DEFAULT NULL,
 `positional_quality_indicator` tinyint(3) unsigned NOT NULL,
 `country` enum('England','Wales','Scotland','Northern Ireland','Channel Islands','Isle of Man') NOT NULL,
 `latitude` decimal(9,6) DEFAULT NULL,
 `longitude` decimal(9,6) DEFAULT NULL,
 `postcode_no_space` char(7) NOT NULL,
 `postcode_fixed_width_seven` char(7) NOT NULL,
 `postcode_fixed_width_eight` char(8) NOT NULL,
 `postcode_area` char(2) DEFAULT NULL,
 `postcode_district` char(4) DEFAULT NULL,
 `postcode_sector` char(6) DEFAULT NULL,
 `outcode` char(4) NOT NULL,
 `incode` char(3) NOT NULL,
 UNIQUE KEY `postcode` (`postcode`),
 UNIQUE KEY `postcode_no_space` (`postcode_no_space`),
 UNIQUE KEY `postcode_fixed_width_seven` (`postcode_fixed_width_seven`),
 UNIQUE KEY `postcode_fixed_width_eight` (`postcode_fixed_width_eight`)
);

And here is what you should see at the prompt:

mysql> CREATE TABLE `open_postcode_geo` (
   -> `postcode` char(8) NOT NULL,
   -> `status` enum('live','terminated') NOT NULL,
   -> `usertype` enum('small','large') NOT NULL,
   -> `easting` mediumint(9) DEFAULT NULL,
   -> `northing` mediumint(9) DEFAULT NULL,
   -> `positional_quality_indicator` tinyint(3) unsigned NOT NULL,
   -> `country` enum('England','Wales','Scotland','Northern Ireland','Channel Islands','Isle of Man') NOT NULL,
   -> `latitude` decimal(9,6) DEFAULT NULL,
   -> `longitude` decimal(9,6) DEFAULT NULL,
   -> `postcode_no_space` char(7) NOT NULL,
   -> `postcode_fixed_width_seven` char(7) NOT NULL,
   -> `postcode_fixed_width_eight` char(8) NOT NULL,
   -> `postcode_area` char(2) DEFAULT NULL,
   -> `postcode_district` char(4) DEFAULT NULL,
   -> `postcode_sector` char(6) DEFAULT NULL,
   -> `outcode` char(4) NOT NULL,
   -> `incode` char(3) NOT NULL,
   -> UNIQUE KEY `postcode` (`postcode`),
   -> UNIQUE KEY `postcode_no_space` (`postcode_no_space`),
   -> UNIQUE KEY `postcode_fixed_width_seven` (`postcode_fixed_width_seven`),
   -> UNIQUE KEY `postcode_fixed_width_eight` (`postcode_fixed_width_eight`)
   -> );
Query OK, 0 rows affected (0.29 sec)

Note: This table has an index on all postcode fields. If you know your postcodes all have the same format, you might decide you do not need to index all postcode fields (or even to retain them).

Dan Winchester

Comments