Gone are the days when developers had to code every aspect of their product from scratch. Today, a cacophony of databases and APIs exist with the explicit purpose of enabling developers to build upon existing frameworks and stacks. But like in any menagerie some birds squawk and squeal while others sing in perfect tune.
Over the past months our team at Panoply.io has implemented over 60 data source integrations to our platform. To achieve this, we developed a data extraction framework designed to handle different implementations of data sources that could easily integrate any future data sources with a few hours of coding. The foremost challenge we faced was how to make this layer robust enough to survive changes over time, as well as feature and version fragmentation without kicking off an endless maintenance spiral.
Data sources are instruments in the orchestra of your destination database. In our case the destination database is a data warehouse but this holds true for any product or service pulling data from external sources. Travel startups like Freebird are perfect examples of services dependent on external data. For products to deliver this data, orchestras need to be conducted. As we’ve done this more than once, we’ve picked up a few best-practices for writing such integrations, fast and reliably. I will outline 9 of the most common lessons here.
*Before we begin, please note that I will not be discussing obvious engineering best practices, such as writing tests, reading docs or building generic code. Also, I will not discuss the underlying architecture of such a framework or the ginormous important task of securing and encrypting the data. The last two are articles in their own right.
#1 Avoid Exotic Advanced Features
Code is frequently developed and tested against a specific database version, which can lead to breaking when used against another version. For example a different Postgres server version, or one with a different set of modules and configurations. Even when developing an internal project where you’re best friends with the system admin, there is no way to know in advance which version your code will run against.
Avoid exotic, unique and advanced features of the data source. This may be disappointing, because maybe you really want to use Mongo’s Aggregation Framework, or Riak’s link-walking. Keep in mind that you don’t really need to build sophisticated code that interacts with the database, just a simple extraction layer. Therefore, not only is it safe but it is efficient to use the simplest CRUD mechanics provided by the database and forgo any extra bells and whistles.
#2 Implement Restrictive Versioning
This is a huge problem when dealing with external data sources, like APIs, as they may change on a daily basis, sometimes without notification. Facebook’s APIs are infamous for changing rapidly and breaking often. When you’re covering tens of different data sources, something will change daily - potentially resulting in a quicksand of maintenance work.
Restrict your code to specific versions. Start off with a validation test of the data source version and decline all that don’t match a pre-tested set of values. This path is a tad tricky as you don’t want to undershoot and build code that’s too restrictive. So we suggest that by testing for a small range of major versions you’ll hit a good tradeoff between stability and availability. Combine this with the previous lesson, and there’s the mix that will keep your code going over time.
#3 Normalize Special Data Types
Some data sources have unique data types that you need to take into consideration. For example, Mongo’s ObjectID data type has an automatically generated unique id, that also holds information about the document’s creation date. This identifier can be found both as a document-id, or as a foreign key to a different Mongo Collection. Other examples are, Postgres’ hstore data type that allows nesting of key-value pairs within a table row, or SQL Server’s Geometry data types.