You want to abstract away all data-source-specific types, and normalize the data into a form that’s compatible with your destination database. We learned early on, that while it’s convenient to test against the normal primitive data types, like Strings, Numbers and Tests, it’s far more complex to cover all possible input data from nested objects to domain-specific types, like geo-coordinates. So it’s best to refer to the documentation to make sure all of the supported data types are covered, and even add a little bit of defensive programming - throw an error when an unknown data type has been received (see #7 below).
Nested data structures, like Objects or Arrays, are irregular. This is a bit outside the scope of this article, but the key question you should be asking is, should we flatten the nested structure somehow? Do we parse JSONs? The answer is dependent on the destination database in question: Redshift, for example, requires flattened tables and well-defined relationships between these tables, while Hadoop can get along with unparsed JSON strings.
#4 Avoid Overconfiguration
An early mistake we made in an attempt to build truly powerful data sources, was to expose most of the extraction and transformation code via configuration; basically allowing our users to control everything in the process, every filtering query, every aggregation function, every bit of control available on the data source. Every database or API has tens of different configurations and features for reading the data, and attempting to cover most of them via configuration led to complex code tightly coupled to each specific data source and its internal mechanics, while also defeating the “Avoid Exotic Features” (#1) principle.
You can keep the code clean and maintainable, by using the minimal set of required configurations: the connection details and (optionally) simple filtering configuration. The latter is just a measure for handling scale and parallelization (see #6 below), by allowing the users to select which tables they want to read, and can be avoided in many small-scale data sources like most external APIs.
Finally, any other filtering, aggregation logic or other configuration, can be conducted as a separate step when the data is already stored and normalized in your data warehouse.
#5 Update Incrementally
When dealing with potentially massive data, you have to support a method of running the extraction code incrementally - only fetch records that were modified since the last execution. There are generally two ways to go about doing this and each has its own drawbacks.
The first option, is to use some intrinsic mechanism, like querying Mongo’s oplog, or using triggers to create a log of all changes. This approach may be impractical for most data sources, and it defeats the principle of avoiding exotic features (#1), as it’s not a basic CRUD operation. It opens up a can of worms in terms of complexity and potential bugs.
The second option, is to use a simple incremental key, and query against it. The incremental key can be a modification date attached to each record. Filters are a common enough operation that you can assume exist in all large data sources, and can therefore be implemented generically. However, you can’t always assume that such a key exists, and even when it does, it will require additional configurations (#4). Even if you do all that, remember that this approach will be unaware of any record removals.
From our experience, using the incremental key is marginally superior. Despite its disadvantages, sometimes it is the only option available.
For smaller data sets, preferably less than several hundred Megabytes, there’s a simpler approach. Just read the entire dataset on each execution. In the case of most relatively small data sets, the tradeoff of performance for code clarity and simplicity is well worth it. Supporting incremental processing in small data sets is a micro-optimization, resulting in tiny performance improvement and adds a lot of complexity to the code.
Massive data sets, ones that contain billions of big records, are difficult to chew and consuming such a data source can take days to run. While this may be considered an acceptable run time for data sources of this size, it also spawns a whole set of new problems, For example, now it will be more difficult to deploy new code during a run, and handling errors will result in retrying (see #7 below) the entire data source all over again. The problem is magnified in cases where incremental updates (#5) aren’t an option.
What you need is a way to scale the extraction process vertically. Adding more workers that process data in parallel to cover more of the original data set concurrently. Since we’ve already established that the input of these processes can include simple filters, this can be achieved by first partitioning the data source into several filter ranges, and then running each of these partitions in parallel, just like we would in sequential mode. For example, we could run several workers where each one is limited to only read a billion records.
Keep in mind though that parallelization produces a new set of challenges: most importantly, we’re bound by the parallelization of the underlying data source. If the database we’re reading from can only handle several parallel operations, adding more parallelization than it can handle will just stress the server to a point of potentially crashing it. If this is a production database, serving other purposes, this mistake could be extremely costly, and if it’s an external API, it might be throttled, resulting in subpar performance. We’re also bound by the concurrency of our destination database, as it can handle only so many writes in parallel.