Bottom line, just like the other lessons, avoid over-engineering. It’s best to avoid these challenges altogether rather than struggling to solve them when the performance is acceptable for your project needs.
#7 Errors and Retries
Errors are bound to happen. Anything from temporary network hiccups to logical problems. The general best-practices of handling errors are obviously applicable to data sources, except that now you need a strategy for retrying failed requests in order to keep our code sustainable and alive.
First, you want to differentiate between retryable errors and final errors. Within your error handling code, determine for each error if it’s retryable or not. A retryable error is one that results from a network issue, like server unavailability. These errors will often be resolved in time, either by manual intervention of the data source owner, or automatically due to the database’s internal mechanics. Most database systems nowadays are designed to survive such failures and self-repair themselves. It’s best to refer to the data source documentation and make sure that you cover most of the documented errors. An error is not retryable if it’s logical. For example filtering by a missing column name, or trying to fetch a removed table. In most cases, it’s safe to assume that an error is retryable, because even if it’s not, it will just repeat a few times before it breaks. But it’s always best to be explicit.
When you do capture a retryable error, schedule a retry of the last batch after set time period. Don’t run it immediately, as it may contribute to the underlying problem, like a server crashing due to too many queries. You don’t want to go on with the retries indefinitely, so setting a limit on the number of retries is advised. Two to three times should suffice.
In most cases, you don’t want to read the entire data source in one go. This is true for even small data sets of several hundred MBs, as it would overflow the memory when several workers are running in parallel. It’s best to read the data in small batches, handling retries for each batch individually.
The naive approach to batching is to read the first batch, with a limit of, say, 1,000 records, and then read the next batch with the same limit. Only this time skipping the rows we’ve already consumed. This approach produces several challenges.
First, new data will come in while you’re reading, resulting in non-deterministic ordering. That means that if we’re running the same query twice, with the same skip and limit - it will produce different results.
Second, you can’t always rely on the database native sorting. Teradata for example, could produce completely different results for the exact same query due to it’s non-deterministic native sorting. Most importantly, some databases, like Postgres, don’t handle skips very well; scanning all of the skipped rows without inserting them to the result set.
Third and last, this will result in increasingly slower batches, proportional to the number of skipped rows, resulting in a full db-scan towards the final batches. Meaning that in practice, most of these queries are scanning the majority of the data over and over again for each batch.
It’s better to implement batching using a primary key filter; sort the results by the primary key, and only fetching rows that are bigger than the primary key received from the previous batch. This approach will use the primary key index, allowing the database to skip all of the irrelevant rows instantly, without scanning through them. Primary keys are preferable to other indexes because most databases enforce uniqueness on primary keys (Redshift excluded). Otherwise, it’s possible to implement this logic with any other index but the solution is far more complex (hint: it’s a combination of this method and the naive one).
#9 Debugging and Maintaining
After the data source is ready, your job is not done. As mentioned in the introduction, our main purpose here is to keep the code sustainable and maintainable with as little effort as possible. Sooner or later, you will need to revisit the code, add features or debug it. Here are several best-practices we’ve learned over the past year which will help you maintain your sanity and have a life outside of work:
- Log every bit of raw data: you will eventually want to view exactly what data came through the data source, unmodified in any way. Usually, you’ll need to compare against the same data point in your destination database. This will make it easy to identify weird bugs, like unsupported data types or non-ascii characters in the data. You should also log any metadata you have, like the database, collection or filename that was used by the data source to extract that record.
- Add metadata to the result set: You’ll want to have a simple way of tracing data back from your destination database to the data source. For this reason, it’s advised to add some metadata to each record you produce, same as with logging.
There’s a myth that says that for every line of code it will take 7x more time to maintain than it took to develop. With the ever changing arrays of data sources, it gets worse. Obsession with writing resilient code and tests will save your future-self a ton of time. Scaling is obviously also an issue, especially with database sources where the input data could potentially be enormous. But as tempting as it may be, do not over-optimize your queries to the best possible performance as this will quickly result in brittle code. Approach your performance targets by parallelization, incremental updates and smart batching. Remember that you need to conduct this cacophony, not play every single instrument at once.