How to Use APIs and a Data Integration Layer to Manage SaaS Silos

As business computing migrates to SaaS applications, developers (and those whom they represent) must address the problem of the data silos the software creates. For example, a business might use a marketing automation system, a separate CRM, and a lead generating system, each of which addresses a specific need, but collectively creates a fragmented data model. APIs are the accepted mechanism for moving data in and out of these online applications and ensuring that everything stays in sync. That's great, but it presents a new set of problems: managing data in several applications using several APIs.

One way developers deal with this issue is to use data Integration tools acting as middleware between APIs. Data integration tools can provide a lot of benefits, such as creating a central store of information about the forms of data transformation used and reporting for data governance needs. But integration products can add their own burdens, such as increased cost, latency within the application Stack, and complexity. This becomes especially troublesome when you need multiple layers of integration tools to address a robust suite of APIs, an infinite regression problem known as “middleware all the way down.”

I believe in pragmatic and cost-effective solutions to database problems. For example, I am in favor of breaking a normalized relational model in a SQL schema by duplicating some key values if doing so dramatically speeds up common queries. When it comes to integrating APIs, a small- or medium-sized business (SMB) may be more concerned with software costs, and generally it doesn’t face data governance concerns. For my clients, all SMBs, I often adopt the pragmatic architecture of creating what I call a data integration layer (DIL).

A DIL is a combination of a database and custom code that is optimized to aggregate data from multiple sources (both internal and SaaS based) and to deliver the data as quickly as possible. The specific database and programming language is a matter of personal taste; you could use any language (such as PHP, Python, or Node.js) against the same data store.

Only data used to respond to frequent queries across multiple silos needs to be stored in the DIL, and in most cases this can be aggregated data. When done right, a DIL is lightweight, compared to other integration architectures.

DIL is lightweight, compared to other integration architectures

Let’s examine a specific use case as an example of the DIL architecture.

A sales department decides to add a chat window to the company’s website to allow interaction with site visitors. But the stakeholders soon realize that this chat should be personalized to the visitor’s history with the company. No problem: Many chat products have a Callback API that allows the chat application to communicate with an external data source when the page is loaded. This can create a direct connection between the application and a customer relationship management (CRM) tool, such as Salesforce. The result is a personalized message displayed to the site visitor and a user profile shown in the salesperson’s chat dashboard.

But what if you also need data from outside the CRM to fully implement your sales model, such as a customer subscription system? You could upload this external data to the CRM in advance, but that’s asking for trouble. Eventually, you find that you either accept the CRM as the centralized store for all business data (something Salesforce would agree with), or you have to connect multiple APIs to the chat application. This creates inevitable latency and sluggish page delivery.

The problem can be exacerbated by the need to add more processing control over the data delivered to the chat application through custom code running on one of your servers. The latency is multiplied if you need to pull data from one SaaS system for every page view to allow the custom code to decide how to make the next API call to another SaaS product. Another gotcha is that SaaS APIs generally have rate limits that may be well below the rate of pageviews required to serve an active site.

My work with the Twitter API led me to the pattern of caching data I might need to serve any application built on Twitter. My new design philosophy is to build a separate background process for automated data collection from the API, and then build applications on the resulting database. This isolates applications from latency and rate limit issues. I’ve adopted the same data architecture as the solution to integrating APIs from SaaS silos.

Two major benefits of a DIL are improved query performance and centralized logic. APIs often have weak or non-existent query capabilities. When queries are simple, it’s no big deal; you can get data on a single entity, such as a customer or a list of entities. But paring down the results to all the results that satisfy a complex query is rarely possible through the API directly.

I can’t blame the API designers. They don’t want to allow users to make complex ad-hoc queries that can load down their database servers. It’s easier for them to deliver a large, generic Payload all at once, and let a client break it apart. I agree, but I’d rather build a centralized DIL that acts as the client rather than spreading that processing throughout multiple applications.

The build vs. buy objection is a common reason to dismiss a DIL. I’m often asked, “Don’t Hubspot, Salesforce, etc. already do this? Why not just leave this work up to the SaaS tools we already pay for?” Sorry to be cynical. But I’ve been told too often by a SaaS company’s sales or support person that I can do anything I want because the product has an API. The technical reality is that you can use the API to extract data, but then it’s up to you to make efficient use of it. Hence the creation of a DIL.

My personal preference for a DIL is that it allows a human brain to be added to a stack of SaaS tools and middleware. Trying to bend someone else’s API and database schema to fit a specific application can seem like an endless negotiation. In the end, adding custom logic and an optimized schema to the stack is the most direct path to a solution. The key is not trying to be too much of a purist and insist on middleware integration tools as the only solution for all types of organizations. Sometimes writing a small amount of glue programming in Python or PHP that calls a MySQL database to implement a DIL is the right sized solution.

But don’t make this the silver bullet, either. A DIL can only be taken so far. When the development team that needs to manage integration grows too large, a more robust solution may be needed.

Be sure to read the next Integration article: Why Messages Queues Might Not Suck