What We Did Wrong: NPR Improves its API Architecture

This guest post comes from Zach Brand, Sr. Director of Technology at NPR. Zach is responsible for technical strategy and operation for NPR’s Digital Media efforts including the NPR API, npr.org, and mobile apps, and is a contributor to the inside NPR.org blog.

NPR launched our API in July 2008. This API was the technology keystone of our strategy at NPR to solve crossmedia challenges by ensuring content could be ported to any presentation layer (websites, mobile apps etc). While the API has been a huge success, the code architecture behind the API was built on some inaccurate assumptions. NPR has recently finished refactoring the code behind API and we have already seen a significant improvement in performance and are better positioned for future growth of the API.

Assumption in our Architecture

API’s are a quick path to having flexible content that embraces the idea of COPE that has been discussed here before. Building API’s allows content producers to be much more nimble in deploying the content to multiple platforms. Having an API has allowed us at NPR.org to be highly efficient at building new platforms such as iPhone, Android, iPad and Chrome app because we only have to build the presentation logic – the ‘data’ is already ready to go. In fact in one 12 month period subsequent to the API we doubled our online audience with the launch 11 major products including a website redesign; all done with limited dev resources. Truly one of the biggest benefits of flexible content is the ability for your development team to be very efficient and nimble.

In an interesting bit of ironic learning – the actual architecture underneath the NPR API turned out to be somewhat inflexible. So while the API supports nimble change of our other products, making changes to the API itself was rather difficult.

There were two very reasonable assumptions we made early on that significantly contributed to the rigidness of our API codebase. They were:

Assumption 1) That a serial progression of data from receiving the request to providing the results was best. This process was very straightforward and seemed the simplest approach. With this architecture --as shown in the diagram below-- we receive a request, Parse request, find list of matching stories, gather XML for stories, transform based on content/right exclusions, transform to desired output, provide response.

NPR's serial API architecture

Assumption 2) That it would be easiest to just use XML throughout the entire API architecture. Our API starts with a NO-SQL like XML repository that closely resembled the data schema in our database. We then used this XML throughout the entire process, creating a super XML document out of individual story XML from the repository and then doing selections with XPath and only transforming the XML in the end as necessary to the appropriate output.

So what was wrong?

First, some API queries are not easily done in a serial process. In a given query we may discover late in the serial process that we’ve filtered all the results ‘in hand’ and need to start our selection of raw data over, or alternatively it may be that it would be preferably to transformed data upstream based on output type.

Second it turns out that while the XML storage of data has been great for high-performance retrieval of data, XPATH is cumbersome to use for giant collections of data. Navigating the namespace in XPATH turned out to be very difficult if we nested too many documents. Further if we wanted to add or change a tag in the XML in our repository, we now had significant overhead ensuring that our pre-existing XPATH queries wouldn’t break. Finally we ran the risk of breaking XPATH queries downstream if we stripped superfluous data (ie. Data that wasn’t requested). All of this meant that we had a steep learning curve for new developers and increased likelihood of mistakes.

Improved Architecture

So while the original architecture was smart based on what we knew, in order to address the above, and other concerns we refactored the core API code. The new code has a number of improvements including parallel access to the data with easier handling of the data objects themselves. Since our API codebase is in PHP we simply convert the XML data to PHP data objects for easier manipulation. We have a controller class that oversees all processing of the data. So while the old architecture looked like this:

NPR's old API architecture

The new architecture looks like this:

NPR's new API architecture

We did not modify the data entry, the normalized data, flattened data or output layers. The API layer however, has radically changed. Instead of a linear XML based process, it is now an Asynchronous process using PHP data objects.

Where the code to ensure correct XPATH selection use to have span several classes and was constantly being tweaked to ensure accurate results, the current equivalent is just 7 lines of code.

The process in this new architecture still has many of the same steps as the old architecture. But critically they can run in parallel and have been made into more discreet functions. As shown in the simplified diagram above processes include: Request – convert the request as provided by the user to a quer(ies) to be run. Pass the XML objects to a Model class that creates the data objects. Concurrently, as objects become available run the Model Rules to select fields as requested, filter based on access rights, and denormalize data where applicable. Additionally as objects are available start mapping and filtering to desired output type using Elements and Document classes that build tags and the resulting page respectively.

The Results

Development is now much faster on the API, and we can much more easily get new developers up to speed on the code. Ensuring accurate results in the API is also much more straightforward. There were other benefits as well. The new code is more efficient with the average API response now 22% faster. Further we have provided cleaner separation of output types and made it easier to accommodate more complex rights-handling rules. As use of our API is doubling approximately every six months, and new feature requirements seemingly come in every week, these changes have been critical for our long term support and operation of the NPR API.