Content Modularity: More Than Just Data Normalization

Guest Author
Oct. 21 2009, 12:25PM EDT

This guest post comes from Daniel Jacobson, Director of Application Development for NPR. Daniel leads NPR’s content management solutions, is the creator of the NPR API and is a frequent contributor to the Inside NPR.org blog.

As discussed in my previous post, COPE (Create Once, Publish Everywhere) is a fundamental philosophy that drives NPR's digital publishing and distribution strategy and is the foundation of the NPR API. Supporting it all is a single system that manages all incoming content and funnels it out through a single distribution pipe, regardless of content type or destination. A key principle that supports COPE is ensuring that content is stored in a modular way.

Modular storage of content is more than just database normalization. It requires strategic design of the data model to ensure that discreet objects are stored in distinct locations. To create the right design, you must truly understand your system and the assets that it stores. That is, you need to be able to identify and represent the object (or series of objects) that is at the core of your system. For NPR, the core of the system is a story. We then attach "resources" to the story, each of which is its own object in the database (examples of resources include full text with each paragraph stored as distinct records, audio, video, images, related links, and a range of other object types). Then stories get attached to lists, which are essentially a series of taxonomies that help our systems slice through the stories.

npr_entity_diagram1

The diagram above is a basic entity diagram of how NPR manages data for a story, some related resources and the list to which the stories are assigned. This is a conceptual model that represents how these entities relate to each other and does not include all resource or list entities in the system. The physical model, obviously, is much more complex. Click here for a larger and more complete view of this diagram (PDF).

NPR’s system is obviously much more complicated than this, but the breakdown of story/resources/lists is the foundation of it all. Accordingly, storage of this information in the database needs to ensure that all of these objects can be manipulated independently. With this approach, NPR is able to create a list of all images in the system, or all stories that have video, or all stories in the News topic, or any number of other combinations of stories or resources. The power of this modularity is that we have tremendous control over what gets distributed to each destination. And the distribution of content for all of these scenarios is the same simple REST-based API, requiring no special coding to generate the content for the different destinations.

npr_xmlsamp

The above is an excerpt of XML outputted from the NPR API. Clean, effective storage of the content makes it a simpler and more flexible process to manage it differently as it gets distributed to different destinations. Click here to see an expanded view of the XML with annotation detailing how it maps to the entity diagram.

Conversely, WPT's tend to store objects to enable the building of a web page. As a result, the content may be bundled together in database fields, storing the actual references to images, video and audio entirely within the story content text. It is still possible that the WPT's are adhering to some form of data normalization in their storage techniques, but that does not mean that these systems are embracing COPE.

There are two significant problems with the WPT approach of data storage. First, as an example, the image references within the block of text will contain HTML and possibly other markup, making the text block dirty. Any distribution to other platforms could then require special treatment to prepare the content for that destination. More importantly, however, is the fact that these same images are very difficult to repurpose because they are embedded in text. So, it would be quite a challenge to make a feed of images, to identify only those posts that contain images, to resize some or all images in the system, or to consistently restrict distribution of images that do not have the rights cleared.

Building systems that manage the content in a modular way and separates it from display sets it up well to be distributed on a range of platforms. The final piece to the puzzle, however, is content portability. Content portability ensures that the content can actually live and thrive in all platforms to which it gets distributed (even those that do not yet exist). Building a distribution channel, like an API, is simply not enough anymore. Content portability must be applied at the CMS level, which will be the topic of my next article.

Guest Author

Comments

Comments(16)

User HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

[...] Content Modularity: More Than Just Data Normalization &#8211; Modular storage of content is more than just database normalization. It requires strategic design of the data model to ensure that discreet objects are stored in distinct locations. To create the right design, you must truly understand your system and the assets that it stores. That is, you need to be able to identify and represent the object (or series of objects) that is at the core of your system. For NPR, the core of the system is a story. We then attach &ldquo;resources&rdquo; to the story, each of which is its own object in the database (examples of resources include full text with each paragraph stored as distinct records, audio, video, images, related links, and a range of other object types). Then stories get attached to lists, which are essentially a series of taxonomies that help our systems slice through the stories. Post a comment | Trackback URI [...]

I love to see what NPR is doing. One quibble with the article: the acronym "WPT" is referenced several times, but never defined. A search on your site returns no hits for that term. I'm guessing "Web Publishing Tool?"

Mark,

Thanks for the comment. This article is actually the second in a three-part series. WPT is defined in the first article, which can be found at http://blog.programmableweb.com/2009/10/13/cope-create-once-publish-ever....

You are correct, though, that in my previous article, I make the distinction between WPT (Web Publishing Tool) and CMS, where a WPT focuses on publishing content to a single platform and a CMS focuses on platform-agnostic content management.

Rahul,

Thanks for the comment. We haven't focused on PRISM yet, but it is something that we are aware of. Right now, we are actively planning to incorporate PBCore as an output, which is also partially based on the Dublin Core standard. As more people adopt PRISM, however, and as those people request it from the NPR API, we will certainly consider adding it as an output type.

[...] easiest to just use XML throughout the entire API architecture. Our API starts with a NO-SQL like XML repository that closely resembled the data schema in our database. We then used this XML throughout the entire process, creating a super XML document [...]