Semantic Search the US Library of Congress

Raymond Yee
Apr. 29 2008, 01:29AM EDT

As the national library of the United States, the Library of Congress has created vast amounts of metadata to describe books and other documents in its collection. Among this metadata is the Library of Congress Subject Headings (LCSH), a "controlled vocabulary" for classifying documents by subject. In order words, experts at the Library of Congress have come up with a (large) list of subject headers from which catalogers of documents can choose. As an example, if you look at the Library of Congress record for Tim Berners-Lee's book Weaving the Web, you'll that it is classified under "World Wide Web", specifically "World Wide Web--History".

Update: The website hosting the LCSH API appears to no longer make the API available.

Since the Library of Congress isn't the only entity that classifies documents, you can imagine that other entities (and not just libraries) would interested in reusing the LCSH vocabulary. But how should the Library of Congress make LCSH available so that it can be easily reused?

That's where the recent release of lcsh.info comes in (see also the lcsh.info ProgrammableWeb Profile):

This is an experimental service that makes the Library of Congress Subject Headings available as linked-data using the SKOS vocabulary. The goal of lcsh.info is to encourage experimentation and use of LCSH on the web with the hopes of informing a similar effort at the Library of Congress to make a continually updated version available. More information about the Linked Data effort can be found on the W3C Wiki.

Let's look at what you can do with lcsh.info through a couple of examples. First, we return to the subject heading World Wide Web, this time accessible from lcsh.info as

http://lcsh.info/sh95000541

Note the form of the URL: http://lcsh.info/{lccn} where lccn refers to the Library of Congress Control Number (LCCN), an identifier of the subject heading. In this case, the LCCN for World Wide Web is sh95000541.

If you drop this URL into your browser, you'll get the default format or representation of the information lcsh.info has about the World Wide Web subject header, including:

  • broader terms (e.g., Hypertext systems)
  • narrower terms (e.g., Semantic Web)
  • related terms (e.g., Internet)
  • alternative label -- (e.g, W3 (World Wide Web))

The diagram below illustrates some of these relationships

lcshgraph.png

To facilitate reuse of the data, lcsh.info offers its data a variety of formats that can be accessed via content negotiation. That is, you use the Accept HTTP header to specify which of the following content type you want:

  • XHTML (with embedded RDFa), which is the default value (application/xhtml+xml)
  • JSON (application/json)
  • RDF/XML (application/rdf+xml)
  • N3 (text/n3)

For example, you can use curl to get JSON representation of the World Wide Web subject header:

curl -v -L -H "Accept: application/json" http://lcsh.info/sh85062913

By looking at the RDF/XML and N3 representations, you can see a concrete example of semantic web approaches to express notions of broader, narrower, and related terms as well as alternative labels using

  • Simple Knowledge Organization System (SKOS), which is "a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other types of controlled vocabulary"
  • designs rules for linked data to represent the network of interconnected subject headings

This experimental but promising service may soon pave the way for full production level web services from the Library of Congress.

Raymond Yee

Comments

Comments(1)

User HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.