Using Wikipedia as a Web Database

John Musser
Aug. 09 2007, 12:17AM EDT

Ever want to programmatically query Wikipedia? It's a tempting dataset with over 1.6 million articles but yet no official API. While there's been a rumor that the Wikipedia team will supply an API at some point, for now you can use an API we just listed here: the DBpedia API. It's a project headed by a team of German university researchers and as they describe it "DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data." More from their Introduction:

Wikipedia is the by far largest publicly available encyclopedia on the Web. Wikipedia editions are available in over 100 languages with the English one accounting for more than 1.6 million articles. Wikipedia has the problem that its search capabilities are limited to full-text search, which only allows very limited access to this valuable knowledge-base.

Semantic Web technologies enable expressive queries against structured information on the Web. The Semantic Web has the problem that there is not much RDF data online yet and that up-to-date terms and ontologies are missing for many application domains.

The DBpedia.org project approaches both problems by extracting structured information from Wikipedia and by making this information available on the Semantic Web. DBpedia.org allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to DBpedia data.

Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates, categorisation information, images, geo-coordinates and links to external Web pages. This structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content.

The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data.

The DBpedia dataset currently consists of around 91 million RDF triples, which have been extracted from the English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese and Chinese version of Wikipedia. The DBpedia dataset describes 1,600,000 concepts, including at least 58,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 557,000 links to images, 1,300,000 links to relevant external web pages, 207,000 Wikipedia categories and 75,000 YAGO categories.

The project also has some interesting utilities like an integrated online debugger and a tool called the Relationship Finder that lets you explore the relationship between any two things in their dataset. In the example below you can see N degrees of separation between Kevin Bacon and Johnny Cash.



It will be interesting to see what sorts of applications get built on this API and if we start to see more public SPARQL/RDF APIs appearing.

John Musser

Comments

Comments(14)

Hi Danny, good to hear from you and thanks for the pointer to the LinkingOpenData project. Very interesting to see the scale of the linked datasets. And yes, handy diagram too boot! Very good resource to know about.

This is the right webpage for anyone who wants to find out about this topic.

You realize a whole lot its almost hard to argue with you

(not that I actually would want to…HaHa).

You certainly put a brand new spin on a topic which has been written about for ages.

Great stuff, just great!