Yahoo Term Extraction is Reborn as Content Analysis

The new Yahoo Content Analysis API capitalizes on the success of its term extraction API while adding new functionality. The reborn implementation not only extracts terms and entities but ranks them based on perceived importance within the article. Peter Levinson, product manager for Yahoo Content Analysis, says “The new API gives developers actionable metadata about their content. By extracting and then ranking terms on the page, the API separates signal from noise in unstructured content.”

As an added bonus these extracted entities are matched against Wikipedia articles.  Because the Wikipedia URL is unique, it can act as an identifier for the entity.  Levinson notes that the use of wikipedia IDs provides value, “by giving developers the Wikipedia IDs for most of these terms, their content becomes even more structured for those entities since now they can use those ids to build relationships among their documents.”

My big questions is how do they do it?  Semantic analysis is a popular area now, but those services are powered by artificial intelligence.  Is there AI at work here?  Is the Yahoo! approach some how simpler than that?  I’m hoping that we can catch up with Yahoo! to discuss a few of the ingredients that compose secret sauce behind entity recognition, ranking, and matching against Wikipedia.

This feature is part of the YQL platform which, I have to admit, I only recently discovered.  Maybe I’m a bit late to that party, but I’m interested to see where they take it.  It reminds me a lot of what our friends over at Datafiniti are doing.

Semantic analysis would be difficult to perform on one's own, so it makes perfect sense to build those smarts into your application via API. This new offering from Yahoo provides valuable analysis and is a great way for Yahoo to endear itself to the developer community.

Garrett Wilkin




Pretty good. Try this query in the YQL console:

> SELECT * FROM contentanalysis.analyze WHERE text="Wikipedia has a lot of interesting discussions about Apple."

Thank you for joining the discussion Nicolas! Do you have links to any of these articles to share with our readers?

The Yahoo! Content Analysis platform is based on a technology we built and use every day at Yahoo!. It was developed in collaboration with Yahoo! Labs. Several papers have been published about it.

By the way, YQL was introduced in 2008 to provide a unified access to APIs and Open Data tables on the web... It's not new, widely used among developers, and has been copied quite a bit :)