PW Interview: Jacob Perkins, Text Processing API, NLP meets API

Ajay Ohri
Apr. 10 2013, 12:30PM EDT

Here is an interview with Jacob Perkins, creator of text processing API and author of Python Text Processing with NLTK Cookbook. Jacob talks on why APIs help, on accuracy and domain context in sentiment analysis APIs, using Mashape for monetizing APIs and on enabling continued API availability to enterprise customers.

Ajay- Describe the journey behind creating the text-processing APIs.

Jacob- The text-processing APIs are an indirect result of writing my book, the Python Text Processing with NLTK Cookbook: http://link.packtpub.com/fzqUNY. After I finished the book, I thought, "wouldn't it be nice if people could see NLTK in action?" So I created some demo pages to display the results of various NLTK functions and trained models, and this became http://text-processing.com/demo/. While I was doing that, I realized that there's probably a lot of programmers that would want access to this functionality, but don't want to or can't program it themselves. Maybe they don't use Python or Java, which are the languages with the most support for NLP. Or maybe they don't want to deal with any of the complexity involved in natural language processing, and just want the results. So shortly after creating the demo pages, I made APIs providing the same functionality. Once there was sufficient demand, I had to monetize the APIs, which led me to Mashape.

Ajay- How do these APIs compare on accuracy and context vis a vis existing APIs, especially sentiment analysis which is offered by Chatterbox and Viralheat (among others)?

Jacob- Accuracy is very domain dependent. An API that's very accurate on tweets might not be good for reviews, and vice-versa. Or even more specific, an API that's great at sentiment analysis for movie reviews could have low accuracy on electronics reviews. So it's hard to compare general accuracy, and there's also no gold-standard sentiment dataset to use for comparison, nor should there be, because domain specific models will always beat generic models within that domain.

What matters to API consumers is whether the API is accurate on their data, and the only way to find that out is to test & validate each API.

But here's some details on the text-processing English sentiment API. It is currently composed of 6 models trained on movie reviews, plus 2 models trained on tweets. So obviously it will be most accurate on text that looks like a movie review, but can also provide decent results for tweet-like text. I tested these models by training on 3/4 of the training corpus, then testing the model against the remaining 1/4. The movie review models were in the 85-90% accuracy range, while the tweet models got to ~80% accuracy.

In addition to sentiment analysis, the text-processing APIs also provide multi-lingual stemming, part-of-speech tagging, phrase extraction, and named entity recognition. Non-english languages are very underserved by NLP applications, so I hope to make a small dent in that problem.

Ajay- What is your plan for exciting developers about your APIs to create mashups, applications.

Jacob-Mashape has been very helpful in this regard, by promoting APIs like mine at hackathons and meetups. I also write about natural language processing at http://streamhacker.com, and I think many of my API consumers find it thru my articles. But I don't do much marketing because the API is not my full-time business.

Ajay- Apart from Python, what other languages and  technologies have you considered for Text Processing.

Jacob- I know that Java has quite a few libraries for natural language processing, and I suspect that in a few years, newer languages like Go will start to approach Java & Python, in terms of having useful libraries for NLP. But the Python ecosystem is anything but stagnant, with projects like pandas & scikit-learn that continue make Python more attractive for all things related to data processing & machine learning. So I don't think I'll be switching from Python anytime soon.

Ajay- APIs sometimes get deprecated and even closed when they fail to scale up or attract enough users.How do you plan to scale up your API services to offer continued services to enterprises?

Jacob- The nice thing about the API being a side project is that I'm not subject to such commercial pressures. I don't need to have thousands of customers to keep it online. I just need enough to cover server costs, which right now is a relatively small number. And many enterprises don't actually want to use an external API, because their data is private and/or they want unlimited usage & very low latency. For those customers, I can provide a reduced version of the API, which contains only the functions they need, and can run on any number of Linux servers. This is more economical for everyone, and has the added flexibility that they can train & use their own custom models, without requiring any API changes. I think the future of commercial NLP APIs isn't better one-size-fits-all models, it's having a simpler, more automated process for training highly accurate models in a specific domain, with a generic API "shell" around the custom models.

You can also read about this API here. I particularly liked the accuracy is a function of business  domain part of the sentiment analysis API arguement , and wonder if we can have reviews for text mining APIs vis a vis business domains rather than by social media channels. Isn't social media analytics a function of business domain more than the quantity and quality of social media channels alone as a classification mechanism?

Smarter Natural Language Processing with Pythonic batteries? Just another API call away!

Ajay Ohri is the author of R for Business Analytics and likes to write on Enterprise ,Cloud and Statistical APIs with an emphasis on interviews. Follow Ajay on Google+ and connect on LinkedIn

Comments

User HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.