In recent history we have seen tech giants put out speech recognition APIs, allowing developers to integrate speech-to-text technology into their applications. Developer Daniel Janus puts three of the popular contenders through their paces over at Rebased.
The three APIs in question are: Google’s cloud speech recognition, Microsoft’s Bing Search API and IBM’s Watson speech-to-text API. Watson comes out of the starting blocks in first position being the most flexible API in terms of what audio formats it will accept. Microsoft will only accept WAV files. Google will take FLAC too but Watson accepts all those and more.
In length of video the API will accept, Janus finds, Watson again comes out on top. Microsoft is last, only accepting 10 second files per call. Google will accept 60 second audio, while Watson will let you post any file less than 100MB.
All APIs have pretty much the same design. You post the audio file to an endpoint and get JSON back with the transcribed text and a probability between 0 and 1. Google and Watson edge Bing again, though, in that they both offer alternative texts with their probabilities too. Bing comes short once more in not supporting asynchronous requests, as the others do, although there is a library to help you manage it if you know C#.
Google comes out on top for once on language support. Google supports 80 languages, Bing 28 and Watson a lousy eight.
But the real question is how do they perform on real speech? Janus compared their performance on five one to two-sentence pieces of speech taken from the British National Corpus. Error rate was measured by taking the Levenshtein distance on words (that is, you measure error by how many words you need to add, remove or change to get the right answer). Watson was the winner. It had the lowest error rate on all test files, while Microsoft was uniformly the worst.
The lesson that Janus concludes is, use Watson in your app, unless you need to support several dozen exotic languages.