Google Overhauls Cloud Speech-to-Text

Google has announced a large overhaul of its Cloud Speech-to-Text product (formerly the Google Cloud Speech API). Google Cloud Speech-to-Text now supports a selection of pre-built models, automatic punctuation, recognition metadata, and standard service level agreement (SLA). The pre-built models aim to improve phone call and video transcription accuracy, and the models are tailored for specific use cases such as phone call transcriptions for a customer calling center or a transcription of the audio from a cable TV basketball game video. Automatic punctuation adds basic punctuation such as commas, periods, and question marks to transcriptions generated from the Cloud Speech-to-Text service. Recognition metadata is a new mechanism that allows users to tag and group transcription workloads. This metadata is also used by Google to prioritize what the company works on next. Finally, the standard SLA includes a commitment to 99.9% availability.

The company introduced the Google Cloud Speech API in May 2016, and in 2017 the company added several new features including word-level timestamps and support for long-form audio files up to three hours long. Last month, the company introduced its speech synthesis API (Cloud Text-to-Speech) which features DeepMind WaveNet models. WaveNet is a new method of creating synthetic speech, speech that can mimic a human voice. The Google Text-to-Speech API leverages WaveNet technology to create raw audio data of natural-sounding, human speech. The company uses this same technology to generate human-like speech for products such as Google Assistant and Google Translate. The Google Text-to-Speech API offers a selection of premium voices that were created with WaveNet technology. The API is currently a beta release.

For more information about Google Cloud Speech-to-Text, visit the official company website.

Be sure to read the next Text-to-Speech article: Google Cloud Text-to-Speech Gains Additional Languages and Voices