At work, we’re always trying to be more productive: to get more out of doing less, whether it’s our time, our tools, or our method of communication. Digital tools like email, collaboration apps, and messengers have helped us do that, but there’s the one method of communication everyone uses at work that’s never been optimized: voice. Voice, which remains the most used, most effective, and fastest form of communication at work, is also the most difficult to store and consume. What if we could “see” what was said in our conversations instead? We’re best at exchanging information with each other with voice but consume information best with visuals.
The advancements in Automatic Speech Recognition (ASR) from Google, Microsoft, and Baidu have reached almost 4.9% word error rate. However, the Speech-to-Text transcription for conversations at work requires another level of innovation. Firstly the ASR’s are not aware of the context of our conversations and such cannot interpret the keywords, names and the entities we refer. As such they have a poor accuracy of the words recognized. Secondly, ASRs have no concept of who said what in the conversation. This is required to preserve the structure of the conversations. Typically after most conversations, we would like to know what one person about a given topic. And even short conversations contain paragraphs of text when transcribed, which are a pain to read through to get a few sentences of useful information.
Now that we’re in the era of Infinite Computing and Artificial Intelligence, it’s finally possible to address this by developing the voice NLP infrastructure. This is the future of voice in the enterprise we’re building at Alan.