Conversational AI

Why it is hard to build a Voice Interface for daily use?

By June 22, 2022January 22nd, 2024No Comments

Today’s voice platforms do NOT work for mission-critical Apps.

We are entering the era of Voice which combined with AI has the potential to transform every business. There will be 200 billion connected devices by 2025 and Voice will be the primary interface to interact with them which hints at the scale of transformation ahead with voice. Our vision is to lead the Voice AI era by building the world’s most accurate Spoken Language Understanding for mission-critical daily operations in every enterprise. Here are a few challenges we are experiencing in building our Voice and AI service and the excitement as well as the challenges with the adoption.

The above picture illustrates the five steps in a Voice AI service. Voice input can be picked up from any microphone and sent to the speech recognition service either on the device on in the cloud. Once the speech is converted to text, the natural language understanding of the intent is processed and text is generated for the reply. In the last step, the reply text is synthesized into speech and sent to the users.

The following are some of the key challenges in developing an effective Voice AI service. 

  1. Spoken Language is different from Written Language. When speaking, people don’t always follow grammar, use punctuation, and often split their sentences. The Neural Networks from Automatic Speech Recognition (ASR) introduce errors. Users tend to use more anaphoras to convey their intents. Lastly, when writing, a person can go back and edit sentences, but for a speaker, it’s not possible, corrections are appended to the sentence. All of these make the NLU trained on written data sets not work well for spoken language understanding.
  2. Names and Entity recognition (NER) is hard for most Automatic Speech Recognition services. For example, Google Speech Recognition (or Microsoft or Amazon) will not always give the correct results for names. For example, for the spoken name “Ronjon” Google returns “1 John”, “Call John”, “Long John”, “Ron John”, etc.  These responses from AI services have to be considered as “hints” which need to be augmented to infer as “Ronjon”.
  3. The next challenge with Voice services is natural conversational experiences. Our human conversation has interrupts, pauses, and varying sentences. Due to privacy concerns, the consumer voice services from Google, Amazon, Apple, and Microsoft have to use a wake-word (Siri, Alexa, Google, Cortana) and offer a request-response conversation which limits the scope of these services for extended use. Hence, the current consumer voice products have conversations that are less than one minute.

    The next-generation Voice AI services will overcome the above three shortcomings. We at Alan AI have developed a unique technology of spoken language understanding that is based on the application context that gives unparalleled accuracy and flexibility. Alan ( is a complete Voice AI Platform and the market leader for developers to deploy and manage in-app voice assistants, and voice interfaces for mobile and web apps.  W are looking forward to helping the business world realize the ROIs by deploying Voice AI solutions. We expect all Apps to pass the “Turing Test” in the near future. 

Leave a Reply

Discover more from Alan Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading