Voice processing

In the Alan AI’s infrastructure, all voice processing is performed in the Alan AI Cloud. The Alan AI Cloud is the AI-backend of the Alan AI Platform. This is where dialog scripts are hosted and Spoken Language Understanding (SLU) and Natural Language Processing (NLP) tasks are accomplished.

Under the hood, the Alan AI Cloud engages a combination of voice AI tools and technologies to simulate human-like dialog between the user and the app. Together, they allow Alan AI to interpret human speech, generate responses and perform the necessary actions in the app. The main voice technologies used by Alan AI are:

  • Natural language processing (NLP)

  • Spoken Language Understanding (SLU)

  • Automatic Speech Recognition (ASR)

  • Machine Learning (ML)

  • Speech-to-Text (STT) and Text-to-Speech (TTS)

How voice processing works

Alan AI’s main goal is to match what the user says to a specific voice command in the script. To do this, Alan AI needs to rearrange unstructured data in the user’s input so that it can be analyzed and processed at the machine level.

  1. Voice commands processing starts on the client side. The Alan AI SDK captures voice stream from the user’s device and sends the voice data to the Alan AI Cloud for processing.

  2. Alan AI uses the Automatic Speech Recognition (ASR) engine and Speech-to-Text (STT) to get the user input and convert it to text segments.

  3. With the help of Spoken Language Understanding (SLU) and Named Entity Recognition (NER) systems, Alan AI evaluates the phrase patterns, draws the intent and meaningful words, such as location, date and time, from the phrase. For high accuracy of matching, Alan AI uses the Domain Language Model for your app.

  4. Alan AI matches the phrase to a user’s command in the dialog script. This is where Alan AI’s Machine Learning (ML) algorithms are leveraged. Each phrase is given a probability score, with ‘1’ being the most accurate match.

  5. If the AI agent is supposed to give a response, it uses the Text-to-Speech (TTS) technology to synthesize speech that sounds natural.

Note

You can check the phrase probability score in Alan AI Studio logs. To do this, at the bottom of Alan AI Studio, open the logs pane and click the phrase given by the user.

Alan AI’s ASR and Domain Language Model

During voice processing, the ASR engine converts speech to text. And one of vital components in ASR is the language model. The language model evaluates probabilities of word sequences, which allows ASR to distinguish between words that sound similar.

Beside the global language model, Alan AI creates a domain language model (DLM) for every app. The DLM is based on unique terms, names and phrases used in your company or domain. It lets Alan AI’s ASR predict with high accuracy what user may say in a particular context and resolve ambiguities.

To build a DLM, you do not need to provide a large dataset with variants of utterances or to create spoken language models. Alan AI automatically trains on small existing datasets and generates models with phrases and intents for your app. As a result, you get an AI agent that deeply understands the UI, workflow and business logic of your app.

Another thing that can impact the accuracy of the ASR engine is noise. Alan AI uses its probabilistic approach to face this problem. It handles speech recognition errors and ensures the AI agent works reliably in noisy environments.