In the Alan infrastructure, all voice processing is performed in the Alan cloud. The Alan cloud is the AI-backend of the Alan AI Platform. This is where voice scripts are hosted and Spoken Language Understanding (SLU) and Natural Language Processing (NLP) tasks are accomplished.
Under the hood, the Alan cloud engages a combination of voice AI tools and technologies to simulate human-like dialog between the user and the app. Together, they allow Alan to interpret human speech, generate responses and perform the necessary actions in the app. The main voice technologies used by Alan are:
Natural language processing (NLP)
Spoken Language Understanding (SLU)
Automatic Speech Recognition (ASR)
Machine Learning (ML)
Speech-to-Text (STT) and Text-to-Speech (TTS)
How voice processing works¶
Alan’s main goal is to match what the user says to a specific voice command in the script. To do this, Alan needs to rearrange unstructured data in the user’s input so that it can be analyzed and processed at the machine level.
Voice commands processing starts on the client side. The Alan Client SDK captures voice stream from the user’s device and sends the voice data to the Alan Cloud for processing.
Alan uses the Automatic Speech Recognition (ASR) engine and Speech-to-Text (STT) to get the user input and convert it to text segments.
With the help of Spoken Language Understanding (SLU) and Named Entity Recognition (NER) systems, Alan evaluates the phrase patterns, draws the intent and meaningful words, such as location, date and time, from the phrase. For high accuracy of matching, Alan uses the Domain Language Model for your app.
Alan matches the phrase to a voice command in the Alan script. This is where Alan AI Machine Learning (ML) algorithms are leveraged. Each phrase is given a probability score, with ‘1’ being the most accurate match.
If Alan is supposed to give a response, it uses the Text-to-Speech (TTS) technology to synthesize speech that sounds natural.
You can check the phrase probability score in Alan Studio logs. To do this, at the bottom of Alan Studio, open the logs pane and click the phrase given by the user.
Alan ASR and Domain Language Model¶
During voice processing, the ASR engine converts speech to text. And one of vital components in ASR is the language model. The language model evaluates probabilities of word sequences, which allows ASR to distinguish between words that sound similar.
Beside the global language model, Alan creates a domain language model (DLM) for every app. The DLM is based on unique terms, names and phrases used in your company or domain. It lets Alan’s ASR predict with high accuracy what user may say in a particular context and resolve ambiguities.
To build a DLM, you do not need to provide a large dataset with variants of utterances or to create spoken language models. Alan automatically trains on small existing datasets and generates models with phrases and intents for your app. As a result, you get an intelligent voice assistant that deeply understands the UI, workflow and business logic of your app.
Another thing that can impact the accuracy of the ASR engine is noise. Alan uses its probabilistic NLU to face this problem. It handles speech recognition errors and ensures the voice assistant works reliably in noisy environments.