Conversational AI

Spoken Language Understanding (SLU) and Intelligent Voice Interfaces

By April 28, 2021January 22nd, 2024No Comments

It’s no secret voice tech performs everyday magic for users. By now, basic voice product capabilities and features are well-known to the public. Common knowledge is enough to tell us what this technology does. Yet we fail to consider what factors and mechanisms behind the scenes enable these products to work. Multiple frameworks oversee different methods that people and products communicate with. But as a concept, the frequent lifeblood of the user experience — Spoken Language Understanding (SLU) — is quite concrete.  

As the name hints, SLU takes what someone tells a voice product and tries to understand it. Doing so involves detecting signs in speech, coding inferences correctly, and navigating complexities as an intermediary between human voices and scripts featuring calligraphy. Since typed and spoken language form sentences differently, self-corrections and hesitations recur. SLUs leverage different tools to navigate user messages through traffic. 

The most established is Automatic Speech Recognition (ASR), a technology that transcribes user speech at the system’s front end. By tracking audio signals, spoken words convert to text. Similar to the first listener in an elementary school game of telephone, ASR is most likely to precisely understand what the original caller whispered. Conversely, Natural Language Understanding (NLU) determines user intent at the back end. Both ASR and NLU are used in tandem since they typically complement each other well. Meanwhile, an End-to-End SLU cuts corners by deciphering utterances without transcripts. 

This Law & Order: SLU series gets juicier once you know all that Spoken Language Understanding is up against. One big challenge SLU systems face is the fact ASR has a complicated past. The number of transcription errors ASRs have committed is borderline criminal — not in a court of law — but the product repairs and false starts that resulted left a polarizing effect on users. ASRs usually operate at than the speed of sound or slower. And the icing on the cake? The limited scope of domain knowledge across early SLU systems hampered their appeal across targeted audiences. Relating to different niches was difficult as jargon was scarce. 

Let’s say you are planning a vacation. After deciding your destination will be New York, you are ready to book a flight. You tell your voice assistant, “I want to fly from San Francisco to New York.” 

That request is sliced and diced into three pieces: Domain, Intent, and Slot Labels

1. DOMAIN (“Flight”)

Before accurately determining what a user said, SLU systems figure out what subject they talked about. A domain is the predetermined area of expertise that a program specializes in. Since many voice products are designed to appeal broadly, learning algorithms can classify various query subjects by categorizing incoming user data. In the example above, the domain is just “Flight.” No hotels were mentioned. Nor did the user ask to book a cruise. They simply preferred to fly to the Big Apple.    

Domain classification is a double-edged sword SLU must use wisely. Without it, these systems can miss the mark, steering someone in need of one application into another. 

The SLU has to guess if the user referred to flights or not. Otherwise, the system could produce the wrong list of travel options. Nobody should prompt the user to accidentally book a rental car for a cross-country road trip they never wanted. 

2. INTENT (“Departure”)

Tracking down the subject a speaker talks about matters. However, if a voice product cannot pin down why that person spoke, how could it solve their problem? Carrying out a task would then become unnecessary. 

Once the domain is selected, the SLU identifies user intent. Doing so goes one step further and traces why that person communicated with the system. In the example above, “Departure” is the intent. When someone asks about flying, the SLU has enough information to believe the user is likely interested in leaving town. 

3. SLOT LABELS (Departure: “San Francisco”, Arrival: “New York”)

Enabling an SLU system to set domains and grasp intent is often not enough. Sure, we already know the user is vouching to leave on a flight. But the system is yet to officially document where they want to go.   

Slots capture the specifics of a query once the subject matter and end goal are both determined. Unlike their high-stakes Vegas counterparts, these slots do not rack up casino winnings. Instead, they take the domain and intent and apply labels to them. Within the same example, departure and arrival locations must be accounted for. The original query includes both: “San Francisco” (departure) and “New York” (arrival). 


Spoken Language Understanding (SLU) provides a structure that establishes, stores, and processes nomenclature that allows voice products to find their identity. When taking the needed leap to classify and categorize queries, SLU systems collect better data and personalize voice experiences. Products then become smarter and channel more empathy. And they are empowered to anticipate user needs and solve problems quickly. Therefore, SLU facilitates efficient workflow design and raises the ceiling on how well people can share and receive information to accomplish more.

If you’re looking for a voice platform to bring your application, get started with the Alan AI platform today.

Leave a Reply

Discover more from Alan Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading