Conversational AI

What is a Voice User Interface (VUI)?

By September 25, 2019January 22nd, 2024One Comment

A Voice User Interface(VUI) enables users to interact with a device or application using spoken voice commands. VUIs give users complete control of technology hands free, often times without even having to look at the device. A combination of Artificial Intelligence(AI) technologies are used to build VUIs, including Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis among others. VUIs can also be contained either in devices or inside of applications. The backend infrastructure, including AI technologies used to create the VUI’s speech components, are often stored in a public or private cloud where the user’s speech is processed. In the cloud, AI components determine the intent of the user and return a given response back to the device or application where the user is interacting with the VUI.

Well known VUIs include Amazon Alexa, Apple Siri, Google Assistant, Samsung Bixby, Yandex Alisa, and Microsoft Cortana. For the best user experience, VUIs have visuals created by a Graphical User Interface and additional sound effects to accompany them. Each VUI today has its own way of handling sound effects are used so that users know when the VUI is active, listening, processing speech, or responding back to the user. The benefits of VUIs include hands-free accessibility, productivity, and better customer experience that will change how the world interacts with artificial intelligence. 

The Creation of VUI 

Audrey

The first traces of VUI started as the first speech recognition system in 1952 with a device called Audrey. Audrey was invented by K.H. Davis, R. Biddulph and S. Balashek, it was known as the “automatic digit recognizer” due to its ability to recognize numbers 0 through 9. Although Audrey’s skill was limited to numbers, it was seen as a technological breakthrough. Audrey was also not a small device like usually seen today, Audrey stood 6 feet tall with a large and rather complicated analog circuit system.

During the creation of Audrey there was an input and output procedure like used today in modern VUI devices. First, a speaker recited a digit or digits into a telephone and made sure to make a 350 milliseconds pause between each word. Next, Audrey listened to the speaker’s input and with speech processes it sorted the speech sounds and patterns to understand the input. Audrey would then visibly respond by flashing a light like modern VUI devices. 

Although Audrey could distinguish the numbers, Audrey could not universally understand everyone’s voice or language style and could only respond to a familiar speaker. Unfortunately this was not a feature like modern day VUI in devices, Audrey was simply not advanced enough and needed a familiar speaker to maintain a 97 percent digit recognition accuracy. With a few other designated speakers, Audrey’s accuracy was 70-80 percent, but far less with other speakers it was unfamiliar with. Why was Audrey created in the first place if manual push-button dialling was cheaper and easier to work with? Recognized speech requires less bandwidth (less frequencies for transmitting a signal) than the original sound waves in a telephone. It would also be more practical for reducing data traveling through wires and future technology. 

Tangora

Shortly after the creation of Audrey, the most significant voice technology advancement was in 1971 when the U.S Department of Defense’s research team funded five years of a Speech Understanding Research program. Their goal was to reach a minimum of 1,000 vocabulary words with the help of companies such as IBM. In the 1980s, IBM built a voice activated typewriter called Tangora. Tangora was capable of understanding and handling a 20,000-word vocabulary. Today voice activated typing systems have evolved to be used in smartphones to send a text or write a research paper in a matter of moments. 

Overtime, computer technology advanced VUI, Graphical User Interface (GUI), and User Experience (UX) design is placed into a small device that fits in the palm of a hand. Even GUI and UX is becoming old news due to the quick adoption of voice-only devices that no longer use these features. Speech recognition technology went from understanding 9 numbers to millions of phrases and words from any voice. This advancement was made possible with new speech recognition software processes such as Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis. 

Technology used to create a VUI

A range of Artificial Intelligence technologies are used to create VUIs, including Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis. 

Automatic Speech Recognition

Automatic Speech Recognition(ASR) is a technology used to analyze and process human speech into text. For a given audio input, ASR is required to filter out any distracting acoustic noises and identify human speech instead. Distortions in the audio and streaming connectivity can make this a challenge. Several underlying technologies have been tested and used to build ASR technology, including Gaussian mixture models (a probabilistic model) and deep learning with neural networks that process and distribute information to collect data. Often times, the words recognized by ASR are not an exact match to entities within a user intent. In these cases, augmented entity matching is used, which will take similar words or similar sounding words and match them to a predefined entity in the VUI.  

Name Entity Recognition

Name Entity Recognition(NER) is used to classify words as their underlying entity. For example, in the command “Get directions to New York City”, ‘New York City’ is recognized as a location. In addition to locations, NER locates entities or semi-structured text that can be a person, a subject, or something as specific as a scientific term. NER often takes surrounding text or words to determine the value of the entity. In the “Get directions to New York City” example, pre-trained probabilistic models assume that whatever word(s) come after “Get directions to” can be safely classified as a location. Examples like “Get directions to the nearest gas station” can also work for the same reasons, with ‘the nearest’ being a defined qualifier that precedes location.

NER assists ASR in resolving words as their entities. On the basis of voice input alone, “New York City” is recognized as “new” “York” “city”. NER then identifies this as a unique location and adjusts to “New York City”. NER is highly contextual and needs additional input to confidently determine entities. Sometimes, NER is reliant on previous training and will not be able to confidently determine an input’s entity. 

Speech Synthesis

Speech Synthesis produces artificial human voice and speech using input text. VUI does the job in three stages. The stages are input, processing, and output. Speech Synthesis is simply a text-to-speech (TTS) output where a device reads out loud what was input with a simulated voice through a loudspeaker.

 These AI technologies analyze, learn, and mimic human speech patterns and can also adjust the speech intonation, pitch, and cadence. Intonation is the way a person’s voice rises or falls as they speak. Factors that affect intonation is emotion, accent, and diction. Pitch is the tone of voice, but it is not affected by emotion. Pitch is high or low and can be best described as a squeaky or deep voice. Cadence is the flow of voice that fluctuates in pitch as someone is speaking or reading. For example, a public speaker will change their cadence by descending their voice during a declarative sentence to make an impact on their audience.

Once all of this information is stored and analyzed, these technologies will use it to improve itself and the VUI through what is called machine learning. The clouds and technologies will determine the intent of the user and return a response through the application or device.

Intents & Entities

Voice commands consist of intents and entities. The intent is the objective of the voice interaction and has two approaches. There are local intents and global intents. A local intent is when the user is asked a question in which they respond “Yes” or “No”. A global intent is when a user has a more complex answer. When designing VUI’s, the way different commands can be said need to be taken into consideration in order to recognize the intent and respond correctly. Here is an example of getting directions to a location: “Get directions to 1600 Pennsylvania Avenue”, “Take me to 1600 Pennsylvania Avenue”. Entities are variables within intents. Think of it as the blanks needed to fill into a Mad Libs booklet, such as “ Book a hotel in {location} on {date}” or “Play {song}.” 

Image result for someone speaking to siri

VUI vs GUI

User Experience (UX) is the overall experience of an interface product such as a website, application, and more in terms of how aesthetically pleasing it is or how easy it is to navigate for users. Together VUI and GUI play a large role in UX design because they assemble a product for consumers. 

Voice User Interface

As explained earlier, Voice User Interface (VUI) enables users to interact with a device or application using spoken voice commands. VUIs give users complete control of technology hands free, often times without even having to look at the device. 

Graphical User Interface (GUI)

Graphical User Interface (GUI) is graphical layout and design of a device. For example, the screen display and apps on a smartphone or computer is a graphical user interface. GUI can be used to display visuals for VUI, such as a graphic of sound waves when a voice assistant on a smartphone responds to its user. Another real life example can be how Google and Apple Siri use VUI and GUI together.

Apple Siri VUI & GUI

Apple Siri responds to “Hey Siri” using VUI or by pressing down on the home button of the Apple device. Users will know that Siri is active when Siri says “What can I help you with?” through its speaker or on the screen using GUI. While a user speaks to Siri, colorful representational wavelengths move to the sound of speech. This also shows users that Siri is actively listening and processing their question. When a user is quiet, Siri will prompt “Go ahead, I’m listening…” If a user still does not respond, then it will display on the screen “Some things you can ask me:” with a few examples of what it can do, such as calling, face timing, emailing, and more.

This GUI feature is specifically catered to people who are new to Siri and are unsure on what to do. The Apple device will also display what the user has asked and Siri’s response on the screen to show what is being understood from the interaction. Other features that Apple Siri has is the customization of Siri’s gender, accent, and language. 

Google Assistant VUI & GUI

Google Assistant responds to users when it hears “OK Google” or “Hey Google.” At the bottom of the screen, colorful dots will display to let the user know that Google Assistant has been activated and ready to listen. While it waits for the user to ask a question, the dots will move in a wave formation to represent wavelengths until it gets speech. Once a user starts speaking, the dots will transform into bars and move into a wave formation to the sound of speech to let users know it is processing information. Another GUI feature that Google Assistant has is that it will display what the user has asked and Google’s responses. Like Apple Siri, this display is another way of showing users what is being understood by the interaction. Google Assistant is also customizable in language and accent.

VUI vs Voice AI

The term Voice Artificial intelligence (AI) is used with VUI very commonly. Both terms usually get confused to mean the same thing since they are closely connected. VUI is all about the voice user experience on a device. Voice AI is the term for speech recognition technologies. The technologies fall under the Voice AI umbrella and are Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis. 

Different VUI approaches

Voice command devices also known as voice assistants use VUI and can be auditory, tactile, or visual. Devices can also range from a small sized speaker or to a blue light that blinks in a car’s stereo when it hears a command. More common examples of a voice command device are iPhone Siri, Alexa, and Google Home. These voice assistants are made to help people in daily tasks. There are also device genres for what the VUI is used for. This influences how the interaction between the user and device is set up.

VUI Device Genres

  • Smartphones
  • Wearables
    • Smart wrist watches
  • Stationary Connected Devices 
    • Desktop computers
    • Sound System
    • Smart TV
  • Non-Stationary Computing Devices
    • Laptops
    • Speakers
  • Internet of Things (IoT)
    • Thermostats
    • Locks 
    • Lights 

Each voice enabled device has a different functionality. A smart tv will respond to changing the channel, but not to sending a text message like a smartphone would. Users can ask for information from the news and weather channel or simply send a voice text with the power of VUI. Not only are there devices, but VUI integrated voice controlled apps that serve the same purpose as well. The VUI will interact with an app in a task-oriented workflow and/or knowledge-oriented workflow. Task-oriented workflows can complete almost anything a user asks it to do, such as setting an alarm or making a phone call. Knowledge-oriented workflows responds to its user by using secondary sources like the internet to complete a task, such as searching for a question about Mt. Everest’s height. 

The Benefits of VUIs

The primary benefit of VUIs is that they allow a hands-free experience that users can interact with while focusing on something else. It can save time in daily routines and improve people’s lives such as, checking the weather or setting an alarm clock the night before work. 

VUI in Workflows & Lifestyles

VUI is beneficial in multitasking productivity in work spaces that range from an office space or outdoor labor. Voice User Interface can actively participate in worker safety by assisting users in hazardous work flows, such as construction sites, oil refineries, driving, and more. Traditional devices like phones and computers aren’t the only devices connected to the internet or VUI. Smart light fixtures, thermostats, smart locks, and other Internet of Things (IoT) are connected as well. These VUI devices are useful in households with travelers and/or busy families from home or a smartphone.

Improving Lives

With individualized experiences, VUI can lead society to a more accessible world and help give a better quality of life. VUI benefits users with disabilities such as the visually impaired or others that cannot adapt to visual UI or keyboards. VUI is also becoming popular with Seniors who are new to technology. Aging has many effects on abilities such as sensory, movement, and memory, which makes VUI an alternative to hands-on assistance. With the assistance of VUI, elders can communicate with loved ones and use devices without the confusion and frustration. 

VUI in Education

Educational strategies are constantly being updated in educational systems for all ages. VUI can be a learning tool where classrooms interact with a voice assistant to create a new experience and cater to all learning styles. Since VUI is very accessible, training isn’t required for using it which makes it very easy to use in any audience. 

Technology Innovation

As VUI grows, it will change the way that products are designed and start a new job demand. VUI design will become a key skill for designers due to the evolving user experience. User Experience (UX) designers are trained in providing experiences for physical input and graphical output. VUI design is different from UX because the design guidelines and principles are different. This will encourage designers to focus more on VUI design. In 2019, it was estimated that 111.8 million people in the US will use a voice assistant at least monthly, up 9.5% from last year. Since users are using voice assistants more than ever, it will eventually become a habit and the new device feature that everyone will own.

It will be easier for users to speak to a device than to physically use a device after the habit has been formed. This will create a high demand for VUI knowledgeable designers and contribute to the change of how devices are designed.

Lastly, another benefit to voice command devices is that they don’t stay stagnant to what they are programmed to do. Over time, the interaction between the user and voice-user interface improves through machine learning as discussed earlier. The user learns how to better utilize the voice command device and the device in return learns how to work with its user. 

Solutions With Alan

With the Alan Platform, it is very simple to create your own voice interface designed for natural communication and conversation. Signing up for an account with Alan Studio gives you access to the complete Alan IDE to create a VUI you can integrate with any pre-existing app. The Alan Platform allows you to create a Voice User Interface completely within your browser and allows you to embed the code into any app, so you only have to write it once and not worry about compatibility issues.

Final Thoughts 

Voice User Interface went from only recognizing numbers 0-9 to more than a million vocabulary words in different styles of speaking. VUI has never stopped progressing and is creating a new job demand and an important focus in User Experience design. As VUI progresses, more voice assistants and solutions are being created to benefit society. Companies and consumers are switching to the new and practical trend of VUI or combining Graphical User Interface with VUI.

Voice assistants come in many shapes, forms, and genres. Each device has its own purpose using VUI, such as assisting in the productivity of workflows, lifestyles, and education. What they all have in common is that their purpose is to help users in their everyday lives with a hands free user experience. This is done by using a range of Artificial Intelligence technologies that are used to create VUIs, including Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis.

Another reason why VUI never stops growing and improving is because it does not stay stagnant to what it is programmed to do. Over time, the interaction between the user and voice user interface improves through machine learning. The user learns how to better utilize the voice command device and the device in return learns how to work with its user. Together they are working towards a more advanced artificial intelligence and voice user interface. 

This article was reposted at dev.to here:
https://dev.to/alanvoiceai/what-is-voice-ui-2ga7

One Comment

Leave a Reply

Discover more from Alan Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading