Voice search is going to become the next big change in how the users use technology. Hence, a Voice User Interface (VUI) is no longer expected to follow a linear pattern or flow. If it does, and subsequently doesn’t interpret the user’s instructions correctly, then it’s highly likely that the user will never come back.
In this talk, Josey Sandoval explains the basics of how a VUI works and the details of the key components involved in its functioning.
Into the Jungle – Amazon Alexa’s Senior PM, Josey Sandoval
Serving as a Senior Product Manager for Amazon Alexa, Josey Sandoval has over 10 years of experience in Product Management, Program Management, and Project Management, including new product introduction, enterprise web applications, and productized machine learning and AI. He is on the Alexa Skills Kit Team and focuses on Custom skills, such as Pizza Hut, My Pet Rock, and Whiskeypedia.
What Are Voice User Interfaces
A VUI allows people to use voice input to control computers and devices. Voice interfaces are the driving force behind the growing success of Amazon Alexa and Google Home. Even for devices with screens, such as the Echo Show, UX design is voice first. The Graphical User Interface (GUI) supplements the VUI, as opposed to the VUI simply describing what is on the screen. If done well, users can speak naturally and don’t need to follow fixed flows.
Three major VUI components
- Logic: State and Fulfillment Business
- Inputs: SLU – Spoken Language Understanding
- Responses: TTS – Text to Speech
Generally, this is how a VUI works: User says something to the device, then the device picks that up, packs it into a stream, sends that to platform services, such as Amazon Alexa and Google Home. Then it runs it through a system called the ASR – Automated Speech Recognition, which picks that audio stream and converts into tokenized words. These words are then run through a system called NLU – Natural Language Understanding, which then tries to interpret the meaning of those words.
After the interpretation, it sends the payload, often in the .json format to the skill, which resides in a server. Next, the business logic is run and the data is sent back to the platform in the form of TTS – Text to Speech and some other components. The data is finally converted into an audio format and is shipped back to the device, playing it back to the user.
Spoken Language Understanding (SLU)
- The foundation of UX in a voice application
- Consists of two Machine Learning (ML) systems (ASR + NLU)
- SLU Authoring is a developer (customer) facing product
ML models determine the output of ASL and NLU, and the model is different for every skill.
Challenge of VUI
A good ML modeling requires people, processes, tools, and long term investment. Even after this, the return on investment is not guaranteed.
Workaround currently implemented
Productized SLU Authoring
- Skill creators provide app
–specific model data
- Samples only, not every variation is required
Platformhandles all the SLU/ML science stuff
- Model accuracy is highly dependent on sample quality
It’s crucial to the ML models to correctly identify the intent of the user and filling up the slots (variables) required, in order to interpret them correctly. For example: If a user says “I need to get a gift for my mom’s birthday”, gift is the intent, and the slots to be filled are relationship (mom~mother) and occasion (birthday).
What if the user doesn’t provide all the slots necessary in the first utterance? The system then implements cycles, does some business logic, and forms the questions necessary to fill those slots. For example, it might ask the user “What is the name of your mother?”
A lot of effort can go into SLU authoring tools:
- Built-in Intents
- Custom Intents
- Intent Sample Utterances
- Slot Sample Utterances
- Built-in List Slot Types
- Slot Suggester
- Grammar Base Slot Types
- Custom Slot Types
- Intent History
- Authoring GUI and APIs
The Product Manager’s Role in This
- Product teams define features and tools to democratize creation of great VUI.
- Paving the road to the North Star:
Josey concludes by mentioning that the voice-enabled devices and apps we see today are not ‘it’. The technology and market still have a long way to go before actually becoming self-sufficient and robust enough. The VUI devices that we see on our desks are just the beginning.