“Hi Siri, tell me about yourself.” “Hi! I’m Siri, but enough about me… how can I help?”
Voice Assistants have been around for a while, but it is only recently that they have become truly “smart”.
All major applications and websites support search by voice today. About 27% of all mobile searches in 2018 were voice searches. Voice Assistants take it one step further. Put in simple terms, talking to a Voice Assistant is like having a real conversation with someone whose job is to help you.
If you want to understand why Voice Assistants Imagine the following scenarios:
A daughter teaches her his 60-year old father how to order a new shirt online, “Dad, just open this icon and speak about the kind of shirt you want!”
A young student opens his shopping app knowing exactly what he wants but without any intentions of applying 4 filters manually, “Show me all non-marking shoes by Puma. The size should be UK size 6. Don’t show me anything that costs above Rs. 5000.”
A woman is working on her presentation for work when she remembers she has not ordered the grocery for tomorrow, “Repeat my cart from last time and remove grapes.”
To help users shop by talking to the app, we started working on a platform to enable a smart voice assistant at Flipkart. To understand how voice technology works, let us look at its high-level components:
Automatic Speech Recognition or ASR: Understanding what someone is speaking and converting that input to text.
This text will be processed to tag it with relevant information for the next step.
Example: “Show me (search) all non-marking shoes (product category) by Puma (brand filter). The size should be UK size 6 (size filter). Don’t show me anything that costs above Rs. 5000 (price filter: under 5000).”
Natural Language Processing or NLP: Understanding the meaning of what was said.
We will now use the text from the previous step and use the tags highlighted to “make sense” of the sentences.
There are three main components here:
Intent: What is the main action the Assistant has to take? In our example, it is “search for something”
Entities: What is the additional information we need to fulfil the intent? In our example, the entities are the details of the products we have to search for — “shoes” that are “from Puma, size UK 6, under Rs. 5000)
Context: Was there something else the user said before or something the system remembers about this user that can be used here? In our previous example, the system may know that the user is male and apply that filter. Additionally, the system may refer to a previous query and build on top of it. See the flow chart below for such an interaction.
Conversation Design Engine: Rules for how the Voice Assistant will respond back to users.
Combination of phrases that combine together as per a certain logic.
For example: In the query on shoes above, the Assistant will realise that it is a search command, so the starting phrase would be, say, “Here you go!”. Then, it will pull up the query components and add a confirmation. “Here you go! Browse for size UK 6 white non-marking shoes under Rs. 5000 by Puma.” Here, the response is already too long but if it was shorter, the Assistant will add, “I’m here if you need me.”
Text-to-speech or TTS: Once we have the response ready, we convert it to a “voice” and the Assistant speaks back to the user.
Optimization:
Based on how users respond to different conversations, the platform optimises the rules.
Each component above will have a certain “error rate”, which is constantly improved based on data.
Here is a flow chart to explain all the components better:
How sophisticated a particular Voice Assistant is depends on how robust all the above components are. For example, simpler Voice Search may not include any advanced context management.
While building the voice architecture was one part of the problem, UX was another. Should the microphone itself showcase advanced Assistant features or should we have a floating icon to make it more distinct and give it a “personality”? Developing a intuition for human-computer interaction design is key when it comes to such AI products.
Before we end this one, I asked Google Assistant to sing a song for me: