Talk To Me: Speech Recognition, NLP, & UX/UI

We all know intelligent virtual assistants (IVAs) have taken the world by storm. Interacting with our devices via voice commands is becoming the norm, and despite the hilarious misunderstandings that sometimes happen, the technology behind it is pretty amazing.

This is software - a set of instructions that tell a computer what to do - understanding human speech. It’s a program that comprehends, analyzes, and then comes up with a response, all from a person speaking to it. Not so long ago, this kind of technological tool would have been considered science fiction. Now it’s something we take for granted whenever we use a smartphone.

This verbal interaction is the result of a technological trend known as speech recognition. This can be summarized as the ability to automatically and accurately recognize human speech. It’s an integration of artificial intelligence (AI) - those smart algorithms that use databases to learn.

Fundamentals

The ability to convert human speech or spoken audio into text is the basis of speech recognition. At its most basic, it’s about running a file through a recognizer (an algorithm that has been trained to recognize the words being spoken) and getting a transcription in a text-file format as a result. This includes dictation tools, where you speak into your device and the software types out what you’re saying into an email or a document.

IVAs take this technology to the next level: thanks to the integrated AI, the software has the ability to act upon what has been spoken. In other words, it recognizes speech as a command and responds accordingly.

How It Works

To start off with, for a device to recognize speech, it must have the adequate tools to do so. A good microphone is essential for speech recognition to function correctly. When someone speaks a command, the microphone transforms the sound vibrations into an electrical signal. This is then processed by the hardware (say, a sound card) and turned into data.

Here’s where the AI in the software kicks in, initiating a matching process. The algorithm compares the data of the spoken command to a pre-existing database of established words, sentences, and phrases. The size of the database will determine how accurate (or smart) the speech recognition software is. For instance, Google Assistant uses the vast data pools provided by Google. Once the software finds a match for the spoken command, it processes the data and sends it back in the form of an execution.

So, for example, by telling Siri to ‘call Bob’, the assistant sends the digital signal to Apple’s gargantuan servers until it finds a match for ‘call’ and for ‘Bob’ (all of this based on previously gathered information). Once it does this, then it searches for Bob’s number on your phone and dials it.

Raising the Bar: Natural Language Processing (NLP)

NLP has become an essential part of software that aims to understand words. Since human language is complex, something had to be done to give artificial intelligence a boost. We all know languages work in odd ways. Grammatical rules may exist but these are sometimes applied and often not; then we have words with different meanings depending on the context, not to mention local slang and colloquial expressions. All of these have to be taught to the AI for it to be able to understand people when they speak naturally.

Hence, NLP: a set of algorithms that provide support to the software by emulating the ability to comprehend language. These vary in complexity and provide different levels of support, but the most advanced versions use cognitive technology - such as semantic technology - as a base. These algorithms are trained by developers to disambiguate meanings by identifying all the structural elements of a language. NLP uses glossaries, dictionaries, and semantic networks as references. It relies on all available information so that it can determine what a word means in a particular context. It’s all about the lexicon.

NLP can be found at the core of IVAs. This is what allows the latter to understand spoken commands. NLP aims to make the interaction between humans and technology smooth and natural, improving overall user experience.

UX/UI Comes Into Play

A few UX/UI factors need to be taken into account when it comes to providing top-notch user experience for speech recognition and NLP software:

I’m Listening

Sound is essential. The whole point of interacting with software through voice commands is to lessen the need to interact physically with the device. Hence, all responses coming from said device have to be conveyed in a manner that facilitates this - in other words, through sound. This applies particularly to mobile technology; if you’re speaking to your phone from across the room, visual replies showing up on the screen won’t do much good.

An audible response is a key element of the user’s experience. It is an essential component of making the interaction feel natural. In other words, the voice you give to your software plays a crucial role when it comes to providing great UX. The user should feel as though the replies coming from the device are spoken by a fellow human.

It is in this manner that UX design involving speech recognition and NLP software differs from the norm. Other programs/apps may focus solely on the visual aspects of the UI, but with this kind of software, the sound becomes much more important. It’s the key to a satisfying user experience.

A Natural Feel

One of the biggest challenges for speech recognition software is making the experience feel seamless. The software is supposed to understand what you say when you speak naturally, just as if you were talking to another person. Nonetheless, those of us who have used an IVA know this interaction between human and device has a tendency to go awry.

Every one of us speaks differently, with our own personal inflections. We express ourselves in a wide array of different ways. We use expressions in our daily lives that shift and mutate depending on lots of factors (from the political atmosphere to the latest trending film). In order for the software to be able to understand people, the training involved is quite considerable. This is the main challenge for developers when it comes to providing a great user experience.

Security

Another factor that needs to be taken into consideration when dealing with software that integrates speech recognition is security. Users should be aware that the words they are speaking are being processed in servers and not in the device itself. This is particularly significant when it comes to software that records or stores what the user is saying for subsequent processing. They need to know that their data is being used.

Most of the time, this is done through Terms of Service agreements but the UI design should also include clear ways of letting users know that what they are saying is being registered or processed in some way. For example, when speaking to Siri, a series of wavelike undulations move across the lower part of the screen. This way, the user knows Siri is ‘listening’ to what they are saying. By making it visually evident, the person becomes instantly aware that the software is running and any security issues can be more easily prevented.