Speech Channel

The State of Voice User Interfaces

Tuesday, September 20th, 2011 by Robert Cravotta

While touch interfaces have made a splash in the consumer market, voice-based user interfaces have been quietly showing up in more devices. Interestingly, voice user interfaces were expected to become viable long before touch interfaces. The technical challenges to implementing a successful speech recognition capability far exceeded what research scientists expected. That did not however stop story writers and film productions from adopting voice user interfaces in their portrayal of the future. Consider the ship’s computer in the Star Trek series. In addition to using proximity sensors that worked uncannily well in understanding when to open and close doors, the ship’s computer in Star Trek was able to tell when a person was issuing a request or command versus when they were just talking to another person.

Today, the quiet rise of speech recognition in consumer devices is opening up a different way to interact with devices – one that does not require the user to focus their eyes on a display to help them know where to place their fingertips to issues commands. Improving speech recognition technology is providing an alternative for interacting with devices for people with dyslexia. However, there are a number of subtle challenges facing systems that rely on speech recognition, and make it challenging to provide a reliable and robust voice user interface.

For a voice interface to be useful, there are a number of ambiguities the system must be able to clarify. In addition to accurately identifying what words are spoken, the system must be able to reliably filter out words that are not issued by the user. It also it must be able to distinguish between words from the user that are intended for the system versus those words intended for another person or device.

One way that systems enable a user to actively assist the speech recognition module to resolve these types of ambiguity is to force the user to press and/or hold a button indicating that they are issuing a voice command. By relying on an unambiguous input, such as a button press, the speech recognition module is able to leverage the system’s processing capacity at the time a command is most likely being issued. This approach supports a lower power operation because it enables the system to avoid operating in an always-on mode that can drain the system’s energy store.

The positive button press also prompts the user, even unconsciously, to make accommodations based on the environment they are talking. If the environment is noisy, users may move themselves to a quieter location, or position themselves so that the device microphone is shielded from the noise in the area, such as cupping the device with their hand or placing the device close to their mouth. This helps the system act more reliably in a noisy environment, but it relies on the user’s actions to improve the efficiency of the noise immunity. An ideal speech recognition module would have a high immunity to noisy environments while consuming a low amount of low energy without having to rely on the user.

But detecting when the user is speaking and issuing a command to the device is only the first step in implementing a viable voice user interface. Once the system has determined that the user is speaking a command, the system has four more steps to complete to close the loop between the system and the user. Following voice activation, the module needs to perform the actual speech recognition and transcription step. This stage of speech processing also relies on a high level of immunity to noise, but the noise immunity does not need to be as robust as it is for voice the activation stage because this stage of processing is only active when the system has already determined that the user is speaking a command. This stage of processing relies on high accuracy to successfully separate the user’s voice from the environmental noise and transcribing the sound waves into symbols that the rest of the speech module can use.

The third stage of processing takes the output of the transcribed speech and determines the intent and meaning of the speech so as to be able to accurately understand what the user is asking for. This stage of processing may be as simple as comparing the user’s input to a constrained set of acceptable words or phrases. If a match is found, the system acts on it. If no acceptable match is found, the system may prompt the user to reissue the command or ask the user to confirm the module’s guess of the user’s command.

For more sophisticated speech recognition, this stage of processing resolves ambiguity in the semantics of the issued command. This may involve considering each part of the speech in context with the whole message spoken by the user to identify contradictions that could signal an inappropriate way to interpret the user’s spoken words. If the system is able to process free form speech, it may rely on a significant knowledge of language structure to improve its ability to properly identify the meaning of the words the user actually spoke.

The next stage of processing involves acting on the issued command. Is the command a request for information? Is it a request to activate a component in the system? This processing performed during this stage is as varied as there are tasks that a system can perform. The final stage though is to ensure that there is appropriate feedback to the user that their command was received, properly interpreted, and the appropriate actions were started, in progress, or even completed. This might involve an audio tone, a haptic feedback, an audio acknowledgement, or even a change in the display.

There are a number of companies providing the technology to implement speech recognition in your designs. Two of them are Sensory and Nuance. Nuance provides software for speech recognition while Sensory provides both hardware and embedded software for speech recognition. Please share the names and links of any other companies that you know provide tools and resources for speech recognition in the comments.