Entries Tagged ‘Speech Recognition’

The State of Voice User Interfaces

Tuesday, September 20th, 2011 by Robert Cravotta

While touch interfaces have made a splash in the consumer market, voice-based user interfaces have been quietly showing up in more devices. Interestingly, voice user interfaces were expected to become viable long before touch interfaces. The technical challenges to implementing a successful speech recognition capability far exceeded what research scientists expected. That did not however stop story writers and film productions from adopting voice user interfaces in their portrayal of the future. Consider the ship’s computer in the Star Trek series. In addition to using proximity sensors that worked uncannily well in understanding when to open and close doors, the ship’s computer in Star Trek was able to tell when a person was issuing a request or command versus when they were just talking to another person.

Today, the quiet rise of speech recognition in consumer devices is opening up a different way to interact with devices – one that does not require the user to focus their eyes on a display to help them know where to place their fingertips to issues commands. Improving speech recognition technology is providing an alternative for interacting with devices for people with dyslexia. However, there are a number of subtle challenges facing systems that rely on speech recognition, and make it challenging to provide a reliable and robust voice user interface.

For a voice interface to be useful, there are a number of ambiguities the system must be able to clarify. In addition to accurately identifying what words are spoken, the system must be able to reliably filter out words that are not issued by the user. It also it must be able to distinguish between words from the user that are intended for the system versus those words intended for another person or device.

One way that systems enable a user to actively assist the speech recognition module to resolve these types of ambiguity is to force the user to press and/or hold a button indicating that they are issuing a voice command. By relying on an unambiguous input, such as a button press, the speech recognition module is able to leverage the system’s processing capacity at the time a command is most likely being issued. This approach supports a lower power operation because it enables the system to avoid operating in an always-on mode that can drain the system’s energy store.

The positive button press also prompts the user, even unconsciously, to make accommodations based on the environment they are talking. If the environment is noisy, users may move themselves to a quieter location, or position themselves so that the device microphone is shielded from the noise in the area, such as cupping the device with their hand or placing the device close to their mouth. This helps the system act more reliably in a noisy environment, but it relies on the user’s actions to improve the efficiency of the noise immunity. An ideal speech recognition module would have a high immunity to noisy environments while consuming a low amount of low energy without having to rely on the user.

But detecting when the user is speaking and issuing a command to the device is only the first step in implementing a viable voice user interface. Once the system has determined that the user is speaking a command, the system has four more steps to complete to close the loop between the system and the user. Following voice activation, the module needs to perform the actual speech recognition and transcription step. This stage of speech processing also relies on a high level of immunity to noise, but the noise immunity does not need to be as robust as it is for voice the activation stage because this stage of processing is only active when the system has already determined that the user is speaking a command. This stage of processing relies on high accuracy to successfully separate the user’s voice from the environmental noise and transcribing the sound waves into symbols that the rest of the speech module can use.

The third stage of processing takes the output of the transcribed speech and determines the intent and meaning of the speech so as to be able to accurately understand what the user is asking for. This stage of processing may be as simple as comparing the user’s input to a constrained set of acceptable words or phrases. If a match is found, the system acts on it. If no acceptable match is found, the system may prompt the user to reissue the command or ask the user to confirm the module’s guess of the user’s command.

For more sophisticated speech recognition, this stage of processing resolves ambiguity in the semantics of the issued command. This may involve considering each part of the speech in context with the whole message spoken by the user to identify contradictions that could signal an inappropriate way to interpret the user’s spoken words. If the system is able to process free form speech, it may rely on a significant knowledge of language structure to improve its ability to properly identify the meaning of the words the user actually spoke.

The next stage of processing involves acting on the issued command. Is the command a request for information? Is it a request to activate a component in the system? This processing performed during this stage is as varied as there are tasks that a system can perform. The final stage though is to ensure that there is appropriate feedback to the user that their command was received, properly interpreted, and the appropriate actions were started, in progress, or even completed. This might involve an audio tone, a haptic feedback, an audio acknowledgement, or even a change in the display.

There are a number of companies providing the technology to implement speech recognition in your designs. Two of them are Sensory and Nuance. Nuance provides software for speech recognition while Sensory provides both hardware and embedded software for speech recognition. Please share the names and links of any other companies that you know provide tools and resources for speech recognition in the comments.

User Interfaces: Test Bench and Process for Projects

Tuesday, May 11th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on Low-Power Design

To avoid confusion or the need for repetitious information in later posts, I will now describe the test bench and process I am using for these touch development kit projects. This process is something I have refined over a number of previous hands-on projects that involved using embedded resources from multiple vendors. The goal of the process is to extract the most value from the effort while minimizing the time spent dealing with the inevitable usability issues that arise when working with so many different kits.

I have several goals when I perform this kind of hands-on project. First and foremost, I want to understand the state-of-the-art for touch development kits. Each company is serving this market from a slightly different angle, and their board and software support reflects different design trade-offs. I believe uncovering those trade-offs will provide you with better insight into how each kit can best meet your needs in your own touch projects.

Additionally, each company is at a different point of maturity in supporting touch. Some companies are focusing on providing the best signal-to-noise ratio at the sensor level and the supported software abstractions may require you to become an expert in the sensor’s idiosyncrasies to extract that next new differentiating feature. Likewise, the company may focus on simplifying the learning curve to implement touch in your design; the software may abstract more of the noise filtering and allow/limit you to treating touch as a simple on/off switch or an abstracted mouse pointer. Or the company’s development kit may focus on providing rich filtering capabilities while still allowing you to work with the raw signals for truly innovative features. My experience suggests the kits will run the entire gamut of maturity and abstraction levels.

Another goal is to help each company that participates in this project to improve their offering. One way to do this is to work with an application engineer from the company that understands the development kit we will be working with. Working with the application engineer not only permits the company to present their development kit’s capabilities in the best possible light and enables me to complete the project more quickly, but it puts the kit through a set of paces that invariably causes something to not work as expected. This helps the application engineer to gain a new understanding of how the touch kit can be used by a developer and that results in direct feedback to the development team and spawns a refinement that improves the kit for the entire community. This is especially relevant because many of the kits will have early adopter components – software modules that are “hot off the press” and may not have completely gone through the field validation process yet. This exercise becomes a classic developer and user integration effort that is the embedded equivalent to dogfooding (using you own product).

In addition to the development boards and software that is included in each touch development kit, I will be using a Dell Inspiron 15 laptop computer running Windows 7 Home Premium (64-bit) for the host development system. One reason I am using this laptop is to see how well these development kits support the Windows 7 environment. Experience suggests that at least one kit will have issues that will be solved by downloading or editing a file that is missing from the production installation files.

So in short, I will be installing the development software on a clean host system that is running Windows 7. I will be spending a few hours with an application engineer, either over the phone or face-to-face, as we step through installing the development software, bringing-up and verifying the board is operating properly from the factory, loading a known example test program, building a demonstration application, and doing some impromptu tweaks to the demonstration application to find the edges of the system’s capabilities. From there, I will post about the experience with a focus on what types of problem spaces the development kit is best suited for, and what opportunities you may have to add new differentiating capabilities to your own touch applications.

If you would like to participate in this project post here or email me at Embedded Insights.

User Interfaces: Introduction

Tuesday, April 13th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on Low-Power Design

The pursuit of better user interfaces constantly spawns new innovative ideas to make it easier for a user to correctly, consistently, and unambiguously direct the behavior of a machine. For this series, I propose to explore logical and direct as two user interface categories. Both categories are complex enough to warrant a full discussion on their own.

I define logical user interfaces as the subsystems that manage those signals and feedbacks that exist within the digital world of software after real-world filtering. Logical user interfaces focus on the ease of teaching/learning communication mechanisms, especially via feedbacks, between user and machine to enable the user to quickly, accurately, and intuitively control a system as they intend to with a minimum of stumbling to find the way to tell the system what the user wants it to do.

I define direct user interfaces as the subsystems that collect real-world signals at the point where user and machine directly interface with one another. For a keyboard, this would include the physical key switches. For mouse-based interfaces, this would include the actual mouse mechanism, including buttons, wheels, and position sensing components. For touch interfaces, this would include the touch surface and sensing mechanisms. For direct user interface subsystems, recognizing and filtering real-world noise is an essential task.

A constant challenge for direct user interfaces is how to accurately infer a user’s true intent in a noisy world. Jack Ganssle’s “Guide to Debouncing” is a good indication of the complexity that designers still must tame to manage the variable, real-world behavior of a simple mechanical switch with the user’s expectations for simple and accurate operation when the user toggles a switch to communicate with the system.

As systems employ more complex interface components than mere switches, the amount of real-world input variability these systems must accommodate increases. This is especially true for the rapidly evolving types of user interfaces that include touch screens and speech recognition. Similar to the debounce example, these types of interfaces are relying on increasing amounts of software processing to better distinguish real-world signal from real-world noise.

To begin this series, I will be focusing mostly on the latter category of direct user interfaces. I believe understanding the challenges to extract user intent from within a sea of real-world noise is essential to discuss how to address the types of ambiguity and uncertainty logical user interfaces are subject to. Another reason to start with direct user interfaces is because over the previous year there has been an explosion of semiconductor companies that have introduced, expanded, or evolved their touch interface offerings.

To encourage a wider range of developers to adopt their touch interface solutions, these companies are offering software development ecosystems around their mechanical and electrical technologies to make it easier for developers to add touch interfaces to their designs. This is the perfect time to examine their touch technologies and evaluate the maturity of their surrounding development ecosystems. I also propose to explore speech recognition development kits in a similar fashion.

Please help me identify touch and speech recognition development kits to try out and report back to you here. My list of companies to approach for touch development kits includes (in alphabetical order) Atmel, Cypress, Freescale, Microchip, Silicon Labs, Synaptics, and Texas Instruments. I plan to explore touch buttons and touch screen projects for the development kits; companies that support both will have the option to support one or both types of project.

My list of companies to approach for speech recognition development kits includes (in alphabetical order) Microsoft, Sensory, and Tigal. I have not scoped the details for a project with these kits just yet, so if you have a suggestion, please share.

Please help me prioritize which development kits you would like to see first. Your responses here or via email will help me to demonstrate to the semiconductor companies how much interest you have in their development kits.

Please suggest vendors and development kits you would like me to explore first in this series by posting here or emailing me at Embedded Insights.