Jazz Up Consumer Products With Speech Recognition

Speech recognition historically has been constrained to PC-based systems, telephone servers, and high-end cell phones and PDAs. But advances in recent years have brought low-cost speech-recognition processors into the realm of consumer electronics.

Today's speech-recognition processors contain more integration on chip, they're more accurate, and they're accompanied by better tools, making it relatively easy to add speech I/O to consumer products. Speech control of environmental lighting for the home represents one potential consumer application.

Types Of Speech Recognition Speech recognition (sometimes called voice recognition or VR) can be broadly divided into three classes: speaker-independent (SI), speaker-dependent (SD), and speaker verification (SV). Each has advantages that can be exploited in different applications. Products using SI require speech commands that work "out of the box" without user training.

For example, SI typically serves light controllers best. It's also better to use one SI command, called a "trigger," to get the light controller's attention — just like we use names to trigger the attention of our associates. After triggering the product, it can listen for a set of multiple commands.

Products that incorporate speech recognition usually need a way to let the user know they have heard a command and are ready for further instruction. That is, they must let the user know where the product is in the control flow. The light controller will use a short tone since the flow is simple. This reduces the time spent interacting with the controller and is less intrusive if a false fire occurs .

Since speech is a natural interface for humans, speech recognition adds ease-of-use to products. Speech also extends beyond the user's physical reach. A speech-controlled light switch provides this kind of value. Perhaps the user is sitting down watching TV and the light is out of reach. Or, it may be too dark to see the light control. A simple voice command conveniently solves these dilemmas.

Design Considerations Because speech recognition is based on probabilistic functions, designers must strike a compromise between the importance of acceptance of commands that are included in a recognition set and rejection of commands that aren't included in the set. For instance, if the product must be very responsive and an occasional false acceptance (false firing) isn't catastrophic, the application developer may want to favor acceptance. Other applications can't tolerate false firing, such as a voice-operated oven control or a light controller.

Background noise is the bane of speech recognition. Both detection and recognition assume a reasonable signal-to-noise ratio (SNR) — about 3:1 or better. Managing noise with directional or close-talking microphones is best when the application naturally allows for that approach.

Cost also is a factor. When an end user buys a product, the initial manufacturing cost has been marked up four to five times. Fortunately, the highly integrated speech processors that are now available contain the necessary microphone preamplifier, analog-to-digital converter (ADC), digital filters, core processor, digital-to-analog converter (DAC), and math engines.

They also come bundled with text-to-speaker-independent (T2SI) recognition and synthesis technologies. The chips can double as the primary controller of the consumer product functions, too, and they're priced competitively for consumer electronics. This results in little or no incremental increase in cost to add speech.

Light Controller Theory of Design The very features that make a VR light controller appealing also contribute to the challenges of speech recognition in this application. Recognizing a command at a distance in a home environment means competing with background noises like speech, TV, music, clattering dishes, and banging doors. Such an application also must work well for adults and children of both sexes.

Speech recognition is only as good as the integrity of the signal being processed, so proper design of the microphone circuit is fundamental. This circuit should be designed so that the combination of microphone, bias resistor, and preamplification stage make the best use of the ADC bits — maximizing the use of bits for best resolution but not saturating. The design should account for the range of power that is likely for people who speak softly and loudly, and for the distances at which the light controller is likely to be used (typically up to about 10 ft.).

It's best to bias a light controller toward avoiding false fires. (Users occasionally may have to repeat the command in noisy situations.) This can be accomplished with the Quick T2SI tool settings. Keeping the command set's size as small as possible is important to minimize substitutions of the wrong commands. This is particularly true in noisy environments, such as the home. To maximize differentiation, the T2SI commands should differ as much as possible in sound and length.

Finally, the logic flow for the light controller must be simple and natural to use. The steps to navigate from getting the light controller's attention to reaching the active command set should be minimized to avoid user confusion. The trigger word should always be duplicated in the active command set so that the users can always reestablish their place in the flow. The trigger word should be easy to associate with the light controller function, and the active commands should be typical for light control. The flow chart illustrates the flow we will use (Fig. 1).

Hardware Design The Sensory VR Stamp is used in this example to simplify the development of the light controller. The VR Stamp is a low-cost module containing a Sensory RSC-4128 microprocessor, audio-circuit discrete capacitors and microphone preamplifier, 3.58-MHz crystal, reset circuit, and 128 kbytes of flash memory for program code.

It also has 128 kbits of serial EEPROM memory, which isn't used in the light-controller application (Fig. 2). The VR Stamp Toolkit comes with VR Stamps, Integrated Development Environment (IDE), Quick T2SI, FluentChip Library (a variety of speech recognition and synthesis functions, including T2SI), VR Stamp Programmer board, and supporting documentation.

In the voice-activated light-controller circuit, the VR Stamp module listens for spoken commands from the user. It then supplies a control signal to turn the light on and off and sets the desired lamp brightness by setting the duty cycle (Fig. 3).

A 120-V, 60-Hz ac line source powers the circuit. A transformer (T1) and a diode bridge (D1) convert and rectify ac to dc. The RSC-4128 operates from 2.4 to 3.6 V. A regulator (U1) then produces a stable 3.3-V source to the VR Stamp. A 3300- Ω resistor (R1) reduces the ac line current down to a few microamps so the RSC-4128 can detect when the voltage crosses the zero-threshold point.

Internal diodes prevent chip damage from overdriving. A diac/triac pair (U2/Q2) controls the ac line at the output plug (P2). A 100- µ F capacitor (C3) must be present to filter out low-frequency ripples on V_DD . Unstable V_DD will couple into the audio circuitry and reduce voice-recognition accuracy.

A microphone (MK1) for voice-recognition input and a speaker (LS1) for tone outputs complete the application's functional blocks. This is a textbook circuit that's used to power the light. By delaying the firing, it also can dim the light's brightness. Four light-switch brightness settings are implemented. Two of the settings, "dimmer high" (fully on) and "power off," use a 100% and 0% duty cycle, respectively. The other two settings, "dimmer medium" and "dimmer low," use approximately a 50% and 10% duty cycle, respectively.

Designers should remember two guidelines when they create pc boards that implement voice recognition:

Stable analog power and ground: A regulator should be used to keep the power and ground signals as stable as possible. The circuit board should be physically organized to keep all analog power and ground signals separate from digital ground. Analog power and grounds should be run separately to the main power and ground sources, which for this application is the voltage regulator. This technique is commonly called a "star grounding." Place the voltage regulator as close to the MIC _ RET pin of the VR Stamp as practical, and use thick wire and pc-board traces for all power and ground signals.
Short distance/shielded microphone wire : It's a good practice to keep all analog pc-board traces as short as possible. In particular, the main audio signal path from the positive side of the microphone to the VR Stamp should be kept as short as practical. The high-impedance audio signal has an amplitude of only a few millivolts peak-to-peak. To avoid an antenna effect from digital noise and electromagnetic interference (EMI), shielded cable must be used to connect the microphone to the circuit.

The VR Stamp is designed to provide superior recognition results with an inexpensive omnidirectional electret microphone. A Panasonic WM-64PKT was selected for this application, but many other manufacturers and models might be used. While electret microphones require an external power source to drive the internal FET buffer, they also behave as a current source when biased. In addition, b iasing controls the overall microphone sensitivity. In the dimmer light switch, a microphone with a sensitivity of -44 dB was used. If a microphone with a different sensitivity is used, then the microphone bias resistor (R4) should be modified using the formula:

where Sensitivity is the sensitivity of the microphone you want to use (specified in -dB in the microphone's specification), R is the microphone impedance, and R S is the microphone bias resistor (R4) required for the specified sensitivity.

Microphone placement also is critical for effective VR design. There are three important guidelines to remember:

Flush mounting: The microphone element should be positioned as close to the mounting surface as possible, and it should be fully seated in the plastic housing. There can't be any airspace between the microphone element and the housing.
No Obstructions, Large Hole: The area in front of the microphone element must be kept clear of obstructions to avoid interference with recognition. The diameter of the hole in the housing in front of the microphone should be at least 5 mm. Any necessary plastic surface in front of the microphone should be as thin as possible, being no more than 0.7 mm, if doable.
Insulation: The microphone should be acoustically isolated from the housing to prevent auditory noises produced by handling or jarring the product from being "picked up" by the microphone.

Software Design The Sensory VR Stamp runs programs developed using the FluentChip technology firmware tools and libraries. FluentChip programs are created and managed using the IDE tool included with the VR Stamp Toolkit. A program consists of one code module or more code, which can be written in assembly language or C, plus other program resources, including object data files for T2SI recognition sets and SX speech prompts.

T2SI trigger and command sets are created using Quick T2SI, a Windows-based SI-recognition-set creation tool. To use this GUI-based tool, designers type words and phrases to be recognized into text boxes, press the "Build" button, and a custom SI set is created. Triggers should be entered in the trigger text box, and commands in the command text box.

The words and phrases can be tested using the PC, or downloaded into a VR Stamp for testing. If some words seem too hard to recognize, or prone to substitution errors, designers should adjust or "tune" the pronunciation of the recognition words and phrases and quickly retest. The Quick T2SI tool also creates object files that can be linked into any T2SI application.

The "Out of Vocabulary Sensitivity" setting in the Quick T2SI tool should be set to Reject More or Reject Most to reduce false firing. The T2SI words have been carefully chosen for this application to be easily discriminated by VR, and because they're natural to the user. For example, "on" and "off" are poor choices because they sound too similar to each other and have a high chance of confusion, or substitution.

A longer word like "power" is a better choice. Moreover, this single word can be used as a toggle to turn the lamp on or off, depending on its current state. The other command words, "dimmer low," "dimmer medium," "dimmer high," and "light switch," are long enough and dissimilar enough for substitution errors to be unlikely.

FluentChip applications must include a configuration file (config.mca) that includes the microphone distance (hardware gain) setting. This should be set to FAR_MIC. Also, the interrupt-service-routine (ISR) "stubs" can be found in the configuration file. These should be used for the brightness duty-cycle control.

The library contains over 200 useful application-programming-interface (API) calls. However, only one API call is required for T2SI recognition:

uchar _ T2SI(long acousticModel, long grammarModel, uchar knob, char timeout, uchar trailing, PARAMETERPASS *pRes); where:

acousticModel, grammarModel: pointers to SI recognition acoustic model and grammar tables: (created by Quick T2SI)
knob: confidence threshold level (0..4) (0= loosest, 4= tightest)
timeout: maximum listening time in 1 second units
trailing: minimum ending silence duration in 0.025 second units
*pRes: pointer to results structure

The T2SI API returns an error-code byte. The error code for the light controller will be one of the following: 0x00 = no error (high confidence); 0x01 = timeout; 0x04 = too soft; 0x12 = low confidence; and 0x13 = medium confidence. Some errors are applicable only in certain recognition events. For example, a "timeout" or "too soft" error cannot occur during trigger recognition. It can only happen during command recognition.

The programmer should decide whether to accept "low confidence" or "medium confidence" errors based on the application requirements. The light controller should accept "no error" (high confidence) for triggers to avoid false fires, and high or medium confidence for commands to improve hit rate at that menu level.

Since this application uses a trigger plus command to operate, the user should be prompted that the trigger has been recognized and the system is now waiting for a command. A short is selected for the light controller because there's only one simple command set.

The voice-activated light controller continuously listens for the trigger phrase "light switch." When the trigger is recognized, the speaker emits a short sound, and a three-second listening window is opened for one of the five command phrases: "dimmer low," "dimmer medium," "dimmer high," "power," and "light switch."

"Power" acts as a toggle to switch the light controller off if it's on, and to the last selected brightness level if the light controller is off. "Light switch" also is included as a command in case the user didn't hear the after saying "light switch" as the trigger, or if the user is confused about which mode the program is in. In any case, if "light switch" is recognized, the program will and open the three-second command listening window.

In the RSC-4128, all recognition is done in the foreground. In the VR light controller, the light control and dimmer functions are done in the background. Two interrupts, GPIO and a 1-ms timer, control the switching of the ac during the cycle.

First, a GPIO interrupt detects a zero-threshold crossover of the ac line voltage. This starts a timer that causes an interrupt approximately once every millisecond. Depending on the desired brightness-level setting, the program switches on the diac/triac at different places along the sine wave. When the next zero-threshold crossover occurs, the diac/triac is switched off and the process repeats.

Testing and Tuning There are two ways to verify a voice-recognition circuit's quality: so-called "karaoke" and "diagnostic output."

Karaoke is a quick but subjective test that measures the microphone and analog front-end quality by allowing the designer to input speech and listen to the quality received by the processor's internals. It's also simple to use. Just add one line of code to the application:

Karaoke #N ; where N is the number of seconds 1 to 255

This will place the program into a special mode in which the microphone input will be internally connected to the VR Stamp's digital-to-analog converter and pulse-width-modulation (PWM) speaker driver to output speech. Speak into the microphone and listen to the output. The speech should be undistorted, and it shouldn't have any obvious background noise. Listen to the output when not speaking. Try disconnecting the microphone. If any noise is heard from the speaker, there may be an electrical noise problem in the circuit.

Diagnostic output is designed to provide detailed data per recognition event on the characteristics of that event. Each time the program calls any recognition function or macro, diagnostic information is automatically transmitted serially on a dedicated I/O pin. Two I/O pins usually are required for diagnostic output, one for the actual serial output and one to enable or disable the diagnostic during program reset.

A line of diagnostic output for T2SI might look like this:

T 00 00022 01691 20 00023 003 000 00012 00128 001 00263 00512 000 00115 00096

Format of output:

"T" = unique T2SI header byte
2-digit hex error code (00 = no error)
5-digit decimal silence level
5-digit decimal maximum power
2-digit AGC value (Sensory internal use)
5-digit decimal duration
3-digit (Sensory internal use — count of groups of three to follow)
3-digit (Sensory internal use)
5-digit (Sensory internal use)
5-digit (Sensory internal use)

Of particular importance are the error code, silence level, maximum power, and AGC value. The software design section above describes the nature of the error codes. The "silence level" can indicate the system's inherent noise. When the microphone is unplugged, the silence level value should be as low as possible (usually greater than "00050" when AGC value = "21").

Higher numbers may indicate EMI noise in the system. When a typical user speaks into the microphone at a normal volume from a distance appropriate to the hardware gain configuration (e.g., far), the maximum power should be in the range of 02000 ± 500. Numbers consistently outside of this range suggest the microphone is biased for too much or too little gain, or the wrong hardware gain setting was chosen.