Speech Chip Listens Well, Talks Clearly, Plays Music

Low-cost high-quality speech recognition and synthesis has been the holy grail of the electronics industry for several decades. It's here now in the RSC-4x family of speech-recognition and synthesis chips from Sensory Inc. Each chip needs just an...

Dave Bursky

April 1, 2002

7 min read

Integration, the use of advanced DSP technology, and an innovative architecture played a key role in the chips' development. All necessary digital-signal processing, data storage, digital control, and analog input and output support are on-chip. As a single-chip solution, the RSC-4x family allows speech I/O to be easily incorporated into a wide range of consumer, automotive, cellular, and portable computing applications, among others.

Some of these chips sell for less than $2.00 in large quantities. They all provide high recognition accuracy for both speaker-dependent and speaker-independent applications, as well as high-quality speech synthesis.

The silicon works in tandem with the company's Sensory Speech 7 firmware, which consists of a suite of tools, algorithms, and libraries. With the tools, products that store over five minutes of compressed speech, multiple speaker-dependent and speaker-independent vocabularies, speaker verification, and all application code can be implemented as a single-chip solution.

The chips can perform speaker-independent or speaker-dependent recognition, speaker verification, speaker-adaptive recognition, word spotting, and continuous-listening recognition. When coupled with the speech-synthesis capability, the RSC-4x processors can also perform voice recording and playback. Data rates of less than 14 kbits/s can be achieved while maintaining very high-quality reproduction.

In the synthesis mode, the chips can deliver high-quality speech synthesis based on a proprietary version of the linear-predictive code algorithm that allows the data rate to go as low as 5 kbits/s. The low data rate permits a considerable amount of prerecorded speech to be stored on-chip, or in an external ROM or flash device. A MIDI-like synthesis mode with four "voices" lets multiple instruments harmonize and generate music simultaneously.

Initially, there will be three versions of the chip--the RSC-4000, RSC-4128, and RSC-4256. The first, a ROMless implementation with address and data buses, enables users to store the vocabulary in external ROMs, flash memories, or static RAMs. The other two are ROM-based versions, with 128 and 256 kbytes of ROM, respectively.

Customized Core Controls All: To keep costs low, Sensory's designers based their signal-processing architecture around an 8051-compatible controller and a vector-processor unit (see the figure). While the controller incorporates an extended instruction set optimized for managing data operations, the custom vector processor handles many of the signal-processing tasks. The vector accelerator includes a single-cycle multiplier and twin DMA-channel controllers. The DMA controllers stream data from the on-chip or external memory to the vector engine.

Also contained on-chip is a 5-kbyte block of static RAM. Most of this RAM can be used to hold word data, but 256 bytes are set aside for the embedded microcontroller to handle the application program parameters.

In the recognition mode, the RSC-4x can execute both hidden-Markov modeling and a neural-network based algorithm. Speaker-dependent recognition, which uses a dynamic time-warping technique, may require some external memory to store speech information (10 words can be stored on-chip), while speaker-independent vocabularies can be stored on- or off-chip.

Speaker-independent recognition requires no training and leverages the word libraries developed by Sensory, or customer-developed libraries incorporated into the on-chip or external ROM. The RSC-4x recognizes up to 16 words in an active set. (An active set is a limited list of words that the chip will recognize. In turn, each word in an active list can open another active list, and so on.)

Only the amount of internal or external memory limits the number of active sets. By using cascaded active lists, up to about 1000 words can be stored in the on-chip memories. But off-chip memory can handle an almost unlimited number of words.

Speaker-dependent recognition lets the user create custom vocabularies. Up to 100 words can be recognized in an active set, but the on-chip RAM only allows six words to be stored on-chip. Off-chip memory must be used for larger vocabularies. In the continuous listening mode, the chip can continuously listen for a specific word. With this feature, a product can be used in a normal environment and only "activates" when a specific word, framed by silence, is spoken.

Additionally, the chips have a word-spotting feature that allows them to continuously listen for up to five speaker-independent or five speaker-dependent words at a time. In the word-spotting mode, the word doesn't require framing by silence.

In the speaker-verification mode, the speaker trains the chip on a specific word. After training, the chip can identify whether or not the word is spoken by the original speaker. This enables the RSC-4x to provide biometric security capabilities with programmable security thresholds. Up to 10 speaker-verification words can be stored on-chip (or more with external memory).

For high-quality speech recognition and synthesis, designers at Sensory incorporated a 16-bit analog-to-digital converter (ADC) on the speech-recognition side, and a 10-bit digital-to-analog converter (DAC) on the speech-output side. An input preamplifier and an automatic gain-control circuit on the chip scales the microphone signal for the ADC, and a PWM block delivers the output to the speaker. A separate line output also is available for systems that have an audio output amplifier.

Additional resources on the RSC-4x chips include three bidirectional 8-bit I/O ports that permit the processors to interact with external devices. There also is a pair of independent 8-bit counter-timers, a separate watchdog timer, and a pair of comparators (for up to four inputs) that can be used to sense input control levels. Moreover, the microcontroller core includes a fully nested interrupt structure that handles up to six sources.

The RSC-4x architecture was also designed to operate at low power levels. A 32-kHz on-chip oscillator keeps the timers running when the chip goes into standby. During normal operation, the typical operating current is about 10 mA when running at 14.32 MHz and powered by a 3-V supply. In standby, the RSC-4x speech processors consume under 5 µA. Sleep and Idle are the two power-down modes. In the Sleep mode, everything stops. Only an I/O, audio, or reset event can initiate a wakeup. In the Idle mode, the low-frequency oscillator and one timer continue to run. An overflow on the timer can initiate a wakeup.

Sensory has a wide range of speaker-independent vocabularies available for both speech-recognition and speech-synthesis applications. In addition to the libraries, Sensory has software tools for custom word-library development and application development.

In partnership with Phyton Inc., a supplier of microcontroller software development tools, Sensory has created a combined tool suite that enables vocabulary and application development. The tools include the PICE-SE in-circuit emulator and integrated development environment (consisting of the MCA-SE assembler and PDS-SE command-set simulator). Also available is the MCC-SE C compiler. It comes bundled with the MCA-SE assembler and PDS-SE command simulator.

Price & Availability
Samples of the RSC-4x speech processors are immediately available. The processors are most often sold as bare chips for use in chip-on-board applications. However, packaged units are available too, including a 48-lead TQFP for the RSC-4128 and -4256, and a 100-lead package for the ROMless RSC-4000.

In quantities of 100,000 units, the unpackaged RSC-4128, which includes an on-chip 128-kbyte ROM, sells for only $1.75 apiece. The price for the RSC-4256 is slightly higher, and the RSC-4000 costs slightly less. Packages add about $1.00 to the device prices.

The Phyton PICE-SE in-circuit emulator costs $2900, while the MCC-SE C compiler and software tools run $1195. A combination package of the two tools goes for $3750.

Sensory Inc., 1991 Russell Ave., Santa Clara, CA 95054; Erik Soule, (408) 327-9000; www.sensoryinc.com.

Phyton Inc., 7206 Bay Parkway, 2nd fl., Brooklyn, NY 11204; (718) 259-3191; www.phyton.com.

About the Author

Dave Bursky

Technologist

Dave Bursky, the founder of New Ideas in Communications, a publication website featuring the blog column Chipnastics – the Art and Science of Chip Design. He is also president of PRN Engineering, a technical writing and market consulting company. Prior to these organizations, he spent about a dozen years as a contributing editor to Chip Design magazine. Concurrent with Chip Design, he was also the technical editorial manager at Maxim Integrated Products, and prior to Maxim, Dave spent over 35 years working as an engineer for the U.S. Army Electronics Command and an editor with Electronic Design Magazine.