On the season premiere of the satirical cartoon comedy South Park, the foul-mouthed main character Cartman asks Amazon’s digital assistant Alexa to set an early morning alarm, tell him jokes, and order weird and disgusting products online.
But the real joke was that if you were watching the episode and your Amazon Echo heard the television, it would follow all the same commands – or at least try to. This has long been one of Alexa’s annoying but amusing quirks.
But it's easy to imagine more nefarious hacks as voice controls move into kitchen appliances and even door locks. Amazon released a bumper crop of new Alexa devices, and it has been trying to move artificial intelligence into things as mundane as an alarm clock and as outlandish as an animatronic talking fish. Google is trying to do the same with its Home digital assistant.
The cost and processing power of chips has also improved so that all types of devices could be controlled with a few words, and now software can clearly make out voices from across a room. But household devices may need more than microphones to react naturally to a person’s requests and not be fooled by the television.
“Conversational awareness is the next big thing,” said Mark Lippett, the chief executive of Xmos, which makes multicore microcontrollers that can capture voice from across a room or kitchen. “Ultimately, I think these interfaces are going to go beyond just audio,” he said in a recent interview.
Xmos is not alone thinking along those lines. The company recently raised $15 million in a funding round led by Infineon, which announced that it would combine its sensors with Xmos chips to enable devices that only listen to humans. Infineon did not disclose the size of its investment, which it called a minority stake.
By coupling its automotive radar and microphones with Xmos’ voice controllers, Infineon is aiming to give smart speakers and other household devices context to ignore music and voices from the television. “It is adding another dimension to that voice interface and another dimension to the user experience as a result,” Lippett said.
“The microphones would eliminate the television as a source of a voice because televisions don’t behave like human beings, they don’t move around, and they generally don’t have the characteristics of a human,” Lippett said. The radar could provide other assurances that a voice belonged to someone in the room, such as reading a person’s heart rate and breathing patterns.
“The audio reproduction industry is struggling – I mean it is very successful at creating something that comes out of your T.V. that sounds like exactly like a human being,” Lippett said. “That is a tough nut to crack and that is where I think you need a little more information that just acoustic.”
Other chipmakers are betting on voice control as well. Cirrus Logic, which makes audio codecs and microphones, has released development tools for voice capture with Amazon’s Alexa. Texas Instruments and NXP are also targeting voice controls in the $81 billion smart home hardware market that Strategy Analytics predicts will exist in 2022.
Touch chipmaker Synaptics recently closed a $300 million deal for Conexant, which makes voice processors that work with both Baidu’s and Amazon’s digital assistants. Like its rivals, Conexant has been improving its chips to separate voice commands from background noise. Using four microphones, its voice software can determine the direction of sounds.
Xmos is also pressing ahead with investments into sound separation and software that fuses information from multiple sensors. In July, it bought Setem Technologies, whose algorithms can identify the number of people in a room and amplify individual voices in noisy situations like cocktail parties – something that Lippett calls “vocal sorcery.” The software could also give gadgets an intuitive grasp of their surroundings.
For example, a smart speaker could detect music from a television and block it out to listen for voice commands. The speaker could also recall that something in that location played music before, so it will ignore a news broadcaster’s voice, for example, from that direction. “You can use that historical data and build up some knowledge about the soundscape,” Lippett said.
Further along, household devices could initiate conversations to tell users about the weather or traffic. They could roll with the typical quirks of human speech, like following a voice command after the person talking pauses or asks an unrelated question. But that would require lots of contextual clues that might be tough to pick out with a microphone.
“They are going to start looking at different things like how many people are in the room, where they are facing, and their body language – to make intelligent choices about when to speak, the context, and what information might be required in that context,” Lippett said.