During last week’s colossal drop of new products and features, Amazon announced a new whisper mode for its Alexa personal assistant. The feature is centred around the idea there are times of the day when you’re naturally quiet, such as early in the morning when your partner may still be sleeping, or during the evening when the children are asleep.
However, when others are catching some z’s you may still want to ask Alexa for an update on your commute, or to set an alarm for the morning, without waking fellow members of the household. So now, if you whisper your command at Alexa, she will whisper back in kind.
A week on, Amazon is explaining how the feature works, and it’s a lot more complex than it sounds. Amazon scientist Zeynab Raeesy says whispered speech is typically low energy and unvoiced, meaning it lacks the vibration in the vocal chords. That makes it much more difficult for a listening device like an Amazon Echo to pick up the sounds effectively.
Related: Which Amazon Echo should you buy
Raeesy says Amazon compared the performance of two different neural networks in order to to distinguish between words spoken normally and those whispered. They found a long short-term memory (LSTM) network performed better than multilayer perceptron (MLP) network
She wrote: “The models are trained on two categories of features. One is log filter-bank energies, a fairly direct representation of the speech signal that records the signal energies in different frequency ranges. The other is a set of features specifically engineered to exploit the signal differences between whispered and normal speech.
“We found that an LSTM network that doesn’t use handcrafted features performs as well as an MLP that does, indicating that LSTMs are capable of learning which signal attributes are most useful for whisper detection.”
Raeesy added there were caveats to this approach, namely that the more data the LSTM network was exposed to, the less improvement the handcrafted features offered. So, the model that now sits within Alexa doesn’t include the handcrafted features at all.
Other problems the team had to overcome included the “end-pointing” process. Usually, Alexa i