During last week’s colossal drop of new products and features, Amazon announced a new whisper mode for its Alexa personal assistant. The feature is centred around the idea there are times of the day when you’re naturally quiet, such as early in the morning when your partner may still be sleeping, or during the evening when the children are asleep.
However, when others are catching some z’s you may still want to ask Alexa for an update on your commute, or to set an alarm for the morning, without waking fellow members of the household. So now, if you whisper your command at Alexa, she will whisper back in kind.
A week on, Amazon is explaining how the feature works, and it’s a lot more complex than it sounds. Amazon scientist Zeynab Raeesy says whispered speech is typically low energy and unvoiced, meaning it lacks the vibration in the vocal chords. That makes it much more difficult for a listening device like an Amazon Echo to pick up the sounds effectively.
Related: Which Amazon Echo should you buy
Raeesy says Amazon compared the performance of two different neural networks in order to to distinguish between words spoken normally and those whispered. They found a long short-term memory (LSTM) network performed better than multilayer perceptron (MLP) network
She wrote: “The models are trained on two categories of features. One is log filter-bank energies, a fairly direct representation of the speech signal that records the signal energies in different frequency ranges. The other is a set of features specifically engineered to exploit the signal differences between whispered and normal speech.
“We found that an LSTM network that doesn’t use handcrafted features performs as well as an MLP that does, indicating that LSTMs are capable of learning which signal attributes are most useful for whisper detection.”
Raeesy added there were caveats to this approach, namely that the more data the LSTM network was exposed to, the less improvement the handcrafted features offered. So, the model that now sits within Alexa doesn’t include the handcrafted features at all.
Other problems the team had to overcome included the “end-pointing” process. Usually, Alexa is able to detect the end of a command due to the period of silence at the end. This becomes more problematic with whispering and the ability of the LSTM network was less effective towards the end of utterances.
She said: “Unexpectedly, averaging the entire signal — including the troublesome final 50 frames — yielded the best results. We suspect, however, that that’s because the samples of whispered speech that we used in our experiments were manually segmented, while the samples of normal speech were automatically segmented, using Alexa’s production end-pointer.
“There could be some consistent difference between manual and automatic segmentation that the system was actually exploiting to distinguish the two types of input, and dropping the final 50 frames made that difference more difficult to detect.”
Do you think whisper mode is a useful addition to Alexa’s skillset? Drop us a line @TrustedReviews on Twitter.