Separating a single person’s voice from a noisy crowd is something most people do subconsciously — it’s called the cocktail party effect. Smart speakers like Google Home and Amazon’s Echo typically have a tougher time, but thanks to artificial intelligence (AI), they might one day be able to filter out voices as well as any human.
Researchers at Google and the Idiap Research Institute in Switerzland describe a novel solution in a new paper (“VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking“) published on the preprint server Arxiv.org. They trained two separate neural networks — a speaker recognition network and a spectrogram masking network — that together “significantly” reduced the speech recognition word error rate (WER) on multispeaker signals.
Their work builds on a paper out of MIT’s Computer Science and Artificial Intelligence Lab earlier this year, which described a system — PixelPlayer — that learned to isolate the sounds of individual instruments from YouTube videos. And it calls to mind an AI system created by researchers at the University of Surrey in 2015, which output vocal spectrograms when fed songs as input.
“[We address] the task of isolating the voices of a subset of speakers of interest from the commonality of all the other speakers and noises,” the researchers wrote. “For example, such subset can be formed by a single target speaker issuing a spoken query to a personal mobile device, or the members of a house talking to a shared home device.”
The researchers’ two-part system, dubbed VoiceFilter, consisted of a long short term memory (LSTM) model — a type of machine learning algorithm that combines memory and inputs to improve its prediction accuracy — and a convolutional neural network (with one LSTM layer). The first took as inputs preprocessed voice samples and output speaker embeddings (i.e., representations of sound in vector form), while the latter predicted a soft mask, or filter, from the embeddings and a magnitude spectrogram computed from noisy audio. The mask was used to generate an enhanced magnitude spectrogram, which, when combined with the phase (sound waves) of the noisy audio and transformed, produced an enhanced waveform.
The AI system was taught to minimize the difference between the masked magnitude spectrogram and the target magnitude spectrogram computed from clean audio.
The team sourced two datasets for training samples: (1) roughly 34 million anonymized voice query logs in English from 138,000 speakers, and (2) a compilation of open source speech libraries LibriSpeech, VoxCeleb, and VoxCeleb2. The VoiceFilter network trained on speech samples from 2,338 contributors to the CSTR VCTK dataset — a corpus of speech data maintained by the University of Edinburgh — and LibriSpeech, and was evaluated with utterances from 73 speakers. (The training data consisted of three data inputs: clean audio as ground truth, noisy audio containing multiple speakers, and reference audio from the target speaker.)
In tests, VoiceFilter achieved a reduction in word error rate from 55.9 percent to 23.4 percent in two-speaker scenarios.
“We have demonstrated the effectiveness of using a discriminatively-trained speaker encoder to condition the speech separation task,” the researchers wrote. “Such a system is more applicable to real scenarios because it does not require prior knowledge about the number of speakers … Our system purely relies on the audio signal and can easily generalize to unknown speakers by using a highly representative embedding vector for the speaker.”