In one paper Published on the preprint server Arxiv.org, researchers from Google and the University of Illinois Mixed Invariant Training (MixIT) are proposing an unattended approach to separating, isolating, and enhancing the voices of multiple speakers in one audio recording. This approach only requires single-channel (e.g., monaural) acoustic features, and researchers claim that it “significantly” improves speech separation performance by incorporating reverberation mixes and a large amount of in-the-wild training data.
As the co-authors of the paper point out, audio perception suffers from a fundamental problem: sounds are mixed together in a way that cannot be unraveled without knowing the properties of the sources. Attempts have been made to design algorithms with which each sound source can be estimated from single-channel recordings. However, most are monitored, that is, they practice audio mixes created by adding sounds with or without environmental simulations. The result is that they perform poorly with acoustic reverberation or when the distribution of sound types does not match. This is due to several factors. First, it is difficult to achieve the properties of a real body, and the room properties are sometimes unknown. Then data of every source type may not be readily available, and it is also difficult to accurately simulate realistic acoustics.
MixIT claims to solve these challenges by using acoustic mixes without references. Training examples are created by mixing existing audio mixes and broken down into a number of sources by the system, with the separate sources being remixed to approximate the original.
In experiments, MixIT was trained with four Google Cloud Tensor Processing Units (TPU) to accomplish three tasks: speech separation, speech improvement and universal sound separation. For speech separation, the researchers used the open source data sets WSJ0-2mix and Libri2Mix to extract over 390 hours of recordings from male and female speakers. They added a reverberation effect before adding a mix of the two sets (three-second clips from WSJ0-2mix and ten-second clips from Libri2Mix) to the model.
For the speech enhancement task, they collected non-speech sounds from FreeSound.org to test whether MixIT can be trained to remove noisy audio data from a mix with LibriSpeech voices. And for the universal sound isolation task, they used the recently released Free Universal Sound Separation dataset to train MixIT to separate any sound from an acoustic mix.
The researchers report that unattended training in universal sound isolation and speech enhancement was not as helpful compared to existing approaches – presumably because the test sets were “well matched to the monitored training area”. For universal sound isolation, however, unsupervised training seemed to help slightly in generalizing the test set compared to only supervised training. The co-authors did not reach a monitored level, but claim that MixIT’s performance was “unprecedented” without supervision.
“MixIT opens up new lines of research in which huge amounts of previously unused data in the wild can be used to train sound isolation systems,” the researchers wrote. “The ultimate goal is to evaluate the separation based on real mixture data. However, this remains a challenge due to the lack of basic truth. Depending on the application, future experiments can use detection or human listening as a proxy as a measure of the separation. ”