Multi-modal Multi-channel Speech Separation

Authors: Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Yuexian Zou, Dong Yu

Abstract: Target speech separation refers to extracting a target speaker's voice from overlapped audios of simultaneous talkers. Previously the use of visual clue for target speech separation has shown great potentials. This paper extends it with a general multi-modal framework in the far-field scenario. The framework leverages all the available information of target speaker, including his/her spatial location, voice characteristics and lip movements. These target-related features are extracted and learned together with the separation task in an end-to-end neural architecture. An important aspect of the multi-modal joint modeling is the form of fusion. A factorized attention-based fusion method is proposed to aggregate different modalities. To validate the robustness of proposed multi-modal system in practice, we evaluate it under different challenging conditions, such as one of the modalities is temporarily missing, invalid or noisy. Experiments are carried out on a large-scale audio-visual dataset collected from YouTube and simulated into multichannels. Evaluation results illustrate that our proposed multi-modal framework significantly outperforms single-modal speech separation approaches, even when the target's location or lip information is temporally missing or corrupted.

Details

Dataset

The audio-visual corpus used for experiments is collected from Youtube, in which Mandarin accounts for the vast majority.

The simulated dataset contains 160,000, 15,000 and 1,200 multi-channel noisy and reverberant mixtures for training, validation and testing. The speakers in the training set and test set are not overlapped.

Simulation

We use a 9-element non-uniform linear array, with spacing 4-3-2-1-1-2-3-4 cm. The Multi-channel audio signals are generated by convolving single-channel signals with RIRs simulated by image-source method (ISM).

We consider three scenarios for the synthetic examples generation: 1 speaker, 2 speakers and 3 speakers, respectively accounts for 49%, 30% and 21% in the test dataset.

Framework

As shown in figure above, the proposed system is a multi-stream architecture which takes four inputs: (i) noisy multi-channel mixture waveforms, (ii) target speaker's direction calculated by face detection, (iii) video frames of cropped lip regions, (iv) enrollment audio(s) of the target speaker. The system directly outputs estimated monaural target speech, while all other interfering signals are suppressed.

Features

Audio: 1) Speaker-independent features: spectral feature (logrithm power spectrum) and spatial features (inter-channel phase difference). 2) Speaker-related feature: directional feature computed with the target speaker's direction.

Video: The video stream takes gray image frames of the target speaker's lip region. The structure of lip reading network is similar to the one proposed by [1]. It consists of a spatio-temporal convolution layer and a 18-layer ResNet, followed by several video blocks. The output of the video blocks are target speaker's lip embeddings.

Speaker Embedding: The speaker model was pretrained on speaker verification task. The input to the speaker model is an enrollment utterance of the target speaker. The speaker model outputs the utterance-level speaker embedding.