Details

Dataset

The audio-visual corpus used for experiments is collected from Youtube, in which Mandarin accounts for the vast majority.

The simulated dataset contains 160,000, 15,000 and 1,200 multi-channel noisy and reverberant mixtures for training, validation and testing. The speakers in the training set and test set are not overlapped.

Simulation

We use a 9-element non-uniform linear array, with spacing 4-3-2-1-1-2-3-4 cm. The Multi-channel audio signals are generated by convolving single-channel signals with RIRs simulated by image-source method (ISM).

We consider three scenarios for the synthetic examples generation: 1 speaker, 2 speakers and 3 speakers, respectively accounts for 49%, 30% and 21% in the test dataset.

Framework

As shown in figure above, the proposed system is a multi-stream architecture which takes four inputs: (i) noisy multi-channel mixture waveforms, (ii) target speaker's direction calculated by face detection, (iii) video frames of cropped lip regions, (iv) enrollment audio(s) of the target speaker. The system directly outputs estimated monaural target speech, while all other interfering signals are suppressed.

Features

Audio: 1) Speaker-independent features: spectral feature (logrithm power spectrum) and spatial features (inter-channel phase difference). 2) Speaker-related feature: directional feature computed with the target speaker's direction.

Video: The video stream takes gray image frames of the target speaker's lip region. The structure of lip reading network is similar to the one proposed by [1]. It consists of a spatio-temporal convolution layer and a 18-layer ResNet, followed by several video blocks. The output of the video blocks are target speaker's lip embeddings.

Speaker Embedding: The speaker model was pretrained on speaker verification task. The input to the speaker model is an enrollment utterance of the target speaker. The speaker model outputs the utterance-level speaker embedding.

1-speaker samples

Mixture (1st channel)
reverberant clean
audio-only
audio-visual
multi-modal
   
(IPD+DF)
(IPD+DF+lip)
(IPD+DF+lip+spk. emb.)

2-speaker samples

Mixture (1st channel)
reverberant clean
audio-only
audio-visual
multi-modal
   
(IPD+DF)
(IPD+DF+lip)
(IPD+DF+lip+spk. emb.)

3-speaker samples

Mixture (1st channel)
reverberant clean
audio-only
audio-visual
multi-modal
   
(IPD+DF)
(IPD+DF+lip)
(IPD+DF+lip+spk. emb.)

Real recorded samples

Unprocessed Mixture Estimated speaker 1 Estimated speaker 2