3D Spatial Features for Multi-channel Target Speech Separation

3D NEURAL BEAMFORMING FOR MULTI-CHANNEL SPEECH SEPARATION AGAINST LOCATION UNCERTAINTY

Anonymous Authors

Abstract: Multi-channel speech separation using speaker’s directional information has demonstrated significant gains over blind separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker’s direction. To overcome these issues, this paper proposed 3D features and anassociated 3D neural beamformer for speech separation. Previous works in this area are extended in two important directions. First, the traditional 1D directional beam patterns are generalized to 3D. This enables to extract speech in any target region in the 3D space. Thus, speakers in the same direction but with different elevations or distances become separable. Second, to handle the speaker location uncertainty, previously proposed spatial feature is extended to a new 3D region feature. The proposed feature and model are evaluated under an in-car scenario. Experimental results demonstrated that the proposed 3D region feature and 3D beamformer can achieve comparable performance to that with ground truth speaker location input.

From 1D to 3D spatial feature


	SF based on azimuth	SF based on azimuth and elevation	SF based on azimuth, elevation and distance
	$d_m(\theta)=\varDelta_m\cos\theta$	$d_m (\theta,\phi)=\varDelta_m \cos\theta\cos\phi$	$d_m(\theta,\phi,d_o)=d_{m_1} - d_{m_2}$ $d^2_{m_1}=d^2_{om_1}+d^2_o-2d_{om_1}d_o\cos\alpha$ $\cos\alpha=\cos\theta\cos\phi$

Effects of location uncertainty

We report the performances when the spatial features are extracted with deviated azimuth, elevation or distance.

It can be observed that the 3D spatial feature based on azimuth, elevation and distance is quite sensitive to the localization accuracy, compared to other spatial features. To alleviate this problem, we have explored some strategies. For example, data augmentation such as introducing the error during training (the red line in Azimuth figure).

In this work, we sample some candidate locations in the adjacent 3d regions around the given location (θ,φ,d), and design an attention mechanism to selectively focus on these candidate locations.

Azimuth	Elevation	Distance

Experiments on real-recorded data

Scenario

In order to evaluate our proposed method in real-life applications, we consider a in-car scenario (Figure 1). As shown in Figure 2, there are 4 potential speakers and corresponding regions in a car: the main driver (S1), the co-driver (S2) and two passengers (S3 & S4) sitting in the back. We take the main driver's voice as the target. It can be seen from the top view that the azimuths of the main driver (S1) and the passenger in the back seat (S3) are very close. In this case, it is difficult to distinguish these two speakers with the spatial feature only based on azimuth.

Fig.1 The detailed information of the car. Fig.2 4 speakers corresponding to 4 regions in the car.

Spatial Features

We assume that the main driver (as well as other passengers) will be located within a limited region (a 3D box as shown in the figure below, each dot indicates a potential speaker position). The length, width, height and corresponding distance to the microphone array center of this box, are measured according to the specific car. With this information, we can calcalate the expectation center point of the 3D box, and consider the center point as the speaker location.

1D Spatial feature: The spatial features based on azimuth that fed into the separation network are always extracted using the fixed center locations of 4 regions (boxes). As shown in the figure, the center locations of four regions are marked with the cross. Only the azimuths of the center points are used to calculate the spatial features.

3D Spatial feature: The fixed center locations (including azimuth, elevation and distance) of 4 regions are used to calculate 3D spatial features.

3D Region feature: We can also sample 8 vertexes of the 3D box to obtain a full spatial view of the whole region. As shown in the figure, 4 (regions) × 9 (spatial positions) spatial features that computed from azimuth, elevation and distance will be extracted and input to the separation model. We design a simple attention mechanism (i.e., weight-and-sum) to make the model selectively focus on useful spatial positions.

Data Preparation

In order to train a model to separate the voice of the main driver, we simulate a reverberant noisy dataset. The room size matches that of the car and the microphone array is placed at the car head. We use a dual mic with 11.8cm spacing.

The simulated dataset contains 90,000 noisy and reverberant N-speaker mixtures for training, validation and testing, where N is randomly selected from [2, 3]. The Multi-channel audio signals are generated by convolving single-channel signals with RIRs simulated by image-source method (ISM).

Evaluating

on real-recorded data: We recorded 25-minute 2-channel overlapped speech data in the specific car, where the AISHELL speech are replayed according to pre-arranged timestamps in each region. Also, the car's player is playing loud music, which is not simulated in the training data. This means there will be a mismatch between the training and testing phase. We calcalate word error rate of each utterance according to the AISHELL transcript. Compared to the unprocessed mixture with WER of 108.71%, the WERs for SF based on azimuth and SF based on azimuth, elevation and distance are 45.90% and 42.74%, respectively.

Feature	1D (θ_c)	3D (l_c)	1D (θ_c)	3D (l_c)	3D ({l_c}^9)
Target	cRM	cRM	AN-BF	AN-BF	AN-BF
CER (%)	108.7	45.9	42.7	27.4	25.5

Samples on real-recorded data with Echo

Mixture (1st channel)	GT transcript	1D (θ_c)	3D (l_c)	1D (θ_c)	3D ({l_c}^9)

ASR: 下深圳市联合收费号码	深圳今天适合穿外套吗	深圳今天是和小外套吗	深圳今天适合群外套吗	深圳今天适合吃外套吗	深圳今天适合穿外套吗

ASR: 对这些都没有兴趣	语音可以支持哪些功能	语音配置适中	语音呗支持哪些	语音可以支持哪些空的	语音可以支持哪些功能

ASR: 中创业难坚持一次发生	我想听邓紫棋唱的穿越火线	我想听你唱	我想听你唱	我想听邓紫琴唱的首先	我想听邓紫棋唱的穿越火线

ASR: 中创业难坚持一次发生	采菊东篱下的整首古诗	采菊东篱下到郑州	采菊东篱下到郑州古诗	采取公篱下的整首古诗	采菊东篱下的整首古诗