From 1D to 3D spatial feature

SF based on azimuth
SF based on azimuth and elevation
SF based on azimuth, elevation and distance


Effects of location uncertainty

We report the performances when the spatial features are extracted with deviated azimuth, elevation or distance.

It can be observed that the 3D spatial feature based on azimuth, elevation and distance is quite sensitive to the localization accuracy, compared to other spatial features. To alleviate this problem, we have explored some strategies. For example, data augmentation such as introducing the error during training (the red line in Azimuth figure).

In this work, we sample some candidate locations in the adjacent 3d regions around the given location (θ,φ,d), and design an attention mechanism to selectively focus on these candidate locations.

Azimuth
Elevation
Distance

Experiments on real-recorded data

Scenario

In order to evaluate our proposed method in real-life applications, we consider a in-car scenario (Figure 1). As shown in Figure 2, there are 4 potential speakers and corresponding regions in a car: the main driver (S1), the co-driver (S2) and two passengers (S3 & S4) sitting in the back. We take the main driver's voice as the target. It can be seen from the top view that the azimuths of the main driver (S1) and the passenger in the back seat (S3) are very close. In this case, it is difficult to distinguish these two speakers with the spatial feature only based on azimuth.

Fig.1 The detailed information of the car.                                                 Fig.2 4 speakers corresponding to 4 regions in the car.

Spatial Features

We assume that the main driver (as well as other passengers) will be located within a limited region (a 3D box as shown in the figure below, each dot indicates a potential speaker position). The length, width, height and corresponding distance to the microphone array center of this box, are measured according to the specific car. With this information, we can calcalate the expectation center point of the 3D box, and consider the center point as the speaker location.

1D Spatial feature: The spatial features based on azimuth that fed into the separation network are always extracted using the fixed center locations of 4 regions (boxes). As shown in the figure, the center locations of four regions are marked with the cross. Only the azimuths of the center points are used to calculate the spatial features.

3D Spatial feature: The fixed center locations (including azimuth, elevation and distance) of 4 regions are used to calculate 3D spatial features.

3D Region feature: We can also sample 8 vertexes of the 3D box to obtain a full spatial view of the whole region. As shown in the figure, 4 (regions) × 9 (spatial positions) spatial features that computed from azimuth, elevation and distance will be extracted and input to the separation model. We design a simple attention mechanism (i.e., weight-and-sum) to make the model selectively focus on useful spatial positions.

Data Preparation

In order to train a model to separate the voice of the main driver, we simulate a reverberant noisy dataset. The room size matches that of the car and the microphone array is placed at the car head. We use a dual mic with 11.8cm spacing.

The simulated dataset contains 90,000 noisy and reverberant N-speaker mixtures for training, validation and testing, where N is randomly selected from [2, 3]. The Multi-channel audio signals are generated by convolving single-channel signals with RIRs simulated by image-source method (ISM).

Evaluating

on real-recorded data: We recorded 25-minute 2-channel overlapped speech data in the specific car, where the AISHELL speech are replayed according to pre-arranged timestamps in each region. Also, the car's player is playing loud music, which is not simulated in the training data. This means there will be a mismatch between the training and testing phase. We calcalate word error rate of each utterance according to the AISHELL transcript. Compared to the unprocessed mixture with WER of 108.71%, the WERs for SF based on azimuth and SF based on azimuth, elevation and distance are 45.90% and 42.74%, respectively.

Feature
1D (θ_c)
3D (l_c)
1D (θ_c)
3D (l_c)
3D ({l_c}^9)
Target
cRM
cRM
AN-BF
AN-BF
AN-BF
CER (%)
108.7
45.9
42.7
27.4
25.5

Samples on real-recorded data with Echo

      Mixture (1st channel)      
        GT  transcript        
1D (θ_c)
3D (l_c)
1D (θ_c)
3D ({l_c}^9)
ASR: 下深圳市联合收费号码
深圳今天适合穿外套吗
深圳今天是和小外套吗
深圳今天适合外套吗
深圳今天适合外套吗
深圳今天适合穿外套吗
ASR: 对这些都没有兴趣
语音可以支持哪些功能
语音配置适中
语音支持哪些
语音可以支持哪些空的
语音可以支持哪些功能
ASR: 中创业难坚持一次发生
我想听邓紫棋唱的穿越火线
我想听
我想听
我想听邓紫琴唱的首先
我想听邓紫棋唱的穿越火线
ASR: 中创业难坚持一次发生
采菊东篱下的整首古诗
采菊东篱下到郑州
采菊东篱下到郑州古诗
取公篱下的整首古诗
采菊东篱下的整首古诗