Distant speech separation using predicted time-frequency masks from spatial features

Question

Q1. What are the contributions in "Distant speech separation using predicted time-frequency masks from spatial features, speech communication, 2015 (68), pp.96-107. " ?

Q2. What is the way to improve the quality of speech?

Q3. How was the critical distance of the reverberation calculated?

Q4. How was the error backpropagation algorithm used?

Q5. What was the error on the training data?

Q6. What are the typical cues for binaural recordings?

Q7. What is the main reason why Hummersone et al. argued that IRM?

Q8. How many speakers were used in the three source case?

Q9. How many sentences were captured in the TIMIT database?

Q10. What is the purpose of this work?

Q11. What is the SDR for the MTT-SEP?

Q12. How does Healy et al. use DNNs to predict IBM?

Q13. What is the difference between the two modifications?

Q14. What is the difference between the two hearing aids?

Q15. What is the difference between the two spatial cues?

Accepted Answer

Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array ’ s spatial features into a T-F mask.

Accepted Answer

A combination of DNNs and support vector machines (SVMs) for speech enhancement using binary classification of T-F bands is proposed by Wang and Wang (2013).

Accepted Answer

The critical distance (Kuttruff, 2009) at which the sound energy of the reverberation is equal to that of the direct path is approximated as rc ≈ 0.056 √gV/T60, where V is room volume, and g = 1.62 is the directivity factor modeled by using the average human speech directivity index of 2.1 dB measured by Monson et al. (2012).

Accepted Answer

The error backpropagation algorithm1 was used to train the network (8) using stochastic gradient descent with learning rate µ=0.1.

Accepted Answer

The backpropagation was run on the training data until the error on the testing data reached a minimum, which did not decrease during five successive iterations.

Accepted Answer

Such recordings frommicrophones placed in the ears are referred as binaural, and the typical cues are interaural time delay (ITD) and interaural level difference (ILD).

Accepted Answer

Hummersone et al. (2014) argued that IRM may be more closely related to auditory processes than IBM and reviewed studies favoring IRM over IBM in certain ASR tasks and speech intelligibility (SI) measurements.

Accepted Answer

For the three source case, all speaker permutations using two separate sentences from each speaker were generated to produce 32 three-speaker mixtures.

Accepted Answer

100 sentences from the TIMIT database (Garofolo et al., 1993) were captured in both rooms for both distances, resulting in 400 sentences.

Accepted Answer

This work extends the speech enhancement work of Pertilä and Nikunen (2014) to source separation by introducing a post-processing stage to include information between sources into the predicted T-F masks.

Accepted Answer

The SDR is also highest for the MTT-SEP in the more reverberant room, while in the low reverberant room the SDR scores are high for MVDR, CNMF, and MTT-SEP.

Accepted Answer

Healy et al. (2013) utilize DNNs to predict IBM, and demonstrate that the method can significantly improve SI for hearing impaired listeners.

Accepted Answer

Both modifications decrease the leakage of interfering sources into the target speaker, thus improving the framework’s separation performance.

Accepted Answer

The authors analyze different types of non-linear feature transforms, using varying amounts of adjacent T-F points, and propose an evolutionary quantization scheme to achieve low transmission rate to exchange feature values between the two hearing aid devices.

Accepted Answer

The mentioned spatial cues are channel pairwise, which poses no issues in two channel recordings, but in general for M microphones the number of pairs grows O(M2), which can result in huge spatial cue vectors for microphone arrays.

Distant speech separation using predicted time-frequency masks from spatial features

Figures

Citations

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation

Deep Learning Based Target Cancellation for Speech Dereverberation

Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks

References

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning (Information Science and Statistics)

Independent component analysis: algorithms and applications

Image method for efficiently simulating small‐room acoustics

Related Papers (5)

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Supervised Speech Separation Based on Deep Learning: An Overview

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

Neural network based spectral mask estimation for acoustic beamforming

Performance measurement in blind audio source separation

Frequently Asked Questions (15)