scispace - formally typeset
Open AccessJournal ArticleDOI

Distant speech separation using predicted time-frequency masks from spatial features

TLDR
The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
About
This article is published in Speech Communication.The article was published on 2015-04-01 and is currently open access. It has received 45 citations till now. The article focuses on the topics: Intelligibility (communication) & Source separation.

read more

Figures
Citations
More filters
Proceedings ArticleDOI

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

TL;DR: It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Journal ArticleDOI

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

TL;DR: A novel method of time-varying beamforming with estimated complex spectra for single- and multi-channel speech enhancement, where deep neural networks are used to predict the real and imaginary components of the direct-path signal from noisy and reverberant ones.
Journal ArticleDOI

Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation

TL;DR: This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments to localize individual speakers so that an enhancement network can be trained on spatial as well as spectral features to extract the speaker from an estimated direction and with specific spectral structures.
Journal ArticleDOI

Deep Learning Based Target Cancellation for Speech Dereverberation

TL;DR: These models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.
Journal ArticleDOI

Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks

TL;DR: This paper presents a time–frequency masking based online multi-channel speech enhancement approach that uses a convolutional recurrent neural network to estimate the mask and demonstrates the robustness of the system to different angular positions of the speech source.
References
More filters
Book

Pattern Recognition and Machine Learning

TL;DR: Probability Distributions, linear models for Regression, Linear Models for Classification, Neural Networks, Graphical Models, Mixture Models and EM, Sampling Methods, Continuous Latent Variables, Sequential Data are studied.
Journal ArticleDOI

Pattern Recognition and Machine Learning

Radford M. Neal
- 01 Aug 2007 - 
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Book

Pattern Recognition and Machine Learning (Information Science and Statistics)

TL;DR: Looking for competent reading resources?
Journal ArticleDOI

Independent component analysis: algorithms and applications

TL;DR: The basic theory and applications of ICA are presented, and the goal is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible.
Journal ArticleDOI

Image method for efficiently simulating small‐room acoustics

TL;DR: The theoretical and practical use of image techniques for simulating the impulse response between two points in a small rectangular room, when convolved with any desired input signal, simulates room reverberation of the input signal.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Distant speech separation using predicted time-frequency masks from spatial features, speech communication, 2015 (68), pp.96-107. " ?

Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array ’ s spatial features into a T-F mask. 

A combination of DNNs and support vector machines (SVMs) for speech enhancement using binary classification of T-F bands is proposed by Wang and Wang (2013). 

The critical distance (Kuttruff, 2009) at which the sound energy of the reverberation is equal to that of the direct path is approximated as rc ≈ 0.056 √gV/T60, where V is room volume, and g = 1.62 is the directivity factor modeled by using the average human speech directivity index of 2.1 dB measured by Monson et al. (2012). 

The error backpropagation algorithm1 was used to train the network (8) using stochastic gradient descent with learning rate µ=0.1. 

The backpropagation was run on the training data until the error on the testing data reached a minimum, which did not decrease during five successive iterations. 

Such recordings frommicrophones placed in the ears are referred as binaural, and the typical cues are interaural time delay (ITD) and interaural level difference (ILD). 

Hummersone et al. (2014) argued that IRM may be more closely related to auditory processes than IBM and reviewed studies favoring IRM over IBM in certain ASR tasks and speech intelligibility (SI) measurements. 

For the three source case, all speaker permutations using two separate sentences from each speaker were generated to produce 32 three-speaker mixtures. 

100 sentences from the TIMIT database (Garofolo et al., 1993) were captured in both rooms for both distances, resulting in 400 sentences. 

This work extends the speech enhancement work of Pertilä and Nikunen (2014) to source separation by introducing a post-processing stage to include information between sources into the predicted T-F masks. 

The SDR is also highest for the MTT-SEP in the more reverberant room, while in the low reverberant room the SDR scores are high for MVDR, CNMF, and MTT-SEP. 

Healy et al. (2013) utilize DNNs to predict IBM, and demonstrate that the method can significantly improve SI for hearing impaired listeners. 

Both modifications decrease the leakage of interfering sources into the target speaker, thus improving the framework’s separation performance. 

The authors analyze different types of non-linear feature transforms, using varying amounts of adjacent T-F points, and propose an evolutionary quantization scheme to achieve low transmission rate to exchange feature values between the two hearing aid devices. 

The mentioned spatial cues are channel pairwise, which poses no issues in two channel recordings, but in general for M microphones the number of pairs grows O(M2), which can result in huge spatial cue vectors for microphone arrays.