Non-negative matrix factorization based compensation of music for automatic speech recognition.
read more
Citations
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
An overview of noise-robust automatic speech recognition
Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition
Discriminatively trained recurrent neural networks for single-channel speech separation
Deep NMF for speech separation
References
Suppression of acoustic noise in speech using spectral subtraction
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria
Hidden Markov model decomposition of speech and noise
A vector Taylor series approach for environment-independent speech recognition
Related Papers (5)
Frequently Asked Questions (11)
Q2. What are the future works mentioned in the paper "Non-negative matrix factorization based compensation of music for automatic speech recognition" ?
This and other techniques remain topics for future work.
Q3. What is the key aspect of the magnitude spectrogram?
A key aspect of the magnitude spectrographic representation is that the magnitude spectrogram of the sum of two signals is approximately equal to the magnitude spectrogram of the individual signals.
Q4. What is the reason for the problem of recognizing speech in the presence of non-stationary?
since the authors have only the noisy speech to estimate the instantaneous noise from, the authors requirestronger a priori information about the signals involved, namely the speech and the noise.
Q5. What is the model for the noisy speech spectrogram?
The model for the noisy speech spectrogram Y ≈ S+M can be written as S ≈ BW, (3) where B = [BsBm] be a matrix that combines the bases for speech and noise into a single matrix, and W = [Ws⊤Wm⊤]⊤ combines the weights into a single matrix.
Q6. How did the authors learn the base for the music types?
As speech may be confused with sung vocals, the authors simplify the recognition task by semi-automatically discarding music material containing singing, using simple rules to discard shorter segments, derived from MIDI references, and by listening to the rest.
Q7. What is the spectral model for speech?
If the authors represent the set of basis vectors using matrix Bs =[bs1, . . . ,b s S ] , the model and the weights using matrix [Ws]i,t = wsi,t, the authors can write the model for the speech spectrogram as the product of matrices Bs and Ws:S = BsWs (2)Similarly, the noise is modeled as the weighted sum of noise basis vectors bmi , i = 1, . . . ,M , where M is the number of noisebasis vector.
Q8. What is the EM algorithm used to determine the weights of the spectral vectors?
Once a set of bases B is given, the weights with which they must be combined to optimally compose the spectral vectors in Y can be determined using either the EM algorithm from [10] or one of various NMF-based update rules [13].
Q9. How can the authors learn the contributions of the individual bases?
Through the application of appropriate constraints, the authors can the estimate the contributions of the individual bases, and thereby reconstitute the individual sources contributing to the mixture.
Q10. What is the main reason for the difficulty of recognizing speech in the presence of nonstationary?
However a simple characterization such as a linear dynamical system (or the more coarse Gaussian mixture model) is insufficiently detailed for signals such as music or speech, which have nearly unlimited range of variation.
Q11. What is the compositional model of speech?
The compositional model represents the magnitude spectrum st of speech in frame t as a weighted linear non-negative combination of basis vectors bsi asst =