scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Non-negative subspace projection during conventional MFCC feature extraction for noise robust speech recognition

TL;DR: An additional feature processing algorithm using Non-negative Matrix Factorization (NMF) is proposed to be included during the conventional extraction of Mel-frequency cepstral coefficients (MFCC) for achieving noise robustness in HMM based speech recognition.
Abstract: An additional feature processing algorithm using Non-negative Matrix Factorization (NMF) is proposed to be included during the conventional extraction of Mel-frequency cepstral coefficients (MFCC) for achieving noise robustness in HMM based speech recognition. The proposed approach reconstructs log-Mel filterbank outputs of speech data from a set of building blocks that form the bases of a speech subspace. The bases are learned using the standard NMF of training data. A variation of learning the bases is proposed, which uses histogram equalized activation coefficients during training, to achieve noise robustness. The proposed methods give up to 5.96% absolute improvement in recognition accuracy on Aurora-2 task over a baseline with standard MFCCs, and up to 13.69% improvement when combined with other feature normalization techniques like Histogram Equalization (HEQ) and Heteroscedastic Linear Discriminant Analysis (HLDA).
Citations
More filters
Posted Content
TL;DR: An MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed, and a modification to the training process of SPLICE algorithm for noise robust speech recognition is proposed.
Abstract: Speech recognition system performance degrades in noisy environments. If the acoustic models are built using features of clean utterances, the features of a noisy test utterance would be acoustically mismatched with the trained model. This gives poor likelihoods and poor recognition accuracy. Model adaptation and feature normalisation are two broad areas that address this problem. While the former often gives better performance, the latter involves estimation of lesser number of parameters, making the system feasible for practical implementations. This research focuses on the efficacies of various subspace, statistical and stereo based feature normalisation techniques. A subspace projection based method has been investigated as a standalone and adjunct technique involving reconstruction of noisy speech features from a precomputed set of clean speech building-blocks. The building blocks are learned using non-negative matrix factorisation (NMF) on log-Mel filter bank coefficients, which form a basis for the clean speech subspace. The work provides a detailed study on how the method can be incorporated into the extraction process of Mel-frequency cepstral coefficients. Experimental results show that the new features are robust to noise, and achieve better results when combined with the existing techniques. The work also proposes a modification to the training process of SPLICE algorithm for noise robust speech recognition. It is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. An MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed.

2 citations

Book ChapterDOI
01 Jan 2015
TL;DR: From experiment, it is observed that, stress information of stressed speech is not present in the complement cosine (1-cosine) times of stress speech on different inner product space.
Abstract: In this paper, similarity measurement on different inner product space approach is proposed for analysis of stressed speech The similarity is measured between neutral speech subspace and stressed speech subspace Cosine between neutral speech and stressed speech is taken as similarity measurement parameter It is asssumed that, speech and stress components of stressed speech are linearly related to each other Cosine between neutral and stressed speech multiples of stressed speech contains speech information of stressed speech Complement cosine (1-cosine) multiples of stressed speech is taken as stress component of stressed speech Neutral speech subspace is created by all neutral speech of the training database and stressed speech subspace contain stressed (angry, sad, lombard, happy) speech From experiment, it is observed that, stress information of stressed speech is not present in the complement cosine (1-cosine) times of stressed speech on different inner product space The linear relationship between speech and stress component of stressed speech exists only for some specific inner product space All the experiments are done using nonlinear (TEO-CB-Auto-Env) feature

Cites methods from "Non-negative subspace projection du..."

  • ...In an other work, orthogonal projection technique is used to remove the stress component from stressed speech [6–9]....

    [...]

References
More filters
Journal ArticleDOI
21 Oct 1999-Nature
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

11,500 citations


"Non-negative subspace projection du..." refers background in this paper

  • ...gives the following iterative update rules [4], [5] for refining the matrices Wand H: Ln hrnvmn/[WH]mn Wmr := Wmr '\' h L....

    [...]

01 Jan 1999
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

9,604 citations

Proceedings Article
01 Jan 2000
TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.
Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

7,345 citations


"Non-negative subspace projection du..." refers background in this paper

  • ...gives the following iterative update rules [4], [5] for refining the matrices Wand H: Ln hrnvmn/[WH]mn Wmr := Wmr '\' h L....

    [...]

Journal ArticleDOI
TL;DR: The results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower signal-to-noise ratios (SNRs), achieving up to 57.1% accuracy at SNR = -5 dB.
Abstract: This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how speech can be modeled as a linear combination of a small number of exemplars from a large speech exemplar dictionary. The exemplars are time-frequency patches of real speech, each spanning multiple time frames. We then propose to model speech corrupted by additive noise as a linear combination of noise and speech exemplars, and we derive an algorithm for recovering this sparse linear combination of exemplars from the observed noisy speech. We describe how the framework can be used for doing hybrid exemplar-based/HMM recognition by using the exemplar-activations together with the phonetic information associated with the exemplars. As an alternative to hybrid recognition, the framework also allows us to take a source separation approach which enables exemplar-based feature enhancement as well as missing data mask estimation. We evaluate the performance of these exemplar-based methods in connected digit recognition on the AURORA-2 database. Our results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower signal-to-noise ratios (SNRs), achieving up to 57.1% accuracy at SNR = -5 dB. Although not as effective as two baseline recognizers at higher SNRs, the novel approach offers a promising direction of future research on exemplar-based ASR.

388 citations


"Non-negative subspace projection du..." refers methods in this paper

  • ...In [3], log-Mel filterbank features of noisy speech were represented using exemplars (dictionary) of speech and noise bases....

    [...]

Journal ArticleDOI
TL;DR: Theoretical results to the problem of speech recognition are applied and word-error reduction in systems that employed both diagonal and full covariance heteroscedastic Gaussian models tested on the TI-DIGITS database is observed.

384 citations


"Non-negative subspace projection du..." refers background or methods in this paper

  • ...The HLDA transformation matrix is estimated in maximum likelihood (ML) framework after building the acoustic models using 39 dimensional MFCCs as described in [7], and is applied to both feature vectors and the models in the conventional method....

    [...]

  • ...When the classes have diagonal covariances, if the feature vectors are correlated within a class, HLDA reduces the correlation [7] and increases the likelihood of the model, thus improving the recognition accuracy....

    [...]