scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving deep neural networks using state projection vectors of subspace Gaussian mixture model as features

TL;DR: This paper proposes to use state specific vectors of SGMM as features thereby providing additional phonetic information for the DNN framework by combining it with LDA bottleneck features improved performance is obtained using the Dnn framework.
Abstract: Recent advancement in deep neural network (DNN) has surpassed the conventional hidden Markov model-Gaussian mixture model (HMM-GMM) framework due to its efficient training procedure. Providing better phonetic context information in the input gives improved performance for DNN. The state projection vectors (state specific vectors) in subspace Gaussian mixture model (SGMM) captures the phonetic information in low dimensional vector space. In this paper, we propose to use state specific vectors of SGMM as features thereby providing additional phonetic information for the DNN framework. To each observation vector in the train data, the corresponding state specific vectors of SGMM are aligned to form the state specific vector feature set. Linear discriminant analysis (LDA) feature set are formed by applying LDA to the training data. Since bottleneck features are efficient in extracting useful discriminative information for the phonemes, LDA feature set and state specific vector feature set are converted to bottleneck features. These bottleneck features of both feature sets act as input features to train a single DNN framework. Relative improvement of 8.8% for TIMIT database (core test set) and 9.7% for WSJ corpus is obtained by using the state specific vector bottleneck feature set when compared to the DNN trained only with LDA bottleneck feature set. Also training Deep belief network - DNN (DBN-DNN) using the proposed feature set attains a WER of 20.46% on TIMIT core test set proving the effectiveness of our method. The state specific vectors while acting as features, provide additional useful information related to phoneme variation. Thus by combining it with LDA bottleneck features improved performance is obtained using the DNN framework.
Citations
More filters
Proceedings ArticleDOI
01 Oct 2016
TL;DR: This paper uses the state specific vectors of SGMM as features to provide additional phonetic context information to the DNN framework and investigates the performance of speech recognition on different training data selection strategies.
Abstract: Recent advancements and efficient training procedures in deep neural networks (DNNs) have significantly outperformed the hidden Markov model-Gaussian mixture model (HMM-GMM). The performance of DNNs can further be improved should it be given better phonetic context information. This is manifested by state specific vectors (SSV) of subspace Gaussian mixture model (SGMM). In this paper, we use the state specific vectors of SGMM as features to provide additional phonetic context information to the DNN framework. The state specific vectors are aligned with each observation vector of the training data to form the state specific vector (SSV) feature set. The combination of linear discriminant analysis (LDA) feature sets and state specific feature sets are then used as input features to train the DNN framework. Relative improvement of up to 4.13% is obtained on Hindi database using DNN trained with a combination of state specific feature sets and LDA feature sets compared to the DNN trained only with LDA feature sets. Since state specific vectors provide extra information about the phonetic context, they show improved results when combined with DNN framework. In this paper, we also investigate the performance of speech recognition on different training data selection strategies. The idea is to implement an approach that maximizes the information content in the training corpus. The experiments in this paper are carried on the training data set having maximum information content.
References
More filters
Proceedings Article
01 Jan 2011
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

5,857 citations


"Improving deep neural networks usin..." refers methods in this paper

  • ...Baseline DNN system Conventional GMM-HMM model is built by following the Kaldi recipe for both datasets....

    [...]

  • ...The Baseline DBN-DNN model is built by following the Kaldi recipe....

    [...]

  • ...Results of TIMIT baseline system using fMLLR features is replicated to match Kaldi baseline results....

    [...]

  • ...The baseline result 21.43% is replicated to match standard Kaldi result on TIMIT dataset....

    [...]

  • ...These parameters are tuned as per Dan’s DNN implementation in Kaldi [10]....

    [...]

ReportDOI
01 Feb 1993

1,238 citations


"Improving deep neural networks usin..." refers methods in this paper

  • ...Experiments performed using TIMIT [6] and WSJ [7] substantiates our hypothesis by giving improved performance compared to DNN trained with input (LDA / fMLLR) features....

    [...]

  • ...TIMIT [6] and Wall street journal (WSJ) [7] corpus are used for our experiments....

    [...]

Proceedings ArticleDOI
23 Feb 1992
TL;DR: This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data.
Abstract: The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.

1,100 citations


"Improving deep neural networks usin..." refers methods in this paper

  • ...Experiments performed using TIMIT [6] and WSJ [7] substantiates our hypothesis by giving improved performance compared to DNN trained with input (LDA / fMLLR) features....

    [...]

  • ...WSJ: To make tuning faster WSJ0 SI-84 (84 speakers/ 7240 utterances) is used as train data [7]....

    [...]

  • ...TIMIT [6] and Wall street journal (WSJ) [7] corpus are used for our experiments....

    [...]

Proceedings Article
01 Jan 1992
TL;DR: The WSJ CSR Corpus as mentioned in this paper is the first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value.
Abstract: The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.

1,032 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Abstract: We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

714 citations


"Improving deep neural networks usin..." refers background or methods or result in this paper

  • ...Appending SSV to input features in a similar way as appending i-vectors to input features [5] did not give appreciable results....

    [...]

  • ...Similar approach as in [5], was experimented by appending SSV features along with input features for each corresponding frame....

    [...]

  • ...This aspect is considered in [5] using i-vectors which carry information about speakers in a low dimensional vector....

    [...]