Improving deep neural networks using state projection vectors of subspace Gaussian mixture model as features

doi:10.1109/SLT.2014.7078562

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Acoustic modeling using state projection vectors of subspace Gaussian mixture model to train deep neural network on entropy maximized Hindi dataset

[...]

Lalaram Arya¹, Upendra Pratap Singh¹, Anupam Shukla¹, Ritu Tiwari¹•Institutions (1)

Indian Institute of Information Technology and Management, Gwalior¹

01 Oct 2016

TL;DR: This paper uses the state specific vectors of SGMM as features to provide additional phonetic context information to the DNN framework and investigates the performance of speech recognition on different training data selection strategies.

...read moreread less

Abstract: Recent advancements and efficient training procedures in deep neural networks (DNNs) have significantly outperformed the hidden Markov model-Gaussian mixture model (HMM-GMM). The performance of DNNs can further be improved should it be given better phonetic context information. This is manifested by state specific vectors (SSV) of subspace Gaussian mixture model (SGMM). In this paper, we use the state specific vectors of SGMM as features to provide additional phonetic context information to the DNN framework. The state specific vectors are aligned with each observation vector of the training data to form the state specific vector (SSV) feature set. The combination of linear discriminant analysis (LDA) feature sets and state specific feature sets are then used as input features to train the DNN framework. Relative improvement of up to 4.13% is obtained on Hindi database using DNN trained with a combination of state specific feature sets and LDA feature sets compared to the DNN trained only with LDA feature sets. Since state specific vectors provide extra information about the phonetic context, they show improved results when combined with DNN framework. In this paper, we also investigate the performance of speech recognition on different training data selection strategies. The idea is to implement an approach that maximizes the information content in the training corpus. The experiments in this paper are carried on the training data set having maximum information content.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article•

The Kaldi Speech Recognition Toolkit

[...]

Daniel Povey¹, Arnab Ghoshal², Gilles Boulianne, Lukas Burget³, Ondrej Glembek³, Nagendra Kumar Goel, Mirko Hannemann³, Petr Motlicek⁴, Yanmin Qian⁵, Petr Schwarz³, Jan Silovsky, Georg Stemmer⁶, Karel Vesely³ - Show less +9 more•Institutions (6)

Microsoft¹, Saarland University², Brno University of Technology³, Idiap Research Institute⁴, Tsinghua University⁵, University of Erlangen-Nuremberg⁶

01 Jan 2011

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

...read moreread less

5,857 citations

"Improving deep neural networks usin..." refers methods in this paper

...Baseline DNN system Conventional GMM-HMM model is built by following the Kaldi recipe for both datasets....
[...]
...The Baseline DBN-DNN model is built by following the Kaldi recipe....
[...]
...Results of TIMIT baseline system using fMLLR features is replicated to match Kaldi baseline results....
[...]
...The baseline result 21.43% is replicated to match standard Kaldi result on TIMIT dataset....
[...]
...These parameters are tuned as per Dan’s DNN implementation in Kaldi [10]....
[...]

Report•DOI•

DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1

[...]

John S. Garofolo, Lori Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren - Show less +2 more

01 Feb 1993

1,238 citations

"Improving deep neural networks usin..." refers methods in this paper

...Experiments performed using TIMIT [6] and WSJ [7] substantiates our hypothesis by giving improved performance compared to DNN trained with input (LDA / fMLLR) features....
[...]
...TIMIT [6] and Wall street journal (WSJ) [7] corpus are used for our experiments....
[...]

Proceedings Article•DOI•

The design for the wall street journal-based CSR corpus

[...]

Douglas B. Paul¹, Janet M. Baker•Institutions (1)

Massachusetts Institute of Technology¹

23 Feb 1992

TL;DR: This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data.

...read moreread less

Abstract: The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.

...read moreread less

1,100 citations

"Improving deep neural networks usin..." refers methods in this paper

...Experiments performed using TIMIT [6] and WSJ [7] substantiates our hypothesis by giving improved performance compared to DNN trained with input (LDA / fMLLR) features....
[...]
...WSJ: To make tuning faster WSJ0 SI-84 (84 speakers/ 7240 utterances) is used as train data [7]....
[...]
...TIMIT [6] and Wall street journal (WSJ) [7] corpus are used for our experiments....
[...]

Proceedings Article•

The design for the wall street journal-based CSR corpus.

[...]

Douglas B. Paul¹, Janet M. Baker•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1992

TL;DR: The WSJ CSR Corpus as mentioned in this paper is the first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value.

...read moreread less

Abstract: The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.

...read moreread less

1,032 citations

Proceedings Article•DOI•

Speaker adaptation of neural network acoustic models using i-vectors

[...]

George Saon¹, Hagen Soltau¹, David Nahamoo¹, Michael Picheny¹•Institutions (1)

IBM¹

01 Dec 2013

TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.

...read moreread less

Abstract: We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

...read moreread less

714 citations

"Improving deep neural networks usin..." refers background or methods or result in this paper

...Appending SSV to input features in a similar way as appending i-vectors to input features [5] did not give appreciable results....
[...]
...Similar approach as in [5], was experimented by appending SSV features along with input features for each corresponding frame....
[...]
...This aspect is considered in [5] using i-vectors which carry information about speakers in a low dimensional vector....
[...]

Improving deep neural networks using state projection vectors of subspace Gaussian mixture model as features

Citations

References

"Improving deep neural networks usin..." refers methods in this paper

"Improving deep neural networks usin..." refers methods in this paper

"Improving deep neural networks usin..." refers methods in this paper

"Improving deep neural networks usin..." refers background or methods or result in this paper

Related Papers (5)