Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code

doi:10.1109/ICASSP.2013.6639211

Proceedings ArticleDOI

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code

- pp 7942-7946

TLDR

A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

Abstract:

In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A regression approach to speech enhancement based on deep neural networks

Yong Xu, +3 more

- 01 Jan 2015 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

...read moreread less

Proceedings ArticleDOI

Speaker adaptation of neural network acoustic models using i-vectors

George Saon, +3 more

TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.

...read moreread less

Journal ArticleDOI

Speech Recognition Using Deep Neural Networks: A Systematic Review

Ali Bou Nassif, +4 more

- 01 Feb 2019 -

IEEE Access

TL;DR: A thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications is provided.

...read moreread less

Proceedings ArticleDOI

Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models

Pawel Swietojanski, +1 more

TL;DR: This paper proposes a simple yet effective model-based neural network speaker adaptation technique that learns speaker-specific hidden unit contributions given adaptation data, without requiring any form of speaker-adaptive training, or labelled adaptation data.

...read moreread less

Proceedings ArticleDOI

Improving DNN speaker independence with I-vector inputs

Andrew W. Senior, +1 more

TL;DR: Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs), and the algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

George E. Dahl, +3 more

- 01 Jan 2012 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Journal ArticleDOI

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

C. J. Leggetter, +1 more

- 01 Apr 1995 -

Computer Speech & Language

TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

...read moreread less

Journal ArticleDOI

Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

Jean-Luc Gauvain, +1 more

- 01 Apr 1994 -

IEEE Transactions on Speech and Audio Pr...

TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.

...read moreread less

Journal ArticleDOI

Acoustic Modeling Using Deep Belief Networks

Abdelrahman Mohamed, +2 more

- 01 Jan 2012 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.

...read moreread less

Journal ArticleDOI

Maximum likelihood linear transformations for HMM-based speech recognition

Mark J. F. Gales

- 01 Apr 1998 -

Computer Speech & Language

TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

...read moreread less

IEEE Signal Processing Magazine

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code

Citations

A regression approach to speech enhancement based on deep neural networks

Speaker adaptation of neural network acoustic models using i-vectors

Speech Recognition Using Deep Neural Networks: A Systematic Review

Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models

Improving DNN speaker independence with I-vector inputs

References

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

Acoustic Modeling Using Deep Belief Networks

Maximum likelihood linear transformations for HMM-based speech recognition

Related Papers (5)

KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition

Speaker adaptation of neural network acoustic models using i-vectors

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

Speaker adaptation of context dependent deep neural networks

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups