scispace - formally typeset
Search or ask a question
Author

JJ Odell

Other affiliations: University of Cambridge
Bio: JJ Odell is an academic researcher from Microsoft. The author has contributed to research in topics: Language model & Hidden Markov model. The author has an hindex of 22, co-authored 39 publications receiving 5955 citations. Previous affiliations of JJ Odell include University of Cambridge.

Papers
More filters
16 Sep 1995
TL;DR: The Fundamentals of HTK: General Principles of HMMs, Recognition and Viterbi Decoding, and Continuous Speech Recognition.
Abstract: 1 The Fundamentals of HTK 2 1.1 General Principles of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Isolated Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Output Probability Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Baum-Welch Re-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Recognition and Viterbi Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Continuous Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Speaker Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2,095 citations

Proceedings ArticleDOI
08 Mar 1994
TL;DR: This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree, which is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones.
Abstract: The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.

781 citations

Proceedings ArticleDOI
19 Apr 1994
TL;DR: This work has extended the approach from using word-internal gender independent modelling to use decision tree based state clustering, cross-word triphones and gender dependent models, and gave the lowest error rate reported on the 5 k/20 k word bigram and 20 k word trigram "hub" tests.
Abstract: HTK is a portable software toolkit for building speech recognition systems using continuous density hidden Markov models developed by the Cambridge University Speech Group. One particularly successful type of system uses mixture density tied-state triphones. We have used this technique for the 5 k/20 k word ARPA Wall Street Journal (WSJ) task. We have extended our approach from using word-internal gender independent modelling to use decision tree based state clustering, cross-word triphones and gender dependent models. Our current systems can be run with either bigram or trigram language models using a single pass dynamic network decoder. Systems based on these techniques were included in the November 1993 ARPA WSJ evaluation, and gave the lowest error rate reported on the 5 k word bigram, 5 k word trigram and 20 k word bigram "hub" tests and the second lowest error rate on the 20 k word trigram "hub" test. >

298 citations

Journal ArticleDOI
TL;DR: Experimental results show that MMIE optimisation of system structure and parameters can yield useful increases in recognition accuracy and the use of lattices makes MMIe training practicable for very complex recognition systems and large training sets.
Abstract: This paper describes a framework for optimising the structure and parameters of a continuous density HMM-based large vocabulary recognition system using the Maximum Mutual Information Estimation (MMIE) criterion. To reduce the computational complexity of the MMIE training algorithm, confusable segments of speech are identified and stored as word lattices of alternative utterance hypotheses. An iterative mixture splitting procedure is also employed to adjust the number of mixture components in each state during training such that the optimal balance between the number of parameters and the available training data is achieved. Experiments are presented on various test sets from the Wall Street Journal database using up to 66 hours of acoustic training data. These demonstrate that the use of lattices makes MMIE training practicable for very complex recognition systems and large training sets. Furthermore, the experimental results show that MMIE optimisation of system structure and parameters can yield useful increases in recognition accuracy.

203 citations


Cited by
More filters
Proceedings Article
01 Jan 2011
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

5,857 citations

Book
01 Jan 2000
TL;DR: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.
Abstract: From the Publisher: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora.Methodology boxes are included in each chapter. Each chapter is built around one or more worked examples to demonstrate the main idea of the chapter. Covers the fundamental algorithms of various fields, whether originally proposed for spoken or written language to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation. Emphasis on web and other practical applications. Emphasis on scientific evaluation. Useful as a reference for professionals in any of the areas of speech and language processing.

3,794 citations

Journal ArticleDOI
TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

3,120 citations

Journal ArticleDOI
TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.
Abstract: A method of speaker adaptation for continuous density hidden Markov models (HMMs) is presented. An initial speaker-independent system is adapted to improve the modelling of a new speaker by updating the HMM parameters. Statistics are gathered from the available adaptation data and used to calculate a linear regression-based transformation for the mean vectors. The transformation matrices are calculated to maximize the likelihood of the adaptation data and can be implemented using the forward–backward algorithm. By tying the transformations among a number of distributions, adaptation can be performed for distributions which are not represented in the training data. An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed. Experiments have been performed on the ARPA RM1 database using an HMM system with cross-word triphones and mixture Gaussian output distributions. Results show that adaptation can be performed using as little as 11 s of adaptation data, and that as more data is used the adaptation performance improves. For example, using 40 adaptation utterances, a 37% reduction in error from the speaker-independent system was achieved with supervised adaptation and a 32% reduction in unsupervised mode.

2,504 citations

Book
09 Feb 2012
TL;DR: A new type of output layer that allows recurrent networks to be trained directly for sequence labelling tasks where the alignment between the inputs and the labels is unknown, and an extension of the long short-term memory network architecture to multidimensional data, such as images and video sequences.
Abstract: Recurrent neural networks are powerful sequence learners. They are able to incorporate context information in a flexible way, and are robust to localised distortions of the input data. These properties make them well suited to sequence labelling, where input sequences are transcribed with streams of labels. The aim of this thesis is to advance the state-of-the-art in supervised sequence labelling with recurrent networks. Its two main contributions are (1) a new type of output layer that allows recurrent networks to be trained directly for sequence labelling tasks where the alignment between the inputs and the labels is unknown, and (2) an extension of the long short-term memory network architecture to multidimensional data, such as images and video sequences.

2,101 citations