scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Subspace modeling technique using monophones for speech recognition

TL;DR: The proposed adaptive training method for parameter estimation of acoustic models in the speech recognition system can match the performance of the conventional HMM system for large amount of training data and outperforms it when the number of training examples are less.
Abstract: In this paper we propose an adaptive training method for parameter estimation of acoustic models in the speech recognition system. Our technique is inspired from the Cluster Adaptive Training (CAT) method which is used for rapid speaker adaptation. Instead of adapting the model to a speaker as in CAT, we adapt the parameters of the context dependent triphone states (tied states) from context independent states (monophones). This is achieved by finding a global mapping of parameters of the tied state from the parametric subspace of monophone models. This technique is similar to Subspace Gaussian Mixture Model (SGMM), but differs in the initialization of parameters and in the update of weights of Gaussian mixture components. We show that, the proposed method can match the performance of the conventional HMM system for large amount of training data and outperforms it when the number of training examples are less.
References
More filters
Proceedings ArticleDOI
03 Oct 1996
TL;DR: A novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition that jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models.
Abstract: We formulate a novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition. It is motivated by the fact that variability in SI acoustic models is attributed to both phonetic variation and variation among the speakers of the training population, that is independent of the information content of the speech signal. These two variation sources are decoupled and the proposed method jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models. We compare the proposed training algorithm to the common SI training paradigm within the context of supervised adaptation. We show that the proposed acoustic models are more efficiently adapted to the test speakers, thus achieving significant overall word error rate reductions of 19% and 25% for 20K and 5K vocabulary tasks respectively.

586 citations


"Subspace modeling technique using m..." refers background in this paper

  • ...The Maximum Likelihood training of the Subspace Model is same as any other adaptive training techniques like SAT, CAT etc....

    [...]

  • ...CAT is a generalization of Speaker Adaptive Training (SAT) [1]....

    [...]

Journal ArticleDOI
TL;DR: A new approach to speech recognition, in which all Hidden Markov Model states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state, appears to give better results than a conventional model.

304 citations


"Subspace modeling technique using m..." refers background in this paper

  • ...But SGMM uses an exponential transform on vj to compute the component priors....

    [...]

  • ...SGMM goes one step further and tries to model the correlations among the parameters of GMMs of various tied-states....

    [...]

  • ...The proposed Subspace Model and SGMM are both acoustic modeling techniques....

    [...]

  • ...Though Subspace Model in a theoretical sense is closer to CAT, implementation wise it loans a few concepts from SGMM....

    [...]

  • ...While SGMM exploits the correlations among GMM parameters of tied-states, CSM strives to transform a canonical model to a context-dependent state....

    [...]

Journal ArticleDOI
M.J.F. Gales1
TL;DR: This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT), which may be viewed as a simple extension to speaker clustering, a linear interpolation of all the cluster means is used as the mean of the particular speaker.
Abstract: When performing speaker adaptation, there are two conflicting requirements. First, the speaker transform must be powerful enough to represent the speaker. Second, the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt the models to be representative of an individual speaker. This limits how rapidly the models may be adapted to a new speaker or the acoustic environment. This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting a single cluster as representative of a particular speaker, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speaker-independent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set.

293 citations


"Subspace modeling technique using m..." refers methods in this paper

  • ...Both these methods have similarities to the Cluster Adaptive Training (CAT) [3], a speaker adaptation technique....

    [...]

Proceedings ArticleDOI
07 May 1996
TL;DR: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters.
Abstract: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are re-estimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of reducing the error rate by 10-15% with the use of as little as 3 sentences of adaptation data.

34 citations


"Subspace modeling technique using m..." refers methods in this paper

  • ...This ensures a mode of soft clustering than the hard clustering advocated by Speaker Clustering [5]....

    [...]

Proceedings Article
30 Sep 2010
TL;DR: A general class of model where the context-dependent state parameters are a transformed version of one, or more, canonical states and a set of preliminary experiments illustrating some of this model’s properties using CMLLR transformations from the canonical state to the context dependent state are described.
Abstract: Current speech recognition systems are often based on HMMs with state-clustered Gaussian Mixture Models (GMMs) to represent the context dependent output distributions. Though highly successful, the standard form of model does not exploit any relationships between the states, they each have separate model parameters. This paper describes a general class of model where the context-dependent state parameters are a transformed version of one, or more, canonical states. A number of published models sit within this framework, including, semi-continuous HMMs, subspace GMMs and the HMM error model. A set of preliminary experiments illustrating some of this model’s properties using CMLLR transformations from the canonical state to the context dependent state are described.

26 citations


"Subspace modeling technique using m..." refers background in this paper

  • ...While SGMM exploits the correlations among GMM parameters of tied-states, CSM strives to transform a canonical model to a context-dependent state....

    [...]

  • ...Subspace Gaussian Mixture Models (SGMM) [6] and Canonical State Models (CSM) [2] are two acoustic modeling techniques which exploits the relationship between the context dependent phone models (or triphones)....

    [...]