scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Lexicon-building methods for an acoustic sub-word based speech recognizer

03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

Content maybe subject to copyright    Report

Citations
More filters
18 Nov 2013

Cites methods from "Lexicon-building methods for an aco..."

  • ...Various methods for the development of word and phoneme lexicon using these ASM have been discussed in [14],[15] and [16]....

    [...]

Posted Content
TL;DR: This work proposes the integration of a simple space management strategy into the iterative process, and shows experimentally that this leads to no loss in performance in terms of F-measure while guaranteeing that a threshold space complexity is not breached.
Abstract: Agglomerative hierarchical clustering (AHC) requires only the similarity between objects to be known. This is attractive when clustering signals of varying length, such as speech, which are not readily represented in fixed-dimensional vector space. However, AHC is characterised by $O(N^2)$ space and time complexity, making it infeasible for partitioning large datasets. This has recently been addressed by an approach based on the iterative re-clustering of independent subsets of the larger dataset. We show that, due to its iterative nature, this procedure can sometimes lead to unchecked growth of individual subsets, thereby compromising its effectiveness. We propose the integration of a simple space management strategy into the iterative process, and show experimentally that this leads to no loss in performance in terms of F-measure while guaranteeing that a threshold space complexity is not breached.

Cites methods from "Lexicon-building methods for an aco..."

  • ...The clustering of acoustic speech segments was also investigated by Paliwal [13]....

    [...]

Proceedings ArticleDOI
01 Dec 2016
TL;DR: It is demonstrated that a speech recognition system using these discovered resources can approach the performance of a speech recognizer trained using resources developed by experts.
Abstract: State of the art speech recognition systems use context-dependent phonemes as acoustic units. However, these approaches do not work well for low resourced languages where large amounts of training data or resources such as a lexicon are not available. For such languages, automatic discovery of acoustic units can be important. In this paper, we demonstrate the application of nonparametric Bayesian models to acoustic unit discovery. We show that the discovered units are linguistically meaningful. We also present a semi-supervised learning algorithm that uses a nonparametric Bayesian model to learn a mapping between words and acoustic units. We demonstrate that a speech recognition system using these discovered resources can approach the performance of a speech recognizer trained using resources developed by experts. We show that unsupervised discovery of acoustic units combined with semi-supervised discovery of the lexicon achieved performance (9.8% WER) comparable to other published high complexity systems. This nonparametric approach enables the rapid development of speech recognition systems in low resourced languages.

Cites methods from "Lexicon-building methods for an aco..."

  • ...Paliwal [5] also proposed several methods to discover a lexicon for isolated word speech recognition applications....

    [...]

  • ...Most approaches to automatic discovery of acoustic units [3]-[5] do this in two steps: segmentation and clustering....

    [...]

Proceedings Article
01 Jan 2000
TL;DR: In this work, new constraints on the units were introduced: 1) they should contain suÆcient statistics of the features and 2) they must contain su ÆcientStatistics of the vocabulary.
Abstract: The choice of units, sub-word or whole-word, is generally based on the size of the vocabulary and the amount of training data. In this work, we have introduced new constraints on the units: 1) they should contain suÆcient statistics of the features and 2) they should contain suÆcient statistics of the vocabulary. This led to minimization of two cost functions, rst based on the confusion between the features and the units and the second based on the confusion between the units and the words. We minimized rst cost function by forming broad phone classes that were less confusing among themselves than the phones. The second cost function was minimized by coding the word-speci c phone sequences. On the continuous digit recognition task, the broad classes performed worse than the phones. The word-speci c phone sequences however signi cantly improved the performance over both the phones and the whole-word units. In this paper we discuss the new constraints, our speci c implementation of the cost functions, and the corresponding recognition performance.

Cites background from "Lexicon-building methods for an aco..."

  • ...In general, the choice of units is based on the size of the vocabulary and the amount of training data [2]....

    [...]

Dissertation
01 Jan 2008
TL;DR: This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion and an automatic generation of probabilistic dictionaries using joint multigrams.
Abstract: Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data. Two of the components of an ASR system are acoustic models and pronunciation models. The variation within spontaneous speech must be accounted for by these components. Phones, or context-dependent phones are typically used as the base subword unit, and one acoustic model is trained for each sub-word unit. Pronunciation modelling largely takes place in a dictionary, which relates words to sequences of phones. Acoustic modelling and pronunciation modelling overlap, and the two are not clearly separable in modelling pronunciation variation. Techniques that find pronunciation variants in the data and then reflect these in the dictionary have not provided expected gains in recognition. An alternative approach to modelling pronunciations in terms of phones is to derive units automatically: using data-driven methods to determine an inventory of sub-word units, their acoustic models, and their relationship to words. This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are 1. automatic and simultaneous generation of a sub-word unit inventory and acoustic model set, using an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion 2. automatic generation of probabilistic dictionaries using joint multigrams The prerequisites of this approach are fewer than in previous work on unit derivation; notably, the timings of word boundaries are not required here. The approach is language independent since it is entirely data-driven and no linguistic information is required. The dictionary generation method outperforms a supervised method using phonetic data. The automatically derived units and dictionary perform reasonably on a small spontaneous speech task, although not yet outperforming phones.

Cites background or methods from "Lexicon-building methods for an aco..."

  • ...In Paliwal (1990), the lexicon generation is extended to a probabilistic form....

    [...]

  • ...Paliwal (1990) used k-means to cluster his acoustic segments, by first finding the centroid of each segment....

    [...]

  • ...This method is used in the segmentation step of Paliwal’s work (see Paliwal 1990), and also in Fukada et al. (1996)....

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Abstract: Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.

1,637 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >

245 citations

Journal ArticleDOI
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

218 citations