scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Lexicon-building methods for an acoustic sub-word based speech recognizer

03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Lexicon-building methods for an aco..."

  • ...ASWUs were proposed several years ago [10, 11, 12, 13], but they faded from view as speaker-independent recognition became the primary goal, because of the difficulty of distinguishing speaker variability from real pronunciation differences....

    [...]

Journal ArticleDOI
TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.
Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

67 citations

Journal ArticleDOI
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.

66 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....

    [...]

  • ...…therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....

    [...]

  • ...Cluster centroids therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....

    [...]

  • ...The clustering algorithm used here di€ers from that used in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) in that maximum likelihood is used as an objective rather than minimum Euclidean distance....

    [...]

  • ...The related problem of de®ning a lexicon in terms of these ASWUs has also received attention (e.g., Paliwal, 1990; Svendsen et al., 1995)....

    [...]

Journal ArticleDOI
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.

63 citations

Proceedings Article
01 Oct 2013
TL;DR: An unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries is proposed, which jointly discovers the phonetic inventory and the Letter-to-Sound mapping rules in a language using only transcribed data.
Abstract: The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures.

41 citations


Additional excerpts

  • ...Various algorithms for learning sub-word based pronunciations were proposed in (Lee et al., 1988; Fukada et al., 1996; Bacchiani and Ostendorf, 1999; Paliwal, 1990)....

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Abstract: Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.

1,637 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >

245 citations

Journal ArticleDOI
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

218 citations