scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Lexicon-building methods for an acoustic sub-word based speech recognizer

03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
01 Jan 1998
TL;DR: The joint solution to the problems of learning a unit inventory and corresponding lexicon from data is described and the methodology is extended to handle infrequently observed words using a hybrid system that combines automatically-derived units with phone-based units.
Abstract: Although most parameters in a speech recognition system are estimated from data, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper describes a joint solution to the problems of learning a unit inventory and corresponding lexicon from data. The methodology, which requires multiple training tokens per word, is then extended to handle infrequently observed words using a hybrid system that combines automatically-derived units with phone-based units. The hybrid system outperforms a phone-based system in rst-pass decoding experiments on a large vocabulary conversational speech recognition task.

19 citations

01 Jan 1999
TL;DR: This thesis addresses previously unsolved problems in automatic unit design with three main contributions: to make design of a large unit inventory practical, a new approach is described that combines the problems of unit selection and lexicon design, and the algorithm for learning context conditioning groups is successful.
Abstract: In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular, this thesis addresses previously unsolved problems in automatic unit design with three main contributions. First, to make design of a large unit inventory practical, a new approach is described that combines the problems of unit selection and lexicon design. The design of the units is acoustically driven but constrained to guarantee a matched, limited complexity pronunciation model. Instead of using an acoustic unit training algorithm followed by separate pronunciation model design, the algorithm proposed here incorporates a pronunciation constraint within the unit design algorithm. The resulting unit inventory, unit models and lexicon are matched since they are designed by a single joint design step. The second problem addressed involves synthesizing models for unobserved contexts, needed to model contextual variation at word boundaries. As in phone-based systems, decision tree clustering is used, but this requires classes or sets of units that have a similar influence in context. The solution is to learn these classes from data by a parallel context clustering process. Third, the ability to generalize at the word-level, i.e. to handle words not observed in the training data, is provided by a hybrid system design algorithm. In the hybrid system, automatically derived units are designed for the most frequent words, and phonetic units are designed for all words in the vocabulary. Using an estimation step, the word models constructed by the independent automatic and phonetic units are evaluated and the most likely model is included in the lexicon. The new automatic unit design algorithm showed improved performance over phonetic units in experiments on a medium vocabulary (1000 words) task (Resource Management) for both small and large unit inventory systems, outperforming an alternative approach to automatic unit design reported on this task. The algorithm for learning context conditioning groups is successful in that the performance of a system derived by decision tree clustering is equivalent to that of the best unconstrained clustering system and an additional gain is observed when modeling contextual effects across word-boundaries. Finally, when automatically derived units were used in experiments on a large vocabulary (20,000 word) conversational speech task (Switchboard), the recognition accuracy improved over the phonetic unit baseline. In summary, the joint unit and lexicon design algorithm gives higher recognition performance or can be configured to give similar performance at lower cost (lower system complexity) than phone-based units for applications where several examples of each vocabulary word can be provided.

19 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ...Third, my many interactions with Kuldip Paliwal have given me much more insightin the ideas described here....

    [...]

  • ...Previous work has investigated the use of an automatically learned unit inventory and lexicon but has always approached these as separable problems[63, 50, 64]....

    [...]

  • ...addresses both the inventory and model design problems, whereas in [63, 50, 28] unit model parameters had to be estimated in a separate step from the data partition de ned by clustering....

    [...]

  • ...Taking an approach similar to that in [50], the maximum likelihood segmentation of the training data is found by use of dynamic programming....

    [...]

  • ...Another example of this type of model is described by Paliwal [50]....

    [...]

Proceedings ArticleDOI
03 Oct 1996
TL;DR: The authors propose an ASU-based word model generation method by composing the ASU statistics, that is, their means, variances and durations, and the effectiveness of the proposed method is shown through spontaneous word recognition experiments.
Abstract: The paper describes a new method of word model generation based on acoustically derived segment units (henceforth ASUs). An ASU-based approach has the advantages of growing out of human pre-determined phonemes and of consistently generating acoustic units by using the maximum likelihood (ML) criterion. The former advantage is effective when it is difficult to map acoustics to a phone such as with highly co-articulated spontaneous speech. In order to implement an ASU-based modeling approach in a speech recognition system, one must first solve two points: (1) how does one design an inventory of acoustically-derived segmental units and (2) how does one model the pronunciations of lexical entries in terms of the ASUs. As for the second question, the authors propose an ASU-based word model generation method by composing the ASU statistics, that is, their means, variances and durations. The effectiveness of the proposed method is shown through spontaneous word recognition experiments.

18 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ...As for the second question, if we have a large number of word speech to be recognized, we can construct an ASU-based statistical word model[5][2]....

    [...]

  • ...Several techniques have been proposed for the case in which a large number of utterances for each vocabulary word are seen in the training set[2][5]....

    [...]

  • ...To cope with these mismatches, we combined two advances proposed in previous work [2][3]....

    [...]

Dissertation
01 Jan 2014
TL;DR: A class of probabilistic models that discover the latent linguistic structures of a language directly from acoustic signals are developed, and this approach contrasts sharply with the typical method of creating such a dictionary by human experts, which can be a time-consuming and expensive endeavor.
Abstract: The ability to infer linguistic structures from noisy speech streams seems to be an innate human capability. However, reproducing the same ability in machines has remained a challenging task. In this thesis, we address this task, and develop a class of probabilistic models that discover the latent linguistic structures of a language directly from acoustic signals. In particular, we explore a nonparametric Bayesian framework for automatically acquiring a phone-like inventory of a language. In addition, we integrate our phone discovery model with adaptor grammars, a nonparametric Bayesian extension of probabilistic context-free grammars, to induce hierarchical linguistic structures, including sub-word and word-like units, directly from speech signals. When tested on a variety of speech corpora containing different acoustic conditions, domains, and languages, these models consistently demonstrate an ability to learn highly meaningful linguistic structures. In addition to learning sub-word and word-like units, we apply these models to the problem of one-shot learning tasks for spoken words, and our results confirm the importance of inducing intrinsic speech structures for learning spoken words from just one or a few examples. We also show that by leveraging the linguistic units our models discover, we can automatically infer the hidden coding scheme between the written and spoken forms of a language from a transcribed speech corpus. Learning such a coding scheme enables us to develop a completely data-driven approach to creating a pronunciation dictionary for the basis of phone-based speech recognition. This approach contrasts sharply with the typical method of creating such a dictionary by human experts, which can be a time-consuming and expensive endeavor. Our experiments show that automatically derived lexicons allow us to build speech recognizers that consistently perform closely to supervised speech recognizers, which should enable more rapid development of speech recognition capability for low-resource languages.

14 citations


Cites background from "Lexicon-building methods for an aco..."

  • ...Various algorithms for learning sub-word based pronunciations were proposed in [113, 49, 4, 144]....

    [...]

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This work introduces a nonparametric Bayesian approach for segmentation, based on Hierarchical Dirichlet Processes (HDP), in which a hidden Markov model (HMM) with an unbounded number of states is used to segment the utterance.
Abstract: Speech recognition systems have historically used contextdependent phones as acoustic units because these units allow linguistic information, such as a pronunciation lexicon, to be leveraged. However, when dealing with a new language for which minimal linguistic resources exist, it is desirable to automatically discover acoustic units. The process of discovering acoustic units usually consists of two stages: segmentation and clustering. In this paper, we focus on the segmentation portion of this problem. We introduce a nonparametric Bayesian approach for segmentation, based on Hierarchical Dirichlet Processes (HDP), in which a hidden Markov model (HMM) with an unbounded number of states is used to segment the utterance. This model is referred to as an HDP-HMM. We compare this algorithm to several popular heuristic methods and demonstrate an 11% improvement in finding boundaries on the TIMIT Corpus. A self-similarity measure over segments shows an 88% improvement compared to manual segmentation with comparable segment length. This work represents the first step in the development of a speech recognition system that is entirely based on nonparametric Bayesian models.

13 citations


Cites methods from "Lexicon-building methods for an aco..."

  • ...Most approaches to automatic discovery of acoustic units [2]- [4] do this in two steps: segmentation and clustering....

    [...]

  • ...Previously a dynamic programming method was applied that incorporated a heuristic stopping criterion [2]- [4]....

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Abstract: Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.

1,637 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >

245 citations

Journal ArticleDOI
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

218 citations