scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Lexicon-building methods for an acoustic sub-word based speech recognizer

03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
Ha-Jin Yu1, Y.-H. Oh1
TL;DR: A non-uniform unit which can model phoneme variations caused by co-articulation spread over several phonemes and between words is introduced to neural networks for speaker-independent continuous speech recognition.
Abstract: We introduce acoustic sub-word units to neural networks for speaker-independent continuous speech recognition. The functions of segmenting input and detecting words are implemented with networks of simple structures. The non-uniform unit which we introduce in this research can model phoneme variations caused by co-articulation spread over several phonemes and between words. These units can be segmented by the network according to stationary and transition parts of speech without iteration or without considering all possible position shifts. A word lexicon can be trained by the network, which can effectively memorize all transcription variations in the training utterances of words. The results of speaker-independent word spotting of 520 words with TIMIT data are described.

3 citations

Proceedings ArticleDOI
04 Sep 2005
TL;DR: Results show that EHMM learns the lexicon distribution over the population of speakers for each word, thereby effectively modeling the inter-speaker pronunciation variability.
Abstract: We propose a stochastic pronunciation model using an ergodic hidden Markov model (EHMM) of automatically derived acoustic sub-word units (SWU). The proposed EHMM discovers the pronunciation structure inherent in the acoustic training data of a word without any apriori phonetic transcriptions. The EHMM is an HMM of HMMs – its states are SWU HMMs and the state-transitions compose various possible lexicon. The EHMM parameters are estimated by an iterative segmental -means procedure which jointly optimizes the subword units (states) and the pronunciation structure parameters (state-transitions). The EHMM based pronunciation model is evaluated in an English isolated word recognition task with 70 speakers drawn from 8 highly different first languages. Results show that EHMM learns the lexicon distribution over the population of speakers for each word, thereby effectively modeling the inter-speaker pronunciation variability. EHMM offers an improvement of 8% (absolute) word recognition accuracy over a single most likely lexicon performance.

2 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ...We evaluate the proposed EHMM pronunciation model for a closed set multi-speaker isolated word recognition task....

    [...]

  • ...Introduction There has been a growing interest in the last decade in pronunciation modeling for automatic speech recognition (ASR) systems whose performance deteriorates due to high pronunciation variability in the words arising due to intra-speaker and interspeaker variabilities [1]....

    [...]

01 Jan 2008
TL;DR: The ICSI 2007 language recognition system constitutes a variant of the classic PPRLM approach, using a combination of frame-by-frame multilayer perceptron (MLP) phone classifiers for English, Arabic, and Mandarin and one open loop hidden Markov Model (HMM) phone recognizer (trained on English data).
Abstract: In this paper, we describe the ICSI 2007 language recognition system. The system constitutes a variant of the classic PPRLM (parallel phone recognizer followed by language modeling) approach. We used a combination of frame-by-frame multilayer perceptron (MLP) phone classifiers for English, Arabic, and Mandarin and one open loop hidden Markov Model (HMM) phone recognizer (trained on English data). The maximum likelihood language modeling is substituted by support-vectormachines (SVMs) as a more powerful, discriminative classifi cation method. Rank normalization is used as a normalization method superior to mean-variance normalization. Results are presented on the NIST 2005 language recognition evaluation (LRE05) set and a test set taken from the LRE07 training corpus. The average NIST cost of the system on the LRE05 set is 0.0886.

2 citations

01 Jan 2015
TL;DR: This chapter discusses nonparametric Bayesian Approaches for Acoustic Modeling of Sub-Word Units in Speech Recognition and Semi-Supervised Training of DHDPHMM Models.
Abstract: ................................................................................................................................. iii TABLE OF CONTENTS ............................................................................................................ vii LIST OF FIGURES ...................................................................................................................... ix LIST OF TABLES .......................................................................................................................... x Introduction .............................................................................................................................. 1 Acoustic Modeling ................................................................................................................ 3 Nonparametric Bayesian Approaches in Speech Recognition ............................. 8 Dissertation Organization .............................................................................................. 10 Dissertation Contributions ............................................................................................ 12 Nonparametric Bayesian Basics ...................................................................................... 15 The Dirichlet Distribution.............................................................................................. 17 Dirichlet Process ............................................................................................................... 23 Hierarchical Dirichlet Process ..................................................................................... 25 2.3.1 Stick-Breaking Construction ................................................................................................... 26 HDPHMM .............................................................................................................................. 29 2.4.1 Block Sampler ............................................................................................................................... 32 2.4.2 Learning Hyperparameters ..................................................................................................... 36 Conclusion............................................................................................................................ 40 Nonparametric Bayesian Approaches For Acoustic Modeling of Sub-Word Units ............................................................................................................................................ 42 Related Work ...................................................................................................................... 44 A Doubly Hierarchical Dirichlet Process Mixture Model .................................... 46 Inference Algorithm for DHDPHMM ........................................................................... 50 DHDPHMM with a Non-Ergodic Structure ................................................................ 53 3.4.1 Left-to-Right DHDPHMM with Loop Transitions ........................................................... 54 3.4.2 Left-To-Right DHDPHMM ........................................................................................................ 55 3.4.3 Strictly Left-to-Right DHDPHMM .......................................................................................... 56 Initial and Final Non-Emitting States ......................................................................... 56 3.5.1 Maximum Likelihood Estimation .......................................................................................... 57 3.5.2 Bayesian Estimation ................................................................................................................... 58 An Integrated Model ......................................................................................................... 59 Experiments ........................................................................................................................ 62 3.7.1 Evaluation Methods .................................................................................................................... 62 3.7.2 A Computational Analysis of DHDPHMM .......................................................................... 62 3.7.3 HMM-Generated Data ............................................................................................................... 64 3.7.4 Phoneme Classification on the TIMIT Corpus ................................................................. 66 Conclusions ......................................................................................................................... 73 Semi-Supervised Training of DHDPHMM ..................................................................... 76 Semi-Supervised Training of DHDPHMM Models .................................................. 78 4.1.1 Composite DHDPHMM Model ................................................................................................ 79 4.1.2 Approximation of the Generative Model for Semi-Supervised Training .............. 82

2 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ..., 2001) and acoustically inspired units (Paliwal, 1990) have been explored over the years....

    [...]

  • ...Most unsupervised algorithms for speech segmentation rely on changes in the acoustic data or spectrum (Ma et al., 2005; Bacchiani and Ostendorf, 1999; Paliwal, 1990; Wang et al., 2015)....

    [...]

  • ...Different types of acoustic units such as phonemes (Lee, 1989), syllables (Ganapathiraju et al., 2001) and acoustically inspired units (Paliwal, 1990) have been explored over the years....

    [...]

Dissertation
01 Jan 2012
TL;DR: This thesis explores unsupervised algorithms for pattern discovery and retrieval in audio and speech data and explores the techniques of searching audio pattern in broadcast audio which consists of diverse content such as speech, music/songs, commercials, sound effects and background noise.
Abstract: This thesis explores unsupervised algorithms for pattern discovery and retrieval in audio and speech data. In this work, audio pattern is defined as repeating audio content such as repeating music segments or words/short phrases in speech recordings. The meanings of “pattern” will be defined separately for different types of data, for example, repeating pattern discovery in music will extract segments with similar melody in music piece; In human speech, the same words/short phrases spoken by single or multiple speakers are also defined as speech patterns; In broadcast audio, repeated commercials/logo music are also considered as patterns. Previous work on audio pattern discovery focuses on either symbolizing the audio signal into token sequences followed by text-based search or using Brute-Force search techniques such as self-similarity matrix and Dynamic Time Warping. Symbolization process that relies on Vector Quantization or other modeling techniques may suffer from misclassification errors, and the exhaustive search requires high computation cost and can also be affected by channel distortion and speaker variation in audio data. Such limitations motivate me to explore more efficient and robust approaches to automatically detect repeating information in audio data. In this thesis, different unsupervised techniques are examined to analyze music and speech separately. For music, an efficient approach which extends Ukkonon’s suffix tree construction algorithm is proposed to detect repeating segments. For speech data, an iterative merging approach which is based on Acoustic Segment Model (ASM) is proposed to discover recurrent phrases/words in speech. This thesis also explores the techniques of searching audio pattern in broadcast audio which consists of diverse content such as speech, music/songs, commercials, sound effects and background noise. Existing audio pattern retrieval techniques focus only on specific

2 citations


Cites background from "Lexicon-building methods for an aco..."

  • ...These sub-word units have been further explored to create lexicon in [123]....

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Abstract: Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.

1,637 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >

245 citations

Journal ArticleDOI
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

218 citations