Lexicon-building methods for an acoustic sub-word based speech recognizer
03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >
Citations
More filters
Proceedings Article•
01 Jan 2000
TL;DR: The task of this research is to form phone-like models and a phoneme-like set from spoken word samples without using any transcriptions except for the lexical identi cation of each word in a vocabulary.
Abstract: The task of our research is to form phone-like models and a phoneme-like set from spoken word samples without using any transcriptions except for the lexical identi cation of each word in a vocabulary. This framework is derived from two motivations: 1) automatic design of optimal speech recognition units and structures of phone models, and 2) multi-lingual speech recognition based on languageindependent intermediate phonetic codes. The procedure consists of two steps: 1) constructing a VQ codebook of sub-phonetic segments from speech samples, and 2) extracting phonological chunks from sequences of the codes. Segment model is represented with \piecewise linear segment lattice" model, which is a lattice structure of segments, each of which is represented as regression coeÆcients of feature vectors within the segment. Phonological chunks are extracted with a criterion based on KullbackLeibler divergence between the distribution of individual VQ codes. The recognition rate yields approximately 90% on the 1542 words task with 128 VQ codes.
1 citations
Cites background from "Lexicon-building methods for an aco..."
...Related researches seem to be getting more active for example: [2][3][4][5][6]....
[...]
Dissertation•
01 Jan 2002
TL;DR: This thesis presented a dynamic reliability scoring scheme which can help adjust partial path scores while the recognizer searches through the composed lexical and acoustic-phonetic network, and demonstrated the effectiveness of the dynamic reliability modeling approach.
Abstract: In this thesis, we mainly focused on the integration of knowledge sources within the speech understanding system of a conversational interface. More specifically, we studied the formalization and integration of hierarchical linguistic knowledge at both the sub-lexical level and the supra-lexical level, and proposed a unified framework for integrating hierarchical linguistic knowledge in speech recognition using layered finite-state transducers (FSTs). Within the proposed framework, we developed context-dependent hierarchical linguistic models at both sub-lexical and supra-lexical levels. FSTs were designed and constructed to encode both structure and probability constraints provided by the hierarchical linguistic models. We also studied empirically the feasibility and effectiveness of integrating hierarchical linguistic knowledge into speech recognition using the proposed framework.
Another important aspect of achieving more accurate and robust speech recognition is the integration of acoustic knowledge. Typically, along a recognizer's search path, some acoustic units are modeled more reliably than others, due to differences in their acoustic phonetic features and many other factors. We presented a dynamic reliability scoring scheme which can help adjust partial path scores while the recognizer searches through the composed lexical and acoustic-phonetic network. The reliability models were trained on acoustic scores of a correct arc and its immediate competing arcs extending the current partial path. During recognition, if, according to the trained reliability models, an arc can be more easily distinguished from the competing alternatives, that arc is then more likely to be in the right path, and the partial path score can be adjusted accordingly, to reflect such acoustic model reliability information. We applied this reliability scoring scheme in two weather information domains. The first one is the JUPITER system in English, and the second one is the PANDA system in Mandarin Chinese. We demonstrated the effectiveness of the dynamic reliability modeling approach in both cases. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.) (Abstract shortened by UMI.)
1 citations
Cites methods from "Lexicon-building methods for an aco..."
...In contrast, the corpus-based approach does not use a priori linguistic knowledge, and the sub-word units are chosen by clustering acoustic data [81]....
[...]
01 Jan 2005
TL;DR: In this article, the authors proposed the use of acoustic based units as a means of reducing these text-dependent restrictions and hopefully increase the overall system performance, which has been primarily based around text-independent systems based upon GMM.
Abstract: The field of speaker recognition has been primarily based around text-independent systems based upon GMM. Textdependent system based on linguistic units have since long been forgotten due to their restrictive nature. It is the goal of this paper to propose the use of acoustic based units as a means of reducing these text-dependent restrictions and hopefully increase the overall system performance.
1 citations
Cites background from "Lexicon-building methods for an aco..."
...In recent years acoustic units have been quite popular in speech recognition [1]–[4], but have never been applied to speaker recognition....
[...]
DOI•
01 Jan 2017
TL;DR: A data-driven G2P conversion approach is developed in which a probabilistic G1P relationship is learned by matching the acoustic signal with the word hypothesis represented by graphemes, using phones as the latent symbols, and building on the posterior-based formalism to show that different G2 P conversion approaches in the literature can be regarded as different estimators of phone class conditional probabilities, and can be combined in a multi-stream fashion to yield better lexicons.
Abstract: State-of-the-art automatic speech recognition (ASR) and text-to-speech systems require a pronunciation lexicon that maps each word to a sequence of phones. Manual development of lexicons is costly as it needs linguistic knowledge and human expertise. To facilitate this process, grapheme-to-phone (G2P) conversion approaches are used, in which given a seed lexicon provided by linguistic experts, the G2P relationship is learned by applying statistical techniques. Despite advances in these approaches, there are two challenges remaining: (1) the seed lexicon development through linguistic expertise incorporates limited acoustic information, which may not necessarily cover all natural phonological variations, and (2) the linguistic expertise required for the development of the seed lexicon may not be available for all languages, particularly under-resourced languages. The goal of this thesis is to address these challenges by developing a framework that effectively integrates linguistic information and acoustic data for pronunciation lexicon development. To achieve that goal, we first study the problem of matching a word hypothesis to the acoustic signal, and show that the hidden Markov model-based ASR approach achieves that match via a latent symbol set. Building on that understanding, we develop a data-driven G2P conversion approach in which a probabilistic G2P relationship is learned by matching the acoustic signal with the word hypothesis represented by graphemes, using phones as the latent symbols. Through a theoretical development, we show that this acoustic G2P conversion approach is a particular case of an abstract posterior-based G2P conversion formalism, which requires estimation of phone class conditional probabilities. Through studies on two languages, we show that the acoustic G2P conversion approach yields lexicons that can perform comparable to state-of-the-art G2P conversion methods at the ASR level, despite performing relatively poorly at pronunciation level. We build on the posterior-based formalism to show that different G2P conversion approaches in the literature can be regarded as different estimators of phone class conditional probabilities, and can be combined in a multi-stream fashion to yield better lexicons. We also demonstrate that the multi-stream formulation can be further extended to unify acoustic-to-phone conversion and G2P conversion. We validate the proposed multi-stream formulation on two challenging tasks on English. Finally, we address the issue of developing lexical resources for under-resourced languages by proposing an acoustic subword unit (ASWU)-based lexicon development approach. In this approach, ASWU derivation is cast as the problem of determining a latent symbol space given the word hypothesis and acoustics, and the pronunciations are generated using the proposed acoustic G2P conversion approach. Through experimental studies and analysis on well-resourced and under-resourced languages, we show that the derived ASWUs are "phone-like", and the ASWU-based lexicons yield better ASR systems compared to the alternative grapheme-based lexicons.
Proceedings Article•
01 Jan 1998
TL;DR: Experimental results for isolated word recognition task is that the recognition rate is significantly improved by blocking the segments and by clustering the segments within a block.
Abstract: The goal of this work is to model phone-like units automatically from spoken word samples without using any transcriptions except for the lexical identi cation of the words. In order to implement this task, we have proposed the \piecewise linear segment lattice (PLSL)" model for phoneme representation. The structure of this model is a lattice of segments, each of which is represented as regression coe cients of feature vectors within the segment. In order to organize phone models, operations including division, concatenation, blocking and clustering are applied to the models. This paper mainly report on blocking and clustering. Experimental results for isolated word recognition task is that the recognition rate is signi cantly improved by blocking the segments and by clustering the segments within a block. We get su cient performance for the task with the models consist of at most 128 clusters of segment patterns.
References
More filters
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >
21,819 citations
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.
7,935 citations
IBM1
TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Abstract: Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.
1,637 citations
IBM1
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >
245 citations
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
218 citations