scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Lexicon-building methods for an acoustic sub-word based speech recognizer

03 Apr 1990-Vol. 1990, pp 729-732
TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.
Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Lexicon-building methods for an aco..."

  • ...ASWUs were proposed several years ago [10, 11, 12, 13], but they faded from view as speaker-independent recognition became the primary goal, because of the difficulty of distinguishing speaker variability from real pronunciation differences....

    [...]

Journal ArticleDOI
TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.
Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

67 citations

Journal ArticleDOI
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.
Abstract: Although most parameters in a speech recognition system are estimated from data by the use of an objective function, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper proposes a joint solution to the related problems of learning a unit inventory and corresponding lexicon from data. On a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities. Obwohl die meisten Parameter eines Spracherkennungssystems aus Daten geschatzt werden, ist die Wahl der akustischen Grundeinheiten und des Lexikons normalerweise nicht automatisch und deshalb wahrscheinlich nicht optimal. Dieser Artikel stellt einen kombinierten Ansatz fur die Losung dieser verwandten Probleme dar - das Lernen von akustischen Grundeinheiten und des zugehorigen Lexikons aus Daten. Experimente mit sprecher-unabhangigen gelesenen Sprachdaten mit einem Vokabular von 1000 Wortern zeigen, da?s der vorgestellte Ansatz besser ist als ein System niedriger oder hoherer Komplexitat, das auf Phonemen basiert ist. Bien que la plupart des parametres dans un systeme de reconnaissance de la parole soient estimes a partie des donnees en utilisant une fonction objective, l'inventaire des unites acoustiques et le lexique sont generalement crees a la main, et donc susceptibles de ne pas etre optimeux. Cette etude propose une solution conjointe aux problemes interdependants que sont l'apprentissage a partir des donnees d'un inventaire des unites acoustiques et du lexique correspondant. Nous avons teste l'algorithme propose sur des echantillons lus, en reconnaissance independantes du locuteur avec un vocabulaire de 1k: il surpasse les systemes phonetiques en faible ou forte complexite.

66 citations


Cites background or methods from "Lexicon-building methods for an aco..."

  • ...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....

    [...]

  • ...…therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....

    [...]

  • ...Cluster centroids therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....

    [...]

  • ...The clustering algorithm used here di€ers from that used in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) in that maximum likelihood is used as an objective rather than minimum Euclidean distance....

    [...]

  • ...The related problem of de®ning a lexicon in terms of these ASWUs has also received attention (e.g., Paliwal, 1990; Svendsen et al., 1995)....

    [...]

Journal ArticleDOI
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.
Abstract: This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. In order to create a consistent framework for optimisation of automatic speech recognition systems, we present a maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word. We also propose an extension of this formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The methods improve the subword description of the vocabulary words and have been shown to improve recognition performance on the DARPA Resource Management task. Dieser Beitrag behandelt das Problem der Erzeugung lexikalischer Wortdarstellungen, die die naturliche Variation der Aussprache von Wortern geeignet reprasentieren, um auf diese Weise die Genauigkeit eines Spracherkennungssystemes zu erhohen. Um einen einheitlichen Ansatz fur die Optimierung von Spracherkennungssystemen zu entwickeln, wird ein maximum-likelihood-basiertes Verfahren zur vollautomatischen datengesteuerten Modellierung der Aussprache von Wortern vorgestellt. Dieses Verfahren basiert auf einen Satz von Teilwortern mit verborgenen Markov-Modellen und akustische Proben eines Wortes. Au?serdem wird eine Erweiterung dieses Verfahrens vorgeschlagen, um eine optimale Modellierung der Aussprachevariation zu erzielen. Da unterschiedliche Worter im allgemeinen nicht den gleichen Grad der Aussprachevariation aufweisen, erlaubt das vorgestellte Verfahren, Worter durch eine unterschiedliche Anzahl von Basisformen darzustellen. Diese Verfahren verbessern die Teilwort-Darstellung der Worter im Wortschatz, und es konnte gezeigt werden, da?s damit die Erkennungsleistung fur den DARPA Resource Management Task verbessert wird. Cette communication aborde le probleme de la generation des representations des mots lexicaux representant des variations naturelles de la prononciation. Le but est d'ameliorer la precision en ce qui concerne la reconnaissance de la parole. Afin de creer un cadre consistant pour l'optimisation des systemes automatiques pour la reconnaissance de la parole, on presente ici un algorithme base sur la classification au maximum de vraisemblance pour la modelisation automatique de la prononciation. Cette modelisation utilise une rame d'unites de parole aux modeles Markov dissimules et des echantillons acoustiques d'un mot. On propose aussi une extension de cette formulation afin d'obtenir une modelisation optimale des variations de la prononciation. Puisque de differents mots n'exposent pas, en general, le meme degre de variation de la prononciation, cette methode permet une representation des mots par un nombre varie d'entrees lexicales. La methode ameliore la description d'unites de parole des mots du vocabulaire, chose qui a demontre une amelioration de la performance de la reconnaissance en ce qui concerne la tâche de la DARPA Resource Management.

63 citations

Proceedings Article
01 Oct 2013
TL;DR: An unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries is proposed, which jointly discovers the phonetic inventory and the Letter-to-Sound mapping rules in a language using only transcribed data.
Abstract: The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures.

41 citations


Additional excerpts

  • ...Various algorithms for learning sub-word based pronunciations were proposed in (Lee et al., 1988; Fukada et al., 1996; Bacchiani and Ostendorf, 1999; Paliwal, 1990)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: An enhanced analysis feature set consisting of both instantaneous and transitional spectral information is used and the hidden-Markov-model (HMM)-based connected-digit recognizer in speaker-trained, multispeaker, and speaker-independent modes is tested.
Abstract: The authors use an enhanced analysis feature set consisting of both instantaneous and transitional spectral information and test the hidden-Markov-model (HMM)-based connected-digit recognizer in speaker-trained, multispeaker, and speaker-independent modes. For the evaluation, both a 50-talker connected-digit database recorded over local, dialed-up telephone lines, and the Texas Instruments, 225-adult-talker, connected-digits database are used. Using these databases, the performance achieved was 0.35, 1.65, and 1.75% string error rates for known-length strings, for speaker-trained, multispeaker, and speaker-independent modes, respectively, and 0.78, 2.85, and 2.94% string error rates for unknown-length strings of up to seven digits in length for the three modes. Several experiments were carried out to determine the best set of conditions (e.g., training, recognition, parameters, etc.) for recognition of digits. The results and the interpretation of these experiments are described. >

205 citations

Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


"Lexicon-building methods for an aco..." refers methods in this paper

  • ...The maximum likelihood (ML) algorithm proposed by Svendsen and Soong [12] uses this criterion for segmentation ....

    [...]

Proceedings ArticleDOI
11 Apr 1988
TL;DR: The proposed segment model was tested on a speaker-trained, isolated word, speech recognition task with a vocabulary of 1109 basic English words and the average word recognition accuracy was 85% and increased to 96% and 98% for the top 3 and top 5 candidates, respectively.
Abstract: Proposes a global acoustic segment model for characterizing fundamental speech sound units and their interactions based upon a general framework of hidden Markov models (HMM). Each segment model represents a class of acoustically similar sounds. The intra-segment variability of each sound class is modeled by an HMM, and the sound-to-sound transition rules are characterized by a probabilistic intersegment transition matrix. An acoustically-derived lexicon is used to construct word models based upon subword segment models. The proposed segment model was tested on a speaker-trained, isolated word, speech recognition task with a vocabulary of 1109 basic English words. In the current study, only 128 segment models were used, and recognition was performed by optimally aligning the test utterance with all acoustic lexicon entries using a maximum likelihood Viterbi decoding algorithm. Based upon a database of three male speakers, the average word recognition accuracy for the top candidate was 85% and increased to 96% and 98% for the top 3 and top 5 candidates, respectively. >

132 citations

Proceedings ArticleDOI
23 May 1989
TL;DR: A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech and points out the relative advantages of each type of speech unit based on the results of a series of recognition experiments.
Abstract: The problem of how to select and construct a set of fundamental unit statistical models suitable for speech recognition is addressed. A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech. The performances of three types of fundamental units, namely whole word, phoneme-like, and acoustic segment units, in a 1109-word vocabulary speech recognition task are compared. The authors point out the relative advantages of each type of speech unit based on the results of a series of recognition experiments. >

47 citations

Journal ArticleDOI
TL;DR: A unified system for automatically recognizing fluently spoken digit strings based on whole-word reference units is presented, which can use either hidden Markov model (HMM) technology or template-based technology and contains features from both approaches.
Abstract: Although a great deal of effort has gone into studying large-vocabulary speech-recognition problems, there remains a number of interesting, and potentially exceedingly important, problems which do not require the complexity of these large systems. One such problem is connected-digit recognition, which has applications to telecommunications, order entry, credit-card entry, forms automation, and data-base management, among others. Connected-digit recognition is also an interesting problem for another reason, namely that it is one in which whole-word training patterns are applicable as the basic speech-recognition unit. Thus one can bring to bear all the fundamental speech recognition technology associated with whole-word recognition to solve this problem. As such, several connected digit recognizers have been proposed in the past few years. The performance of these systems has steadily improved to the point where high digit-recognition accuracy is achievable in a speaker-trained mode. In this paper we present a unified system for automatically recognizing fluently spoken digit strings based on whole-word reference units. The system that we will describe can use either hidden Markov model (HMM) technology or template-based technology. In fact the overall system contains features from both approaches. A key factor in the success of the various connected digit recognizers is the ability to derive, via a training procedure, a good set of representations of the behavior of the individual digits in actual connected digit strings. For most applications, isolated digit training does not provide a good enough characterization of the variability of the digits in strings. The ''best'' training procedure is to derive the digit reference patterns (either templates or statistical models) from connected digit strings. Such a connected word training procedure, based on a segmental k-means loop, has been proposed and was tested on seven experienced users of speech recognizers. For these seven talkers, average string accuracies of greater than 98% for unknown length strings, and greater than 99% for known length strings were obtained on an independent test set of 525 variable length strings (1-7 digits) recorded over local dialed-up telephone lines. To evaluate the performance of the overall connected digit recognizer under more difficult conditions, a set of 50 people (25 men, 25 women), from the non-technical local population, was each asked to record 1200 random digit strings over local dialed-up telephone lines. Both a speaker-trained and a multi-speaker training set was created, and a full performance evaluation was made. Results show that the average string accuracy for unknown- and known-length strings, in the speaker-trained mode, was 98% and 99% respectively; in the multi-speaker mode the average string accuracies were 94% and 96.6% respectively. A complete analysis of these results is given in this paper.

43 citations