scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Joint lexicon, acoustic unit inventory and model design

01 Nov 1999-Speech Communication (Elsevier Science Publishers B. V.)-Vol. 29, Iss: 2, pp 99-114
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.
Abstract: Although most parameters in a speech recognition system are estimated from data by the use of an objective function, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper proposes a joint solution to the related problems of learning a unit inventory and corresponding lexicon from data. On a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities. Obwohl die meisten Parameter eines Spracherkennungssystems aus Daten geschatzt werden, ist die Wahl der akustischen Grundeinheiten und des Lexikons normalerweise nicht automatisch und deshalb wahrscheinlich nicht optimal. Dieser Artikel stellt einen kombinierten Ansatz fur die Losung dieser verwandten Probleme dar - das Lernen von akustischen Grundeinheiten und des zugehorigen Lexikons aus Daten. Experimente mit sprecher-unabhangigen gelesenen Sprachdaten mit einem Vokabular von 1000 Wortern zeigen, da?s der vorgestellte Ansatz besser ist als ein System niedriger oder hoherer Komplexitat, das auf Phonemen basiert ist. Bien que la plupart des parametres dans un systeme de reconnaissance de la parole soient estimes a partie des donnees en utilisant une fonction objective, l'inventaire des unites acoustiques et le lexique sont generalement crees a la main, et donc susceptibles de ne pas etre optimeux. Cette etude propose une solution conjointe aux problemes interdependants que sont l'apprentissage a partir des donnees d'un inventaire des unites acoustiques et du lexique correspondant. Nous avons teste l'algorithme propose sur des echantillons lus, en reconnaissance independantes du locuteur avec un vocabulaire de 1k: il surpasse les systemes phonetiques en faible ou forte complexite.
Citations
More filters
Journal ArticleDOI
TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.

363 citations


Cites methods from "Joint lexicon, acoustic unit invent..."

  • ...Bacchiani proposed a method for breaking words into smaller acoustic segments and clustering those segments to jointly determine acoustic subword units and word pronunciations [21]....

    [...]

Journal ArticleDOI
TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.
Abstract: The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversational speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.11Whenever we mention 'the Rolduc workshop' in the text we refer to the ESCA Tutorial and Research Workshop "Modeling pronunciation variation for ASR" that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains a selection of papers presented at that workshop. First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evaluation and comparison are addressed. Particular attention is paid to some of the most important factors that make it difficult to compare the different methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out. Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen isolierter Worter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, da?s die Aussprachevariation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist. Die Leistung eines ASR-Systems wird namlich erheblich beeintrachtigt, wenn man diesen Faktor nicht berucksichtigt. Dies ist vermutlich der Hauptgrund dafur, warum die systematische Berucksichtigung der Aussprachevariation bei der ASR in letzter Zeit stark zugenommen hat. Dieser Artikel stellt einen Uberblick der Literatur zu diesem Thema dar, wobei den Beitragen in diesem 'special issue' sowie denen des 'Rolduc workshop' besondere Aufmerksamkeit geschenkt wird. Zunachst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprachevariation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schlie?sen sich einige Schlu?sfolgerungen im Hinblick auf die Relevanz objektiver Beurteilung und deren mogliche Realisierung an. Le centre d'interet dans la recherche de la reconnaissance automatique de la parole (ASR), parti des mots isoles, s'est engage vers le discours conversationnel. Par consequence, la quantite de variation de prononciation presente dans le discours dont nous rapportons les resultats a graduellement augmente. La variation de prononciation deteriorera la performance d'un systeme ASR si l'on n'en rend pas compte. C'est probablement la raison principale pourquoi la recherche dans le domaine de la modelisation de la variation de prononciation pour ASR a augmente recemment. Dans cette contribution on fournit une vue d'ensemble des publications sur ce sujet, et en particulier on refere aux articles de cette edition speciale et aux contributions presentees dans les sessions qui ont eu lieu a 'Rolduc'. D'abord, les caracteristiques les plus importantes qui distinguent les diverses etudes sur modelisation de variation de prononciation sont discutees. Puis les questions d'evaluation et de comparaison sont adressees. Une attention particuliere est pretee a certains des facteurs les plus importants qui rendent difficile de comparer les differentes methodes d'une maniere objective. Enfin quelques conclusions sont tirees quant a l'importance de l'evaluation objective et de la facon dans laquelle elle pourrait etre effectuee.

259 citations


Cites background or methods or result from "Joint lexicon, acoustic unit invent..."

  • ...In general, these procedures seem to improve the performance of the ASR systems (Aubert and Dugast, 1995; Bacchiani and Ostendorf, 1998, 1999; Finke and Waibel, 1997; Kessens and Wester, 1997; Kessens et al., 1999; Riley et al., 1998, 1999; Schiel et al., 1998; Sloboda and Waibel, 1996; Wester et…...

    [...]

  • ...Further- more, in (Bacchiani and Ostendorf, 1998, 1999; Deng and Sun, 1994; Godfrey et al., 1997; Holter, 1997) the observed levels of performance are comparable to those of phone-based ASR systems (usually for limited tasks)....

    [...]

  • ...In both (Bacchiani and Ostendorf, 1998, 1999) and (Holter, 1997) the optimization is done with a maximum likelihood criterion....

    [...]

  • ...The idea behind data-driven methods is that information on pronunciation variation has to be obtained directly from the signals (Bacchiani and Ostendorf, 1998, 1999; Blackburn and Young, 1995, 1996; Cremelie and Martens, 1995, 1997, 1998, 1999; Fosler-Lussier and Morgan, 1998, 1999; Fukada and…...

    [...]

  • ...However, it is also possible to allow an optimization procedure to decide what the optimal pronunciations in the lexicon and the optimal basic units (i.e., both their size and the corresponding acoustic models) are (Bacchiani and Ostendorf, 1998, 1999; Holter, 1997)....

    [...]

Proceedings ArticleDOI
Samy Bengio1, Georg Heigold1
14 Sep 2014
TL;DR: This work presents here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense, and shows how embeddings can still allow to score words that were not in the training dictionary.
Abstract: Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

166 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...ples include grapheme-to-phoneme conversion [2], pronunciation learning [15, 10], and joint learning of phonetic units and word pronunciations [1, 9]....

    [...]

01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...However, this problem has recently been addressed by integrating the unit and dictionary design step [14, 15], so that an ASWU system is now a viable option for speaker-independent recognition....

    [...]

Journal ArticleDOI
TL;DR: It is shown that the model is competitive with state-of-the-art spoken term discovery systems, and analyses exploring the model’s behavior and the kinds of linguistic structures it learns are presented.
Abstract: We present a model of unsupervised phonological lexicon discovery -- the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsupervised phone-like unit discovery from acoustic data (Lee and Glass, 2012), and unsupervised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches using a probabilistic model of phonological variation. We show that the model is competitive with state-of-the-art spoken term discovery systems, and present analyses exploring the model's behavior and the kinds of linguistic structures it learns.

112 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...Although some earlier systems have examined various parts of the joint learning problem (Bacchiani and Ostendorf, 1999; De Marcken, 1996b), to our knowledge, the only other system which addresses the entire problem is that of Chung et al....

    [...]

References
More filters
Proceedings ArticleDOI
11 Apr 1988
TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.
Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

393 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...Experiments were conducted on the DARPA Resource Management corpus (Price et al., 1988), which is a read speech corpus with a 991 word vocabulary....

    [...]

Journal ArticleDOI
TL;DR: A method fordesigning HMM topologies that learn both temporal and contextual variation, extending previous work on successive state splitting (SSS) and using a maximum likelihood criterion consistently at each step is described.
Abstract: Modelling contextual variations of phones is widely accepted as an important aspect of a continuous speech recognition system, and HMM distribution clustering has been sucessfully used to obtain robust models of context through distribution tying. However, as systems move to the challenge of spontaneous speech, temporal variation also becomes important. This paper describes a method fordesigning HMM topologies that learn both temporal and contextual variation, extending previous work on successive state splitting (SSS). The new approach uses a maximum likelihood criterion consistently at each step, overcoming the previous SSS limitation to speaker-dependent training. Initial experiments show both performance gains and training cost reduction over SSS with the reformulated algorithm.

157 citations


"Joint lexicon, acoustic unit invent..." refers background in this paper

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....

    [...]

Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion.
Abstract: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion. With this algorithm, a hidden Markov network (HM-Net), which is an efficient representation of phoneme-context-dependent HMMs, can be generated automatically. The authors implemented this algorithm, and tested it on the recognition of six Japanese consonants ( mod b mod , mod d mod , mod g mod , mod m mod , mod n mod and mod N mod ). The HM-Net gave better recognition results with a lower number of total output probability density distributions than conventional phoneme-context-independent mixture Gaussian density HMMs. >

148 citations


"Joint lexicon, acoustic unit invent..." refers background in this paper

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

Proceedings Article
23 Sep 1993

125 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...These larger inventories are typically obtained using automatic clustering of models with di erent phonetic contexts, e.g. (Young and Woodland, 1993)....

    [...]