scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Joint lexicon, acoustic unit inventory and model design

01 Nov 1999-Speech Communication (Elsevier Science Publishers B. V.)-Vol. 29, Iss: 2, pp 99-114
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.
About: This article is published in Speech Communication.The article was published on 1999-11-01. It has received 66 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.

363 citations


Cites methods from "Joint lexicon, acoustic unit invent..."

  • ...Bacchiani proposed a method for breaking words into smaller acoustic segments and clustering those segments to jointly determine acoustic subword units and word pronunciations [21]....

    [...]

Journal ArticleDOI
TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.

259 citations


Cites background or methods or result from "Joint lexicon, acoustic unit invent..."

  • ...In general, these procedures seem to improve the performance of the ASR systems (Aubert and Dugast, 1995; Bacchiani and Ostendorf, 1998, 1999; Finke and Waibel, 1997; Kessens and Wester, 1997; Kessens et al., 1999; Riley et al., 1998, 1999; Schiel et al., 1998; Sloboda and Waibel, 1996; Wester et…...

    [...]

  • ...Further- more, in (Bacchiani and Ostendorf, 1998, 1999; Deng and Sun, 1994; Godfrey et al., 1997; Holter, 1997) the observed levels of performance are comparable to those of phone-based ASR systems (usually for limited tasks)....

    [...]

  • ...In both (Bacchiani and Ostendorf, 1998, 1999) and (Holter, 1997) the optimization is done with a maximum likelihood criterion....

    [...]

  • ...The idea behind data-driven methods is that information on pronunciation variation has to be obtained directly from the signals (Bacchiani and Ostendorf, 1998, 1999; Blackburn and Young, 1995, 1996; Cremelie and Martens, 1995, 1997, 1998, 1999; Fosler-Lussier and Morgan, 1998, 1999; Fukada and…...

    [...]

  • ...However, it is also possible to allow an optimization procedure to decide what the optimal pronunciations in the lexicon and the optimal basic units (i.e., both their size and the corresponding acoustic models) are (Bacchiani and Ostendorf, 1998, 1999; Holter, 1997)....

    [...]

Proceedings ArticleDOI
Samy Bengio1, Georg Heigold1
14 Sep 2014
TL;DR: This work presents here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense, and shows how embeddings can still allow to score words that were not in the training dictionary.
Abstract: Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

166 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...ples include grapheme-to-phoneme conversion [2], pronunciation learning [15, 10], and joint learning of phonetic units and word pronunciations [1, 9]....

    [...]

01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...However, this problem has recently been addressed by integrating the unit and dictionary design step [14, 15], so that an ASWU system is now a viable option for speaker-independent recognition....

    [...]

Journal ArticleDOI
TL;DR: It is shown that the model is competitive with state-of-the-art spoken term discovery systems, and analyses exploring the model’s behavior and the kinds of linguistic structures it learns are presented.
Abstract: We present a model of unsupervised phonological lexicon discovery -- the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsupervised phone-like unit discovery from acoustic data (Lee and Glass, 2012), and unsupervised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches using a probabilistic model of phonological variation. We show that the model is competitive with state-of-the-art spoken term discovery systems, and present analyses exploring the model's behavior and the kinds of linguistic structures it learns.

112 citations


Cites background from "Joint lexicon, acoustic unit invent..."

  • ...Although some earlier systems have examined various parts of the joint learning problem (Bacchiani and Ostendorf, 1999; De Marcken, 1996b), to our knowledge, the only other system which addresses the entire problem is that of Chung et al....

    [...]

References
More filters
Proceedings ArticleDOI
11 Apr 1988
TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.
Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

393 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...Experiments were conducted on the DARPA Resource Management corpus (Price et al., 1988), which is a read speech corpus with a 991 word vocabulary....

    [...]

Journal ArticleDOI
TL;DR: A method fordesigning HMM topologies that learn both temporal and contextual variation, extending previous work on successive state splitting (SSS) and using a maximum likelihood criterion consistently at each step is described.

157 citations


"Joint lexicon, acoustic unit invent..." refers background in this paper

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....

    [...]

Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion.
Abstract: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion. With this algorithm, a hidden Markov network (HM-Net), which is an efficient representation of phoneme-context-dependent HMMs, can be generated automatically. The authors implemented this algorithm, and tested it on the recognition of six Japanese consonants ( mod b mod , mod d mod , mod g mod , mod m mod , mod n mod and mod N mod ). The HM-Net gave better recognition results with a lower number of total output probability density distributions than conventional phoneme-context-independent mixture Gaussian density HMMs. >

148 citations


"Joint lexicon, acoustic unit invent..." refers background in this paper

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

  • ...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....

    [...]

Proceedings Article
23 Sep 1993

125 citations


"Joint lexicon, acoustic unit invent..." refers methods in this paper

  • ...These larger inventories are typically obtained using automatic clustering of models with di erent phonetic contexts, e.g. (Young and Woodland, 1993)....

    [...]