Joint lexicon, acoustic unit inventory and model design

doi:10.1016/S0167-6393(99)00033-3

Home
/
Papers
/
Joint lexicon, acoustic unit inventory and model design

Journal Article•DOI•

Joint lexicon, acoustic unit inventory and model design

Michiel Bacchiani¹, Mari Ostendorf¹•Institutions (1)

Boston University¹

01 Nov 1999-Speech Communication (Elsevier Science Publishers B. V.)-Vol. 29, Iss: 2, pp 99-114

TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.

read less

About: This article is published in Speech Communication.The article was published on 1999-11-01. It has received 66 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Unsupervised Pattern Discovery in Speech

[...]

Alex Park¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.

...read moreread less

Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.

...read moreread less

363 citations

Cites methods from "Joint lexicon, acoustic unit invent..."

...Bacchiani proposed a method for breaking words into smaller acoustic segments and clustering those segments to jointly determine acoustic subword units and word pronunciations [21]....
[...]

Journal Article•DOI•

Modeling pronunciation variation for ASR

[...]

Helmer Strik¹, Catia Cucchiarini¹•Institutions (1)

Radboud University Nijmegen¹

01 Nov 1999-Speech Communication

TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.

...read moreread less

259 citations

Cites background or methods or result from "Joint lexicon, acoustic unit invent..."

...In general, these procedures seem to improve the performance of the ASR systems (Aubert and Dugast, 1995; Bacchiani and Ostendorf, 1998, 1999; Finke and Waibel, 1997; Kessens and Wester, 1997; Kessens et al., 1999; Riley et al., 1998, 1999; Schiel et al., 1998; Sloboda and Waibel, 1996; Wester et…...
[...]
...Further- more, in (Bacchiani and Ostendorf, 1998, 1999; Deng and Sun, 1994; Godfrey et al., 1997; Holter, 1997) the observed levels of performance are comparable to those of phone-based ASR systems (usually for limited tasks)....
[...]
...In both (Bacchiani and Ostendorf, 1998, 1999) and (Holter, 1997) the optimization is done with a maximum likelihood criterion....
[...]
...The idea behind data-driven methods is that information on pronunciation variation has to be obtained directly from the signals (Bacchiani and Ostendorf, 1998, 1999; Blackburn and Young, 1995, 1996; Cremelie and Martens, 1995, 1997, 1998, 1999; Fosler-Lussier and Morgan, 1998, 1999; Fukada and…...
[...]
...However, it is also possible to allow an optimization procedure to decide what the optimal pronunciations in the lexicon and the optimal basic units (i.e., both their size and the corresponding acoustic models) are (Bacchiani and Ostendorf, 1998, 1999; Holter, 1997)....
[...]

Proceedings Article•DOI•

Word Embeddings for Speech Recognition

[...]

Samy Bengio¹, Georg Heigold¹•Institutions (1)

Google¹

14 Sep 2014

TL;DR: This work presents here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense, and shows how embeddings can still allow to score words that were not in the training dictionary.

...read moreread less

Abstract: Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

...read moreread less

166 citations

Cites background from "Joint lexicon, acoustic unit invent..."

...ples include grapheme-to-phoneme conversion [2], pronunciation learning [15, 10], and joint learning of phonetic units and word pronunciations [1, 9]....
[...]

Moving beyond the 'beads-on-a-string' model of speech

[...]

Mari Ostendorf

01 Jan 1999

TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.

...read moreread less

Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

...read moreread less

151 citations

Cites background from "Joint lexicon, acoustic unit invent..."

...However, this problem has recently been addressed by integrating the unit and dictionary design step [14, 15], so that an ASWU system is now a viable option for speaker-independent recognition....
[...]

Journal Article•DOI•

Unsupervised Lexicon Discovery from Acoustic Input

[...]

Chia-ying Lee¹, Timothy J. O'Donnell¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Jul 2015-Transactions of the Association for Computational Linguistics

TL;DR: It is shown that the model is competitive with state-of-the-art spoken term discovery systems, and analyses exploring the model’s behavior and the kinds of linguistic structures it learns are presented.

...read moreread less

Abstract: We present a model of unsupervised phonological lexicon discovery -- the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsupervised phone-like unit discovery from acoustic data (Lee and Glass, 2012), and unsupervised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches using a probabilistic model of phonological variation. We show that the model is competitive with state-of-the-art spoken term discovery systems, and present analyses exploring the model's behavior and the kinds of linguistic structures it learns.

...read moreread less

112 citations

Cites background from "Joint lexicon, acoustic unit invent..."

...Although some earlier systems have examined various parts of the joint learning problem (Bacchiani and Ostendorf, 1999; De Marcken, 1996b), to our knowledge, the only other system which addresses the entire problem is that of Chung et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

The DARPA 1000-word resource management database for continuous speech recognition

[...]

Patti Price, William M. Fisher¹, Jared Bernstein², David S. Pallett³•Institutions (3)

Texas Instruments¹, SRI International², National Institute of Standards and Technology³

11 Apr 1988

TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.

...read moreread less

Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

...read moreread less

393 citations

"Joint lexicon, acoustic unit invent..." refers methods in this paper

...Experiments were conducted on the DARPA Resource Management corpus (Price et al., 1988), which is a read speech corpus with a 991 word vocabulary....
[...]

Journal Article•DOI•

HMM topology design using maximum likelihood successive state splitting

[...]

Mari Ostendorf¹, Harald Singer•Institutions (1)

Boston University¹

01 Jan 1997-Computer Speech & Language

TL;DR: A method fordesigning HMM topologies that learn both temporal and contextual variation, extending previous work on successive state splitting (SSS) and using a maximum likelihood criterion consistently at each step is described.

...read moreread less

157 citations

"Joint lexicon, acoustic unit invent..." refers background in this paper

...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....
[...]
...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....
[...]

Proceedings Article•DOI•

On the automatic segmentation of speech signals

[...]

Torbjørn Svendsen¹, F. Soong•Institutions (1)

Bell Labs¹

06 Apr 1987

TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.

...read moreread less

Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

...read moreread less

156 citations

"Joint lexicon, acoustic unit invent..." refers methods in this paper

...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....
[...]

Proceedings Article•DOI•

A successive state splitting algorithm for efficient allophone modeling

[...]

Jun-ichi Takami, Shigeki Sagayama

23 Mar 1992

TL;DR: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion.

...read moreread less

Abstract: The authors propose an algorithm, successive state splitting (SSS), for simultaneously finding an optimal set of phoneme context classes, an optimal topology, and optimal parameters for hidden Markov models (HMMs) commonly using a maximum likelihood criterion. With this algorithm, a hidden Markov network (HM-Net), which is an efficient representation of phoneme-context-dependent HMMs, can be generated automatically. The authors implemented this algorithm, and tested it on the recognition of six Japanese consonants ( mod b mod , mod d mod , mod g mod , mod m mod , mod n mod and mod N mod ). The HM-Net gave better recognition results with a lower number of total output probability density distributions than conventional phoneme-context-independent mixture Gaussian density HMMs. >

...read moreread less

148 citations

"Joint lexicon, acoustic unit invent..." refers background in this paper

...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....
[...]
...Second, even phone-based systems bene®t from progressive techniques for increasing the acoustic model complexity, both in terms of contextual and temporal resolution (Woodland and Young, 1993; Takami and Sagayama, 1992; Ostendorf and Singer, 1997)....
[...]

Proceedings Article•

The use of state tying in continuous speech recognition.

[...]

Steve Young, Philip C. Woodland

23 Sep 1993

125 citations

"Joint lexicon, acoustic unit invent..." refers methods in this paper

...These larger inventories are typically obtained using automatic clustering of models with di erent phonetic contexts, e.g. (Young and Woodland, 1993)....
[...]