Lexicon-building methods for an acoustic sub-word based speech recognizer

doi:10.1109/ICASSP.1990.115888

Home
/
Papers
/
Lexicon-building methods for an acoustic sub-word based speech recognizer

Proceedings Article•DOI•

Lexicon-building methods for an acoustic sub-word based speech recognizer

Kuldip K. Paliwal¹•Institutions (1)

Tata Institute of Fundamental Research¹

03 Apr 1990-Vol. 1990, pp 729-732

TL;DR: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed and it is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon.

read less

Abstract: The use of an acoustic subword unit (ASWU)-based speech recognition system for the recognition of isolated words is discussed. Some methods are proposed for generating the deterministic and the statistical types of word lexicon. It is shown that the use of a modified k-means algorithm on the likelihoods derived through the Viterbi algorithm provides the best deterministic-type of word lexicon. However, the ASWU-based speech recognizer leads to better performance with the statistical type of word lexicon than with the deterministic type. Improving the design of the word lexicon makes it possible to narrow the gap in the recognition performances of the whole word unit (WWU)-based and the ASWU-based speech recognizers considerably. Further improvements are expected by designing the word lexicon better. >

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Moving beyond the 'beads-on-a-string' model of speech

[...]

Mari Ostendorf

01 Jan 1999

TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.

...read moreread less

Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

...read moreread less

151 citations

Cites background from "Lexicon-building methods for an aco..."

...ASWUs were proposed several years ago [10, 11, 12, 13], but they faded from view as speaker-independent recognition became the primary goal, because of the difficulty of distinguishing speaker variability from real pronunciation differences....
[...]

Journal Article•DOI•

A method for the construction of acoustic Markov models for words

[...]

Lalit R. Bahl¹, Peter Fitzhugh Brown¹, P.V. de Souza¹, Robert Leroy Mercer¹, Michael Picheny¹ - Show less +1 more•Institutions (1)

IBM¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.

...read moreread less

Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

...read moreread less

67 citations

Journal Article•DOI•

Joint lexicon, acoustic unit inventory and model design

[...]

Michiel Bacchiani¹, Mari Ostendorf¹•Institutions (1)

Boston University¹

01 Nov 1999-Speech Communication

TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.

...read moreread less

66 citations

Cites background or methods from "Lexicon-building methods for an aco..."

...Taking an approach similar to that in (Svendsen and Soong, 1987; Paliwal, 1990), the maximum likelihood segmentation of the training data are found by the use of dynamic programming....
[...]
...…therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....
[...]
...Cluster centroids therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....
[...]
...The clustering algorithm used here diers from that used in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) in that maximum likelihood is used as an objective rather than minimum Euclidean distance....
[...]
...The related problem of de®ning a lexicon in terms of these ASWUs has also received attention (e.g., Paliwal, 1990; Svendsen et al., 1995)....
[...]

Journal Article•DOI•

Maximum likelihood modelling of pronunciation variation

[...]

Trym Holter¹, Torbjørn Svendsen²•Institutions (2)

SINTEF¹, Norwegian University of Science and Technology²

01 Nov 1999-Speech Communication

TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.

...read moreread less

63 citations

Proceedings Article•

Joint Learning of Phonetic Units and Word Pronunciations for ASR

[...]

Chia-ying Lee¹, Yu Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Oct 2013

TL;DR: An unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries is proposed, which jointly discovers the phonetic inventory and the Letter-to-Sound mapping rules in a language using only transcribed data.

...read moreread less

Abstract: The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures.

...read moreread less

41 citations

Additional excerpts

...Various algorithms for learning sub-word based pronunciations were proposed in (Lee et al., 1988; Fukada et al., 1996; Bacchiani and Ostendorf, 1999; Paliwal, 1990)....
[...]

1
2
3
4
…
5
6
7
8
9

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

An improved sub-word based speech recognizer

[...]

Torbjørn Svendsen¹, Kuldip K. Paliwal¹, E. Harborg¹, P. O. Husoy¹•Institutions (1)

Norwegian Institute of Technology¹

23 May 1989

TL;DR: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units that showed results comparable to those of whole-word-based systems.

...read moreread less

Abstract: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units. Several strategies for automatic generation of an acoustic lexicon are outlined. Preliminary tests have been performed on a small vocabulary. In these tests, the proposed system showed results comparable to those of whole-word-based systems. >

...read moreread less

42 citations

Proceedings Article•DOI•

An investigation on the use of acoustic sub-word units for automatic speech recognition

[...]

Jay G. Wilpon¹, Biing-Hwang Juang, Lawrence R. Rabiner•Institutions (1)

Bell Labs¹

01 Apr 1987

TL;DR: An approach to automatic speech recognition is described which attempts to link together ideas from pattern recognition such as dynamic time warping and hidden Markov modeling, with ideas from linguistically motivated approaches.

...read moreread less

Abstract: An approach to automatic speech recognition is described which attempts to link together ideas from pattern recognition such as dynamic time warping and hidden Markov modeling, with ideas from linguistically motivated approaches. In this approach, the basic sub-word units are defined acoustically, but not necessarily phonetically. An algorithm was developed which automatically decomposed speech into multiple sub-word segments, based solely upon strict acoustic criteria, without any reference to linguistic content. By repeating this procedure on a large corpus of speech data we obtained an extensive pool of unlabeled sub-word speech segments. Then using well defined clustering techniques, a small set of representative acoustic sub-word units (e.g. an inventory of units) was created. This process is fast, easy to use, and required no human intervention. The interpretation of these sub-word units, in a linguistic sense, in the context of word decoding is an important issue which must be addressed for them to be useful in a large vocabulary system. We have not yet addressed this issue; instead a couple of simple experiments were performed to determine if these acoustic sub-word units had any potential value for speech recognition. For these experiments we used a connected digits database from a single female talker. A 25 sub-word unit codebook of acoustic segments was created from about 1600 segments drawn from 100 connected digit strings. A simple isolated digit recognition system, designed using the statistics of the codewords in the acoustic sub-word unit codebook had a recognition accuracy of 100%. In another experiment a connected digit recognition system was created with representative digit templates created by concatenating the sub-word units in an appropriate manner. The system had a string recognition accuracy of 96%.

...read moreread less

41 citations

Journal Article•DOI•

A modified K‐means clustering algorithm for use in speaker‐independent isolated word recognition

[...]

J. G. Wilpon, Lawrence R. Rabiner

01 May 1984-Journal of the Acoustical Society of America

TL;DR: A new clustering algorithm based on a K‐means approach which requires no user parameter specification is presented, which performs as well or better than the previously used clustering techniques when tested as part of a speaker independent isolated word recognition system.

...read moreread less

Abstract: Recent studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker‐independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human‐interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly, because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered tokens with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since the user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a new clustering algorithm based on a K‐means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker independent isolated word recognition system.

...read moreread less

38 citations

Proceedings Article•DOI•

Recognition results for several experimental acoustic processors

[...]

Lalit R. Bahl¹, Raimo Bakis, Paul S. Cohen, A. Cole, Frederick Jelinek, Burn L. Lewis, Robert Leroy Mercer - Show less +3 more•Institutions (1)

IBM¹

01 Apr 1979

TL;DR: One of these processors, which achieved a 0% error rate on New Raleigh sentences, has been used to decode sentences from the New Raleigh Language without the benefit of syntactic guidance during the decoding process.

...read moreread less

Abstract: The statistical training and decoding procedures developed at IBM Research can be used with a wide variety of acoustic processors. We have recently (July and August 1978) achieved error-free or nearly error-free decoding results with several different acoustic processors on sentences from the New Raleigh Language (vocabulary 250 words, perplexity 7.27 words). One of these processors, which achieved a 0% error rate on New Raleigh sentences, has been used to decode sentences from the New Raleigh Language without the benefit of syntactic guidance during the decoding process. On this much more difficult task, it has achieved an error rate of 8.8% at the word level, corresponding to a sentence error rate of 53%. All of these processors are non-segmenting processors which produce output once every 10ms.

...read moreread less

19 citations

Proceedings Article•DOI•

Automatic speech recognition using acoustic sub-words and no time alignment

[...]

V.R. Algazi¹, K.L. Brown¹•Institutions (1)

University of California, Davis¹

11 Apr 1988

TL;DR: The authors have developed a very successful new approach to automatic speech recognition which incorporates speech knowledge into a mathematical framework and does not require a computationally intensive time alignment/dynamic programming scheme.

...read moreread less

Abstract: The authors have developed a very successful new approach to automatic speech recognition which incorporates speech knowledge into a mathematical framework and does not require a computationally intensive time alignment/dynamic programming scheme. They transform the speech signal into the spectral domain, segment it into sub-word units and, in turn, perform an additional transformation in the spectral domain to capture the spectral structure within each sub-word unit. The system was shown to perform robustly in hand segmented whole word digit recognition in clean as well as noisy speech. They have now augmented the system with an automatic acoustic sub-word segmentation routine and tested the performance of this integrated system with the TI isolated word database and the confusable E set. >

...read moreread less

8 citations