scispace - formally typeset
Search or ask a question

Moving beyond the 'beads-on-a-string' model of speech

01 Jan 1999-
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.
Citations
More filters
Book
Li Deng1, Dong Yu1
12 Jun 2014
TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.
Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

2,817 citations


Cites background from "Moving beyond the 'beads-on-a-strin..."

  • ...Likewise, it has also been well understood for a long time that the use of phonetic or its finer state sequences, even with contextual dependency, in engineering speech recognition systems, is inadequate in representing such rich structure [86, 273, 355], and thus leaving a promising open direction to improve the speech recognition systems’ performance....

    [...]

Reference BookDOI
03 Oct 2018
TL;DR: Analysis of discrete-time speech signals probability and random processes linear model and dynamic system model optimization methods and estimation theory statistical pattern recognition helps clarify speech technology in selected areas.
Abstract: Analytical background and techniques: discrete-time signals, systems and transforms analysis of discrete-time speech signals probability and random processes linear model and dynamic system model optimization methods and estimation theory statistical pattern recognition Fundamentals of speech science: phonetic process phonological process Computational phonology and phonetics: computational phonology computational models for speech production computational models for auditory speechprocessing Speech technology in selected areas: speech recognition speech enhancement speech synthesis

244 citations

Journal ArticleDOI
TL;DR: A survey of a growing body of work in which representations of speech production are used to improve automatic speech recognition is provided.
Abstract: Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds, and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, a survey of a growing body of work in which such representations are used to improve automatic speech recognition is provided.

207 citations


Cites background from "Moving beyond the 'beads-on-a-strin..."

  • ...The standard approach to acoustic modeling continues to be the “beads on a string” model (Ostendorf, 1999) in which the speech signal is represented as a concatenation of phones....

    [...]

  • ...…Kingdom Karen Livescu MIT Computer Science and Artificial Intelligence Laboratory 32 Vassar Street, Room 32-G482 Cambridge MA 02139 USA Erik McDermott Nippon Telegraph and Telephone Corporation, NTT Communication Science Laboratories 2–4 Hikari-dai, Seika-cho, Soraku-gun Kyoto-fu 619-0237…...

    [...]

Journal ArticleDOI
TL;DR: An overview of past and present efforts to link human and automatic speech recognition research is provided and an overview of the literature describing the performance difference between machines and human listeners is presented.
Abstract: The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, lately a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the 'gap'. In this paper, we provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the paper is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR. The paper ends with an argument for more and closer collaborations between researchers of ASR and HSR to further improve research in both fields.

126 citations


Cites methods from "Moving beyond the 'beads-on-a-strin..."

  • ...In the field of ASR, AFs are often put forward as a more flexible alternative (Kirchhoff, 1999; Wester, 2003; Wester et al., 2001 )t o modelling the variation in speech using the standard ‘beads-on-a-string’ paradigm ( Ostendorf, 1999 ), in which the acoustic signal is described in terms of (linear sequences of) phones, and words as phone sequences....

    [...]

  • ...…a more flexible alternative (Kirchhoff, 1999; Wester, 2003; Wester et al., 2001) to modelling the variation in speech using the standard ‘beads-on-a-string’ paradigm (Ostendorf, 1999), in which the acoustic signal is described in terms of (linear sequences of) phones, and words as phone sequences....

    [...]

Journal ArticleDOI
TL;DR: A dynamic Bayesian network for articulatory feature recognition is described that gives superior recognition of articulatory features from the speech signal compared with a state-of-the-art neural network system.
Abstract: We describe a dynamic Bayesian network for articulatory feature recognition. The model is intended to be a component of a speech recognizer that avoids the problems of conventional ''beads-on-a-string'' phoneme-based models. We demonstrate that the model gives superior recognition of articulatory features from the speech signal compared with a state-of-the-art neural network system. We also introduce a training algorithm that offers two major advances: it does not require time-aligned feature labels and it allows the model to learn a set of asynchronous feature changes in a data-driven manner.

109 citations


Cites background from "Moving beyond the 'beads-on-a-strin..."

  • ...the “beadson-a-string” paradigm [1], makes it extremely difficult to model the variation that is present in spontaneous, conversational speech....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The apparently vast number of speech sounds found in the languages of the world turn out to be surface-level realisations of a limited number of combinations of a very small set of such features – some twenty or so, in current analyses.
Abstract: On the notion ‘feature bundle’ The study of the phonological aspect of human speech has advanced greatly over the past decades as a result of one of the fundamental discoveries of modern linguistics – the fact that phonological segments, or phonemes, are not the ultimate constituents of phonological analysis, but factor into smaller, simultaneous properties or features. The apparently vast number of speech sounds found in the languages of the world turn out to be surface-level realisations of a limited number of combinations of a very small set of such features – some twenty or so, in current analyses. This conclusion is strongly supported by the similar patterning of speech sounds in language after language, and by many extragrammatical features of language use, such as patterns of acquisition, language disablement and language change.

1,043 citations


"Moving beyond the 'beads-on-a-strin..." refers background in this paper

  • ...First, it has been observed that certain sets of features tend to spread or modify together in groups that can be characterized by a hierarchical organization [34]....

    [...]

Proceedings ArticleDOI
08 Mar 1994
TL;DR: This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree, which is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones.
Abstract: The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.

781 citations


"Moving beyond the 'beads-on-a-strin..." refers background in this paper

  • ...[20], which can learn both contextual and temporal structure (i....

    [...]

Journal ArticleDOI
TL;DR: A general stochastic model is described that encompasses most of the models proposed in the literature for speech recognition, pointing out similarities in terms of correlation and parameter tying assumptions, and drawing analogies between segment models and HMMs.
Abstract: Many alternative models have been proposed to address some of the shortcomings of the hidden Markov model (HMM), which is currently the most popular approach to speech recognition. In particular, a variety of models that could be broadly classified as segment models have been described for representing a variable-length sequence of observation vectors in speech recognition applications. Since there are many aspects in common between these approaches, including the general recognition and training problems, it is useful to consider them in a unified framework. The paper describes a general stochastic model that encompasses most of the models proposed in the literature, pointing out similarities of the models in terms of correlation and parameter tying assumptions, and drawing analogies between segment models and HMMs. In addition, we summarize experimental results assessing different modeling assumptions and point out remaining open questions.

680 citations


"Moving beyond the 'beads-on-a-strin..." refers background in this paper

  • ...Improved acoustic models may require additional layers of hidden states at different time scales, mixed memory Markov models [41], a mixed continuous and discrete hidden state [42], a discrete event model [43], and/or other alternatives....

    [...]

Journal ArticleDOI
TL;DR: Evidence that, because syntax does not fully predict the way that spoken utterances are organized, prosody is a significant issue for studies of auditory sentence processing is presented.
Abstract: In this tutorial we present evidence that, because syntax does not fully predict the way that spoken utterances are organized, prosody is a significant issue for studies of auditory sentence processing. We describe the basic elements and principles of current prosodic theory, review the psycholinguistic evidence that supports an active role for prosodic structure in sentence representation, and provide a road map of references that contain more complete arguments about prosodic structure and prominence. Because current theories do not predict the precise prosodic shape that a particular utterance will take, it is important to determine the prosodic choices that a speaker has made for utterances that are used in an auditory sentence processing study. To this end, we provide information about practical tools such as systems for signal display and prosodic transcription, and several caveats which we have found useful to keep in mind.

551 citations


"Moving beyond the 'beads-on-a-strin..." refers background in this paper

  • ...However, such phenomena may be more directly described in terms of prosodic structure [38], i....

    [...]

Journal ArticleDOI
TL;DR: Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is more systematic at the level of the syllable than at the phonetic-segment level, and syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents.
Abstract: Current-generation automatic speech recognition (ASR) systems model spoken discourse as a quasi-linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if simplified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is more systematic at the level of the syllable than at the phonetic-segment level. Thus, syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents. Prosodic prominence and lexical stress also appear to play an important role in pronunciation variation. The governing mechanism is likely to involve the informational valence associated with syllabic and lexical elements, and for this reason pronunciation variation offers a potential window onto the mechanisms responsible for the production and understanding of spoken language.

373 citations


"Moving beyond the 'beads-on-a-strin..." refers background in this paper

  • ...Such short segments are quite frequent, as evidenced by distributional data in hand-labeled phonetic transcriptions [7] and by the high percentage of phones mapped to the minimum allowed duration in a forced alignment using a single-pronunciation dictionary (observed in several studies)....

    [...]

  • ...syllable onsets are most often preserved and codas are most often deleted [7]....

    [...]