scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An improved sub-word based speech recognizer

TL;DR: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units that showed results comparable to those of whole-word-based systems.
Abstract: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units. Several strategies for automatic generation of an acoustic lexicon are outlined. Preliminary tests have been performed on a small vocabulary. In these tests, the proposed system showed results comparable to those of whole-word-based systems. >
Citations
More filters
Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.

507 citations


Cites background from "An improved sub-word based speech r..."

  • ...In [290], the usual assumption is made that the piecewise quasi-stationary segments (QSS) of the speech signal can be modeled by a Gaussian autoregressive (AR) process of a fixed orderp as in [7, 272, 273]....

    [...]

  • ...In Tyagi et al. (2005), the usual assumption is made that the piecewise quasi-stationary segments (QSS) of the speech signal can be modeled by a Gaussian autoregressive (AR) process of a fixed order p as in Andre-Obrecht (1988), Svendsen et al. (1989), Svendsen and Soong (1987)....

    [...]

01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "An improved sub-word based speech r..."

  • ...ASWUs were proposed several years ago [10, 11, 12, 13], but they faded from view as speaker-independent recognition became the primary goal, because of the difficulty of distinguishing speaker variability from real pronunciation differences....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions, and permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.
Abstract: Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed subword units generalize well, they may not be the optimal units of classification for the specific task or environment for which an LVCSR system is trained. Moreover, when human expertise is not available, it may not be possible to design good subword units manually. There is clearly a need for data-driven design of these LVCSR components. In this paper, we present a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions. The proposed framework permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.

82 citations


Cites background from "An improved sub-word based speech r..."

  • ...The problem of automatic identification of subword units has been addressed by several researchers in the past [1]–[6]....

    [...]

Journal ArticleDOI
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.

66 citations


Cites background or methods from "An improved sub-word based speech r..."

  • ...Cluster centroids therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned by clustering....

    [...]

  • ...Cluster centroids therefore directly represent unit models and clustering addresses both the inventory and model design problems, whereas in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) unit model parameters had to be estimated in a separate step from the data partition de®ned…...

    [...]

  • ...Over the last decade, a number of researchers have looked into this problem and found algorithms that automatically de®ne model inventories and estimate unit model parameters (Lee et al., 1989; Svendsen et al., 1989; Bahl et al., 1993; Bacchiani et al., 1996)....

    [...]

  • ...The clustering algorithm used here di€ers from that used in (Svendsen et al., 1989; Paliwal, 1990; Holter and Svendsen, 1997a) in that maximum likelihood is used as an objective rather than minimum Euclidean distance....

    [...]

  • ...The two basic steps of any unit inventory design algorithm are an acoustic segmentation step followed by a clustering step (e.g., Lee et al., 1989; Svendsen et al., 1989; Bacchiani et al., 1996; Holter and Svendsen, 1997a)....

    [...]

Journal ArticleDOI
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.

63 citations

References
More filters
Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
TL;DR: The purpose of this tutorial paper is to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
Abstract: The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.

4,546 citations

Journal ArticleDOI
TL;DR: This paper has found that a bandpass "liftering" process reduces the variability of the statistical components of LPC-based spectral measurements and hence it is desirable to use such a liftering process in a speech recognizer.
Abstract: In a template-based speech recognition system, distortion measures that compute the distance or dissimilarity between two spectral representations have a strong influence on the performance of the recognizer. Accordingly, extensive comparative studies have been conducted to determine good distortion measures for improved recognition accuracy. Previous studies have shown that the log likelihood ratio measure, the likelihood ratio measure, and the truncated cepstral measures all gave good recognition performance (comparable accuracy) for isolated word recognition tasks. In this paper we extend the interpretation of distortion measures, based upon the observation that measurements of speech spectral envelopes (as normally obtained from standard analysis procedures such as LPC or filter banks) are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc., and may not accurately characterize the true speech spectrum because of analysis model constraints. We have found that these undesirable spectral measurement variations can be partially controlled (i.e., reduced in the level of variation) by appropriate signal processing techniques. In particular, we have found that a bandpass "liftering" process reduces the variability of the statistical components of LPC-based spectral measurements and hence it is desirable to use such a liftering process in a speech recognizer. We have applied this liftering process to several speech recognition tasks: in particular, single frame vowel recognition and isolated word recognition. Using the liftering process, we have been able to achieve an average digit error rate of 1 percent in a speaker-independent isolated digit test. This error rate is about one-half that obtained without the liftering process.

291 citations

Journal ArticleDOI
TL;DR: This paper discusses word recognition as a classical pattern-recognition problem and shows how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences.
Abstract: The art and science of speech recognition have been advanced to the state where it is now possible to communicate reliably with a computer by speaking to it in a disciplined manner using a vocabulary of moderate size. It is the purpose of this paper to outline two aspects of speech-recognition research. First, we discuss word recognition as a classical pattern-recognition problem and show how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences. We then describe methods whereby these principles, augmented by modern theories of formal language and semantic analysis, can be used to study some of the more general problems in speech recognition. It is anticipated that these methods will ultimately lead to accurate mechanical recognition of fluent speech under certain controlled conditions.

246 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks.
Abstract: The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks. >

245 citations