scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Minimum prediction residual principle applied to speech recognition

F. Itakura1
01 Feb 1975-IEEE Transactions on Acoustics, Speech, and Signal Processing (IEEE)-Vol. 23, Iss: 1, pp 154-158
TL;DR: A computer system is described in which isolated words, spoken by a designated talker, are recognized through calculation of a minimum prediction residual through optimally registering the reference LPC onto the input autocorrelation coefficients using the dynamic programming algorithm.
Abstract: A computer system is described in which isolated words, spoken by a designated talker, are recognized through calculation of a minimum prediction residual. A reference pattern for each word to be recognized is stored as a time pattern of linear prediction coefficients (LPC). The total log prediction residual of an input signal is minimized by optimally registering the reference LPC onto the input autocorrelation coefficients using the dynamic programming algorithm (DP). The input signal is recognized as the reference word which produces the minimum prediction residual. A sequential decision procedure is used to reduce the amount of computation in DP. A frequency normalization with respect to the long-time spectral distribution is used to reduce effects of variations in the frequency response of telephone connections. The system has been implemented on a DDP-516 computer for the 200-word recognition experiment. The recognition rate for a designated male talker is 97.3 percent for telephone input, and the recognition time is about 22 times real time.
Citations
More filters
Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Journal ArticleDOI
H. Sakoe1, S. Chiba1
TL;DR: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition, in which the warping function slope is restricted so as to improve discrimination between words in different categories.
Abstract: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.

5,906 citations


Cites methods from "Minimum prediction residual princip..."

  • ...As a further investigation, the optimized algorithm is experimentally compared with several varieties of the DP-algorithm, which have been applied to spoken word recognition by some research groups [3] -[ 6 ] . The optimized algorithm superiority is established, indicating the validity of this investigation....

    [...]

  • ...Four typical ones, including those proposed by Sakoe and Chiba [3] , Velichko and Zagoruyko [4] , White and Neely [SI , and Itakura [ 6 ] , were subjected to comparison with the algorithms described in this paper....

    [...]

Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
David J. Thomson1
01 Sep 1982
TL;DR: In this article, a local eigenexpansion is proposed to estimate the spectrum of a stationary time series from a finite sample of the process, which is equivalent to using the weishted average of a series of direct-spectrum estimates based on orthogonal data windows to treat both bias and smoothing problems.
Abstract: In the choice of an estimator for the spectrum of a stationary time series from a finite sample of the process, the problems of bias control and consistency, or "smoothing," are dominant. In this paper we present a new method based on a "local" eigenexpansion to estimate the spectrum in terms of the solution of an integral equation. Computationally this method is equivalent to using the weishted average of a series of direct-spectrum estimates based on orthogonal data windows (discrete prolate spheroidal sequences) to treat both the bias and smoothing problems. Some of the attractive features of this estimate are: there are no arbitrary windows; it is a small sample theory; it is consistent; it provides an analysis-of-variance test for line components; and it has high resolution. We also show relations of this estimate to maximum-likelihood estimates, show that the estimation capacity of the estimate is high, and show applications to coherence and polyspectrum estimates.

3,921 citations

Book
01 Jan 2000
TL;DR: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.
Abstract: From the Publisher: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora.Methodology boxes are included in each chapter. Each chapter is built around one or more worked examples to demonstrate the main idea of the chapter. Covers the fundamental algorithms of various fields, whether originally proposed for spoken or written language to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation. Emphasis on web and other practical applications. Emphasis on scientific evaluation. Useful as a reference for professionals in any of the areas of speech and language processing.

3,794 citations

References
More filters
Journal ArticleDOI
TL;DR: Application of this method for efficient transmission and storage of speech signals as well as procedures for determining other speechcharacteristics, such as formant frequencies and bandwidths, the spectral envelope, and the autocorrelation function, are discussed.
Abstract: A method of representing the speech signal by time‐varying parameters relating to the shape of the vocal tract and the glottal‐excitation function is described. The speech signal is first analyzed and then synthesized by representing it as the output of a discrete linear time‐varying filter, which is excited by a suitable combination of a quasiperiodic pulse train and white noise. The output of the linear filter at any sampling instant is a linear combination of the past output samples and the input. The optimum linear combination is obtained by minimizing the mean‐squared error between the actual values of the speech samples and their predicted values based on a fixed number of preceding samples. A 10th‐order linear predictor was found to represent the speech signal band‐limited to 5kHz with sufficient accuracy. The 10 coefficients of the predictor are shown to determine both the frequencies and bandwidths of the formants. Two parameters relating to the glottal‐excitation function and the pitch period are determined from the prediction error signal. Speech samples synthesized by this method will be demonstrated.

1,124 citations

Journal ArticleDOI
TL;DR: The logarithmic characteristics of acoustic signal in five bands are extracted as features and the measure of similarity between the words of standard and control sequences is calculated by the words maximizing a definite functional using dynamic programming.
Abstract: Experiments on the automatic recognition of 203 Russian words are described The experimental vocabulary includes terms of the language, ALGOL -60 together with others The logarithmic characteristics of acoustic signal in five bands are extracted as features The measure of similarity between the words of standard and control sequences is calculated by the words maximizing a definite functional using dynamic programming The average reliability of recognition for one speaker obtained for experiments using 5000 words is 0·95 The computational time for recognition is 2-4 sec

214 citations