scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic time warping published in 1983"


Journal ArticleDOI
TL;DR: This paper presents an approach to speaker-independent, isolated word recognition in which the well-known techniques of vector quantization and hidden Markov modeling are combined with a linear predictive coding analysis front end in the framework of a standard statistical pattern recognition model.
Abstract: In this paper we present an approach to speaker-independent, isolated word recognition in which the well-known techniques of vector quantization and hidden Markov modeling are combined with a linear predictive coding analysis front end. This is done in the framework of a standard statistical pattern recognition model. Both the vector quantizer and the hidden Markov models need to be trained for the vocabulary being recognized. Such training results in a distinct hidden Markov model for each word of the vocabulary. Classification consists of computing the probability of generating the test word with each word model and choosing the word model that gives the highest probability. There are several factors, in both the vector quantizer and the hidden Markov modeling, that affect the performance of the overall word recognition system, including the size of the vector quantizer, the structure of the hidden Markov model, the ways of handling insufficient training data, etc. The effects, on recognition accuracy, of many of these factors are discussed in this paper. The entire recognizer (training and testing) has been evaluated on a 10-word digits vocabulary. For training, a set of 100 talkers spoke each of the digits one time. For testing, an independent set of 100 tokens of each of the digits was obtained. The overall recognition accuracy was found to be 96.5 percent for the 100-talker test set. These results are comparable to those obtained in earlier work, using a dynamic time-warping recognition algorithm with multiple templates per digit. It is also shown that the computation and storage requirements of the new recognizer were an order of magnitude less than that required for a conventional pattern recognition system using linear prediction with dynamic time warping.

337 citations



Proceedings ArticleDOI
14 Apr 1983
TL;DR: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech which retains text- independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems.
Abstract: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech. The technique retains text-independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems. The frequently-occurring vectors or characters form a model of multiple points in the n dimensional speech space instead of the usual single point models, The speaker recognition depends on the statistical distribution of the distances between the speech frames from the unknown speaker and the closest points in the model. Models were generated with 100 seconds of conversational training speech for each of 11 male speakers. The system was able to identify 11 speakers with 96%, 87%, and 79% accuracy from sections of unknown speech of durations of 10, 5, and 3 seconds, respectively. Accurate recognition was also obtained even when there were variations in channels over which the training and testing data were obtained. A real-time demonstration system has been implemented including both training and recognition processes.

66 citations


PatentDOI
TL;DR: In this article, the zero crossing intervals of the input speech are measured and sorted by duration, to provide a rough measure of the frequency distribution within each input frame, which is transformed into a binary feature vector, and compared with each reference template using a modified Hamming distance measure.
Abstract: Speaker-independent word recognition is performed, based on a small acoustically distinct vocabulary, with minimal hardware requirements. After a simple preconditioning filter, the zero crossing intervals of the input speech are measured and sorted by duration, to provide a rough measure of the frequency distribution within each input frame. The distribution of zero crossing intervals is transformed into a binary feature vector, which is compared with each reference template using a modified Hamming distance measure. A dynamic time warping algorithm is used to permit recognition of various speaker rates, and to economize on the reference template storage requirements. A mask vector with each reference vector on a template is used to ignore insignificant (or speaker-dependent) features of the words detected.

55 citations


Journal ArticleDOI
Weste1, Burr, Ackland
TL;DR: This paper decribes the architecture, algorithms, and design of a CMOS integrated processing array used for computing the dynamic time warp algorithm.
Abstract: Dynamic time warping is a well-established technique for time alignment and comparison of speech and image patterns. This paper decribes the architecture, algorithms, and design of a CMOS integrated processing array used for computing the dynamic time warp algorithm. Emphasis is placed on speech recognition applications because of the real-time constraints imposed by isolated and continuous speech recognition.

51 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: The experiments reported here are the first in which a direct comparison is made between two conceptually different methods of treating the non-stationarity problem in speech recognition by implicitly dividing the speech signal into quasi-stationary intervals.
Abstract: A method for speaker independent isolated digit recognition based on modeling entire words as discrete probabilistic functions of a Markov chain is described. Training is a three part process comprising conventional methods of linear prediction coding (LPC) and vector quantization of the LPCs followed by an algorithm for estimating the parameters of a hidden Markov process. Recognition utilizes linear prediction and vector quantization steps prior to maximum likelihood classification based on the Viterbi algorithm. Vector quantization is performed by a K-means algorithm which finds a codebook of 64 prototypical vectors that minimize the distortion measure (Itakura distance) over the training set. After training based on a 1,000 token set, recognition experiments were conducted on a separate 1,000 token test set obtained from the same talkers. In this test a 3.5% error rate was observed which is comparable to that measured in an identical test of an LPC/DTW (dynamic time warping) system. The computational demand for recognition under the new system is reduced by a factor of approximately 10 in both time and memory compared to that of the LPC/DTW system. It is also of interest that the classification errors made by the two systems are virtually disjoint; thus the possibility exists to obtain error rates near 1% by a combination of the methods. In describing our experiments we discuss several issues of theoretical importance, namely: 1) Alternatives to the Baum-Welch algorithm for model parameter estimation, e.g., Lagrangian techniques; 2) Model combining techniques by means of a bipartite graph matching algorithm providing improved model stability; 3) Methods for treating the finite training data problem by modifications to both the Baum-Welch algorithm and Lagrangian techniques; and 4) Use of non-ergodic Markov chains for isolated word recognition. We note that the experiments reported here are the first in which a direct comparison is made between two conceptually different (i.e. parametric and non-parametric) methods of treating the non-stationarity problem in speech recognition by implicitly dividing the speech signal into quasi-stationary intervals.

36 citations


Journal ArticleDOI
TL;DR: The effects of two major design choices on the performance of an isolated word speech recognition system are examined in detail: 1) the choice of a warping algorithm among the Itakura asymmetric, the Sakoe and Chiba symmetric, and the SakOE and Ch Japan asymmetric; and 2) the size of the warping window to reduce computation time.
Abstract: In this paper, the effects of two major design choices on the performance of an isolated word speech recognition system are examined in detail. They are: 1) the choice of a warping algorithm among the Itakura asymmetric, the Sakoe and Chiba symmetric, and the Sakoe and Chiba asymmetric, and 2) the size of the warping window to reduce computation time. Two vocabularies were used: the digits (zero, one,..., nine) and a highly confusable subset of the alphabet (b, c, d, e, g, p, t, v, z). The Itakura asymmetric warping algorithm appears to be slightly better than the other two for the confusable vocabulary. We discuss the reasons why the performance of the algorithms is vocabulary dependent. Finally, for the data used in our experiments, a warping window of about 100 ms appears to be optimal.

26 citations


Journal ArticleDOI
TL;DR: It is shown that the dynamic time warping procedures used for isolated word recognition apply almost as well to alignment of sentence length utterances, and one must apply caution in using the time alignment contour for synthesis or recognition applications.
Abstract: One way to improve the quality of synthetic speech, and to learn about temporal aspects of speech recognition, is to study the problem of time aligning pairs of spoken sentences. For example, one could evaluate various sets of duration rules for synthesis by comparing the time alignments of speech sounds within synthetic sentences to those of naturally spoken sentences. In this manner, an improved set of sound duration rules could be obtained by applying some objective measure to the alignment scores. For speech recognition applications, one could obtain automatic labeling of continuous speech from a hand-marked prototype to obtain models and/or statistical data on sounds within sentences. A key question in the use of automatic alignment of sentence length utterances is whether the time warping methods, developed for isolated word recognition, could be extended to the problem of time aligning sentence length utterances (up to several seconds long). A second key question is the reliability and accuracy of such an alignment. In this paper we investigate these questions. It is shown that, with some simple modifications, the dynamic time warping procedures used for isolated word recognition apply almost as well to alignment of sentence length utterances. It is also shown that, on the average, the uncertainty in the location of significant events within the sentence is much smaller than the event durations although the largest errors are longer than some event durations. Hence, one must apply caution in using the time alignment contour for synthesis or recognition applications.

24 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: Results indicate that discrimination between similar sounding words can be greatly improved, and an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns is presented.
Abstract: Whole-word pattern matching using dynamic time-warping (DTW) has achieved considerable success as an algorithm for automatic speech recognition. However, the performance of such an algorithm is ultimately limited by its inability to discriminate between similar sounding words. The problem arises because all differences between speech patterns are treated as being equally important, hence the algorithm is particularly susceptible to confusions caused by irrelevant differences. This paper presents an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns. A network-type data structure is derived from reference speech patterns, and the separate paths through the network determine the regions where recognition takes place. Results indicate that discrimination between similar sounding words can be greatly improved.

23 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: An O(n) dynamic programming (time warping) matching algorithm is proposed for connected spoken word recognition that is computationally more efficient than the conventional methods and also does not require many memory storages.
Abstract: An O(n) dynamic programming (time warping) matching algorithm is proposed for connected spoken word recognition. The algorithm is based on the same principle of Two Level DP, Level Building DP, and Clock-Wise DP matching methods. The O(n) DP method is computationally more efficient than the conventional methods and also does not require many memory storages. This is able to recognize connected spoken words by one level (one pass) DP matching for every reference patterns. The amount of computation is the same as that of isolated spoken word recognition by DP matching. Of course, this algorithm gives the same results as Two Level DP, Level Building DP and Clock-Wise DP methods. Further, the flexibility of this algorithm is increased by embedding the unconstrained endpoints DP matching on both reference patterns and test pattern, and weighted DP matching. We show from the experiment on connected spoken digit recognition that the augmented algorithm improves the results.

16 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: The averaging, procedure, which is based on a purely sequential processing of the tokens, contains additional weighting operations for word boundaries and scaling of the time axis, which improve the robustness of the learning procedure.
Abstract: This paper Presents a learning procedure for speaker-dependent word recognition systems which are based on the principle of dynamic time warping. The reference templates are created by averaging word tokens for each class. The averaging, procedure, which is based on a purely sequential processing of the tokens, contains additional weighting operations for word boundaries and scaling of the time axis. These operations improve the robustness of the learning procedure. The new learning procedure has been tested with different speech examples, some of which have been recorded in extremely noisy conditions with casual speakers. In all cases the learning procedure yields very reliable reference templates.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: Three non-linear compression techniques, which have come under consideration during the study of recognition systems developed at LIMSI, are described and compared to linear compressions in an isolated word recognition framework for different vocabularies.
Abstract: Dynamic time warping is a very efficient technique in dealing with the problem of time distorsion between different pronunciations of any given worm. However when, in a word-based recognition system(isolated or connected words), time normalisation is solely based on a DR-matching process, much processing time is necessitated. Another characteristic is that the general constraints used to optimise the DP-matching algorithm impose a severe limit on acceptable time distorsions. In this paper, we evaluate the interesting aspects of non-linear time compression methods which carry out a first time normalisation prior to DP-matching in word-based recognition. We describe three non-linear compression techniques, which have come under consideration during the study of recognition systems developed at LIMSI. These non-linear compression methods are compared to linear compressions in an isolated word recognition framework for different vocabularies.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: Results are presented which show that these techniques for incorporating information about timescale variability directly into the Dynamic Time-Warping process can lead to considerable improvements in recognition accuracy, especially if the differences between word classes are mainly due to temporal structure.
Abstract: Dynamic Time-Warping is one the most important tools available for overcoming timescale variability problems in Automatic Speech Recognition. One of the main problems associated with the technique is to constrain the behaviour of the algorithm in order to avoid unlikely timescale distortion. This paper describes techniques for incorporating information about timescale variability directly into the Dynamic Time-Warping process. Results are presented which show that these techniques can lead to considerable improvements in recognition accuracy, especially if the differences between word classes are mainly due to temporal structure.

Proceedings ArticleDOI
14 Apr 1983
TL;DR: A simple algorithm to perform speaker-independent word recognition with modest performance on a small, acoustically distinct, vocabulary is described and demonstrates implementation potential via simple analog circuitry and an 8-bit microcomputer.
Abstract: A simple algorithm to perform speaker-independent word recognition with modest performance on a small, acoustically distinct, vocabulary is described. The primary measurements are the zero crossing intervals of the acoustic waveform. Robust and reliable performance was achieved by deriving binary-valued features, using dynamic time warping and invoking high level logic. The algorithm demonstrates implementation potential via simple analog circuitry and an 8-bit microcomputer. The performance achieved was 85% correct recognition, 9% rejection and 6% substitution for a set of six words uttered by over 100 speakers.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: This paper addresses the use of linear frequency warping for template normalization and describes both a technique for estimating the long-term distribution of the frequencies of a talker's formants and a techniques for automatically predicting an optimal linear frequency warp.
Abstract: In a template-based, speaker-independent, speech recognition system, stored templates may be used in matching the speech of new users. For optimal results, templates should be carefully selected and proper normalization algorithms should be applied for each new talker. This paper addresses the use of linear frequency warping for template normalization and describes both a technique for estimating the long-term distribution of the frequencies of a talker's formants and a technique for automatically predicting an optimal linear frequency warp.

Journal ArticleDOI
TL;DR: Fitch et al. as mentioned in this paper used dynamic time warping to eliminate differences in duration between the template and the unknown speech, and used the relative durations of certain segments in a word to distinguish it from most false alarms.
Abstract: Template matching methods use dynamic time warping to eliminate differences in duration between the template and the unknown speech. These differences may be linguistically relevant. It has been demonstrated for one test word that the relative durations of certain segments in that word can be used to distinguish it from most “false alarms”—stretches of speech other than the test word that score well on the spectrally‐based template match [H. L. Fitch, Proc. ICASSP 82, 1247–1250 (1982)]. Here, a more general procedure is described for segmenting, and testing the relative durations of the segments. This procedure has now been applied to ten words, and shows promising results.

Journal ArticleDOI
TL;DR: A single mathematical description of dynamic time warping is presented that unifies these and other approaches, and highlights their similarities and differences, and it is found that in general, the use of additional levels of computation results in a relatively small decrease in recognition error.
Abstract: The application of dynamic time warping to the problem of connected word recognition has recently received much attention. Several successful approaches have appeared in the literature including the level building algorithm [C. Myers and L. Rabiner, IEEE Trans. Acoust. Speech Signal Process. ASSP‐29, 284–297 (1981)] and the single pass algorithm [J. Bridle, M. Brown, and R. Chamberlain, Proc. 1982 IEEE ICASSP]. A single mathematical description of dynamic time warping is presented that unifies these and other approaches, and highlights their similarities and differences. By testing the algorithms with a database of connected digits, it is found that in general, the use of additional levels of computation results in a relatively small decrease in recognition error.

Proceedings Article
08 Aug 1983
TL;DR: Although the augmented continuous dynamic programming algorithm obtains a near optimal solution for the recognition principle based on pattern matching, it is computationaly more efficient than the conventional methods and also does not require many memory storages.
Abstract: The technique of dynamic time warping by using dynamic programming is powerful for isolated word recognition. An augmented continuous dynamic programming algorithm is proposed for connected spoken word recognition with syntactical constraints. The algorithm is based on the same principle of two level DP and level building DP. Although our algorithm obtains a near optimal solution for the recognition principle based on pattern matching, it is computationaly more efficient than the conventional methods and also does not require many memory storages. Therefore it is useful for connected word recognition with syntactical constraints in a large vocabulary. The amount of computation is almost the same as that for isolated word recognition.

Proceedings ArticleDOI
14 Apr 1983
TL;DR: A speaker independent isolated word speech recognition system is developed based on computer generated phonemes, where when unvoiced fricatives occur at either the beginning or end of a word a representative CGP is created.
Abstract: A speaker independent isolated word speech recognition system is developed based on computer generated phonemes (CGP). A CGP is a vector of features that has been generated to represent a region of speech. The CGP creation algorithm looks for stable sounds in the incoming word through the use of a similarity measure. When a stable sound is detected a CGP is created to represent it. In addition to the creation of CGPs for stable vocal tract sounds, when unvoiced fricatives occur at either the beginning or end of a word a representative CGP is created. Using a heavily constrained dynamic time warping algorithm, the CGPs of the incoming word are then compared against reference templates, which consist of previously created strings of CGPs. The identity of the reference template which is closest in distance to the incoming test word is chosen as the estimate of the test word.

Journal ArticleDOI
TL;DR: An input signature is validated by elastic matching with a reference specimen signature to check relational consistency only to an extent that is practicable in a low-cost real-time system.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: This paper proposes an algorithm which takes into consideration nearly invariant properties in the time axis of the transient parts of the speech to solve the speaker independent word recognition problem.
Abstract: In speaker independent word recognition, time-normalization problem has not been completely solved at present In the attempt to solve this problem, this paper proposes an algorithm which takes into consideration nearly invariant properties in the time axis of the transient parts of the speech Using this algorithm, speaker independent recognition becomes possible by performing multi-dimensional analysis of the transient parts of the speech after transient detection This algorithm has been applied to 12-word recognition and 101-monosyllable recognition, showing correct rates of about 98 % and 82 %, respectively, for non-learning data