scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic time warping published in 1993"


Proceedings ArticleDOI
15 Jun 1993
TL;DR: A method for learning, tracking, and recognizing human gestures using a view-based approach to model articulated objects is presented and results showing tracking and recognition of human hand gestures at over 10 Hz are presented.
Abstract: A method for learning, tracking, and recognizing human gestures using a view-based approach to model articulated objects is presented. Objects are represented using sets of view models, rather than single templates. Stereotypical space-time patterns, i.e., gestures, are then matched to stored gesture patterns using dynamic time warping. Real-time performance is achieved by using special purpose correlation hardware and view prediction to prune as much of the search space as possible. Both view models and view predictions are learned from examples. Results showing tracking and recognition of human hand gestures at over 10 Hz are presented. >

425 citations


Journal ArticleDOI
TL;DR: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer.
Abstract: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer. The objective of discriminative training here is to directly minimize the recognition error rate. To achieve this, a formulation that allows controlled approximation of the exact error rate and renders optimization possible is used. The GPD method is implemented in a dynamic-time-warping (DTW)-based system. A linear discriminant function on the DTW distortion sequence is used to replace the conventional average DTW path distance. A series of speaker-independent recognition experiments using the highly confusible English E-set as the vocabulary showed a recognition rate of 84.4% compared to approximately 60% for traditional template training via clustering. The experimental results verified that the algorithm converges to a solution that achieves minimum error rate. >

165 citations


Proceedings ArticleDOI
19 Oct 1993
TL;DR: A segment-based speech recognition scheme is proposed to explicitly model the correlations between successive frames of an acoustic segment by using features representing the contours of spectral parameters by using several lower-order coefficients of discrete orthonormal polynomial expansions.
Abstract: A segment-based speech recognition scheme is proposed The basic idea is to explicitly model the correlations between successive frames of an acoustic segment by using features representing the contours of spectral parameters These segmental features are several lower-order coefficients of discrete orthonormal polynomial expansions The performance of the proposed scheme was examined by simulations on multi-speaker speech recognition for all 408 highly confusing first-tone Mandarin syllables A recognition rate of 774% was achieved for the case, using five 6-segment reference templates per syllable This is 130% and 66% higher than those obtained by a conventional dynamic time warping (DTW) method and a conventional hidden Markov model (CHMM) method, respectively >

159 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading, on an extension of a state-of-the-art speech recognition system, a modular multistage time delay neural network architecture (MS-TDNN).
Abstract: The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading. They show this on an extension of a state-of-the-art speech recognition system, a modular multistage time delay neural network architecture (MS-TDNN). The acoustic and visual speech data are preclassified in two separate front-end phoneme TDNNs and combined with acoustic-visual hypotheses for the dynamic time warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speech-reading, the error rate could be reduced by up to half of the error rate of the pure acoustic recognition. >

145 citations


Book ChapterDOI
01 Jan 1993
TL;DR: While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.
Abstract: This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance--without the use of syntactical, semantic, or any other contextual guide to the recognition process--indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized. This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to-viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system achieved recognition rates of 2 percent, 12.7 percent, and 25.3 percent using, respectively, viseme HMMs, triseme HMMs, and generalized triseme HMMs. The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.

94 citations


Patent
Delbert D. Bailey1, Carole Dulong1
12 May 1993
TL;DR: In this paper, a pattern recognition engine is provided within the present invention that contains five pipelines which operate in parallel and are specially optimized for Dynamic Time Warping and Hidden Markov Models procedures for pattern recognition, especially handwriting recognition.
Abstract: A computer implemented apparatus and method of pattern recognition utilizing a pattern recognition engine coupled with a general purpose computer system. The present invention system provides increased accuracy and performance in handwriting and voice recognition systems and may interface with general purpose computer systems. A pattern recognition engine is provided within the present invention that contains five pipelines which operate in parallel and are specially optimized for Dynamic Time Warping and Hidden Markov Models procedures for pattern recognition, especially handwriting recognition. These pipelines comprise two arithmetic pipelines, one control pipeline and two pointer pipelines. Further, a private memory is associated with each pattern recognition engine for library storage of reference or prototype patterns. Recognition procedures are partitioned across a CPU and the pattern recognition engine. Use of a private memory allows quick access of the library patterns without impeding the performance of programs operating on the main CPU or the host bus. Communication between the CPU and the pattern recognition engine is accomplished over the host bus.

62 citations


Journal ArticleDOI
TL;DR: It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features.
Abstract: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared in speaker-dependent, speaker-independent (or cross-speaker) recognition. The perceptually based linear predictive (PLP) front-end, with the root-power sums (RPS) distance measure, yields generally the highest accuracies, especially in cross-speaker recognition., It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features. For a digit vocabulary and five reference templates obtained with a clustering algorithm, the optimization improves recognition accuracy from 97% to 98.1%, with respect to the PL-PRPS front-end. >

31 citations


Proceedings ArticleDOI
28 Mar 1993
TL;DR: It is shown how recognition performance in automated speech preception can be significantly improved by additional lipreading, so called speech-reading, on an extension of an existing state-of-the-art speech recognition system, a modular multi-state time-delay neural network (MS-TDNN).
Abstract: It is shown how recognition performance in automated speech preception can be significantly improved by additional lipreading, so called speech-reading. It is shown on an extension of an existing state-of-the-art speech recognition system, a modular multi-state time-delay neural network (MS-TDNN). The acoustic and visual speech data are preclassified in two separate front-end phoneme TDNNs and combined to acoustic-visual hypotheses for the dynamic time warping algorithm. This is shown on a connected word recognition problem, the letter spelling task. With speech-reading the error rate can be reduced up to half of the error rate of pure acoustic recognition. >

26 citations


Proceedings Article
29 Nov 1993
TL;DR: A view-based representation is used to model aspects of the hand relevant to the trained gestures, and is found using an unsupervised clustering technique, which uses normalized correlation networks, with dynamic time warping in the temporal domain, as a distance function for unsuper supervised clustering.
Abstract: We present a method for learning, tracking, and recognizing human hand gestures recorded by a conventional CCD camera without any special gloves or other sensors. A view-based representation is used to model aspects of the hand relevant to the trained gestures, and is found using an unsupervised clustering technique. We use normalized correlation networks, with dynamic time warping in the temporal domain, as a distance function for unsupervised clustering. Views are computed separably for space and time dimensions; the distributed response of the combination of these units characterizes the input data with a low dimensional representation. A supervised classification stage uses labeled outputs of the spatio-temporal units as training data. Our system can correctly classify gestures in real time with a low-cost image processing accelerator.

24 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: The authors propose a continuous speaker independent speech recognition system based on predictive neural networks for modelizing phonemes, and dynamic time warping for temporal alignment that compares well with current systems.
Abstract: The authors propose a continuous speaker independent speech recognition system based on predictive neural networks for modelizing phonemes, and dynamic time warping for temporal alignment. In this system several modules cooperate, and this allows incorporation of a grammar model and simple correction rules. The neural networks are trained by using a frame discriminative criterion. Tests on the TIMIT database show 74.5% correct classification and 68.6% accuracy, which compares well with current systems (the CMU SPHINX System and the Cambridge Recurrent Error Propagation network). >

22 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: The authors applied an automatic structure optimization (ASO) algorithm to the optimization of multistate time-delay neural networks (MSTDNNs), an extension of the TDNN, which was applied successfully to speech recognition and handwritten character recognition tasks with varying amounts of training data.
Abstract: The authors applied an automatic structure optimization (ASO) algorithm to the optimization of multistate time-delay neural networks (MSTDNNs), an extension of the TDNN. These networks allow the recognition of sequences of ordered events that have to be observed jointly. For example, in many speech recognition systems the recognition of words is decomposed into the recognition of sequences of phonemes or phonemelike units. In handwritten character recognition the recognition of characters can be decomposed into the joined recognition of characteristic strokes, etc. The combination of the proposed ASO algorithm with the MSTDNN was applied successfully to speech recognition and handwritten character recognition tasks with varying amounts of training data. >

Journal ArticleDOI
TL;DR: The performance of continuous HMMs using one type of transitional features in speaker-dependent recognition of the highly confusing Mandarin syllables is first evaluated and discussed in detail under the constraint of very limited training data.

Journal ArticleDOI
TL;DR: Results are presented which show that the additional parameters extracted encode further speaker specific information, and can be used to improve upon the speaker verification performance of the baseline systems.

Journal ArticleDOI
Chin-Hui Lee1, Chih-Heng Lin1
TL;DR: Testing on a 39-word English alpha-digit vocabulary, in a speaker trained mode, indicates that the recognition performance of a template-based, dynamic time-warping (DTW) recognizer can be significantly improved in noisy conditions when the robust signal limiter is used as a pre-processor to reduce the variability of the features in strong mismatch conditions.

Proceedings ArticleDOI
27 Apr 1993
TL;DR: It is shown that the dynamic time warping (DTW) comb filter corrects for variations in the vocal tract as well as the variation in pitch by using DTW.
Abstract: An attempt is made to enhance speech degraded by added noise by exploiting the periodic nature of voiced speech. A modification of the adaptive comb filter is employed for this purpose. Problems which may arise when using the periodicity of the speech for enhancement include significant distortion caused by comb filtering a time-varying waveform (called temporal smearing) as well as the variation in pitch from period to period (called overload). It is shown that the dynamic time warping (DTW) comb filter corrects for variations in the vocal tract as well as the variation in pitch by using DTW. A computationally straightforward but suboptimal implementation of the time warping algorithm is used to improve the performance of the comb filter algorithm. Performance is based on computational complexity, informal listening tests, and segmental SNR. >

Proceedings ArticleDOI
27 Apr 1993
TL;DR: A novel MCE/GPD (minimum classification error/generalized probabilistic descent) loss function that can incorporate word spotting errors and other measures of symbolic distance between correct and incorrect categories is defined.
Abstract: A straightforward application of PBMEC (prototype-based minimum error classifier) training to existing techniques for handling continuous speech is described. A novel MCE/GPD (minimum classification error/generalized probabilistic descent) loss function that can incorporate word spotting errors and other measures of symbolic distance between correct and incorrect categories is defined. Classification consists in a time-synchronous DTW (dynamic time warping) pass through a finite state machine; adaptation makes use of an A* based N-best algorithm and consists in propagating the derivative of the loss over the N best paths through the finite state machine. The key feature is that the loss function being optimized closely reflects the actual recognition performance of the system. >

Book ChapterDOI
13 Sep 1993
TL;DR: It is shown that MSTDNNs are a very powerful approach to on-line handwritten character and word recognition and that the ASO algorithm can automatically structure this type of architecture efficiently in a single training run.
Abstract: Highly structured neural networks like the Time-Delay Neural Network (TDNN) can achieve very high recognition accuracies in real world applications like on-line handwritten character and speech recognition systems. Achieving the best possible performance greatly depends on the optimization of all structural parameters for the given task and amount of training data. We propose an Automatic Structure Optimization (ASO) algorithm that avoids time-consuming manual optimization and apply it to Multi State Time-Delay Neural Networks (MSTDNNs), a recent extension of the TDNN. We show that MSTDNNs are a very powerful approach to on-line handwritten character and word recognition and that the ASO algorithm can automatically structure this type of architecture efficiently in a single training run.

Proceedings ArticleDOI
24 Nov 1993
TL;DR: The hybrid system developed by the authors combines self-organizing feature maps with dynamic time warping and the combination has better performance than either of the two methods applied individually.
Abstract: Describes a series of experiments on using Kohonen self-organizing maps and hybrid systems for continuous speech recognition. Experiments with different nonlinear transformations on the signal before using a neural network has been done and results compared. The hybrid system developed by the authors combines self-organizing feature maps with dynamic time warping. The experiments suggest that the combination has better performance than either of the two methods applied individually. >

Journal ArticleDOI
TL;DR: The time-warping network (TWN) is introduced that is a generalization of both an HMM-based recognizer and a backpropagation net, and its results indicate that not only does the recognition performance improve, but the separation between classes is enhanced, allowing to set up a rejection criterion to improve the confidence of the system.
Abstract: Recently, much interest has been generated regarding speech recognition systems based on Hidden Markov Models (HMMs) and neural network (NN) hybrids. Such systems attempt to combine the best features of both models: the temporal structure of HMMs and the discriminative power of neural networks. In this work we establish one more relation between the HMM and the NN paradigms by introducing the time-warping network (TWN) that is a generalization of both an HMM-based recognizer and a backpropagation net. The basic element of such a network, a time- warping neuron, extends the operation of the formal neuron of a backpropagation network by warping the input pattern to match it optimally to its weights. We show that a single-layer network of TW neurons is equivalent to a Gaussian density HMM-based recognition system. This equivalent neural representation suggests ways to improve the discriminative power of this system by using backpropagation discriminative training, and/or by generalizing the structure of the recognizer to a multi-layer net. The performance of the proposed network was evaluated on a highly confusable, isolated word, multi-speaker recognition task. The results indicate that not only does the recognition performance improve, but the separation between classes is enhanced, allowing us to set up a rejection criterion to improve the confidence of the system.

Proceedings ArticleDOI
TL;DR: The nonlinear behavior of ASTER provides more robust performance than the related dynamic time warping algorithm and is compared with a more common approach wherein a self-organizing feature map is first used to map a sequence of extracted feature vectors onto a lower dimensional trajectory.
Abstract: Two types of artificial neural networks are introduced for the robust classification of spatio- temporal sequences. The first network is the Adaptive Spatio-Temporal Recognizer (ASTER), which adaptively estimates the confidence that a (variable length) signal of a known class is present by continuously monitoring a sequence of feature vectors. If the confidence for any class exceeds a threshold value at some moment, the signal is considered to be detected and classified. The nonlinear behavior of ASTER provides more robust performance than the related dynamic time warping algorithm. ASTER is compared with a more common approach wherein a self-organizing feature map is first used to map a sequence of extracted feature vectors onto a lower dimensional trajectory, which is then identified using a variant of the feedforward time delay neural network. The performance of these two networks is compared using artificial sonograms as well as feature vectors strings obtained from short-duration oceanic signals.© (1993) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Book ChapterDOI
13 Sep 1993
TL;DR: A discriminative neural prediction system for continuous speaker independent speech recognition that allows to reach 74,9% accuracy on TIMIT which compares well with other state of the art systems, while being less complex and easier to implement.
Abstract: This paper presents a discriminative neural prediction system for continuous speaker independent speech recognition. We first compare different neural predictors for modeling speech production. We then propose new criteria for discriminative training. These networks are incorporated into a complete speech recognition system where they cooperate with other modules (grammar model, correction rules and dynamic time warping). Our best systems allow to reach 74,9% accuracy on TIMIT which compares well with other state of the art systems, while being less complex and easier to implement.

Journal ArticleDOI
TL;DR: A chain vector-quantization clustering clustering (CVQC) algorithm for realtime speech recognition that delivers faster training and recognition speeds and requires smaller memory locations.

Proceedings ArticleDOI
25 Oct 1993
TL;DR: A continuous speech recognition system with finite set of Chinese words is devised, and the precedence relations among the spectral patterns within a token period can be preserved by the topology preservations and the serious nonlinear time warping can be overcome.
Abstract: A continuous speech recognition system with finite set of Chinese words is devised for selected applications. With proper design of the self-organizing map for the speech signals, the precedence relations among the spectral patterns within a token period can be preserved by the topology preservations and the serious nonlinear time warping can thus be overcome. The 1D hierarchical relations among the sequential spectral patterns can be represented by the topology map developed on the linear array of neurons. We then devise two kinds of perception energies based on the trained map. One of the energies is derived from properly fitting a precedence curve on the sequential excitation patterns of the map during a whole word period. The other energy is obtained from the accumulation of total excitations on the map during a word period. Thresholds for the perception energies are then designed experimentally. A set of 1309 linear array maps are used for representing the total 1309 standard Chinese word pronunciations. Each linear array contains 100 equally spaced and linearly ordered neurons.

Journal ArticleDOI
TL;DR: A VLSI architecture, which exhibits both SIMD and systolic behaviour for computing the dynamic time-warping (DTW) algorithm is presented, and a 20000-word real-time DTW-based speech recognition system is achievable.
Abstract: A VLSI architecture, which exhibits both SIMD and systolic behaviour for computing the dynamic time-warping (DTW) algorithm is presented. Such an architecture is well-suited for VLSI implementation because of its regular structure and small number of input/output. Currently, based on a 1-2 µm CMOS technology, a SIMD-systolic data-path chip has been designed and fabricated for computing the DTW algorithm. It is functionally correct and packaged as a 68-pin PGA chip. With such a chip, a 20000-word real-time DTW-based speech recognition system is achievable.

Proceedings ArticleDOI
20 Oct 1993
TL;DR: The authors present the implementation of a generic dynamic programming algorithm on array processors, adopting a torus interconnection network, an internal/external dual buffer structure, and a multilevel pipelining design, for a performance of several GOPS per DP chip.
Abstract: The authors present the implementation of a generic dynamic programming algorithm on array processors. A dynamic programming (DP) chip is proposed to speed up the processing of the dynamic programming tasks in many applications, including the Viterbi algorithm, the boundary following algorithm, the dynamic time warping algorithm, etc. By adopting a torus interconnection network, an internal/external dual buffer structure, and a multilevel pipelining design, a performance of several GOPS per DP chip is expected. Both the dedicated hardware design and the data low control of the DP chip are discussed. >

Proceedings Article
01 Jan 1993

Proceedings ArticleDOI
14 Sep 1993
TL;DR: The authors' experience to date leads them to recommend the use of a combination of a shift-tolerant, correlation-based measure, such as DTW, and a robust Normalized Mean Square Error measure.
Abstract: A critical problem encountered in evaluating methods that extract event-related potentials (ERPs) from single-trial electroencephalograph (EEG) signals is the inadequacy of available performance measures. Here the authors analyzed two standard performance measures, Normalized Mean Squared Error and correlation, and a lesser used measure, dynamic time warping (DTW), and explored the conditions under which they provide misleading results. The authors' experience to date leads them to recommend the use of a combination of a shift-tolerant, correlation-based measure, such as DTW, and a robust Normalized Mean Square Error measure. >

Proceedings ArticleDOI
27 Apr 1993
TL;DR: Experimental results show that a neural network can be used as a new speaker-independent feature extractor and is compared with a conventional training algorithm in terms of recognition performance.
Abstract: The authors propose an algorithm using a neural network to normalize features that differ between speakers in speaker-independent speech recognition. The algorithm has three procedures: (1) initially training a neural network, (2) calculating the alignment function between the target signal and the network's output by dynamic time warping, and (3) incrementally training the network for extracting speaker-independent features. The neural network is a fuzzy partition model (FPM) with multiple input-output units to give a probabilistic formulation. The algorithm was evaluated in phrase recognition experiments by FPM-LR recognizers. The FPM was directly combined with a LR parser. The algorithm is compared with a conventional training algorithm in terms of recognition performance. The experimental results show that a neural network can be used as a new speaker-independent feature extractor. >

01 Jan 1993
TL;DR: This thesis investigates a dynamic programming approach to word hypothesis in the context of a speaker independent, large vocabulary, continuous speech recognition system, and attempts to extend the DTW technique using strings of phonetic symbols.
Abstract: This thesis investigates a dynamic programming approach to word hypothesis in the context of a speaker independent, large vocabulary, continuous speech recognition system. Using a method known as Dynamic Time Warping, an undifferentiated phonetic string (one without word boundaries) is parsed to produce all possible words contained in a domain specific lexicon. Dynamic Time Warping is a common method of sequence comparison used in matching the acoustic feature vectors representing an unknown input utterance and some reference utterance. The cumulative least cost path, when compared with some threshold can be used as a decision criterion for recognition. This thesis attempts to extend the DTW technique using strings of phonetic symbols, instead. Three variables that were found to affect the parsing process include: (1) minimum distance thres hold, (2) the number of word candidates accepted at any given phonetic index, and (3) the lexical search space used for reference pattern comparisons. The performance of this parser as a function of these variables is discussed. Also discussed is the performance of the parser at a variety of input error

Journal ArticleDOI
TL;DR: In this paper, an optical processor consisting of a Helium-Neon laser, optical lenses, photographic film plates and diffusers was used for the analysis and recognition of speech signals.