scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic time warping published in 1999"


01 Jan 1999
TL;DR: In this paper, a modification of DTW, called Piecewise Aggregate Approximation (PAA), is proposed to improve the robustness of time series distance calculation with no loss of accuracy.
Abstract: There has been much recent interest in adapting data mining algorithms to time series databases. Most of these algorithms need to compare time series. Typically some variation of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a Piecewise Aggregate Approximation (PAA). Our approach allows us to outperform DTW by one to two orders of magnitude, with no loss of accuracy.

670 citations


Book ChapterDOI
15 Sep 1999
TL;DR: This paper introduces a modification of DTW which operates on a higher level abstraction of the data, in particular, a piecewise linear representation and demonstrates that this approach allows us to outperform DTW by one to three orders of magnitude.
Abstract: There has been much recent interest in adapting data mining algorithms to time series databases. Many of these algorithms need to compare time series. Typically some variation or extension of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a piecewise linear representation. We demonstrate that our approach allows us to outperform DTW by one to three orders of magnitude. We experimentally evaluate our approach on medical, astronomical and sign language data.

248 citations


Proceedings ArticleDOI
01 Sep 1999
TL;DR: A new method for comparing planar curves and for performing matching at sub-sampling resolution is presented and the performance of the well-known Dynamic Time Warping algorithm is compared.
Abstract: The problem of establishing correspondence and measuring the similarity of a pair of planar curves arises in many applications in computer vision and pattern recognition. This paper presents a new method for comparing planar curves and for performing matching at sub-sampling resolution. The analysis of the algorithm as well as its structural properties are described. The performance of the new technique applied to the problem of signature verification is shown and compared with the performance of the well-known Dynamic Time Warping algorithm.

179 citations


Journal ArticleDOI
TL;DR: The Self-Organizing Map (SOM) and Learning Vector Quantization (LVQ) algorithms are constructed in this work for variable-length and warped feature sequences and good results have been obtained in speaker-independent speech recognition.
Abstract: The Self-Organizing Map (SOM) and Learning Vector Quantization (LVQ) algorithms are constructed in this work for variable-length and warped feature sequences. The novelty is to associate an entire feature vector sequence, instead of a single feature vector, as a model with each SOM node. Dynamic time warping is used to obtain time-normalized distances between sequences with different lengths. Starting with random initialization, ordered feature sequence maps then ensue, and Learning Vector Quantization can be used to fine tune the prototype sequences for optimal class separation. The resulting SOM models, the prototype sequences, can then be used for the recognition as well as synthesis of patterns. Good results have been obtained in speaker-independent speech recognition.

170 citations


Journal ArticleDOI
TL;DR: Most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz, and in some realistic environments, the use of componentsfrom the range below 2 Hz or above 16 Hz can degrade the recognition accuracy.

135 citations


Proceedings ArticleDOI
07 Nov 1999
TL;DR: A novel aligned subsequence matching scheme is proposed, where the number of subsequences to be compared with a query sequence is reduced to linear to L~, and an indexing technique is presented to speed-up the aligned subsequences matching using the similarity measure of the modified time warping distance.
Abstract: Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L~ of data sequences We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L~ We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 65 times speed-up

91 citations


Proceedings ArticleDOI
20 Sep 1999
TL;DR: An adaptive online recognizer that is suitable for recognizing isolated alphanumeric characters based on the k nearest neighbor rule is developed that is carried out during normal use in a self-supervised fashion and thus remains otherwise unnoticed by the user.
Abstract: We have developed an adaptive online recognizer that is suitable for recognizing isolated alphanumeric characters. It is based on the k nearest neighbor rule. Various dissimilarity measures, all based on dynamic time warping (DTW), have been studied. The main focus of this work is on online adaptation. The adaptation is performed by modifying the prototype set of the classifier according to its recognition performance and the user's writing style. These adaptations include: (1) adding new prototypes, (2) inactivating confusing prototypes, and (3) reshaping existing prototypes. The reshaping algorithm is based on learning vector quantization (LVQ). The writers are allowed to use their own natural style of writing, and the adaptation is carried out during normal use in a self-supervised fashion and thus remains otherwise unnoticed by the user.

37 citations


Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental comparisons with rigid matching and local perturbation show the performance superiority of the monotonic and continuous warping in character recognition.
Abstract: In this paper, a handwritten character recognition experiment using a monotonic and continuous two-dimensional warping algorithm is reported. This warping algorithm is based on dynamic programming and searches for the optimal pixel-to-pixel mapping between given two images subject to two-dimensional monotonicity and continuity constraints. Experimental comparisons with rigid matching and local perturbation show the performance superiority of the monotonic and continuous warping in character recognition.

29 citations


Proceedings ArticleDOI
20 Jun 1999
TL;DR: It is shown how, for a harmonic signal segment, the parabolic time warping function can remove the part of the frequency variation which progresses linearly with time, without changing the time duration of that segment.
Abstract: A parabolic time warper designed to enhance the stationarity of voiced speech segments, is presented. It is shown how, for a harmonic signal segment, the parabolic time warping function can remove the part of the frequency variation which progresses linearly with time, without changing the time duration of that segment. In the actual implementation of the time warping system, the linear part of the pitch frequency variation in a segment is removed on the basis of maximization of the pitch-related autocorrelation peak of the warped signal. As a by-product, the time warper yields a very reliable pitch estimation. An example on real speech is discussed.

26 citations


Proceedings ArticleDOI
07 Jun 1999
TL;DR: This technique tries to eliminate distortions by the replication of the original signal frequencies by usingMalvar wavelets are used to avoid clicking between segment transitions.
Abstract: Describes a technique to obtain a time dilation or contraction of an audio signal. Different computer graphics applications can take advantage of this technique. In real-time networked virtual reality applications, such as teleconferences or games, the audio might be transmitted independently from the rest of the data. These different signals arrive asynchronously and need to be somehow resynchronized on-the-fly. In animation, it can help to automatically fit and merge pre-recorded sound samples to special timed events. It also makes it easier to accomplish special effects, like lip-sync for dubbing or changing the voice of an animated character. Our technique tries to eliminate distortions by the replication of the original signal frequencies. Malvar wavelets are used to avoid clicking between segment transitions.

22 citations


Patent
21 Apr 1999
TL;DR: In this article, the authors proposed an automated dialing method for mobile telephones, where a user enters a telephone number via the keypad of the mobile phone, followed by speaking a corresponding codeword into the handset.
Abstract: The present invention relates to an automated dialing method for mobile telephones According to the method, a user enters a telephone number via the keypad of the mobile phone, followed by speaking a corresponding codeword into the handset The voice signal is encoded using the CODEC and vocoder already on board the mobile phone The speech is divided into frames and each frame analyzed to ascertain its primary spectral features These features are stored in memory as associated with the numeric keypad sequence In recognition mode, the user speaks the codeword into the handset, which is analyzed in a like fashion as in training mode The primary spectral features are compared with those stored in memory When a match is declared according to preset criteria, the telephone number is automatically dialed by the mobile phone Time warping techniques may be applied in the analysis to reduce timing variations

DOI
01 Jan 1999
TL;DR: This PhD thesis tries to understand how to analyse, decompose, model and transform the vocal identity of a human when seen through an automatic speaker recognition application, with a study of the impostors phenomenon.
Abstract: This PhD thesis tries to understand how to analyse, decompose, model and transform the vocal identity of a human when seen through an automatic speaker recognition application. It starts with an introduction explaining the properties of the speech signal and the basis of the automatic speaker recognition. Then, the errors of an operating speaker recognition application are analysed. From the deficiencies and mistakes noticed in the running application, some observations cm be made which will imply a re-evaluation of the characteristic parameters of a speaker, and to reconsider some parts of the automatic speaker recognition chain. In order to determine what are the characterising parameters of a speaker, these are extracted from the speech signal with an analysis and synthesis harmonic plus noise model (H+N). The analysis and re-synthesis of the harmonic and noise parts indicate those which are speech or speaker dependent. It is then shown that the speaker discriminating information can be found in the residual of the subtraction from the original signal of the H+N modeled signal. Then, a study of the impostors phenomenon, essential in the tuning of a speaker recognition system, is carried out. The impostors are simulated in two ways: first by a transformation of the speech of a source speaker (the impostor) to the speech of a target speaker (the client) using the parameters extracted from the H+N model. This way of transforming the parameters is efficient as the false acceptance rate grows from 4% to 23%. Second, an automatic imposture by speech sepent concatenation is carried out. In this case the false acceptance rate grows to 30%. A way to become less sensitive to the spectral modification impostures is to remove the harmonic part or even the noise part modeled by the H+N from the original signal. Using such a subtraction decreases the false acceptance rate to 8% even if transformed impostors are used. To overcome the lack of training data — one of the main cause of modeling errors in speaker recognition — a decomposition of the recognition task into a set of binary classifiers is proposed. A classifier matrix is built and each of its elements has to classify word by word the data coming from the client and another speaker (named here an anti-speaker, randomly chosen from an extemal database). With such an approach it is possible to weight the results according to the vocabulary or the neighbours of the client in the parameter (acoustic) space. The output of the mamx classifiers are then weighted and mixed in order to produce a single output score. The weights are estimated on validation data, and if the weighting is done properly, the binary pair speaker recognition system gives better results than a state of the an HMM based system. In order to set a point of operation (i.e. a point on the COR cuwe) for the speaker recognition application, an a priori threshold has to be determined. Theoretically the threshold should be speaker independent when stochastic models are used. However, practical experiments show that this is not the case, as due to modeling mismatch the threshold becomes speaker and utterance length dependant. A theoretical framework showing how to adjust the threshold using the local likelihood ratio is then developed. Finally, a last modeling error correction method using decision fusion is proposed. Some practical experiments show the advantages and drawbacks of the fusion approach in speaker recognition applications.

Proceedings ArticleDOI
20 Dec 1999
TL;DR: A novel vision-based speech analysis system STODE which is used in spoken Chinese training of oral deaf children and integrates such capabilities as real-time lip tracking and feature extraction, multi-state lip modeling, Time-delay Neural Network (TDNN) for visual speech analysis.
Abstract: This paper presents a novel vision-based speech analysis system STODE which is used in spoken Chinese training of oral deaf children. Its design goal is to help oral deaf children overcome two major difficulties in speech learning: the confusion of intonations for spoken Chinese characters and timing errors within different words and characters. It integrates such capabilities as real-time lip tracking and feature extraction, multi-state lip modeling, Time-delay Neural Network (TDNN) for visual speech analysis. A desk-mounted camera tracks users in real-time. At each frame, region of interest is identified and key information is extracted. The preprocessed acoustic and visual information are then fed into a modular TDNN and combined for visual speech analysis. Confusion of intonations for spoken Chinese characters can be easily identified, and timing error within words and characters also can be detected using a DTW (Dynamic Time Warping) algorithm. For visual feedback we have created an artificial talking head directly cloned from user's own images to generate correct outputs showing both correct and wrong ways of pronunciation. This system has been successfully used for spoken Chinese training of oral deaf children in cooperation with Nanjing Oral School under grants from National Natural Science Foundation of China.

Proceedings ArticleDOI
15 Sep 1999
TL;DR: This paper proposes a text-dependent speaker identification system applied to Thai language using isolated digits 0-9 and their concatenations and dynamic time warping to measure distances between referenced and evaluated vectors.
Abstract: This paper proposes a text-dependent speaker identification system applied to Thai language. Isolated digits 0-9 and their concatenations are used for speaking text. Linear prediction coefficients (LPC) are extracted and formed as feature vectors represented each speech signal. Dynamic time warping (DTW) is used to measure distances between referenced and evaluated vectors. These distances, indicating nearness of unknown vectors to references, incorporated with the K-nearest neighbor (KNN) decision technique are used to decide who possesses those unknown vectors. The experimental results have shown that the best identification rate for a single digit is 95.83% and the highest rate for concatenated digits of top-3, top-5, and top-7 are 98.75%, 100%, and 99.20%, respectively.

Proceedings Article
01 Jan 1999
TL;DR: Experimental results on a natural language call routing task indicate that the proposed techniques speeded up the search process by a factor of 4 without loss in the recognition accuracy.
Abstract: In this paper, we describe approaches for improving the search efficiency of a dynamic programming based onepass decoder for dialogue applications. In order to allow the use of long-term language models (LM) and crossword acoustic models, efficient pruning techniques and fast methods for the calculation of emission probability density functions (pdfs) are required. This is particularly important for real-time and memory constrained applications such as dialogue systems involving automatic speech recognition (ASR) and natural-language understanding. We propose an effective pruning technique exploiting the LM and cross-word context. We also present a fast distance calculation method to reduce the cost of state likelihood calculations in HMM-based systems. Experimental results on a natural language call routing task indicate that the proposed techniques speeded up the search process by a factor of 4 without loss in the recognition accuracy. In addition, we present a technique for generating word graphs incorporating cross-word context.

Proceedings ArticleDOI
10 Jul 1999
TL;DR: A recognition system which enhances its accuracy by applying continuous adaptation to the user's writing style is developed, which uses dynamic time warping (DTW) in matching the input characters with the stored prototypes.
Abstract: Subsystems for online recognition of handwriting are needed in personal digital assistants (PDA) and other portable handheld devices. We have developed a recognition system which enhances its accuracy by applying continuous adaptation to the user's writing style. The forms of adaptation we have experimented with take place simultaneously with the normal operation of the system and therefore, there is no need for separate training period of the device. The present implementation uses dynamic time warping (DTW) in matching the input characters with the stored prototypes. The DTW algorithm implemented with dynamic programming (DP) is, however both time and memory consuming. In our current research we have experimented with methods that transform the elastic templates to pixel images which can then be recognized by using statistical or neural classification. The particular neural classifier we have used is the local subspace classifier (LSC) of which we have developed an adaptive version.

Proceedings ArticleDOI
31 Oct 1999
TL;DR: An algorithm for comparing speech waveforms to decide if the spoken utterance is part of a given vocabulary of word waveforms or not, and if it is part the vocabulary, to choose the matching word is presented and preliminary results show that the algorithm provides high probability of correct classification.
Abstract: An algorithm for comparing speech waveforms to decide if the spoken utterance is part of a given vocabulary of word waveforms or not, and if it is part of the vocabulary, to choose the matching word is presented. Our algorithm has been implemented in connection with our own vector interpolation alignment algorithm which is faster than dynamic time warping and yet as accurate as dynamic time warping. This vector interpolation, has a classification rate comparable to that of dynamic time warping. While vector interpolation is able to match dynamic time warping for recognition accuracy, it requires significantly less computation, making it much faster than DTW based algorithms. Both algorithms are presented and a comparison of the two is made. Also an alternative algorithm, where the number of intervals of the two utterances to be compared is the same, where the length of the intervals in one utterance is different than the length of the intervals in the other utterance, has been investigated. When appropriate adjustments are made so that the beginning and end of the two utterances match, this algorithm has a classification rate comparable to that of dynamic time warping. Furthermore an alternative to LPC analysis for utterance recognition is presented. Unlike LPC which is an extrapolation algorithm, our algorithm is an interpolation algorithm. Theoretically our algorithm has smaller variance and smaller mean square error than the LPC algorithm. Preliminary results show that our algorithm provides high probability of correct classification.

Proceedings ArticleDOI
10 Jul 1999
TL;DR: The proposed algorithm, which uses a fuzzy logic recognition approach based on the power distribution pattern of a segment of a speech, allows the implementation of real-time speech recognition.
Abstract: Speech recognition is a major topic in speech signal processing. Many algorithms based on results of speech analysis, among which dynamic time warping) and hidden Markov models are the most important, have been advanced. However, these algorithms generally turn out to be too complicated to be implemented in real time systems. The proposed algorithm in this paper, which uses a fuzzy logic recognition approach based on the power distribution pattern of a segment of a speech, allows the implementation of real-time speech recognition.

Proceedings ArticleDOI
23 Aug 1999
TL;DR: Both weight selection methods provide performance close to the optimal point, and it is shown that the optimal combination of three models provides lower error rates than that achievable with two models.
Abstract: We focus on the score combination for three separate modeling approaches as applied to text-dependent speaker verification. The modeling methods that are evaluated consist of the neural tree network (NTN), hidden Markov model (HMM), and dynamic time warping (DTW). One of the main challenges in combining scores of several models is how to select the weight for each model. One method is to use equal weights for all models used in the combination. Another method is to use the Fisher linear discriminant to select the weights that maximize the ratio of the separation of the inter-class means to the sum of the variances. Both methods are evaluated for three separate databases and the results are compared to the optimal performance as obtained by an exhaustive search over the weight space. Overall, both weight selection methods provide performance close to the optimal point. It is also shown that the optimal combination of three models provides lower error rates than that achievable with two models.

Patent
Adoram Erell1
06 Jan 1999
TL;DR: In this article, a speech recognition system includes a token builder, a noise estimator, a template padder, a gain and noise adapter and a dynamic time warping (DTW) unit.
Abstract: A speech recognition system includes a token builder, a noise estimator, a template padder, a gain and noise adapter and a dynamic time warping (DTW) unit. The token builder produces a widened test token representing an input test utterance and at least one frame before and after the input test utterance. The noise estimator estimates noise qualities of the widened test token. The template padder pads each of a plurality of reference templates with at least one blank frame either the beginning or end of the reference template. The gain and noise adapter adapts each padded reference template with the noise and gain qualities thereby producing adapted reference templates having noise frames wherever a blank frame was originally placed and noise adapted speech where speech exists. The DTW unit performs a noise adapted DTW operation comparing the widened token with one of the noise adapted reference templates.

Proceedings ArticleDOI
31 Oct 1999
TL;DR: An individual verification system for a multimedia environment such as Windows 95 by using DTW (dynamic time warping) and finds that the weighted cepstrum reveals an effect on intensifying the difference between the customer and the imposter.
Abstract: We implement an individual verification system for a multimedia environment such as Windows 95 by using DTW (dynamic time warping). The conventional method for speaker recognition uses a password through the keyboard. However, this paper uses speech. The major feature of this study is summarized as follows. (1) We make a complete reference pattern by updating the new speech pattern with the F/sub 1//F/sub 0/ ratio. This method has a high recognition rate compared with the other systems whose performance degrade rapidly as time goes on. (2) We use the F-ratio values as weighted values of the cepstral coefficients. We find that the weighted cepstrum reveals an effect on intensifying the difference between the customer and the imposter. Also the speaker recognition rate is improved by more than 5% than the conventional DTW pattern matching with cepstrum. This shows the possibility that the speech signal can be used as a means of individual verification on a Windows environment.

Journal ArticleDOI
TL;DR: In this paper, a memoryless, finite state recognition system with LPC, FFT, and BF-FFT parametrization was applied to evaluate speech transmission quality in analog telephone channels.
Abstract: The preliminary results of application of automatic recognition of isolated words to objective evaluation of speech transmission quality in analog telephone channels are presented. A memoryless, finite state recognition system with LPC, FFT, and BF‐FFT (where the speech signal was filtered in Bark bands) parametrization was applied. In classification stage a dynamic time warping and nearest‐neighbor algorithm were utilized. Nonsense word lists consisting of 100 logotoms were recorded in a studio by a professional male speaker and utilized next as a test material. Speech transmission quality was examined in laboratory models of telephone channels with frequency bands of 300–3400, 400–2500, and 100–6000 Hz for speech‐to‐white‐noise ratios in the range of +15 to −15 dB. The results of objective measurements expressed in percent of logotoms correctly recognized by the recognition system were compared under the same transmission conditions with subjectively measured logotom intelligibility. The best agreement between subjective and objective evaluation of speech transmission quality was obtained for automatic speech recognition utilizing BF–FFT parametrization. The results of objective evaluation of speech transmission quality by means of the presented method are encouraging and the experiments will be continued for other communication channels (e.g., digital) and different distortions and disturbances.

Book ChapterDOI
01 Jan 1999
TL;DR: This chapter discusses, in an informal manner, some of the successes and a few of the outstanding problems of automatic speech recognition (ASR) and speaker identification — for forensic, business and banking purposes.
Abstract: In this chapter we discuss, in an informal manner, some of the successes and a few of the outstanding problems of automatic speech recognition (ASR) and speaker identification — for forensic, business and banking purposes. ASR can also help the hard-of-hearing by giving them printed text to read, and the wheelchair-bound by allowing them to control their vehicles by voice. Together with speech synthesis from text, human-machine dialogue systems offer attractive possibilities for all manner of information services.