scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic time warping published in 1986"


Journal ArticleDOI
Oded Ghitza1
TL;DR: The model produces a frequency domain representation of the input signal in terms of the ensemble histogram of the inverse of the interspike intervals, measured from firing patterns generated by a simulated nerve-fiber array, which is comparable to a conventional Fourier Transform (FFT)-based front-end.

135 citations


Journal ArticleDOI
TL;DR: This paper focuses on the long-term intra-speaker variability of feature parameters as on the most crucial problems in speaker recognition, and presents an investigation into methods for reducing the effects of long- term spectral variability on recognition accuracy.

79 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: It is found that measurements of speech spectral envelopes are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc. and may possess spurious characteristics because of analysis model constraints and that a statistical model can be established to predict the variances of the cepstral coefficient measurements.
Abstract: In this paper, we extend the interpretation of distortion measures, based upon the observation that measurements of speech spectral envelopes (as normally obtained from analysis procedures) are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc. and may possess spurious characteristics because of analysis model constraints. We have found that these undesirable spectral measurement variations can be controlled (i.e. reduced in the level of variation) through proper cepstral processing and that a statistical model can be established to predict the variances of the cepstral coefficient measurements. The findings lead to the use of a bandpass "liftering" process aimed at reducing the variability of the statistical components of spectral measurements. We have applied this liftering process to various speech recognition problems; in particular, vowel recognition and isolated word recognition. With the liftering process, we have been able to achieve an average digit error rate of 1%, which is about half of the previously reported best results, with dynamic time warping in a speaker-independent isolated digit test.

41 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepStral distance and the log likelihood ratio distance measures across two different data bases, namely a 10 digits and a 129 airline vocabulary words.
Abstract: A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition system using standard DTW (Dynamic Time Warping) techniques. The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measures across two different data bases, namely a 10 digits and a 129 airline vocabulary words. The recognition accuracy obtained using the weighted cepstral distance measure was about 992 for digit recognition. This result was more than 3% higher than that obtained using the simple Euclidean cepstral distance measure and about 2% higher than the results using the log likelihood ratio distance measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to equalize the performance of the recognizer across different talkers.

39 citations


Proceedings ArticleDOI
J. Marques1, L. Almeida
07 Apr 1986
TL;DR: A time warping procedure is presented, which permits to reduce the overlap between high frequency spectral lines, with the purpose of improving their estimation.
Abstract: Sinusoid based models have been used in the last few years for high quality representation of voiced speech. In this paper we first review a theoretical basis for this kind of representation, and then introduce a spectral model for varying-frequency sinusoids. This model is used for estimating the sinusoid parameters end it is also intended for accurate representation of the spectrum of voiced speech. A time warping procedure is presented, which permits to reduce the overlap between high frequency spectral lines, with the purpose of improving their estimation.

38 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: A new method, based on template matching, that utilizes temporal information to advantage in text-dependent recognition as a special case and is compared with that of similar recently-developed methods.
Abstract: Text-independent speaker recognition methods have been based on measurements of long-term statistics of individual speech frames. These methods are not capable of modeling speaker-dependent speech dynamics. In this paper, we describe a new method, based on template matching, that utilizes temporal information to advantage. The template-matching method performs text-dependent recognition as a special case. Performance of the template-matching method is compared with that of similar recently-developed methods.

24 citations


ReportDOI
TL;DR: A custom CMOS VLSI DTW processor is presented which can achieve real-time isolated word recognition for dictionaries of up to 2000 words and is both flexible and modular.
Abstract: : A custom CMOS VLSI processor is presented which can achieve real-time isolated word recognition for dictionaries of up to 2000 words. The processor is based on the dynamic time warping (DTW) algorithm, an exhaustive search technique which permits nonlinear pattern matching between an unknown utterance and a reference word. Our design differs from previous DTW designs in that (1) all data is represented in signed-digit, base 4 format; (2) digits are passed between processing elements in a most significant digit first, digit serial fashion; and (3) the algorithms are pipelined at the digit level. Because of these features, nine processing elements will fit on one chip using 3 micron feature size devices and an 84 pin package. The VLSI DTW processor presented is both flexible and modular. The design is independent of the number of coefficients per frame and the precision of those coefficients. The design is also easily expandable in the number frames per word and the warp factor used to achieve the nonlinear matching.

23 citations


Journal ArticleDOI
TL;DR: A VLSI architecture based on the space-time domain expansion which can compute the symbol distance and also give the index pairs which correspond to the warp function is proposed.
Abstract: The method of dynamic time warping is a well-established technique for time alignment and comparison of speech and image patterns. It has found extensive application in speech recognition and related areas of pattern matching. Comparing the handwritten symbol to the set of training symbols (called reference symbols), we can recognize the input handwritten symbol by computing the distances among the input symbol and the reference symbols in the training set. In this paper we propose a VLSI architecture based on the space-time domain expansion which can compute the symbol distance and also give the index pairs which correspond to the warp function. The time complexity is O(max(m, n)) by using m × n processing elements array, where m is the length of the input symbol and n is the length of the reference symbol. With a uniprocessor, the matching process will have the time complexity O(m × n). If there are p reference symbols, using the proposed architecture, the recognition problem can be solved in time O(max(m, n, p)). With a uniprocessor, the time complexity will be O(m × n × p). The algorithm partition problems are discussed. Verification of the proposed VLSI architecture is also given.

18 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: A VLSI processor designed to compute dynamic time warping algorithms for speech recognition with extreme rapidity is presented, designed to give maximum efficiency on continuous speech recognition applications with or without syntax constraints.
Abstract: We present a VLSI processor designed to compute dynamic time warping algorithms for speech recognition with extreme rapidity. This processor works as a coprocessor in a classical system including a standard microprocessor and a digital signal processor. It uses its own local memory for reference utterances and intermediate results. It has been designed to give maximum efficiency on continuous speech recognition applications with or without syntax constraints. Its flexibility permits software optimisation and its use in a large number of different applications. We use a sequential approach for DTW computations and work along the time axis. All the calculations are carried out on each frame of the unknown utterance as soon as it arrives from the DSP and DTW computations therefore take place in real time. Response time is in hundredth of second; intermediate results are obtained before the end of the sentence. A system using this chip will be able to carry out continuous speech recognition in real timee on a vocabulary of 300 references. Many of those chips can be used in parallel on a single system.

16 citations


Journal ArticleDOI
TL;DR: Systolic arrays for two connected speech recognition methods which require that the input sentence be preprocessed by a phonetic analyzer and the architecture of a 12 000 transistors programmable NMOS prototype IC which can be used as the basic processor of the probabilistic matching systolic array is presented.
Abstract: Systolic arrays for two connected speech recognition methods are presented. The first method is based on the dynamic time warping algorithm which is applied directly on acoustic feature patterns. The second method is the probabilistic matching algorithm which requires that the input sentence be preprocessed by a phonetic analyzer. It is shown that both methods may be implemented on either a two-dimensional or a linear systolic array. Advantages of each of these implementations are discussed. The architecture of a 12 000 transistors programmable NMOS prototype IC, which can be used as the basic processor of the probabilistic matching systolic arrays, is presented.

10 citations


Journal ArticleDOI
TL;DR: A ring array architecture is studied on a hardware algorithm and a control scheme for dynamic time warping (DTW) processing, in order to achieve real-time speech recognition.
Abstract: A ring array architecture is studied on a hardware algorithm and a control scheme for dynamic time warping (DTW) processing, in order to achieve real-time speech recognition. For developing a practical DTW processor, the key factors are to reduce the number of processing elements (PE's) in the array architecture and to maintain highly efficient concurrency and high throughput. Regular data and control flow is achieved by using a ring network, where every constituent PE uses parallel and pipelined operations on the data. Regular and continuous DTW processing, even for a variety of treated data volume, is realized with a novel control scheme based on "tags" and "status flags" attached to the data, thus indicating data attributes. This control scheme permits a simple control structure to be achieved for the array system. The efficiency and throughput expected for the ring array architecture is then compared to orthogonal array architecture.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: A learning method in which the syllable templates are automatically optimized, based on speaker-dependent recognition system, showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning.
Abstract: In this speaker-dependent recognition system, recognition is based on syllable template matching and each syllable has several templates. In the initial training for each speaker, 590 templates for 111 syllables are made, each including various contextual variations. The authors studied a learning method in which the syllable templates are automatically optimized. It is judged whether or not an input syllable should be learned according to the recent recognition condition. If it should be learned, the input syllable pattern replaces the template that contributes the least to recognition in the templates segmented from the same context and in the same syllable category. Automatic learning was evaluated on recognition of speech data obtained by reading Japanese sentences at a rate of about 4 to 5 syllables per second. The results over eight speakers showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning. Further, by increasing the maximum number of templates to 1024, it rose to 84.8%.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored and some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies.
Abstract: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored in this paper. Aspects of the Chinese language are discussed in relation to the capabilities of current state-of-the-art isolated-word recognition systems. Some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies. The vocabulary size is of the order of 350 syllables with a large number of similar word pairs. Recognition rates using linear predictive coding, Itakura distance measures and dynamic time warping are of the order of 25-30%.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: An approach to isolated and connected word recognition by using dynamic time warping algorithm which referes as a hidden Markov model which has almost the same performance as the exact algorithm.
Abstract: In this paper, we present an approach to isolated and connected word recognition by using dynamic time warping algorithm which referes as a hidden Markov model. The classification consists of computing the a posteriori probability for each word model and choosing the word model that gives the highest probability. The probability is calculated by two different ways: One is the exact algorithm and the other is the approximate (Viterbi) algorithm. In our system, first, an input speech is recognized as a string of monosyllables by the syllable-based O(n) DP matching. Second, the recognized string is matched with a mono-syllable string of each lexical model, and the word or word sequence with the highest probability is recognized as the input speech by using O(n) DP matching based on a hidden Markov model. Reference patterns consist of 68 mono-syllables, and test patterns consists of 90 isolated words, two connected words and three connected words. We conclude from the results of the experiments that: (1) The results by using 3 candidates are much better than those by using only best candidate for each segment. (2) The approximate algorithm has almost the same performance as the exact algorithm. (3) The extended algorithm for connected word recognition works well.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: 2 efficiency enhancements for speaker-independent connected spoken word spotting using phoneme concatenation and Coarse DP refinement that reduces the cost of dynamic time warping for only a small error rate penalty.
Abstract: This paper describes 2 efficiency enhancements for speaker-independent connected spoken word spotting. The first enhancement, LESS COST, uses phoneme concatenation to reduce the cost of computing the local distance between each reference and input pattern point. The second enhancement, Coarse DP refinement, reduces the cost of dynamic time warping for only a small error rate penalty. An experiment confirmed these techniques.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance, which consists of dynamic time warping and statistical word discrimination.
Abstract: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance. The recognition algorithm consists of two processes ; dynamic time warping and statistical word discrimination. In the first process, input speech is compared with each word template using the dynamic time warping technique. Multiple word templates are used to deal with speech variations among speakers, where each word template is represented by a sequence of phoneme-like templates. To attain high recognition ability, a new technique for generating word templates is proposed. In the second process, statistical word discrimination is carried out for word candidates which have relatively low reliability in the first process. Discrimination functions are calculated based on statistics of transition tendencies of speech characteristics between adjacent frames, and the final word decision is made. The system was trained using utterances from 1305 speakers and tested with utterances from 259 speakers. The average recognition rate of 96.5% was obtained for a 16-word Japanese vocabulary set.

Journal ArticleDOI
John G. Ackenhusen1, Syed S. Ali1, David J. Bishop1, Louis F. Rosa1, Reed Thorkildsen1 
TL;DR: This paper describes a single-board implementation of an isolated word recognizer based on the principles of linear predictive coding (LPC) and dynamic time warping (DTW) that proceeds on one word while LPC measurement on the next is in progress, increasing speech throughput.
Abstract: This paper describes a single-board implementation of an isolated word recognizer based on the principles of linear predictive coding (LPC) and dynamic time warping (DTW). The recognizer requires only a serial (RS-232) terminal, power supply, and microphone for operation, and may be used to add speech input capability to any serial terminal connected to a host computer. Key elements of the recognizer include a custom integrated circuit for DTW-based pattern matching, a single-chip implementation of real-time LPC feature measurement, and a 16-bit microprocessor for control, communication, and decision functions. As a result of the custom integrated circuit and multiple processor architecture, pattern matching speed is increased by a factor of 50 over an earlier design with no custom integrated circuits and without pipeline processing capabilities, and proceeds on one word while LPC measurement on the next is in progress, increasing speech throughput. Comprehensive control/evaluation software for the recognizer has been developed for the AT&T PC6300 personal computer.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: An LSI design for use in speech recognition system is described, which can perform matching with about 100 reference patterns and can be applied to connected word recognition systems.
Abstract: An LSI design for use in speech recognition system is described. The Staggered Array Dynamic Programming (SADP) method[1] has been adopted as a high-speed Dynamic Time Warping (DTW) technique. A new LSI architecture has been designed for SADP. The main features of this architecture are look-up tables for address calculation and parallel processing structure for SADP calculation. The SADP-LSI is designed using a commercially available gate-array. Memories and some control logic components are attached externally. Under microprocessor control, this LSI can perform matching with about 100 reference patterns. It can also be applied to connected word recognition systems.

Journal ArticleDOI
TL;DR: New systolic architectures have been evaluated for the Dynamic Time Warping (DTW) algorithm, a non-linear pattern matching technique used in isolated and continuous speech recognition systems, which has led to a syStolic architecture which is relatively flexible, compact and easy to test.

Journal ArticleDOI
TL;DR: In this paper, a dynamic time-warping (DTW) algorithm was proposed to maximize the cross-correlation between a template and the recorded block of EPs.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A new speaker-adaptation method using selective linear prediction (SLP) is developed to recognize connected digits using the reference speech patterns spoken by another speaker and achieves a remarkable improvement of the correct recognition rate.
Abstract: When multi-template methods are used for a speaker-independent speech recognition system, the speaker-adaptation is useful to reduce the calculation time and the incorrect recognition rate As one of the simplest and the most stable methods, we developed a new speaker-adaptation method using selective linear prediction (SLP) to realize the frequency expansion In this system, the frequency expansion ratios of a few vowels are estimated first, then the reference speech patterns are modified based on them, and their modified patterns are used for the continuous DP (Dynamic Programming) matching to recognize the candidate words We applied it to recognize connected digits using the reference speech patterns spoken by another speaker and achieved a remarkable improvement of the correct recognition rate

Proceedings ArticleDOI
01 Apr 1986
TL;DR: Non-linear time warping, indispensable for spoken word recognition, was shown to be accomplished by controlling the transmittance function of the windowing plate and close agreement between the frequency spectra obtained and those obtained by the simulation confirms that the system operates as expected.
Abstract: A new optical processing system for spectrum analysis of speech was proposed. The main components of this system is an optical processor and a micro-computer. The processor consists of a He-Ne laser, optical lenses, and photographic film plates. Since the optical signal processing is inherently parallel for two-dimensional signals, time-varying spectra of one-dimensional signal of speech can be obtained without shifting a window along the time axis. Based on the results of a computer simulation for the spectrum analysis of speech, the optical processor was designed and constructed. Various /V/, /VV/ and /CV/ utterances were analyzed by using the optical processing system. The close agreement between the frequency spectra obtained by the system and those obtained by the simulation confirms that the system operates as expected. Non-linear time warping, indispensable for spoken word recognition, was shown to be accomplished by controlling the transmittance function of the windowing plate.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: The proposed decomposition leads to compact realizations of complete systolic arrays on a single chip and achieves a matching in real-time with 60 reference words with a clock-frequency of only 100 kHz.
Abstract: New systolic architectures for the Dynamic Time Warping (DTW)-algorithm have been evaluated. The DTW-algorithm is used as a non-linear pattern matching technique in isolated and continuous speech recognition systems. A non-conventional decomposition of the recursive DTW-algorithm, bit-serial computation, extensive pipe-lining and simultaneous matching of multiple patterns are used in order to execute the DTW-algorithm in real-time. The proposed decomposition leads to compact realizations of complete systolic arrays on a single chip. A typical realization of an array of 14 systolic processors in a 4μm 1.5 V CMOS process contains approximately 15,000 transistors on a chip area of 11.5 mm2. With a clock-frequency of only 100 kHz, we already achieve a matching in real-time (0.2 sec) with 60 reference words. One processor element has been integrated.

Proceedings ArticleDOI
Rong Yu1, M. Kimura
01 Apr 1986
TL;DR: A relaxation algorithm is shown in this paper to seek the optimal candidate sequence from a large quantity of candidate sequences for speaker-independent recognition of vowels.
Abstract: A relational model for Japanese vowels is defined that describes the relations among vowels using only relative parameters. Based on the model a new approach for vowel recognition is developed in which three candidates are considered for every vowel in a vowel sequence and each candidate sequence is evaluated as a whole by checking the relationships among them. We show a relaxation algorithm in this paper to seek the optimal candidate sequence from a large quantity of candidate sequences. The approach has been applied to a large speech data base consisting of spoken words voiced by 84 speakers. Experimental results show that the technique is highly effective for speaker-independent recognition of vowels.

Proceedings ArticleDOI
23 Mar 1986
TL;DR: Methods for recognizing phonemes in 2-D spectrograms are also expected to lend themselves to optical processing and hidden Markov models for exploiting temporal relations in establishing word and sentence hypotheses appear to be efficiently implementable on the processor.
Abstract: A previously proposed optical crossbar interconnected signal processor is reviewed for application to speech recognition. A traditional recognition approach is described that uses: linear prediction, dynamic time warping, and parsing. A wait-and-see rule based parser is used. The major parts Of the system are implemented on the optical processor and show that high performance should be achievable for large vocabularies. Further, hidden Markov models for exploiting temporal relations in establishing word and sentence hypotheses, also appear to be efficiently implementable on the processor. Methods for recognizing phonemes in 2-D spectrograms are also expected to lend themselves to optical processing.© (1986) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.