scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic time warping published in 1996"


Proceedings ArticleDOI
Li Lee1, Richard Rose1
07 May 1996
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filter-bank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filter-bank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results showed that frequency warping was consistently able to reduce word error rate by 20% even for very short utterances.

344 citations


Journal ArticleDOI
TL;DR: Under restricted recordings conditions, this technique apparently has general applicability to analysis of a variety of animal vocalizations and can dramatically decrease the amount of time spent on manual identification of vocalizations.
Abstract: The application of dynamic time warping (DTW) to the automated analysis of continuous recordings of animal vocalizations is evaluated. The DTW algorithm compares an input signal with a set of predefined templates representative of categories chosen by the investigator. It directly compares signal spectrograms, and identifies constituents and constituent boundaries, thus permitting the identification of a broad range of signals and signal components. When applied to vocalizations of an indigo bunting (Passerina cyanea) and a zebra finch (Taeniopygia guttata) collected from a low‐clutter, low‐noise environment, the recognizer identifies syllables in stereotyped songs and calls with greater than 97% accuracy. Syllables of the more variable and lower amplitude indigo bunting plastic song are identified with approximately 84% accuracy. Under restricted recording conditions, this technique apparently has general applicability to analysis of a variety of animal vocalizations and can dramatically decrease the amount of time spent on manual identification of vocalizations.

186 citations


Patent
Adoram Erell1
12 Dec 1996
TL;DR: A keyword recognition system for speaker dependent, dynamic time warping (DTW) recognition systems uses all of the trained word templates in the system, (keyword and vocabulary) to determine if an utterance is a keyword utterance or not as mentioned in this paper.
Abstract: A keyword recognition system for speaker dependent, dynamic time warping (DTW) recognition systems uses all of the trained word templates in the system, (keyword and vocabulary), to determine if an utterance is a keyword utterance or not. The utterance is selected as the keyword if a keyword score indicates a significant match to the keyword template and if the keyword score indicates a better match than do the entirety of scores to the vocabulary word templates.

147 citations


Proceedings ArticleDOI
25 Aug 1996
TL;DR: An on-line signature verification system based on dynamic time-warping (DTW) that is able to deal efficiently with the availability of a rather large number of reference patterns, making it possible to determine which parts of a reference signature are important and which are not.
Abstract: In this paper, we discuss an on-line signature verification system based on dynamic time-warping (DTW). The DTW-algorithm originates from the field of speech recognition, and has been applied successfully in the signature verification area more than once. However, until now, few adaptations have been made in order to take the specific characteristics of signature verification into account. According to us, one of the most important differences is the availability of a rather large number of reference patterns, making it possible to determine which parts of a reference signature are important and which are not. By disconnecting the DTW-stage and the feature extraction process we are able to deal efficiently with this extra amount of information. We demonstrate the benefits of our approach by building and evaluating a complete system.

114 citations


Book ChapterDOI
15 Apr 1996
TL;DR: Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.
Abstract: Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

96 citations


Journal ArticleDOI
TL;DR: The concept of dynamic time warping as a tool for supervision and fault detection with particular reference to bioprocess applications and especially in research applications, where a structured model is lacking, is proposed to be employed advantageously.

63 citations


Proceedings ArticleDOI
25 Aug 1996
TL;DR: A new method and presents unknown reconstruction scenes based on a dynamic time warping algorithm that allows us to consider omni-directional robotics vision under a new realistic and robust aspect.
Abstract: The observation of an entire 3D space and the reconstruction of an observed unknown scene are very interesting in the field of robot vision. This paper presents a new omni-directional device especially built for binocular peripheral vision. The architecture of the sensor is designed to simplify the computation considerably for real time application. The device needs no calculation of epipolar lines. This paper describes a new method and presents unknown reconstruction scenes based on a dynamic time warping algorithm. The image matching approach exploits the architecture benefits by calculating, in real time, the depth of the image slits of each angular position. The system described allows us to consider omni-directional robotics vision under a new realistic and robust aspect.

50 citations


Patent
01 Feb 1996
TL;DR: In this paper, a pattern recognition system and method is described for Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) scoring units. The method includes the steps of a) providing a noisy test feature set of the input signal, a plurality of reference feature sets of reference templates produced in a quiet environment, and a background noise feature set.
Abstract: A pattern recognition system and method is disclosed. The method includes the steps of a) providing a noisy test feature set of the input signal, a plurality of reference feature sets of reference templates produced in a quiet environment, and a background noise feature set of background noise present in the input signal, b) producing adapted reference templates from the test feature set, the background noise feature set and the reference feature sets and c) determining match scores defining the match between each of the adapted reference templates and the test feature set. The method can also include adapting the scores before accepting a score as the result. The system and method are described for both Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) scoring units. The system performs the steps of the method.

39 citations


Journal ArticleDOI
TL;DR: A novel multi-layer perceptrons (MLP)-based speech recognition method that is comparable to a well modeled continuous Gaussian mixture density HMM trained with the minimum error criterion and requires less trainable parameters than the HMM system, but the former is more convenient for analysing internal features.

33 citations


Journal ArticleDOI
TL;DR: Use of two peak frequencies, time warping in matching calls, and tolerance of frequency shift were found to be the three most important factors in mimicking the bird’s own classification of natural calls, whereas intensity differences of peak frequencies did not play an important role.
Abstract: Dynamic programming (DP) matching was applied to classification of budgerigar contact calls. Our DP‐matching algorithm calculates distances between two calls with time warping. It was compared to other methods including linear matching methods and methods using cross correlation, and was evaluated in classifying calls into natural groups. The DP‐matching method with two peak frequencies with some tolerance in frequency comparison (DP2peak) and the cross‐correlation method using two peak frequency tracks with frequency shift (Corr2shift) were equally effective in classifying obviously different calls. DP2peak proved more effective than any other methods tested in classifying minutely different cagemate calls. Use of two peak frequencies, time warping in matching calls, and tolerance of frequency shift were found to be the three most important factors in mimicking the bird’s own classification of natural calls, whereas intensity differences of peak frequencies did not play an important role. The possibility of similar processes of call discrimination in the bird’s brain, such as simultaneous perception of two frequencies and time warping in comparing calls, was discussed in relation to these results.

28 citations


Book ChapterDOI
01 Jan 1996
TL;DR: Tests on a 40 word vocabulary using a dynamic time warping based audio-visual recogniser demonstrate that the lip outline is a rich source of information for speech recognition and establish dynamic contour tracking as a viable instrument for near real-time speechreading applications.
Abstract: Recent developments in contour tracking permit the outlines of moving, natural objects to be tracked live, at full video-rate. Such a capability can be used to turn parts of the body—for instance, the hands and lips—into input devices. The results presented here were obtained using a real-time lip tracker which utilises a novel Kalman filter based dynamic contour to track the outline of a speaker’s lips. The tracker incorporates predictive dynamics which can be learned from training sequences and automatically tuned to follow typical motions found in speech. The visual data from the tracker is incorporated into an acoustic automatic speech recogniser enabling robust recognition of speech in the presence of acoustic noise. Tests on a 40 word vocabulary using a dynamic time warping based audio-visual recogniser demonstrate that the lip outline is a rich source of information for speech recognition and establish dynamic contour tracking as a viable instrument for near real-time speechreading applications.

Journal ArticleDOI
TL;DR: A stochastic method called the genetic algorithm (GA), which is used to solve the nonlinear time alignment problem, is presented and experimental results show that the GA has a better performance than the DTW.
Abstract: Dynamic time warping (DTW) is a nonlinear time-alignment technique for automatic speech recognition (ASR) systems. It had been widely used in many commercial and industrial products, ranging from electronic dailies/dictionaries to wireless voice digit dialers. DTW has the advantages of fast training and searching times, which makes it more popular than other available ASR techniques. However, there exist some limitations to DTW, such as the stringent rule on slope weighting, the nontrivial computation of the K-best paths, and the significant increase in computational time when the endpoint constraint is relaxed or the variations of the length of pattern increased. In this paper, a stochastic method called the genetic algorithm (GA), which is used to solve the nonlinear time alignment problem, is presented. Experimental results show that the GA has a better performance than the DTW. In addition, two derivatives of GA: the hybrid GA and the parallel GA are also presented.

Journal ArticleDOI
01 Aug 1996
TL;DR: A VLSI processor is designed for the small-scale isolated speech recognition applications which detects endpoint, extracts LPC (linear predictive coefficient) cepstral coefficients from the speech signal, and computes the spectral distances using a dynamic time warping (DTW) technique.
Abstract: A VLSI processor is designed for the small-scale isolated speech recognition applications. It is a dedicated processor which detects endpoint, extracts LPC (linear predictive coefficient) cepstral coefficients from the speech signal, and computes the spectral distances using a dynamic time warping (DTW) technique. The designed chip can recognize 1000 isolated words per second with an average recognition accuracy of 90.3%. It is designed in a 0.8 /spl mu/m CMOS technology, includes 66,760 gates, and runs with a 10 MHz clock.

Proceedings ArticleDOI
03 Oct 1996
TL;DR: The paper reports on experiments in non segmental speech analysis and synthesis using parameters derived from a speech database of British English monosyllables, which includes almost every onset, nucleus and coda, and almost all onset nucleus and nucleus consonant combinations occurring in English.
Abstract: The paper reports on experiments in non segmental speech analysis and synthesis using parameters derived from a speech database of British English monosyllables. The database includes almost every onset, nucleus and coda, and almost all onset nucleus and nucleus consonant combinations occurring in English. Acoustic parameters including f0, formant frequencies and bandwidths, and amplitude of voicing were determined for each token in the database. Fine duration differences within minimal pairs are analyzed using dynamic time warping techniques, avoiding the need for manual segmentation. For each parameter, a matrix of distances between all samples of the two words is calculated, together with a minimal path through the matrix (the warp path). The set of warp paths for all parameters identifies the nature and location of acoustic differences between the words, including locations of temporal expansion and compression. Preliminary experiments using dynamic time warping for non segmental synthesis are also discussed.

Proceedings ArticleDOI
13 Oct 1996
TL;DR: In this paper, the Discrete Time Wavelet Transform of the signal is calculated and the highest scales along with the low-pass residue of the wavelet transform are treated as signals and the spectrogram of each one of them is in turn treated as 2D images.
Abstract: Recognition of pre-defined musical patterns in the context of Greek Traditional Music is very useful to researchers in Musicology and Ethnomusicology. This paper presents an efficient method for recognizing isolated musical patterns played by Creek Traditional Clarinet, in a monophonic environment. The Discrete Time Wavelet Transform of the signal is calculated. The highest scales, along with the lowpass residue of the Wavelet Transform, are treated as signals and the spectrogram of each one of them is calculated. The spectrograms are in turn treated as 2-D images. A number of translation and scaling invariant moments are then computed for the resulting images. These moments are used as features, and turn out to cluster around certain points in the corresponding multidimensional feature space, for the various musical patterns. Tree-like structured classification procedure is then adopted for classification. A few clusters correspond to more than one musical pattern. In such case a Dynamic Time Warping procedure is employed to determine the specific pattern.

Journal ArticleDOI
TL;DR: A Genetic Time Warping (GTW) algorithm for isolated word recognition was proposed and it demonstrated that GTW performed better or much better than the DTW method for most of the tested words.
Abstract: In this paper, a Genetic Time Warping (GTW) algorithm for isolated word recognition was proposed. Relative representation techniques, fitness techniques and reproduction techniques were described and genetic operators were also discussed in detail. Different from the conventional genetic algorithms with fixed genes, every chromosome has its own number of genes. A modified order-based crossover operator was introduced in order to deal with the chromosomes with a different number of genes. Besides the mutation and crossover operators, a new heuristic local optimum operator was also built and it could alter part of a chromosome based on a function of local distance and average distortion of the paths. Finally, experimental investigations were carried out to test the performance of GTW. Based on Rabiner's normal assumptions23 on the distributions of the distances, the overall probability of making a word error could be calculated experimentally. Results demonstrated that GTW performed better or much better than the DTW method for most of the tested words.

Proceedings ArticleDOI
03 Oct 1996
TL;DR: A semiautomatic method to generate unit inventories by means of dynamic time warping alignment with a synthesized utterance and a penalty system was developed that uses timing constraints to show the validity of this approach.
Abstract: In concatenative speech synthesis systems, the generation of a unit inventory is a tedious task. However, some applications demand multiple voices. A semiautomatic method to generate unit inventories is proposed. The units are segmented out of carrier phrases by means of dynamic time warping alignment with a synthesized utterance. This requires at least one existing inventory. The availability of several existing inventories will improve the likelihood of finding one with similar voice characteristics, which will improve the accuracy of results. The method is a bootstrapping procedure. To choose the best segmentation out of a set (e.g. aligned with each voice already implemented), a penalty system was developed that uses timing constraints. The results were compared with manually corrected segmentations and show the validity of this approach.

Proceedings ArticleDOI
01 Sep 1996
TL;DR: A novel method of text-independent speaker recognition which uses only the correlations among MFCCs, computed over selected speech segments of very-short length (approximately 120ms) is proposed.
Abstract: The problem addressed in this paper is related to the fact that classical statistical approach for speaker recognition yields satisfactory results but at the expense of long length training and test utterances. An attempt to reduce the length of speaker samples is of great importance in the field of speaker recognition since the statistical approach, due to its limitations, is usually precluded from use in real-time applications. A novel method of text-independent speaker recognition which uses only the correlations among MFCCs, computed over selected speech segments of very-short length (approximately 120ms) is proposed. Three different neural networks — the Multi-Layer Perceptron (MLP), the Steinbuch's Learnmatrix (SLM) and the Self-Organizing Feature Finder (SOFF) — are evaluated in a speaker recognition task. The ability of dimensionality reduction of the SOFF paradigm is also discussed.

Proceedings ArticleDOI
03 Jun 1996
TL;DR: A learning vector quantization method based on the dynamic time warping scheme is proposed for the speech recognition and the optimized speech database and adequate time-alignment vector matching can be achieved.
Abstract: A learning vector quantization method based on the dynamic time warping scheme is proposed for the speech recognition. The optimized speech database and adequate time-alignment vector matching can be achieved. The recognition accuracy for different users is improved by adapting the speech database using the learning vector quantization method. An user-friendly software system of the proposed method is implemented to demonstrate the recognition of 200 voice commands. Various languages and dialects can be realized in our system with a high recognition accuracy. The evaluation board of speech recognition using an 8051 microprocessor has been designed for the industrial applications. A pipelined programmable micro-architecture of the speech recognition processor is developed for the high-performance and high-speed recognition system.

Journal ArticleDOI
TL;DR: In this paper, a new collaborative project seeks to evaluate algorithms adapted from human speech recognition to establish a basis for automating the identification of animal vocalizations and recording their occurrence, including dynamic time warping and hybrid hidden Markov models incorporating features of artificial neural networks.
Abstract: Finding and censusing birds and other animals via listening can pose problems because of inaccessibility of habitats, rarity or shyness of animals, or subjectivity of observers. A new collaborative project seeks to evaluate algorithms adapted from human speech recognition to establish a basis for automating the identification of animal vocalizations and recording their occurrence. The algorithms include dynamic time warping and hybrid hidden Markov models incorporating features of artificial neural networks. Probably no single method will work for all species. More than one method maybe useful together, in multiple stages. A database of high‐quality, annotated digital field recordings is being collected to supply training and test data on known species and, when possible, known individuals. Both low‐noise and realistic ambient noise situations are important. Field data are supplemented with recordings from laboratory settings. Red‐cockaded woodpecker, other vocal yet threatened species, and species related to them, such as other Picoides woodpeckers, are being studied. Preliminary results are presented. [Research supported by USACERL.]

Proceedings ArticleDOI
18 Nov 1996
TL;DR: This research proposes the use of inductive inference "decision trees" for speech processing applications such as automatic speech recognition, automatic language identification, speech understanding and speaker verification, and attempts to solve the problem of inter- and intra-speaker speech variability.
Abstract: Proposes the use of inductive inference "decision trees" for speech processing applications such as automatic speech recognition, automatic language identification, speech understanding and speaker verification. The aim of this research is to demonstrate that artificial intelligence techniques such as inductive learning can provide an alternative approach to existing speech processing techniques such as dynamic time warping, hidden Markov modelling (HMM) and neural networks. The construction of the decision tree is based on the C4.5 inductive system developed by J.R. Quinlan (1993). The decision tree is generated automatically from the training speech database. The classification is performed using an inference engine to execute the decision tree and classify the firing of the rules. The proposed system has two main advantages. Firstly, it attempts to solve the problem of inter- and intra-speaker speech variability, by the use of a large speech database. Secondly, it has the ability to generate decision trees using any combination of features (parametric or acoustic-phonetic). This allows the integration of features from existing signal processing techniques, that are currently used in HMM stochastic modelling, and acoustic-phonetic features, which have been the cornerstone of traditional knowledge-based techniques.

Proceedings ArticleDOI
03 Oct 1996
TL;DR: The paper presents a new scheme of acoustic modeling for speech recognition based on an idea of statistical phoneme center, which has several properties that are feasible for realizing more reliable phoneme extraction.
Abstract: The paper presents a new scheme of acoustic modeling for speech recognition based on an idea of statistical phoneme center. The statistical phoneme center has several properties that are feasible for realizing more reliable phoneme extraction. First, the authors assume that there is a fictitious center point in every phoneme. The center is determined statistically by an iterative procedure to maximize the local likelihood using a large amount of speech data. Next, in order to evaluate the performance of phoneme extraction, phoneme recognition is realized by optimizing the likelihood based on the dynamic time warping technique. As an experimental result, 71.6% recognition accuracy is obtained for speaker independent phoneme recognition. This result demonstrate that the proposed SPC is a new effective concept to obtain more stabilized acoustic model for speaker independent speech recognition.

Proceedings ArticleDOI
Pascale Fung1
07 May 1996
TL;DR: It is found that the lengths of context segments of a word are closely correlated to that of the translation, even when the corpus is non-parallel, i.e., monolingual texts which are not translations of each other.
Abstract: We report a new statistical feature relating a bilingual word pair in a non-parallel English-Chinese corpus. It is found that the lengths of context segments of a word are closely correlated to that of the translation, even when the corpus is non-parallel, i.e., monolingual texts which are not translations of each other. The context segment length histogram of a word has a characteristic pattern and corresponds to that of its translation. If a word appears most frequently in long segments, its translation is found to be most likely occurring in long segments. One way to match these histograms is to first extract their salient shape characteristics by space-frequency analysis and then match them against each other using dynamic time warping. The results of matching can be used in combination with other statistical features to bootstrap a word or term translation algorithm from non-parallel corpora.

01 Dec 1996
TL;DR: This paper reports on the implementation of a real-time speaker independentisolated word speech recognition program on a PC Windows platform based on the Dynamic Time Warping (DTW) paradigm for computational efficiency.
Abstract: This paper reports on the implementation of a real-time speaker independentisolated word speech recognition program on a PC Windows platform. The overall structure of the recognition engine is based on the Dynamic Time Warping (DTW) paradigm for computational efficiency. Furthermore, to decrease the recognition time and increase the recognition accuracy, the dictionary is limited to under 15 words. This severely restricts the vocabulary. To overcome this restriction, a new technique is introduced. Many dictionaries are linked in a hierarchical structure and each word in each dictionary will activate a new dictionary related to that word. This represents a basic form of language modelling which is suited for the menu driven interface found in many of today''s applications. The results show that reasonable performance can be achieved by these methods.

Proceedings ArticleDOI
26 Nov 1996
TL;DR: It is demonstrated that inductive learning can provide an alternative approach to existing automatic speech recognition techniques such as dynamic time warping, hidden Markov modelling, HMM stochastic modelling and neural networks, which have been the cornerstone of traditional knowledge based techniques.
Abstract: This paper proposes the use of inductive inference "decision trees" for isolated digit recognition. The aim of this research is to demonstrate that inductive learning can provide an alternative approach to existing automatic speech recognition techniques such as dynamic time warping (DTW), hidden Markov modelling (HMM) and neural networks (NN). The construction of the decision tree is based on C4.5 inductive system developed by Quinlan (1986, 1993), The decision tree is generated automatically from the training speech database. The database contains labelled examples in the form of a feature vector and its corresponding label, for each frame. The feature vector may consist of any number of different feature sets and the label may be at the word, phonetic class or phoneme level. The recognition is performed at the frame level, using an inference engine to execute the decision tree and classify the firing of the rules. The proposed system has two main advantages. Firstly, it uses the data-driven approach to isolated word classification, thus attempting to solve the problem of inter and intra speaker speech variability, by the use of a large speech database. Secondly, it has the ability to generate decision trees using any combination of features (parametric or acoustic-phonetic). This allows the integration of features from existing signal processing techniques, that are currently used in HMM stochastic modelling, and acoustic-phonetic features, which have been the cornerstone of traditional knowledge based techniques. Isolated digit recognition results for Texas Instruments (TI) digit database, for speaker dependent and independent recognition, are presented.

Book ChapterDOI
01 Jan 1996
TL;DR: This article presents some recent development in generalizing the standard HMM to incorporate the local dynamic patterns as well as the global non-stationarity for speech signal modeling.
Abstract: The standard hidden Markov models (HMM) assume local or state-conditioned stationarity of the signals being modeled. In this article, we present some recent development in generalizing the standard HMM to incorporate the local dynamic patterns as well as the global non-stationarity for speech signal modeling. The major component of the proposed non-stationary HMMs is the parametric regression models for individual HMM states. The regression functions are intended for characterizing the dynamic movements of the signals within a HMM state. Both the EM algorithm (or Baum-Welch algorithm) and the segmental K-means algorithms are generalized to accommodate the complex state duration information needed for the estimation of regression parameters. To allow for the flexibility of linear time warping in individual HMM states, an efficient algorithm is developed with the use of token-dependent auxiliary parameters. Although the auxiliary parameters are of no interest in themselves for modeling speech sound patterns, they provide an intermediate tool for achieving maximal accuracy in estimating the parameters of the regression models.

Proceedings ArticleDOI
14 Oct 1996
TL;DR: Compared with other methodologies using HMM or dynamic time warping, the explicit use of Cantonese speech feature is enhances the results and recognition speed.
Abstract: This paper present a heuristic methodology to recognize Cantonese finals A consonant/vowel recognition is first used to segment the initial and final After segmentation, the final is recognized by first classifying into an individual final group by a simple distortion measurement and then recognized by dynamic time warping within each final group The feature extraction for final group classification is done by a presudo-search of the repeating harmony unit in the middle steady part of the final The tailing consonant is found by a presudo-search of the segmentation point The presudo-search using both procedures is characterised by pitch determination and harmony unit comparison Using this methodology, an averaged recognition accuracy of 9344% is obtained for recognizing finals Compared with other methodologies using HMM or dynamic time warping, the explicit use of Cantonese speech feature is enhances the results and recognition speed


01 Jan 1996
TL;DR: An automated method for phonetic labeling of speech data is presented, a method that requires the orthographic transcription (text) of the speech sequence and is aided by a phonetic lexicon.
Abstract: Speech data-bases are an important issue in the study of speech communication. In particular, time-aligned and phonetically labeled speech data-bases are useful during the design of applications such as speech recognizers. The time-alignment and labeling of the speech data is traditionally done manually. This manual procedure is time consuming, tedious, and subjective. Therefore, an automated labeling procedure would be advantageous. In the present thesis, various aspects of the design of a system for automatic labeling of speech data are thoroughly investigated. An automated method for phonetic labeling of speech data is presented, a method that requires the orthographic transcription (text) of the speech sequence. The method is aided by a phonetic lexicon. The procedure handles long speech sequences as it takes advantage of first performing a coarse alignment between the speech signal and the corresponding text. The coarse alignment supplies the phonetic labeling system with short sequences of speech with its corresponding phonetic transcription as given by the phonetic lexicon. The alignment between the speech signal and the phones is found by means of a Viterbi search algorithm. The search algorithm is supported by a distance function based on a speech model that contains sets of phonetic and di-phonetic classes with associated mean vectors and covariance matrices. Furthermore, an extensive evaluation study of the automated labeling procedure is done, applying various signal processing methods, distance functions, and various speech models. Additionally, the behavior of the automated labeling procedure is illustrated to indicate future development of the procedure. A second order recursive algorithm for adaptive signal processing is proposed. The basic algorithm is derived and analyzed for the ARX case, and then extended to instrumental variables and prediction error like algorithms. Furthermore, a similar algorithm is derived for signal subspace tracking. It is shown that the algorithm encompasses both the RLS and the LMS algorithms as special cases. The computational complexity is the same as for the RLS algorithm, but some extra memory storage is required. The associated ordinary differential equation for the ARX case algorithm is proven to be globally exponentially stable. Furthermore, it is demonstrated that the proposed algorithm has a higher ability to track time-varying systems than has the RLS-algorithm. The proposed algorithm especially handles those situations well where there is a simultaneous system change and decrease of signal power.

Proceedings Article
01 Sep 1996
TL;DR: This paper describes a follow on work on the ITS technique where a Multi-layer Perceptron has been used to perform an internal mapping in the original ITS input space in order to provide a tighter set of clusters of the speech sequences.
Abstract: Trace-segmentation (TS) is a method for nonlinear time-normalization of a sequence of speech representation frames prior to recognition of the sequence. It has been shown in a recent work [1] that an Individual Trace-Segmentation (ITS), i.e. a separate segmentation of the trajectory described by each individual coefficient in the speech frame leads to a much improved recognition which exceeds the performance provided by DTW recognition on the same database. This paper describes a follow on work on the ITS technique where a Multi-layer Perceptron has been used to perform an internal mapping in the original ITS input space in order to provide a tighter set of clusters of the speech sequences. This novel technique is called Neural Network Trace-Segmentation (NNTS) and has produced a significant improvement on the ITS original performance.