scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2006"


Proceedings ArticleDOI
25 Jun 2006
TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Abstract: Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.

5,188 citations


Journal ArticleDOI
TL;DR: Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs.
Abstract: This paper presents an extension of hidden Markov models (HMMs) based on the type-2 (T2) fuzzy set (FS) referred to as type-2 fuzzy HMMs (T2 FHMMs). Membership functions (MFs) of T2 FSs are three-dimensional, and this new third dimension offers additional degrees of freedom to evaluate the HMMs fuzziness. Therefore, T2 FHMMs are able to handle both random and fuzzy uncertainties existing universally in the sequential data. We derive the T2 fuzzy forward-backward algorithm and Viterbi algorithm using T2 FS operations. In order to investigate the effectiveness of T2 FHMMs, we apply them to phoneme classification and recognition on the TIMIT speech database. Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs.

146 citations


Journal ArticleDOI
Li Deng1, Dong Yu1, Alejandro Acero1
TL;DR: The new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.
Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter's input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

37 citations


Proceedings ArticleDOI
17 Sep 2006
TL;DR: A vowel classifier is further constructed and achieves the same performance as HMM-based systems, except for extremely adverse 0dB signal-noise-ratio environments, while HMMs degrade rigorously.
Abstract: In this paper, we describe a method to detect syllabic nuclei in continuous speech. It employs two basic and robust acoustic features, periodicity and energy, to detect syllable landmarks. This method is evaluated on TIMIT, noise additive TIMIT and NTIMIT datasets with typical total error rates of around 30% in all the datasets, except for extremely adverse 0dB signal-noise-ratio environments, while HMM-based systems degrade rigorously. Based on the landmarks, a vowel classifier is further constructed and achieves the same performance as HMM-based systems.

37 citations


Journal ArticleDOI
TL;DR: The results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation.

26 citations


Proceedings Article
01 Jan 2006
TL;DR: A new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM is described, which uses a discriminative kernel-based training procedure in which the learning process is tailored to minimizing the Levenshtein distance between the predicted phoneme sequences and the correct sequence.
Abstract: We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMM-based approaches, our method uses a discriminative kernel-based training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vector-space endowed with an inner-product that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMM-based approach.

24 citations


01 Jan 2006
TL;DR: In this article, a structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation, where the dynamics of formants or vocal tract resonances (VTRs) in fluent speech are generated using prior information of resonance targets in the phone sequence, in absence of acoustic data.
Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter’s input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the -best ( = 2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

23 citations


Proceedings ArticleDOI
28 Jun 2006
TL;DR: The upper performance limits of automatic syllable segmentation algorithms using single or multiple frequency band envelopes as their primary segmentation feature are explored and it is concluded that a low total error rate requires an algorithm which can reject many candidates or which uses features other than those based on envelope alone.
Abstract: In this paper the upper performance limits of automatic syllable segmentation algorithms using single or multiple frequency band envelopes as their primary segmentation feature are explored. Each algorithm is tested against the TIMIT corpus of continuous read speech. The results show that candidate matching rates as high as 99% can be achieved by segmentation based on a simple envelope, but only at the expense of as many as 13 non-matching candidates per syllable. We conclude that a low total error rate requires an algorithm which can reject many candidates or which uses features other than those based on envelope alone to generate fewer, more accurate candidates.

21 citations


Journal ArticleDOI
Dong Yu1, Li Deng1, Alex Acero1
TL;DR: Improved likelihood score computation in theHTM and a novel A∗-based time-asynchronous lattice-constrained decoding algorithm for the HTM evaluation are described and improvement of recognition accuracy by the new search algorithm on recognition lattices over the traditional N-best rescoring paradigm is shown.

20 citations


Proceedings ArticleDOI
09 Jul 2006
TL;DR: This paper addresses the problem of unsupervised speaker change detection by testing three systems based on the Bayesian information criterion (BIC), a real-time approach employing the line spectral pairs and the BIC to validate a potential speaker change point.
Abstract: This paper addresses the problem of unsupervised speaker change detection. Three systems based on the Bayesian Information Criterion (BIC) are tested. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic thresholding followed by a fusion scheme, and finally applies BIC. The second method is a real-time one that uses a metric-based approach employing the line spectral pairs and the BIC to validate a potential speaker change point. The third method consists of three modules. In the first module, a measure based on second-order statistics is used; in the second module, the Euclidean distance and T2 Hotelling statistic are applied; and in the third module, the BIC is utilized. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. A comparison between the performance of the three systems is made based on t-statistics.

16 citations


Proceedings ArticleDOI
21 May 2006
TL;DR: This paper addresses unsupervised speaker change detection, a necessary step for several indexing tasks, and demonstrates that the performance of the proposed multiple pass algorithm is better than that of other approaches.
Abstract: This paper addresses unsupervised speaker change detection, a necessary step for several indexing tasks. We assume that there is no prior knowledge either on the number of speakers or their identities. Features included in the MPEG-7 audio prototype are investigated such as the AudioWaveformEnvelope and the AudioSpectrumCentroid. The model selection criterion is the Bayesian information criterion (BIC). A multiple pass algorithm is proposed. It uses a dynamic thresholding for scalar features and a fusion scheme so as to refine the segmentation results. It also models every speaker by a multivariate Gaussian probability density function and whenever new information is available, the respective model is updated. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. It is and demonstrated that the performance of the proposed multiple pass algorithm is better than that of other approaches.

Proceedings Article
15 Feb 2006
TL;DR: Experiments show that the homogeneity of the speech material may improve the quality of speaker identification and the broad phonetic groups nasals and vowels were found to be particularly speaker specific.
Abstract: The aim of this study is to provide a quantitative assessment of the speaker discriminating properties of broad phonetic groups. GMM based approach to speaker modelling is used in conjunction with a phonetically hand-labelled speech database (TIMIT) to produce broad phonetic group ranking based on speaker identification scores. The broad phonetic groups nasals and vowels were found to be particularly speaker specific. Experiments show that the homogeneity of the speech material may improve the quality of speaker identification.

Book ChapterDOI
11 Sep 2006
TL;DR: A method for speaker verification with limited amount of speech data by computing normalized correlation coefficient values between signal patterns chosen around high SNR regions (corresponding to the instants of significant excitation), without having to extract any further parameters.
Abstract: In this paper, we present a method for speaker verification with limited amount (2 to 3 secs) of speech data. With the constraint of limited data, the use of traditional vocal tract features in conjunction with statistical models becomes difficult. An estimate of the glottal flow derivative signal which represents the excitation source information is used for comparing two signals. Speaker verification is performed by computing normalized correlation coefficient values between signal patterns chosen around high SNR regions (corresponding to the instants of significant excitation), without having to extract any further parameters. The high SNR regions are detected by locating peaks in the Hilbert envelope of the LP residual signal. Speaker verification studies are conducted on clean microphone speech (TIMIT) as well as noisy telephone speech (NTIMIT), to illustrate the effectiveness of the proposed method.

Book ChapterDOI
13 Dec 2006
TL;DR: A novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries according to the minimum boundary error (MBE) criterion, inspired by the recently proposed minimum phone error training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition.
Abstract: This paper presents a novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries. In the framework, both training and segmentation approaches are proposed according to the minimum boundary error (MBE) criterion, which tries to minimize the expected boundary errors over a set of possible phonetic alignments. This framework is inspired by the recently proposed minimum phone error (MPE) training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition. To evaluate the proposed MBE framework, we conduct automatic phonetic segmentation experiments on the TIMIT acoustic-phonetic continuous speech corpus. MBE segmentation with MBE-trained models can identify 80.53% of human-labeled phone boundaries within a tolerance of 10 ms, compared to 71.10% identified by conventional ML segmentation with ML-trained models. Moreover, by using the MBE framework, only 7.15% of automatically labeled phone boundaries have errors larger than 20 ms.

Journal ArticleDOI
TL;DR: This work compares a variety of novel sub-vector clustering procedures for ASR system parameter quantization, most of which are based on entropy minimization, and others on recognition accuracy maximization on a development set.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: Results are presented for two speech processing tasks for BP: phone classification and grapheme to phoneme (G2P) conversion.
Abstract: Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). Consequently, there are no publicly available scripts to design baseline BP systems. This work discusses some efforts towards decreasing this gap and presents results for two speech processing tasks for BP: phone classification and grapheme to phoneme (G2P) conversion. The former task used hidden Markov models to classify phones from the Spoltech and TIMIT corpora. The G2P module adopted machine learning methods such as decision trees and was tested on a new BP pronunciation dictionary and the following languages: British English, American English and French.

Journal ArticleDOI
TL;DR: Novel techniques are proposed to enhance time-domain adaptive decorrelation filtering for separation and recognition of cochannel speech in reverberant room conditions and significantly improved ADF convergence rate, target-to-interference ratio, and accuracy of phone recognition.
Abstract: Novel techniques are proposed to enhance time-domain adaptive decorrelation filtering (ADF) for separation and recognition of cochannel speech in reverberant room conditions. The enhancement techniques include whitening filtering on cochannel speech to improve condition of adaptive estimation, block-iterative formulation of ADF to speed up convergence, and integration of multiple ADF outputs through post filtering to reduce reverberation noise. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in real room environment, with approximately 2 m microphone-source distance and initial target-to-interference ratio of about 0 dB. The proposed techniques significantly improved ADF convergence rate, target-to-interference ratio, and accuracy of phone recognition

31 Mar 2006
TL;DR: This paper addresses the problem of unsupervised speaker change detection by using the Bayesian Information Criterion and a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point.
Abstract: This paper addresses the problem of unsupervised speaker change detection We assume that there is no prior knowledge of the number of speakers or their identities Two methods are tested The first method uses the Bayesian Information Criterion (BIC), investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, and implements a dynamic thresholding followed by a fusion scheme The second method is a real-time one that uses a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point The methods are tested on two different datasets The first set was created by concatenating speakers from the TIMIT database and is referred to as the TIMIT data set The second set was created by using recordings from the MPEG-7 test set CD1 and broadcast news and is referred to as the INESC dataset

Dissertation
01 Jan 2006
TL;DR: This work qualitatively and quantitatively validate the model of an elaborate computational model of the primary auditory cortex in the central auditory system, and develops new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response.
Abstract: It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection.

Journal ArticleDOI
TL;DR: The use of syllables as the acoustic unit for spoken name recognition based on reverse lookup schemes is proposed and how syllables can be used to improve recognition performance and reducing the system perplexity is shown.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: In this article, a stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus, which is mapped into a space which maintains the relationships between samples and their temporal derivatives.
Abstract: In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

Proceedings ArticleDOI
14 May 2006
TL;DR: An adaptation technique for ANNs is presented that, similar to the framework of MAP estimation, tries to exploit in the adaptation process prior information that is particularly useful to deal with the problem of sparse training data.
Abstract: Many techniques for speaker or channel adaptation have been successfully applied to automatic speech recognition. Most of these techniques have been proposed for the adaptation of Hidden Markov Models (HMMs). Far less proposals have been made for the adaptation of the Artificial Neural Networks (ANNs) used in the hybrid HMM-ANN approach. This paper presents an adaptation technique for ANNs that, similar to the framework of MAP estimation, tries to exploit in the adaptation process prior information that is particularly useful to deal with the problem of sparse training data. We show that the integration of a priori information can be simply achieved by linear interpolation of the weights of an "a priori" network and of a speaker specific network. Good improvements with respect to the baseline results are reported evaluating this technique on the Wall Street Journal WSJ0 and WSJ1 databases and on TIMIT corpus using different amounts of adaptation data.

Proceedings Article
01 Jan 2006
TL;DR: It is shown that a consid-erable improvement in recognition performance can be achieved if the baseforms are selected properly, and the prelim-inary experiments carried out on the TIMIT speech corpus show a considerable improvement in the recognition performance over pure monophone/triphone-based systems when the larger-sized units are combined using proper selection of baseforms.
Abstract: A Longer-sized sub-word unit is known to be a better candi-date in the development of a continuous speech recognition sys-tem. However, the basic problem with such units is the data spar-sity. To overcome this problem, researchers have tried to com-bine longer-sized sub-word unit models with phoneme models. Inthis paper, we have considered only frequently occurring syllablesand VC (Vowel + Consonant) units, and phone-sized units (mono-phones and triphones) for the development of a continuous speechrecognition system. Insuch a case, even for a single pronunciationof a word, there can be multiple representational baseforms in thelexicon, each with different-sized units. We show that a consid-erable improvement in recognition performance can be achievedif the baseforms are selected properly. Out of all possible base-forms for a given word in the lexicon, the baseform that maxi-mizes the acoustic likelihood, for possible sub-word unit concate-nations to make a word, alone is considered. In the baseline sys-tems’ word-lexicon, like pure monophone or triphone-based sys-tems, since only the acoustically weaker baseforms are replacedby baseforms with longer-sized units, the resultant performance isguaranteed to be better than that of baseline systems. The prelim-inary experiments carried out on the TIMIT speech corpus showa considerable improvement in the recognition performance overa pure monophone/triphone-based systems when the larger-sizedunits are combined using proper selection of baseforms.

Proceedings ArticleDOI
17 Sep 2006
TL;DR: Results on the standard TIMIT phone recognition task show this CRF evidence model, even with a relatively simple first-order feature set, is competitive with standard HMMs and DBN variants using static Gaussian mixture models on MFCC features.
Abstract: This paper describes an implementation of a discriminative acoustical model – a Conditional Random Field (CRF) – within a Dynamic Bayes Net (DBN) formulation of a Hierarchic Hidden Markov Model (HHMM) phone recognizer. This CRF-DBN topology accounts for phone transition dynamics in conditional probability distributions over random variables associated with observed evidence, and therefore has less need for hidden variable states corresponding to transitions between phones, leaving more hypothesis space available for modeling higher-level linguistic phenomena such syntax and semantics. The model also has the interesting property that it explicitly represents likely formant trajectories and formant targets of modeled phones in its random variable distributions, making it more linguistically transparent than models based on traditional HMMs with conditionally independent evidence variables. Results on the standard TIMIT phone recognition task show this CRF evidence model, even with a relatively simple first-order feature set, is competitive with standard HMMs and DBN variants using static Gaussian mixture models on MFCC features.

Book ChapterDOI
13 Dec 2006
TL;DR: The hidden spectral peak trajectory model (HSPTM) is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients.
Abstract: Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.

Proceedings Article
19 Sep 2006
TL;DR: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented.
Abstract: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation and reduction using a compact set of parameters. The long-span nature of the HTM had posed a great technical challenge for developing efficient search algorithms for full evaluation of the model. Taking on the challenge, the decoding algorithm is developed to deal effectively with the exponentially increased search space by HTMspecific t echniques for hypothesis representation, w ord-ending recombination, and hypothesis pruning. Experimental results obtained on the TIMIT phonetic recognition task are reported, extending our earlier HTM evaluation paradigms based on N-best and A* lattice rescoring. Index T erms: Hidden Trajectory Model, t ime-synchronous decoding, trace-based hypothesis, TIMIT

Journal ArticleDOI
TL;DR: The GEM approach is proposed to be applied to adaptation of hidden Markov models which use non-diagonal covariances and the necessary update equations are provided.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: The new Gaussian Elimination Algorithm (GEA) is presented for determining the more suitable HMM complexity in continuous speech recognition systems and is evaluated on a small vocabulary continuous speech database as well as on the TIMIT corpus.
Abstract: Nowadays, HMM-based speech recognition systems are used in many real time processing applications, from cell phones to automobile automation In this context, one important aspect to be considered is the HMM model size, which directly determines the computational load So, in order to make the system practical, it is interesting to optimize the HMM model size constrained to a minimum acceptable recognition performance Furthermore, topology optimization is also important for reliable parameter estimation This work presents the new Gaussian Elimination Algorithm (GEA) for determining the more suitable HMM complexity in continuous speech recognition systems The proposed method is evaluated on a small vocabulary continuous speech (SVCS) database as well as on the TIMIT corpus

Proceedings ArticleDOI
09 Jul 2006
TL;DR: A modeling of the PSTn channel is developed in order to train HMM with the database TIMIT passed through the PSTN channel modeling, and NTIMIT was used for testing.
Abstract: The telephone channel triggers, by the reduction of the signal bandwidth, a drop of the performances of most of the recognition systems which belong to speaker identification or continuous speech recognition. Many compensation techniques have been developed to reduce the unmatching issue between the training and the test databases which is supposed to be the main cause of the results decrease. We can gather these techniques in two categories: (1) feature compensation in which the representation of the acoustic vector is adjusted, and (2) model adaptation in which the HMM parameters are modified to get closer of the testing environments. The developed method is in the second one. The main purpose of this paper is to develop a modeling of the PSTN channel in order to train HMM with the database TIMIT passed through the PSTN channel modeling. NTIMIT will be used for testing. Comparisons will be made by the way of feature compensation techniques joined with the proposed approach.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The experimental results show that effectiveness of the Gaussian mixture model is effective and the performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding.
Abstract: Implicit speech segmentation is basically to find time instances when the spectral distortion is large. Spectral variation function is a widely used measure of spectral distortion. However, SVF is a data-dependent measure. In order to make the measurement data-independent, a likelihood ratio is constructed to measure the spectral distortion. This ratio can be computed efficiently with a Bayesian predictive model. The prior of the Bayesian predictive model is estimated from unlabeled data via an unsupervised machine learning technique - Gaussian mixture model (GMM). The experimental results show that effectiveness of this novel method. The performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding