Showing papers on "TIMIT published in 2006"

PDF

Open Access

Proceedings Article•DOI•

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

[...]

Alex Graves¹, Santiago Fernández¹, Faustino Gomez¹, Jürgen Schmidhuber²•Institutions (2)

Dalle Molle Institute for Artificial Intelligence Research¹, Technische Universität München²

25 Jun 2006

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

...read moreread less

Abstract: Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.

...read moreread less

5,188 citations

Journal Article•DOI•

Type-2 fuzzy hidden Markov models and their application to speech recognition

[...]

Jia Zeng¹, Zhi-Qiang Liu¹•Institutions (1)

City University of Hong Kong¹

01 Nov 2006-IEEE Transactions on Fuzzy Systems

TL;DR: Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs.

...read moreread less

Abstract: This paper presents an extension of hidden Markov models (HMMs) based on the type-2 (T2) fuzzy set (FS) referred to as type-2 fuzzy HMMs (T2 FHMMs). Membership functions (MFs) of T2 FSs are three-dimensional, and this new third dimension offers additional degrees of freedom to evaluate the HMMs fuzziness. Therefore, T2 FHMMs are able to handle both random and fuzzy uncertainties existing universally in the sequential data. We derive the T2 fuzzy forward-backward algorithm and Viterbi algorithm using T2 FS operations. In order to investigate the effectiveness of T2 FHMMs, we apply them to phoneme classification and recognition on the TIMIT speech database. Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs.

...read moreread less

146 citations

Journal Article•DOI•

A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

[...]

Li Deng¹, Dong Yu¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter's input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

37 citations

Proceedings Article•DOI•

Robust acoustic-based syllable detection.

[...]

Zhimin Xie¹, Partha Niyogi¹•Institutions (1)

University of Chicago¹

17 Sep 2006

TL;DR: A vowel classifier is further constructed and achieves the same performance as HMM-based systems, except for extremely adverse 0dB signal-noise-ratio environments, while HMMs degrade rigorously.

...read moreread less

Abstract: In this paper, we describe a method to detect syllabic nuclei in continuous speech. It employs two basic and robust acoustic features, periodicity and energy, to detect syllable landmarks. This method is evaluated on TIMIT, noise additive TIMIT and NTIMIT datasets with typical total error rates of around 30% in all the datasets, except for extremely adverse 0dB signal-noise-ratio environments, while HMM-based systems degrade rigorously. Based on the landmarks, a vowel classifier is further constructed and achieves the same performance as HMM-based systems.

...read moreread less

37 citations

Journal Article•DOI•

A comparison of acoustic coding models for speech-driven facial animation

[...]

P. Kakumanu¹, Anna Esposito², Oscar N. Garcia³, Ricardo Gutierrez-Osuna⁴•Institutions (4)

Wright State University¹, Seconda Università degli Studi di Napoli², University of North Texas³, Texas A&M University⁴

01 Jun 2006-Speech Communication

TL;DR: The results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation.

...read moreread less

26 citations

Proceedings Article•

Discriminative Kernel-Based Phoneme Sequence Recognition

[...]

Joseph Keshet¹, Samy Bengio², Dan Chazan, Shai Shalev-Shwartz, Yoram Singer - Show less +1 more•Institutions (2)

Hebrew University of Jerusalem¹, Idiap Research Institute²

01 Jan 2006

TL;DR: A new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM is described, which uses a discriminative kernel-based training procedure in which the learning process is tailored to minimizing the Levenshtein distance between the predicted phoneme sequences and the correct sequence.

...read moreread less

Abstract: We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMM-based approaches, our method uses a discriminative kernel-based training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vector-space endowed with an inner-product that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMM-based approach.

...read moreread less

24 citations

A Bidirectional Target Filtering Model of Speech Coarticulation: two-stage Implementation for Phonetic Recognition

[...]

Li Deng, Dong Yu, Alex Acero

01 Jan 2006

TL;DR: In this article, a structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation, where the dynamics of formants or vocal tract resonances (VTRs) in fluent speech are generated using prior information of resonance targets in the phone sequence, in absence of acoustic data.

...read moreread less

Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter’s input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the -best ( = 2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

23 citations

Proceedings Article•DOI•

Performance Limits for Envelope based Automatic Syllable Segmentation

[...]

Rudi Villing¹, Tomas E. Ward¹, Joseph Timoney¹•Institutions (1)

Maynooth University¹

28 Jun 2006

TL;DR: The upper performance limits of automatic syllable segmentation algorithms using single or multiple frequency band envelopes as their primary segmentation feature are explored and it is concluded that a low total error rate requires an algorithm which can reject many candidates or which uses features other than those based on envelope alone.

...read moreread less

Abstract: In this paper the upper performance limits of automatic syllable segmentation algorithms using single or multiple frequency band envelopes as their primary segmentation feature are explored. Each algorithm is tested against the TIMIT corpus of continuous read speech. The results show that candidate matching rates as high as 99% can be achieved by segmentation based on a simple envelope, but only at the expense of as many as 13 non-matching candidates per syllable. We conclude that a low total error rate requires an algorithm which can reject many candidates or which uses features other than those based on envelope alone to generate fewer, more accurate candidates.

...read moreread less

21 citations

Journal Article•DOI•

A lattice search technique for a long-contextual-span hidden trajectory model of speech

[...]

Dong Yu¹, Li Deng¹, Alex Acero¹•Institutions (1)

Microsoft¹

01 Sep 2006-Speech Communication

TL;DR: Improved likelihood score computation in theHTM and a novel A∗-based time-asynchronous lattice-constrained decoding algorithm for the HTM evaluation are described and improvement of recognition accuracy by the new search algorithm on recognition lattices over the traditional N-best rescoring paradigm is shown.

...read moreread less

20 citations

Proceedings Article•DOI•

Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches

[...]

M. Kotti, Luis Gustavo Martins, Emmanouil Benetos¹, Jaime S. Cardoso, Constantine Kotropoulos¹ - Show less +1 more•Institutions (1)

Aristotle University of Thessaloniki¹

09 Jul 2006

TL;DR: This paper addresses the problem of unsupervised speaker change detection by testing three systems based on the Bayesian information criterion (BIC), a real-time approach employing the line spectral pairs and the BIC to validate a potential speaker change point.

...read moreread less

Abstract: This paper addresses the problem of unsupervised speaker change detection. Three systems based on the Bayesian Information Criterion (BIC) are tested. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic thresholding followed by a fusion scheme, and finally applies BIC. The second method is a real-time one that uses a metric-based approach employing the line spectral pairs and the BIC to validate a potential speaker change point. The third method consists of three modules. In the first module, a measure based on second-order statistics is used; in the second module, the Euclidean distance and T2 Hotelling statistic are applied; and in the third module, the BIC is utilized. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. A comparison between the performance of the three systems is made based on t-statistics.

...read moreread less

16 citations

Proceedings Article•DOI•

Automatic speaker change detection with the Bayesian information criterion using MPEG-7 features and a fusion scheme

[...]

M. Kotti¹, Emmanouil Benetos¹, Constantine Kotropoulos¹•Institutions (1)

Aristotle University of Thessaloniki¹

21 May 2006

TL;DR: This paper addresses unsupervised speaker change detection, a necessary step for several indexing tasks, and demonstrates that the performance of the proposed multiple pass algorithm is better than that of other approaches.

...read moreread less

Abstract: This paper addresses unsupervised speaker change detection, a necessary step for several indexing tasks. We assume that there is no prior knowledge either on the number of speakers or their identities. Features included in the MPEG-7 audio prototype are investigated such as the AudioWaveformEnvelope and the AudioSpectrumCentroid. The model selection criterion is the Bayesian information criterion (BIC). A multiple pass algorithm is proposed. It uses a dynamic thresholding for scalar features and a fusion scheme so as to refine the segmentation results. It also models every speaker by a multivariate Gaussian probability density function and whenever new information is available, the respective model is updated. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. It is and demonstrated that the performance of the proposed multiple pass algorithm is better than that of other approaches.

...read moreread less

Proceedings Article•

Speaker recognition and broad phonetic groups

[...]

Margit Antal, Gavril Toderean¹•Institutions (1)

Technical University of Cluj-Napoca¹

15 Feb 2006

TL;DR: Experiments show that the homogeneity of the speech material may improve the quality of speaker identification and the broad phonetic groups nasals and vowels were found to be particularly speaker specific.

...read moreread less

Abstract: The aim of this study is to provide a quantitative assessment of the speaker discriminating properties of broad phonetic groups. GMM based approach to speaker modelling is used in conjunction with a phonetically hand-labelled speech database (TIMIT) to produce broad phonetic group ranking based on speaker identification scores. The broad phonetic groups nasals and vowels were found to be particularly speaker specific. Experiments show that the homogeneity of the speech material may improve the quality of speaker identification.

...read moreread less

Book Chapter•DOI•

Correlation-based similarity between signals for speaker verification with limited amount of speech data

[...]

N. Dhananjaya¹, B. Yegnanarayana¹•Institutions (1)

Indian Institute of Technology Madras¹

11 Sep 2006

TL;DR: A method for speaker verification with limited amount of speech data by computing normalized correlation coefficient values between signal patterns chosen around high SNR regions (corresponding to the instants of significant excitation), without having to extract any further parameters.

...read moreread less

Abstract: In this paper, we present a method for speaker verification with limited amount (2 to 3 secs) of speech data. With the constraint of limited data, the use of traditional vocal tract features in conjunction with statistical models becomes difficult. An estimate of the glottal flow derivative signal which represents the excitation source information is used for comparing two signals. Speaker verification is performed by computing normalized correlation coefficient values between signal patterns chosen around high SNR regions (corresponding to the instants of significant excitation), without having to extract any further parameters. The high SNR regions are detected by locating peaks in the Hilbert envelope of the LP residual signal. Speaker verification studies are conducted on clean microphone speech (TIMIT) as well as noisy telephone speech (NTIMIT), to illustrate the effectiveness of the proposed method.

...read moreread less

Book Chapter•DOI•

A minimum boundary error framework for automatic phonetic segmentation

[...]

Jen-Wei Kuo¹, Hsin-Min Wang¹•Institutions (1)

Academia Sinica¹

13 Dec 2006

TL;DR: A novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries according to the minimum boundary error (MBE) criterion, inspired by the recently proposed minimum phone error training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition.

...read moreread less

Abstract: This paper presents a novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries. In the framework, both training and segmentation approaches are proposed according to the minimum boundary error (MBE) criterion, which tries to minimize the expected boundary errors over a set of possible phonetic alignments. This framework is inspired by the recently proposed minimum phone error (MPE) training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition. To evaluate the proposed MBE framework, we conduct automatic phonetic segmentation experiments on the TIMIT acoustic-phonetic continuous speech corpus. MBE segmentation with MBE-trained models can identify 80.53% of human-labeled phone boundaries within a tolerance of 10 ms, compared to 71.10% identified by conventional ML segmentation with ML-trained models. Moreover, by using the MBE framework, only 7.15% of automatically labeled phone boundaries have errors larger than 20 ms.

...read moreread less

Journal Article•DOI•

Algorithms for data-driven ASR parameter quantization

[...]

Karim Filali¹, Xiao Li¹, Jeff A. Bilmes¹•Institutions (1)

University of Washington¹

01 Oct 2006-Computer Speech & Language

TL;DR: This work compares a variety of novel sub-vector clustering procedures for ASR system parameter quantization, most of which are based on entropy minimization, and others on recognition accuracy maximization on a development set.

...read moreread less

Proceedings Article•DOI•

New resources for Brazilian Portuguese: Results for grapheme-to-phoneme and phone classification

[...]

Chadia Hosn¹, Luiz Alberto Baptista¹, Tales Imbiriba¹, Aldebaro Klautau¹•Institutions (1)

Federal University of Pará¹

01 Sep 2006

TL;DR: Results are presented for two speech processing tasks for BP: phone classification and grapheme to phoneme (G2P) conversion.

...read moreread less

Abstract: Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). Consequently, there are no publicly available scripts to design baseline BP systems. This work discusses some efforts towards decreasing this gap and presents results for two speech processing tasks for BP: phone classification and grapheme to phoneme (G2P) conversion. The former task used hidden Markov models to classify phones from the Spoltech and TIMIT corpora. The G2P module adopted machine learning methods such as decision trees and was tested on a new BP pronunciation dictionary and the following languages: British English, American English and French.

...read moreread less

Journal Article•DOI•

Speedup convergence and reduce noise for enhanced speech separation and recognition

[...]

Yunxin Zhao¹, Rong Hu¹, Xiaolong Li¹•Institutions (1)

University of Missouri¹

01 Jul 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Novel techniques are proposed to enhance time-domain adaptive decorrelation filtering for separation and recognition of cochannel speech in reverberant room conditions and significantly improved ADF convergence rate, target-to-interference ratio, and accuracy of phone recognition.

...read moreread less

Abstract: Novel techniques are proposed to enhance time-domain adaptive decorrelation filtering (ADF) for separation and recognition of cochannel speech in reverberant room conditions. The enhancement techniques include whitening filtering on cochannel speech to improve condition of adaptive estimation, block-iterative formulation of ADF to speed up convergence, and integration of multiple ADF outputs through post filtering to reduce reverberation noise. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in real room environment, with approximately 2 m microphone-source distance and initial target-to-interference ratio of about 0 dB. The proposed techniques significantly improved ADF convergence rate, target-to-interference ratio, and accuracy of phone recognition

...read moreread less

Speaker change detection using BIC: a comparison on two datasets

[...]

M. Kotti¹, Emmanouil Benetos¹, Constantine Kotropoulos¹, Luis Gustavo Martins•Institutions (1)

Aristotle University of Thessaloniki¹

31 Mar 2006

TL;DR: This paper addresses the problem of unsupervised speaker change detection by using the Bayesian Information Criterion and a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point.

...read moreread less

Abstract: This paper addresses the problem of unsupervised speaker change detection We assume that there is no prior knowledge of the number of speakers or their identities Two methods are tested The first method uses the Bayesian Information Criterion (BIC), investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, and implements a dynamic thresholding followed by a fusion scheme The second method is a real-time one that uses a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point The methods are tested on two different datasets The first set was created by concatenating speakers from the TIMIT database and is referred to as the TIMIT data set The second set was created by using recordings from the MPEG-7 test set CD1 and broadcast news and is referred to as the INESC dataset

...read moreread less

Dissertation•

Speech analysis and cognition using category-dependent features in a model of the central auditory system

[...]

Biing-Hwang (Fred) Juang¹, Woojay Jeon¹•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2006

TL;DR: This work qualitatively and quantitatively validate the model of an elaborate computational model of the primary auditory cortex in the central auditory system, and develops new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response.

...read moreread less

Abstract: It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection.

...read moreread less

Journal Article•DOI•

A split lexicon approach for improved recognition of spoken names

[...]

Abhinav Sethy¹, Shrikanth S. Narayanan¹, S. Parthasarthy²•Institutions (2)

University of Southern California¹, AT&T Labs²

01 Sep 2006-Speech Communication

TL;DR: The use of syllables as the acoustic unit for spoken name recognition based on reverse lookup schemes is proposed and how syllables can be used to improve recognition performance and reducing the system perplexity is shown.

...read moreread less

Proceedings Article•DOI•

Visual speech synthesis from 3d video

[...]

James D. Edge¹, Adrian Hilton¹•Institutions (1)

University of Surrey¹

01 Jan 2006

TL;DR: In this article, a stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus, which is mapped into a space which maintains the relationships between samples and their temporal derivatives.

...read moreread less

Abstract: In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

...read moreread less

Proceedings Article•DOI•

Adaptation of Hybrid ANN/HMM Using Weights Interpolation

[...]

Stefano Scanzio¹, Pietro Laface¹, Roberto Gemello, Franco Mana•Institutions (1)

Polytechnic University of Turin¹

14 May 2006

TL;DR: An adaptation technique for ANNs is presented that, similar to the framework of MAP estimation, tries to exploit in the adaptation process prior information that is particularly useful to deal with the problem of sparse training data.

...read moreread less

Abstract: Many techniques for speaker or channel adaptation have been successfully applied to automatic speech recognition. Most of these techniques have been proposed for the adaptation of Hidden Markov Models (HMMs). Far less proposals have been made for the adaptation of the Artificial Neural Networks (ANNs) used in the hybrid HMM-ANN approach. This paper presents an adaptation technique for ANNs that, similar to the framework of MAP estimation, tries to exploit in the adaptation process prior information that is particularly useful to deal with the problem of sparse training data. We show that the integration of a priori information can be simply achieved by linear interpolation of the weights of an "a priori" network and of a speaker specific network. Good improvements with respect to the baseline results are reported evaluating this technique on the Wall Street Journal WSJ0 and WSJ1 databases and on TIMIT corpus using different amounts of adaptation data.

...read moreread less

Proceedings Article•

Combining multiple-sized sub-word units in a speech recognition system using baseform selection

[...]

T. Nagarajan, P. Vijayalakshmi, Douglas D. O'Shaughnessy

01 Jan 2006

TL;DR: It is shown that a consid-erable improvement in recognition performance can be achieved if the baseforms are selected properly, and the prelim-inary experiments carried out on the TIMIT speech corpus show a considerable improvement in the recognition performance over pure monophone/triphone-based systems when the larger-sized units are combined using proper selection of baseforms.

...read moreread less

Abstract: A Longer-sized sub-word unit is known to be a better candi-date in the development of a continuous speech recognition sys-tem. However, the basic problem with such units is the data spar-sity. To overcome this problem, researchers have tried to com-bine longer-sized sub-word unit models with phoneme models. Inthis paper, we have considered only frequently occurring syllablesand VC (Vowel + Consonant) units, and phone-sized units (mono-phones and triphones) for the development of a continuous speechrecognition system. Insuch a case, even for a single pronunciationof a word, there can be multiple representational baseforms in thelexicon, each with different-sized units. We show that a consid-erable improvement in recognition performance can be achievedif the baseforms are selected properly. Out of all possible base-forms for a given word in the lexicon, the baseform that maxi-mizes the acoustic likelihood, for possible sub-word unit concate-nations to make a word, alone is considered. In the baseline sys-tems’ word-lexicon, like pure monophone or triphone-based sys-tems, since only the acoustically weaker baseforms are replacedby baseforms with longer-sized units, the resultant performance isguaranteed to be better than that of baseline systems. The prelim-inary experiments carried out on the TIMIT speech corpus showa considerable improvement in the recognition performance overa pure monophone/triphone-based systems when the larger-sizedunits are combined using proper selection of baseforms.

...read moreread less

Proceedings Article•DOI•

Dynamic evidence models in a DBN phone recognizer.

[...]

William Schuler¹, Timothy A. Miller¹, Stephen Wu¹, Andrew Exley¹•Institutions (1)

University of Minnesota¹

17 Sep 2006

TL;DR: Results on the standard TIMIT phone recognition task show this CRF evidence model, even with a relatively simple first-order feature set, is competitive with standard HMMs and DBN variants using static Gaussian mixture models on MFCC features.

...read moreread less

Abstract: This paper describes an implementation of a discriminative acoustical model – a Conditional Random Field (CRF) – within a Dynamic Bayes Net (DBN) formulation of a Hierarchic Hidden Markov Model (HHMM) phone recognizer. This CRF-DBN topology accounts for phone transition dynamics in conditional probability distributions over random variables associated with observed evidence, and therefore has less need for hidden variable states corresponding to transitions between phones, leaving more hypothesis space available for modeling higher-level linguistic phenomena such syntax and semantics. The model also has the interesting property that it explicitly represents likely formant trajectories and formant targets of modeled phones in its random variable distributions, making it more linguistically transparent than models based on traditional HMMs with conditionally independent evidence variables. Results on the standard TIMIT phone recognition task show this CRF evidence model, even with a relatively simple first-order feature set, is competitive with standard HMMs and DBN variants using static Gaussian mixture models on MFCC features.

...read moreread less

Book Chapter•DOI•

Consistent modeling of the static and time-derivative cepstrums for speech recognition using HSPTM

[...]

Yiu-Pong Lai¹, Man-Hung Siu¹•Institutions (1)

Hong Kong University of Science and Technology¹

13 Dec 2006

TL;DR: The hidden spectral peak trajectory model (HSPTM) is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients.

...read moreread less

Abstract: Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.

...read moreread less

Proceedings Article•

A time-synchronous phonetic decoder for a long-contextual-Span hidden trajectory model.

[...]

Xiaolong Li, Li Deng, Dong Yu, Alex Acero

19 Sep 2006

TL;DR: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented.

...read moreread less

Abstract: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation and reduction using a compact set of parameters. The long-span nature of the HTM had posed a great technical challenge for developing efficient search algorithms for full evaluation of the model. Taking on the challenge, the decoding algorithm is developed to deal effectively with the exponentially increased search space by HTMspecific t echniques for hypothesis representation, w ord-ending recombination, and hypothesis pruning. Experimental results obtained on the TIMIT phonetic recognition task are reported, extending our earlier HTM evaluation paradigms based on N-best and A* lattice rescoring. Index T erms: Hidden Trajectory Model, t ime-synchronous decoding, trace-based hypothesis, TIMIT

...read moreread less

Journal Article•DOI•

Observation process adaptation for linear dynamic models

[...]

Joe Frankel¹, Simon King¹•Institutions (1)

University of Edinburgh¹

01 Sep 2006-Speech Communication

TL;DR: The GEM approach is proposed to be applied to adaptation of hidden Markov models which use non-diagonal covariances and the necessary update equations are provided.

...read moreread less

Proceedings Article•DOI•

Hmm topology in continuous speech recognition systems

[...]

Glauco F. G. Yared, Fábio Violaro, Antonio Marcos Selmini

01 Sep 2006

TL;DR: The new Gaussian Elimination Algorithm (GEA) is presented for determining the more suitable HMM complexity in continuous speech recognition systems and is evaluated on a small vocabulary continuous speech database as well as on the TIMIT corpus.

...read moreread less

Abstract: Nowadays, HMM-based speech recognition systems are used in many real time processing applications, from cell phones to automobile automation In this context, one important aspect to be considered is the HMM model size, which directly determines the computational load So, in order to make the system practical, it is interesting to optimize the HMM model size constrained to a minimum acceptable recognition performance Furthermore, topology optimization is also important for reliable parameter estimation This work presents the new Gaussian Elimination Algorithm (GEA) for determining the more suitable HMM complexity in continuous speech recognition systems The proposed method is evaluated on a small vocabulary continuous speech (SVCS) database as well as on the TIMIT corpus

...read moreread less

Proceedings Article•DOI•

Modeling of the Pstn Channel and Multireferences Training in Robust Speech Recognition

[...]

R. Preiss¹, M. Gabrea¹•Institutions (1)

Université du Québec¹

09 Jul 2006

TL;DR: A modeling of the PSTn channel is developed in order to train HMM with the database TIMIT passed through the PSTN channel modeling, and NTIMIT was used for testing.

...read moreread less

Abstract: The telephone channel triggers, by the reduction of the signal bandwidth, a drop of the performances of most of the recognition systems which belong to speaker identification or continuous speech recognition. Many compensation techniques have been developed to reduce the unmatching issue between the training and the test databases which is supposed to be the main cause of the results decrease. We can gather these techniques in two categories: (1) feature compensation in which the representation of the acoustic vector is adjusted, and (2) model adaptation in which the HMM parameters are modified to get closer of the testing environments. The developed method is in the second one. The main purpose of this paper is to develop a modeling of the PSTN channel in order to train HMM with the database TIMIT passed through the PSTN channel modeling. NTIMIT will be used for testing. Comparisons will be made by the way of feature compensation techniques joined with the proposed approach.

...read moreread less

Proceedings Article•DOI•

A Bayesian Predictive Method for Automatic Speech Segmentation

[...]

Ming Liu¹, Thomas S. Huang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 Aug 2006

TL;DR: The experimental results show that effectiveness of the Gaussian mixture model is effective and the performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding.

...read moreread less

Abstract: Implicit speech segmentation is basically to find time instances when the spectral distortion is large. Spectral variation function is a widely used measure of spectral distortion. However, SVF is a data-dependent measure. In order to make the measurement data-independent, a likelihood ratio is constructed to measure the spectral distortion. This ratio can be computed efficiently with a Bayesian predictive model. The prior of the Bayesian predictive model is estimated from unlabeled data via an unsupervised machine learning technique - Gaussian mixture model (GMM). The experimental results show that effectiveness of this novel method. The performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding

...read moreread less