Showing papers on "TIMIT published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams

[...]

Yaodong Zhang¹, James Glass¹•Institutions (1)

01 Dec 2009

TL;DR: An unsupervised learning framework is presented to address the problem of detecting spoken keywords by using segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances and obtaining the keyword detection result.

...read moreread less

Abstract: In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.

...read moreread less

350 citations

Journal Article•DOI•

Discriminative keyword spotting

[...]

Joseph Keshet¹, David Grangier², Samy Bengio³•Institutions (3)

Idiap Research Institute¹, Princeton University², Google³

01 Apr 2009-Speech Communication

TL;DR: This article proposed a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters.

...read moreread less

113 citations

Journal Article•DOI•

Speech Recognition Using Augmented Conditional Random Fields

[...]

Yasser Hifny¹, Steve Renals²•Institutions (2)

IBM¹, University of Edinburgh²

01 Feb 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed, which addresses some limitations of HMMs while maintaining many of the aspects which have made them successful.

...read moreread less

Abstract: Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT phone recognition task, a phone error rate of 23.0% was recorded on the full test set, a significant improvement over comparable HMM-based systems.

...read moreread less

91 citations

Proceedings Article•

Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

[...]

Amarnag Subramanya¹, Jeff A. Bilmes¹•Institutions (1)

University of Washington¹

07 Dec 2009

TL;DR: It is proved that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence and a graph node ordering algorithm is proposed that is cache cognizant and leads to a linear speedup in parallel computations.

...read moreread less

Abstract: We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to outperform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph.

...read moreread less

73 citations

Proceedings Article•DOI•

Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks

[...]

Martin Wöllmer¹, Florian Eyben¹, Joseph Keshet², Alex Graves¹, Björn Schuller¹, Gerhard Rigoll¹ - Show less +2 more•Institutions (2)

Technische Universität München¹, Idiap Research Institute²

19 Apr 2009

TL;DR: A new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding and overcomes the drawbacks of generative HMM modeling.

...read moreread less

Abstract: In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.

...read moreread less

71 citations

Journal Article•DOI•

Speaker-independent phoneme alignment using transition-dependent states

[...]

John-Paul Hosom¹•Institutions (1)

Oregon Health & Science University¹

01 Apr 2009-Speech Communication

TL;DR: A baseline forced-alignment system and a proposed system with several modifications to this baseline, including the addition of energy-based features to the standard cepstral feature set, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities are described.

...read moreread less

63 citations

Proceedings Article•DOI•

Speech rhythm guided syllable nuclei detection

[...]

Yaodong Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

19 Apr 2009

TL;DR: In this paper, an instantaneous speech rhythm estimator is introduced to predict possible regions where syllable nuclei can appear, and a simple slope based peak counting algorithm is used to get the exact location of each syllable nucleus.

...read moreread less

Abstract: In this paper, we present a novel speech-rhythm-guided syllable-nuclei location detection algorithm. As a departure from conventional methods, we introduce an instantaneous speech rhythm estimator to predict possible regions where syllable nuclei can appear. Within a possible region, a simple slope based peak counting algorithm is used to get the exact location of each syllable nucleus. We verify the correctness of our method by investigating the syllable nuclei interval distribution in TIMIT dataset, and evaluate the performance by comparing with a state-of-the-art syllable nuclei based speech rate detection approach.

...read moreread less

42 citations

Proceedings Article•DOI•

An exploration of large vocabulary tools for small vocabulary phonetic recognition

[...]

Tara N. Sainath¹, Bhuvana Ramabhadran¹, Michael Picheny¹•Institutions (1)

IBM¹

01 Dec 2009

TL;DR: This paper takes the standard Ȝrecipeȝ used in typical LVCSR systems and applies it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods and finds that at the speaker-independent (SI) level, the results offer comparable performance to other SI HMM systems.

...read moreread less

Abstract: While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard Ȝrecipeȝ used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.

...read moreread less

38 citations

Proceedings Article•

Hidden conditional random field with distribution constraints for phone classification.

[...]

Dong Yu¹, Li Deng¹, Alex Acero¹•Institutions (1)

Microsoft¹

01 Sep 2009

TL;DR: It is demonstrated that a 20.8% classification error rate can be achieved on the TIMIT phone classification task using the HCRF-DC model, which is superior to any published single-system result on this heavily evaluated task.

...read moreread less

Abstract: We advance the recently proposed hidden conditional random field (HCRF) model by replacing the moment constraints (MCs) with the distribution constraints (DCs). We point out that the distribution constraints are the same as the traditional moment constraints for the binary features but are able to better regularize the probability distribution of the continuousvalued features than the moment constraints. We show that under the distribution constraints the HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to the resulting more difficult optimization problem by converting it to the traditional log-linear form at a higher-dimensional space of features exploiting cubic spline. We demonstrate that a 20.8% classification error rate (CER) can be achieved on the TIMIT phone classification task using the HCRF-DC model. This result is superior to any published single-system result on this heavily evaluated task including the HCRF-MC model, the discriminatively trained HMMs, and the large-margin HMMs using the same features. Index Terms: hidden conditional random field, maximum entropy, moment constraint, distribution constraint, phone classification, TIMIT, cubic spline

...read moreread less

29 citations

Journal Article•DOI•

Broad phonetic classification using discriminative Bayesian networks

[...]

Franz Pernkopf¹, Tuan Van Pham¹, Jeff A. Bilmes²•Institutions (2)

Graz University of Technology¹, University of Washington²

01 Feb 2009-Speech Communication

TL;DR: An approach to broad phonetic classification, defined as mapping acoustic speech frames into broad (or clustered) phonetic categories, which suggests that discriminative Bayesian networks are the most appropriate approach when missing features are common.

...read moreread less

24 citations

Journal Article•DOI•

Efficient likelihood evaluation and dynamic Gaussian selection for HMM-based speech recognition

[...]

Jun Cai¹, Ghazi Bouselmi², Yves Laprie², Jean-Paul Haton²•Institutions (2)

Xiamen University¹, French Institute for Research in Computer Science and Automation²

01 Apr 2009-Computer Speech & Language

TL;DR: A fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed, which is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation.

...read moreread less

Proceedings Article•

A Tandem BLSTM-DBN Architecture for Keyword Spotting with Enhanced Context Modeling

[...]

Martin Wöllmer¹, Florian Eyben¹, Alex Graves¹, Björn Schuller¹, Gerhard Rigoll¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

01 Jan 2009

TL;DR: A novel architecture for keyword spotting which is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net, based on a phoneme recognizer and uses a hidden garbage variable to discriminate between keywords and arbitrary speech.

...read moreread less

Abstract: We propose a novel architecture for keyword spotting which is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The DBN is based on a phoneme recognizer and uses a hidden garbage variable as well as the concept of switching parents to discriminate between keywords and arbitrary speech. Contextual information is incorporated by a BLSTM network, providing a discrete phoneme prediction feature for the DBN. Together with continuous acoustic features, the discrete BLSTM output is processed by the DBN which detects keywords in speech sequences. Due to the flexible design of our Tandem BLSTM-DBN recognizer, new keywords can be added to the vocabulary without having to re-train the model. Further, our concept does not require the training of an explicit garbage model. Experiments on the TIMIT corpus show that incorporating a BLSTM network into the DBN architecture can increase the true positive rate by up to 10% at equal false positive rates.

...read moreread less

Proceedings Article•DOI•

Using Artificial Neural Network for Robust Voice Activity Detection Under Adverse Conditions

[...]

Tuan Van Pham, Chien T. Tang, Michael Stadtschnitzer

13 Jul 2009

TL;DR: Results show that the proposed neural network classifier employing MFCC feature provides robustly high scores under different noisy conditions and the robustness of the developed VAD algorithm is still hold in the case of testing it with the completely mismatched environment.

...read moreread less

Abstract: We present an approach to model-based voice activity detection (VAD) for harsh environments. By using mel-frequency cepstral coefficients feature extracted from clean and noisy speech samples, an artificial neural network is trained optimally in order to provide a reliable model. There are three main aspects to this study: First, in addition to the developed model, recent state-of-the-art VAD methods are analyzed extensively. Second, we present an optimization procedure of neural network training, including evaluation of trained network performance with proper measures. Third, a large assortment of empirical results on the noisy TIMIT and SNOW corpuses including different types of noise at different signal-to-noise ratios is provided. We evaluate the built VAD model on the noisy corpuses and compare against the state-of-the-art VAD methods such as the ITU-T Rec. G. 729 Annex B, the ETSI AFE ES 202 050, and recently promising VAD algorithms. Results show that: (i) the proposed neural network classifier employing MFCC feature provides robustly high scores under different noisy conditions; (ii) the invented model is superior to other VAD methods in terms of various classification measures; (iii) the robustness of the developed VAD algorithm is still hold in the case of testing it with the completely mismatched environment.

...read moreread less

Proceedings Article•DOI•

A new wavelet thresholding method for speech enhancement based on symmetric Kullback-Leibler divergence

[...]

Shima Tabibian¹, Ahmad Akbari¹, Babak Nasersharif²•Institutions (2)

Iran University of Science and Technology¹, University of Gilan²

08 Dec 2009

TL;DR: A new method is proposed to determine the threshold value based on the symmetric Kullback-Leibler divergence between the probability distributions of noisy speech and noise wavelet coefficients using segmental SNR.

...read moreread less

Abstract: Performance of wavelet thresholding methods for speech enhancement is dependent on estimating an exact threshold value in the wavelet sub-bands. In this paper, we propose a new method for more exact estimating the threshold value. We proposed to determine the threshold value based on the symmetric Kullback-Leibler divergence between the probability distributions of noisy speech and noise wavelet coefficients. In the next step, we improved this value using segmental SNR. We used some of TIMIT utterances to assess the performance of the proposed threshold. The algorithm is evaluated using the PESQ score and the SNR improvement. In average, we obtain 2db SNR improvement and a PESQ score increase up to 0.7 in comparison to the conventional wavelet thresholding approaches.

...read moreread less

Proceedings Article•DOI•

Static and dynamic modulation spectrum for speech recognition.

[...]

Sriram Ganapathy¹, Samuel Thomas¹, Hynek Hermansky¹•Institutions (1)

Johns Hopkins University¹

06 Sep 2009

TL;DR: A feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands that provides relative improvements in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS).

...read moreread less

Abstract: We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction (FDLP) These sub-band envelopes are compressed with a static (logarithmic) and dynamic (adaptive loops) compression The compressed sub-band envelopes are transformed into modulation spectral components which are used as features for speech recognition Experiments are performed on a phoneme recognition task using a hybrid HMM-ANN phoneme recognition system and an ASR task using the TANDEM speech recognition system The proposed features provide a relative improvements of 38 % and 115 % in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS) respectively Further, these improvements are found to be consistent for ASR tasks on OGI-Digits database (relative improvement of 135 %)

...read moreread less

Journal Article•DOI•

Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions

[...]

Wooil Kim¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

01 Sep 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An advanced method to effectively utilize the correlation information of the spectral components across time and frequency axes in an effort to increase the performance of missing-feature reconstruction in band-limited conditions is proposed.

...read moreread less

Abstract: Band-limited speech represents one of the most challenging factors for robust speech recognition. This is especially true in supporting audio corpora from sources that have a range of conditions in spoken document retrieval requiring effective automatic speech recognition. The missing-feature reconstruction method has a problem when applied to band-limited speech reconstruction, since it assumes the observations in the unreliable regions are always greater than the latent original clean speech. The approach developed here depends only on reliable components to calculate the posterior probability to mitigate the problem. This study proposes an advanced method to effectively utilize the correlation information of the spectral components across time and frequency axes in an effort to increase the performance of missing-feature reconstruction in band-limited conditions. We employ an F1 area window and cutoff border window in order to include more knowledge on reliable components which are highly correlated with the cutoff frequency band. To detect the cutoff regions for missing-feature reconstruction, blind mask estimation is also presented, which employs the synthesized band-limited speech model without secondary training data. Experiments to evaluate the performance of the proposed methods are accomplished using the SPHINX3 speech recognition engine and the TIMIT corpus. Experimental results demonstrate that the proposed time-frequency (TF) correlation based missing-feature reconstruction method is significantly more effective in improving band-limited speech recognition accuracy. By employing the proposed TF-missing feature reconstruction method, we obtain up to 14.61% of average relative improvement in word error rate (WER) for four available bandwidths with cutoff frequencies 1.0, 1.5, 2.0, and 2.5 kHz, respectively, compared to earlier formulated methods. Experimental results on the National Gallery of the Spoken Word (NGSW) corpus also show the proposed method is effective in improving band-limited speech recognition in real-life spoken document conditions.

...read moreread less

Journal Article•DOI•

A stochastic version of Expectation Maximization algorithm for better estimation of Hidden Markov Model

[...]

Shamsul Huda¹, John Yearwood¹, Roberto Togneri²•Institutions (2)

Federation University Australia¹, University of Western Australia²

30 Oct 2009-Pattern Recognition Letters

TL;DR: A hybrid algorithm, Simulated Annealing Stochastic version of EM (SASEM) is proposed, combining SimulatedAnnealing with EM that reformulates the HMM estimation process using a stochastic step between the EM steps and the SA.

...read moreread less

Proceedings Article•DOI•

Robust vocabulary independent keyword spotting with graphical models

[...]

Martin Wöllmer¹, Florian Eyben¹, Björn Schuller¹, Gerhard Rigoll¹•Institutions (1)

Technische Universität München¹

01 Dec 2009

TL;DR: The design of the model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve.

...read moreread less

Abstract: This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the Receiver Operating Characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable Hidden Markov Model based keyword spotter that uses conventional garbage modelling.

...read moreread less

Proceedings Article•DOI•

Text Independent Composite Speaker Identification/Verification Using Multiple Features

[...]

A. Revathi¹, Y. Venkataramani•Institutions (1)

National Institute of Technology, Tiruchirappalli¹

31 Mar 2009

TL;DR: In this work, F-ratio is computed as a theoretical measure to validate the experimental results for both identification and verification of composite speaker identification/verification.

...read moreread less

Abstract: The main objective of this paper is to explore the effectiveness of feature selection for performing composite speaker identification/verification. We propose features such as line spectral frequency (LSF), differential line spectral frequency (DLSF), mel frequency cepstral coefficients (MFCC), discrete cosine transform cepstrum (DCTC), perceptual linear predictive cepstrum (PLP) and mel frequency perceptual linear predictive cepstrum (MF-PLP). These features are captured and training models are developed by K-means clustering procedure. A speaker identification system is evaluated on noise added test speeches and the experimental results reveal the performance of the proposed algorithm in identifying speakers based on minimum distance between test features and clusters and also highlight the best choice of feature set among all the proposed features for 50 speakers chosen randomly from "TIMIT" database. Analysis is performed on the identification results to emphasize the choice of features which produce better results for speaker verification with respect to equal error rate. In this work, F-ratio is computed as a theoretical measure to validate the experimental results for both identification and verification.

...read moreread less

Proceedings Article•DOI•

Matrix updates for perceptron training of continuous density hidden Markov models

[...]

Chih-Chieh Cheng¹, Fei Sha², Lawrence K. Saul¹•Institutions (2)

University of California, San Diego¹, University of Southern California²

14 Jun 2009

TL;DR: A simple, mistake-driven learning algorithm for discriminative training of continuous density hidden Markov models (CD-HMMs), which reparameterize these Gaussian distributions in terms of positive semidefinite matrices that jointly encode their mean and covariance statistics.

...read moreread less

Abstract: In this paper, we investigate a simple, mistake-driven learning algorithm for discriminative training of continuous density hidden Markov models (CD-HMMs). Most CD-HMMs for automatic speech recognition use multivariate Gaussian emission densities (or mixtures thereof) parameterized in terms of their means and covariance matrices. For discriminative training of CD-HMMs, we reparameterize these Gaussian distributions in terms of positive semidefinite matrices that jointly encode their mean and covariance statistics. We show how to explore the resulting parameter space in CDHMMs with perceptron-style updates that minimize the distance between Viterbi decodings and target transcriptions. We experiment with several forms of updates, systematically comparing the effects of different matrix factorizations, initializations, and averaging schemes on phone accuracies and convergence rates. We present experimental results for context-independent CD-HMMs trained in this way on the TIMIT speech corpus. Our results show that certain types of perceptron training yield consistently significant and rapid reductions in phone error rates.

...read moreread less

Proceedings Article•DOI•

Unfolding speaker clustering potential: a biomimetic approach

[...]

Thilo Stadelmann¹, Bernd Freisleben¹•Institutions (1)

University of Marburg¹

19 Oct 2009

TL;DR: It is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing, and the validity of this biomimetic approach is shown.

...read moreread less

Abstract: Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.

...read moreread less

Proceedings Article•DOI•

Spectral and Temporal Modulation Features for Phonetic Recognition

[...]

Stephen A. Zahorian¹, Hongbing Hu¹, Zhengqing Chen, Jiang Wu¹•Institutions (1)

Binghamton University¹

06 Sep 2009

TL;DR: A DCT analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information.

...read moreread less

Abstract: Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. These DCT/DCS features can be computed so as to emphasize frequency resolution or time resolution or a combination of the two factors. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.

...read moreread less

Journal Article•DOI•

Feature Compensation Techniques for ASR on Band-Limited Speech

[...]

Nicolas Morales¹, Doroteo Torre Toledano, John H. L. Hansen², Javier Garrido•Institutions (2)

Nuance Communications¹, University of Texas at Dallas²

01 May 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper argues that in particular cases such as rapidly varying distortions, or limited computational or memory resources, feature compensation is more convenient, and shows that feature-side and model-side approaches may be combined, outperforming any of those approaches alone.

...read moreread less

Abstract: Band-limited speech (speech for which parts of the spectrum are completely lost) is a major cause for accuracy degradation of automatic speech recognition (ASR) systems particularly when acoustic models have been trained with data with a different spectral range. In this paper, we present an extensive study of the problem of ASR of band-limited speech with full-bandwidth acoustic models. Our focus is mainly on band-limited feature compensation, covering even the case of time-varying band-limiting distortions, but we also compare this approach to more common model-side techniques (adaptation and retraining) and explore the combination of feature-based and model-side approaches. The feature compensation algorithms proposed are organized in a unified framework supported by a novel mathematical model of the impact of such distortions on Mel-frequency cepstral coefficient (MFCC) features. A crucial and novel contribution is the analysis made of the relative correlation of different elements in the MFCC feature vector for the cases of full-bandwidth and limited-bandwidth speech, which justifies an important modification in the feature compensation scheme. Furthermore, an intensive experimental analysis is provided. Experiments are conducted on real telephone channels, as well as artificial low-pass and bandpass filters applied over TIMIT data, and results are given for different experimental constraints and variations of the feature compensation method. Results for other well-known robustness approaches, such as cepstral mean normalization (CMN), model retraining, and model adaptation are also given for comparison. ASR performance with our approach is similar or even better than model adaptation, and we argue that in particular cases such as rapidly varying distortions, or limited computational or memory resources, feature compensation is more convenient. Furthermore, we show that feature-side and model-side approaches may be combined, outperforming any of those approaches alone.

...read moreread less

Book Chapter•DOI•

Large Margin Training of Continuous Density Hidden Markov Models

[...]

Fei Sha¹, Lawrence K. Saul²•Institutions (2)

University of Southern California¹, University of California, San Diego²

14 Jan 2009

TL;DR: This chapter investigates a new framework for parameter estimation in CD-HMMs, inspired by recent parallel trends in the fields of ASR and machine learning, and describes how it can be optimized efficiently by simple gradient-based methods.

...read moreread less

Abstract: Continuous density hidden Markov models (CD-HMMs) are an essential component of modern systems for automatic speech recognition (ASR). These models assign probabilities to the sequences of acoustic feature vectors extracted by signal processing of speech waveforms. In this chapter, we investigate a new framework for parameter estimation in CD-HMMs. Our framework is inspired by recent parallel trends in the fields of ASR and machine learning. In ASR, significant improvements in performance have been obtained by discriminative training of acoustic models. In machine learning, significant improvements in performance have been obtained by discriminative training of large margin classifiers. Building on both these lines of work, we show how to train CD-HMMs by maximizing an appropriately defined margin between correct and incorrect decodings of speech waveforms. We start by defining an objective function over a transformed parameter space for CD-HMMs, then describe how it can be optimized efficiently by simple gradient-based methods. Within this framework, we obtain highly competitive results for phonetic recognition on the TIMIT speech corpus. We also compare our framework for large margin training to other popular frameworks for discriminative training of CD-HMMs.

...read moreread less

Journal Article•DOI•

Discriminative Input Stream Combination for Conditional Random Field Phone Recognition

[...]

Ilana Heintz¹, Eric Fosler-Lussier¹, Chris Brew¹•Institutions (1)

Ohio State University¹

01 Nov 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper improves upon previous work to find that several different dimensionality reduction techniques (SVD, PARAFAC2, KLT), followed by a nonlinear transform provided by a multilayer perceptron, provides a significant gain in phone recognition accuracy on the TIMIT task.

...read moreread less

Abstract: In recent studies, we and others have found that conditional random fields (CRFs) can be effectively used to perform phone classification and recognition tasks by combining non-Gaussian distributed representations of acoustic input. In previous work by I. Heintz (latent phonetic analysis: Use of singular value decomposition to determine features for CRF phone recognition, Proc. ICASSP, pp. 4541-4544, 2008), we experimented with combining phonological feature posterior estimators and phone posterior estimators within a CRF framework; we found that treating posterior estimates as terms in a ldquophoneme information retrievalrdquo task allowed for a more effective use of multiple posterior streams than directly feeding these acoustic representations to the CRF recognizer. In this paper, we examine some of the design choices in our previous work, and extend our results to up to six acoustic feature streams. We concentrate on feature design, rather than feature selection, to find the best way of combining features for introduction into a log-linear model. We improve upon our previous work to find that several different dimensionality reduction techniques (SVD, PARAFAC2, KLT), followed by a nonlinear transform provided by a multilayer perceptron, provides a significant gain in phone recognition accuracy on the TIMIT task.

...read moreread less

Proceedings Article•DOI•

Improving multi-lattice alignment based spoken keyword spotting

[...]

Hui Lin¹, Alex Stupakov¹, Jeff A. Bilmes¹•Institutions (1)

University of Washington¹

19 Apr 2009

TL;DR: Several techniques are introduced that further improve the multi-lattice alignment approach, including edit operation modeling and supervised training of the conditional probability table, something which cannot be directly trained by traditional maximum likelihood estimation.

...read moreread less

Abstract: In previous work, we showed that using a lattice instead of the 1-best path to represent both the query and the utterance being searched is beneficial for spoken keyword spotting. In this paper, we introduce several techniques that further improve our multi-lattice alignment approach, including edit operation modeling and supervised training of the conditional probability table, something which cannot be directly trained by traditional maximum likelihood estimation. Experiments on TIMIT show that the proposed methods significantly improve the performance of spoken keyword spotting.

...read moreread less

Book Chapter•DOI•

Automatic classification of regular vs. irregular phonation types

[...]

Tamás M. Bőhm¹, Zoltán Both¹, Géza Németh¹•Institutions (1)

Budapest University of Technology and Economics¹

25 Jun 2009

TL;DR: A classifier that extracts six acoustic cues from vowels and then labels them as regular or irregular by means of a support vector machine is proposed and integrated cues from earlier phonation type classifiers are integrated and improved their performance in five out of the six cases.

...read moreread less

Abstract: Irregular phonation (also called creaky voice, glottalization and laryngealization) may have various communicative functions in speech Thus the automatic classification of phonation type into regular and irregular can have a number of applications in speech technology In this paper, we propose such a classifier that extracts six acoustic cues from vowels and then labels them as regular or irregular by means of a support vector machine We integrated cues from earlier phonation type classifiers and improved their performance in five out of the six cases The classifier with the improved cue set produced a 9885% hit rate and a 347% false alarm rate on a subset of the TIMIT corpus

...read moreread less

Proceedings Article•DOI•

Perceptual Features Based Isolated Digit and Continuous Speech Recognition Using Iterative Clustering Approach

[...]

A. Revathi, Y. Venkataramani

27 Dec 2009

TL;DR: F-ratio is computed as a theoretical measure to validate the experimental results and distribution is used to justify the good experimental results statistically.

...read moreread less

Abstract: The main objective of this paper is to explore the effectiveness of perceptual features for performing isolated digits and continuous speech recognition. The proposed perceptual features are captured and training models are developed by K-means clustering procedure. Speech recognition system is evaluated on clean and noisy test speeches and the experimental results reveal the performance of the proposed algorithm in recognizing digits and continuous speeches based on minimum distance between test features and clusters. Performance of these features is tested on speeches randomly chosen from “TI Digits_1”, “TI Digits_2” and “TIMIT” databases. These perceptual features are evaluated with accuracy of 91% for speaker independent isolated digit recognition, 99.5% for clean continuous speech recognition and 87% for noisy continuous speech recognition. In this work, F-ratio is computed as a theoretical measure to validate the experimental results and distribution is used to justify the good experimental results statistically.

...read moreread less

Journal Article•

Robust Text Independent Speaker Identification Using Hybrid GMM-SVM System.

[...]

Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghanni, Noureddine Ellouze

01 Jan 2009-International Journal of Digital Content Technology and Its Applications

TL;DR: This paper introduces and motivates the use of the statistical method Gaussian Mixture Model (GMM) and Support Vector Machines (SVM) for robust textindependent speaker identification and proves that the hybrid GMM-SVM system is significantly more preferment than the SVM system.

...read moreread less

Abstract: This paper introduces and motivates the use of the statistical method Gaussian Mixture Model (GMM) and Support Vector Machines (SVM) for robust textindependent speaker identification. Features are extracted from the dialect DR1 of the Timit corpus. They are presented by MFCC, energy, Delta and Delta-Delta coefficients. GMM is used to model the feature extractor of the input speech signal and SVM is used for handling the task of decision making. The SVM is trained using inputs, which are the feature vectors presented by the GMM. Our results prove that the hybrid GMM-SVM system is significantly more preferment than the SVM system. We report improvements of 85,37% amelioration in identification rate compared to the SVM identification rate.

...read moreread less

Journal Article•DOI•

Hybrid models based on biological approaches for speech recognition

[...]

Nabil Neggaz¹, Abdelkader Benyettou¹•Institutions (1)

University of Science and Technology of Oran Mohamed-Boudiaf¹

01 Dec 2009-Artificial Intelligence Review

TL;DR: The results obtained show that DE convergence speeds were faster than the ones of multiple population genetic algorithm and genetic algorithms, therefore DE algorithm seems to be a promising approach to engineering optimization problems.

...read moreread less

Abstract: This paper aims to adapt the Clonal Selection Algorithm (CSA) which is usually used to explain the basic features of artificial immune systems to the learning of Neural Networks, instead of Back Propagation. The CSA was first applied to a real world problem (IRIS database) then compared with an artificial immune network. CSA performance was contrasted with other versions of genetic algorithms such as: Differential Evolution (DE), Multiple Populations Genetic Algorithms (MPGA). The tested application in the simulation studies were IRIS (vegetal database) and TIMIT (phonetic database). The results obtained show that DE convergence speeds were faster than the ones of multiple population genetic algorithm and genetic algorithms, therefore DE algorithm seems to be a promising approach to engineering optimization problems. On the other hand, CSA demonstrated good performance at the level of pattern recognition, since the recognition rate was equal to 99.11% for IRIS database and 76.11% for TIMIT. Finally, the MPGA succeeded in generalizing all phonetic classes in a homogeneous way: 60% for the vowels and 63% for the fricatives, 68% for the plosives.

...read moreread less