scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2011"


Proceedings Article
12 Dec 2011
TL;DR: This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks and revisits several common regularisers from a variational perspective.
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus.

1,341 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: Deep Belief Networks work even better when their inputs are speaker adaptive, discriminative features, and on the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model.
Abstract: Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higher-order statistical structure of the data. These features can be used to initialize the hidden units of a feed-forward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs [1, 2]. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model and 19.4% using monophone HMMs and a trigram language model.

321 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable is presented and initial results demonstrate phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.
Abstract: State of the art speech recognition systems rely on preprocessed speech features such as Mel cepstrum or linear predictive coding coefficients that collapse high dimensional speech sound waves into low dimensional encodings. While these have been successfully applied in speech recognition systems, such low dimensional encodings may lose some relevant information and express other information in a way that makes it difficult to use for discrimination. Higher dimensional encodings could both improve performance in recognition tasks, and also be applied to speech synthesis by better modeling the statistical structure of the sound waves. In this paper we present a novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable and we report initial results demonstrating phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.

223 citations


Proceedings Article
01 Jan 2011
TL;DR: This paper introduces a novel pitch tracking database (PTDB) including ground truth signals obtained from a laryngograph, and evaluated two multipitch tracking systems on a subset of speakers to provide a benchmark for further research activities.
Abstract: In this paper, we introduce a novel pitch tracking database (PTDB) including ground truth signals obtained from a laryngograph. The database, referenced as PTDB-TUG, consists of 2342 phonetically rich sentences taken from the TIMIT corpus. Each sentence was at least recorded once by a male and a female native speaker. In total, the database contains 4720 recordings from 10 male and 10 female speakers. Furthermore, we evaluated two multipitch tracking systems on a subset of speakers to provide a benchmark for further research activities. The database can be downloaded at http://www.spsc.tugraz.at/tools.

123 citations


Journal Article
TL;DR: A new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities is described and generalized into a form that includes the standard squared-error loss, and a geometric rate of convergence is proved.
Abstract: We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 108 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classification on Reuters and WebKB, speech phone classification on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case.

111 citations


Journal ArticleDOI
TL;DR: A simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities yields higher phoneme recognition accuracies than the conventional single MLP-based system.
Abstract: We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this hierarchical setup, the first MLP classifier is trained using standard acoustic features. The second MLP is trained using the posterior probabilities of phonemes estimated by the first, but with a long temporal context of around 150-230 ms. Through extensive phoneme recognition experiments, and the analysis of the trained second MLP using Volterra series, we show that 1) the hierarchical system yields higher phoneme recognition accuracies-an absolute improvement of 3.5% and 9.3% on TIMIT and CTS respectively-over the conventional single MLP-based system, 2) there exists useful information in the temporal trajectories of the posterior feature space, spanning around 230 ms of context, 3) the second MLP learns the phonetic temporal patterns in the posterior features, which include the phonetic confusions at the output of the first MLP as well as the phonotactics of the language as observed in the training data, and 4) the second MLP classifier requires fewer number of parameters and can be trained using lesser amount of training data.

86 citations


Journal ArticleDOI
Tara N. Sainath1, Bhuvana Ramabhadran1, Michael Picheny1, David Nahamoo1, Dimitri Kanevsky1 
TL;DR: This paper combines the advantages of using both small and large vocabulary tasks by taking well-established techniques used in LVCSR systems and applying them on TIMIT to establish a new baseline, creating a novel set of exemplar-based sparse representation (SR) features.
Abstract: The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNNs) and sparse representations (SRs), in speech recognition has thus far been limited. Exemplar-based techniques utilize information about individual training examples and are computationally expensive, making it particularly difficult to investigate these methods on large-vocabulary continuous speech recognition (LVCSR) tasks. While research in LVCSR provides a good testbed to tackle real-world speech recognition problems, research in this area suffers from two main drawbacks. First, the overall complexity of an LVCSR system makes error analysis quite difficult. Second, exploring new research ideas on LVCSR tasks involves training and testing state-of-the-art LVCSR systems, which can render a large turnaround time. This makes a small vocabulary task such as TIMIT more appealing. TIMIT provides a phonetically rich and hand-labeled corpus that allows easy insight into new algorithms. However, research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we combine the advantages of using both small and large vocabulary tasks by taking well-established techniques used in LVCSR systems and applying them on TIMIT to establish a new baseline. We then utilize these existing LVCSR techniques in creating a novel set of exemplar-based sparse representation (SR) features. Using these existing LVCSR techniques, we achieve a phonetic error rate (PER) of 19.4% on the TIMIT task. The additional use of SR features reduce the PER to 18.6%. We then explore applying the SR features to a large vocabulary Broadcast News task, where we achieve a 0.3% absolute reduction in word error rate (WER).

74 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A new set of feature parameters based on the Hilbert envelope of Gammatone filterbank outputs is proposed to improve SID performance in the presence of room reverberation, indicating significant improvement over the baseline system with MFCCs plus cepstral mean subtraction (CMS).
Abstract: It is well known that MFCC based speaker identification (SID) systems easily break down under mismatched training and test conditions. One such mismatch occurs when a SID system is trained on anechoic speech data, while test is carried out using reverberant data collected via a distant microphone. In this study, a new set of feature parameters based on the Hilbert envelope of Gammatone filterbank outputs is proposed to improve SID performance in the presence of room reverberation. Considering two distinct perceptual effects of reverberation on speech signals, i.e., coloration and long-term reverberation, two different compensation strategies are integrated within the feature extraction framework to effectively suppress the effects of reverberation. Experimental evaluation is performed using speech material from the TIMIT, four different measured room impulse responses (RIR) from Aachen impulse response (AIR) database, and a GMM-based SID system. Obtained results indicate significant improvement over the baseline system with MFCCs plus cepstral mean subtraction (CMS), confirming the effectiveness of the proposed feature parameters for SID under reverberant mismatched conditions.

72 citations


Journal ArticleDOI
TL;DR: This paper presents a new feature extraction technique for speaker recognition using Radon transform (RT) and discrete cosine transform (DCT) and highlights the superiority of the proposed method over some of the existing algorithms.

51 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A new approach for phoneme recognition which aims at minimizing the phoneme error rate is described, which is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent.
Abstract: We describe a new approach for phoneme recognition which aims at minimizing the phoneme error rate. Building on structured prediction techniques, we formulate the phoneme recognizer as a linear combination of feature functions. We state a PAC-Bayesian generalization bound, which gives an upper-bound on the expected phoneme error rate in terms of the empirical phoneme error rate. Our algorithm is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent. The resulting algorithm is iterative and easy to implement. Experiments on the TIMIT corpus show that our method achieves the lowest phoneme error rate compared to other discriminative and generative models with the same expressive power.

39 citations


Journal ArticleDOI
TL;DR: The results of experiments indicate that with the optimized feature set, the performance of the ASV system is improved and the speed of verification is significantly increased since by use of ACO, number of features is reduced over 80% which consequently decrease the complexity of the AsV system.
Abstract: With the growing trend toward remote security verification procedures for telephone banking, biometric security measures and similar applications, automatic speaker verification (ASV) has received a lot of attention in recent years The complexity of ASV system and its verification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers In this paper, we concentrate on optimizing dimensionality of feature space by selecting relevant features At present there are several methods for feature selection in ASV systems To improve performance of ASV system we present another method that is based on ant colony optimization (ACO) algorithm After feature reduction phase, feature vectors are applied to a Gaussian mixture model universal background model (GMM-UBM) which is a text-independent speaker verification model The performance of proposed algorithm is compared to the performance of genetic algorithm on the task of feature selection in TIMIT corpora The results of experiments indicate that with the optimized feature set, the performance of the ASV system is improved Moreover, the speed of verification is significantly increased since by use of ACO, number of features is reduced over 80% which consequently decrease the complexity of our ASV system

Proceedings ArticleDOI
28 Nov 2011
TL;DR: Experimental results show that the use of ASM posteriorgrams leads to consistently better performance of detection than the conventional GMM anteriorgrams.
Abstract: This paper describes a study on query-by-example spoken term detection (STD) using the acoustic segment modeling technique. Acoustic segment models (ASMs) are a set of hidden Markov models (HMM) that are obtained in an unsupervised manner without using any transcription information. The training of ASMs follows an iterative procedure, which consists of the steps of initial segmentation, segments labeling, and HMM parameter estimation. The ASMs are incorporated into a template-matching framework for query-by-example STD. Both the spoken query examples and the test utterances are represented by frame-level ASM posteriorgrams. Segmental dynamic time warping (DTW) is applied to match the query with the test utterance and locate the possible occurrences. The performance of the proposed approach is evaluated with different DTW local distance measures on the TIMIT and the Fisher Corpora respectively. Experimental results show that the use of ASM posteriorgrams leads to consistently better performance of detection than the conventional GMM posteriorgrams.

Proceedings ArticleDOI
Dong Yu1, Li Deng1
27 Aug 2011
TL;DR: A set of novel, batch-mode algorithms developed recently are described as one key component in scalable, deep neural network based speech recognition, to structure the singlehidden-layer neural network so that the upper-layer's weights can be written as a deterministic function of the lower-layer’s weights.
Abstract: We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the singlehidden-layer neural network so that the upper-layer’s weights can be written as a deterministic function of the lower-layer’s weights. This structure is effectively exploited during training by plugging in the deterministic function to the least square error objective function while calculating the gradients. Accelerating techniques are further exploited to make the weight updates move along the most promising directions. The experiments on TIMIT frame-level phone and phonestate classification show strong results. In particular, the error rate is strictly monotonically dropping as the minibatch size increases. This demonstrates the potential for the proposed batch-mode algorithms in large scale speech recognition since they are easily parallelizable across computers.

Proceedings ArticleDOI
22 May 2011
TL;DR: The sparse multilayer perceptron (SMLP) is introduced which learns the transformation from the inputs to the targets as in multilayers perceptron while the outputs of one of the internal hidden layers is forced to be sparse.
Abstract: This paper introduces the sparse multilayer perceptron (SMLP) which learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and learning the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, the SMLP based system trained using perceptual linear prediction (PLP) features performs better than the conventional MLP based system. Furthermore, their combination yields a phoneme error rate of 21.2%, a relative improvement of 6.2% over the baseline.

Proceedings ArticleDOI
24 Mar 2011
TL;DR: Perceptual linear predictive cepstrum yields the accuracy of 86% and 93% for speaker independent isolated digit recognition using VQ and combination of VQ & HMM speech models respectively.
Abstract: The main objective of this paper is to explore the effectiveness of perceptual features for performing isolated digits and continuous speech recognition. The proposed perceptual features are captured and code book indices are extracted. Expectation maximization algorithm is used to generate HMM models for the speeches. Speech recognition system is evaluated on clean test speeches and the experimental results reveal the performance of the proposed algorithm in recognizing isolated digits and continuous speeches based on maximum log likelihood value between test features and HMM models for each speech. Performance of these features is tested on speeches randomly chosen from “TI Digits_1”, “TI Digits_2” and “TIMIT” databases. This algorithm is tested for VQ and combination of VQ and HMM speech modeling techniques. Perceptual linear predictive cepstrum yields the accuracy of 86% and 93% for speaker independent isolated digit recognition using VQ and combination of VQ & HMM speech models respectively. This feature also gives 99% and 100% accuracy for speaker independent continuous speech recognition by using VQ and the combination of VQ & HMM speech modeling techniques.

Journal ArticleDOI
TL;DR: The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.6% and 1.8% on the WCL-1 database.

Journal ArticleDOI
TL;DR: A recursive way to efficiently calculate modified forward variables of HSMM is designed and the positive effect of different robustness-related schemes adopted in the proposed VAD is demonstrated.

Proceedings ArticleDOI
22 May 2011
TL;DR: A novel framework to integrate articulatory features (AFs) into HMM- based ASR system by using posterior probabilities of different AFs directly as observation features in Kullback-Leibler divergence based HMM (KL-HMM) system yields a best performance on the TIMIT phoneme recognition task.
Abstract: In this paper, we propose a novel framework to integrate articulatory features (AFs) into HMM- based ASR system. This is achieved by using posterior probabilities of different AFs (estimated by multilayer perceptrons) directly as observation features in Kullback-Leibler divergence based HMM (KL-HMM) system. On the TIMIT phoneme recognition task, the proposed framework yields a phoneme recognition accuracy of 72.4% which is comparable to KL-HMM system using posterior probabilities of phonemes as features (72.7%). Furthermore, a best performance of 73.5% phoneme recognition accuracy is achieved by jointly modeling AF probabilities and phoneme probabilities as features. This shows the efficacy and flexibility of the proposed approach.

Proceedings ArticleDOI
22 May 2011
TL;DR: This paper investigates the use of arccosine kernels for speech recognition, using these kernels in a hybrid support vector machine/hidden Markov model recognition system.
Abstract: Neural networks are a useful alternative to Gaussian mixture models for acoustic modeling; however, training multilayer networks involves a difficult, nonconvex optimization that requires some “art” to make work well in practice. In this paper we investigate the use of arccosine kernels for speech recognition, using these kernels in a hybrid support vector machine/hidden Markov model recognition system. Arccosine kernels approximate the computation in a certain class of infinite neural networks using a single kernel function, but can be used in learners that require only a convex optimization for training. Phone recognition experiments on the TIMIT corpus show that arccosine kernels can outperform radial basis function kernels.

Journal ArticleDOI
TL;DR: This research investigates one novel method when using voiced formants' features in combination with standard MFCC features in order to enhance TIMIT phone recognition and shows clear phone enhancement regarding existing state of the art.
Abstract: Combination of multiple acoustic features has great potential to improve Automatic Speech Recognition (ASR) accuracy. Our contribution in this research was to investigate one novel method when using voiced formants' features in combination with standard MFCC features in order to enhance TIMIT phone recognition. These voiced features provide accurate formants frequencies using a Variable Order LPC Coding (VO-LPC) algorithm that was combined with continuity constraints. The overall estimating formants were concatenated with MFCC features when a voiced frame could be detected. For feature-level combination, Heteroscedastic Linear Discriminant Analysis (HLDA) based approach had been used successfully to find an optimal linear combination of successive vectors of a single feature stream. A series of experiments on phone recognition speaker-independent continuous-speech had been carried out using a subset of the large read-speech TIMIT phone corpus. Hidden Markov Model Toolkit (HTK) was also used throughout all carried experiments. Using such feature level combination, optimized mixture splitting and a bigram language model, a detailed analysis on this combination performance was discussed for Context-Independent (CI) and Context-Dependent (CD) Hidden Markov Models (HMM). Experimental results from our proposed procedure showed that phone error rate was successfully decreased by about 3%. At phonetic level group, an increase of 8% and of 10% was observed respectively for vowel and liquid group. These results proved clear phone enhancement regarding existing state of the art.

Proceedings ArticleDOI
15 Jun 2011
TL;DR: This paper defines the keyword spotting problem as a binary classification problem and proposes a discriminative approach for solving it, which exploits evolutionary algorithm to determine the separating hyper plane between two classes: class of sentences containing the target keywords andclass of sentences which don't include thetarget keywords.
Abstract: Keyword spotting refers to detection of all occurrences of any given word in a speech utterance. In this paper, we define the keyword spotting problem as a binary classification problem and propose a discriminative approach for solving it. Our approach exploits evolutionary algorithm to determine the separating hyper plane between two classes: class of sentences containing the target keywords and class of sentences which don't include the target keywords. The results on TIMIT indicate that the proposed method has good performance equal to 95.7 FOM value (average true detection rate for different false alarm per keyword per hour) and acceptable speed equal to 3.3 RTF (Real Time Factor) value.

Journal ArticleDOI
TL;DR: An iterative algorithm is derived that exchanges information between the speech enhancement and speaker identification tasks that makes use of speaker dependent speech priors and demonstrates that speaker identification accuracy is improved.
Abstract: We present a variational Bayesian algorithm for joint speech enhancement and speaker identification that makes use of speaker dependent speech priors. Our work is built on the intuition that speaker dependent priors would work better than priors that attempt to capture global speech properties. We derive an iterative algorithm that exchanges information between the speech enhancement and speaker identification tasks. With cleaner speech we are able to make better identification decisions and with the speaker dependent priors we are able to improve speech enhancement performance. We present experimental results using the TIMIT data set which confirm the speech enhancement performance of the algorithm by measuring signal-to-noise (SNR) ratio improvement and perceptual quality improvement via the Perceptual Evaluation of Speech Quality (PESQ) score. We also demonstrate the ability of the algorithm to perform voice activity detection (VAD). The experimental results also demonstrate that speaker identification accuracy is improved.

Proceedings ArticleDOI
22 May 2011
TL;DR: Experimental results of continuous phoneme recognition on TIMIT core test set and Japanese read speach recognition task using monophone showed that HCNF was superior to HCRF and HMM trained in MPE manner.
Abstract: Hidden Conditional Random Fields(HCRF) is a very promising approach to model speech. However, because HCRF computes the score of a hypothesis by summing up linearly weighted features, it cannot consider non-linearity among features that will be crucial for speech recognition. In this paper, we extend HCRF by incorporating gate function used in neural networks and propose a new model called Hidden Conditional Neural Fields(HCNF). Differently with conventional approaches, HCNF can be trained without any initial model and incorporate any kinds of features. Experimental results of continuous phoneme recognition on TIMIT core test set and Japanese read speach recognition task using monophone showed that HCNF was superior to HCRF and HMM trained in MPE manner.

Journal ArticleDOI
TL;DR: It is found that there are cases where conventional VQ based system outperforms the modern systems and the impact of distance metrics on the performance of the conventional and modern systems depends on the recognition task imposed (verification/identification).

Proceedings ArticleDOI
22 May 2011
TL;DR: This work creates a new dictionary which is a function of the phonetic labels of the original dictionary, and refers to these new features as Spif, which allow for a 2.9% relative reduction in Phonetic Error Rate on the TIMIT phonetic recognition task and a 4.8% relative improvement on a large vocabulary 50 hour Broadcast News task.
Abstract: Exemplar-based techniques, such as k-nearest neighbors (kNNs) and Sparse Representations (SRs), can be used to model a test sample from a few training points in a dictionary set. In past work, we have shown that using a SR approach for phonetic classification allows for a higher accuracy than other classification techniques. These phones are the basic units of speech to be recognized. Motivated by this result, we create a new dictionary which is a function of the phonetic labels of the original dictionary. The SR method now selects relevant samples from this new dictionary to create a new feature representation of the test sample, where the new feature is better linked to the actual units to be recognized. We will refer to these new features as S pif . We present results using these new S pif features in a Hidden Markov Model (HMM) framework for speech recognition. We find that the S pif features allow for a 2.9% relative reduction in Phonetic Error Rate (PER) on the TIMIT phonetic recognition task. Furthermore, we find that the S pif features allow for a 4.8% relative improvement in Word Error Rate (WER) on a large vocabulary 50 hour Broadcast News task.

Proceedings Article
01 Jan 2011
TL;DR: Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin.
Abstract: We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method (McAllester, Hazan and Keshet, 2010), which introduces an analytic expression for the gradient of the expected error-rate. The analytic expression leads to a perceptron-like update rule, which is adapted here for training of HMMs in an online fashion. While the proposed method can work with any type of the error function used in speech recognition, we evaluated it on phoneme recognition of TIMIT, when the desired error function used for training was frame error rate. Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin.

Book ChapterDOI
14 Jun 2011
TL;DR: It is shown that MTL MLP can estimate articulatory features compactly and efficiently by learning the inter-feature dependencies through a common hidden layer representation and adding phoneme as subtask while estimating articulation features improves both articulatory feature estimation and phoneme recognition.
Abstract: Speech sounds can be characterized by articulatory features. Articulatory features are typically estimated using a set of multilayer perceptrons (MLPs), i.e., a separate MLP is trained for each articulatory feature. In this paper, we investigate multitask learning (MTL) approach for joint estimation of articulatory features with and without phoneme classification as subtask. Our studies show that MTL MLP can estimate articulatory features compactly and efficiently by learning the inter-feature dependencies through a common hidden layer representation. Furthermore, adding phoneme as subtask while estimating articulatory features improves both articulatory feature estimation and phoneme recognition. On TIMIT phoneme recognition task, articulatory feature posterior probabilities obtained by MTL MLP achieve a phoneme recognition accuracy of 73.2%, while the phoneme posterior probabilities achieve an accuracy of 74.0%.

Proceedings ArticleDOI
22 May 2011
TL;DR: This paper combines three simple refinements proposed recently to improve HMM/ANN hybrid models to apply a hierarchy of two nets, where the second net models the contextual relations of the state posteriors produced by the first network.
Abstract: In this paper we combine three simple refinements proposed recently to improve HMM/ANN hybrid models The first refinement is to apply a hierarchy of two nets, where the second net models the contextual relations of the state posteriors produced by the first network The second idea is to train the network on context-dependent units (HMM states) instead of context-independent phones or phone states As the latter refinement results in a lot of output neurons, combining the two methods directly would be problematic Hence the third trick is to shrink the output layer of the first net using the bottleneck technique before applying the second net on top of it The phone recognition results obtained on the TIMIT database demonstrate that both the context-dependent and the 2-stage modeling methods can bring about marked improvements Using them in combination, however, results in a further significant gain in accuracy With the bottleneck technique a further improvement can be obtained, especially when the number of context-dependent units is large

Journal ArticleDOI
01 May 2011
TL;DR: This paper proposes a framework to improve independent feature transformations such as PCA, and HLDA transformation for MFCC using the minimum classification error criterion, and modify full transformation matrices such that classification error is minimized for mapped features.
Abstract: Feature extraction is an important step in pattern classification and speech recognition. Extracted features should discriminate classes from each other while being robust to the environmental conditions such as noise. For this purpose, some transformations are applied to features. In this paper, we propose a framework to improve independent feature transformations such as PCA (Principal Component Analysis), and HLDA (Heteroscedastic LDA) using the minimum classification error criterion. In this method, we modify full transformation matrices such that classification error is minimized for mapped features. We do not reduce feature vector dimension in this mapping. The proposed methods are evaluated for continuous phoneme recognition on clean and noisy TIMIT. Experimental results show that our proposed methods improve performance of PCA, and HLDA transformation for MFCC in both clean and noisy conditions.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: This paper forms the exemplar-based classification paradigm as a sparse representation (SR) problem, and explores the use of convex hull constraints to enforce both regularization and sparsity, and utilizes the Extended Baum-Welch (EBW) optimization technique to solve the SR problem.
Abstract: In this paper, we propose a novel exemplar based technique for classification problems where for every new test sample the classification model is re-estimated from a subset of relevant samples of the training data.We formulate the exemplar-based classification paradigm as a sparse representation (SR) problem, and explore the use of convex hull constraints to enforce both regularization and sparsity. Finally, we utilize the Extended Baum-Welch (EBW) optimization technique to solve the SR problem. We explore our proposed methodology on the TIMIT phonetic classification task, showing that our proposed method offers statistically significant improvements over common classification methods, and provides an accuracy of 82.9%, the best single-classifier number reported to date.