scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings Article
06 Dec 2010
TL;DR: This work uses the mean-covariance restricted Boltzmann machine (mcRBM) to learn features of speech data that serve as input into a standard DBN, and achieves a phone error rate superior to all published results on speaker-independent TIMIT to date.
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date.

326 citations

Proceedings ArticleDOI
25 Mar 2012
TL;DR: This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the Dbns that preserves the similarity structure of the feature vector at multiple scales.
Abstract: Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN.

322 citations

Proceedings ArticleDOI
22 May 2011
TL;DR: Deep Belief Networks work even better when their inputs are speaker adaptive, discriminative features, and on the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model.
Abstract: Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higher-order statistical structure of the data. These features can be used to initialize the hidden units of a feed-forward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs [1, 2]. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model and 19.4% using monophone HMMs and a trigram language model.

321 citations

Posted Content
TL;DR: Stochastic recurrent neural networks are introduced which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model.
Abstract: How can we efficiently propagate uncertainty in a latent state representation with recurrent neural networks? This paper introduces stochastic recurrent neural networks which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model. The clear separation of deterministic and stochastic layers allows a structured variational inference network to track the factorization of the model's posterior distribution. By retaining both the nonlinear recursive structure of a recurrent neural network and averaging over the uncertainty in a latent path, like a state space model, we improve the state of the art results on the Blizzard and TIMIT speech modeling data sets by a large margin, while achieving comparable performances to competing methods on polyphonic music modeling.

269 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Abstract: In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

269 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895