scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
01 Jan 2007
TL;DR: TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects designed to further acoustic-phonetic knowledge and automatic speech recognition systems.
Abstract: TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further acoustic-phonetic knowledge and automatic speech recognition systems.

1 citations

Proceedings ArticleDOI
R. Chengalvarayon1
09 Sep 1997
TL;DR: Experimental results show that state-dependent transformation on mel-warped DFT features is superior in performance to the mel-frequency cepstral coefficients (MFCCs), and an error rate reduction is obtained on a standard 39-class TIMIT phone classification task in comparison with the conventional MCE-trained HMM.
Abstract: We investigate the interactions of front-end preprocessing and back-end classification techniques in hidden Markov model (HMM) based speech recognition. The proposed model aims at finding an optimal linear transformation on the mel-warped discrete Fourier transform (DFT) with the construction of dynamic feature parameters according to the minimum classification error (MCE) criterion. Experimental results show that state-dependent transformation on mel-warped DFT features is superior in performance to the mel-frequency cepstral coefficients (MFCCs). An error rate reduction of 15% is obtained on a standard 39-class TIMIT phone classification task in comparison with the conventional MCE-trained HMM using MFCCs and delta MFCCs that have not been subject to optimization during training.

1 citations

Proceedings ArticleDOI
01 Nov 2010
TL;DR: A detection-by-ranking approach to address the stressed syllable detection problem based on the RankNet technique is proposed and the introducing of fractal dimension is helpful for improving the detection correct rate.
Abstract: A lot of time or frequency domain speech features had been applied to address the problem of English stressed syllable detection. Researchers had proved that the combination of multiple features is necessary to get better performance. But up to now, the tasks of seeking new feasible speech features and innovative feature fusion approaches are still open. This paper proposes a detection-by-ranking approach to address the stressed syllable detection problem based on the RankNet technique. The approach is able to find out the stressed syllable through one by one comparison of feature vectors corresponding to vowels of syllables in a multi-syllable word. This paper also introduces the fractal dimensions of each vowel as one type of the stress features. Experiments conducted on the corpus TIMIT show that the proposed feature fusion method reaches high performance and the introducing of fractal dimension is helpful for improving the detection correct rate.

1 citations

Proceedings Article
01 Jan 2007
TL;DR: Line spectrum pairs are proposed as the feature representation when at- tempting to model the dynamics of speech frame and the experiments on TIMIT data base show the proposed features for the vowel classication.
Abstract: In this paper Line spectrum pairs are proposed as the feature representation when at- tempting to model the dynamics of speech frame. The experiments on TIMIT data base show the eectiv eness of the proposed features for the ap- plication of vowel classication.

1 citations

DissertationDOI
01 Jan 2011
TL;DR: This dissertation investigates combining template based methods with the speech prosodic features of duration, energy and pitch to further improve speech recognition accuracy and proposes methods of minimum distance template selection and maximum log-likelihood template selection, and investigates a template compression method on top of template selection to further improved recognition performance.
Abstract: In this dissertation, a novel approach of integrating template matching with statistical modeling is proposed to improve continuous speech recognition. Hidden Markov Modeling (HMMs) has been the dominant approach in statistical speech recognition since it provides a principled way of jointly modeling speech spectral variations and time dynamics. However, HMMs have the shortcoming of assuming the observations being independent within each state, which makes it ineffective in modeling the details of speech temporal evolutions that are important for characterizing nonstationary speech sounds. Template-based methods make comparisons between a test pattern and the templates derived from training data, and therefore they are able to capture speech dynamics and time correlation of speech frames better than HMM based methods. However, template matching requires large memory space and computational time since feature vectors of training data need to be stored in computer memory for access at the recognition stage, which is difficult in large vocabulary continuous speech recognition (LVCSR). Our proposed approach takes advantages of both statistical modeling and template matching, which overcomes the weakness of conventional template-based method and is feasible for LVCSR. We use multiple Gaussian Mixture Model (GMM) indices to represent each frame of speech templates, and define the template unit to be context-dependent phone segments (triphone context). We also use phonetic decision trees borrowed from those commonly used in HMMs to tie triphone templates and predict triphones unseen in training data. Two local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, are proposed for dynamic time warping (DTW) based template matching. In order to reduce computational complexity and storage space, we propose methods of minimum distance template selection (MDTS) and maximum log-likelihood template selection (MLTS), and investigate a template compression method on top of template selection to further improve recognition performance. The template based methods were used to rescore lattices generated by baseline HMMs on the tasks of TIMIT continuous phone recognition and teleheath LVCSR and experimental results demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition performances over the HMM baselines. The template selection methods also provided significant recognition accuracy improvements over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance obtained better recognition performance than the KL divergence local distance. For MLTS and template compression, KL divergence local distance provided better performance than the LLR local distance, and the template compression method made further improvements over KL based MLTS. Since the templates were constructed based on the GMM indices extracted from HMM baselines, we also validated the effectiveness of the proposed template methods based on enhanced HMM baselines. Experimental results showed that LLR based all template method was able to consistently improve TIMIT phone recognition accuracies based on four enhanced HMM baselines. Prosodic features such as duration, energy, and pitch can reflect longer span information of speech than conventional single frame vectors but they have commonly been ignored by HMMs. Template based methods provide possibilities to conveniently integrate prosodic features into speech recognition, which has not been well studied in the past. In this dissertation, we investigate combining template based methods with the speech prosodic features of duration, energy and pitch to further improve speech recognition accuracy. The scores of prosodic information were computed by a GMM based method and a non-parametric method, and the prosodic scores were combined with the acoustic scores in triphone template matching. Experimental results obtained on the telehealth task showed that prosodic information had positive effects on vowel sound recognition.

1 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895