scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
23 Jun 2010
TL;DR: It is concluded that running EM to the optimal point of convergence achieves best speaker verification performance, but that this optimal point is dependent on the data and model parameters.
Abstract: An established approach to training Gaussian Mixture Models (GMMs) for speaker verification is via the expectation-maximisation (EM) algorithm. The EM algorithm has been shown to be sensitive to initialisation and prone to converging on local maxima. In exploration of these issues, three different initialisation methods are implemented, along with a split and merge technique to ‘pull’ the trained GMM out of a local maxima. It is shown that both of these approaches improve the likelihood of a GMM trained on speech data. Results of a verification task on the TIMIT and YOHO databases show that increased model fit does not directly translate into an improved equivalent error (EER) rate. In no case does the split and merge procedure improve the EER rate. TIMIT results show a peak in performance of 4.8% EER at 20 EM iterations and a random GMM initialisation. An EER of 1.41% is achieved on the YOHO database under the same regime. It is concluded that running EM to the optimal point of convergence achieves best speaker verification performance, but that this optimal point is dependent on the data and model parameters.

1 citations

Book ChapterDOI
08 Sep 2003
TL;DR: Among the compared techniques, the technique based on TRAP achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system and its advantage is also observed in the presence of mismatch between training and testing conditions.
Abstract: We investigate and compare several techniques for automatic recognition of unconstrained context-independent phoneme strings from TIMIT and NTIMIT databases. Among the compared techniques, the technique based on TempoRAl Patterns (TRAP) achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system. Its advantage is also observed in the presence of mismatch between training and testing conditions. Issues such as the optimal length of temporal patterns in the TRAP technique and the effectiveness of mean and variance normalization of the patterns and the multi-band input the TRAP estimations, are also explored.

1 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper compares Neural Network (NN) based approaches such as the Subband Autocorrelation Classifier (SAcC) with signal processing based methods such as YIN and RAPT and shows that multi-style training of NN using the CC+SA cC feature outperforms all the other methods.
Abstract: Pitch, or fundamental frequency, estimation is an important problem in speech processing. Research on pitch extraction is several years old and numerous algorithms have been developed over the years to improve its accuracy. It becomes more difficult in the presence of additive noise and reverberation because noise corrupts the periodicity information which is vital for estimating the pitch. In this paper, we present a quantitative analysis on pitch tracking in the presence of reverberation by different state of the art methods. We compare Neural Network (NN) based approaches such as the Subband Autocorrelation Classifier (SAcC) with signal processing based methods such as YIN and RAPT. We enhance the performance of SAcC by introducing a cross-correlogram feature (CC+SAcC). We further show that multi-style training of NN using the CC+SAcC feature outperforms all the other methods. Experiments were conducted using artificially reverberated Keele and TIMIT databases with room impulse responses of varying T60 values.

1 citations

Proceedings Article
13 Feb 2008
TL;DR: An objective comparison between different excitation extension algorithms while the wideband spectral envelope is assumed to be known priorly, showing that different extension methods provide roughly the same speech quality.
Abstract: The most common approach for artificial bandwidth extension is based on the source-filter model such that the estimation of missing signal components is done in two stages: excitation extension and spectral envelope extension. In this paper we address an objective comparison between different excitation extension algorithms while the wideband spectral envelope is assumed to be known priorly. A variety of methods have been implemented and subjected to several objective quality measures. All methods are tested with the NTIMIT corpus and the bandpass filtered TIMIT corpus. The results show that: 1) different extension methods provide roughly the same speech quality which implicitly means the choice of the method can be made on account of the efficient implementation; 2) underestimation of the extended components is to be preferred over an overestimation of these; 3) the PESQ-MOS is a most suitable objective measure for evaluating bandwidth extended speech signal.

1 citations

Journal ArticleDOI
TL;DR: In this article , the hybrid modelling techniques are implemented to recognize the spoken words and the model efficiency is determined based on the word error rate (WER) and the obtained results are assessed with the well-known datasets such as TIMIT and Aurora-4.
Abstract: Objectives: The primary goal is to address attempts to establish a Continuous Speech Recognition (CSR) framework for recognising continuous speech in Kannada. It is a difficult challenge to deal with a local language such as Kannada, which lacks the resources of a single language database. Methods: Modelling techniques such as monophone, triphone, deep neural network (DNN)-hidden Markov model (HMM) and Gaussian Mixture Model (GMM)- HMM-based models were implemented in Kaldi toolkit and used for continuous Kannada speech recognition (CKSR). To extract feature vectors from speech data, the Mel frequency Cepstral (MFCC) coefficient technique is used. The continuous Kannada speech database consists of 2800 speakers (1680 males and 1120 females) belong to the age group 8 years to 80 years. The training and testing data are in the ratio 80:20. In this paper the hybrid modelling techniques are implemented to recognize the spoken words. Findings: The model efficiency is determined based on the word error rate (WER) and the obtained results are assessed with the well-known datasets such as TIMIT and Aurora-4. This study found that using Kaldi-based features ex- traction recipes for monophone, triphone, DNN-HMM and GMM-HMM acoustic models had a word error rate (WER) of 8.23%, 5.23%, 4.05% and 4.64% respectively. The experimental results suggest that the rate of recognition of Kannada speech data has increased higher than that of state-of-the-art databases. Novelty : We propose a novel automatic speech recognition system for Kannada language. The main reason for developing the automatic speech recognition system for Kannada language is that there are only limited sources of standard continuous Kannada speech are available. We created large vocabulary Kannada database. We implemented monophone, triphone, Subspace Gaussian mixture model (SGMM) and hybrid modelling techniques to develop the automatic speech recognition system for Kannada language. Keywords: DNN; Continuous speech; HMM; Kannada dialect; Kaldi toolkit; monophone; triphone; WER

1 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895