scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings Article
01 Jan 1993
TL;DR: A unified approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods, which has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-ffcation and promising for a variety of applications.
Abstract: SUMMARY In this paper we have presented a unified approach forthe identification of non-linguistic speech features fromrecorded signals using phone-based acoustic likelihoods.The inclusion of this technique in speech-based systems,can broaden the scope of applications of speech technolo-gies, and lead to more user-friendly systems. The approachis based on training a set of large phone-based ergodicHMMs for each non-linguisticfeature to be identified (lan-guage, gender, speaker, ...), and identifying the feature asthat associated with the model having the highest acousticlikelihoodof the set. The decoding procedure is efficientlyimplemented by processing all the models in parallel usinga time-synchronous beam search strategy.This has been shown to be a powerful technique for sex,language, and speaker-identification, and has other possi-ble applications such as for dialect identification (includ -ing foreign accents), or identification of speech disfluen-cies. Sex-identification for BREF and WSJ was error-free,and 99% accurate for TIMIT with 2s of speech. Speakeridentification accuracies of 98.8% on TIMIT (168 speak-ers) and 99.1% on BREF (65 speakers) were obtained withone utterance per speaker, and 100% if 2 utterances wereused foridentification. This identificationaccuracy was ob -tained on the 168 test speakers of TIMIT without makinguse of the phonetic transcriptionsduring training,verifyingthat it is not necessary to have labeled data adaptation data.Speaker-independent models can be used to provide the la-bels used in building the speaker-specific models. Beingindependent of the spoken text, and requiring only a smallamount of identification speech (on the order of 2.5s), thistechnique is promising for a variety of applications, partic-ularly those for which continual, transparent verification ispreferable.Tests of two-way language identification of read, labora-toryspeech show that with 2sof speech the languageis cor-rectly identified as English or French with over 99% accu-racy. Simply portingthe approach to the conditionsof tele-phone speech, French and English data in the OGI multi-language telephone speech corpus was about 76% with 2sof speech, and increased to 82% with 10s. The overall 10-languageidentificationaccuracy on thedesignateddevelop -ment test data of in the OGI corpus is 59.7%. These resultswere obtained without the use of phone transcriptions fortraining, which were used for the experiments with labora-tory speech.In conclusion, we propose a unified approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods. This tech-nique has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-fication. While phone labels have been used to train thespeaker-independent seed models, these models can thenbe used to label unknown speech, thus avoiding the costlyprocess of transcribing the speech data. The ability to ac-curately identify non-linguisticspeech features can leadtomore performant spoken language systems enabling betterand more friendly human machine interaction.

39 citations

Proceedings ArticleDOI
13 May 2002
TL;DR: It was found that combining the classical MFCCs with some auditory-based acoustic distinctive cues and the main peaks of the spectrum of a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance.
Abstract: In this paper, a multi-stream paradigm is proposed to improve the performance of automatic speech recognition (ASR) systems Our goal in this paper is to improve the performance of the HMM-based ASR systems by exploiting some features that characterize speech sounds based on the auditory system and one based on the Fourier power spectrum It was found that combining the classical MFCCs with some auditory-based acoustic distinctive cues and the main peaks of the spectrum of a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance The Hidden Markov Model Toolkit (HTK) was used throughout our experiments to test the use of the new multi-stream feature vector A series of experiments on speaker-independent continuous-speech recognition have been carried out using a subset of the large read-speech corpus TIMIT Using such multi-stream paradigm, N-mixture mono-/tri-phone models and a bigram language model, we found that the word error rate was decreased by about 401%

39 citations

Proceedings ArticleDOI
22 May 2011
TL;DR: A new approach for phoneme recognition which aims at minimizing the phoneme error rate is described, which is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent.
Abstract: We describe a new approach for phoneme recognition which aims at minimizing the phoneme error rate. Building on structured prediction techniques, we formulate the phoneme recognizer as a linear combination of feature functions. We state a PAC-Bayesian generalization bound, which gives an upper-bound on the expected phoneme error rate in terms of the empirical phoneme error rate. Our algorithm is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent. The resulting algorithm is iterative and easy to implement. Experiments on the TIMIT corpus show that our method achieves the lowest phoneme error rate compared to other discriminative and generative models with the same expressive power.

39 citations

Proceedings ArticleDOI
13 Oct 2002
TL;DR: The results show that a reconstructed phase space approach is a viable method for classification of phonemes, with the potential for use in a continuous speech recognition system.
Abstract: A novel method for classifying speech phonemes is presented. Unlike traditional cepstral based methods, this approach uses histograms of reconstructed phase spaces. A naive Bayes classifier uses the probability mass estimates for classification. The approach is verified using isolated fricative, vowel, and nasal phonemes from the TIMIT corpus. The results show that a reconstructed phase space approach is a viable method for classification of phonemes, with the potential for use in a continuous speech recognition system.

39 citations

Proceedings Article
01 Jan 1997
TL;DR: The acoustic segmentation algorithm is replaced with “segmentation by recognition,” a probabilistic algorithm that can combine multiple contextual constraints towards hypothesizing only the most likely segments and an efficient search algorithm is described that can efficiently use multiple models to enforce contextual constraints across all segments in a network.
Abstract: Recently, we have developed a probabilistic framework for segmentbased speech recognition that represents the speech signal as a network of segments and associated feature vectors [2]. Although in general, each path through the network does not traverse all segments, we argued that each path must account for all feature vectors in the network. We then demonstrated an efficient search algorithm that uses a single additional model to account for segments that are not traversed. In this paper, we present two new extensions to our framework. First, we replace our acoustic segmentation algorithm with “segmentation by recognition,” a probabilistic algorithm that can combine multiple contextual constraints towards hypothesizing only the most likely segments. Second, we generalize our framework to “near-miss modeling” and describe a search algorithm that can efficiently use multiple models to enforce contextual constraints across all segments in a network. We report experiments in phonetic recognition on the TIMIT corpus in which we achieve a diphone context-dependent error rate of 26.6% on the NIST core test set over 39 classes. This is a 12.8% reduction in error rate from our best previously reported result.

38 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895