scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: It is shown that the addition of a hidden dynamic state leads to increases in accuracy over otherwise equivalent static models, and a time-asynchronous decoding strategy suited to recognition with segment models is proposed.
Abstract: The majority of automatic speech recognition systems rely on hidden Markov models, in which Gaussian mixtures model the output distributions associated with sub-phone states. This approach, whilst successful, models consecutive feature vectors (augmented to include derivative information) as statistically independent. Furthermore, spatial correlations present in speech parameters are frequently ignored through the use of diagonal covariance matrices. This paper continues the work of Digalakis and others who proposed instead a first-order linear state-space model which has the capacity to model underlying dynamics, and furthermore give a model of spatial correlations. This paper examines the assumptions made in applying such a model and shows that the addition of a hidden dynamic state leads to increases in accuracy over otherwise equivalent static models. We also propose a time-asynchronous decoding strategy suited to recognition with segment models. We describe implementation of decoding for linear dynamic models and present TIMIT phone recognition results

34 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: This work investigates techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models and finds that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps.
Abstract: Accurate phone-level segmentation of speech remains an important task for many subfields of speech research. We investigate techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models. In prior work [25] we were able to improve on state-of-the-art alignment accuracy by employing special phone boundary HMM models, trained on phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here we present further improved results by using more powerful statistical models for boundary correction that are conditioned on phonetic context and duration features. Furthermore, we find that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps. Overall, we reduce segmentation errors on the TIMIT corpus by almost one half, from 93.9% to 96.8% boundary accuracy with a 20-ms tolerance.

34 citations

Journal ArticleDOI
Huy Phan1, Lars Hertel1, Marco Maass1, Radoslaw Mazur1, Alfred Mertins1 
TL;DR: This work considers speech patterns as basic acoustic concepts, which embody and represent the target nonspeech signal, and proposes an algorithm to select a sufficient subset, which provides an approximate representation capability of the entire set of available speech patterns.
Abstract: The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, e.g., in a classification task. In order to answer this question, we consider speech patterns as basic acoustic concepts, which embody and represent the target nonspeech signal. To find out how similar the nonspeech signal is to speech, we classify it with a classifier trained on the speech patterns and use the classification posteriors to represent the closeness to the speech bases. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. Furthermore, these descriptors are generic. That is, once the speech classifier has been learned, it can be employed as a feature extractor for different datasets without retraining. Lastly, we propose an algorithm to select a sufficient subset, which provides an approximate representation capability of the entire set of available speech patterns. We conduct experiments for the application of audio event analysis. Phone triplets from the TIMIT dataset were used as speech patterns to learn the descriptors for audio events of three different datasets with different complexity, including UPC-TALP, Freiburg-106, and NAR. The experimental results on the event classification task show that a good performance can be easily obtained even if a simple linear classifier is used. Furthermore, fusion of the learned descriptors as an additional source leads to state-of-the-art performance on all the three target datasets.

33 citations

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper applies graphbased learning to variable-length segments rather than to the fixed-length vector representations that have been used previously, and finds that the best learning algorithms are those that can incorporate prior knowledge.
Abstract: This paper presents several novel contributions to the emerging framework of graph-based semi-supervised learning for speech processing. First, we apply graphbased learning to variable-length segments rather than to the fixed-length vector representations that have been used previously. As part of this work we compare various graph-based learners, and we utilize an efficient feature selection technique for high-dimensional feature spaces that alleviates computational costs and improves the performance of graph-based learners. Finally, we present a method to improve regularization during the learning process. Experimental evaluation on the TIMIT frame and segment classification tasks demonstrates that the graphbased classifiers outperform standard baseline classifiers; furthermore, we find that the best learning algorithms are those that can incorporate prior knowledge.

32 citations

Journal ArticleDOI
Sam Kwong1, Qianhua He, Kim F. Man1, Ke Tang1, C. W. Chau1 
TL;DR: A Parallel Genetic Time Warping (PGTW) is proposed to solve the above said problems and showed that the PGTW had performed better than the TTS, but about 30% CPU time is saved in the single processor system.
Abstract: Dynamic Time Warping (DTW) is a common technique widely used for nonlinear time normalization of different utterances in many speech recognition systems. Two major problems are usually encountered when the DTW is applied for recognizing speech utterances: (i) the normalization factors used in a warping path; and (ii) finding the K-best warping paths. Although DTW is modified to compute multiple warping paths by using the Tree-Trellis Search (TTS) algorithm, the use of actual normalization factor still remains a major problem for the DTW. In this paper, a Parallel Genetic Time Warping (PGTW) is proposed to solve the above said problems. A database extracted from the TIMIT speech database of 95 isolated words is set up for evaluating the performance of the PGTW. In the database, each of the first 15 words had 70 different utterances, and the remaining 80 words had only one utterance. For each of the 15 words, one utterance is arbitrarily selected as the test template for recognition. Distance measure for each test template to the utterances of the same word and to those of the 80 words is calculated with three different time warping algorithms: TTS, PGTW and Sequential Genetic Time Warping (SGTW). A Normal Distribution Model based on Rabiner23 is used to evaluate the performance of the three algorithms analytically. The analyzed results showed that the PGTW had performed better than the TTS. It also showed that the PGTW had very similar results as the SGTW, but about 30% CPU time is saved in the single processor system.

32 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895