scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
28 Nov 2011
TL;DR: Experimental results show that the use of ASM posteriorgrams leads to consistently better performance of detection than the conventional GMM anteriorgrams.
Abstract: This paper describes a study on query-by-example spoken term detection (STD) using the acoustic segment modeling technique. Acoustic segment models (ASMs) are a set of hidden Markov models (HMM) that are obtained in an unsupervised manner without using any transcription information. The training of ASMs follows an iterative procedure, which consists of the steps of initial segmentation, segments labeling, and HMM parameter estimation. The ASMs are incorporated into a template-matching framework for query-by-example STD. Both the spoken query examples and the test utterances are represented by frame-level ASM posteriorgrams. Segmental dynamic time warping (DTW) is applied to match the query with the test utterance and locate the possible occurrences. The performance of the proposed approach is evaluated with different DTW local distance measures on the TIMIT and the Fisher Corpora respectively. Experimental results show that the use of ASM posteriorgrams leads to consistently better performance of detection than the conventional GMM posteriorgrams.

28 citations

Proceedings ArticleDOI
31 Mar 2016
TL;DR: The developed ConvRBM with sampling from noisy rectified linear units (NReLUs) is trained in an unsupervised way to model speech signal of arbitrary lengths and weights of the model can represent an auditory-like filterbank.
Abstract: Convolutional Restricted Boltzmann Machine (ConvRBM) as a model for speech signal is presented in this paper. We have developed ConvRBM with sampling from noisy rectified linear units (NReLUs). ConvRBM is trained in an unsupervised way to model speech signal of arbitrary lengths. Weights of the model can represent an auditory-like filterbank. Our proposed learned filterbank is also nonlinear with respect to center frequencies of subband filters similar to standard filterbanks (such as Mel, Bark, ERB, etc.). We have used our proposed model as a front-end to learn features and applied to speech recognition task. Performance of ConvRBM features is improved compared to MFCC with relative improvement of 5% on TIMIT test set and 7% on WSJ0 database for both Nov'92 test sets using GMM-HMM systems. With DNN-HMM systems, we achieved relative improvement of 3% on TIMIT test set over MFCC and Mel filterbank (FBANK). On WSJ0 Nov'92 test sets, we achieved relative improvement of 4–14% using ConvRBM features over MFCC features and 3.6–5.6% using ConvRBM filterbank over FBANK features.

28 citations

Proceedings ArticleDOI
15 Apr 2007
TL;DR: To integrate the two very different modules, Brno's phone recognizer is modified into a phone lattice hypothesizer to produce high-quality phone lattices, and feed them directly into the knowledge-based module to rescore the lattices.
Abstract: This study is a result of a collaboration project between two groups, one from Brno University of Technology and the other from Georgia Institute of Technology (GT). Recently the Brno recognizer is known to outperform many state-of-the-art systems on phone recognition, while the GT knowledge-based lattice rescoring module has been shown to improve system performance on a number of speech recognition tasks. We believe a combination of the two system results in high-accuracy phone recognition. To integrate the two very different modules, we modify Brno's phone recognizer into a phone lattice hypothesizer to produce high-quality phone lattices, and feed them directly into the knowledge-based module to rescore the lattices. We test the combined system on the TIMIT continuous phone recognition task without retraining the individual subsystems, and we observe that the phone error rate was effectively reduced to 19.78% from 24.41% produced by the Brno phone recognizer. To the best of the authors' knowledge this result represents the lowest ever error rate reported on the TIMIT continuous phone recognition task.

28 citations

Proceedings ArticleDOI
12 Nov 2015
TL;DR: It is shown that this descriptor successfully characterizes complex time-frequency phenomena such as time-varying filters and frequency modulated excitations on the TIMIT dataset.
Abstract: We introduce the joint time-frequency scattering transform, a time shift invariant descriptor of time-frequency structure for audio classification. It is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a time-frequency wavelet scalogram. We show that this descriptor successfully characterizes complex time-frequency phenomena such as time-varying filters and frequency modulated excitations. State-of-the-art results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.

28 citations

Journal ArticleDOI
TL;DR: A two-stage approach with two DNN-based methods to address dereverberation and separation in the monaural source separation problem, which outperform the state-of-the-art specifically in highly reverberant room environments.
Abstract: Deep neural networks DNNs have been used for dereverberation and separation in the monaural source separation problem. However, the performance of current state-of-the-art methods is limited, particularly when applied in highly reverberant room environments. In this paper, we propose a two-stage approach with two DNN-based methods to address this problem. In the first stage, the dereverberation of the speech mixture is achieved with the proposed dereverberation mask DM. In the second stage, the dereverberant speech mixture is separated with the ideal ratio mask IRM. To realize this two-stage approach, in the first DNN-based method, the DM is integrated with the IRM to generate the enhanced time-frequency T-F mask, namely the ideal enhanced mask IEM, as the training target for the single DNN. In the second DNN-based method, the DM and the IRM are predicted with two individual DNNs. The IEEE and the TIMIT corpora with real room impulse responses and noise from the NOISEX dataset are used to generate speech mixtures for evaluations. The proposed methods outperform the state-of-the-art specifically in highly reverberant room environments.

28 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895