scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.
Abstract: In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

14 citations

Journal ArticleDOI
TL;DR: In this article, the authors investigate the interactions of front-end feature extraction and back-end classification techniques in nonstationary state hidden Markov model (NSHMM) based speech recognition.
Abstract: In this letter, we investigate the interactions of front-end feature extraction and back-end classification techniques in nonstationary state hidden Markov model (NSHMM) based speech recognition. The proposed model aims at finding an optimal linear transformation on the mel-warped discrete Fourier transform (DFT) features according to the minimum classification error (MCE) criterion. This linear transformation, along with the NSHMM parameters, are automatically trained using the gradient descent method. An error rate reduction of 8% is obtained on a standard 39-class TIMIT phone classification task in comparison with the MCE-trained NSHMM using conventional preprocessing techniques.

14 citations

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A comparative evaluation of speech enhancement algorithms for robust automatic speech recognition on a core test set of the TIMIT speech corpus and mean objective speech quality and ASR correctness scores under two noise conditions are given.
Abstract: A comparative evaluation of speech enhancement algorithms for robust automatic speech recognition is presented. The evaluation is performed on a core test set of the TIMIT speech corpus. Mean objective speech quality scores as well as ASR correctness scores under two noise conditions are given.

14 citations

Journal ArticleDOI
TL;DR: A new pooling technique, multileVEL region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers which improves extracted features using additional information from the multilesvel convolutional neural network layers.
Abstract: Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.

14 citations

Proceedings ArticleDOI
04 Sep 2005
TL;DR: This paper regularizes LDA and heteroschedastic LDA transforms using two methods: (1) Using statistical priors for the transform in a MAP formulation (2) Using structural constraints on the transform.
Abstract: Feature extraction is an essential first step in speech recognition applications. In addition to static features extracted from each frame of speech data, it is beneficial to use dynamic features (called Δ and ΔΔ coefficients) that use information from neighboring frames. Linear Discriminant Analysis (LDA) followed by a diagonalizing maximum likelihood linear transform (MLLT) applied to spliced static MFCC features yields important performance gains as compared to MFCC+Δ+ΔΔ features in most tasks. However, since LDA is obtained using statistical averages trained on limited data, it is reasonable to regularize LDA transform computation by using prior information and experience. In this paper, we regularize LDA and heteroschedastic LDA transforms using two methods: (1) Using statistical priors for the transform in a MAP formulation (2) Using structural constraints on the transform. As prior, we use a transform that computes static+Δ+ΔΔ coefficients. Our structural constraint is in the form of a block structured LDA transform where each block acts on the same cepstral parameters across frames. The second approach suggests using new coefficients for static, first difference and second difference operators as compared to the standard ones to improve performance. We test the new algorithms on two different tasks, namely TIMIT phone recognition and AURORA2 digit sequence recognition in noise. We obtain consistent improvement in our experiments as compared to MFCC features. In addition, we obtain encouraging results in some AURORA2 tests as compared to LDA+MLLT features.

14 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895