scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: This study demonstrates the merit of the Gammatone filter-bank in improving robustness to codec-degraded speech at different bit rates and achieves the best verification performance under aggressive compression.
Abstract: The main novelty of this work resides in incorporating a Gammatone filter-bank as a substitute of the Mel filter-bank in the extraction pipeline of the Product Spectrum PS. The proposed feature is dubbed the Gammatone Product-Spectrum Cepstral coefficients GPSCC. Experimental results are undertaken on TIMIT and noisy TIMIT corpora using the Gaussian Mixture Model with Universal Background Model (GMM-UBM) recognition algorithm. Performance evaluations indicate that GPSCC shows a drastic reduction in Equal Error Rates compared to other related features and this gain in performance is more pronounced at low signal to noise ratios. Also, our study demonstrates the merit of the Gammatone filter-bank in improving robustness to codec-degraded speech at different bit rates. Furthermore, the proposed GPSCC feature achieves the best verification performance under aggressive compression. Interestingly, at 6.60 kbps we observe that GPSCC achieves an absolute error reduction of 12% compared to the Mel Frequency Cepstral Coefficients (MFCC).

7 citations

Book ChapterDOI
19 Sep 2018
TL;DR: The effectiveness of RNNs for speaker recognition is shown by improving state of the art speaker clustering performance and robustness on the classic TIMIT benchmark by experimentally showing a “sweet spot” of the segment length for successfully capturing prosodic information that has been theoretically predicted in previous work.
Abstract: Deep neural networks have become a veritable alternative to classic speaker recognition and clustering methods in recent years. However, while the speech signal clearly is a time series, and despite the body of literature on the benefits of prosodic (suprasegmental) features, identifying voices has usually not been approached with sequence learning methods. Only recently has a recurrent neural network (RNN) been successfully applied to this task, while the use of convolutional neural networks (CNNs) (that are not able to capture arbitrary time dependencies, unlike RNNs) still prevails. In this paper, we show the effectiveness of RNNs for speaker recognition by improving state of the art speaker clustering performance and robustness on the classic TIMIT benchmark. We provide arguments why RNNs are superior by experimentally showing a “sweet spot” of the segment length for successfully capturing prosodic information that has been theoretically predicted in previous work.

7 citations

Book ChapterDOI
01 Jan 2019
TL;DR: This chapter discusses end-to-end acoustic modeling using CNN in detail, a convolutional neural network (CNN)-based direct raw speech model that performs better than traditional cepstral feature–based systems but uses a high number of parameters.
Abstract: State-of-the-art automatic speech recognition (ASR) systems map speech into its corresponding text. Conventional ASR systems model the speech signal into phones in two steps; feature extraction and classifier training. Traditional ASR systems have been replaced by deep neural network (DNN)-based systems. Today, end-to-end ASRs are gaining in popularity due to simplified model-building processes and the ability to directly map speech into text without predefined alignments. These models are based on data-driven learning methods and competition with complicated ASR models based on DNN and linguistic resources. There are three major types of end-to-end architectures for ASR: Attention-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. This chapter discusses end-to-end acoustic modeling using CNN in detail. CNN establishes the relationship between the raw speech signal and phones in a data-driven manner. Relevant features and classifiers are jointly learned from raw speech. The first convolutional layer automatically learns feature representation. That intermediate representation is more discriminative and further processed by rest convolutional layers. This system performs better than traditional cepstral feature–based systems but uses a high number of parameters. The performance of the system is evaluated for TIMIT and claimed better performance than MFCC feature-based GMM/HMM (Gaussian mixture model/hidden Markov model) model.

7 citations

Proceedings Article
01 Sep 2008
TL;DR: Tests on the HIWIRE corpus show that multi-accent pronunciation modeling and acoustic adaptation reduce the WER by up to 76% compared to results of canonical models of the target language.
Abstract: In this article we present a study of a multi-accent and accentindependent non-native speech recognition. We propose several approaches based on phonetic confusion and acoustic adaptation. The goal of this article is to investigate the feasibility of multi-accent non-native speech recognition without detecting the origin of the speaker. Tests on the HIWIRE corpus show that multi-accent pronunciation modeling and acoustic adaptation reduce the WER by up to 76% compared to results of canonical models of the target language. We also investigate accentindependent approaches in order to assess the robustness of the proposed methods to unseen foreign accents. Experiments show that our approaches correctly handle unseen accents and give up to 55% WER reduction, compared to the models of the target language. Finally, the proposed pronunciation modeling approach maintains the recognition accuracy on canonical native speech as assessed by our experiments on the TIMIT corpus.

7 citations

Proceedings ArticleDOI
07 May 1996
TL;DR: A new statistical speech model in which coarticulation is modeled explicitly, and a novel method based on explicitly minimizing a measure of the segmentation error is presented.
Abstract: This paper presents a new statistical speech model in which coarticulation is modeled explicitly. Unlike HMMs, in which the current state depends only on the previous state and the current observation, the proposed model supports dependence on the previous and next states and on the previous and current observations. The degree of coarticulation between adjacent phones is modeled parametrically, and can be adjusted according to a parameter representing the speaking rate. The model also incorporates a parameter that represents a frame-by-frame measure of confidence in the speech. We present two methods for solving the system parameters: one based on the K-means method, and a novel method based on explicitly minimizing a measure of the segmentation error. A new, efficient forward algorithm and the use of top candidates in the search greatly reduce the computational complexity. In evaluation on the TIMIT data base, we achieve a phone recognition rate of 77.1%.

7 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895