scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper used the TIMIT corpus of spoken sentences produced by talkers from a number of distinct dialect regions in the United States, and found that several phonetic features distinguish between the dialects.
Abstract: The perception of phonological differences between regional dialects of American English by naive listeners is poorly understood. Using the TIMIT corpus of spoken sentences produced by talkers from a number of distinct dialect regions in the United States, an acoustic analysis conducted in Experiment I confirmed that several phonetic features distinguish between the dialects. In Experiment II recordings of the sentences were played back to naive listeners who were asked to categorize each talker into one of six geographical dialect regions. Results suggested that listeners are able to reliably categorize talkers into three broad dialect clusters, but have more difficulty accurately categorizing talkers into six smaller regions. Correlations between the acoustic measures and both actual dialect affiliation of the talkers and dialect categorization of the talkers by the listeners revealed that the listeners in this study were sensitive to acoustic‐phonetic features of the dialects in categorizing the talker...

5 citations

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This work found that the peaks detected from the data-driven approach significantly improve the speech rate estimation when combined with the traditional TCSSBC approach using a proposed peak-merging strategy.
Abstract: A typical solution for the speech rate estimation consists of two stages, which involves first computing a short-time feature contour such that most of peaks of the contour correspond to the syllable nuclei followed by the detection of the peaks of the contour corresponding to the syllable nuclei. Temporal correlation selected subband correlation (TCSSBC) is often used as a feature contour for the speech rate estimation in which correlation within and across a few selected sub-band energies are computed. In this work, instead of a fixed set of sub-bands, we learn them in a data-driven manner using a dictionary learning approach. Similarly, instead of the energy contours, we use the activation profile from the learned dictionary elements. We found that the peaks detected from the data-driven approach significantly improve the speech rate estimation when combined with the traditional TCSSBC approach using a proposed peak-merging strategy. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora. Except Switchboard, the correlation coefficient for the speech rate estimation using the proposed approach is found to be higher than those by the TCSSBC technique − 3.1% and 5.2% (relative) improvements for TIMIT and CTIMIT respectively.

5 citations

31 Mar 2006
TL;DR: This paper addresses the problem of unsupervised speaker change detection by using the Bayesian Information Criterion and a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point.
Abstract: This paper addresses the problem of unsupervised speaker change detection We assume that there is no prior knowledge of the number of speakers or their identities Two methods are tested The first method uses the Bayesian Information Criterion (BIC), investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, and implements a dynamic thresholding followed by a fusion scheme The second method is a real-time one that uses a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point The methods are tested on two different datasets The first set was created by concatenating speakers from the TIMIT database and is referred to as the TIMIT data set The second set was created by using recordings from the MPEG-7 test set CD1 and broadcast news and is referred to as the INESC dataset

5 citations

Posted Content
TL;DR: The results indicate as long as a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good end-to-end whispered speech recognizer.
Abstract: Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.

5 citations

Proceedings ArticleDOI
01 Dec 2017
TL;DR: An acoustic model to predict phone labels based on a recurrent neural network (RNN) with bidirectional long short- term memory (BLSTM) units, which is trained by CTC technique and found that the positions of spiky phone outputs of this model are consistent with the landmarks annotated in the TIMIT corpus.
Abstract: Acoustic features extracted in the vicinity of landmarks have demonstrated their usefulness for detecting mispronunciation in our recent work [1, 2]. Traditional approaches of detecting acoustic landmarks rely on annotations by linguists with prior knowledge of speech production mechanisms, which are laborious and expensive. This paper proposes a data-driven approach of connectionist temporal classification (CTC) that can detect landmarks without any human labels while still maintaining consistent performance with knowledge-based models for stop burst landmarks. We designed an acoustic model to predict phone labels based on a recurrent neural network (RNN) with bidirectional long short- term memory (BLSTM) units, which is trained by CTC technique. We found that the positions of spiky phone outputs of this model are consistent with the landmarks annotated in the TIMIT corpus. Both data-driven and knowledge-based landmark models are applied to detect pronunciation errors of second-language (L2) Chinese learners. Experiments illustrate that data-driven CTC landmark model is comparable to knowledge-based model in pronunciation error detection. The fusion of them can further improve performance.

5 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895