scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: An artificial neural network architecture in terms of Gaussian kernels together with an associated network training algorithm is proposed for phonetic density estimation and the recognition capability is found to be compatible with that of a Bayes classifier.

6 citations

Journal ArticleDOI
TL;DR: Fusion strategies combining excitation source with simple features show that comparable performance can be obtained in both close-talking and far-field microphone scenarios, and obfuscation methods applied on the excitation features yield low phoneme accuracies in conjunction with SND performance comparable to that of MFPLP features.
Abstract: The goal of this paper is to investigate features for speech/nonspeech detection (SND) having low linguistic information from the speech signal. Towards this, we present a comprehensive study of privacy-sensitive features for SND in multiparty conversations. Our study investigates three different approaches to privacy-sensitive features. These approaches are based on: 1) simple, instantaneous feature extraction methods; 2) excitation source information based methods; and 3) feature obfuscation methods such as local (within 130 ms) temporal averaging and randomization applied on excitation source information. To evaluate these approaches for SND, we use multiparty conversational meeting data of nearly 450 hours. On this dataset, we evaluate these features and benchmark them against standard spectral shape based features such as Mel frequency perceptual linear prediction (MFPLP). Fusion strategies combining excitation source with simple features show that comparable performance can be obtained in both close-talking and far-field microphone scenarios. As one way to objectively evaluate the notion of privacy, we conduct phoneme recognition studies on TIMIT. While excitation source features yield phoneme recognition accuracies in between the simple features and the MFPLP features, obfuscation methods applied on the excitation features yield low phoneme accuracies in conjunction with SND performance comparable to that of MFPLP features.

6 citations

Journal ArticleDOI
TL;DR: A human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE) and use them to better understand the attention mechanism and the recurrent representations, and combined with canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model.
Abstract: Although the attention-based speech recognition has achieved promising performances, the specific explanation of the intermediate representations remains a black box theory. In this paper, we use the method to visually show and explain continuous encoder outputs. We propose a human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE), and use them to better understand the attention mechanism and the recurrent representations. In addition, we combine t-SNE and canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model. Experiments are carried on TIMIT and WSJ respectively. The aligned embeddings of the encoder outputs could form sequence manifolds of the ground truth labels. Figures of t-SNE embeddings visually show what representations the encoder shaped into and how the attention mechanism works for the speech recognition. The comparisons between different models, different layers, and different lengths of the utterance show that manifolds are clearer in the shape when outputs are from the deeper layer of the encoder, the shorter utterance, and models with better performances. We also observe that the same symbols from different utterances tend to gather at similar positions, which proves the consistency of our method. Further comparisons are taken between different epochs of the model using t-SNE and CCA. The results show that both the plosive and the nasal/flap phones converge quickly, while the long vowel phone converge slowly.

6 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images is proposed which shows that the temporal alignment obtained is better (12% relative) than that using acoustic feature only.
Abstract: In speech production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.

6 citations

Proceedings ArticleDOI
12 May 2008
TL;DR: Results from experiments on TIMIT and HIWIRE corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.
Abstract: A fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed for HMM-based continuous speech recognition. DGS approach is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation. The shortlist consists of the Gaussians which make prominent contribution to the likelihood. In principle, DGS is an extension of the technique of partial distance elimination, and it requires almost no additional memory for the storage of Gaussian shortlists. DGS algorithm has been implemented by modifying the likelihood computation module in HTK 3.4 system. Results from experiments on TIMIT and HIWIRE corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.

6 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895