Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Recognition of phonetic labels of the TIMIT speech corpus by means of an artificial neural network

[...]

Jian-xiong Wu¹, Jian-xiong Wu², Chorkin Chan¹•Institutions (2)

University of Hong Kong¹, Shanghai Jiao Tong University²

01 Nov 1991-Pattern Recognition

TL;DR: An artificial neural network architecture in terms of Gaussian kernels together with an associated network training algorithm is proposed for phonetic density estimation and the recognition capability is found to be compatible with that of a Bayes classifier.

...read moreread less

6 citations

Journal Article•DOI•

Privacy-Sensitive Audio Features for Speech/Nonspeech Detection

[...]

Sree Hari Krishnan Parthasarathi¹, Daniel Gatica-Perez¹, Hervé Bourlard¹, Mathew Magimai-Doss¹•Institutions (1)

Idiap Research Institute¹

01 Nov 2011-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Fusion strategies combining excitation source with simple features show that comparable performance can be obtained in both close-talking and far-field microphone scenarios, and obfuscation methods applied on the excitation features yield low phoneme accuracies in conjunction with SND performance comparable to that of MFPLP features.

...read moreread less

Abstract: The goal of this paper is to investigate features for speech/nonspeech detection (SND) having low linguistic information from the speech signal. Towards this, we present a comprehensive study of privacy-sensitive features for SND in multiparty conversations. Our study investigates three different approaches to privacy-sensitive features. These approaches are based on: 1) simple, instantaneous feature extraction methods; 2) excitation source information based methods; and 3) feature obfuscation methods such as local (within 130 ms) temporal averaging and randomization applied on excitation source information. To evaluate these approaches for SND, we use multiparty conversational meeting data of nearly 450 hours. On this dataset, we evaluate these features and benchmark them against standard spectral shape based features such as Mel frequency perceptual linear prediction (MFPLP). Fusion strategies combining excitation source with simple features show that comparable performance can be obtained in both close-talking and far-field microphone scenarios. As one way to objectively evaluate the notion of privacy, we conduct phoneme recognition studies on TIMIT. While excitation source features yield phoneme recognition accuracies in between the simple features and the MFPLP features, obfuscation methods applied on the excitation features yield low phoneme accuracies in conjunction with SND performance comparable to that of MFPLP features.

...read moreread less

6 citations

Journal Article•DOI•

Towards Understanding Attention-Based Speech Recognition Models

[...]

Chu-Xiong Qin¹, Dan Qu¹•Institutions (1)

PLA Information Engineering University¹

01 Jan 2020-IEEE Access

TL;DR: A human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE) and use them to better understand the attention mechanism and the recurrent representations, and combined with canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model.

...read moreread less

Abstract: Although the attention-based speech recognition has achieved promising performances, the specific explanation of the intermediate representations remains a black box theory. In this paper, we use the method to visually show and explain continuous encoder outputs. We propose a human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE), and use them to better understand the attention mechanism and the recurrent representations. In addition, we combine t-SNE and canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model. Experiments are carried on TIMIT and WSJ respectively. The aligned embeddings of the encoder outputs could form sequence manifolds of the ground truth labels. Figures of t-SNE embeddings visually show what representations the encoder shaped into and how the attention mechanism works for the speech recognition. The comparisons between different models, different layers, and different lengths of the utterance show that manifolds are clearer in the shape when outputs are from the deeper layer of the encoder, the shorter utterance, and models with better performances. We also observe that the same symbols from different utterances tend to gather at similar positions, which proves the consistency of our method. Further comparisons are taken between different epochs of the model using t-SNE and CCA. The results show that both the plosive and the nasal/flap phones converge quickly, while the long vowel phone converge slowly.

...read moreread less

6 citations

Proceedings Article•DOI•

Spatial and temporal alignment of multimodal human speech production data: Real time imaging, flesh point tracking and audio

[...]

Jangwon Kim¹, Adam C. Lammert¹, Prasanta Kumar Ghosh², Shrikanth S. Narayanan¹•Institutions (2)

University of Southern California¹, Indian Institute of Science²

26 May 2013

TL;DR: A novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images is proposed which shows that the temporal alignment obtained is better (12% relative) than that using acoustic feature only.

...read moreread less

Abstract: In speech production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.

...read moreread less

6 citations

Proceedings Article•DOI•

Dynamic Gaussian selection technique for speeding up HMM-based continuous speech recognition

[...]

Jun Cai¹, Ghazi Bouselmi¹, Dominique Fohr¹, Yves Laprie¹•Institutions (1)

Centre national de la recherche scientifique¹

12 May 2008

TL;DR: Results from experiments on TIMIT and HIWIRE corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.

...read moreread less

Abstract: A fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed for HMM-based continuous speech recognition. DGS approach is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation. The shortlist consists of the Gaussians which make prominent contribution to the likelihood. In principle, DGS is an extension of the technique of partial distance elimination, and it requires almost no additional memory for the storage of Gaussian shortlists. DGS algorithm has been implemented by modifying the likelihood computation module in HTK 3.4 system. Results from experiments on TIMIT and HIWIRE corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.

...read moreread less

6 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics