scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
04 May 2014
TL;DR: A subband hybrid (SBH) feature is developed for multi-stream (MS) speech recognition that significantly enhances the amount of information being extracted from individual subbands and achieves a substantial gain in performance over the single-stream baseline.
Abstract: A subband hybrid (SBH) feature is developed for multi-stream (MS) speech recognition. The fullband speech signal is decomposed into multiple subbands, each covers about 3 Bark along the frequency. Speech signal is analyzed by a high-resolution filterbank of 4 filters/Bark and a low-resolution filterbank of 2 filters/Bark to facilitate the representation of both short-term spectral modulation and long-term temporal modulation within a frequency subband. Experiments on TIMIT corpus for English and RATS corpus for Arabic Levantine show that the SBH feature significantly enhances the amount of information being extracted from individual subbands. The MS system with performance monitor achieves a substantial gain in performance over the single-stream baseline.

2 citations

Proceedings ArticleDOI
01 Sep 2016
TL;DR: This type of feature detection using low corpus size shows a path with which continuous speech can be recognized using inadequate data repository also, in case of small speech corpora, phonological feature based speech attributes can be detected with the bottom-up approach.
Abstract: In this paper, place and manner of articulation based phonological features have been successfully identified with high accuracy using very minimal amount of training data. In detection-based, bottom-up speech recognition approach, the phonological feature based acoustic-phonetic speech attributes are considered as a key component. After identifying the features, they are merged together to get the phonemes. So this type of feature detection using low corpus size shows a path with which continuous speech can be recognized using inadequate data repository also. To execute the experiment, both the language, Bengali and English have been considered. The sentences were trained using deep neural network. Training procedure is carried out for Bengali using three different corpus sizes with a number of 100, 200, and 500 sentences. The average frame level accuracies were obtained as 87.88%, 88.43% and 88.96% respectively for CDAC speech corpus. Whereas using the same training procedure for TIMIT corpus, the accuracies were 87.97%, 88.84%, and 89.39% respectively. So the average frame level accuracy is almost same irrespective of number of training data. This ensures, in case of small speech corpora, phonological feature based speech attributes can be detected with the bottom-up approach.

2 citations

Posted Content
TL;DR: This study proposes an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently, and uses the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model.
Abstract: Speech enhancement tasks have seen significant improvements with the advance of deep learning technology, but with the cost of increased computational complexity. In this study, we propose an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently. We use the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model, particularly to address the resource-constrained environments. Our adaptive boosting algorithm learns simple logistic regressors as the weak learners. Once trained, their binary classification results transform each spectrum of test noisy speech into a bit string. Simple bitwise operations calculate Hamming distance to find the K-nearest matching frames in the dictionary of training noisy speech spectra, whose associated ideal binary masks are averaged to estimate the denoising mask for that test mixture. Our proposed learning algorithm differs from AdaBoost in the sense that the projections are trained to minimize the distances between the self-similarity matrix of the hash codes and that of the original spectra, rather than the misclassification rate. We evaluate our discriminative hash codes on the TIMIT corpus with various noise types, and show comparative performance to deep learning methods in terms of denoising performance and complexity.

2 citations

Proceedings ArticleDOI
29 May 2020
TL;DR: Experimental results indicate that SECNN outperforms other end-to-end models such as Deep Speaker and achieves an equal error rate (EER) of 5.55% in speaker verification and accuracy of 93.92% in Speaker identification on Librispeech dataset.
Abstract: This paper highlights a structure called SECNN It combines squeeze-and-excitation (SE) components with a simplified residual convolutional neural network (ResNet) This model takes time-frequency spectrogram as input and measures speaker similarity between an utterance embedding and speaker models by cosine similarity Speaker models are obtained by averaging utterance level features of each enrollment speaker On the one hand, SECNN can mitigate speaker overfitting in speaker verification by using some techniques such as regularization techniques and SE operation On the other hand, SECNN is a lightweight model with merely 15M parameters Experimental results indicate that SECNN outperforms other end-to-end models such as Deep Speaker and achieves an equal error rate (EER) of 555% in speaker verification and accuracy of 9392% in speaker identification on Librispeech dataset It also achieves an EER of 258% in speaker verification and accuracy of 9583% in speaker identification on TIMIT dataset

2 citations

Proceedings ArticleDOI
09 Sep 2013
TL;DR: An attempt is made to correct the transcription at the sub-word unit level using acoustic cues that are available in the waveform and results show that corrected pronunciations lead to higher likelihood for train utterances in the TIMIT corpus.
Abstract: Accurate transcription of the utterances during training is critical for recognition performance. The inherent properties of continuous/spontaneous speech across speakers, such as variation in pronunciation, poorly emphasized or over stressed words/sub-word units can lead to misalignment of the waveform at the sub-word unit level. The misalignment is caused by the deviation of the pronunciation from that defined by the pronunciation lexicon. This leads to insertion/deletion of subword units. This is primarily because the transcription is not specific to utterances. In this paper, an attempt is made to correct the transcription at the sub-word unit level using acoustic cues that are available in the waveform. Using sentence-level transcriptions, the transcription of a word is corrected in terms of the phonemes that make up the word. In particular, it is observed that vowels are either inserted or deleted. To support the proposed argument, mispronunciations in continuous speech are substantiated using signal processing and machine learning tools. An automatic data driven annotator exploiting the inferences drawn from the study is used to correct transcription errors. The results show that corrected pronunciations lead to higher likelihood for train utterances in the TIMIT corpus.

2 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895