scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Posted Content
TL;DR: In this article, a fully unsupervised learning algorithm was proposed to train a phoneme classifier for a given set of phoneme segmentation boundaries and refine the phoneme boundaries based on a given classifier.
Abstract: We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.
Proceedings ArticleDOI
17 Nov 2022
TL;DR: In this article , an end-to-end deep learning model called Connectionist Temporal Classification (CTC) and attention-based seq2seq network was proposed for phoneme recognition.
Abstract: A phoneme is the smallest sound unit of a language. Every language has its corresponding phonemes. Phoneme recognition can be used in speech-based applications such as auto speech recognition and lip sync. This paper proposes an end-to-end deep learning model called Connectionist Temporal Classification (CTC) and attention-based seq2seq network that consists of one bi-GRU layer in the encoder and one GRU layer in the decoder, for recognizing the phonemes in speech. Experiments on the TIMIT dataset demonstrate its advantages on some other seq2seq networks, with over 50% improvements after applying the attention mechanism.
Proceedings ArticleDOI
17 Mar 1994
TL;DR: In this article, the authors describe a method for the enhancement of speech of a particular speaker in a noisy multispeaker environment using minimum variance deconvolution (MVD) algorithm.
Abstract: Describes a novel method for the enhancement of speech of a particular speaker in a noisy multispeaker environment Many potential applications of the method are possible including the implementation in a new generation of hearing aids The system is based on the minimum variance deconvolution (MVD) algorithm The method was tested using the TIMIT speech database The utterances of two speakers were first combined to create a multispeaker environment, and then separated using the MVD algorithm The intelligibility of the separated and enhanced speech was high Likewise the frequency spectra of the original speech were very similar to the spectra of the separated and enhanced speech for each of the two speakers >
01 Jan 2003
TL;DR: A method for reducing the training time and the number of networks required to achieve a desired performance level is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction.
Abstract: AUTOMATIC SPEAKER IDENTIFICATION USING REUSABLE AND RETRAINABLE BINARY-PAIR PARTITIONED NEURAL NETWORKS Ashutosh Mishra Old Dominion University May 2003 Director: Dr. Stephen A. Zahorian This thesis presents an extension of the work previously done on speaker identification using Binary Pair Partitioned (BPP) neural networks. In the previous work, a separate network was used for each pair of speakers in the speaker population. Although the basic BPP approach did perform well and had a simple underlying algorithm, it had the obvious disadvantage of requiring an extremely large number of networks for speaker identification with large speaker populations. It also requires training of networks proportional to the square of the number of speakers under consideration, leading to a very large number of networks to be trained and correspondingly large training and evaluation times. In the present work, the concepts of clustered speakers and reusable binary networks are investigated. Systematic methods are explored for using a network originally trained to separate only two specific speakers to also separate other speakers of other speaker pairs. For example, it would seem quite likely that a network trained to separate a particular female speaker from a particular male speaker would also reliably separate many other male speakers from many other female speakers. The focal point of the research is to develop a method for reducing the training time and the number of networks required to achieve a desired performance level. A new method of reducing the network requirement is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction (compared to the BPP approach). The two methods investigated are-reusable binary-paired partitioned neural networks (RBPP) and retrained and reusable binary-pair partitioned neural networks (RRBPP). Both the methods explored and described in this thesis work very well for clean (studio quality) speech but do not provide the desired level of performance with bandwidth – limited speech (telephone quality). In this thesis, a detailed description of both the methods and the experimental results is provided. All experimental results reported are based on either the Texas Instruments Massachusetts Institute of Technology (TIMIT) or Nynex TIMIT (NTIMIT) databases, using 8 sentences (approximately 24 seconds) for training and up to two sentences (approximately 6 seconds for testing). Best results obtained with TIMIT, using 102 speakers, for BPP, RBPP, and RRBPP respectively (for 2 sentences i.e. ~ 6 seconds of test data) are 99.02 %, 99.02 %, 99.02 % of speakers correctly identified. Corresponding recognition rates for NTIMIT, again using 102 speakers, are 84.3%, 75.5% and 77.5%. Using all 630 speakers, the accuracy rates for TIMIT are 99%, 97% and 96%, and the accuracy rates for NTIMIT are ~72 %, 48% and 41 %.
Proceedings ArticleDOI
30 Nov 2020
TL;DR: In this paper, the authors proposed an improved system for the detection of end of speech (EOS) events in noisy environments, needed, for example, in voice interfaces of mobile devices.
Abstract: In this paper we propose an improved system for the detection of end of speech (EOS) events in noisy environments, needed, for example, in voice interfaces of mobile devices. Our solution is based on a deep neural network composed of convolutional, feed-forward and LSTM layers. For the input data we use mel-frequency cepstral coefficients (MFCC). The main novelty of our solution is the metric used during the training process of the model: our loss function returns higher values the later the model recognizes the EOS event. We confront this approach with the loss functions previously used, where such a delay was not considered. The experiments run on the TIMIT corpus, as well as additional evaluations on the other types of audio data, showed that our solution is significantly more robust to noisy and far-field environments compared to the baseline solution.

Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895