Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Posted Content•

Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

[...]

Chih-Kuan Yeh¹, Jianshu Chen², Chengzhu Yu², Dong Yu²•Institutions (2)

Carnegie Mellon University¹, Tencent²

23 Dec 2018-arXiv: Audio and Speech Processing

TL;DR: In this article, a fully unsupervised learning algorithm was proposed to train a phoneme classifier for a given set of phoneme segmentation boundaries and refine the phoneme boundaries based on a given classifier.

...read moreread less

Abstract: We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.

...read moreread less

Proceedings Article•DOI•

Research on Phoneme Recognition using Attention-based Methods

[...]

Yupei Zhang

17 Nov 2022

TL;DR: In this article , an end-to-end deep learning model called Connectionist Temporal Classification (CTC) and attention-based seq2seq network was proposed for phoneme recognition.

...read moreread less

Abstract: A phoneme is the smallest sound unit of a language. Every language has its corresponding phonemes. Phoneme recognition can be used in speech-based applications such as auto speech recognition and lip sync. This paper proposes an end-to-end deep learning model called Connectionist Temporal Classification (CTC) and attention-based seq2seq network that consists of one bi-GRU layer in the encoder and one GRU layer in the decoder, for recognizing the phonemes in speech. Experiments on the TIMIT dataset demonstrate its advantages on some other seq2seq networks, with over 50% improvements after applying the attention mechanism.

...read moreread less

Proceedings Article•DOI•

Minimum variance deconvolution based-speech enhancement system for a new generation of hearing aids

[...]

Huiqin Gao¹, M. Savic¹, J. Sorensen¹•Institutions (1)

Rensselaer Polytechnic Institute¹

17 Mar 1994

TL;DR: In this article, the authors describe a method for the enhancement of speech of a particular speaker in a noisy multispeaker environment using minimum variance deconvolution (MVD) algorithm.

...read moreread less

Abstract: Describes a novel method for the enhancement of speech of a particular speaker in a noisy multispeaker environment Many potential applications of the method are possible including the implementation in a new generation of hearing aids The system is based on the minimum variance deconvolution (MVD) algorithm The method was tested using the TIMIT speech database The utterances of two speakers were first combined to create a multispeaker environment, and then separated using the MVD algorithm The intelligibility of the separated and enhanced speech was high Likewise the frequency spectra of the original speech were very similar to the spectra of the separated and enhanced speech for each of the two speakers >

...read moreread less

Automatic speaker identification using reusable and retrainable binary-pair partitioned neural networks

[...]

Ashutosh Mishra

01 Jan 2003

TL;DR: A method for reducing the training time and the number of networks required to achieve a desired performance level is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction.

...read moreread less

Abstract: AUTOMATIC SPEAKER IDENTIFICATION USING REUSABLE AND RETRAINABLE BINARY-PAIR PARTITIONED NEURAL NETWORKS Ashutosh Mishra Old Dominion University May 2003 Director: Dr. Stephen A. Zahorian This thesis presents an extension of the work previously done on speaker identification using Binary Pair Partitioned (BPP) neural networks. In the previous work, a separate network was used for each pair of speakers in the speaker population. Although the basic BPP approach did perform well and had a simple underlying algorithm, it had the obvious disadvantage of requiring an extremely large number of networks for speaker identification with large speaker populations. It also requires training of networks proportional to the square of the number of speakers under consideration, leading to a very large number of networks to be trained and correspondingly large training and evaluation times. In the present work, the concepts of clustered speakers and reusable binary networks are investigated. Systematic methods are explored for using a network originally trained to separate only two specific speakers to also separate other speakers of other speaker pairs. For example, it would seem quite likely that a network trained to separate a particular female speaker from a particular male speaker would also reliably separate many other male speakers from many other female speakers. The focal point of the research is to develop a method for reducing the training time and the number of networks required to achieve a desired performance level. A new method of reducing the network requirement is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction (compared to the BPP approach). The two methods investigated are-reusable binary-paired partitioned neural networks (RBPP) and retrained and reusable binary-pair partitioned neural networks (RRBPP). Both the methods explored and described in this thesis work very well for clean (studio quality) speech but do not provide the desired level of performance with bandwidth – limited speech (telephone quality). In this thesis, a detailed description of both the methods and the experimental results is provided. All experimental results reported are based on either the Texas Instruments Massachusetts Institute of Technology (TIMIT) or Nynex TIMIT (NTIMIT) databases, using 8 sentences (approximately 24 seconds) for training and up to two sentences (approximately 6 seconds for testing). Best results obtained with TIMIT, using 102 speakers, for BPP, RBPP, and RRBPP respectively (for 2 sentences i.e. ~ 6 seconds of test data) are 99.02 %, 99.02 %, 99.02 % of speakers correctly identified. Corresponding recognition rates for NTIMIT, again using 102 speakers, are 84.3%, 75.5% and 77.5%. Using all 630 speakers, the accuracy rates for TIMIT are 99%, 97% and 96%, and the accuracy rates for NTIMIT are ~72 %, 48% and 41 %.

...read moreread less

Proceedings Article•DOI•

Improved weighted loss function for training end-of-speech detection models

[...]

Mikołaj Pudo¹, Adrian Wiśniewski², Artur Janicki¹•Institutions (2)

Warsaw University of Technology¹, Samsung²

30 Nov 2020

TL;DR: In this paper, the authors proposed an improved system for the detection of end of speech (EOS) events in noisy environments, needed, for example, in voice interfaces of mobile devices.

...read moreread less

Abstract: In this paper we propose an improved system for the detection of end of speech (EOS) events in noisy environments, needed, for example, in voice interfaces of mobile devices. Our solution is based on a deep neural network composed of convolutional, feed-forward and LSTM layers. For the input data we use mel-frequency cepstral coefficients (MFCC). The main novelty of our solution is the metric used during the training process of the model: our loss function returns higher values the later the model recognizes the EOS event. We confront this approach with the loss functions previously used, where such a delay was not considered. The experiments run on the TIMIT corpus, as well as additional evaluations on the other types of audio data, showed that our solution is significantly more robust to noisy and far-field environments compared to the baseline solution.

...read moreread less

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics