Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Speech rhythm guided syllable nuclei detection

[...]

Yaodong Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

19 Apr 2009

TL;DR: In this paper, an instantaneous speech rhythm estimator is introduced to predict possible regions where syllable nuclei can appear, and a simple slope based peak counting algorithm is used to get the exact location of each syllable nucleus.

...read moreread less

Abstract: In this paper, we present a novel speech-rhythm-guided syllable-nuclei location detection algorithm. As a departure from conventional methods, we introduce an instantaneous speech rhythm estimator to predict possible regions where syllable nuclei can appear. Within a possible region, a simple slope based peak counting algorithm is used to get the exact location of each syllable nucleus. We verify the correctness of our method by investigating the syllable nuclei interval distribution in TIMIT dataset, and evaluate the performance by comparing with a state-of-the-art syllable nuclei based speech rate detection approach.

...read moreread less

42 citations

Posted Content•

Multi-task self-supervised learning for Robust Speech Recognition

[...]

Mirco Ravanelli¹, Jianyuan Zhong², Santiago Pascual³, Pawel Swietojanski⁴, Joao Monteiro⁵, Jan Trmal⁶, Yoshua Bengio¹ - Show less +3 more•Institutions (6)

Université de Montréal¹, University of Rochester², Polytechnic University of Catalonia³, University of New South Wales⁴, Institut national de la recherche scientifique⁵, Johns Hopkins University⁶

25 Jan 2020-arXiv: Audio and Speech Processing

TL;DR: In this paper, the authors proposed an improved version of PASE for robust speech recognition in noisy and reverberant environments, called PASE+, which employs an online speech distortion module, that contaminates the input signals with a variety of random disturbances.

...read moreread less

Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

...read moreread less

42 citations

Posted Content•

Phoneme recognition in TIMIT with BLSTM-CTC

[...]

Santiago Fernández¹, Alex Graves², Jürgen Schmidhuber²•Institutions (2)

Dalle Molle Institute for Artificial Intelligence Research¹, Technische Universität München²

15 Apr 2008-arXiv: Computation and Language

TL;DR: The performance of a recurrent neural network is compared with the best results published so far on phoneme recognition in the TIMIT database and a single recurrent network is applied to the same task.

...read moreread less

Abstract: We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly different from that obtained by the other best methods, but they rely on a combination of classifiers for achieving comparable performance.

...read moreread less

42 citations

Book Chapter•DOI•

Audio-to-Visual Conversion Using Hidden Markov Models

[...]

Soonkyu Lee¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

18 Aug 2002

TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

42 citations

Posted Content•

Z-Forcing: Training Stochastic Recurrent Networks

[...]

Anirudh Goyal¹, Alessandro Sordoni², Marc-Alexandre Côté³, Nan Rosemary Ke⁴, Yoshua Bengio¹ - Show less +1 more•Institutions (4)

Université de Montréal¹, Microsoft², Université de Sherbrooke³, École Polytechnique de Montréal⁴

15 Nov 2017-arXiv: Machine Learning

TL;DR: This paper proposed a stochastic recurrent model, where each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps, and training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence.

...read moreread less

Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: \url{this https URL}

...read moreread less

42 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics