scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Posted Content
TL;DR: This study proposes an SE system that incorporates contextual articulatory information that is obtained using broad phone class (BPC) end-to-end automatic speech recognition (ASR) and shows that the BPC-based ASR can improve the SE performance more effectively under different signal- to-noise ratios (SNR).
Abstract: Previous studies have confirmed the effectiveness of leveraging articulatory information to attain improved speech enhancement (SE) performance. By augmenting the original acoustic features with the place/manner of articulatory features, the SE process can be guided to consider the articulatory properties of the input speech when performing enhancement. Hence, we believe that the contextual information of articulatory attributes should include useful information and can further benefit SE. In this study, we propose an SE system that incorporates contextual articulatory information; such information is obtained using broad phone class (BPC) end-to-end automatic speech recognition (ASR). Meanwhile, two training strategies are developed to train the SE system based on the BPC-based ASR: multitask-learning and deep-feature training strategies. Experimental results on the TIMIT dataset confirm that the contextual articulatory information facilitates an SE system in achieving better results. Moreover, in contrast to another SE system that is trained with monophonic ASR, the BPC-based ASR (providing contextual articulatory information) can improve the SE performance more effectively under different signal-to-noise ratios(SNR).

3 citations

27 Aug 2020
TL;DR: This thesis proposes several extensions to the variational autoencoder (VAE) model, a unified probabilistic framework which combines generative modelling and deep neural networks, and imposes biases inspired by articulatory and acoustic theories of speech production to induce a generative model to capture and disentangle meaningful underlying factors.
Abstract: Representations aim to capture significant, high-level information from raw data, most commonly as low dimensional vectors. When considered as input features for a downstream classification task, they reduce classifier complexity, and help in transfer learning and domain adaptation. An interpretable representation captures underlying meaningful factors, and can be used for understanding data, or to solve tasks that need access to these factors. In natural language processing (NLP), representations such as word or sentence embeddings have recently become important components of most natural language understanding models. They are trained without supervision on very large, unannotated corpora, allowing powerful models that capture semantic relations important in many NLP tasks. In speech processing, deep network-based representations such as bottlenecks and x-vectors have had some success, but are limited to supervised or partly supervised settings where annotations are available and are not optimized to separate underlying factors. An unsupervised representation for speech, i.e. one that could be trained directly with large amounts of unlabelled speech recordings, would have a major impact on many speech processing tasks. Annotating speech data requires expensive manual transcription and is often a limiting factor, especially for low-resource languages. Disentangling speaker and phonetic variability in the representation would eliminate major nuisance factors for downstream tasks in speech or speaker recognition. But despite this potential, unsupervised representation has received less attention than its supervised counterpart. In this thesis, we propose a non-supervised generative model that can learn interpretable speech representations. More specifically, we propose several extensions to the variational autoencoder (VAE) model, a unified probabilistic framework which combines generative modelling and deep neural networks. To induce the model to capture and disentangle meaningful underlying factors, we impose biases inspired by articulatory and acoustic theories of speech production. We first propose time filtering as a bias to induce representations at a different time scale for each latent variable. It allows the model to separate several latent variables along a continuous range of time scale properties, as opposed to binary oppositions or hierarchical factorization that have been previously proposed. We also show how to impose a multimodal prior to induce discrete latent variables, and present two new tractable VAE loss functions that apply to discrete variables, using expectation-maximization reestimation with matched divergence, and divergence sampling. In addition, we propose self-attention to add sequence modelling capacity to the VAE model, to our knowledge the first time self-attention is used for learning in an unsupervised speech task. We use simulated data to confirm that the proposed model can accurately recover phonetic and speaker underlying factors. We find that, given only a realistic high dimensional log filterbank signal, the model is able to accurately recover the generating factors, and that both frame and sequence level variables are essential for accurate reconstruction and well-disentangled representation. On TIMIT, a corpus of read English speech, the proposed biases yield representations that separate phonetic and speaker information, as evidenced by unsupervised results on downstream phoneme and speaker classification tasks using a simple k-means classifier. Jointly optimizing for multiple latent variables, with a distinct bias for each one, makes it possible to disentangle underlying factors that a single latent variable is not able to capture simultaneously. We explored some of the underlying factors potentially useful for applications where annotated data is scarce or non-existent. The approach proposed in this thesis, which induces a generative model to learn disentangled and interpretable representations, opens the way for exploration of new factors and inductive biases.

3 citations

Proceedings ArticleDOI
01 Jan 2015
TL;DR: From the results, it is observed that, the proposed two-stage recognition systems outperform their respective baseline systems and are compared with baseline phone recognition systems developed using spectral features alone.
Abstract: In this paper, we propose a two-stage phone recognition system using articulatory and spectral features. In the first stage, articulatory features are predicted from spectral features using FeedForward Neural Networks (FFNNs). In the second stage, phone recognition is carried out using the predicted articulatory features and spectral features together. FFNNs and Hidden Markov Models are explored for developing the phone recognition models in stage-2. In this work, spectral features are represented by Mel-frequency cepstral coefficients. The performance of the proposed phone recognition systems are analyzed using an Indian language Bengali and TIMIT speech databases. The recognition accuracy of the proposed two-stage models are compared with baseline phone recognition systems developed using spectral features alone. From the results, it is observed that, the proposed two-stage recognition systems outperform their respective baseline systems. Two-stage recognition systems have shown an improvement of 5.87% and 7.55% for Bengali and TIMIT speech databases respectively.

3 citations

Proceedings ArticleDOI
09 Jun 1997
TL;DR: A self-configuring neural network is trained to recognize sentences that have been compressed by the LBG clustering algorithm to provide a basis for the generation of secure speaker recognition systems which use neural networks.
Abstract: This paper discusses preliminary work on a promising method for recognizing speakers. A self-configuring neural network is trained to recognize sentences that have been compressed by the LBG clustering algorithm. The bias weights of the trained neural networks are adjusted to minimize the false positive percentage. Recognition results from the TIMIT speech database of greater than 90% correct are obtained with no false positives. The results presented here provide a basis for the generation of secure speaker recognition systems which use neural networks.

3 citations

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This paper concentrates on obtaining the spectro-temporal representation by incorporating a physiologically and psychoacoustically motivated gammatone filter called gamm atonegram into the Gabor filters to better approximate the auditory perception of speech.
Abstract: Spectro-temporal features have recently shown much performance improvement for robust Automatic Speech Recognition (ASR) tasks. Gabor filters are best known to extract the spectro-temporal cues of speech. Spectro-temporal representation becomes an essential ingredient for two dimensional Gabor based feature extraction methods. State of the art spectro-temporal features is mostly based on Mel spectrogram. However, the time-frequency representation based on the Mel scale is not accurate enough to model the human auditory system. This paper concentrates on obtaining the spectro-temporal representation by incorporating a physiologically and psychoacoustically motivated gammatone filter called gammatonegram. From literature, gammatonegram is found to better approximate the auditory perception of speech. The spectro-temporal features obtained using gammatonegram based Gabor filters are fed to a hybrid Deep Neural Network (DNN)-Hidden Markov Model (HMM) framework to develop the acoustic model of an ASR system. Experimental analysis is carried out with NOISEX-92 database implemented on TIMIT. The experimental results show the better performance gain obtained with the proposed features compared with conventional feature extraction methods.

3 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895