scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Posted Content
TL;DR: Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context by training a contextual joint factor synthesis encoder that outperforms phone classification baselines, yielding a classification accuracy of 74.1%.
Abstract: Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output. The second approach is a contextual joint factor analysis encoder, where the encoder is trained to analyse joint factors from the source signal that correlates best with the neighbouring audio. To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted -- phone classification and speaker recognition -- and test on different TIMIT data sets. Experimental results show that one of the proposed approaches outperforms phone classification baselines, yielding a classification accuracy of 74.1%. When using additional out-of-domain data for training, an additional 3% improvements can be obtained, for both for phone classification and speaker recognition tasks.

3 citations

Journal ArticleDOI
TL;DR: It is shown in two studies that the improvements obtained on read speech do not always transfer to conversational speech, and the nature of the application data should be taken into account already when defining the basic assumptions of a method.
Abstract: With the growing interest among speech scientists in working with natural conversations also the popularity for using articulatory–acoustic features as basic unit increased. They showed to be more suitable than purely phone-based approaches. Even though the motivation for AF classification is driven by the properties of conversational speech, most of the new methods continue to be developed on read speech corpora (e.g., TIMIT). In this paper, we show in two studies that the improvements obtained on read speech do not always transfer to conversational speech. The first study compares four different variants of acoustic parameters for AF classification of both read and conversational speech using support vector machines. Our experiments show that the proposed set of acoustic parameters substantially improves AF classification for read speech, but only marginally for conversational speech. The second study investigates whether labeling inaccuracies can be compensated for by a data selection approach. Again, although an substantial improvement was found with the data selection approach for read speech, this was not the case for conversational speech. Overall, these results suggest that we cannot continue to develop methods for one speech style and expect that improvements transfer to other styles. Instead, the nature of the application data (here: read vs. conversational) should be taken into account already when defining the basic assumptions of a method (here: segmentation in phones), and not only when applying the method to the application data

3 citations

Posted Content
TL;DR: Five neural network (NN) architectures with various adaptation and feature normalization techniques are compared, including feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization.
Abstract: Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.

3 citations

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a stage-wise adaptive inference approach with early exit mechanism for progressive speech enhancement, where in each stage, once the spectral distance between adjacent stages lowers the empirically preset threshold, the inference will terminate and output the estimation.
Abstract: In real scenarios, it is often necessary and significant to control the inference speed of speech enhancement systems under different conditions. To this end, we propose a stage-wise adaptive inference approach with early exit mechanism for progressive speech enhancement. Specifically, in each stage, once the spectral distance between adjacent stages lowers the empirically preset threshold, the inference will terminate and output the estimation, which can effectively accelerate the inference speed. To further improve the performance of existing speech enhancement systems, PL-CRN++ is proposed, which is an improved version over our preliminary work PL-CRN and combines stage recurrent mechanism and complex spectral mapping. Extensive experiments are conducted on the TIMIT corpus, the results demonstrate the superiority of our system over state-of-the-art baselines in terms of PESQ, ESTOI and DNSMOS. Moreover, by adjusting the threshold, we can easily control the inference efficiency while sustaining the system performance.

3 citations

Proceedings Article
01 May 2010
TL;DR: It turns out that in the case of wideband telephony, server-side ASR should not be carried out by simply decimating received signals to 8 kHz and applying existent narrowband acoustic models, and real-world wide band telephony channel data (such as WTIMIT) provides the best training material for wideband IVR systems.
Abstract: In anticipation of upcoming mobile telephony services with higher speech quality, a wideband (50 Hz to 7 kHz) mobile telephony derivative of TIMIT has been recorded called WTIMIT It opens up various scientific investigations; eg, on speech quality and intelligibility, as well as on wideband upgrades of network-side interactive voice response (IVR) systems with retrained or bandwidth-extended acoustic models for automatic speech recognition (ASR) Wideband telephony could enable network-side speech recognition applications such as remote dictation or spelling without the need of distributed speech recognition techniques The WTIMIT corpus was transmitted via two prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile network in The Hague, The Netherlands, employing the Adaptive Multirate Wideband (AMR-WB) speech codec The paper presents observations of transmission effects and phoneme recognition experiments It turns out that in the case of wideband telephony, server-side ASR should not be carried out by simply decimating received signals to 8 kHz and applying existent narrowband acoustic models Nor do we recommend just simulating the AMR-WB codec for training of wideband acoustic models Instead, real-world wideband telephony channel data (such as WTIMIT) provides the best training material for wideband IVR systems

2 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895