scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
Geoffrey Zweig1
25 Mar 2012
TL;DR: Initial steps are taken at using segment based direct models on their own, first by developing a segment-based maximum entropy phone classifier, and then by utilizing the features in a segmental conditional random field for recognition.
Abstract: Segment based direct models have recently been used to improve the output of existing state-of-the-art speech recognizers. To date, however, they have relied on an existing HMM system to provide segment boundaries. This paper takes initial steps at using these models on their own, first by developing a segment-based maximum entropy phone classifier, and then by utilizing the features in a segmental conditional random field for recognition. To produce a feature representation that is independent of segment length, we utilize a set of ngram features based on vector-quantized representations of the acoustic input. We find that the models are able to integrate information at different granularities and from different streams. Contextual information from around the segment boundaries is particularly important. We obtain competitive results for TIMIT phone classification, and present initial recognition results.

25 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The use of regularization effectively prevents overfitting and HCRFs are able to make use of non-independent features in phone classification, at least with small numbers of mixture components, while HMMs degrade due to their strong independence assumptions.
Abstract: We show a number of improvements in the use of Hidden Conditional Random Fields (HCRFs) for phone classification on the TIMIT and Switchboard corpora. We first show that the use of regularization effectively prevents overfitting, improving over other methods such as early stopping. We then show that HCRFs are able to make use of non-independent features in phone classification, at least with small numbers of mixture components, while HMMs degrade due to their strong independence assumptions. Finally, we successfully apply Maximum a Posteriori adaptation to HCRFs, decreasing the phone classification error rate in the Switchboard corpus by around 1% -5% given only small amounts of adaptation data.

25 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This is the first paper to describe in-depth these various flavors of TDNN, providing details regarding the speech features used at input, constituent layers and their dimensionality, regularization techniques etc.
Abstract: Kaldi NNET3 is at the moment the leading speech recognition toolkit on many well-known tasks such as LibriSpeech, TED-LIUM or TIMIT. Several versions of the time-delay neural network (TDNN) architecture were recently proposed, implemented and evaluated for acoustic modeling with Kaldi: plain TDNN, convolutional TDNN (CNN-TDNN), long short-term memory TDNN (TDNN-LSTM) and TDNN-LSTM with attention. To the best of our knowledge, this is the first paper to describe in-depth these various flavors of TDNN, providing details regarding the speech features used at input, constituent layers and their dimensionality, regularization techniques etc. The various acoustic models were evaluated in conjunction with n-gram and recurrent language models in an automatic speech recognition (ASR) experiment for the Romanian language. We report significantly better results over the previous ASR systems for the same Romanian ASR tasks.

25 citations

Journal ArticleDOI
TL;DR: Distribution scaling based scorenormalization techniques are developed specifically for the in-set/out-of-set problem and compared against existing score normalization schemes used in open-set speaker recognition.
Abstract: In this paper, the problem of identifying in-set versus out-of-set speakers using extremely limited enrollment data is addressed. The recognition objective is to form a binary decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or not. Here, the emphasis is on low enrollment (about 5 sec of speech for each enrolled speaker) and test data durations (2-8 sec), in a text-independent scenario. In order to overcome the limited enrollment, data from speakers that are acoustically close to a given in-set speaker are used to form an informative prior (base model) for speaker adaptation. Score normalization for in-set systems is addressed, and the difficulty of using conventional score normalization schemes for in-set speaker recognition is highlighted. Distribution scaling based score normalization techniques are developed specifically for the in-set/out-of-set problem and compared against existing score normalization schemes used in open-set speaker recognition. Experiments are performed using the following three separate corpora: (1) noise-free TIMIT; (2) noisy in-vehicle CU-move; and (3) the NIST-SRE-2006 database. Experimental results show a consistent increase in system performance for the proposed techniques.

25 citations

Proceedings Article
01 Jan 2003
TL;DR: Two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers are experimented with and it is found that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance.
Abstract: Motivated by the temporal processing properties of human hearing, researchers have explored various methods to incorporate temporal and contextual information in ASR systems. One such approach, TempoRAl PatternS (TRAPS), takes temporal processing to the extreme and analyzes the energy pattern over long periods of time (500 ms to 1000 ms) within separate critical bands of speech. In this paper we extend the work on TRAPS by experimenting with two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers. Both the Hidden Activation TRAPS (HATS) and Tonotopic MultiLayer Perceptrons (TMLP) require 84% less parameters than TRAPS but can achieve significant phone recognition error reduction when tested on the TIMIT corpus under clean, reverberant, and several noise conditions. In addition, the TMLP performs training in a single stage and does not require critical band level training targets. Using these variants, we find that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance. In combination with a conventional PLP system, these TRAPS variants achieve significant additional performance improvements.

25 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895