Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
06 Sep 2009TL;DR: A feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands that provides relative improvements in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS).
Abstract: We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction (FDLP) These sub-band envelopes are compressed with a static (logarithmic) and dynamic (adaptive loops) compression The compressed sub-band envelopes are transformed into modulation spectral components which are used as features for speech recognition Experiments are performed on a phoneme recognition task using a hybrid HMM-ANN phoneme recognition system and an ASR task using the TANDEM speech recognition system The proposed features provide a relative improvements of 38 % and 115 % in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS) respectively Further, these improvements are found to be consistent for ASR tasks on OGI-Digits database (relative improvement of 135 %)
19 citations
••
21 Oct 2019TL;DR: This work proposes an utterance-level classification-aided non-intrusive (UCAN) assessment approach that combines the task of quality score classification with the regression task ofquality score estimation, and uses a categorical quality ranking as an auxiliary constraint to assist with quality score estimation.
Abstract: Objective metrics, such as the perceptual evaluation of speech quality (PESQ) have become standard measures for evaluating speech. These metrics enable efficient and costless evaluations, where ratings are often computed by comparing a degraded speech signal to its underlying clean reference signal. Reference-based metrics, however, cannot be used to evaluate real-world signals that have inaccessible references. This project develops a nonintrusive framework for evaluating the perceptual quality of noisy and enhanced speech. We propose an utterance-level classification-aided non-intrusive (UCAN) assessment approach that combines the task of quality score classification with the regression task of quality score estimation. Our approach uses a categorical quality ranking as an auxiliary constraint to assist with quality score estimation, where we jointly train a multi-layered convolutional neural network in a multi-task manner. This approach is evaluated using the TIMIT speech corpus and several noises under a wide range of signal-to-noise ratios. The results show that the proposed system significantly improves quality score estimation as compared to several state-of-the-art approaches.
19 citations
••
26 Sep 2010TL;DR: This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification, and promising results are shown to outperform other approaches based onglottal features.
Abstract: Most of current speaker recognition systems are based on features extracted from the magnitude spectrum of speech. However the excitation signal produced by the glottis is expected to convey complementary relevant information about the speaker identity. This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification. Experiments using these signatures are performed on both TIMIT and YOHO databases. Promising results are shown to outperform other approaches based on glottal features. Besides it is highlighted that the signatures can be used for text-independent speaker recognition and that only several seconds of voiced speech are sufficient for estimating them reliably.
19 citations
••
14 Sep 2014TL;DR: A simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems.
Abstract: We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame from the different contexts it is associated with we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (testdev93) and 9.3% on test set (test-eval92).
19 citations
••
11 Feb 2020TL;DR: This work proposes a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection and evaluates the model on a He-brew corpus to demonstrate such phonetic supervision can be beneficial in a multi-lingual setting.
Abstract: Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superior to the baseline models and reaches state-of-the-art performance in terms of F1 and R-value. We further explore the use of phonetic transcription as additional supervision and show this yields minor improvements in performance but substantially better convergence rates. We additionally evaluate the model on a He-brew corpus and demonstrate such phonetic supervision can be beneficial in a multi-lingual setting.
19 citations