Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: In this article, a feature parameter is obtained by applying the Teager energy to the WPD(Wavelet Packet Decomposition) coefficients and a threshold value is obtained based on means and standard deviations of nonspeech frames.
Abstract: In this paper, a feature parameter is obtained by applying the Teager energy to the WPD(Wavelet Packet Decomposition) coefficients. The threshold value is obtained based on means and standard deviations of nonspeech frames. Experimental results by using TIMIT speech and NOISEX-92 noise databases show that the proposed algorithm is superior to the typical VAD algorithm. The ROC(Receiver Operating Characteristics) curves are used to compare performance of VAD`s for SNR values of ranging from 10 to -10 dB.
1 citations
••
TL;DR: This investigation of two novel lattice-constrained Viterbi training strategies for improving sub-word unit (SWU) inventories that were discovered using an unsupervised sparse coding approach finds that this lightly supervised approach substantially increases correspondence with the reference phonemes, and in this case also improves pronunciation consistency.
1 citations
•
01 Jan 2009TL;DR: In this paper, the authors used the Fisher's F-ratio to measure the frequency regions containing the most discriminative information and suppress the phonetic information in the speech.
Abstract: This Master's thesis presents an investigation of the features and models used
when constructing a robust speaker identification system using the TIMIT
speaker database. Investigations of the k-Means clustering algorithm and the
Gaussian mixture models (GMM) for speaker modelling show an improvement
in the identification rate when using the GMM speaker models.
The features for the speaker identification should emphasize the individual differences
in the speech while suppressing the phonetic information, the exact
opposite is the case for the features used for speech recognition. However the
same features, the MFCCs, have been used for both tasks. Using the Fisher's
F-ratio to measure the frequency regions containing the most discriminative
speaker information we present a new set of features, the FRFCCs. They emphasize
the regions with speaker discriminative information and suppress the
phonetic information in the speech. The Fisher's F-ratio shows that the regions
around the fundamental frequency (100 Hz) and the third (2500 Hz) and fourth
(3500 Hz) formant contain large speaker information, while the region around
the first formant (500 Hz) contains only phonetic information.
By adding noise to the TIMIT database we show that using the FRFCC features
yield a better and more robust automatic speaker identification system. Finally
testing on speech from Danish TV we show that using the FRFCCs instead of
the MFCCs gives an improvement of 91%.
1 citations
•
TL;DR: In this article, the authors studied the problem of acoustic feature learning in the setting where they have access to an external, domain mismatched dataset of paired speech and articulatory measurements, either with or without labels.
Abstract: Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.
1 citations
••
TL;DR: It is suggested that cepstral coefficients are able to model speech in a given environment in finer detail, whereas acoustic phonetic‐based features are more robust to changes in environment, so that combining both types of measurements leads to the best performance.
Abstract: This work classifies voiceless stop consonant place in CV tokens of English using burst release cues for clean (TIMIT) and telephone speech (NTIMIT). We compared the performance of cepstral coefficients to acoustic phonetics‐motivated features such as center of gravity, burst amplitude and relative difference of formant amplitudes. In clean speech, cepstral coefficients resulted in better classification. However, for test data from NTIMIT, acoustic phonetic‐based features outperformed cepstral coefficients, particularly if models were trained on clean speech. In addition, augmenting cepstral coefficients with acoustic phonetic‐based measurements resulted in the best performance. These findings suggest that cepstral coefficients are able to model speech in a given environment in finer detail, whereas acoustic phonetic‐based features are more robust to changes in environment, so that combining both types of measurements leads to the best performance.
1 citations