Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: A waveform based clipping detection algorithm is proposed for naturalistic audio streams and the results show that clipping introduces a nonlinear distortion into clean speech data, which reduces speech quality and performance for speaker recognition.
2 citations
••
01 Feb 2018TL;DR: The proposed split lattice structure based on sonority detection decreased the phone error rates by nearly 0.9 % when evaluated on core TIMIT test corpus as compared to the conventional decoding involved in the state-of-the-art Deep Neural Networks (DNN).
Abstract: Phoneme lattices have been shown to be a good choice to encode in a compact way alternative decoding hypotheses from a speech recognition system. However the optimal phoneme sequence is produced by tracing all the phoneme identities in the lattice. This not only makes the search space of the decoder huge but also the final phoneme sequence may be prone to have false substitutions or insertion errors. In this paper, we introduce the split lattice structures that is generated by splitting the speech frames based on the manner of articulation. Spectral flatness measure (SFM) is exploited to detect the two broad manner of articulation sonorants and non-sonorants. The manner of sonorants includes broadly the vowels, the semivowels and the nasals whereas the fricatives, stop consonants and closures belong to non-sonorants. The conventional way of speech decoder produces one lattice for one test utterance. In our work, we split the speech frames into sonorants and non-sonorants based on SFM knowledge and generate split lattices. The split lattice generated are modified according to the manner of articulation in each split so as to remove the irrelevant phoneme identities in the lattice. For instance, the sonorant lattice is forced to exclude the non-sonorant phoneme identities and hence minimizing false substitutions or insertion errors. The proposed split lattice structure based on sonority detection decreased the phone error rates by nearly 0.9 % when evaluated on core TIMIT test corpus as compared to the conventional decoding involved in the state-of-the-art Deep Neural Networks (DNN).
2 citations
01 Jan 2007
TL;DR: A theoretical validation of the Neural Predictive Coding model is presented in the hypothesis of unnoised signals and gaussian noised signals to show that the classification rates are clearly improved compared to usual methods, in particular regarding phonemes considered difficult to process.
Abstract: and key words In this article, we propose to study a speech coding method applied to the recognition of phonemes. The proposed model (the Neural Predictive Coding, NPC) and its two declinations (NPC-2 and DFE-NPC) is a connectionist model (multilayer perceptron) based on the non linear prediction of the speech signal. We show that it is possible to improve the discriminant capacities of such an encoder with the introduction of signal membership class information as from the coding stage. As such, it fits in with the category of DFE encoders (Discriminant Features Extraction) already proposed in literature. In this study we present a theoretical validation of the model in the hypothesis of unnoised signals and gaussian noised signals. NPC performances are compared to that obtained with traditional methods used to process speech on the Darpa Timit an Ntimit speech bases. Simulations presented here show that the classification rates are clearly improved compared to usual methods, in particular regarding phonemes considered difficult to process. A small vocabulary word recognition experiment is provided to show how NPC features can be used in a more conventional speech ANN-HMM based system approach.
2 citations
••
01 Sep 1996TL;DR: Recognition of voiced speech phonemes is addressed in this paper using features extracted from the bispectrum of the speech signal as a superposition of coupled harmonics, located at frequencies that are multiples of the pitch and modulated by the vocal tract.
Abstract: Recognition of voiced speech phonemes is addressed in this paper using features extracted from the bispectrum of the speech signal. Voiced speech is modeled as a superposition of coupled harmonics, located at frequencies that are multiples of the pitch and modulated by the vocal tract. For this type of signal, nonzero bispectral values are shown to be guaranteed by the estimation procedure employed. The vocal tract frequency response is reconstructed from the bispectrum on a set of frequency points that are multiples of the pitch. An AR model is next fitted on this transfer function. The AR coefficients are used as the feature vector for the subsequent classification step. Any finite dimension vector classifier can be employed at this point. Experiments using the LVQ neural classifier give satisfactory classification scores on real speech data, extracted from the DARPA/TIMIT speech corpus.
2 citations
••
21 May 2015
TL;DR: An experiment was carried out on the Frame Distance Array (FDA) algorithm with a main goal of the algorithm parameter tune-up and the best combination of values was chosen based on the observations on the detection rate, the miss rate and the false boundary rate.
Abstract: This work is related to unsupervised automatic speech segmentation. An experiment was carried out on the Frame Distance Array (FDA) algorithm with a main goal of the algorithm parameter tune-up. The experiment was carried out by applying the algorithm on TIMIT corpus and by using MFCC as the speech signal features. The parameters tuned up in this work are the frame length, the frame increment, the number of test frames and the test frame step size. The best combination of values was chosen based on the observations on the detection rate, the miss rate and the false boundary rate. The best parameter tune-up found at 23 ms, 1.5 ms, 9 frames and 2 frames for the frame length, the frame increment, the number of test frames and the test frame step size respectively.
2 citations