Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
06 Sep 2015TL;DR: It is demonstrated that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions.
Abstract: This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMMbased phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.
16 citations
••
TL;DR: An image analysis-based algorithm is proposed to enhance the binary T–F mask obtained in the initial segmentation stage of CASA-based monaural speech separation systems to improve the speech quality and reduce the noise residue.
Abstract: Monaural speech separation is the process of separating the target speech from the noisy speech mixture recorded using single microphone. It is a challenging problem in speech signal processing, and recently, computational auditory scene analysis (CASA) finds a reasonable solution to solve this problem. This research work proposes an image analysis-based algorithm to enhance the binary T–F mask obtained in the initial segmentation stage of CASA-based monaural speech separation systems to improve the speech quality. The proposed algorithm consists of labeling the initial segmentation mask, boundary extraction, active pixel detection and finally eliminating the noisy non-active pixels. In labeling, the T–F mask obtained from the initial segmentation is labeled as periodicity pixel matrix and non-periodicity pixel matrix. Next boundaries are created by connecting all the possible nearby periodicity pixel matrix and non-periodicity pixel matrix as speech boundary. Some speech boundary may include noisy T–F units as holes, and these holes are treated using the proposed algorithm to properly classify them as the speech-dominant or noise-dominant T–F units in the active pixel detection process. Finally, the noisy T–F units are eliminated. The performance of the proposed algorithm is evaluated using TIMIT speech database. The experimental results show that the proposed algorithm improves the quality of the separated speech by increasing the signal-to-noise ratio by an average value of 9.64 dB and reduces the noise residue by 25.55% as compared to the noisy speech mixture.
16 citations
••
26 Jun 1995TL;DR: The algorithm is based on using a speech recognition system to discover the surface pronunciations of words in speech corpora and shows the probabilities the system has learned for ten common phonological rules which model reductions and coarticulation effects.
Abstract: This paper presents an algorithm for learning the probabilities of optional phonological rules from corpora. The algorithm is based on using a speech recognition system to discover the surface pronunciations of words in speech corpora; using an automatic system obviates expensive phonetic labeling by hand. We describe the details of our algorithm and show the probabilities the system has learned for ten common phonological rules which model reductions and coarticulation effects. These probabilities were derived from a corpus of 7203 sentences of read speech from the Wall Street Journal, and are shown to be a reasonably close match to probabilities from phonetically hand-transcribed data (TIMIT). Finally, we analyze the probability differences between rule use in male versus female speech, and suggest that the differences are caused by differing average rates of speech.
16 citations
••
01 Jan 2002TL;DR: This paper attempts to overcome the above difficulty by using the alternative Lagrangian formulation which only requires the inversion of a matrix whose dimension is proportional to the size of the MFCC sequence of vectors.
Abstract: We study the performance of binary and multi-category SVMs for phoneme classification. The training process of the standard formulation involves the solution of a quadratic programming problem whose complexity depends on the size of the training set. The large size of speech corpora such as TIMIT limits seriously their practical use in continuous speech recognition tasks, using off the shelf personal computers in a reasonable time. In this paper, we attempt to overcome the above difficulty by using the alternative Lagrangian formulation which only requires the inversion of a matrix whose dimension is proportional to the size of the MFCC sequence of vectors. We provide computational results of all possible binary classifiers (1830) on the TIMIT database which are shown to be competitive in terms of recognition rates (96.8%) with those found in the literature (95.6%). The binary classifiers are introduced in the DAGSVM and voting algorithms to perform multi-category classification on some hand picked subsets from TIMIT corpus.
16 citations
••
TL;DR: A two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises.
16 citations