Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling baselines for both TIMIT and telehealth tasks.
Abstract: We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.
3 citations
••
11 Apr 2014TL;DR: An algorithm for spotting fricative consonants in continuous speech that relies only on features extracted directly from the audio signal and on common classification techniques, making it simple to implement and language-invariant.
Abstract: We present an algorithm for spotting fricative consonants in continuous speech. Fricative spotting can be useful in professional audio applications, where excessive accentuation of these phonemes can degrade the aesthetics of voice recordings, or in applications for the hearing-impaired, where certain manipulations can increase their perception. All stages of our algorithm rely only on features extracted directly from the audio signal and on common classification techniques, making it simple to implement and language-invariant. In the first stage, a linear classifier, pre-trained using the Fisher's Linear Discriminant Analysis (LDA) method, is used to detect fricatives inside speech sentences. In the second stage, the detected phonemes are further analyzed using a decision-tree classifier, attempting to reject false detections. Tested on the full corpus of the TIMIT audio database the algorithm achieved very good detection rates across the entire range of fricative phonemes.
3 citations
•
01 Jan 1992
TL;DR: This work uses a very detailed biologically motivated input representation of the speech tokens-Lyon's cochlear model as implemented by Slaney 20 to produce results comparable to those obtained by others without the addition of time normaliza-tion.
Abstract: We report results on vowel and stop consonant recognition with tokens extracted from the TIMIT database. Our current system diiers from others doing similar tasks in that we do not use any speciic time normalization techniques. We use a very detailed biologically motivated input representation of the speech tokens-Lyon's cochlear model as implemented by Slaney 20]. This detailed, high dimensional representation, known as a cochleagram, is classi-ed by either a back-propagation or by a hybrid super-vised/unsupervised neural network classiier. The hybrid network is composed of a biologically motivated unsuper-vised network and a supervised back-propagation network. This approach produces results comparable to those obtained by others without the addition of time normaliza-tion.
3 citations
••
01 Sep 2014TL;DR: It was observed that vowels are inserted by L1 Bengali speakers to break up consonant clusters or avoid syllable final coda consonant in case of phonological problem.
Abstract: Due to the importance of English grows day by day, it is necessary to acquire English language properly for second language learner where proper acquisition involves in correct pronunciation. Forty native (L1) Bengali speakers' read speech data of “The North Wind and the Sun” was analyzed to find out the phonetic and phonological problems of L1 Bengali speakers' English speech. During the study automatic phoneme alignment was carried out by the HTK tool with a modified TIMIT dictionary. The result shows that L1 Bengali speakers substitute new English consonant and vowel phonemes by Bengali sounds which are close to those English sounds. In case of phonological problem, it was observed that vowels are inserted by L1 Bengali speakers to break up consonant clusters or avoid syllable final coda consonant. The effect of fluency on phonetic and phonological problems of L1 Bengali speakers was also presented in the paper.
3 citations
••
22 Jun 2021TL;DR: In this paper, a hardware-software co-design for efficient sparse deep neural networks (DNNs) implementation in a regular systolic array for real-time on-device speech processing is presented.
Abstract: This paper presents a hardware-software co-design for efficient sparse deep neural networks (DNNs) implementation in a regular systolic array for real-time on-device speech processing. The weight pruning format, exploring pattern-based coordinate-assisted (PICA) sparsity, expands the pattern-based pruning into both convolutional neural networks (CNNs) and recurrent neural networks (RNNs). It reduces the index storage overhead as well as avoids accuracy degradation. The proposed systolic accelerator leverages the intrinsic data reuse and locality to accommodate the PICA-based sparsity without using complex data distribution networks. It also supports DNNs with different topologies. By reducing the model size by 16x, PICA sparsification reduces 6.02x index storage overhead while still achieving 20.7% WER in TIMIT dataset. For the pruned WaveNet and LSTM, the accelerator achieves 0.62 and 2.69 TOPS/W energy efficiency, 1.7x to 10x higher than the state-of-the-art.
3 citations