scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 1992"


Journal ArticleDOI
TL;DR: In the approach described, the ANN outputs constitute the sequence of observation vectors for the HMM, and an algorithm is proposed for global optimization of all the parameters.
Abstract: The integration of multilayered and recurrent artificial neural networks (ANNs) with hidden Markov models (HMMs) is addressed. ANNs are suitable for approximating functions that compute new acoustic parameters, whereas HMMs have been proven successful at modeling the temporal structure of the speech signal. In the approach described, the ANN outputs constitute the sequence of observation vectors for the HMM. An algorithm is proposed for global optimization of all the parameters. Results on speaker-independent recognition experiments using this integrated ANN-HMM system on the TIMIT continuous speech database are reported. >

234 citations


Journal ArticleDOI
TL;DR: A set of phonetic studies based on analysis of the TIMIT speech database detail new results in speaker-dependent variation due to sex and dialect region of the talker including effects on stop release frequency, speaking rate, vowel reduction, flapping, and the use of glottal stop.
Abstract: A set of phonetic studies based on analysis of the TIMIT speech database is presented. Using a database methodological approach, these studies detail new results in speaker‐dependent variation due to sex and dialect region of the talker including effects on stop release frequency, speaking rate, vowel reduction, flapping, and the use of glottal stop. TIMIT was found to be fertile ground for gathering acoustic–phonetic knowledge having relevance to the phonetic classification and recognition goals for which TIMIT was designed, as well as to the linguist attempting to describe regularity and variability in the pronunciation of read English speech.

94 citations


Journal ArticleDOI
TL;DR: A novel method for using the state sequence output of a large hidden Markov model as input to a phonemic recognition system demonstrates that a significant amount of speech information is preserved in the most likely state sequences produced by such a model.
Abstract: The authors present a novel method for using the state sequence output of a large hidden Markov model as input to a phonemic recognition system. It thereby demonstrates that a significant amount of speech information is preserved in the most likely state sequences produced by such a model. Two different system formulations are presented, both achieving recognitions results equivalent to those achieved by other researchers when using systems with similar levels of complexity. The best system formulation achieved a 56.1% recognition rate with 10.8% insertions on a closed-set experiment and a 53.3% recognition rate with 11.8% insertions on a speaker-independent experiment using the TIMIT acoustic-phonetic database. this experiment used 80 male speakers for model training and a separate set of 24 male speakers for model testing. >

15 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: The results show that the general ergodic background model is as effective as a vocabulary-specific model, however, the MC technique is not effective.
Abstract: Hidden Markov model (HMM) decomposition is used for recognizing speech in the presence of an interfering background speaker. The foreground speech is modeled by a set of left-to-right isolated word HMMs trained on a small isolated word database, and the background speech is modeled by a parallel ergodic HMM trained on a subset of TIMIT. The standard output approximation (OA) method of estimating the output probability distributions is used, and compared with a simple model combination (MC) technique. Recent work in this area has shown the effectiveness of vocabulary-specific background speech models, and hence this is used as a baseline. The results show that the general ergodic background model is as effective as a vocabulary-specific model. However, the MC technique is not effective. >

14 citations


Proceedings ArticleDOI
23 Feb 1992
TL;DR: Phonetic classification algorithms have been developed for wide-band and telephone quality speech, and were tested on subsets of the TIMIT and N-TIMIT databases, and the telephone network seems to increase the error rate.
Abstract: Benchmarking the performance for telephone-network-based speech recognition systems is hampered by two factors: lack of standardized databases for telephone network speech, and insufficient understanding of the impact of the telephone network on recognition systems. The N-TIMIT database was used in the experiments described in this paper in order to "calibrate" the effect of the telephone network on phonetic classification algorithms. Phonetic classification algorithms have been developed for wide-band and telephone quality speech, and were tested on subsets of the TIMIT and N-TIMIT databases. The classifier described in this paper provides accuracy of 75% on wide-band TIMIT data and 66.5% on telephone quality N-TIMIT data. Over-all the telephone network seems to increase the error rate by a factor of 1.3.

14 citations



01 Dec 1992
TL;DR: The resulting Vector Quantized (VQ) distortion based classification indicates the auditory model provides slightly reduced recognition in clean studio quality recordings yet achieves similar performance to the LPC cepstral representation in both degraded environments and in test data recorded over multiple sessions.
Abstract: : The TIMIT and KING databases, as well as a ten day AFIT speaker corpus, are used to compare proven spectral processing techniques to an auditory neural representation for speaker identification. The feature sets compared were Linear Predictive Coding (LPC) cepstral coefficients and auditory nerve firing rates using the Payton model. This auditory model provides for the mechanisms found in the human middle and inner auditory periphery as well as neural transduction. Clustering algorithms were used to generate speaker specific codebooks - one statistically based and the other a neural approach. These algorithms are the Linde-Buzo-Gray (LBG) algorithm and a Kohonen self-organizing feature map (SOFM). The LBG algorithm consistently provided optimal codebook designs with corresponding better classification rates. The resulting Vector Quantized (VQ) distortion based classification indicates the auditory model provides slightly reduced recognition in clean studio quality recordings (LPC 100%, Payton 90%), yet achieves similar performance to the LPC cepstral representation in both degraded environments (both 95%) and in test data recorded over multiple sessions (both over 98%). A variety of normalization techniques, preprocessing procedures and classifier fusion methods were examined on this biologically motivated feature set. Speaker identification, Auditory models, Vector quantization, Neural networks, User verification.

5 citations


Proceedings Article
01 Jan 1992
TL;DR: This work uses a very detailed biologically motivated input representation of the speech tokens-Lyon's cochlear model as implemented by Slaney 20 to produce results comparable to those obtained by others without the addition of time normaliza-tion.
Abstract: We report results on vowel and stop consonant recognition with tokens extracted from the TIMIT database. Our current system diiers from others doing similar tasks in that we do not use any speciic time normalization techniques. We use a very detailed biologically motivated input representation of the speech tokens-Lyon's cochlear model as implemented by Slaney 20]. This detailed, high dimensional representation, known as a cochleagram, is classi-ed by either a back-propagation or by a hybrid super-vised/unsupervised neural network classiier. The hybrid network is composed of a biologically motivated unsuper-vised network and a supervised back-propagation network. This approach produces results comparable to those obtained by others without the addition of time normaliza-tion.

3 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: Experimental results indicate that apart from a rather mild limitation of SM in handling a certain type of vocabulary, SM actually performs better than baselined continuous hidden Markov models (CHMM) in terms of recognition rate as far as isolated word recognition is concerned, and it takes only 60% of the time needed by CHMM in recognition.
Abstract: A static model (SM) in the form of a single vector is proposed to represent the temporal properties of a sequence of speech feature vectors. In contrast to a hidden Markov model which captures the conditional probabilities of state transitions of consecutive observations x/sup to //sub t/ and x/sup to //sub t+1/ over time, an SM captures their average joint probabilities of belonging to a pair of phonetic classes omega /sub i/ and omega /sub j/ without any Markovian assumption. SM is tested with isolated words derived from the TIMIT database as well as artificially created words. The vocabulary is a subset of TIMIT consisting of 21 words derived from the two 'sa' sentences spoken by 420 speakers. The artificial vocabulary of 10 words is designed to study the limitations of SM. Experimental results indicate that apart from a rather mild limitation of SM in handling a certain type of vocabulary, SM actually performs better than baselined continuous hidden Markov models (CHMM) in terms of recognition rate as far as isolated word recognition is concerned, and it takes only 60% of the time needed by CHMM in recognition. >

1 citations