scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 1994"


Journal ArticleDOI
TL;DR: Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation.
Abstract: This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed; a role for which the recurrent net appears suitable. An overview of early developments of recurrent nets for phone recognition is given along with the more recent improvements that include their integration with Markov models. Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation. >

497 citations


Journal ArticleDOI
TL;DR: This report concerns speaker-dependent effects on certain phonetic characteristics often involved in reduction such as speech rate, stop releases, flapping, central vowels, laryngeal state, syllabic consonants, and palatalization processes.

257 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: It is found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network.
Abstract: We compare speech recognition accuracy for high-quality speech recorded under controlled conditions with speech as it appears over long-distance telephone lines. In addition to comparing recognition accuracy we use telephone-channel simulation to identify the sources of degradation of speech over telephone lines that have the greatest impact on speech recognition accuracy. We first compare the performance of the CMU SPHINX-I system on the TIMIT and NTIMIT databases. We found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network. We identify the most problematic telephone-channel impairments using a commercial telephone channel simulator and the SPHINX-II system. Of the various effects considered, additive noise and linear filtering appear to have the greatest impact on recognition accuracy. Finally, we examined the performance of three cepstral compensation algorithms in the presence of the most damaging conditions. We found the compensation algorithms to be effective except for the worst 1% of the telephone channels. >

74 citations


Journal ArticleDOI
TL;DR: A set of studies of some phonetic characteristics of the American English represented in the TIMIT speech database are reported, useful not only to linguistic phoneticians but also for speech recognition lexicons and text-to-speech systems.

48 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: Three improvements to the hybrid connectionist-hidden Markov model speech recognition system are described: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling, which provide a significant improvement in the TIMIT phone recognition rate.
Abstract: This paper describes phone modelling improvements to the hybrid connectionist-hidden Markov model speech recognition system developed at Cambridge University. These improvements are applied to phone recognition from the TIMIT task and word recognition from the Wall Street Journal (WSJ) task. A recurrent net is used to map acoustic vectors to posterior probabilities of phone classes. The maximum likelihood phone or word string is then extracted using Markov models. The paper describes three improvements: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling. The first is shown to provide a significant improvement in the TIMIT phone recognition rate and all three provide an improvement in the WSJ word recognition rate. >

38 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: The phoneme class directed enhancement algorithm is evaluated using TIMIT speech data, and shown to result in substantial improvement in objective speech quality over a range of signal-to-noise ratios and individual phoneme classes.
Abstract: It is known that degrading acoustic noise influences speech quality across phoneme classes in a non-uniform manner. This results in variable quality performance for many speech enhancement algorithms in noisy environments. To address this, a hidden-Markov-mode phoneme classification procedure is proposed which directs single channel speech enhancement across individual phoneme classes. The procedure performs broad phoneme class partitioning of noisy speech frames using a continuous-mixture hidden-Markov-model recognizer in conjunction with a cost based decision process. Cost functions are assigned which weigh errors between phoneme classes that are perceptually different (e.g., vowels versus fricatives, etc.). Once noisy speech frames are partitioned, iterative speech enhancement based on all-pole parameter estimation with inter and intra-frame spectral constraints (Auto:I,LSP:T) is employed. The phoneme class directed enhancement algorithm is evaluated using TIMIT speech data, and shown to result in substantial improvement in objective speech quality over a range of signal-to-noise ratios and individual phoneme classes. The algorithm is also shown to possess consistent quality improvement in a speaker independent scenario. >

15 citations


Proceedings ArticleDOI
06 Sep 1994
TL;DR: A number of different approaches to connectionist model merging are presented and compared and evaluated on the TIMIT phone recognition and ARPA Wall Street Journal word recognition tasks.
Abstract: Reports in the statistics and neural networks literature have expounded the benefits of merging multiple models to improve classification and prediction performance. The Cambridge University connectionist speech group has developed a hybrid connectionist-hidden Markov model system for large vocabulary talker independent speech recognition. The performance of this system has been greatly enhanced through the merging of connectionist acoustic models. This paper presents and compares a number of different approaches to connectionist model merging and evaluates them on the TIMIT phone recognition and ARPA Wall Street Journal word recognition tasks. >

14 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: The vowels, the diphtongues and the nasals are shown to be the most specific to speakers discrimination, and a method to predict the most speaker-related phonemes is developed.
Abstract: We investigate speaker-related information in speech. Our investigations are based on the AR-vector model used in the Orphee system: the vocal activated door we designed. The free-text speaker identification accuracy is an indicator of the ability of any parameter for discriminating speakers. Therefore, through speaker identification performance, we test the speech duration influence, the effect of the parameters of AR-vector model and the importance of phonetic segments. We developed a method to predict the most speaker-related phonemes. The results are the following: the vowels, the diphtongues and the nasals are shown to be the most specific to speakers discrimination. For one second duration test signal regrouping vowels and diphtongues, a speaker accuracy rate of 95.6% was found for the 630 TIMIT speakers. >

13 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: It is found that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters, and it is observed that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI.
Abstract: This paper deals with speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linear input feature transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame (mutual information or relative entropy) MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMI is found superior to the frame-based criterion for continuous recognition. We further observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters. >

12 citations


Proceedings ArticleDOI
13 Apr 1994
TL;DR: This paper compares a mixture-Gaussian vector quantisation method, ergodic continuous hidden Markov models (CHMMs) and phone-level left-to-right CHMMs for text-independent speaker recognition to represent a progression of phonetic specificity prior to the generation of probabilities against which speakers are compared.
Abstract: This paper compares a mixture-Gaussian vector quantisation (VQ) method, ergodic continuous hidden Markov models (CHMMs) and phone-level left-to-right CHMMs for text-independent speaker recognition. These three methods represent a progression of phonetic specificity prior to the generation of probabilities against which speakers are compared. The mixture-Gaussian VQ uses a single distribution for all phones, the ergodic CHMM uses several distributions which have been shown in a previous text-independent speaker recognition study to represent broad phonetic classes, and the phone-based left-to-right CHMM uses many distributions representing the specific phones in the test utterance. Our experiments with speaker recognition on 40 TIMIT speakers show that the recognition rates of the mixture-Gaussian VQ, ergodic CHMMs and phone-based left-to-right CHMMs are 87.5%, 87.5% and 100% respectively. >

7 citations


Proceedings ArticleDOI
15 Mar 1994
TL;DR: In this article, a text independent, phoneme based speaker identification system which uses adaptive wavelets to model the phonemes was described, which was achieved by using a two layer feed forward neural network classifier.
Abstract: In this paper, we describe a text independent, phoneme based speaker identification system which usesadaptive wavelets to model the phonemes. This system identifies a speaker by modeling a very shortsegment of phonemes and then by clustering all the phonemes belonging to the same speaker into oneclass. The classification is achieved by using a two layer feed forward neural network classifier. Theperformance of this speaker identification system is demonstrated by considering the phonemes thatwere extracted from various sentences spoken by three speakers in the TIMIT acoustic-phonetic speechcorpus. 1. INTRODUCTION Speaker identification systems are mainly used (a) for verifying a person's identity prior to admitting him/her into a secured place or to a telephone transaction and (b) for associating a person with a voice in police work [1]. A linguistic unit is called a phoneme (speech sound). The acoustic characteristics of each phonemevary based on the manner of articulation (source of excitation) and the place of articulation (shapeof the vocal tract). Based on the source of excitation, speech sounds can be broadly classified into

Proceedings ArticleDOI
19 Apr 1994
TL;DR: A novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) is described, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech.
Abstract: The paper describes a novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) The selection and comparison of various acoustic speech features and the use of context information in the classification are addressed Continuous speech is classified into 9 visible mouth-shape related classes on an acoustic frame basis Some mouth-shape related acoustic speech signal features are selected as the input to a classifier constructed with recurrent neural network (RNN) 304 training sentences and 88 testing sentences are chosen from DARPA TIMIT continuous speech database The average viseme recognition rate for the test set reaches 847% on frame level, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech >

Journal ArticleDOI
TL;DR: This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment‐based framework that compares favorably with other studies using the timit corpus.
Abstract: This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment‐based framework. The approach is based on the creation of a track, Tα, for each phonetic unit α. The track serves as a model of the dynamic trajectories of the acoustic attributes over the segment. The statistical framework for scoring incorporates the auto‐ and cross‐correlation properties of the track error over time, within a segment. On a vowel classification task [W. Goldenthal and J. Glass, ‘‘Modeling Spectra Dynamics for Vowel Classification,’’ Proc. Eurospeech 93, pp. 289–292, Berlin, Germany (1993)], this methodology achieved classification performance of 68.9%. This result compares favorably with other studies using the timit corpus. This talk extends this result by presenting context‐independent and context‐dependent experiments for all the phones. Context‐independent classification performance of 76.8% is demonstrated. The key to implementing the...

Proceedings ArticleDOI
25 Oct 1994
TL;DR: In this paper, a Gaussian mixture speaker model was used for speaker identification and experiments were conducted on the TIMIT and NTIMIT databases, achieving accuracies of 99.5% and 60.7% for clean, wideband speech and telephone speech, respectively.
Abstract: The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g., telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.© (1994) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
09 Oct 1994
TL;DR: The signal at different scales is modeled by a hierarchical autoregressive moving average (ARMA) model, and the features at coarse scales are extracted from the model without performing expensive filtering operation.
Abstract: In this paper, we consider the classification of speech signals by using stochastic models at different scales. The signal at different scales is modeled by a hierarchical autoregressive moving average (ARMA) model, and the features at coarse scales are extracted from the model without performing expensive filtering operation. The hierarchical modeling can increase the accuracy of speech classification by exploiting features at different scales. For speech classification, model parameters at five different scales obtained by hierarchical modeling are used as features. A minimum distance classifier is implemented, and tested on TIMIT speech data.

Proceedings ArticleDOI
13 Apr 1994
TL;DR: Preliminary experimental results on the task of sex adaptation for speaker-independent stop consonant discrimination, evaluated from the DARPA TIMIT speech database, demonstrates the effectiveness of the proposed method of speaker normalization by means of input space optimization for continuous density hidden Markov models.
Abstract: This paper proposes a novel method of speaker normalization by means of input space optimization for continuous density hidden Markov models (CDHMM). The parameters of a linear feature transformation function are so determined that, together with the previously trained CDHMM parameters, a mis-classification cost function is minimized for the normalizing data set. Preliminary experimental results on the task of sex adaptation for speaker-independent stop consonant discrimination, evaluated from the DARPA TIMIT speech database, demonstrates the effectiveness of the proposed method. >

Proceedings ArticleDOI
25 Oct 1994
TL;DR: A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information that compactly and accurately model phonetic information, while accounting for the main effects of contextual variations.
Abstract: A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information. These time-frequency features compactly and accurately model phonetic information, while accounting for the main effects of contextual variations. These segment-level features are computed such that more emphasis is given to the center of the segment and less to the end regions. For phonetic classification, the features are relatively insensitive to both time and frequency resolution, as least insofar as changes in window length and frame spacing are concerned. A 60-dimensional feature space based on this modeling technique resulted in 70.9% accuracy for classification of 16 vowels extracted from the TIMIT data base in speaker-independent experiments. These results are higher than any other results reported in the literature for the same task. >

Proceedings ArticleDOI
06 Sep 1994
TL;DR: A modular architecture where the interactions among different modules are controlled by proper autoassociators, thus reducing significantly the problems due to interaction of different modules.
Abstract: Proposes a modular architecture where the interactions among different modules are controlled by proper autoassociators. The outputs of these modules are computed by sigma p-neurons whose inputs come from both a feedforward network performing classification and an autoassociator. The outputs of the autoassociators are used for performing pattern rejection, thus reducing significantly the problems due to interaction of different modules. The proposed architecture is validated by experiments of speaker independent phoneme recognition on continuous speech with TIMIT data base with very promising results. >

Proceedings ArticleDOI
17 Mar 1994
TL;DR: In this article, the authors describe a method for the enhancement of speech of a particular speaker in a noisy multispeaker environment using minimum variance deconvolution (MVD) algorithm.
Abstract: Describes a novel method for the enhancement of speech of a particular speaker in a noisy multispeaker environment Many potential applications of the method are possible including the implementation in a new generation of hearing aids The system is based on the minimum variance deconvolution (MVD) algorithm The method was tested using the TIMIT speech database The utterances of two speakers were first combined to create a multispeaker environment, and then separated using the MVD algorithm The intelligibility of the separated and enhanced speech was high Likewise the frequency spectra of the original speech were very similar to the spectra of the separated and enhanced speech for each of the two speakers >

Book ChapterDOI
01 Jan 1994
TL;DR: A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available.
Abstract: It is difficult to implement talker recognition on the telephone network because of normal variation in the channel characteristics. The primary component of variation is due to the different telephone handsets or microphone frequency characteristics (Rosenberg and Soong, 1992). Lack of availability of telephone speech databases has also contributed to slow progress in the solution of these problems, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available. A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research.

01 Jan 1994
TL;DR: A modular architecture where the interactions among different modules are controled by proper au- toassociators and the outputs of these modules are used for performing pattern rejection, thus reducing significantly the problems due to iiiteraction of different modules.
Abstract: In this paper, we propose a modular architecture where the interactions among different modules are controled by proper au- toassociators. The outputs of these modules are computed by SL~~JIU p-neitroas whose inputs come from both a feedforward network per- forming classificatioii and an autoassociator. The outputs of the au- toassociators are used for performing pattern rejection, thus reduc- ing significantly the problems due to iiiteraction of different modules. The proposed architecture is validated by experiments of speaker in- dependent plioneiiie recognition on continuous speech with TIMIT