scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 1993"



Dataset
01 Jan 1993
TL;DR: The TIMIT corpus as mentioned in this paper contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, including time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.
Abstract: The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

2,096 citations



Proceedings ArticleDOI
27 Apr 1993
TL;DR: A segmental speech model is used to develop a secondary processing algorithm that rescores putative events hypothesized by a primary HMM word spotter to try to improve performance by discriminating true keywords from false alarms.
Abstract: The authors present a segmental speech model that explicitly models the dynamics in a variable-duration speech segment by using a time-varying trajectory model of the speech features in the segment. Each speech segment is represented by a set of statistics which includes a time-varying trajectory, a residual error covariance around the trajectory, and the number of frames in the segment. These statistics replace the frames in the segment and become the data that are modeled by either HMMs (hidden Markov models) or mixture models. This segment model is used to develop a secondary processing algorithm that rescores putative events hypothesized by a primary HMM word spotter to try to improve performance by discriminating true keywords from false alarms. This algorithm is evaluated on a keyword spotting task using the Road Rally Database, and performance is shown to improve significantly over that of the primary word spotter. The segmental model is also used on a TIMIT vowel classification task to evaluate its modeling capability. >

125 citations


Proceedings Article
01 Jan 1993
TL;DR: It is shown that it is worthwhile to perform phone recognition experiments as opposed to only focusing attention on word recognition results, and high phone accuracies on three corpora: WSJ0, BREF and TIMIT are reported.
Abstract: In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizerare: high dimensional feature vector (48), contextand genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuraciesof 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phone accuracyof 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabulary CSR, we show that it is worthwhile to perform phone recognition experiments as opposed to only focusing attention on word recognition results.

105 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: Research on speaker-independent continuous phone recognition for both French and English is presented and it is found that French is easier to recognize at the phone level (the phone error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognizing at the lexical level due to the larger number of homophones.
Abstract: Research on speaker-independent continuous phone recognition for both French and English is presented. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal (WSJ) and TIMIT corpora for English. Cross-language differences concerning language properties are presented. It is found that French is easier to recognize at the phone level (the phone error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognize at the lexical level due to the larger number of homophones. Experiments with signal analysis indicate that a 4 kHz signal bandwidth is sufficient for French, whereas 8 kHz is needed for English. Phone recognition is a powerful technique for language, sex, and speaker identification. With 2 s of speech, the language can be identified with better than 99% accuracy. Sex-identification for BREF and WSJ is error-free. Speaker identification accuracies of 98.2% on TIMIT (462 speakers) and 99.1% on BREF (57 speakers) were obtained with one utterance per speaker. 100% accuracies were obtained with two utterances per speaker. >

58 citations


Proceedings Article
01 Jan 1993
TL;DR: A unified approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods, which has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-ffcation and promising for a variety of applications.
Abstract: SUMMARY In this paper we have presented a unified approach forthe identification of non-linguistic speech features fromrecorded signals using phone-based acoustic likelihoods.The inclusion of this technique in speech-based systems,can broaden the scope of applications of speech technolo-gies, and lead to more user-friendly systems. The approachis based on training a set of large phone-based ergodicHMMs for each non-linguisticfeature to be identified (lan-guage, gender, speaker, ...), and identifying the feature asthat associated with the model having the highest acousticlikelihoodof the set. The decoding procedure is efficientlyimplemented by processing all the models in parallel usinga time-synchronous beam search strategy.This has been shown to be a powerful technique for sex,language, and speaker-identification, and has other possi-ble applications such as for dialect identification (includ -ing foreign accents), or identification of speech disfluen-cies. Sex-identification for BREF and WSJ was error-free,and 99% accurate for TIMIT with 2s of speech. Speakeridentification accuracies of 98.8% on TIMIT (168 speak-ers) and 99.1% on BREF (65 speakers) were obtained withone utterance per speaker, and 100% if 2 utterances wereused foridentification. This identificationaccuracy was ob -tained on the 168 test speakers of TIMIT without makinguse of the phonetic transcriptionsduring training,verifyingthat it is not necessary to have labeled data adaptation data.Speaker-independent models can be used to provide the la-bels used in building the speaker-specific models. Beingindependent of the spoken text, and requiring only a smallamount of identification speech (on the order of 2.5s), thistechnique is promising for a variety of applications, partic-ularly those for which continual, transparent verification ispreferable.Tests of two-way language identification of read, labora-toryspeech show that with 2sof speech the languageis cor-rectly identified as English or French with over 99% accu-racy. Simply portingthe approach to the conditionsof tele-phone speech, French and English data in the OGI multi-language telephone speech corpus was about 76% with 2sof speech, and increased to 82% with 10s. The overall 10-languageidentificationaccuracy on thedesignateddevelop -ment test data of in the OGI corpus is 59.7%. These resultswere obtained without the use of phone transcriptions fortraining, which were used for the experiments with labora-tory speech.In conclusion, we propose a unified approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods. This tech-nique has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-fication. While phone labels have been used to train thespeaker-independent seed models, these models can thenbe used to label unknown speech, thus avoiding the costlyprocess of transcribing the speech data. The ability to ac-curately identify non-linguisticspeech features can leadtomore performant spoken language systems enabling betterand more friendly human machine interaction.

39 citations


Journal ArticleDOI
TL;DR: This paper presents an artificial neural network (ANN) for speaker-independent isolated word speech recognition that is a multilayer perceptron (MLP) in concatenation and the architecture of these three subnets are described, and the associated adaptive learning algorithms are derived.
Abstract: This paper presents an artificial neural network (ANN) for speaker-independent isolated word speech recognition. The network consists of three subnets in concatenation. The static information within one frame of speech signal is processed in the probabilistic mapping subnet that converts an input vector of acoustic features into a probability vector whose components are estimated probabilities of the feature vector belonging to the phonetic classes that constitute the words in the vocabulary. The dynamics capturing subnet computes the first-order cross correlation between the components of the probability vectors to serve as the discriminative feature derived from the interframe temporal information of the speech signal. These dynamic features are passed for decision-making to the classification subnet, which is a multilayer perceptron (MLP). The architecture of these three subnets are described, and the associated adaptive learning algorithms are derived. The recognition results for a subset of the DARPA TIMIT speech database are reported. The correct recognition rate of the proposed ANN system is 95.5%, whereas that of the best of continuous hidden Markov model (HMM)-based systems is only 91.0%. >

29 citations


Proceedings ArticleDOI
21 Mar 1993
TL;DR: A unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods is presented and is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction.
Abstract: Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods.This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99% accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation.

27 citations


Proceedings ArticleDOI
Yunxin Zhao1
27 Apr 1993
TL;DR: A speaker adaptation technique based on the separation of speech spectra variation sources is developed for improving speaker-independent continuous speech recognition and experiments using short calibration speech have shown substantial performance improvement over the baseline recognition system.
Abstract: A speaker adaptation technique based on the separation of speech spectra variation sources is developed for improving speaker-independent continuous speech recognition. The variation sources include speaker acoustic characteristics, phonologic characteristics, and contextual dependency of allophones. Statistical methods are formulated to normalize speech spectra based on speaker acoustic characteristics and then adapt mixture Gaussian density phone models based on speaker phonologic characteristics. Adaptation experiments using short calibration speech (5 s/speaker) have shown substantial performance improvement over the baseline recognition system. On a TIMIT test set, where the task vocabulary size is 853 and the test set perplexity is 104, the recognition word accuracy has been improved from 86.9% to 90.6% (28.2% error reduction). On a separate test set which contains an additional variation source of recording channel mismatch and with the test set perplexity of 101, the recognition word accuracy has been improved from 65.4% to 85.5% (58.1% error reduction). >

21 citations


Proceedings ArticleDOI
28 Mar 1993
TL;DR: The TIMIT and KING databases are used to compare proven spectral processing techinques to an auditory neural representation for speaker identification and the resulting vector-quantized distortion-based classification indicates the auditory model performs statistically equal to the LPC cepstral representation in clean environments and outperforms the L PCs in noisy environments and in test data recorded over multiple sessions.
Abstract: The TIMIT and KING databases are used to compare proven spectral processing techinques to an auditory neural representation for speaker identification. The feature sets compared are linear prediction coding (LPC) cepstral coefficients and auditory nerve firing rates using the Payton model (1988). Two clustering algorithms, one statistically based and the other a neural approach, are used to generate speaker-specific codebook vectors. These algorithms are the Linde-Buzo-Gray algorithm and a Kohonen self-organizing feature map. The resulting vector-quantized distortion-based classification indicates the auditory model performs statistically equal to the LPC cepstral representation in clean environments and outperforms the LPC cepstral in noisy environments and in test data recorded over multiple sessions (greater intra-speaker distortions). >

Book ChapterDOI
13 Sep 1993
TL;DR: A discriminative neural prediction system for continuous speaker independent speech recognition that allows to reach 74,9% accuracy on TIMIT which compares well with other state of the art systems, while being less complex and easier to implement.
Abstract: This paper presents a discriminative neural prediction system for continuous speaker independent speech recognition. We first compare different neural predictors for modeling speech production. We then propose new criteria for discriminative training. These networks are incorporated into a complete speech recognition system where they cooperate with other modules (grammar model, correction rules and dynamic time warping). Our best systems allow to reach 74,9% accuracy on TIMIT which compares well with other state of the art systems, while being less complex and easier to implement.

Proceedings ArticleDOI
M. Galler1, R. De Mori1
27 Apr 1993
TL;DR: The authors explore the use of randomized performance-based search strategies to improve the generalization of hidden Markov models (HMMs) in a speaker-independent automatic speech recognition system and develops models for phoneme classes with which upper bounds for scores of detailed phoneme models can be obtained.
Abstract: The authors explore the use of randomized performance-based search strategies to improve the generalization of hidden Markov models (HMMs) in a speaker-independent automatic speech recognition system No language models are used, so that the performance of the unit models themselves can be compared Simulated annealing and random search are applied to several components of the system, including phoneme model topologies, distribution tying, the clustering of allophonic contexts, and the sizes of mixture densities By using knowledge of the speech problem to constrain the search appropriately, both reduced numbers of parameters and better phoneme recognition are obtained, as experimentally demonstrated on the TIMIT corpus The clusters developed here automatically introduced a useful degree of generalization The performance increase was preserved when the allophone distributions were tied to parallel transitions in a small context-independent model set In this way models for phoneme classes can be built with which upper bounds for scores of detailed phoneme models can be obtained >

Proceedings ArticleDOI
27 Apr 1993
TL;DR: A framework for the use of variable-width features is presented which employs the N-best algorithm with the features being applied in a postprocessing phase and allows the features to be used in new ways with the availability of complete utterance transcriptions providing a useful additional source of information.
Abstract: A framework for the use of variable-width features is presented which employs the N-best algorithm with the features being applied in a postprocessing phase. The framework is flexible and widely applicable, giving greater scope for exploitation of the features than previous approaches. large-vocabulary speech recognition experiments using TIMIT show that the application of variable-width features has potential benefits. The lack of robustness in some past schemes can be overcome by virtue of the scoring flexibility inherent in the proposed scheme and the use of front-end recognizer output to assist the feature extraction process. The framework also has the advantage of not being tied to a specific front-end recognizer architecture. The method presented allows the features to be used in new ways with, for instance, the availability of complete utterance transcriptions providing a useful additional source of information. >

Proceedings ArticleDOI
07 Mar 1993
TL;DR: The authors study several of the more well-known connectionist models, and how they address the time and frequency variability of the multispeaker, voiced-stop-consonant recognition task.
Abstract: The authors study several of the more well-known connectionist models, and how they address the time and frequency variability of the multispeaker, voiced-stop-consonant recognition task. Among the network architectures reviewed or tested for were the self-organizing feature maps (SOFM) architecture, various derivatives of this architecture, the time-delay neural network (TDNN) architecture, various derivatives of this architecture, and two frequency-and-time-shift-invariant architectures, frequency-shift-invariant TDNN, and the block-windowed neural network (FTDNN and BWNN). Voiced-stop speech was extracted from up to four dialect regions of the TIMIT continuous speech corpus for subsequent preprocessing and training and testing of network instances. Various feature representations were tested for their robustness in representing the voiced-stop consonants.

30 Sep 1993
TL;DR: There is a need for lower-data-rate voice encoders for special applications: improved performance in high bit-error conditions, low- probability-of-intercept (LPI) voice communication, and narrowband integrated voice/data systems.
Abstract: : The 2400-b/s linear predictive coder (LPC) is currently being widely deployed to support tactical voice communication over narrowband channels. However, there is a need for lower-data-rate voice encoders for special applications: improved performance in high bit-error conditions, low- probability-of-intercept (LPI) voice communication, and narrowband integrated voice/data systems. An 800-b/s voice encoding algorithm is presented which is an extension of the 2400-b/s LPC. To construct template tables, speech samples of 420 speakers uttering 8 sentences each were excerpted from the Texas Instrument - Massachusetts Institute of Technology (TIMIT) Acoustic-Phonetic Speech Data Base. Speech intelligibility of the 800-b/s voice encoding algorithm measured by the diagnostic rhyme test (DRT) is 91.5 for three male speakers. This score compares favorably with the 2400-b/s LPC of a few years ago.