Showing papers on "TIMIT published in 1994"

PDF

Open Access

Journal Article•DOI•

An application of recurrent nets to phone probability estimation

[...]

A.J. Robinson¹•Institutions (1)

01 Mar 1994-IEEE Transactions on Neural Networks

TL;DR: Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation.

...read moreread less

Abstract: This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed; a role for which the recurrent net appears suitable. An overview of early developments of recurrent nets for phone recognition is given along with the more recent improvements that include their integration with Markov models. Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation. >

...read moreread less

497 citations

Journal Article•DOI•

Relations of sex and dialect to reduction

[...]

Dani Byrd¹•Institutions (1)

University of California, Los Angeles¹

01 Oct 1994-Speech Communication

TL;DR: This report concerns speaker-dependent effects on certain phonetic characteristics often involved in reduction such as speech rate, stop releases, flapping, central vowels, laryngeal state, syllabic consonants, and palatalization processes.

...read moreread less

257 citations

Proceedings Article•DOI•

Sources of degradation of speech recognition in the telephone network

[...]

Pedro J. Moreno¹, Richard M. Stern¹•Institutions (1)

Carnegie Mellon University¹

19 Apr 1994

TL;DR: It is found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network.

...read moreread less

Abstract: We compare speech recognition accuracy for high-quality speech recorded under controlled conditions with speech as it appears over long-distance telephone lines. In addition to comparing recognition accuracy we use telephone-channel simulation to identify the sources of degradation of speech over telephone lines that have the greatest impact on speech recognition accuracy. We first compare the performance of the CMU SPHINX-I system on the TIMIT and NTIMIT databases. We found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network. We identify the most problematic telephone-channel impairments using a commercial telephone channel simulator and the SPHINX-II system. Of the various effects considered, additive noise and linear filtering appear to have the greatest impact on recognition accuracy. Finally, we examined the performance of three cepstral compensation algorithms in the presence of the most damaging conditions. We found the compensation algorithms to be effective except for the worst 1% of the telephone channels. >

...read moreread less

74 citations

Journal Article•DOI•

Phonetic analyses of word and segment variation using the TIMIT corpus of American English

[...]

Patricia A. Keating¹, Dani Byrd¹, Edward Flemming¹, Yuichi Todaka¹•Institutions (1)

University of California, Los Angeles¹

01 Apr 1994-Speech Communication

TL;DR: A set of studies of some phonetic characteristics of the American English represented in the TIMIT speech database are reported, useful not only to linguistic phoneticians but also for speech recognition lexicons and text-to-speech systems.

...read moreread less

48 citations

Proceedings Article•DOI•

IPA: improved phone modelling with recurrent neural networks

[...]

Tony Robinson¹, Mike Hochberg¹, Steve Renals¹•Institutions (1)

University of Cambridge¹

19 Apr 1994

TL;DR: Three improvements to the hybrid connectionist-hidden Markov model speech recognition system are described: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling, which provide a significant improvement in the TIMIT phone recognition rate.

...read moreread less

Abstract: This paper describes phone modelling improvements to the hybrid connectionist-hidden Markov model speech recognition system developed at Cambridge University. These improvements are applied to phone recognition from the TIMIT task and word recognition from the Wall Street Journal (WSJ) task. A recurrent net is used to map acoustic vectors to posterior probabilities of phone classes. The maximum likelihood phone or word string is then extracted using Markov models. The paper describes three improvements: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling. The first is shown to provide a significant improvement in the TIMIT phone recognition rate and all three provide an improvement in the WSJ word recognition rate. >

...read moreread less

38 citations

Proceedings Article•DOI•

Minimum cost based phoneme class detection for improved iterative speech enhancement

[...]

Levent M. Arslan¹, J.N.L. Hansen¹•Institutions (1)

Duke University¹

19 Apr 1994

TL;DR: The phoneme class directed enhancement algorithm is evaluated using TIMIT speech data, and shown to result in substantial improvement in objective speech quality over a range of signal-to-noise ratios and individual phoneme classes.

...read moreread less

Abstract: It is known that degrading acoustic noise influences speech quality across phoneme classes in a non-uniform manner. This results in variable quality performance for many speech enhancement algorithms in noisy environments. To address this, a hidden-Markov-mode phoneme classification procedure is proposed which directs single channel speech enhancement across individual phoneme classes. The procedure performs broad phoneme class partitioning of noisy speech frames using a continuous-mixture hidden-Markov-model recognizer in conjunction with a cost based decision process. Cost functions are assigned which weigh errors between phoneme classes that are perceptually different (e.g., vowels versus fricatives, etc.). Once noisy speech frames are partitioned, iterative speech enhancement based on all-pole parameter estimation with inter and intra-frame spectral constraints (Auto:I,LSP:T) is employed. The phoneme class directed enhancement algorithm is evaluated using TIMIT speech data, and shown to result in substantial improvement in objective speech quality over a range of signal-to-noise ratios and individual phoneme classes. The algorithm is also shown to possess consistent quality improvement in a speaker independent scenario. >

...read moreread less

15 citations

Proceedings Article•DOI•

Connectionist model combination for large vocabulary speech recognition

[...]

Mike Hochberg¹, Gary Cook¹, Steve Renals¹, Anthony J. Robinson¹•Institutions (1)

University of Cambridge¹

06 Sep 1994

TL;DR: A number of different approaches to connectionist model merging are presented and compared and evaluated on the TIMIT phone recognition and ARPA Wall Street Journal word recognition tasks.

...read moreread less

Abstract: Reports in the statistics and neural networks literature have expounded the benefits of merging multiple models to improve classification and prediction performance. The Cambridge University connectionist speech group has developed a hybrid connectionist-hidden Markov model system for large vocabulary talker independent speech recognition. The performance of this system has been greatly enhanced through the merging of connectionist acoustic models. This paper presents and compares a number of different approaches to connectionist model merging and evaluates them on the TIMIT phone recognition and ARPA Wall Street Journal word recognition tasks. >

...read moreread less

14 citations

Proceedings Article•DOI•

Investigations on speaker characterization from Orphee system techniques

[...]

J.-L. Le Floch, Claude Montacie, Marie-José Caraty

19 Apr 1994

TL;DR: The vowels, the diphtongues and the nasals are shown to be the most specific to speakers discrimination, and a method to predict the most speaker-related phonemes is developed.

...read moreread less

Abstract: We investigate speaker-related information in speech. Our investigations are based on the AR-vector model used in the Orphee system: the vocal activated door we designed. The free-text speaker identification accuracy is an indicator of the ability of any parameter for discriminating speakers. Therefore, through speaker identification performance, we test the speech duration influence, the effect of the parameters of AR-vector model and the importance of phonetic segments. We developed a method to predict the most speaker-related phonemes. The results are the following: the vowels, the diphtongues and the nasals are shown to be the most specific to speakers discrimination. For one second duration test signal regrouping vowels and diphtongues, a speaker accuracy rate of 95.6% was found for the 630 TIMIT speakers. >

...read moreread less

13 citations

Proceedings Article•DOI•

Non-linear input transformations for discriminative HMMs

[...]

F.T. Johansen, M.H. Johnsen

19 Apr 1994

TL;DR: It is found that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters, and it is observed that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI.

...read moreread less

Abstract: This paper deals with speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linear input feature transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame (mutual information or relative entropy) MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMI is found superior to the frame-based criterion for continuous recognition. We further observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters. >

...read moreread less

12 citations

Proceedings Article•DOI•

A comparative study of mixture-Gaussian VQ, ergodic HMMs and left-to-right HMMs for speaker recognition

[...]

Xiaoyuan Zhu¹, B. Millar¹, J. Macleod¹, Michael Wagner¹, Fangxin Chen¹, Shuping Ran¹ - Show less +2 more•Institutions (1)

Australian National University¹

13 Apr 1994

TL;DR: This paper compares a mixture-Gaussian vector quantisation method, ergodic continuous hidden Markov models (CHMMs) and phone-level left-to-right CHMMs for text-independent speaker recognition to represent a progression of phonetic specificity prior to the generation of probabilities against which speakers are compared.

...read moreread less

Abstract: This paper compares a mixture-Gaussian vector quantisation (VQ) method, ergodic continuous hidden Markov models (CHMMs) and phone-level left-to-right CHMMs for text-independent speaker recognition. These three methods represent a progression of phonetic specificity prior to the generation of probabilities against which speakers are compared. The mixture-Gaussian VQ uses a single distribution for all phones, the ergodic CHMM uses several distributions which have been shown in a previous text-independent speaker recognition study to represent broad phonetic classes, and the phone-based left-to-right CHMM uses many distributions representing the specific phones in the test utterance. Our experiments with speaker recognition on 40 TIMIT speakers show that the recognition rates of the mixture-Gaussian VQ, ergodic CHMMs and phone-based left-to-right CHMMs are 87.5%, 87.5% and 100% respectively. >

...read moreread less

7 citations

Proceedings Article•DOI•

Text-independent speaker identification system based on adaptive wavelets

[...]

Shubha L. Kadambe¹, Pramila Srinivasan²•Institutions (2)

Bell Labs¹, Purdue University²

15 Mar 1994

TL;DR: In this article, a text independent, phoneme based speaker identification system which uses adaptive wavelets to model the phonemes was described, which was achieved by using a two layer feed forward neural network classifier.

...read moreread less

Abstract: In this paper, we describe a text independent, phoneme based speaker identification system which usesadaptive wavelets to model the phonemes. This system identifies a speaker by modeling a very shortsegment of phonemes and then by clustering all the phonemes belonging to the same speaker into oneclass. The classification is achieved by using a two layer feed forward neural network classifier. Theperformance of this speaker identification system is demonstrated by considering the phonemes thatwere extracted from various sentences spoken by three speakers in the TIMIT acoustic-phonetic speechcorpus. 1. INTRODUCTION Speaker identification systems are mainly used (a) for verifying a person's identity prior to admitting him/her into a secured place or to a telephone transaction and (b) for associating a person with a voice in police work [1]. A linguistic unit is called a phoneme (speech sound). The acoustic characteristics of each phonemevary based on the manner of articulation (source of excitation) and the place of articulation (shapeof the vocal tract). Based on the source of excitation, speech sounds can be broadly classified into

...read moreread less

Proceedings Article•DOI•

A novel approach for classifying continuous speech into visible mouth-shape related classes

[...]

Suhuai Luo¹, R.W. King¹•Institutions (1)

University of Sydney¹

19 Apr 1994

TL;DR: A novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) is described, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech.

...read moreread less

Abstract: The paper describes a novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) The selection and comparison of various acoustic speech features and the use of context information in the classification are addressed Continuous speech is classified into 9 visible mouth-shape related classes on an acoustic frame basis Some mouth-shape related acoustic speech signal features are selected as the input to a classifier constructed with recurrent neural network (RNN) 304 training sentences and 88 testing sentences are chosen from DARPA TIMIT continuous speech database The average viseme recognition rate for the test set reaches 847% on frame level, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech >

...read moreread less

Journal Article•DOI•

Statistical trajectory models for phonetic classification

[...]

William D. Goldenthal, James Glass

01 May 1994-Journal of the Acoustical Society of America

TL;DR: This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment‐based framework that compares favorably with other studies using the timit corpus.

...read moreread less

Abstract: This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment‐based framework. The approach is based on the creation of a track, Tα, for each phonetic unit α. The track serves as a model of the dynamic trajectories of the acoustic attributes over the segment. The statistical framework for scoring incorporates the auto‐ and cross‐correlation properties of the track error over time, within a segment. On a vowel classification task [W. Goldenthal and J. Glass, ‘‘Modeling Spectra Dynamics for Vowel Classification,’’ Proc. Eurospeech 93, pp. 289–292, Berlin, Germany (1993)], this methodology achieved classification performance of 68.9%. This result compares favorably with other studies using the timit corpus. This talk extends this result by presenting context‐independent and context‐dependent experiments for all the phones. Context‐independent classification performance of 76.8% is demonstrated. The key to implementing the...

...read moreread less

Proceedings Article•DOI•

Large population speaker recognition using wideband and telephone speech

[...]

Douglas A. Reynolds¹•Institutions (1)

Massachusetts Institute of Technology¹

25 Oct 1994

TL;DR: In this paper, a Gaussian mixture speaker model was used for speaker identification and experiments were conducted on the TIMIT and NTIMIT databases, achieving accuracies of 99.5% and 60.7% for clean, wideband speech and telephone speech, respectively.

...read moreread less

Abstract: The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g., telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.© (1994) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

...read moreread less

Proceedings Article•DOI•

Classification of voiced and unvoiced speech by hierarchical stochastic modeling

[...]

Kie B. Eom¹, R. Chellappa•Institutions (1)

George Washington University¹

09 Oct 1994

TL;DR: The signal at different scales is modeled by a hierarchical autoregressive moving average (ARMA) model, and the features at coarse scales are extracted from the model without performing expensive filtering operation.

...read moreread less

Abstract: In this paper, we consider the classification of speech signals by using stochastic models at different scales. The signal at different scales is modeled by a hierarchical autoregressive moving average (ARMA) model, and the features at coarse scales are extracted from the model without performing expensive filtering operation. The hierarchical modeling can increase the accuracy of speech classification by exploiting features at different scales. For speech classification, model parameters at five different scales obtained by hierarchical modeling are used as features. A minimum distance classifier is implemented, and tested on TIMIT speech data.

...read moreread less

Proceedings Article•DOI•

Speaker normalization by input space optimization for continuous density hidden Markov models

[...]

Jian-xiong Wu¹, Zeyu Qi¹, Chorkin Chan, Jiegu Li•Institutions (1)

Shanghai Jiao Tong University¹

13 Apr 1994

TL;DR: Preliminary experimental results on the task of sex adaptation for speaker-independent stop consonant discrimination, evaluated from the DARPA TIMIT speech database, demonstrates the effectiveness of the proposed method of speaker normalization by means of input space optimization for continuous density hidden Markov models.

...read moreread less

Abstract: This paper proposes a novel method of speaker normalization by means of input space optimization for continuous density hidden Markov models (CDHMM). The parameters of a linear feature transformation function are so determined that, together with the previously trained CDHMM parameters, a mis-classification cost function is minimized for the normalizing data set. Preliminary experimental results on the task of sex adaptation for speaker-independent stop consonant discrimination, evaluated from the DARPA TIMIT speech database, demonstrates the effectiveness of the proposed method. >

...read moreread less

Proceedings Article•DOI•

Smoothed time/frequency features for vowel classification

[...]

Zaki B. Nossair¹, Stephen A. Zahorian¹•Institutions (1)

Old Dominion University¹

25 Oct 1994

TL;DR: A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information that compactly and accurately model phonetic information, while accounting for the main effects of contextual variations.

...read moreread less

Abstract: A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information. These time-frequency features compactly and accurately model phonetic information, while accounting for the main effects of contextual variations. These segment-level features are computed such that more emphasis is given to the center of the segment and less to the end regions. For phonetic classification, the features are relatively insensitive to both time and frequency resolution, as least insofar as changes in window length and frame spacing are concerned. A 60-dimensional feature space based on this modeling technique resulted in 70.9% accuracy for classification of 16 vowels extracted from the TIMIT data base in speaker-independent experiments. These results are higher than any other results reported in the literature for the same task. >

...read moreread less

Proceedings Article•DOI•

Autoassociator-based modular architecture for speaker independent phoneme recognition

[...]

L. Lastrucci¹, G. Bellesi, M. Gori, G. Soda•Institutions (1)

University of Florence¹

06 Sep 1994

TL;DR: A modular architecture where the interactions among different modules are controlled by proper autoassociators, thus reducing significantly the problems due to interaction of different modules.

...read moreread less

Abstract: Proposes a modular architecture where the interactions among different modules are controlled by proper autoassociators. The outputs of these modules are computed by sigma p-neurons whose inputs come from both a feedforward network performing classification and an autoassociator. The outputs of the autoassociators are used for performing pattern rejection, thus reducing significantly the problems due to interaction of different modules. The proposed architecture is validated by experiments of speaker independent phoneme recognition on continuous speech with TIMIT data base with very promising results. >

...read moreread less

Proceedings Article•DOI•

Minimum variance deconvolution based-speech enhancement system for a new generation of hearing aids

[...]

Huiqin Gao¹, M. Savic¹, J. Sorensen¹•Institutions (1)

Rensselaer Polytechnic Institute¹

17 Mar 1994

TL;DR: In this article, the authors describe a method for the enhancement of speech of a particular speaker in a noisy multispeaker environment using minimum variance deconvolution (MVD) algorithm.

...read moreread less

Abstract: Describes a novel method for the enhancement of speech of a particular speaker in a noisy multispeaker environment Many potential applications of the method are possible including the implementation in a new generation of hearing aids The system is based on the minimum variance deconvolution (MVD) algorithm The method was tested using the TIMIT speech database The utterances of two speakers were first combined to create a multispeaker environment, and then separated using the MVD algorithm The intelligibility of the separated and enhanced speech was high Likewise the frequency spectra of the original speech were very similar to the spectra of the separated and enhanced speech for each of the two speakers >

...read moreread less

Book Chapter•DOI•

Text-Independent Talker Verification Using Cohort Normalized Scores

[...]

David J. Burr

01 Jan 1994

TL;DR: A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available.

...read moreread less

Abstract: It is difficult to implement talker recognition on the telephone network because of normal variation in the channel characteristics. The primary component of variation is due to the different telephone handsets or microphone frequency characteristics (Rosenberg and Soong, 1992). Lack of availability of telephone speech databases has also contributed to slow progress in the solution of these problems, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available. A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research.

...read moreread less

Autoassociator-based modular architecture for speaker independent

[...]

Phoneme Recognition

01 Jan 1994

TL;DR: A modular architecture where the interactions among different modules are controled by proper au- toassociators and the outputs of these modules are used for performing pattern rejection, thus reducing significantly the problems due to iiiteraction of different modules.

...read moreread less

Abstract: In this paper, we propose a modular architecture where the interactions among different modules are controled by proper au- toassociators. The outputs of these modules are computed by SL~~JIU p-neitroas whose inputs come from both a feedforward network per- forming classificatioii and an autoassociator. The outputs of the au- toassociators are used for performing pattern rejection, thus reduc- ing significantly the problems due to iiiteraction of different modules. The proposed architecture is validated by experiments of speaker in- dependent plioneiiie recognition on continuous speech with TIMIT

...read moreread less