scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 1999"


Dissertation
01 Jan 1999
TL;DR: Evidence is presented indicating that recognition performance can be significantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which are referred to as heterogeneous measurements, as well as understanding of the weaknesses of current automatic phonetic classification systems.
Abstract: The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, fixed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be significantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements. This investigation has three principal goals. The first goal is to develop heterogeneous acoustic measurements to increase the amount of acoustic-phonetic information I extracted from the speech signal. Diverse measurements are obtained by varying the time-frequency resolution, the spectral representation, the choice of temporal basis vectors, and other aspects of the preprocessing of the speech waveform. The second goal is to develop classifier systems for successfully utilizing high-dimensional heterogeneous acoustic measurement spaces. This is accomplished through hierarchical and committee-based techniques for combining multiple classifiers. The third goal is to increase understanding of the weaknesses of current automatic phonetic classification systems. This is accomplished through perceptual experiments on stop consonants which facilitate comparisons between humans and machines. Systems using heterogeneous measurements and multiple classifiers were evaluated in phonetic classification, phonetic recognition, and word recognition tasks. On the TIMIT core test set, these systems achieved error rates of 18.3% and 24.4% for, context-independent phonetic classification and context-dependent phonetic recognition, respectively. These results are the best that we have seen reported on these tasks. Word recognition experiments using the corpus associated with the JUPITER telephone-based weather information system showed 10–16% word error rate reduction, thus demonstrating that these techniques generalize to word recognition in a telephone-bandwidth acoustic environment. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

115 citations


Journal ArticleDOI
TL;DR: A method for upgrading initially simple pronunciation models to new models that can explain several pronunciation variants of each word, and the introduction of such variants in a segment-based recognizer significantly improves the recognition accuracy.

63 citations


Journal ArticleDOI
TL;DR: An attempt at capturing segmental transition information for speech recognition tasks using the Principal Curves method and the Generative Topographic map technique as description of the temporal evolution in terms of latent variables was performed.

41 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: Neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed that does not require transcriptions and can be used with the neural networks.
Abstract: The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of the transmission channels. In this paper, neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed. The advantage of the neural network based approach is that the retraining of speech recognizers for telephone speech is avoided. Furthermore, because the multi-layer neural network is able to compute nonlinear functions, it can accommodate for the non-linear mapping between full bandwidth speech and telephone speech. The new unsupervised model adaptation method does not require transcriptions and can be used with the neural networks. Experimental results on TIMIT/NTIMIT corpora show that the performance of the proposed methods is comparable to that of recognizers retained on telephone speech.

29 citations


Proceedings ArticleDOI
10 Jul 1999
TL;DR: Experimental evaluations show that smaller size EBF networks with basis function parameters determined by the EM algorithm outperform the large RBF networks trained by the conventional approach.
Abstract: It is well known that radial basis function (RBF) networks require a large number of function centers if the data to be modeled contain clusters with complicated shape. This paper proposes to overcome this problem by incorporating full covariance matrices into the RBF structure and to use the expectation-maximization (EM) algorithm to estimate the network parameters. The resulting networks, referred to as the elliptical basis function (EBF) networks, are applied to text-independent speaker verification. Experimental evaluations based on 258 speakers of the TIMIT corpus show that smaller size EBF networks with basis function parameters determined by the EM algorithm outperform the large RBF networks trained by the conventional approach.

14 citations


Journal ArticleDOI
TL;DR: Model simulation experiments demonstrate that the auditory rate-place code constructed at the output of the network model is capable of reliable representation, with possible modification and/or enhancement, of the prominent spectral characteristics of the utterances displayed in wideband spectrograms.

8 citations


Proceedings Article
01 Jan 1999
TL;DR: A recurrent neural network is trained to estimate ‘velum height’ during continuous speech by analyzing the network’s output for each phonetic segment contained in 50 hand-labelled utterances set aside for testing purposes.
Abstract: This paper reports on present work, in which a recurrent neural network is trained to estimate ‘velum height’ during continuous speech. Parallel acoustic-articulatory data comprising more than 400 read TIMIT sentences is obtained using electromagnetic articulography (EMA). This data is processed and used as training data for a range of neural network sizes. The network demonstrating the highest accuracy is identified. This performance is then evaluated in detail by analysing the network’s output for each phonetic segment contained in 50 hand-labelled utterances set aside for testing purposes.

6 citations


Proceedings Article
01 Jan 1999
TL;DR: This paper presents a text-independent speaker recognition system based on the voiced segments of the speech signal that uses feedforward MLP classification with only a limited amount of training and testing data and gives a comparatively high accuracy.
Abstract: This paper presents a text-independent speaker recognition system based on the voiced segments of the speech signal. The proposed system uses feedforward MLP classification with only a limited amount of training and testing data and gives a comparatively high accuracy. The techniques employed are: the Rasta-PLP speech analysis for parameter estimation, a feedforward MLP for voiced/unvoiced segmentation and a large number (equal to the number of speakers) of simple MLPs for the classification procedure. The system has been trained and tested using TIMIT and NTIMIT databases. The verification experiments presented a high accuracy rate: above 99% for clean speech (TIMIT) and 74.7%, for noisy speech (NTIMIT). Additional experiments were performed comparing the proposed approach of using voiced segments with only vowels and all phonetic categories with results favorable to the use of voiced segments.

3 citations


01 Jan 1999
TL;DR: Vowels were selected from the TIMIT speech corpus and a simple preprocessing algorithm achieved normalization and the neural network did not perform as well as the Gaussian classifier and only achieved 50% correct classification.
Abstract: We studied vowel classification and speaker normalization performance with neural nets based on Adaptive Resonance Theory (ART). ART was developed by S. Grossberg [4] as a theory of human cognitive information processing. It is the result of an attempt to understand how biological systems are capable of retaining plasticity throughout life, without compromising the stability of previously learned patterns. We have implemented some of these ideas in a supervised neural network called CategoryART [9]. The neural network was trained with formant frequency values extracted at the midpoint of vowels. Vowels were selected from the TIMIT speech corpus [6]. Separate train and test sets were used. Of the 630 speakers in this database 438 are male and 192 are female. A simple preprocessing algorithm achieved normalization. Only the 13 monophthong vowel categories (iy, ih, ey, eh, ae, aa, ow, ah, ao, uw, uh, ux, er) were used. Formant frequency values were determined by an LPC analysis. To compare formant frequency values for males and females, normalized frequency values were calculated in a preprocessing stage. Next to the neural net we also used a Gaussian classifier. This classifier attained on the average 57% correct classification. The neural network did not perform as well as the Gaussian classifier and only achieved 50% correct classification.

2 citations


01 Apr 1999
TL;DR: Speaker count is the process of automatically identifying segments of speech that contain multiple speakers before attempting to apply any co-channel interference reduction schemes.
Abstract: : Speaker count is the process of automatically identifying segments of speech that contain multiple speakers before attempting to apply any co-channel interference reduction schemes. By computing the variance of the pitch estimate, a heuristic approach is used to classify a 30 msec frame of speech as being from a single talker or from multiple talkers. Using TIMIT data, 79% of the frames were classified correctly.

1 citations


Proceedings ArticleDOI
15 Sep 1999
TL;DR: The proposed algorithm aims to reduce the number of frames per speech signal as much as possible without an appreciable reduction in the recognition rate of the system, and obtains a frame characterizing the signal of a consonant/vowels phonetic value on Korean vocalization.
Abstract: This paper investigates the Korean rule-based system and its speech segmentation application with phonetic value. The discrete wavelet transform is used as a part of a front-end processor by which the phonetic value shape extracts non-uniform blocks. In non-uniform blocks, the extracted feature parameter is derived by consonants/vowels phonetic value segmentation that is based on VQ/HMM. 10 speakers (male) evaluated the results on a portion of Korean isolated words TIMIT corpus consisting of ten each isolated Korean words. Ultimately, the proposed algorithm aims to reduce the number of frames per speech signal as much as possible without an appreciable reduction in the recognition rate of the system, and obtains a frame characterizing the signal of a consonant/vowels phonetic value on Korean vocalization.

01 Jan 1999
TL;DR: This post-mortem parsing algorithm combines syntactic parsing rules, morphological recognition, and closed-class lexicon with a method that attempts to parse a sentence first with a limited prediction for unknown words, and later reparse the sentence with a more broad prediction if first attempts fail.
Abstract: We present a parsing system designed to parse sentences containing unknown words as accurately as possible Our post-mortem parsing algorithm combines syntactic parsing rules, morphological recognition, and closed-class lexicon with a method that attempts to parse a sentence first with a limited prediction for unknown words, and later reparse the sentence with a more broad prediction if first attempts fail This allows great flexibility while parsing, and can offer improved accuracy and efficiency for parsing sentences that contain unknown words Experiments involving hand-created and computer-generated morphological recognizers are performed We also develop a part-of-speech tagging system designed to accurately tag sentences, including sentences containing unknown words The system is based on a basic hidden Markov model, but uses second-order approximations for the probability distributions (instead of first-order) The second order approximations give increased tagging accuracy, without increasing asymptotic running time over traditional trigram taggers A dynamic smoothing technique is used to address sparse data by attaching more weight to events that occur more frequently Unknown words are predicted using statistical estimation from the training corpus based on word endings only Information from different length suffixes is included in a weighted voting scheme, smoothed in a fashion similar to that used for the second-order HMM This tagging model achieves state-of-the-art accuracies Finally, the use of syntactic parsing rules to increase tagging accuracy is considered By allowing a parser to veto possible tag sequences due to violation of syntactic rules, it is shown that tagging errors were reduced by 28% on the Timit corpus This enhancement is useful for corpora that have rules sets defined

Proceedings Article
01 Jan 1999
TL;DR: An advanced multi-level vowel spotting method is used to achieve minimum vowel loss and accurate detection of the vowel location and duration and showed significant performance improvement compared to similar systems.
Abstract: This paper presents a hybrid ANN/HMM syllable recognition module based on vowel spotting. An advanced multi-level vowel spotting method is used to achieve minimum vowel loss and accurate detection of the vowel location and duration. Discrete Hidden Markov Models (DSHMM), Multi Layer Perceptrons (MLP) and Heuristics (HR) are used for this purpose. A hybrid ANN/HMM technique is then used to recognize the syllables between the detected vowels. We replace the usual DSHMM probability parameters with combined neural network outputs. For this purpose both context dependent (CD) and context independent (CI) neural networks are used. Global normalization is employed on the parameters as opposed to the local normalization used on parameters in standard HMMs. Also, all parameters are estimated simultaneously according to the discriminative conditional maximum likelihood (CML) criterion. The tests were performed on the TIMIT and NTIMIT databases and showed significant performance improvement compared to similar systems.

Book ChapterDOI
01 Jan 1999
TL;DR: An automatic system, Rasta-PLP and Time Delay Neural Network based, for classify stops, fricatives and nasals from TIMIT has been built up with results that are significant considering that they have been obtained using only 30 and 60 msec of the signal available.
Abstract: An automatic system, Rasta-PLP and Time Delay Neural Network based, for classify stops, fricatives and nasals from TIMIT has been built up. It uses a preprocessing algorithm based on a modified Rasta-PLP algorithm and a classification algorithm based on new Time Delay Neural Network architecture. A new philosophy based on the idea to use a short piece of the signal available was implemented in the preprocessing phase. This allowed to save memory space and computational time. On the testing data, the results gave the 92% of correct classification for stops and fricatives, and the 82% of correct classification for nasals. These results are significant considering that they have been obtained using only 30 and 60 msec of the signal available.