scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2002"


Book ChapterDOI
18 Aug 2002
TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.
Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

42 citations


Proceedings Article
01 Jan 2002
TL;DR: This paper presents a method for speaker-independent automatic phonetic alignment that is distinguished from standard HMM-based “forced alignment” in three respects: specific acoustic-phonetic features are used, in addition to PLP features, by the phonetic classifier, and the units of classification consist of distinctive phonetic features instead of phonemes.
Abstract: This paper presents a method for speaker-independent automatic phonetic alignment that is distinguished from standard HMM-based “forced alignment” in three respects: (1) specific acoustic-phonetic features are used, in addition to PLP features, by the phonetic classifier; (2) the units of classification consist of distinctive phonetic features instead of phonemes; and (3) observation probabilities depend not only on the current state, but also on the state transition information. This proposed method is compared with a state-of-the-art baseline forcedalignment system on a number of corpora, including telephone speech, microphone speech, and children’s speech. The new method has agreement of 92.57% within 20 msec on the TIMIT corpus, which is a 26% reduction in error over the baseline method (with 89.95% agreement on TIMIT). Average reduction in error over all corpora is 28%.

40 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: It was found that combining the classical MFCCs with some auditory-based acoustic distinctive cues and the main peaks of the spectrum of a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance.
Abstract: In this paper, a multi-stream paradigm is proposed to improve the performance of automatic speech recognition (ASR) systems Our goal in this paper is to improve the performance of the HMM-based ASR systems by exploiting some features that characterize speech sounds based on the auditory system and one based on the Fourier power spectrum It was found that combining the classical MFCCs with some auditory-based acoustic distinctive cues and the main peaks of the spectrum of a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance The Hidden Markov Model Toolkit (HTK) was used throughout our experiments to test the use of the new multi-stream feature vector A series of experiments on speaker-independent continuous-speech recognition have been carried out using a subset of the large read-speech corpus TIMIT Using such multi-stream paradigm, N-mixture mono-/tri-phone models and a bigram language model, we found that the word error rate was decreased by about 401%

39 citations


Proceedings ArticleDOI
13 Oct 2002
TL;DR: The results show that a reconstructed phase space approach is a viable method for classification of phonemes, with the potential for use in a continuous speech recognition system.
Abstract: A novel method for classifying speech phonemes is presented. Unlike traditional cepstral based methods, this approach uses histograms of reconstructed phase spaces. A naive Bayes classifier uses the probability mass estimates for classification. The approach is verified using isolated fricative, vowel, and nasal phonemes from the TIMIT corpus. The results show that a reconstructed phase space approach is a viable method for classification of phonemes, with the potential for use in a continuous speech recognition system.

39 citations


Journal ArticleDOI
TL;DR: A novel discriminative objective function for the estimation of hidden Markov model (HMM) parameters, based on the calculation of overall risk, which minimises the risk of misclassification on the training database and thus maximises recognition accuracy.

31 citations


Proceedings Article
01 Jan 2002
TL;DR: The conventional mel-scaled filter bank is replaced with a speaker-discriminative filter bank, which outperforms traditional MFCC features on TIMIT corpus.
Abstract: A new filter bank approach for speaker recognition front-end is proposed. The conventional mel-scaled filter bank is replaced with a speaker-discriminative filter bank. Filter bank is selected from a library in adaptive basis, based on the broad phoneme class of the input frame. Each phoneme class is associated with its own filter bank. Each filter bank is designed in a way that emphasizes discriminative subbands that are characteristic for that phoneme. Experiments on TIMIT corpus show that the proposed method outperforms traditional MFCC features.

17 citations


Proceedings ArticleDOI
01 Jan 2002
TL;DR: This paper attempts to overcome the above difficulty by using the alternative Lagrangian formulation which only requires the inversion of a matrix whose dimension is proportional to the size of the MFCC sequence of vectors.
Abstract: We study the performance of binary and multi-category SVMs for phoneme classification. The training process of the standard formulation involves the solution of a quadratic programming problem whose complexity depends on the size of the training set. The large size of speech corpora such as TIMIT limits seriously their practical use in continuous speech recognition tasks, using off the shelf personal computers in a reasonable time. In this paper, we attempt to overcome the above difficulty by using the alternative Lagrangian formulation which only requires the inversion of a matrix whose dimension is proportional to the size of the MFCC sequence of vectors. We provide computational results of all possible binary classifiers (1830) on the TIMIT database which are shown to be competitive in terms of recognition rates (96.8%) with those found in the literature (95.6%). The binary classifiers are introduced in the DAGSVM and voting algorithms to perform multi-category classification on some hand picked subsets from TIMIT corpus.

16 citations


01 Jan 2002
TL;DR: The Indiana Speech Project (ISP) as mentioned in this paper collected a large corpus of spoken language samples from a number of different talkers that represent several different regional varieties of American English, which were used for acoustic-phonetic measurements of speech.
Abstract: The goal of the Indiana Speech Project (ISP) was to collect a corpus of spoken language samples from a number of different talkers that represent several different regional varieties of American English. Audio recordings were made of five college-aged women from each of six geographic regions of Indiana while they read isolated words, sentences, and a passage and while engaged in a conversation with an experimenter. The residential histories of the women and those of their parents were strictly controlled to ensure that the talkers were good representatives of each dialect region. The Indiana speech corpus will be used for acoustic-phonetic measurements of speech and perceptual studies on regional language variation by different groups of listeners. Introduction and Theoretical Motivation In recent years, researchers in the field of speech perception and spoken language processing have developed a number of speech corpora for conducting acoustic and perceptual experiments on human language. These corpora typically included a number of speakers reading a set of words or sentences [e.g., “Easy-Hard” Word Multi-Talker Speech Database (Torretta, 1995); TIMIT AcousticPhonetic Continuous Speech Corpus (Zue, Seneff, & Glass, 1990); Talker Variability Sentence Database (Karl & Pisoni, 1994)]. Although the talkers often included both males and females, other important indexical variables such as socioeconomic status, age, ethnicity, and regional dialect were rarely, if ever, considered in selecting the talkers. One exception to this rule is the TIMIT Acoustic-Phonetic Continuous Speech Corpus. This corpus contains spoken sentence materials from 630 talkers, representing eight different regional dialects of American English (Zue et al., 1990). While the TIMIT database was originally collected for speech recognition research, it has been used in various acoustic-phonetic studies on the role of gender, age, and dialect in linguistic variation (e.g., Byrd, 1992; Byrd, 1994; Keating, Blankenship, Byrd, Flemming, & Todaka, 1992; Keating, Byrd, Flemming, & Todaka, 1994). While this work was going on in speech perception and speech recognition, sociolinguists have been collecting speech samples from talkers of a variety of ages, socioeconomic statuses, ethnicities, and regional dialects. The emphasis in this research has been on capturing the variability in spoken language as well as collecting extensive demographic information on each talker [e.g., Santa Barbara Corpus of Spoken American English (DuBois, Chafe, Meyer, & Thompson, 2000); CallFriend Telephone Speech Corpus for American English (Linguistic Data Consortium, 1996); TELSUR (Labov, Ash, & Boberg, in press)]. In contrast to the speech stimuli used in speech perception and speech recognition research, these speech samples are typically taken from “natural” language situations such as interviews and telephone calls and less emphasis is placed on obtaining identical utterances from the same set of talkers. While these corpora are useful for many kinds of sociolinguistic research, they are not adequate for perceptual research in which consistent linguistic content across talkers is highly desirable. The initial goal of the Indiana Speech Project (ISP) was to collect a large amount of speech from a number of phonologically distinct dialect regions in the state of Indiana for use in perceptual studies and acoustic analyses. We wanted to combine the best aspects of the speech perception corpora with the unique focus of the sociolinguistic corpora. In particular, our goal was to collect a large corpus of utterances that were consistent across all talkers, allowing for better control of the stimulus materials for a wide range of perceptual and acoustic studies. In addition, we also wanted the talkers in our corpus to be

16 citations


01 Jan 2002
TL;DR: The method is evaluated on the TIMIT corpus, using a speech recognizer incorporating context-independent HMMs and a bigram language model, and it appears that reductions of the word error rate are possible to achieve.
Abstract: In this paper a previously proposed method for the automatic construction of a lexicon with pronunciation variants for ASR is further developed and evaluated. The basic idea is to transform a lexicon of canonical forms by means of rewrite rules that are learned automatically on a training corpus of orthographically transcribed utterances. The method is evaluated on the TIMIT corpus, using a speech recognizer incorporating context-independent HMMs and a bigram language model. It appears that reductions of the word error rate of up to 35 % are possible to achieve. However, it also appears that it is more likely to obtain much lower gains.

15 citations


Proceedings Article
01 May 2002
TL;DR: This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language.
Abstract: This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language. Firstly, a protocol of Thai phonetic transcription and some essential rules of phonetic correction after grapheme-to-phoneme (G2P) process are described. An iterative procedure of PB and PD sentence selection is conducted in order to avoid tedious work of manual phone correction on all initial sentences. A standard text corpus, ORCHID, was chosen for the initial text. Analysis of several attributes such as the number of words, syllables, monophones and biphones, phone’s distribution, etc., in both the PB and PD sets are reported. At the end, the final selected PB are partially compared to the American English TIMIT’s PB set (MIT-450) and the Japanese ATR’s 503 PB set.

15 citations


Journal ArticleDOI
TL;DR: A hybrid ANN/HMM syllable recognition system based on vowel spotting based on taking special care of the vowel spotter which is based on three different techniques: discrete hidden Markov models (DHMMs), multilayer perceptrons and heuristic rules is presented.

Proceedings ArticleDOI
13 May 2002
TL;DR: Analyzing the TIMIT speech data, the intrinsic structures of vowels and consonants are revealed and the userfulness of the PCA method for speech recognition is demonstrated by performing phoneme classification of /b/, /d/ and /g/ phonemes.
Abstract: In a standard mel-frequency cepstral coefficient-based speech recognizer, it is common to use the same feature dimension and the number of Gaussian mixtures for all subunits. We proposed to use different transformations and different number of mixtures for each subunit. We obtained the transformations from mel-frequency band energies by using the variational Bayesian principal component analysis (PCA) method. In the method, hyperparameters of the Gaussian mixtures and the number of mixtures are automatically learned through maximization of a lower bound of the evidence instead of the likelihood in the conventional maximum likelihood paradigm. Analyzing the TIMIT speech data, we revealed intrinsic structures of vowels and consonants. We demonstrated the userfulness of the method for speech recognition by performing phoneme classification of /b/, /d/ and /g/ phonemes.

Proceedings Article
01 Jan 2002
TL;DR: A novel approach to integration of formant frequency and conventional MFCC data in phone recognition experiments on TIMIT by exploiting the relationship between formant frequencies and vocal tract geometry and reducing the error rate by 6.1% relative to a conventional representation alone.
Abstract: This paper presents a novel approach to integration of formant frequency and conventional MFCC data in phone recognition experiments on TIMIT. Naive use of format data introduces classification errors if formant frequency estimates are poor, resulting in a net drop in performance. However, by exploiting a measure of confidence in the formant frequency estimates, formant data can contribute to classification in parts of a speech signal where it is reliable, and be replaced by conventional MFCC data when it is not. In this way an improvement of 4.7% is achieved. Moreover, by exploiting the relationship between formant frequencies and vocal tract geometry, simple formant-based vocal tract length normalisation reduces the error rate by 6.1% relative to a conventional representation alone.

Journal ArticleDOI
TL;DR: With a view to using an articulatory representation in automatic recognition of conversational speech, two nonlinear methods for mapping from formants to short-term spectra were investigated: multilayered perceptrons (MLPs), and radial basis function (RBF) networks.
Abstract: With a view to using an articulatory representation in automatic recognition of conversational speech, two nonlinear methods for mapping from formants to short-term spectra were investigated: multilayered perceptrons (MLPs), and radial basis function (RBF) networks. Five schemes for dividing the TIMIT data according to their phone class were tested. The r.m.s. error of the RBF networks was 10%, less than that of the MLP, and the scheme based on discrete articulatory regions gave the greatest improvements over a single network.

Book ChapterDOI
12 Aug 2002
TL;DR: This paper compares two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes, which are the generic face images corresponding to particular sounds.
Abstract: Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each triviseme which is a viseme with its left and right context, and the audio signals are directly recognized as a sequence of trivisemes. In the second approach, each triphone is modeled with an HMM, and a general triphone recognizer is used to produce a triphone sequence from the audio signals. The triviseme or triphone sequence is then converted to a viseme sequence. The performances of the two viseme recognition systems are evaluated on the TIMIT speech corpus.

Proceedings ArticleDOI
04 Aug 2002
TL;DR: The proposed partial-correlation (PARCOR) coefficients scheme to model the cross areas of the several cylinders from the vocal tract can yield better identification performance than the conventional approach.
Abstract: In this work, we propose the partial-correlation (PARCOR) coefficients scheme to model the cross areas of the several cylinders from the vocal tract. By using the relationship of the acoustic impedance proportional to the reciprocal of cross areas, the ratios of cross areas between each neighboring cylinders are used to model a speaker's vocal tract. The autoregressive model (AR model) is performed on the speech residual signals, that are produced from the inverse vocal tract transform based on the PARCOR, to generate features. These features with the conventional features from the Mel-Frequency Cepstral Coefficient (MFCC) are used for the identification engine of the Gaussian Mixture Model (GMM). According to our computer analyses in the TIMIT speech database, the proposed system can yield better identification performance than the conventional approach.

Proceedings Article
01 Jan 2002
TL;DR: Results of phone-classification experiments demonstrate that, by appropriate choice of intermediate parameterization and mappings, it is possible to achieve close to optimal performance in a simple multilevel segmental HMM.
Abstract: A theoretical and experimental analysis of a simple multilevel segmental HMM is presented in which the relationship between symbolic (phonetic) and surface (acoustic) representations of speech is regulated by an intermediate (articulatory) layer, where speech dynamics are modeled using linear trajectories. Three formant-based parameterizations and measured articulatory positions are considered as intermediate representations, from the TIMIT and MOCHA corpora respectively. The articulatory-to-acoustic mapping was performed by between 1 and 49 linear transformations. Results of phone-classification experiments demonstrate that, by appropriate choice of intermediate parameterization and mappings, it is possible to achieve close to optimal performance.

Journal ArticleDOI
TL;DR: This study investigates various techniques which improve performance and generalization of the MCE algorithm and achieves improvements of up to 10% in relative error rate on the test set.
Abstract: Discriminative training of hidden Markov models (HMMs) using minimum classification error training (MCE) has been shown to work well for certain speech recognition applications. MCE is, however, somewhat prone to overspecialization. This study investigates various techniques which improve performance and generalization of the MCE algorithm. Improvements of up to 10% in relative error rate on the test set are achieved for the TIMIT dataset.

Proceedings ArticleDOI
13 May 2002
TL;DR: A study of separability of acoustic waveforms of speech at phoneme level by means of principal component analysis, which proves to be very robust in time-domain and spectral-magnitude domain.
Abstract: We present a study of separability of acoustic waveforms of speech at phoneme level. The analyzed data consist of 64ms segments of acoustic waveforms of individual phonemes from TIMIT data base, sampled at 16kHz. For each phoneme, by means of principal component analysis, we identify subspaces which contain a given proportion of the total energy of the available waveforms in time-domain, and also in spectral-magnitude domain. In order to assess separation between phonemes in the two domains, we perform pairwise classification of phonemes on clean data and on data immersed in white additive Gaussian noise up to 0dB signal to noise ratio. While the classification based on spectral magnitudes exhibits high sensitivity to additive noise, the time-domain classification proves to be very robust.

Proceedings ArticleDOI
13 May 2002
TL;DR: Two measures, phoneme error rate (PER) and phoneme confidence score (PCS), are investigated and show that both PER and PCS can help identify where the degradation from noise occurs as well as give a useful indication of how an NM algorithm may impact ASR performance.
Abstract: A common approach to measuring the impact of noise and the effectiveness of noise mitigation (NM) algorithms for Automatic Speech Recognition (ASR) systems is to compare the word error rates (WERs). However, the WER measure does not give much insight into how an NM algorithm affects phoneme-level acoustic characteristics. Such insight can help in tuning the NM parameters and may also lead to reduced research time because the impact of an NM algorithm on ASR can first be investigated on smaller corpora. In this paper, two measures, phoneme error rate (PER) and phoneme confidence score (PCS), are investigated to assess the impact of NM algorithms on the ASR performance. Experimental results using the TIMIT corpus show that both PER and PCS can help identify where the degradation from noise occurs as well as give a useful indication of how an NM algorithm may impact ASR performance. A diagnostic method based on these two measures is also proposed to assess the NM impact on ASR and help improve the NM algorithm performance.

Journal Article
TL;DR: In this paper, audio signals are automatically converted to visual images of mouth shape, which are the generic face images corresponding to particular sounds, and the visual speech can be represented as a sequence of visemes.
Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

Proceedings ArticleDOI
13 May 2002
TL;DR: A novel speech beam-former for noisy environments that identifies the speech signal in the direction where the signal's spectrum entropy is minimized and the recognition rate increases significantly compared to the rate obtained by a single microphone.
Abstract: Detection of the speaker position is a crucial task in hands-free speech recognition applications. In this paper we present a novel speech beam-former for noisy environments. Initially, the localization algorithm extracts a set of candidate directions of the signal sources using array signal processing methods in the frequency domain. Then, a minimum variance (MV) beam-former identifies the speech signal in the direction where the signal's spectrum entropy is minimized. The proposed method is evaluated by a phoneme recognition system using noise recordings from an air-condition fan and the TIMIT speech corpus. Extended experiments, carried out in the range of 25–0 dB, show almost perfect estimation of the speaker DOA in all cases. As a consequence, the recognition rate increases significantly compared to the rate obtained by a single microphone. The recognition improvement increases especially in very low SNRs.

Proceedings ArticleDOI
13 May 2002
TL;DR: It is shown that the speech correlation structure may be used to estimate the communication channel and an efficient algorithm is proposed to compute this estimate and it is argued that the resulting channel estimate is more accurate because the underlying hypothesis is better verified than the original CMN hypothesis.
Abstract: Cepstral mean normalization is the standard technique for channel robustness. Despite its good performance, the effectiveness of cepstral mean normalization (CMN) for short sentences is argued. CMN underlying hypothesis that the speech cepstral mean is constant is not valid for short processing windows. This implies the removal of some phonetic information. In this paper we show that the speech correlation structure may be used to estimate the communication channel and we propose an efficient algorithm to compute this estimate. We argue that the resulting channel estimate is more accurate because the underlying hypothesis is better verified than the original CMN hypothesis. Results for the Kai-Fu Lee phone recognition task on NTIMIT, with acoustic models trained on TIMIT (mismatch conditions), show that our method provides an 8% relative error rate reduction as compared to CMN.

Proceedings ArticleDOI
09 Dec 2002
TL;DR: A discriminative classification based approach for speaker recognition that makes use of regularized least squares regression based input output hidden Markov models (IOHMM) as classifier for closed set, text independent speaker identification.
Abstract: The purpose of the speaker recognition is to determine a speaker's identity from his/her speech utterances. Every speaker has his/her own physiological as well as behavioral characteristics embedded in his/her speech utterances. These characteristics can be extracted from utterances and statistically modeled. Through pattern recognition of unseen test speech with statistically trained models, a speaker identity can be recognized. In this paper, we present a discriminative classification based approach for speaker recognition. The system makes use of regularized least squares regression (RLSR) based input output hidden Markov models (IOHMM) as classifier for closed set, text independent speaker identification. The IOHMM allows us to map input sequences to output sequences, using the same processing style as recurrent neural networks. The RLSR allows the IOHMM to be trained in a more discriminative style. The use of hidden Markov models (HMM) and support vector machines (SVM) has also been studied. The performance of the system is assessed using a set of male and female speakers drawn from the TIMIT corpus.

Proceedings ArticleDOI
13 May 2002
TL;DR: The proposed Signal Bias Removal based GMM (SBR-GMM) executes the minimization of the environmental variation on mismatched condition by removing the bias of the distorted input signal and the adaptation of the speaker-dependent characteristics from the clean, text independent and speaker independent background GMM.
Abstract: In this paper, we focus on the combined method of SBR and GMM-UBM and its capacity for detection and robustness of speaker recognition. While each method has achieved improvements independent of each other in an orthogonal field, both methods have a similar framework. The proposed Signal Bias Removal based GMM (SBR-GMM) executes the minimization of the environmental variation on mismatched condition by removing the bias of the distorted input signal and the adaptation of the speaker-dependent characteristics from the clean, text independent and speaker independent background GMM. In our experiments, we compared the closed-set speaker identification for conventional CMS and the proposed method respectively on TIMIT and NTIMIT database. Particularly in the third set of experiments on NTIMIT, compared to CMS, we were able to improve the recognition rate by 27.4% using the robust feature.

Book ChapterDOI
30 May 2002
TL;DR: A real-time wideband speech codec adopting a wavelet packet based methodology and adapting the probability model of the quantized coefficients frame by frame by means of a competitive neural network to model better the speech characteristics of the current speaker.
Abstract: We developed a real-time wideband speech codec adopting a wavelet packet based methodology. The transform domain coefficients were first quantized by means of a mid-tread uniform quantizer and then encoded with an arithmetic coding. In the first step the wavelet coefficients were quantized by using a psycho-acoustic model. The second step was carried out by adapting the probability model of the quantized coefficients frame by frame by means of a competitive neural network. The neural network was trained on the TIMIT corpus and his weights updated in real-time during the compression in order to model better the speech characteristics of the current speaker. The coding/decoding algorithm was first written in C and then optimised on the TMS320C6000 DSP platform.

Proceedings ArticleDOI
13 May 2002
TL;DR: The use of 3-state AF model with multiple observation distributions that gives a better modeling of the articulatory features within a phone is introduced that results in an improvement of about 1% in phone recognition on the TIMIT task.
Abstract: In this paper, we propose two improvements to the articulatory feature (AF) models We introduce the use of 3-state AF model with multiple observation distributions that gives a better modeling of the articulatory features within a phone This results in an improvement of about 1% in phone recognition on the TIMIT task Combining the AF model with acoustic-based HMM achieves an improvement of 16% compares to use acoustic features only We then introduce the asynchronous state combination of the 3-state AF models with acoustic-based HMM and obtain an additional improvement of 17%

Proceedings ArticleDOI
01 Jul 2002
TL;DR: A novel speech beamformer for moving speakers in noisy environments that identifies the speech signal DOA in the direction where the signal's spectrum entropy is minimized and shows significant improvement in the recognition rate of moving speakers especially in very low SNR.
Abstract: In hands-free speech recognition of moving speakers, the time interval where the source position can be assumed stationary varies. It is very common for the speaker to move rapidly within the data window exploited. In such cases the conventional fixed-window direction of arrival (DOA) estimation may lead to poor tracking performance. In this paper we present a novel speech beamformer for moving speakers in noisy environments. The localization algorithm extracts a set of candidate DOA of the signal sources using array signal processing methods in the frequency domain. A minimum variance (MV) beamformer identifies the speech signal DOA in the direction where the signal's spectrum entropy is minimized. The same localization algorithm is used to detect the closest direction to the initial estimation using a smaller window. The proposed method is evaluated using a phoneme recognition system and noise recordings from an air-condition fan and the TIMIT speech corpus. Extended experiments, carried out in the range of 25-0 dB SNR, show significant improvement in the recognition rate of moving speakers especially in very low SNR.

Journal ArticleDOI
TL;DR: Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.
Abstract: The performance of speaker verification systems is often compromised under real-world environments. For example, variations in handset characteristics could cause severe performance degradation. This paper presents a novel method to overcome this problem by using a non-linear handset mapper. Under this method, a mapper is constructed by training an elliptical basis function network using distorted speech features as inputs and the corresponding clean features as the desired outputs. During feature recuperation, clean features are recovered by feeding the distorted features to the feature mapper. The recovered features are then presented to a speaker model as if they were derived from clean speech. Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.