scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2007"


Proceedings Article
01 Jan 2007
TL;DR: A new approach for keyword spotting is proposed, which is based on large margin and kernel methods rather than on HMMs, and shows theoretically that it attains high area under the ROC curve.
Abstract: This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.

116 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: It is found that well-trained neutral acoustic models can be effectively used as a front-end for emotion recognition, and once trained with MFB, it may reasonably work well regardless of the chan-nel characteristics.
Abstract: Since emotional speech can be regarded as a variation onneutral (non-emotional) speech, it is expected that a robust neu-tral speech model can be useful in contrasting different emo-tions expressed in speech. This study explores this idea by cre-ating acoustic models trained with spectral features, using theemotionally-neutral TIMIT corpus. The performance is testedwith two emotional speech databases: one recorded with a mi-crophone (acted), and another recorded from a telephone ap-plication (spontaneous). It is found that accuracy up to 78%and 65% can be achieved in the binary and category emotiondiscriminations, respectively. Raw Mel Filter Bank (MFB) out-put was found to perform better than conventional MFCC, withboth broad-band and telephone-band speech. These results sug-gest that well-trained neutral acoustic models can be effectivelyused as a front-end for emotion recognition, and once trainedwith MFB, it may reasonably work well regardless of the chan-nel characteristics.Index Terms: Emotion recognition, Neutral speech, HMMs,Mel filter bank (MFB), TIMIT

94 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper compares three frameworks for discriminative training of continuous-density hidden Markov models (CD-HMMs) and proposes a new framework based on margin maximization, which yields significantly lower error rates than both CML and MCE training.
Abstract: In this paper we compare three frameworks for discriminative training of continuous-density hidden Markov models (CD-HMMs). Specifically, we compare two popular frameworks, based on conditional maximum likelihood (CML) and minimum classification error (MCE), to a new framework based on margin maximization. Unlike CML and MCE, our formulation of large margin training explicitly penalizes incorrect decodings by an amount proportional to the number of mislabeled hidden states. It also leads to a convex optimization over the parameter space of CD-HMMs, thus avoiding the problem of spurious local minima. We used discriminatively trained CD-HMMs from all three frameworks to build phonetic recognizers on the TIMIT speech corpus. The different recognizers employed exactly the same acoustic front end and hidden state space, thus enabling us to isolate the effect of different cost functions, parameterizations, and numerical optimizations. Experimentally, we find that our framework for large margin training yields significantly lower error rates than both CML and MCE training.

77 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: An automatic speech recognizer is developed by training cross-word triphone models based on the TIMIT corpus and an "extended" pronunciation lexicon is developed that incorporates the predicted phonetic confusions to generate additional, erroneous pronunciation variants for each word.
Abstract: This work aims to derive salient mispronunciations made by Chinese (L1 being Cantonese) learners of English (L2 being American English) in order to support the design of pedagogical and remedial instructions. Our approach is grounded on the theory of language transfer and involves systematic phonological comparison between two languages to predict possible phonetic confusions that may lead to mispronunciations. We collect a corpus of speech recordings from some 21 Cantonese learners of English. We develop an automatic speech recognizer by training cross-word triphone models based on the TIMIT corpus. We also develop an "extended" pronunciation lexicon that incorporates the predicted phonetic confusions to generate additional, erroneous pronunciation variants for each word. The extended pronunciation lexicon is used to produce a confusion network in recognition of the English speech recordings of Cantonese learners. We refer to the statistics of the erroneous recognition outputs to derive salient mispronunciations that stipulates the predictions based on phonological comparison.

74 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: Initial analyses show that MMC is a promising method for the automatic detection of sub-phonetic information in the speech signal and is highly competitive with existing unsupervised methods for theautomatic detection of phoneme boundaries.
Abstract: Maximum margin clustering (MMC) is a relatively new and promising kernel method. In this paper, we apply MMC to the task of unsupervised speech segmentation. We present three automatic speech segmentation methods based on MMC, which are tested on TIMIT and evaluated on the level of phoneme boundary detection. The results show that MMC is highly competitive with existing unsupervised methods for the automatic detection of phoneme boundaries. Furthermore, initial analyses show that MMC is a promising method for the automatic detection of sub-phonetic information in the speech signal.

63 citations


Journal ArticleDOI
TL;DR: A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG.
Abstract: In phoneme recognition experiments, it was found that approximately 75% of misclassified frames were assigned labels within the same broad phonetic group (BPG). While the phoneme can be described as the smallest distinguishable unit of speech, phonemes within BPGs contain very similar characteristics and can be easily confused. However, different BPGs, such as vowels and stops, possess very different spectral and temporal characteristics. In order to accommodate the full range of phonemes, acoustic models of speech recognition systems calculate input features from all frequencies over a large temporal context window. A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG. Due to the different temporal and spectral structure of each BPG, novel feature sets are extracted using mutual information, to select a relevant time-frequency (TF) feature set for each expert. To construct a phone recognition system, the output of each expert is combined with a baseline classifier under the guidance of a separate BPG detector. Considering phoneme recognition experiments using the TIMIT continuous speech corpus, the proposed architecture afforded significant error rate reductions up to 5% relative

61 citations


Proceedings ArticleDOI
Li Deng1, Dong Yu1
15 Apr 2007
TL;DR: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint Static/delta-cepstra HTM, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model.
Abstract: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint static cepstra and their temporal differentials (i.e., delta cepstra). The formulation of this generalized HTM is presented in the generative-modeling framework, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model. The parameter estimation techniques for the new model are developed and presented, giving closed-form estimation formulas after the use of vector Taylor series approximation. We show principled generalization from the earlier static-cepstra HTM to the new static/delta-cepstra HTM not only in terms of model formulations but also in terms of their respective analytical forms in (monophone) parameter estimation. Experimental results on the standard TIMIT phonetic recognition task demonstrate recognition accuracy improvement over the earlier best HTM system, both significantly better than state-of-the-art triphone HMM systems.

53 citations


Journal ArticleDOI
TL;DR: The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes.
Abstract: In this paper, the problem of identifying in-set versus out-of-set speakers for limited training/test data durations is addressed. The recognition objective is to form a decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or outside speakers. The general goal is to perform rapid speaker model construction from limited enrollment and test size resources for in-set testing for input audio streams. In-set detection can help ensure security and proper access to private information, as well as detecting and tracking input speakers. Areas of applications of these concepts include rapid speaker tagging and tracking for information retrieval, communication networks, personal device assistants, and location access. We propose an integrated system with emphasis on short-enrollment data (about 5 s of speech for each enrolled speaker) and test data (2-8 s) within a text-independent mode. We present a simple and yet powerful decision rule to accept or reject speakers using a discriminative vector in the decision score space, together with statistical hypothesis testing based on the conventional likelihood ratio test. Discriminative training is introduced to further improve system performance for both decision techniques, by employing minimum classification error and minimum verification error frameworks. Experiments are performed using three separate corpora. Using the YOHO speaker recognition database, the alternative decision rule achieves measurable improvement over the likelihood ratio test, and discriminative training consistently enhances overall system performance with relative improvements ranging from 11.26%-28.68%. A further extended evaluation using the TIMIT (CORPUS1) and actual noisy aircraft communications data (CORPUS2) shows measurable improvement over the traditional MAP based scheme using the likelihood ratio test (MAP-LRT), with average EERs of 9%-23% for TIMIT and 13%-32% for noisy aircraft communications. The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes

51 citations


Journal ArticleDOI
TL;DR: This work overviews some recently proposed discrete Fourier transform (DFT)- and discrete wavelet packet transform (DWPT)-based speech parameterization methods and compares their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP), which presently dominate the speech recognition field.
Abstract: In the present work we overview some recently proposed discrete Fourier transform (DFT)- and discrete wavelet packet transform (DWPT)-based speech parameterization methods and evaluate their performance on the speech recognition task. Specifically, in order to assess the practical value of these less studied speech parameterization methods, we evaluate them in a common experimental setup and compare their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) cepstral coefficients which presently dominate the speech recognition field. In particular, utilizing the well established TIMIT speech corpus and employing the Sphinx-III speech recognizer, we present comparative results of 8 different speech parameterization techniques.

47 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: A hierarchical large-margin Gaussian mixture modeling framework trained by alternately updating parameters at different levels in the tree to maximize the joint margin of the overall classification achieves good performance with fewer parameters than single-level classifiers.
Abstract: In this paper we present a hierarchical large-margin Gaussian mixture modeling framework and evaluate it on the task of phonetic classification. A two-stage hierarchical classifier is trained by alternately updating parameters at different levels in the tree to maximize the joint margin of the overall classification. Since the loss function required in the training is convex to the parameter space the problem of spurious local minima is avoided. The model achieves good performance with fewer parameters than single-level classifiers. In the TIMIT benchmark task of context-independent phonetic classification, the proposed modeling scheme achieves a state-of-the-art phonetic classification error of 16.7% on the core test set. This is an absolute reduction of 1.6% from the best previously reported result on this task, and 4-5% lower than a variety of classifiers that have been recently examined on this task.

37 citations


Journal ArticleDOI
TL;DR: It is shown that the addition of a hidden dynamic state leads to increases in accuracy over otherwise equivalent static models, and a time-asynchronous decoding strategy suited to recognition with segment models is proposed.
Abstract: The majority of automatic speech recognition systems rely on hidden Markov models, in which Gaussian mixtures model the output distributions associated with sub-phone states. This approach, whilst successful, models consecutive feature vectors (augmented to include derivative information) as statistically independent. Furthermore, spatial correlations present in speech parameters are frequently ignored through the use of diagonal covariance matrices. This paper continues the work of Digalakis and others who proposed instead a first-order linear state-space model which has the capacity to model underlying dynamics, and furthermore give a model of spatial correlations. This paper examines the assumptions made in applying such a model and shows that the addition of a hidden dynamic state leads to increases in accuracy over otherwise equivalent static models. We also propose a time-asynchronous decoding strategy suited to recognition with segment models. We describe implementation of decoding for linear dynamic models and present TIMIT phone recognition results

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This system achieves state-of-the-art single classifier performance on the TIMIT phonetic classification task, (slightly) beating other recent systems and showing that in the presence of additive noise, the model is much more robust than a well-trained Gaussian mixture model.
Abstract: We perform phonetic classification with an architecture whose elements are binary classifiers trained via linear regularized least squares (RLS). RLS is a simple yet powerful regularization algorithm with the desirable property that a good value of the regularization parameter can be found efficiently by minimizing leave-one-out error on the training set. Our system achieves state-of-the-art single classifier performance on the TIMIT phonetic classification task, (slightly) beating other recent systems. We also show that in the presence of additive noise, our model is much more robust than a well-trained Gaussian mixture model.

Proceedings ArticleDOI
03 Mar 2007
TL;DR: UT-scope data base, and automatic and perceptual an evaluation of Lombard speech in in-set speaker recognition suggest that deeper understanding of cognitive factor involved in perceptual speaker ID offers meaningful insights for further development of automated systems.
Abstract: This paper presents UT-scope data base, and automatic and perceptual an evaluation of Lombard speech in in-set speaker recognition. The speech used for the analysis forms a part of the UT-SCOPE database and consists of sentences from the well-known TIMIT corpus, spoken in the presence of highway, large crowd and pink noise. First, the deterioration of the EER of an in-set speaker identification system trained on neutral and tested with Lombard speech is illustrated. A clear demarcation between the effect of noise and Lombard effect on noise is also given by testing with noisy Lombard speech. The effect of test-token duration on system performance under the Lombard condition is addressed. We also report results from In-Set Speaker Recognition tasks performed by human subjects in comparison to the system performance. Overall observations suggest that deeper understanding of cognitive factor involved in perceptual speaker ID offers meaningful insights for further development of automated systems.

Journal ArticleDOI
TL;DR: This paper introduces a genetic programming system to evolve programs capable of speaker verification and evaluates its performance with the publicly available TIMIT corpora and shows the effect of a simulated telephone network on classification results.
Abstract: Robust automatic speaker verification has become increasingly desirable in recent years with the growing trend toward remote security verification procedures for telephone banking, bio-metric security measures and similar applications. While many approaches have been applied to this problem, genetic programming offers inherent feature selection and solutions that can be meaningfully analyzed, making it well suited to this task. This paper introduces a genetic programming system to evolve programs capable of speaker verification and evaluates its performance with the publicly available TIMIT corpora. We also show the effect of a simulated telephone network on classification results which highlights the principal advantage, namely robustness to both additive and convolutive noise

Proceedings ArticleDOI
01 Aug 2007
TL;DR: Improved HMM/SVM methods for a twostage phoneme segmentation framework that is validated on two speech databases and has been validated on the MATBN Mandarin Chinese database.
Abstract: This paper presents improved HMM/SVM methods for a twostage phoneme segmentation framework, which tries to imitate the human phoneme segmentation process. The first stage performs hidden Markov model (HMM) forced alignment according to the minimum boundary error (MBE) criterion. The objective is to align a phoneme sequence of a speech utterance with its acoustic signal counterpart based on MBE-trained HMMs and explicit phoneme duration models. The second stage uses the support vector machine (SVM) method to refine the hypothesized phoneme boundaries derived by HMM-based forced alignment. The efficacy of the proposed framework has been validated on two speech databases: the TIMIT English database and the MATBN Mandarin Chinese database.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: To integrate the two very different modules, Brno's phone recognizer is modified into a phone lattice hypothesizer to produce high-quality phone lattices, and feed them directly into the knowledge-based module to rescore the lattices.
Abstract: This study is a result of a collaboration project between two groups, one from Brno University of Technology and the other from Georgia Institute of Technology (GT). Recently the Brno recognizer is known to outperform many state-of-the-art systems on phone recognition, while the GT knowledge-based lattice rescoring module has been shown to improve system performance on a number of speech recognition tasks. We believe a combination of the two system results in high-accuracy phone recognition. To integrate the two very different modules, we modify Brno's phone recognizer into a phone lattice hypothesizer to produce high-quality phone lattices, and feed them directly into the knowledge-based module to rescore the lattices. We test the combined system on the TIMIT continuous phone recognition task without retraining the individual subsystems, and we observe that the phone error rate was effectively reduced to 19.78% from 24.41% produced by the Brno phone recognizer. To the best of the authors' knowledge this result represents the lowest ever error rate reported on the TIMIT continuous phone recognition task.

Journal ArticleDOI
TL;DR: This letter proposes a new architecture for voice conversion that is based on a joint neural-wavelet approach and examines the characteristics of many wavelet families to determine the one that best matches the requirements of the proposed system.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper proposes using support vector machine (SVM) to refine the hypothesized phone transition boundaries given by the HMM-based Viterbi forced alignment on the TIMIT speech corpus to perform as well as the recent discriminative H MM-based segmentation.
Abstract: In this paper, we propose using support vector machine (SVM) to refine the hypothesized phone transition boundaries given by the HMM-based Viterbi forced alignment. We conducted experiments on the TIMIT speech corpus. The phone transitions were automatically partitioned into 46 clusters according to their acoustic characteristics and the cross-validation using the training data; hence, 46 phone-transition-dependent SVM classifiers were used for phone boundary refinement. The proposed HMM-SVM approach performs as well as the recent discriminative HMM-based segmentation. The best accuracies achieved are 81.23% within a tolerance of 10 ms and 92.47% within a tolerance of 20 ms. The mean boundary distance is 7.73 ms.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The use of regularization effectively prevents overfitting and HCRFs are able to make use of non-independent features in phone classification, at least with small numbers of mixture components, while HMMs degrade due to their strong independence assumptions.
Abstract: We show a number of improvements in the use of Hidden Conditional Random Fields (HCRFs) for phone classification on the TIMIT and Switchboard corpora. We first show that the use of regularization effectively prevents overfitting, improving over other methods such as early stopping. We then show that HCRFs are able to make use of non-independent features in phone classification, at least with small numbers of mixture components, while HMMs degrade due to their strong independence assumptions. Finally, we successfully apply Maximum a Posteriori adaptation to HCRFs, decreasing the phone classification error rate in the Switchboard corpus by around 1% -5% given only small amounts of adaptation data.

Journal ArticleDOI
TL;DR: Distribution scaling based scorenormalization techniques are developed specifically for the in-set/out-of-set problem and compared against existing score normalization schemes used in open-set speaker recognition.
Abstract: In this paper, the problem of identifying in-set versus out-of-set speakers using extremely limited enrollment data is addressed. The recognition objective is to form a binary decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or not. Here, the emphasis is on low enrollment (about 5 sec of speech for each enrolled speaker) and test data durations (2-8 sec), in a text-independent scenario. In order to overcome the limited enrollment, data from speakers that are acoustically close to a given in-set speaker are used to form an informative prior (base model) for speaker adaptation. Score normalization for in-set systems is addressed, and the difficulty of using conventional score normalization schemes for in-set speaker recognition is highlighted. Distribution scaling based score normalization techniques are developed specifically for the in-set/out-of-set problem and compared against existing score normalization schemes used in open-set speaker recognition. Experiments are performed using the following three separate corpora: (1) noise-free TIMIT; (2) noisy in-vehicle CU-move; and (3) the NIST-SRE-2006 database. Experimental results show a consistent increase in system performance for the proposed techniques.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This paper explores applying the EBW gradient steepness metric in the context of Hidden Markov Models for recognition of broad phonetic classes and finds that the gradient metric is able to outperform the baseline likelihood method, and offers improvements in noisy conditions.
Abstract: In many pattern recognition tasks, given some input data and a model, a probabilistic likelihood score is often computed to measure how well the model describes the data. Extended Baum-Welch (EBW) transformations are most commonly used as a discriminative technique for estimating parameters of Gaussian mixtures, though recently they have been used to derive a gradient steepness measurement to evaluate the quality of the model to match the distribution of the data. In this paper, we explore applying the EBW gradient steepness metric in the context of Hidden Markov Models (HMMs) for recognition of broad phonetic classes and present a detailed analysis and results on the use of this gradient metric on the TIMIT corpus. We find that our gradient metric is able to outperform the baseline likelihood method, and offers improvements in noisy conditions.

Proceedings ArticleDOI
01 Nov 2007
TL;DR: A confidence measure has been successfully extracted for one versus one multi-class SVM classifier from binary classifiers confidence measures and has been optimized to model the temporal variations of speech feature vectors using a Viterbi like decoding.
Abstract: The use of support vector machines for speech recognition purposes has been limited by the static nature of this classifier. In this paper, a confidence measure has been proposed and evaluated for the speech features vectors sequence. The confidence measure has been successfully extracted for one versus one multi-class SVM classifier from binary classifiers confidence measures and has been optimized to model the temporal variations of speech feature vectors using a Viterbi like decoding. In the decoding procedure, the effects of bigram lingual modeling and acoustic confidences have been balanced to achieve the best result in the continuous speech recognition applications. The experiments have been arranged on TIMIT corpus for a continuous phoneme recognition system. The results reveal 2.6% superior recognition rates comparing with HMM continuous classic speech recognition methods.

Journal ArticleDOI
TL;DR: It is established that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results, and when initialised appropriately, longer- length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.
Abstract: Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.

Journal ArticleDOI
TL;DR: This paper presents a transformation-based rapid adaptation technique for robust speech recognition using a linear spectral transformation (LST) and a maximum mutual information (MMI) criterion and provides a relative reduction in the speech recognition error rate using only 0.25 s of adaptation data.
Abstract: This paper presents a transformation-based rapid adaptation technique for robust speech recognition using a linear spectral transformation (LST) and a maximum mutual information (MMI) criterion. Previously, a maximum likelihood linear spectral transformation (ML-LST) algorithm was proposed for fast adaptation in unknown environments. Since the MMI estimation method does not require evenly distributed training data and increases the a posteriori probability of the word sequences of the training data, we combine the linear spectral transformation method and the MMI estimation technique in order to achieve extremely rapid adaptation using only one word of adaptation data. The proposed algorithm, called MMI-LST, was implemented using the extended Baum-Welch algorithm and phonetic lattices, and evaluated on the TIMIT and FFMTIMIT corpora. It provides a relative reduction in the speech recognition error rate of 11.1% using only 0.25 s of adaptation data.

Proceedings ArticleDOI
22 Apr 2007
TL;DR: This paper uses the notion of virtual evidence in a graphical-model based system to reduce the amount of supervisory training data required for sequence learning tasks, and shows that a VE-based training scheme can, relative to a baseline trained with the full segmentation, yield similar results.
Abstract: Collecting supervised training data for automatic speech recognition (ASR) systems is both time consuming and expensive. In this paper we use the notion of virtual evidence in a graphical-model based system to reduce the amount of supervisory training data required for sequence learning tasks. We apply this approach to a TIMIT phone recognition system, and show that our VE-based training scheme can, relative to a baseline trained with the full segmentation, yield similar results with only 15.3% of the frames labeled (keeping the number of utterances fixed).

Proceedings ArticleDOI
27 Aug 2007
TL;DR: A novel data-driven framework is proposed to build the alignment of the frequency axes of two speakers by forming a frequency domain correspondence of these two speakers as an optimal matching problem.
Abstract: Due to physiology and linguistic difference between speakers, the spectrum pattern for the same phoneme of two speakers can be quite dissimilar Without appropriate alignment on the frequency axis, the inter-speaker variation will reduce the modeling efficiency and result in performance degradation In this paper, a novel data-driven framework is proposed to build the alignment of the frequency axes of two speakers This alignment between two frequency axes is essentially a frequency domain correspondence of these two speakers To establish the frequency domain correspondence, we formulate the task as an optimal matching problem The local matching is achieved by comparing the local features of the spectrogram along the frequency bins This local matching is actually capturing the similarity of the local patterns along different frequency bins in the spectrogram After the local matching, a dynamic programming is then applied to find the global optimal alignment between two frequency axes Experiments on TIDIGITS and TIMIT clearly show the effectiveness of this method

Book ChapterDOI
22 May 2007
TL;DR: The manifold learning techniques locally linear embedding and Isomap are considered and it is shown that features resulting from manifold learning are capable of yielding higher classification accuracy than these baseline features.
Abstract: This study aims to investigate approaches for low dimensional speech feature transformation using manifold learning. It has recently been shown that speech sounds may exist on a low dimensional manifold nonlinearly embedded in high dimensional space. A number of manifold learning techniques have been developed in recent years that attempt to discover this type of underlying geometric structure. The manifold learning techniques locally linear embedding and Isomap are considered in this study. The low dimensional representations produced by applying these techniques to MFCC feature vectors are evaluated in several phone classification tasks on the TIMIT corpus. Classification accuracy is analysed and compared to conventional MFCC features and those transformed with PCA, a linear dimensionality reduction method. It is shown that features resulting from manifold learning are capable of yielding higher classification accuracy than these baseline features. The best phone classification accuracy in general is demonstrated by feature transformation with Isomap.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: An acoustic model adaptation method for speaker verification (SV) in environments with additive noise and spectral subtraction (SS) as the enhancement technique and GMM as the acoustic models is presented.
Abstract: This work presents an acoustic model adaptation method for speaker verification (SV) in environments with additive noise. In contrast to traditional acoustic model adaptation techniques that adapt the models parameters based on a model of the noise, acoustic model enhancement (AME) belongs to a new scheme in which the models are adapted to the speech enhancement strategy. The theoretical framework is presented for spectral subtraction (SS) as the enhancement technique and GMM as the acoustic models. In order to study the effect of additive noise only, a modified TIMIT dataset was used. The experimental setup uses two types of noise: one with fixed spectrum that helps as a proof of concept, and another with time-varying spectrum as a more realistic performance reference for AME. The results for this latter type show that at 20 dB SNR, the equal error rate (EER) dropped from 17% to around 8.9% when the noisy speech was enhanced with SS, whereas it further dropped to 8.1% with AME.

Journal ArticleDOI
TL;DR: Although these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of nonlinear formant-to-acoustic mappings.
Abstract: Multiple-level segmental hidden Markov models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation are considered. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or the formant-to-acoustic mapping is sufficiently rich. The way in which M-SHMMs exploit formant-based information is also investigated, using singular value decomposition of the formant-to-acoustic mappings and linear discriminant analysis. The analysis shows that if the intermediate layer contains information which is linearly related to the spectral representation, that information is used in preference to explicit formant frequencies, even though the latter are useful for phone discrimination. In summary, although these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of nonlinear formant-to-acoustic mappings.

Proceedings ArticleDOI
22 Apr 2007
TL;DR: It is found that joint modeling provides superior performance to the independent models on the TIMIT phone recognition task, and it is suggested that in an ASR system, phonological features should be handled as correlated, rather than independent.
Abstract: We compare the effect of joint modeling of phonological features to independent feature detectors in a Conditional Random Fields framework. Joint modeling of features is achieved by deriving phonological feature posteriors from the posterior probabilities of the phonemes. We find that joint modeling provides superior performance to the independent models on the TIMIT phone recognition task. We explore the effects of varying relationships between phonological features, and suggest that in an ASR system, phonological features should be handled as correlated, rather than independent.