Showing papers on "TIMIT published in 2004"

PDF

Open Access

Journal Article•DOI•

Automatic segmentation of continuous speech using minimum phase group delay functions

[...]

V. Kamakshi Prasad¹, T. Nagarajan¹, Hema A. Murthy¹•Institutions (1)

01 Apr 2004-Speech Communication

TL;DR: An algorithm, based on group delay processing of the magnitude spectrum to determine segment boundaries in the speech signal to automatically segment a continuous speech signal into syllable-like segments is presented.

...read moreread less

96 citations

Journal Article•DOI•

Phoneme recognition using ICA-based feature extraction and transformation

[...]

Oh-Wook Kwon¹, Te-Won Lee²•Institutions (2)

Chungbuk National University¹, University of California, San Diego²

01 Jun 2004-Signal Processing

TL;DR: A new scheme is proposed that shows how the phase sensitivity can be removed by using an analytical description of the ICA-adapted basis functions via the Hilbert transform, and since the basis functions are not shift invariant, is extended to include a frequency-based ICA stage that removes redundant time shift information.

...read moreread less

89 citations

Proceedings Article•DOI•

Application of the modified group delay function to speaker identification and discrimination

[...]

Rajesh M. Hegde¹, Hema A. Murthy¹, G.V.R. Rao²•Institutions (2)

Indian Institute of Technology Madras¹, SRI International²

17 May 2004

TL;DR: The modified group delay feature (MODGDF) is used as a front end feature in a Gaussian mixture model (GMM) based speaker identification system and it is shown that the MODGDF has speaker specific properties.

...read moreread less

Abstract: In this paper, we explore new methods by which speakers can be identified and discriminated, using features derived from the Fourier transform phase. The modified group delay feature (MODGDF) which is a parameterized form of the modified group delay function is used as a front end feature in this study. A Gaussian mixture model (GMM) based speaker identification system is built with the MODGDF as the front end feature. The system is tested on both clean (TIMIT) and noisy telephone (NTIMIT) speech. The results obtained are compared with traditional Mel frequency cepstral coefficients (MFCC) which is derived from the Fourier transform magnitude. When both MFCC and MODGDF were combined, the performance improved by about 4% indicating that both phase and magnitude contain complementary information. In an earlier paper (Murthy et al. (2003)), it was shown that the MODGDF does possess phoneme specific characteristics. In this paper we show that the MODGDF has speaker specific properties. We also make an attempt to understand speaker discriminating characteristics of the MODGDF using the nonlinear mapping technique based on Sammon mapping (Sammon (1969)) and find that the MODGDF empirically demonstrates a certain level of linear separability among speakers.

...read moreread less

73 citations

Journal Article•DOI•

Detection of speech landmarks: Use of temporal information

[...]

Ariel Salomon¹, Carol Y. Espy-Wilson, Om D. Deshmukh•Institutions (1)

University of Maryland, College Park¹

27 Feb 2004-Journal of the Acoustical Society of America

TL;DR: In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system and results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used.

...read moreread less

Abstract: Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.

...read moreread less

57 citations

Dissertation•

Linear dynamic models for automatic speech recognition

[...]

Joe Frankel

01 Jun 2004

TL;DR: A novel approach to decoding for segment models in the form of a stack decoder with A∗ search is introduced, which allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model.

...read moreread less

Abstract: The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterise entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrisation and parameter initialisation. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesised segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A∗ search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data.

...read moreread less

43 citations

Journal Article•DOI•

Discriminant neural predictive coding applied to phoneme recognition

[...]

Bruno Gas¹, Jean-Luc Zarader¹, Cyril Chavy¹, Mohamed Chetouani¹•Institutions (1)

University of Paris¹

01 Jan 2004-Neurocomputing

TL;DR: It is shown that it is possible to improve the discriminant capacities of such an encoder with the introduction of signal membership class information as from the coding stage, and fits in with the category of discriminant features extraction (DFE) encoders already proposed in literature.

...read moreread less

24 citations

Proceedings Article•DOI•

Type-2 fuzzy hidden Markov models to phoneme recognition

[...]

Jia Zeng¹, Zhi-Qiang Liu¹•Institutions (1)

City University of Hong Kong¹

23 Aug 2004

TL;DR: This paper presents a novel extension of hidden Markov models (HMMs): type-2 fuzzy HMMs (type-2 FHMMs), which can handle both randomness and fuzziness within the framework of type-1 fuzzy sets and fuzzy logic systems (FLSs).

...read moreread less

Abstract: This paper presents a novel extension of hidden Markov models (HMMs): type-2 fuzzy HMMs (type-2 FHMMs). The advantage of this extension is that it can handle both randomness and fuzziness within the framework of type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs). Membership functions (MFs) of type-2 fuzzy sets are three-dimensional. It is the third dimension that provides the additional degrees of freedom that make it possible to handle both uncertainties. We apply the type-2 FHMM as acoustic models for phoneme recognition on TIMIT speech database. Experimental results show that the type-2 FHMM has a comparable performance as that of the HMM but is more robust to noise, while it retains almost the same computational complexity as that of the HMM.

...read moreread less

22 citations

Proceedings Article•DOI•

Parsing speech into articulatory events

[...]

Kadri Hacioglu¹, Bryan L. Pellom¹, Wayne H. Ward¹•Institutions (1)

University of Colorado Boulder¹

17 May 2004

TL;DR: A detector is described that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech and the results are compared to a baseline system.

...read moreread less

Abstract: In this paper, the states in the speech production process are defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. We extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for the place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.

...read moreread less

21 citations

Proceedings Article•DOI•

Dialect analysis and modeling for automatic classification.

[...]

John H. L. Hansen¹, Umit H. Yapanel¹, Rongqing Huang¹, Ayako Ikeno¹•Institutions (1)

University of Colorado Boulder¹

04 Oct 2004

TL;DR: The results show that significant changes occur at the prosodic, phoneme space, and word levels for dialect analysis, and that effective dialect classification can be achieved using processing strategies from each domain.

...read moreread less

Abstract: In this paper, we present our recent work in the analysis and modeling of speech under dialect Dialect and accent significantly influence automatic speech recognition performance, and therefore it is critical to detect and classify non-native speech In this study, we consider three areas that include: (i) prosodic structure (normalized f0, syllable rate, and sentence duration), (ii) phoneme acoustic space modeling and sub-word classification, and (iii) word-level based modeling using large vocabulary data The corpora used in this study include: the NATO N-4 corpus (2 accents, 2 dialects of English), TIMIT (7 dialect regions), and American and British English versions of the WSJ corpus These corpora were selected because the contained audio material from specific dialects/accents of English (N-4), were phonetically balanced and organized across US (TIMIT), or contained significant amounts of read audio material from distinct dialects (WSJ) The results show that significant changes occur at the prosodic, phoneme space, and word levels for dialect analysis, and that effective dialect classification can be achieved using processing strategies from each domain

...read moreread less

20 citations

Proceedings Article•DOI•

Interval type-2 fuzzy hidden Markov models

[...]

Jia Zeng¹, Zhi-Qiang Liu¹•Institutions (1)

City University of Hong Kong¹

25 Jul 2004

TL;DR: An extension of the hidden Markov models (HMMs) using interval type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs) to produce intervaltype-2 FHMMs that have a comparable performance as that of the HMM but is more robust to the speech variation.

...read moreread less

Abstract: This paper presents an extension of the hidden Markov models (HMMs) using interval type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs) to produce interval type-2 FHMMs. The advantage of this extension is that it can handle both the randomness and fuzziness. Membership function (MF) of the type-2 FS is three-dimensional. It is the third-dimension that provides additional degrees of freedom to evaluate HMM's uncertainties. An attractive property of this extension is that if all uncertainties disappear, the interval type-2 FHMM reduces to the classical HMM. We apply our interval type-2 FHMM as an acoustic model for phoneme recognition on TIMIT speech database. Experimental results show that the type-2 FHMM has a comparable performance as that of the HMM but is more robust to the speech variation, while it retains almost the same computational complexity as that of the HMM.

...read moreread less

17 citations

Wavelet-based voice morphing

[...]

Christina Orphanidou, Irene M. Moroz, Stephen J. Roberts

01 Jan 2004

TL;DR: A new multi-scale voice morphing algorithm that enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content.

...read moreread less

Abstract: This paper presents a new multi-scale voice morphing algorithm. This algorithm enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content. The voice morphing algorithm performs the morphing at different subbands by using the theory of wavelets and models the spectral conversion using the theory of Radial Basis Function Neural Networks. The results obtained on the TIMIT speech database demonstrate effective transformation of the speaker identity.

...read moreread less

Proceedings Article•DOI•

Cluster-dependent modeling and confidence measure processing for In-Set/Out-of-Set speaker identification

[...]

Pongtep Angkititrakul¹, Sepideh Baghaii¹, John H. L. Hansen¹•Institutions (1)

University of Colorado Boulder¹

04 Oct 2004

TL;DR: The in-set speakers are clustered into smaller subsets without merging speaker models and the Anti-Speaker or Background Model is adapted for each subset which minimizes the identification errors of the pseudo impostors during the training stage.

...read moreread less

Abstract: In this paper, we propose an approach to address the problem of text-independent open-set speaker identification. The in-set speakers are clustered into smaller subsets without merging speaker models. The Anti-Speaker or Background Model is then adapted for each subset which minimizes the identification errors of the pseudo impostors during the training stage. Score normalization is applied to align all the in-set speaker score distributions to share a single scale. Finally, confidence measure processing is used to identify in-set versus out-of-set speakers. Experiments with TIMIT and the CU-Accent corpora show an improvement in Equal Error Rate on the average of 20.28 and 8.35 over the baseline performance respectively. Finally, a probe experiment is also included that considers prosody for in-set speaker detection.

...read moreread less

Proceedings Article•DOI•

Discriminative training for speaker identification based on maximum model distance algorithm

[...]

Qingyang Hong¹, Sam Kwong¹•Institutions (1)

City University of Hong Kong¹

17 May 2004

TL;DR: The results show that the identification performance can be improved greatly when the training data is limited and a new selection strategy of competitive speakers is proposed to it.

...read moreread less

Abstract: In this paper we apply the maximum model distance (MMD) training to speaker identification and a new selection strategy of competitive speakers is proposed to it. The traditional ML method only utilizes the utterances for each speaker model, which probably leads to a local optimization solution. By maximizing the dissimilarities among those similar speaker models, MMD could add the discriminative capability into the training procedure and then improve the identification performance. Based on the TIMIT corpus, we designed the word and sentence experiments to evaluate this proposed training approach. The results show that the identification performance can be improved greatly when the training data is limited.

...read moreread less

Proceedings Article•DOI•

Training for polynomial segment model using the expectation maximization algorithm

[...]

Chak-Fai Li¹, Man-Hung Siu¹•Institutions (1)

Hong Kong University of Science and Technology¹

17 May 2004

TL;DR: An improved PSM is introduced, dynamic multi-region PSM, that allows a data-driven alignment between observations and the segment trajectory and outperforms HMM and traditional PSM in both phone classification and phone recognition tasks on the TIMIT corpus.

...read moreread less

Abstract: One of the difficulties in using the polynomial segment model (PSM) to capture the temporal correlations within a phonetic segment is the lack of an efficient training algorithm comparable with the Baum-Welch algorithm in HMM. In our previous paper, we introduced a recursive likelihood computation algorithm for PSM recognition which can perform Viterbi-style training. In this paper, we extend the recursive likelihood computation into a fast forward-backward PSM training algorithm that maximizes PSM likelihood. In addition, we introduce an improved PSM, dynamic multi-region PSM, that allows a data-driven alignment between observations and the segment trajectory. The dynamic multi-region PSM model outperforms HMM and traditional PSM in both phone classification and phone recognition tasks on the TIMIT corpus.

...read moreread less

Journal Article•DOI•

Integration of acoustic and articulatory information with application to speech recognition

[...]

Ka-Yee Leung¹, Man-Hung Siu¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Jun 2004-Information Fusion

TL;DR: It is shown that combining the models during training not only improved performance but also simplified fusion process during recognition, particularly for highly constrained recognition fusion such as synchronous models combination.

...read moreread less

Proceedings Article•DOI•

Fast convergence speech source separation in reverberant acoustic environment

[...]

Yunxin Zhao¹, Rong Hu¹•Institutions (1)

University of Missouri¹

17 May 2004

TL;DR: Three significant enhancements to time-domain adaptive decorrelation filtering (ADF) are proposed for effective separation and recognition of simultaneous speech sources in reverberant room conditions and have significantly improved target-to-interference ratio and accuracy of phone recognition.

...read moreread less

Abstract: Three significant enhancements to time-domain adaptive decorrelation filtering (ADF) are proposed for effective separation and recognition of simultaneous speech sources in reverberant room conditions. The methods include whitening filtering on cochannel speech prior to ADF to improve condition of adaptive estimation, a novel block-iterative implementation of ADF to speed up convergence rate, and an integration of multiple ADF outputs through optimal post-filtering. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in a real acoustic environment, with a 2m microphone-source distance and an initial target-to-interference ratio of about 0 dB. The proposed methods are shown to have speeded up the convergence rate of ADF to a level feasible for online applications, and they have significantly improved target-to-interference ratio and accuracy of phone recognition.

...read moreread less

Proceedings Article•DOI•

Discriminative feature transformation by guided discriminative training

[...]

Roger Hsiao¹, Brian Mak¹•Institutions (1)

Hong Kong University of Science and Technology¹

17 May 2004

TL;DR: To prove the concept, guided discriminative training is applied to derive an optimal linear transformation on the mel-filterbank log power spectra to improve TIMIT phoneme classification.

...read moreread less

Abstract: In this paper, we investigate guided discriminative training in the context of improving multi-class classification problems. We are interested in applications that require improvement in the classification performance of only a subset of the classes at the possible expense of poorer classification performance of the remaining classes. However, should the classification of the remaining classes deteriorate, it is guaranteed not to be worse than the extent that the user specifies. The problem is formulated as a nonlinear programming problem, which can be translated to a unconstrained nonlinear optimization problem using the barrier method that, in turn, can be solved by the gradient descent method. To prove the concept, we apply guided discriminative training to derive an optimal linear transformation on the mel-filterbank log power spectra to improve TIMIT phoneme classification. Encouraging results are obtained.

...read moreread less

Proceedings Article•DOI•

Speech enhancement based on a combined multi-channel array with constrained iterative and auditory masked processing

[...]

Xianxian Zhang¹, John H. L. Hansen¹, K.A. Rehar¹•Institutions (1)

University of Colorado Boulder¹

17 May 2004

TL;DR: A combined fixed/adaptive beamforming algorithm (CFA-BF) for speech enhancement with two single channel methods based on speech spectral constrained iterative processing (Auto-LSP), and an auditory masked threshold based method using equivalent rectangular bandwidth filtering (GMMSE-AMTERB).

...read moreread less

Abstract: While a number of studies have investigated various speech enhancement and noise suppression schemes, most consider either a single channel or array processing framework. Clearly there are potential advantages in leveraging the strengths of array processing solutions in suppressing noise from a direction other than the speaker, with that seen in single channel methods that include speech spectral constraints or psychoacoustically motivated processing. In this paper, we propose to integrate a combined fixed/adaptive beamforming algorithm (CFA-BF) for speech enhancement with two single channel methods based on speech spectral constrained iterative processing (Auto-LSP), and an auditory masked threshold based method using equivalent rectangular bandwidth filtering (GMMSE-AMTERB). After formulating the method, we evaluate performance on a subset of the TIMIT corpus with four real noise sources. We demonstrate a consistent level of noise suppression and voice communication quality improvement using the proposed method as reflected by an overall average 26dB increase in SegSNR from the original degraded audio corpus.

...read moreread less

Proceedings Article•DOI•

Development of automatic speaker identification system

[...]

Aliaa A. A. Youssif¹, Ebada Sarhan¹, W.H. El Behaidy¹•Institutions (1)

Helwan University¹

16 Mar 2004

TL;DR: A general method for text-independent speaker identification system using discrete wavelet transform (DWT) is presented and shows 100% successful text-dependent recognition for both database of speakers.

...read moreread less

Abstract: A general method for text-independent speaker identification system using discrete wavelet transform (DWT) is presented. The production of identification is based on a predefined threshold in conjunction with a subset mother wavelet properly selected to extract the speaker's features. The classification process to produce the speakers codebooks is achieved by using the vector quantization technique (VQ) with a proper algorithm to minimize the time needed to create the data vector. The performance of the proposed system is demonstrated by considering two databases; 28 persons (males and females of 7 different regions) from the international TIMIT database down-sampled to 8 kHz and a group of 5 local females recorded in a semiquiet environment. TIMIT results of text-independent identification indicates 85.7% for speakers of same region and 90.5% for speakers of different regions whereas 88% and 100% identification for the local group are obtained. However, the proposed method shows 100% successful text-dependent recognition for both database of speakers.

...read moreread less

Non-linear input transformations for discriminative hmms

[...]

Tore Johonarn, Magnc Hallsttiri

01 Jan 2004

TL;DR: It is observed that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI, and global MMJ is found superior to the frame-based criterion for continuous recognition.

...read moreread less

Abstract: This paper deals wir h speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linpar input featmre transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMJ is found superior to the frame-based criterion for continuous recognition. We furt.her observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, (:an increase the recognition rate significantly compared t,o just nsing delta parameters.

...read moreread less

Proceedings Article•DOI•

A comparison of reconstructed phase spaces and cepstral coefficients for multi-band phoneme classification

[...]

Kevin M. Indrebo¹, Richard J. Povinelli¹, Michael T. Johnson¹•Institutions (1)

Marquette University¹

01 Jan 2004

TL;DR: It is shown that while classification accuracy using MeI frequency cepstral coefficients as features does not improve with sub-banding, the accuracy increases from 36.1% to 42.0% using sub- banded reconstructed phase spaces to model the phonemes.

...read moreread less

Abstract: This paper examines the use of multi-band reconstructed phase spaces as models for phoneme classification. Sub-banding reconstructed phase spaces combines linear, frequency-based techniques with a nonlinear modeling approach to speech recognition. Experiments comparing the effects of filtering speech signals for both reconstructed phase space and traditional speech recognition approaches are presented. These experiments study the use of two non-overlapping subbands for isolated phoneme classification on the TIMIT corpus. It is shown that while classification accuracy using MeI frequency cepstral coefficients as features does not improve with sub-banding, the accuracy increases from 36.1% to 42.0% using sub-banded reconstructed phase spaces to model the phonemes.

...read moreread less

Journal Article•DOI•

Supplementary features for improving phone recognition

[...]

Mridul Balaraman, Sorin V. Dusan, James L. Flanagan

06 Oct 2004-Journal of the Acoustical Society of America

TL;DR: The results indicate that the supplementary features contain classification characteristics which can be useful in automatic speech recognition.

...read moreread less

Abstract: Traditional speech recognition systems use mel‐frequency cepstral coefficients (MFCCs) as acoustic features. The present research aims to study the classification characteristics and the performance of some supplementary features (SFs) such as periodicity, zero crossing rate, log energy and ratio of low frequency energy to total energy, in a phone recognition system, built using the Hidden Markov Model Tool Kit. To demonstrate the performance of the SFs, training is done on a subset of the TIMIT data base (DR1 data set) on context independent phones using a single mixture. When only the SFs and their first derivatives (feature set of dimension 8) are used the recognition accuracy is found to be 42.96% as compared to 54.65% when 12 MFCCs and their corresponding derivatives are used. The performance of the system improves to 56.49%, when the SFs and their derivatives are used along with the MFCCs. A further improvement to 60.34% is observed when the last 4 MFCCs and their derivatives are replaced by SFs and their derivatives, respectively. These results indicate that the supplementary features contain classification characteristics which can be useful in automatic speech recognition.

...read moreread less

Proceedings Article•DOI•

Maximum model distance discriminative training for text-independent speaker verification

[...]

Q.Y. Hong¹, Sam Kwong¹•Institutions (1)

City University of Hong Kong¹

02 Nov 2004

TL;DR: The maximum model distance (MMD) algorithm is applied to the Gaussian mixture model (GMM) training and shows that the equal error rate (EER) could be reduced greatly compared with the traditional ML method.

...read moreread less

Abstract: This paper presents the design and implementation of text-independent speaker verification. We apply the maximum model distance (MMD) algorithm to the Gaussian mixture model (GMM) training. The traditional maximum likelihood (ML) method only utilizes the labeled utterances for each speaker model, which probably leads to a local optimization solution. By maximizing the model distance between the target and competing speakers, MMD could add the discriminative capability into the training procedure and then improve the verification performance. Based on the TIMIT corpus, we designed the verification experiments and the results show that the equal error rate (EER) could be reduced greatly compared with the traditional ML method.

...read moreread less

Proceedings Article•DOI•

ARMA lattice model for phoneme feature extraction

[...]

Qing Xie¹, Hon Keung Kwan¹•Institutions (1)

University of Windsor¹

20 Oct 2004

TL;DR: Experimental results indicate that the ARMA lattice model achieves an improved noise-resistant capability on vowel phoneme and fricative phonemes as compared to those of the conventional mel-frequency cepstral coefficient (MFCC) method.

...read moreread less

Abstract: In this paper, the result of a study on phoneme feature extraction, under a noisy environment, using an auto-regressive moving average (ARMA) lattice model, is presented. The phoneme characteristics are modeled and expressed in the form of ARMA lattice reflection coefficients for classification. Experimental results, based on the TIMIT speech database and NoiseX-92 noise database, indicate that the ARMA lattice model achieves an improved noise-resistant capability on vowel phonemes and fricative phonemes as compared to those of the conventional mel-frequency cepstral coefficient (MFCC) method.

...read moreread less

Proceedings Article•DOI•

Using state-level information in the HMM

[...]

Hao-Zheng Li¹, Zhi-Qiang Liu, Xiang-Hua Zhu•Institutions (1)

Beijing University of Posts and Telecommunications¹

26 Aug 2004

TL;DR: This paper analyzes the roles of individual hidden states of the HMM and their associated posterior probabilities that reflect the nature of the components in the observation sequence, and proposes to make a full use of the state-level information.

...read moreread less

Abstract: In HMM-based pattern recognition, the structure of HMM is predetermined according to some prior knowledge. In the recognition process, we usually make our judgment based on the maximum likelihood of the HMM which unfortunately may lead to incorrect results. In this paper, we analyze the roles of individual hidden states of the HMM and their associated posterior probabilities that reflect the nature of the components in the observation sequence, which should be taken into consideration. For this, we propose to make a full use of the state-level information, e.g., making use of the distribution of the intersection number of state posterior probability trajectories in the recognition process. We apply the proposed methods to phoneme classification on TIMIT speech corpus and show indeed that we are able to achieve about 2% percent improvement in recognition rate over that of the classical HMM.

...read moreread less

Journal Article•DOI•

Text-independent speaker verification using Minimal Resource Allocation Networks.

[...]

Li Guojie¹, Paramasivan Saratchandran¹, Narasimhan Sundararajan¹•Institutions (1)

Nanyang Technological University¹

01 Dec 2004-International Journal of Neural Systems

TL;DR: The performance of MRAN is compared with other well-known RBF and Elliptical Basis Function (EBF) based speaker verification methods in terms of error rates and computational complexity on a series of speaker verification experiments.

...read moreread less

Abstract: This paper presents a text-independent speaker verification system based on an online Radial Basis Function (RBF) network referred to as Minimal Resource Allocation Network (MRAN). MRAN is a sequential learning RBF, in which hidden neurons are added or removed as training progresses. LP-derived cepstral coefficients are used as feature vectors during training and verification phases. The performance of MRAN is compared with other well-known RBF and Elliptical Basis Function (EBF) based speaker verification methods in terms of error rates and computational complexity on a series of speaker verification experiments. The experiments use data from 258 speakers from the phonetically balancedcontinuous speech corpus TIMIT. The results show that MRAN produces comparable error rates to other methods with much less computational complexity.

...read moreread less

Proceedings Article•

New Contribution to Adaptive Temporal Radial Basis Function Applied on TIMIT Corpus.

[...]

Larbi Mesbahi, Abdelkader Benyettou

01 Jan 2004

TL;DR: An algorithm is developed that permits to solve this type of problem and is called “DOLS” which means Dynamic Orthogonal Least Square, that will be presented in this paper.

...read moreread less

Abstract: Introduction A successful speech recognition system has to determine features not only present in the input pattern at one point in time, but also features of input pattern changing over time ( e.g., Berthold, 1994; Benyettou, 1995). In network design, great importance must be attributed to correct choice of the number of hidden neurons, which helps avoiding problems of overfitting and contributes to reduce the time required for the training without significantly affecting the network performances (e.g., Colla & Reyneri & Sgarbi, 1999), but never looking to architecture adapting effect according to input. The goal to combine the approach of the RBF with the shift invariance features of the TDNN, can be get a new robust model, this is named temporal radial basis function “TRBF” (e.g., Mesbahi & Benyettou, 2003), but to be more efficient, we have adapt these networks so that they come more dynamic according to their behaviour and features of the object has study. It can be goes more clearly in continuous speech. Therefore in object to obtain an Adaptive TRBF, we must adapt the TRBF networks, consequently it was necessary to develop an algorithm that permits to solve this type of problem, this algorithm is called “DOLS” which means Dynamic Orthogonal Least Square, that will be presented in this paper.

...read moreread less

Proceedings Article•DOI•

Distributed speaker recognition.

[...]

Veena Desai¹, Hema A. Murthy¹•Institutions (1)

Indian Institute of Technology Madras¹

04 Oct 2004

TL;DR: The results indicate that by performing feature extraction at the client end, the bitrate can be reduced significantly to 13.6kbps with 96% recognition performance.

...read moreread less

Abstract: Speech recognition systems are gaining increasing importance with the wide-spread use of mobile and portable devices and other interactive voice response systems. Because of the resource constraints on such devices and the requirements of specific applications, the need to perform speech recognition over a data network becomes inevitable. The requirements of such a system with a human at one end and a machine at the other end are clearly asymmetric. The major focus of this work is to enable speaker recognition for information access over the network. Assuming that at the client end the device is either a Personal Digital Assistant(PDA) or a cellphone, an attempt is made to perform part of computation at the client end, thus conserve bandwidth. Experiments have been performed on both TIMIT data and TIMIT data passed through a speech codec. The results indicate that by performing feature extraction at the client end, the bitrate can be reduced significantly to 13.6kbps with 96% recognition performance.

...read moreread less