scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2006"


Journal ArticleDOI
TL;DR: This work examines the idea of using the GMM supervector in a support vector machine (SVM) classifier and proposes two new SVM kernels based on distance metrics between GMM models that produce excellent classification accuracy in a NIST speaker recognition evaluation task.
Abstract: Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMM models is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.

1,081 citations


Journal ArticleDOI
TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.
Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

634 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: A support vector machine kernel is constructed using the GMM supervector and similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis are shown.
Abstract: Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique.

625 citations


Journal ArticleDOI
TL;DR: The metric that is proposed is an information-theoretic one, which measures the effective amount of information that the speaker detector delivers to the user, which is appropriate for the evaluation of application-independent detectors, which output soft decisions in the form of log-likelihood-ratios, rather than hard decisions.

624 citations


Journal ArticleDOI
TL;DR: An EER of 10.5% is obtained, indicating that speaker-specific excitation information is present in the residual phase, which is useful since it is complementary to that of MFCCs.
Abstract: The objective of this letter is to demonstrate the complementary nature of speaker-specific information present in the residual phase in comparison with the information present in the conventional mel-frequency cepstral coefficients (MFCCs). The residual phase is derived from speech signal by linear prediction analysis. Speaker recognition studies are conducted on the NIST-2003 database using the proposed residual phase and the existing MFCC features. The speaker recognition system based on the residual phase gives an equal error rate (EER) of 22%, and the system using the MFCC features gives an EER of 14%. By combining the evidence from both the residual phase and the MFCC features, an EER of 10.5% is obtained, indicating that speaker-specific excitation information is present in the residual phase. This information is useful since it is complementary to that of MFCCs.

601 citations


Proceedings Article
01 Jan 2006
TL;DR: A practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space and achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over the previous baseline.
Abstract: This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement.

461 citations


01 Jan 2006
TL;DR: A full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and the practical limitations that will be encountered if these algorithms are implemented on very large data sets are discussed.
Abstract: We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms are implemented on very large data sets. This article is intended as a companion to (1) where we presented a new type of likelihood ratio statistic for speaker verification which is designed principally to deal with the problem of inter-session variability, that is the variability among recordings of a given speaker. This likelihood ratio statistic is based on a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels (such as one of the Switchboard II databases). Our purpose in the current article is to give detailed algorithms for carrying out such a factor analysis. Although we have only experimented with the applications of this model in speaker recognition we will also explain how it could serve as an integrated framework for progressive speaker-adaptation and on-line channel adaptation of HMM-based speech recognizers operating in situations where speaker identities are known. II. OVERVIEW OF THE JOINT FACTOR ANALYSIS MODEL The joint factor analysis model can be viewed Gaussian distribution on speaker- and channel-dependent (or, more accurately, session-dependent) HMM supervectors in which most (but not all) of the variance in the supervector population is assumed to be accounted for by a small number of hidden variables which we refer to as speaker and channel factors. The speaker factors and the channel factors play different roles in that, for a given speaker, the values of the speaker factors are assumed to be the same for all recordings of the speaker but the channel factors are assumed to vary from one recording to another. For example, the Gaussian distribution on speaker-dependent supervectors used in eigenvoice MAP (2) is a special case of the factor analysis model in which there are no channel factors and all of the variance in the speaker- dependent HMM supervectors is assumed to be accounted The authors are with the Centre de recherche informatique de Montr´

440 citations


PatentDOI
TL;DR: In this paper, a real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user, where the partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.
Abstract: A real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user. Both the client and server can dedicate a variable number of processing resources for performing speech recognition functions. The partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.

279 citations


Journal ArticleDOI
TL;DR: This paper focuses on optimizing vector quantization (VQ) based speaker identification, which reduces the number of test vectors by pre-quantizing the test sequence prior to matching, and thenumber of speakers by pruning out unlikely speakers during the identification process.
Abstract: In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7% can be reached in 0.84 s on average when the length of test utterance is 30.4 s.

248 citations


Journal ArticleDOI
TL;DR: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step, which builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system.
Abstract: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system

217 citations


Patent
13 Apr 2006
TL;DR: In this paper, a speaker-independent context-sensitive speech recognition module contains a vocabulary of available speech commands, which are combined with input from a controller device to control actions of a character or characters in the game environment.
Abstract: In a gaming system, a user controls actions of characters in the game environment using speech commands. In a learning mode, available speech commands are displayed in a command menu on a display device. In a non-learning mode, the available speech commands are not displayed. A speaker-independent context-sensitive speech recognition module contains a vocabulary of available speech commands. Use of speech commands is combined with input from a controller device to control actions of a character or characters in the game environment.

Patent
Jung-Eun Kim1, Jeong-Su Kim1
16 Feb 2006
TL;DR: In this paper, a user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user, which includes calculating a confidence score of recognition candidate according to the result of speech recognition.
Abstract: A user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user. The user adaptive speech recognition method includes calculating a confidence score of a recognition candidate according to the result of speech recognition, setting a new threshold value adapted to the user based on a result of user confirmation of the recognition candidate and the confidence score of the recognition candidate, and outputting a corresponding recognition candidate as a result of the speech recognition if the calculated confidence score is higher than the new threshold value. Thus, the need for user confirmation of the result of speech recognition is reduced and the probability of speech recognition success is increased.

Patent
26 May 2006
TL;DR: In this article, a method for authenticating a user based on the phrase, the biometric voice print, and the device identifier is presented. But the method is limited to a single user and cannot be used to authenticate multiple users.
Abstract: A method (700) and system (900) for authenticating a user is provided. The method can include receiving one or more spoken utterances from a user (702), recognizing a phrase corresponding to one or more spoken utterances (704), identifying a biometric voice print of the user from one or more spoken utterances of the phrase (706), determining a device identifier associated with the device (708), and authenticating the user based on the phrase, the biometric voice print, and the device identifier (710). A location of the handset or the user can be employed as criteria for granting access to one or more resources (712).

Journal ArticleDOI
TL;DR: The speaker recognition studies on NIST 2002 database demonstrates that even though, the recognition performance from the excitation information alone is poor, when combined with evidence from vocal tract information, there is significant improvement in the performance.

Patent
27 Oct 2006
TL;DR: In this article, a privacy sound may be based on the speaker's own voice or another voice, which may be used to access a database of the speaker or another's voice, and form one or more voice streams to form the privacy sound.
Abstract: A privacy apparatus adds a privacy sound into the environment, thereby confusing listeners as to which of the sounds is the real source. The privacy sound may be based on the speaker's own voice or may be based on another voice. At least one characteristic of the speaker (such as a characteristic of the speaker's speech) may be identified. The characteristic may then be used to access a database of the speaker's own voice or another's voice, and to form one or more voice streams to form the privacy sound. The privacy sound may thus permit disruption of the ability to understand the source speech of the user by eliminating segregation cues that the auditory system uses to interpret speech.

Journal ArticleDOI
01 Nov 2006
TL;DR: The main components of audio-visual biometric systems are described, existing systems and their performance are reviewed, and future research and development directions in this area are discussed.
Abstract: Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system's performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person's voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area

Journal ArticleDOI
TL;DR: The Bayesian framework for interpretation of evidence when applied to forensic speaker recognition is introduced, and original contributions for the robust estimation of likelihood ratios are fully described, including TDLRA (target dependent likelihood ratio alignment), oriented to guarantee the presumption of innocence of suspected but non-perpetrators speakers.

Book ChapterDOI
13 Dec 2006
TL;DR: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS), the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks.
Abstract: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

Patent
30 Mar 2006
TL;DR: In this article, an architecture is presented that leverages discrepancies between user model predictions and speech recognition results by identifying discrepancies between the predictive data and the speech recognition data and repairing the data based in part on the discrepancy.
Abstract: An architecture is presented that leverages discrepancies between user model predictions and speech recognition results by identifying discrepancies between the predictive data and the speech recognition data and repairing the data based in part on the discrepancy. User model predictions predict what goal or action speech application users are likely to pursue based in part on past user behavior. Speech recognition results indicate what goal speech application users are likely to have spoken based in part on words spoken under specific constraints. Discrepancies between the predictive data and the speech recognition data are identified and a dialog repair is engaged for repairing these discrepancies. By engaging in repairs when there is a discrepancy between the predictive results and the speech recognition results, and utilizing feedback obtained via interaction with a user, the architecture can learn about the reliability of both user model predictions and speech recognition results for future processing.

Proceedings ArticleDOI
17 Sep 2006
TL;DR: An algorithm for the recognition and separation of speech signals in non-stationary noise, such as another speaker, is proposed using hidden Markov models trained for the speech and noise into a factorial HMM to model the mixture signal.
Abstract: This paper proposes an algorithm for the recognition and separation of speech signals in non-stationary noise, such as another speaker. We present a method to combine hidden Markov models (HMMs) trained for the speech and noise into a factorial HMM to model the mixture signal. Robustness is obtained by separating the speech and noise signals in a feature domain, which discards unnecessary information. We use mel-cepstral coefficients (MFCCs) as features, and estimate the distribution of mixture MFCCs from the distributions of the target speech and noise. A decoding algorithm is proposed for finding the state transition paths and estimating gains for the speech and noise from a mixture signal. Simulations were carried out using speech material where two speakers were mixed at various levels, and even for high noise level (9 dB above the speech level), the method produced relatively good (60% word recognition accuracy) results. Audio demonstrations are available at www.cs.tut.fi/˜tuomasv. IndexTerms: speech recognition, speech separation, factorial hidden Markov model.

Proceedings Article
01 Jan 2006
TL;DR: A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.
Abstract: We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate temporal dynamics. We report on two methods for introducing dynamics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the separation performance. Once the signals have been separated they are then recognized using speaker dependent labeling.

Journal ArticleDOI
TL;DR: The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available.
Abstract: The objective of voice conversion algorithms is to modify the speech by a particular source speaker so that it sounds as if spoken by a different target speaker. Current conversion algorithms employ a training procedure, during which the same utterances spoken by both the source and target speakers are needed for deriving the desired conversion parameters. Such a (parallel) corpus, is often difficult or impossible to collect. Here, we propose an algorithm that relaxes this constraint, i.e., the training corpus does not necessarily contain the same utterances from both speakers. The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30%. A speaker identification measure is also employed that more insightfully portrays the importance of adaptation, while listening tests confirm the success of our method. Both the objective and subjective tests employed, demonstrate that the proposed algorithm achieves comparable results with the ideal case when a parallel corpus is available.

Proceedings ArticleDOI
28 Jun 2006
TL;DR: A simple global calibration metric is proposed that can be generally applied to a multiple-hypothesis problem and it is demonstrated experimentally on some NIST-LRE-'05 data how this relates to the calibration of some of the derived binary-hypotheses sub-problems.
Abstract: Recent publications have examined the topic of calibration of confidence scores in the field of (binary-hypothesis) speaker detection. We extend this topic to the case of multiple-hypothesis language recognition. We analyze the structure of multiple-hypothesis recognition problems to show that any such problem subsumes a multitude of derived sub-problems and that therefore the calibration of all of these problems are interrelated. We propose a simple global calibration metric that can be generally applied to a multiple-hypothesis problem and then demonstrate experimentally on some NIST-LRE-'05 data how this relates to the calibration of some of the derived binary-hypotheses sub-problems

Journal ArticleDOI
TL;DR: Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application.
Abstract: There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application

Journal ArticleDOI
TL;DR: Important aspects of Technical Forensic Speaker recognition, particularly those associated with evidence, are exemplified and critically discussed, and comparisons drawn with generic Speaker Recognition are drawn.

Journal ArticleDOI
TL;DR: A general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches and shows that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality.
Abstract: Voice morphing is a technique for modifying a source speaker's speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. First, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Second, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data

Dissertation
21 Dec 2006
TL;DR: In this paper, a hierarchical bottom-up mono-channel speaker diarization system was used to extract speaker location information and obtain a single enhanced signal from all available microphones, which is then used for speaker segmentation and clustering.
Abstract: This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to. The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data. The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.

Journal ArticleDOI
TL;DR: A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN), and a new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate.
Abstract: The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T/sup 2/-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus.

Book
01 Jan 2006
TL;DR: This book discusses Speech Recognition with HMMs, a Alternative Representations of the LPC Coefficients, and Front-end Processing for Robust Feature Extraction, a Review of Channel Coding Techniques.
Abstract: Forward. Preface. 1 Introduction. 1.1 Introduction. 1.2 RSR over Digital Channels. 1.3 Organization of the Book. 2 Speech Recognition with HMMs. 2.1 Introduction. 2.2 Some General Issues. 2.3 Analysis of Speech Signals. 2.4 Vector Quantization. 2.5 Approaches to ASR. 2.6 Hidden Markov Models. 2.7 Application of HMMs to Speech Recognition. 2.8 Model Adaptation. 2.9 Dealing with Uncertainty. 3 Networks and Degradation. 3.1 Introduction. 3.2 Mobile and Wireless Networks. 3.3 IP Networks. 3.4 The Acoustic Environment. 4 Speech Compression and Architectures for RSR. 4.1 Introduction. 4.2 Speech Coding. 4.3 Recognition from Decoded Speech. 4.4 Recognition from Codec Parameters. 4.5 Distributed Speech Recognition. 4.6 Comparison between NSR and DSR. 5 Robustness Against Transmission Channel Errors. 5.1 Introduction. 5.2 Channel Coding Techniques. 5.3 Error Concealment (EC). 6 Front-end Processing for Robust Feature Extraction. 6.1 Introduction. 6.2 Noise Reduction Techniques. 6.3 Voice Activity Detection. 6.4 Feature Normalization. 7 Standards for Distributed Speech Recognition. 7.1 Introduction. 7.2 Signal Preprocessing. 7.3 Feature Extraction. 7.4 Feature Compression and Encoding. 7.5 Feature Decoding and Postprocessing. A Alternative Representations of the LPC Coefficients. B Basic Digital Modulation Concepts. C Review of Channel Coding Techniques. C.1 Media-independent FEC. C.2 Interleaving. Bibliography. List of Acronyms. Index.

Proceedings ArticleDOI
28 Jun 2006
TL;DR: This paper compares channel variability modeling in the usual Gaussian mixture model domain, and the proposed feature domain compensation technique, and shows that the two approaches lead to similar results on the NIST 2005 speaker recognition evaluation data.
Abstract: The variability of the channel and environment is one of the most important factors affecting the performance of text-independent speaker verification systems. The best techniques for channel compensation are model based. Most of them have been proposed for Gaussian Mixture Models, while in the feature domain typically blind channel compensation is performed. The aim of this work is to explore techniques that allow more accurate channel compensation in the domain of the features. Compensating the features rather than the models has the advantage that the transformed parameters can be used with models of different nature and complexity, and also for different tasks. In this paper we evaluate the effects of the compensation of the channel variability obtained by means of the channel factors approach. In particular, we compare channel variability modeling in the usual Gaussian Mixture model domain, and our proposed feature domain compensation technique. We show that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data. Moreover, the quality of the transformed features is also assessed in the Support Vector Machines framework for speaker recognition on the same data, and in preliminary experiments on Language Identification.