Showing papers on "Speaker recognition published in 2006"

PDF

Open Access

Journal Article•DOI•

Support vector machines using GMM supervectors for speaker verification

[...]

William M. Campbell¹, Douglas E. Sturim¹, Douglas A. Reynolds¹•Institutions (1)

10 Apr 2006-IEEE Signal Processing Letters

TL;DR: This work examines the idea of using the GMM supervector in a support vector machine (SVM) classifier and proposes two new SVM kernels based on distance metrics between GMM models that produce excellent classification accuracy in a NIST speaker recognition evaluation task.

...read moreread less

Abstract: Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMM models is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.

...read moreread less

1,081 citations

Journal Article•DOI•

An overview of automatic speaker diarization systems

[...]

S. E. Tranter¹, Douglas A. Reynolds²•Institutions (2)

University of Cambridge¹, Massachusetts Institute of Technology²

01 Sep 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.

...read moreread less

Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

...read moreread less

634 citations

Proceedings Article•DOI•

SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation

[...]

William M. Campbell¹, Douglas E. Sturim¹, Douglas A. Reynolds¹, Alex Solomonoff¹•Institutions (1)

Massachusetts Institute of Technology¹

14 May 2006

TL;DR: A support vector machine kernel is constructed using the GMM supervector and similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis are shown.

...read moreread less

Abstract: Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique.

...read moreread less

625 citations

Journal Article•DOI•

Application-independent evaluation of speaker detection

[...]

Niko Brümmer¹, Johan A. du Preez¹•Institutions (1)

Stellenbosch University¹

01 Apr 2006-Computer Speech & Language

TL;DR: The metric that is proposed is an information-theoretic one, which measures the effective amount of information that the speaker detector delivers to the user, which is appropriate for the evaluation of application-independent detectors, which output soft decisions in the form of log-likelihood-ratios, rather than hard decisions.

...read moreread less

624 citations

Journal Article•DOI•

Combining evidence from residual phase and MFCC features for speaker recognition

[...]

K.S.R. Murty¹, B. Yegnanarayana¹•Institutions (1)

Indian Institute of Technology Madras¹

01 Jan 2006-IEEE Signal Processing Letters

TL;DR: An EER of 10.5% is obtained, indicating that speaker-specific excitation information is present in the residual phase, which is useful since it is complementary to that of MFCCs.

...read moreread less

Abstract: The objective of this letter is to demonstrate the complementary nature of speaker-specific information present in the residual phase in comparison with the information present in the conventional mel-frequency cepstral coefficients (MFCCs). The residual phase is derived from speech signal by linear prediction analysis. Speaker recognition studies are conducted on the NIST-2003 database using the proposed residual phase and the existing MFCC features. The speaker recognition system based on the residual phase gives an equal error rate (EER) of 22%, and the system using the MFCC features gives an EER of 14%. By combining the evidence from both the residual phase and the MFCC features, an EER of 10.5% is obtained, indicating that speaker-specific excitation information is present in the residual phase. This information is useful since it is complementary to that of MFCCs.

...read moreread less

601 citations

Proceedings Article•

Within-class covariance normalization for SVM-based speaker recognition.

[...]

Andrew O. Hatch¹, Sachin S. Kajarekar², Andreas Stolcke²•Institutions (2)

University of California, Berkeley¹, SRI International²

01 Jan 2006

TL;DR: A practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space and achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over the previous baseline.

...read moreread less

Abstract: This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement.

...read moreread less

461 citations

Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms

[...]

Patrick Kenny

01 Jan 2006

TL;DR: A full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and the practical limitations that will be encountered if these algorithms are implemented on very large data sets are discussed.

...read moreread less

Abstract: We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms are implemented on very large data sets. This article is intended as a companion to (1) where we presented a new type of likelihood ratio statistic for speaker verification which is designed principally to deal with the problem of inter-session variability, that is the variability among recordings of a given speaker. This likelihood ratio statistic is based on a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels (such as one of the Switchboard II databases). Our purpose in the current article is to give detailed algorithms for carrying out such a factor analysis. Although we have only experimented with the applications of this model in speaker recognition we will also explain how it could serve as an integrated framework for progressive speaker-adaptation and on-line channel adaptation of HMM-based speech recognizers operating in situations where speaker identities are known. II. OVERVIEW OF THE JOINT FACTOR ANALYSIS MODEL The joint factor analysis model can be viewed Gaussian distribution on speaker- and channel-dependent (or, more accurately, session-dependent) HMM supervectors in which most (but not all) of the variance in the supervector population is assumed to be accounted for by a small number of hidden variables which we refer to as speaker and channel factors. The speaker factors and the channel factors play different roles in that, for a given speaker, the values of the speaker factors are assumed to be the same for all recordings of the speaker but the channel factors are assumed to vary from one recording to another. For example, the Gaussian distribution on speaker-dependent supervectors used in eigenvoice MAP (2) is a special case of the factor analysis model in which there are no channel factors and all of the variance in the speaker- dependent HMM supervectors is assumed to be accounted The authors are with the Centre de recherche informatique de Montr´

...read moreread less

440 citations

Patent•DOI•

Adjustable resource based speech recognition system

[...]

Ian M. Bennett, Bandi Ramesh Babu, Kishor Morkhandikar, Pallaki Gururaj

20 Nov 2006-Journal of the Acoustical Society of America

TL;DR: In this paper, a real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user, where the partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.

...read moreread less

Abstract: A real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user. Both the client and server can dedicate a variable number of processing resources for performing speech recognition functions. The partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.

...read moreread less

279 citations

Journal Article•DOI•

Real-time speaker identification and verification

[...]

Tomi Kinnunen¹, Evgeny Karpov¹, Pasi Fränti¹•Institutions (1)

University of Eastern Finland¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper focuses on optimizing vector quantization (VQ) based speaker identification, which reduces the number of test vectors by pre-quantizing the test sequence prior to matching, and thenumber of speakers by pruning out unlikely speakers during the identification process.

...read moreread less

Abstract: In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7% can be reached in 0.84 s on average when the length of test utterance is 30.4 s.

...read moreread less

248 citations

Journal Article•DOI•

Multistage speaker diarization of broadcast news

[...]

Claude Barras¹, Xuan Zhu¹, Sylvain Meignier¹, Jean-Luc Gauvain¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Sep 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step, which builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system.

...read moreread less

Abstract: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system

...read moreread less

217 citations

Patent•

Menu-driven voice control of characters in a game environment

[...]

Seth C. H. Luisi¹•Institutions (1)

Sony Computer Entertainment¹

13 Apr 2006

TL;DR: In this paper, a speaker-independent context-sensitive speech recognition module contains a vocabulary of available speech commands, which are combined with input from a controller device to control actions of a character or characters in the game environment.

...read moreread less

Abstract: In a gaming system, a user controls actions of characters in the game environment using speech commands. In a learning mode, available speech commands are displayed in a command menu on a display device. In a non-learning mode, the available speech commands are not displayed. A speaker-independent context-sensitive speech recognition module contains a vocabulary of available speech commands. Use of speech commands is combined with input from a controller device to control actions of a character or characters in the game environment.

...read moreread less

Patent•

User adaptive speech recognition method and apparatus

[...]

Jung-Eun Kim¹, Jeong-Su Kim¹•Institutions (1)

Samsung¹

16 Feb 2006

TL;DR: In this paper, a user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user, which includes calculating a confidence score of recognition candidate according to the result of speech recognition.

...read moreread less

Abstract: A user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user. The user adaptive speech recognition method includes calculating a confidence score of a recognition candidate according to the result of speech recognition, setting a new threshold value adapted to the user based on a result of user confirmation of the recognition candidate and the confidence score of the recognition candidate, and outputting a corresponding recognition candidate as a result of the speech recognition if the calculated confidence score is higher than the new threshold value. Thus, the need for user confirmation of the result of speech recognition is reduced and the probability of speech recognition success is increased.

...read moreread less

Patent•

Method and system for bio-metric voice print authentication

[...]

Germano Di Mambro¹, Bernardas Salna¹•Institutions (1)

Wellesley College¹

26 May 2006

TL;DR: In this article, a method for authenticating a user based on the phrase, the biometric voice print, and the device identifier is presented. But the method is limited to a single user and cannot be used to authenticate multiple users.

...read moreread less

Abstract: A method (700) and system (900) for authenticating a user is provided. The method can include receiving one or more spoken utterances from a user (702), recognizing a phrase corresponding to one or more spoken utterances (704), identifying a biometric voice print of the user from one or more spoken utterances of the phrase (706), determining a device identifier associated with the device (708), and authenticating the user based on the phrase, the biometric voice print, and the device identifier (710). A location of the handset or the user can be employed as criteria for granting access to one or more resources (712).

...read moreread less

Journal Article•DOI•

Extraction of speaker-specific excitation information from linear prediction residual of speech

[...]

S. R. Mahadeva Prasanna¹, Cheedella S. Gupta², B. Yegnanarayana²•Institutions (2)

Indian Institute of Technology Guwahati¹, Indian Institute of Technology Madras²

01 Oct 2006-Speech Communication

TL;DR: The speaker recognition studies on NIST 2002 database demonstrates that even though, the recognition performance from the excitation information alone is poor, when combined with evidence from vocal tract information, there is significant improvement in the performance.

...read moreread less

Patent•

Disruption of speech understanding by adding a privacy sound thereto

[...]

Daniel Mapes-Riordan, Jeffrey Specht, William DeKruif

27 Oct 2006

TL;DR: In this article, a privacy sound may be based on the speaker's own voice or another voice, which may be used to access a database of the speaker or another's voice, and form one or more voice streams to form the privacy sound.

...read moreread less

Abstract: A privacy apparatus adds a privacy sound into the environment, thereby confusing listeners as to which of the sounds is the real source. The privacy sound may be based on the speaker's own voice or may be based on another voice. At least one characteristic of the speaker (such as a characteristic of the speaker's speech) may be identified. The characteristic may then be used to access a database of the speaker's own voice or another's voice, and to form one or more voice streams to form the privacy sound. The privacy sound may thus permit disruption of the ability to understand the source speech of the user by eliminating segregation cues that the auditory system uses to interpret speech.

...read moreread less

Journal Article•DOI•

Audio-Visual Biometrics

[...]

Petar Aleksic¹, Aggelos K. Katsaggelos¹•Institutions (1)

Northwestern University¹

01 Nov 2006

TL;DR: The main components of audio-visual biometric systems are described, existing systems and their performance are reviewed, and future research and development directions in this area are discussed.

...read moreread less

Abstract: Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system's performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person's voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area

...read moreread less

Journal Article•DOI•

Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition

[...]

Joaquin Gonzalez-Rodriguez¹, Andrzej Drygajlo², Daniel Ramos-Castro¹, Marta Garcia-Gomar, Javier Ortega-Garcia¹ - Show less +1 more•Institutions (2)

Autonomous University of Madrid¹, École Polytechnique Fédérale de Lausanne²

01 Apr 2006-Computer Speech & Language

TL;DR: The Bayesian framework for interpretation of evidence when applied to forensic speaker recognition is introduced, and original contributions for the robust estimation of likelihood ratios are fully described, including TDLRA (target dependent likelihood ratio alignment), oriented to guarantee the presumption of innocence of suspected but non-perpetrators speakers.

...read moreread less

Book Chapter•DOI•

HKUST/MTS: a very large scale mandarin telephone speech corpus

[...]

Yi Liu¹, Pascale Fung¹, Yongsheng Yang¹, Christopher Cieri², Shudong Huang², David Graff² - Show less +2 more•Institutions (2)

Hong Kong University of Science and Technology¹, University of Pennsylvania²

13 Dec 2006

TL;DR: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS), the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks.

...read moreread less

Abstract: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

...read moreread less

Patent•

Dialog repair based on discrepancies between user model predictions and speech recognition results

[...]

Timothy S. Paek¹, David Maxwell Chickering¹•Institutions (1)

Microsoft¹

30 Mar 2006

TL;DR: In this article, an architecture is presented that leverages discrepancies between user model predictions and speech recognition results by identifying discrepancies between the predictive data and the speech recognition data and repairing the data based in part on the discrepancy.

...read moreread less

Abstract: An architecture is presented that leverages discrepancies between user model predictions and speech recognition results by identifying discrepancies between the predictive data and the speech recognition data and repairing the data based in part on the discrepancy. User model predictions predict what goal or action speech application users are likely to pursue based in part on past user behavior. Speech recognition results indicate what goal speech application users are likely to have spoken based in part on words spoken under specific constraints. Discrepancies between the predictive data and the speech recognition data are identified and a dialog repair is engaged for repairing these discrepancies. By engaging in repairs when there is a discrepancy between the predictive results and the speech recognition results, and utilizing feedback obtained via interaction with a user, the architecture can learn about the reliability of both user model predictions and speech recognition results for future processing.

...read moreread less

Proceedings Article•DOI•

Speech recognition using factorial hidden Markov models for separation in the feature space.

[...]

Tuomas Virtanen

17 Sep 2006

TL;DR: An algorithm for the recognition and separation of speech signals in non-stationary noise, such as another speaker, is proposed using hidden Markov models trained for the speech and noise into a factorial HMM to model the mixture signal.

...read moreread less

Abstract: This paper proposes an algorithm for the recognition and separation of speech signals in non-stationary noise, such as another speaker. We present a method to combine hidden Markov models (HMMs) trained for the speech and noise into a factorial HMM to model the mixture signal. Robustness is obtained by separating the speech and noise signals in a feature domain, which discards unnecessary information. We use mel-cepstral coefficients (MFCCs) as features, and estimate the distribution of mixture MFCCs from the distributions of the target speech and noise. A decoding algorithm is proposed for finding the state transition paths and estimating gains for the speech and noise from a mixture signal. Simulations were carried out using speech material where two speakers were mixed at various levels, and even for high noise level (9 dB above the speech level), the method produced relatively good (60% word recognition accuracy) results. Audio demonstrations are available at www.cs.tut.fi/˜tuomasv. IndexTerms: speech recognition, speech separation, factorial hidden Markov model.

...read moreread less

Proceedings Article•

Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System

[...]

Trausti Kristjansson, John R. Hershey, Peder A. Olsen, Steven J. Rennie, Ramesh A. Gopinath - Show less +1 more

01 Jan 2006

TL;DR: A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.

...read moreread less

Abstract: We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate temporal dynamics. We report on two methods for introducing dynamics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the separation performance. Once the signals have been separated they are then recognized using speaker dependent labeling.

...read moreread less

Journal Article•DOI•

Nonparallel training for voice conversion based on a parameter adaptation approach

[...]

Athanasios Mouchtaris¹, J. Van der Spiegel¹, Paul Mueller•Institutions (1)

University of Pennsylvania¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available.

...read moreread less

Abstract: The objective of voice conversion algorithms is to modify the speech by a particular source speaker so that it sounds as if spoken by a different target speaker. Current conversion algorithms employ a training procedure, during which the same utterances spoken by both the source and target speakers are needed for deriving the desired conversion parameters. Such a (parallel) corpus, is often difficult or impossible to collect. Here, we propose an algorithm that relaxes this constraint, i.e., the training corpus does not necessarily contain the same utterances from both speakers. The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30%. A speaker identification measure is also employed that more insightfully portrays the importance of adaptation, while listening tests confirm the success of our method. Both the objective and subjective tests employed, demonstrate that the proposed algorithm achieves comparable results with the ideal case when a parallel corpus is available.

...read moreread less

Proceedings Article•DOI•

On calibration of language recognition scores

[...]

Niko Brümmer¹, David A. van Leeuwen•Institutions (1)

Stellenbosch University¹

28 Jun 2006

TL;DR: A simple global calibration metric is proposed that can be generally applied to a multiple-hypothesis problem and it is demonstrated experimentally on some NIST-LRE-'05 data how this relates to the calibration of some of the derived binary-hypotheses sub-problems.

...read moreread less

Abstract: Recent publications have examined the topic of calibration of confidence scores in the field of (binary-hypothesis) speaker detection. We extend this topic to the case of multiple-hypothesis language recognition. We analyze the structure of multiple-hypothesis recognition problems to show that any such problem subsumes a multitude of derived sub-problems and that therefore the calibration of all of these problems are interrelated. We propose a simple global calibration metric that can be generally applied to a multiple-hypothesis problem and then demonstrate experimentally on some NIST-LRE-'05 data how this relates to the calibration of some of the derived binary-hypotheses sub-problems

...read moreread less

Journal Article•DOI•

Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading

[...]

Hasan Ertan Cetingul¹, Yücel Yemez¹, Engin Erzin¹, A.M. Tekalp¹•Institutions (1)

Koç University¹

01 Oct 2006-IEEE Transactions on Image Processing

TL;DR: Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application.

...read moreread less

Abstract: There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application

...read moreread less

Journal Article•DOI•

Technical forensic speaker recognition: Evaluation, types and testing of evidence

[...]

Philip Rose¹, Philip Rose²•Institutions (2)

University of Edinburgh¹, Australian National University²

01 Apr 2006-Computer Speech & Language

TL;DR: Important aspects of Technical Forensic Speaker recognition, particularly those associated with evidence, are exemplified and critically discussed, and comparisons drawn with generic Speaker Recognition are drawn.

...read moreread less

Journal Article•DOI•

Quality-enhanced voice morphing using maximum likelihood transformations

[...]

Hui Ye¹, Steve Young¹•Institutions (1)

University of Cambridge¹

01 Jul 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches and shows that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality.

...read moreread less

Abstract: Voice morphing is a technique for modifying a source speaker's speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. First, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Second, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data

...read moreread less

Dissertation•

Robust speaker diarization for meetings

[...]

Xavier Anguera Miró

21 Dec 2006

TL;DR: In this paper, a hierarchical bottom-up mono-channel speaker diarization system was used to extract speaker location information and obtain a single enhanced signal from all available microphones, which is then used for speaker segmentation and clustering.

...read moreread less

Abstract: This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to. The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data. The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.

...read moreread less

Journal Article•DOI•

Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

[...]

Rongqing Huang¹, John H. L. Hansen¹•Institutions (1)

University of Colorado Boulder¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN), and a new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate.

...read moreread less

Abstract: The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T/sup 2/-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus.

...read moreread less

Book•

Speech Recognition Over Digital Channels: Robustness and Standards

[...]

Antonio M. Peinado

01 Jan 2006

TL;DR: This book discusses Speech Recognition with HMMs, a Alternative Representations of the LPC Coefficients, and Front-end Processing for Robust Feature Extraction, a Review of Channel Coding Techniques.

...read moreread less

Abstract: Forward. Preface. 1 Introduction. 1.1 Introduction. 1.2 RSR over Digital Channels. 1.3 Organization of the Book. 2 Speech Recognition with HMMs. 2.1 Introduction. 2.2 Some General Issues. 2.3 Analysis of Speech Signals. 2.4 Vector Quantization. 2.5 Approaches to ASR. 2.6 Hidden Markov Models. 2.7 Application of HMMs to Speech Recognition. 2.8 Model Adaptation. 2.9 Dealing with Uncertainty. 3 Networks and Degradation. 3.1 Introduction. 3.2 Mobile and Wireless Networks. 3.3 IP Networks. 3.4 The Acoustic Environment. 4 Speech Compression and Architectures for RSR. 4.1 Introduction. 4.2 Speech Coding. 4.3 Recognition from Decoded Speech. 4.4 Recognition from Codec Parameters. 4.5 Distributed Speech Recognition. 4.6 Comparison between NSR and DSR. 5 Robustness Against Transmission Channel Errors. 5.1 Introduction. 5.2 Channel Coding Techniques. 5.3 Error Concealment (EC). 6 Front-end Processing for Robust Feature Extraction. 6.1 Introduction. 6.2 Noise Reduction Techniques. 6.3 Voice Activity Detection. 6.4 Feature Normalization. 7 Standards for Distributed Speech Recognition. 7.1 Introduction. 7.2 Signal Preprocessing. 7.3 Feature Extraction. 7.4 Feature Compression and Encoding. 7.5 Feature Decoding and Postprocessing. A Alternative Representations of the LPC Coefficients. B Basic Digital Modulation Concepts. C Review of Channel Coding Techniques. C.1 Media-independent FEC. C.2 Interleaving. Bibliography. List of Acronyms. Index.

...read moreread less

Proceedings Article•DOI•

Channel Factors Compensation in Model and Feature Domain for Speaker Recognition

[...]

Claudio Vair, Daniele Colibro, Fabio Castaldo¹, Emanuele Dalmasso¹, Pietro Laface¹ - Show less +1 more•Institutions (1)

Polytechnic University of Turin¹

28 Jun 2006

TL;DR: This paper compares channel variability modeling in the usual Gaussian mixture model domain, and the proposed feature domain compensation technique, and shows that the two approaches lead to similar results on the NIST 2005 speaker recognition evaluation data.

...read moreread less

Abstract: The variability of the channel and environment is one of the most important factors affecting the performance of text-independent speaker verification systems. The best techniques for channel compensation are model based. Most of them have been proposed for Gaussian Mixture Models, while in the feature domain typically blind channel compensation is performed. The aim of this work is to explore techniques that allow more accurate channel compensation in the domain of the features. Compensating the features rather than the models has the advantage that the transformed parameters can be used with models of different nature and complexity, and also for different tasks. In this paper we evaluate the effects of the compensation of the channel variability obtained by means of the channel factors approach. In particular, we compare channel variability modeling in the usual Gaussian Mixture model domain, and our proposed feature domain compensation technique. We show that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data. Moreover, the quality of the transformed features is also assessed in the Support Vector Machines framework for speaker recognition on the same data, and in preliminary experiments on Language Identification.

...read moreread less

Collapse