Showing papers on "Speaker diarisation published in 2007"

PDF

Open Access

Journal Article•DOI•

Acoustic Beamforming for Speaker Diarization of Meetings

[...]

Xavier Anguera, Chuck Wooters, Javier Hernando¹•Institutions (1)

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.

...read moreread less

Abstract: When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

...read moreread less

444 citations

Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task

[...]

Todor Ganchev¹, Nikos Fakotakis, George Kokkinakis•Institutions (1)

University of Patras¹

01 Jan 2007

TL;DR: A comparative evaluation of the presented MFCC implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.

...read moreread less

Abstract: Making no claim of being exhaustive, a review of the most popular MFCC (Mel Frequency Cepstral Coefficients) implementations is made. These differ mainly in the particular approximation of the nonlinear pitch perception of human, the filter bank design, and the compression of the filter bank output. Then, a comparative evaluation of the presented implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.

...read moreread less

333 citations

Patent•

Method and system for text-to-speech synthesis with personalized voice

[...]

Itzhack Goldberg¹, Ron Hoory¹, Boaz Mizrachi¹, Zvi Kons¹•Institutions (1)

Nuance Communications¹

20 Mar 2007

TL;DR: In this article, a method and system for text-to-speech synthesis with personalized voice was presented, which includes receiving an incidental audio input ( 403 ) of speech in the form of an audio communication from an input speaker ( 401 ) and generating a voice dataset ( 404 ) for the input speaker.

...read moreread less

Abstract: A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input ( 403 ) of speech in the form of an audio communication from an input speaker ( 401 ) and generating a voice dataset ( 404 ) for the input speaker ( 401 ). The method includes receiving a text input ( 411 ) at the same device as the audio input ( 403 ) and synthesizing ( 312 ) the text from the text input ( 411 ) to synthesized speech including using the voice dataset ( 404 ) to personalize the synthesized speech to sound like the input speaker ( 401 ). In addition, the method includes analyzing ( 316 ) the text for expression and adding the expression ( 315 ) to the synthesized speech. The audio communication may be part of a video communication ( 453 ) and the audio input ( 403 ) may have an associated visual input ( 455 ) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input ( 455 ).

...read moreread less

213 citations

Proceedings Article•DOI•

Speaker Recognition by Combining MFCC and Phase Information

[...]

Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang¹•Institutions (1)

Toyohashi University of Technology¹

27 Aug 2007

TL;DR: In this article, a method that integrates the phase information on a speaker recognition method was proposed, which reduced the speaker recognition error rate by about 44% by using phase information for speaker identification.

...read moreread less

Abstract: In conventional speaker recognition method based on MFCC, the phase information has been ignored. In this paper, we proposed a method that integrates the phase information on a speaker recognition method. The speaker identification experiments were performed using NTT database which consists of sentences uttered at normal speed mode by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months. Each speaker uttered only 5 training utterances (about 20 seconds in total). Using the phaseinformation, the speaker recognition error rate was reduced by about 44%. Index Terms: speaker identification, MFCC, phase information, GMM, combination method

...read moreread less

204 citations

Journal Article•DOI•

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

[...]

José Manuel Pardo¹, Xavier Anguera², Chuck Wooters³•Institutions (3)

Technical University of Madrid¹, Telefónica², International Computer Science Institute³

01 Sep 2007-IEEE Transactions on Computers

TL;DR: The correlation between signals coming from multiple microphones is analyzed and an improved method for carrying out speaker diarization for meetings with multiple distant microphones is proposed, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

...read moreread less

Abstract: Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, which includes speaker diarization together with the annotation of sentence boundaries and the elimination of speaker disfluencies. The sub-area of speaker diarization attempts to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper, we analyze the correlation between signals coming from multiple microphones and propose an improved method for carrying out speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information from the delays between signals coming from the different sources. Using this procedure, we were able to achieve state-of-the-art performance in the NIST spring 2006 rich transcription evaluation, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

...read moreread less

91 citations

Journal Article•DOI•

Speaker Verification Using Support Vector Machines and High-Level Features

[...]

William M. Campbell¹, Joseph P. Campbell¹, Terry P. Gleason¹, D.A. Reynolds¹, Wade Shen¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A method of speaker modeling based upon support vector machines based upon linearizing a log likelihood ratio scoring system is described and generalizations of this method are shown to produce excellent results on a variety of high-level features.

...read moreread less

Abstract: High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although a significant amount of work has been done in finding novel high-level features, less work has been done on modeling these features. We describe a method of speaker modeling based upon support vector machines. Current high-level feature extraction produces sequences or lattices of tokens for a given conversation side. These sequences can be converted to counts and then frequencies of n-gram for a given conversation side. We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system. Generalizations of this method are shown to produce excellent results on a variety of high-level features. We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling. We also demonstrate that our system can perform well in conjunction with standard cesptral speaker recognition systems.

...read moreread less

87 citations

Journal Article•DOI•

Speaker Recognition With Session Variability Normalization Based on MLLR Adaptation Transforms

[...]

Andreas Stolcke, Sachin S. Kajarekar¹, Luciana Ferrer², E. Shrinberg•Institutions (2)

SRI International¹, Stanford University²

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models is presented.

...read moreread less

Abstract: We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame-based cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification without data fragmentation. We discuss the basics of the MLLR-SVM approach, and show how it can be enhanced by combining transforms relative to multiple reference models, with excellent results on recent English NIST evaluation sets. We then show how the approach can be applied even if no full word-level recognition system is available, which allows its use on non-English data even without matching speech recognizers. Finally, we examine how two recently proposed algorithms for intersession variability compensation perform in conjunction with MLLR-SVM.

...read moreread less

81 citations

Patent•DOI•

Unsupervised speaker segmentation of multi-speaker speech data

[...]

Allen Louis Gorin¹, Zhu Liu¹, Sarangarajan Parthasarathy¹, Aaron E. Rosenberg¹•Institutions (1)

AT&T¹

02 Oct 2007-Journal of the Acoustical Society of America

TL;DR: In this paper, a front-end analysis is applied to input speech data to obtain feature vectors and then the speech data is initially segmented and then clustered into groups of segments that correspond to different speakers.

...read moreread less

Abstract: Systems and methods for unsupervised segmentation of multi-speaker speech or audio data by speaker. A front-end analysis is applied to input speech data to obtain feature vectors. The speech data is initially segmented and then clustered into groups of segments that correspond to different speakers. The clusters are iteratively modeled and resegmented to obtain stable speaker segmentations. The overlap between segmentation sets is checked to ensure successful speaker segmentation. Overlapping segments are combined and remodeled and resegmented. Optionally, the speech data is processed to produce a segmentation lattice to maximize the overall segmentation likelihood.

...read moreread less

79 citations

Patent•

Text-dependent speaker verification

[...]

Zhengyou Zhang¹, Subramanya Amarnag¹•Institutions (1)

Microsoft¹

12 Feb 2007

TL;DR: A text-dependent speaker verification technique that uses a generic speaker-independent speech recognizer for robust speaker verification, and uses the acoustical model of a speaker independent speech recogniser as a background model is presented in this article.

...read moreread less

Abstract: A text-dependent speaker verification technique that uses a generic speaker-independent speech recognizer for robust speaker verification, and uses the acoustical model of a speaker-independent speech recognizer as a background model. Instead of using a likelihood ratio test (LRT) at the utterance level (e.g., the sentence level), which is typical of most speaker verification systems, the present text-dependent speaker verification technique uses weighted sum of likelihood ratios at the sub-unit level (word, tri-phone, or phone) as well as at the utterance level.

...read moreread less

78 citations

Journal Article•DOI•

Far-Field Speaker Recognition

[...]

Qin Jin¹, Tanja Schultz¹, Alex Waibel¹•Institutions (1)

Carnegie Mellon University¹

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper introduces reverberation compensation as well as feature warping and shows that higher-level features are more robust under mismatching conditions, which suggests that speaker recognition using multilingual phone strings could be successfully applied to any given language.

...read moreread less

Abstract: In this paper, we study robust speaker recognition in far-field microphone situations. Two approaches are investigated to improve the robustness of speaker recognition in such scenarios. The first approach applies traditional techniques based on acoustic features. We introduce reverberation compensation as well as feature warping and gain significant improvements, even under mismatched training-testing conditions. In addition, we performed multiple channel combination experiments to make use of information from multiple distant microphones. Overall, we achieved up to 87.1% relative improvements on our Distant Microphone database and found that the gains hold across different data conditions and microphone settings. The second approach makes use of higher-level linguistic features. To capture speaker idiosyncrasies, we apply n-gram models trained on multilingual phone strings and show that higher-level features are more robust under mismatching conditions. Furthermore, we compared the performances between multilingual and multiengine systems, and examined the impact of a number of involved languages on recognition results. Our findings confirm the usefulness of language variety and indicate a language independent nature of this approach, which suggests that speaker recognition using multilingual phone strings could be successfully applied to any given language.

...read moreread less

78 citations

Book Chapter•DOI•

Higher-Level Features in Speaker Recognition

[...]

Elizabeth Shriberg¹•Institutions (1)

International Computer Science Institute¹

01 Feb 2007

TL;DR: This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade in terms of their type, temporal span, and reliance on automatic speech recognition for both feature extraction and feature conditioning.

...read moreread less

Abstract: Higher-level features based on linguistic or long-range information have attracted significant attention in automatic speaker recognition. This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade. To clarify how each approach uses higher-level information, features are described in terms of their type, temporal span, and reliance on automatic speech recognition for both feature extractionand feature conditioning. A subsequent analysis of higher-level features in a state-of-the-art system illustrates that (1) a higher-level cepstral system outperforms standard systems, (2) a prosodic system shows excellent performance individually and in combination, (3) other higher-level systems provide further gains, and (4) higher-level systems provide increasing relative gains as training data increases. Implications for the general field of speaker classification are discussed.

...read moreread less

Proceedings Article•DOI•

Conversation detection and speaker segmentation in privacy-sensitive situated speech data.

[...]

Danny Wyatt¹, Tanzeem Choudhury², Jeff A. Bilmes¹•Institutions (2)

University of Washington¹, Intel²

27 Aug 2007

TL;DR: Experimental results show that the conversation finding method outperforms earlier approaches and that the speaker segmentation method is a significant improvement to the only other known privacy-sensitive method for speaker segmentsation.

...read moreread less

Abstract: We present privacy-sensitive methods for (1) automatically finding multi-person conversations in spontaneous, situated speech data and (2) segmenting those conversations into speaker turns. The methods protect privacy through a feature set that is rich enough to capture conversational styles and dynamics, but not sufficient for reconstructing intelligible speech. Experimental results show that the conversation finding method outperforms earlier approaches and that the speaker segmentation method is a significant improvement to the only other known privacy-sensitive method for speaker segmentation.

...read moreread less

Patent•

Simultaneous translation of open domain lectures and speeches

[...]

Alex Waibel¹•Institutions (1)

Facebook¹

26 Oct 2007

TL;DR: In this article, the authors propose a method for simultaneously translating speech between first and second speakers, where the first speaker speaks in a first language and the second speaker is speaking in a second language that is different from the first language.

...read moreread less

Abstract: Speech translation systems and methods for simultaneously translating speech between first and second speakers, wherein the first speaker speaks in a first language and the second speaker speaks in a second language that is different from the first language. The speech translation system may comprise a resegmentation unit that merge at least two partial hypotheses and resegments the merged partial hypotheses into a first-language translatable segment, wherein a segment boundary for the first-language translatable segment is determined based on sound from the second speaker.

...read moreread less

Patent•

Method and apparatus for recognizing a speaker in lawful interception systems

[...]

Adam Weinberg¹, Irit Opher¹, Eyal Ben-Aroya¹, Renan Gutman¹•Institutions (1)

NICE Systems¹

09 Aug 2007

TL;DR: In this article, a method and apparatus for identifying a speaker within a captured audio signal from a collection of known speakers is presented, where the representations are grouped into one or more groups according to the indices.

...read moreread less

Abstract: A method and apparatus for identifying a speaker within a captured audio signal from a collection of known speakers. The method and apparatus receive or generate voice representations for each known speakers and tag the representations according to meta data related to the known speaker or to the voice. The representations are grouped into one or more groups according to the indices. When a voice to be recognized is introduced, characteristics are determined according to which the groups are prioritized, so that the representations participating only in part of the groups are matched against the o voice to be identified, thus reducing identification time and improving the statistical significance.

...read moreread less

Patent•

System and method for optimizing speech recognition in a vehicle

[...]

Bradley S. Coon¹, Roger A. Mcdanell¹•Institutions (1)

Delphi Automotive¹

23 Aug 2007

TL;DR: In this article, a system for controlling personalized settings in a vehicle (10) is described, which includes a microphone (22) for receiving spoken commands from a person (16A) in the vehicle.

...read moreread less

Abstract: A system (20) is provided for controlling personalized settings in a vehicle (10). The system (20) includes a microphone (22) for receiving spoken commands from a person (16A) in the vehicle (10), a location recognizer (80) for identifying location of the speaker (16A), and an identity recognizer (84) for identifying the identity of the speaker (16A). The system (20) also includes a speech recognizer (82) for recognizing the received spoken commands. The system (20) further includes a controller (24) for processing the identified location, identity and commands of the speaker (16A). The controller (24) controls one or more feature settings based on the identified location, identified identity and recognized spoken commands of the speaker (16A). The system (20) also optimizes the grammar comparison for speech recognition and the beamforming microphone array used in the vehicle (10).

...read moreread less

Book•

Speaker Classification II: Selected Projects

[...]

Christian Müller

01 Feb 2007

TL;DR: In this paper, a study of acoustic correlation between speaker age and speaker identity is presented, and the impact of visual and auditory cues in age estimation is discussed, as well as the effect of context-dependent phonemes on speaker identification.

...read moreread less

Abstract: A Study of Acoustic Correlates of Speaker Age.- The Impact of Visual and Auditory Cues in Age Estimation.- Development of a Femininity Estimator for Voice Therapy of Gender Identity Disorder Clients.- Real-Life Emotion Recognition in Speech.- Automatic Classification of Expressiveness in Speech: A Multi-corpus Study.- Acoustic Impact on Decoding of Semantic Emotion.- Emotion from Speakers to Listeners: Perception and Prosodic Characterization of Affective Speech.- Effects of the Phonological Contents on Perceptual Speaker Identification.- Durations of Context-Dependent Phonemes: A New Feature in Speaker Verification.- Language-Independent Speaker Classification over a Far-Field Microphone.- A Linear-Scaling Approach to Speaker Variability in Poly-segmental Formant Ensembles.- Sound Change and Speaker Identity: An Acoustic Study.- Bayes-Optimal Estimation of GMM Parameters for Speaker Recognition.- Speaker Individualities in Speech Spectral Envelopes and Fundamental Frequency Contours.- Speaker Segmentation for Air Traffic Control.- Detection of Speaker Characteristics Using Voice Imitation.- Reviewing Human Language Identification.- Underpinning /nailon/: Automatic Estimation of Pitch Range and Speaker Relative Pitch.- Automatic Dialect Identification: A Study of British English.- ACCDIST: An Accent Similarity Metric for Accent Recognition and Diagnosis.- Selecting Representative Speakers for a Speech Database on the Basis of Heterogeneous Similarity Criteria.- Speaker Classification by Means of Orthographic and Broad Phonetic Transcriptions of Speech.

...read moreread less

Journal Article•

The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System.

[...]

Andreas Stolcke¹, Xavier Anguera¹, Kofi Boakye¹, Özgür Çetin², Adam Janin¹, Mathew Magimai-Doss¹, Chuck Wooters¹, Jing Zheng³ - Show less +4 more•Institutions (3)

International Computer Science Institute¹, Yahoo!², SRI International³

10 May 2007-CLEaR

TL;DR: The latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, is described, highlighting improvements made over the last year, and a new NIST metric designed to evaluate combined speech diarization and recognition is reported.

...read moreread less

Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

...read moreread less

Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007

[...]

Junichi Yamagishi¹, Heiga Zen², Tomoki Toda³, Keiichi Tokuda²•Institutions (3)

University of Edinburgh¹, Nagoya Institute of Technology², Nara Institute of Science and Technology³

01 Aug 2007

TL;DR: This paper describes an HMM-based speech synthesis system developed by the HTS working group for the Blizzard Challenge 2007, and incorporates new features in the conventional system which underpin a speaker-independent approach: speaker adaptation techniques; adaptive training for HSMMs; and full covariance modeling using the CSMAPLR transforms.

...read moreread less

Abstract: This paper describes an HMM-based speech synthesis system developed by the HTS working group for the Blizzard Challenge 2007. To further explore the potential of HMM-based speech synthesis, we incorporate new features in our conventional system which underpin a speaker-independent approach: speaker adaptation techniques; adaptive training for HSMMs; and full covariance modeling using the CSMAPLR transforms.

...read moreread less

Robust speaker recognition

[...]

Tanja Schultz¹, Alex Waibel¹, Qin Jin¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2007

TL;DR: A speaker segmentation and clustering system aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as telephony conversations and meetings is implemented.

...read moreread less

Abstract: The automatic speaker recognition technologies have developed into more and more important modern technologies required by many speech-aided applications. The main challenge for automatic speaker recognition is to deal with the variability of the environments and channels from where the speech was obtained. In previous work, good results have been achieved for clean high-quality speech with matched training and test acoustic conditions, such as high accuracy of speaker identification and verification using clean wideband speech and Gaussian Mixture Models (GMM). However, under mismatched conditions and noisy environments, often expected in real-world conditions, the performance of GMM-based systems degrades significantly, far away from the satisfactory level. Therefore, robustness becomes a crucial research issue in speaker recognition field. In this thesis, our main focus is to-improve the robustness of speaker recognition systems on far-field distant microphones. We investigate approaches to improve robustness from two directions. First, we investigate approaches to improve robustness for traditional speaker recognition system which is based on low-level spectral information. We introduce a new reverberation compensation approach which, along with feature warping in the feature processing procedure, improves the system performance significantly. We propose four multiple channel combination approaches, which utilize information from multiple far-field microphones, to improve robustness under mismatched training-testing conditions. Secondly, we investigate approaches to use high-level speaker information to improve robustness. We propose new techniques to model speaker pronunciation idiosyncrasy from two dimensions: the cross-stream dimension and the time dimension. Such high-level information is expected to be robust under different mismatched conditions. We also built systems that support robust speaker recognition. We implemented a speaker segmentation and clustering system aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as telephony conversations and meetings. We also integrate speaker identification modality with face recognition modality to build a robust person identification system.

...read moreread less

Proceedings Article•DOI•

Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora.

[...]

Christopher Cieri¹, Linda Corson¹, David Graff¹, Kevin Walker¹•Institutions (1)

University of Pennsylvania¹

27 Aug 2007

TL;DR: New language resources designed to support research in speaker recognition are described, including a brief overview of collections protocols, and the shift from the Switchboard protocol to the Mixer protocol is motivated.

...read moreread less

Abstract: This paper describes new language resources designed to support research in speaker recognition. It begins with a brief overview of collections protocols, motivates the shift from the Switchboard protocol to the Mixer protocol, summarizes yields from the earliest phase of Mixer collection and then describes more recent phases, yields and expected yields and lessons learned.

...read moreread less

Journal Article•DOI•

A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification

[...]

Shou-Chun Yin¹, Richard Rose¹, Patrick Kenny¹•Institutions (1)

McGill University¹

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability.

...read moreread less

Abstract: This paper addresses the issue of speaker variability and session variability in text-independent Gaussian mixture model (GMM)-based speaker verification. A speaker model adaptation procedure is proposed which is based on a joint factor analysis approach to speaker verification. It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability. The paper also deals with the interaction between this model adaptation approach and score normalization strategies which act to reduce the variation in likelihood ratio scores. This issue is particularly important in establishing decision thresholds in practical speaker verification systems since the variability of likelihood ratio scores can increase as a result of progressive model adaptation. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 Speaker Recognition Evaluation Plan, which is based on conversation sides derived from telephone speech utterances. It was found that when target speaker models were trained from a single conversation, an equal error rate (EER) of 4.5% was obtained under the NIST unsupervised speaker adaptation scenario.

...read moreread less

Proceedings Article•DOI•

Constrained MLLR for Speaker Recognition

[...]

M. Ferras¹, Cheung-Chi Leung¹, Claude Barras¹, Jean-Luc Gauvain¹•Institutions (1)

Centre national de la recherche scientifique¹

15 Apr 2007

TL;DR: A new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which operates directly on the recorded signal with noise as well as in combination with two cepstral approaches such as reduction in the performance gap between telephone and auxiliary microphone data.

...read moreread less

Abstract: One particularly difficult challenge for cross-channel MLLR (CMLLR) are two widely-used techniques for speaker introduced in the 2005 and 2006 NIST Speaker Recognition Evaluations, where training uses telephone speech and verification uses speech from multiple auxiliary comparable to that obtained with cepstral features. This paper describes a new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which session effects through latent factor analysis (LFA) and through support vector machines (SVM). Results on the NIST operates directly on the recorded signal with noise well as in combination with two cepstral approaches such as reduction in the performance gap between telephone and auxiliary microphone data.

...read moreread less

Proceedings Article•DOI•

Efficient use of overlap information in speaker diarization

[...]

Scott Otterson¹, Mari Ostendorf¹•Institutions (1)

University of Washington¹

01 Dec 2007

TL;DR: With the best features, it is found that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments.

...read moreread less

Abstract: Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.

...read moreread less

Journal Article•DOI•

Discriminative In-Set/Out-of-Set Speaker Recognition

[...]

Pongtep Angkititrakul, John H. L. Hansen

01 Feb 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes.

...read moreread less

Abstract: In this paper, the problem of identifying in-set versus out-of-set speakers for limited training/test data durations is addressed. The recognition objective is to form a decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or outside speakers. The general goal is to perform rapid speaker model construction from limited enrollment and test size resources for in-set testing for input audio streams. In-set detection can help ensure security and proper access to private information, as well as detecting and tracking input speakers. Areas of applications of these concepts include rapid speaker tagging and tracking for information retrieval, communication networks, personal device assistants, and location access. We propose an integrated system with emphasis on short-enrollment data (about 5 s of speech for each enrolled speaker) and test data (2-8 s) within a text-independent mode. We present a simple and yet powerful decision rule to accept or reject speakers using a discriminative vector in the decision score space, together with statistical hypothesis testing based on the conventional likelihood ratio test. Discriminative training is introduced to further improve system performance for both decision techniques, by employing minimum classification error and minimum verification error frameworks. Experiments are performed using three separate corpora. Using the YOHO speaker recognition database, the alternative decision rule achieves measurable improvement over the likelihood ratio test, and discriminative training consistently enhances overall system performance with relative improvements ranging from 11.26%-28.68%. A further extended evaluation using the TIMIT (CORPUS1) and actual noisy aircraft communications data (CORPUS2) shows measurable improvement over the traditional MAP based scheme using the likelihood ratio test (MAP-LRT), with average EERs of 9%-23% for TIMIT and 13%-32% for noisy aircraft communications. The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes

...read moreread less

Journal Article•DOI•

Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM

[...]

Longbiao Wang¹, Norihide Kitaoka¹, Seiichi Nakagawa¹•Institutions (1)

Toyohashi University of Technology¹

01 Jun 2007-Speech Communication

TL;DR: A robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position is proposed.

...read moreread less

Journal Article•

The Rich Transcription 2007 Meeting Recognition Evaluation.

[...]

Jonathan G. Fiscus¹, Jerome Ajot¹, John S. Garofolo¹•Institutions (1)

National Institute of Standards and Technology¹

01 Jan 2007-CLEaR

TL;DR: The design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation are presented; the fifth in a series of community-wide evaluations of language technologies in the meeting domain.

...read moreread less

Abstract: We present the design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation; the fifth in a series of community-wide evaluations of language technologies in the meeting domain. For 2007, we supported three evaluation tasks: Speech-To-Text (STT) transcription, "Who Spoke When" Diarization (SPKR), and Speaker Attributed Speech-To-Text (SASTT). The SASTT task, which combines STT and SPKR tasks, was a new evaluation task. The test data consisted of three test sets: Conference Meetings, Lecture Meetings, and Coffee Breaks from lecture meetings. The Coffee Break data was included as a new test set this year. Twenty-one research sites materially contributed to the evaluation by providing data or building systems. The lowest STT word error rates with up to four simultaneous speakers in the multiple distant microphone condition were 40.6 %, 49.8 %, and 48.4 % for the conference, lecture, and coffee break test sets respectively. For the SPKR task, the lowest diarization error rates for all speech in the multiple distant microphone condition were 8.5 %, 25.8 %, and 25.5 % for the conference, lecture, and coffee break test sets respectively. For the SASTT task, the lowest speaker attributed word error rates for segments with up to three simultaneous speakers in the multiple distant microphone condition were 40.3 %, 59.3 %, and 68.4 % for the conference, lecture, and coffee break test sets respectively.

...read moreread less

Proceedings Article•DOI•

Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification

[...]

Hervé Bredin¹, Gérard Chollet¹•Institutions (1)

Centre national de la recherche scientifique¹

15 Apr 2007

TL;DR: A novel biometric modality based on synchrony measures is introduced in order to improve the overall performance of identity verification, and more specifically its robustness to replay attacks.

...read moreread less

Abstract: We investigate the use of audio-visual speech synchrony measure in the framework of identity verification based on talking faces. Two synchrony measures based on canonical correlation analysis and co-inertia analysis respectively are introduced and their performances are evaluated on the specific task of detecting synchronized and not-synchronized audio-visual speech sequences. The notion of high-effort impostor attacks is also introduced as a dangerous threat for current biometric system based on speaker verification and face recognition. A novel biometric modality based on synchrony measures is introduced in order to improve the overall performance of identity verification, and more specifically its robustness to replay attacks.

...read moreread less

Proceedings Article•DOI•

Speaker Diarization using Normalized Cross Likelihood Ratio

[...]

Viet Bac Le, Odile Mella, Dominique Fohr

27 Aug 2007

TL;DR: The Normalized Cross Likelihood Ratio is used as a dissimilarity measure between two Gaussian speaker models in the speaker change detection step and its contribution to the performance of speakers change detection is compared with those of BIC and Hostelling's T2-Statistic measures.

...read moreread less

Abstract: In this paper, we present the Normalized Cross Likelihood Ratio (NCLR) and the advantages of using it in a speaker diarization system. First, the NCLR is used as a dissimilarity measure between two Gaussian speaker models in the speaker change detection step and its contribution to the performance of speaker change detection is compared with those of BIC and Hostelling's T2-Statistic measures. Then, the NCLR measure is modified to deal with multi-gaussian adapted models in the cluster recombination step. This step ends the step-by-step speaker diarization process after the BIC-based hierarchical clustering and the Viterbi re-segmentation steps. By comparing the NCLR measure with the CLR (Cross Likelihood Ratio) one, more than 30% of relative diarization error is reduced in ESTER evaluation data.

...read moreread less

Book Chapter•DOI•

How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification

[...]

Volker Dellwo¹, Mark Huckvale¹, Michael Ashby¹•Institutions (1)

University College London¹

01 Feb 2007-Lecture Notes in Computer Science

TL;DR: The chapter presents an overview of the physical structures of the human vocal tract used in speech, it introduces the standard phonetic classification system for the description of spoken gestures and it presents a catalogue of the different ways in which individuality can be expressed through speech.

...read moreread less

Abstract: As well as conveying a message in words and sounds, the speech signal carries information about the speaker's own anatomy, physiology, linguistic experience and mental state. These speaker characteristics are found in speech at all levels of description: from the spectral information in the sounds to the choice of words and utterances themselves. This chapter presents an introduction to speech production and to the phonetic description of speech to facilitate discussion of how speech can be a carrier for speaker characteristics as well as a carrier for messages. The chapter presents an overview of the physical structures of the human vocal tract used in speech, it introduces the standard phonetic classification system for the description of spoken gestures and it presents a catalogue of the different ways in which individuality can be expressed through speech. The chapter ends with a brief description of some applications which require access to information about speaker characteristics in speech.

...read moreread less

Book•

Speaker Classification I: Fundamentals, Features, and Methods

[...]

Christian Müller

01 Feb 2007

TL;DR: This book discusses the many Roles of Speaker Classification in Speaker Verification and Identification, as well as the applications in Human Machine Dialog Systems and Evaluation of Speaker Recognition Systems.

...read moreread less

Abstract: Fundamentals.- How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification.- Speaker Classification Concepts: Past, Present and Future.- Characteristics.- Speaker Characteristics.- Foreign Accent.- Acoustic Analysis of Adult Speaker Age.- Speech Under Stress: Analysis, Modeling and Recognition.- Speaker Characteristics and Emotion Classification.- Emotions in Speech: Juristic Implications.- Applications.- Application of Speaker Classification in Human Machine Dialog Systems.- Speaker Classification in Forensic Phonetics and Acoustics.- Forensic Automatic Speaker Classification in the "Coming Paradigm Shift".- The Many Roles of Speaker Classification in Speaker Verification and Identification.- Methods and Features.- Frame Based Features.- Higher-Level Features in Speaker Recognition.- Enhancing Speaker Discrimination at the Feature Level.- Classification Methods for Speaker Recognition.- Multi-stream Fusion for Speaker Classification.- Evaluation.- Evaluations of Automatic Speaker Classification Systems.- An Introduction to Application-Independent Evaluation of Speaker Recognition Systems.

...read moreread less

Collapse