scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2007"


Journal ArticleDOI
TL;DR: The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Abstract: When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

444 citations


01 Jan 2007
TL;DR: A comparative evaluation of the presented MFCC implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.
Abstract: Making no claim of being exhaustive, a review of the most popular MFCC (Mel Frequency Cepstral Coefficients) implementations is made. These differ mainly in the particular approximation of the nonlinear pitch perception of human, the filter bank design, and the compression of the filter bank output. Then, a comparative evaluation of the presented implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.

333 citations


Patent
20 Mar 2007
TL;DR: In this article, a method and system for text-to-speech synthesis with personalized voice was presented, which includes receiving an incidental audio input ( 403 ) of speech in the form of an audio communication from an input speaker ( 401 ) and generating a voice dataset ( 404 ) for the input speaker.
Abstract: A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input ( 403 ) of speech in the form of an audio communication from an input speaker ( 401 ) and generating a voice dataset ( 404 ) for the input speaker ( 401 ). The method includes receiving a text input ( 411 ) at the same device as the audio input ( 403 ) and synthesizing ( 312 ) the text from the text input ( 411 ) to synthesized speech including using the voice dataset ( 404 ) to personalize the synthesized speech to sound like the input speaker ( 401 ). In addition, the method includes analyzing ( 316 ) the text for expression and adding the expression ( 315 ) to the synthesized speech. The audio communication may be part of a video communication ( 453 ) and the audio input ( 403 ) may have an associated visual input ( 455 ) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input ( 455 ).

213 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: In this article, a method that integrates the phase information on a speaker recognition method was proposed, which reduced the speaker recognition error rate by about 44% by using phase information for speaker identification.
Abstract: In conventional speaker recognition method based on MFCC, the phase information has been ignored. In this paper, we proposed a method that integrates the phase information on a speaker recognition method. The speaker identification experiments were performed using NTT database which consists of sentences uttered at normal speed mode by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months. Each speaker uttered only 5 training utterances (about 20 seconds in total). Using the phaseinformation, the speaker recognition error rate was reduced by about 44%. Index Terms: speaker identification, MFCC, phase information, GMM, combination method

204 citations


Journal ArticleDOI
TL;DR: The correlation between signals coming from multiple microphones is analyzed and an improved method for carrying out speaker diarization for meetings with multiple distant microphones is proposed, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.
Abstract: Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, which includes speaker diarization together with the annotation of sentence boundaries and the elimination of speaker disfluencies. The sub-area of speaker diarization attempts to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper, we analyze the correlation between signals coming from multiple microphones and propose an improved method for carrying out speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information from the delays between signals coming from the different sources. Using this procedure, we were able to achieve state-of-the-art performance in the NIST spring 2006 rich transcription evaluation, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

91 citations


Journal ArticleDOI
TL;DR: A method of speaker modeling based upon support vector machines based upon linearizing a log likelihood ratio scoring system is described and generalizations of this method are shown to produce excellent results on a variety of high-level features.
Abstract: High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although a significant amount of work has been done in finding novel high-level features, less work has been done on modeling these features. We describe a method of speaker modeling based upon support vector machines. Current high-level feature extraction produces sequences or lattices of tokens for a given conversation side. These sequences can be converted to counts and then frequencies of n-gram for a given conversation side. We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system. Generalizations of this method are shown to produce excellent results on a variety of high-level features. We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling. We also demonstrate that our system can perform well in conjunction with standard cesptral speaker recognition systems.

87 citations


Journal ArticleDOI
TL;DR: A new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models is presented.
Abstract: We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame-based cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification without data fragmentation. We discuss the basics of the MLLR-SVM approach, and show how it can be enhanced by combining transforms relative to multiple reference models, with excellent results on recent English NIST evaluation sets. We then show how the approach can be applied even if no full word-level recognition system is available, which allows its use on non-English data even without matching speech recognizers. Finally, we examine how two recently proposed algorithms for intersession variability compensation perform in conjunction with MLLR-SVM.

81 citations


PatentDOI
TL;DR: In this paper, a front-end analysis is applied to input speech data to obtain feature vectors and then the speech data is initially segmented and then clustered into groups of segments that correspond to different speakers.
Abstract: Systems and methods for unsupervised segmentation of multi-speaker speech or audio data by speaker. A front-end analysis is applied to input speech data to obtain feature vectors. The speech data is initially segmented and then clustered into groups of segments that correspond to different speakers. The clusters are iteratively modeled and resegmented to obtain stable speaker segmentations. The overlap between segmentation sets is checked to ensure successful speaker segmentation. Overlapping segments are combined and remodeled and resegmented. Optionally, the speech data is processed to produce a segmentation lattice to maximize the overall segmentation likelihood.

79 citations


Patent
12 Feb 2007
TL;DR: A text-dependent speaker verification technique that uses a generic speaker-independent speech recognizer for robust speaker verification, and uses the acoustical model of a speaker independent speech recogniser as a background model is presented in this article.
Abstract: A text-dependent speaker verification technique that uses a generic speaker-independent speech recognizer for robust speaker verification, and uses the acoustical model of a speaker-independent speech recognizer as a background model. Instead of using a likelihood ratio test (LRT) at the utterance level (e.g., the sentence level), which is typical of most speaker verification systems, the present text-dependent speaker verification technique uses weighted sum of likelihood ratios at the sub-unit level (word, tri-phone, or phone) as well as at the utterance level.

78 citations


Journal ArticleDOI
TL;DR: This paper introduces reverberation compensation as well as feature warping and shows that higher-level features are more robust under mismatching conditions, which suggests that speaker recognition using multilingual phone strings could be successfully applied to any given language.
Abstract: In this paper, we study robust speaker recognition in far-field microphone situations. Two approaches are investigated to improve the robustness of speaker recognition in such scenarios. The first approach applies traditional techniques based on acoustic features. We introduce reverberation compensation as well as feature warping and gain significant improvements, even under mismatched training-testing conditions. In addition, we performed multiple channel combination experiments to make use of information from multiple distant microphones. Overall, we achieved up to 87.1% relative improvements on our Distant Microphone database and found that the gains hold across different data conditions and microphone settings. The second approach makes use of higher-level linguistic features. To capture speaker idiosyncrasies, we apply n-gram models trained on multilingual phone strings and show that higher-level features are more robust under mismatching conditions. Furthermore, we compared the performances between multilingual and multiengine systems, and examined the impact of a number of involved languages on recognition results. Our findings confirm the usefulness of language variety and indicate a language independent nature of this approach, which suggests that speaker recognition using multilingual phone strings could be successfully applied to any given language.

78 citations


Book ChapterDOI
01 Feb 2007
TL;DR: This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade in terms of their type, temporal span, and reliance on automatic speech recognition for both feature extraction and feature conditioning.
Abstract: Higher-level features based on linguistic or long-range information have attracted significant attention in automatic speaker recognition. This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade. To clarify how each approach uses higher-level information, features are described in terms of their type, temporal span, and reliance on automatic speech recognition for both feature extractionand feature conditioning. A subsequent analysis of higher-level features in a state-of-the-art system illustrates that (1) a higher-level cepstral system outperforms standard systems, (2) a prosodic system shows excellent performance individually and in combination, (3) other higher-level systems provide further gains, and (4) higher-level systems provide increasing relative gains as training data increases. Implications for the general field of speaker classification are discussed.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: Experimental results show that the conversation finding method outperforms earlier approaches and that the speaker segmentation method is a significant improvement to the only other known privacy-sensitive method for speaker segmentsation.
Abstract: We present privacy-sensitive methods for (1) automatically finding multi-person conversations in spontaneous, situated speech data and (2) segmenting those conversations into speaker turns. The methods protect privacy through a feature set that is rich enough to capture conversational styles and dynamics, but not sufficient for reconstructing intelligible speech. Experimental results show that the conversation finding method outperforms earlier approaches and that the speaker segmentation method is a significant improvement to the only other known privacy-sensitive method for speaker segmentation.

Patent
Alex Waibel1
26 Oct 2007
TL;DR: In this article, the authors propose a method for simultaneously translating speech between first and second speakers, where the first speaker speaks in a first language and the second speaker is speaking in a second language that is different from the first language.
Abstract: Speech translation systems and methods for simultaneously translating speech between first and second speakers, wherein the first speaker speaks in a first language and the second speaker speaks in a second language that is different from the first language. The speech translation system may comprise a resegmentation unit that merge at least two partial hypotheses and resegments the merged partial hypotheses into a first-language translatable segment, wherein a segment boundary for the first-language translatable segment is determined based on sound from the second speaker.

Patent
09 Aug 2007
TL;DR: In this article, a method and apparatus for identifying a speaker within a captured audio signal from a collection of known speakers is presented, where the representations are grouped into one or more groups according to the indices.
Abstract: A method and apparatus for identifying a speaker within a captured audio signal from a collection of known speakers. The method and apparatus receive or generate voice representations for each known speakers and tag the representations according to meta data related to the known speaker or to the voice. The representations are grouped into one or more groups according to the indices. When a voice to be recognized is introduced, characteristics are determined according to which the groups are prioritized, so that the representations participating only in part of the groups are matched against the o voice to be identified, thus reducing identification time and improving the statistical significance.

Patent
23 Aug 2007
TL;DR: In this article, a system for controlling personalized settings in a vehicle (10) is described, which includes a microphone (22) for receiving spoken commands from a person (16A) in the vehicle.
Abstract: A system (20) is provided for controlling personalized settings in a vehicle (10). The system (20) includes a microphone (22) for receiving spoken commands from a person (16A) in the vehicle (10), a location recognizer (80) for identifying location of the speaker (16A), and an identity recognizer (84) for identifying the identity of the speaker (16A). The system (20) also includes a speech recognizer (82) for recognizing the received spoken commands. The system (20) further includes a controller (24) for processing the identified location, identity and commands of the speaker (16A). The controller (24) controls one or more feature settings based on the identified location, identified identity and recognized spoken commands of the speaker (16A). The system (20) also optimizes the grammar comparison for speech recognition and the beamforming microphone array used in the vehicle (10).

Book
01 Feb 2007
TL;DR: In this paper, a study of acoustic correlation between speaker age and speaker identity is presented, and the impact of visual and auditory cues in age estimation is discussed, as well as the effect of context-dependent phonemes on speaker identification.
Abstract: A Study of Acoustic Correlates of Speaker Age.- The Impact of Visual and Auditory Cues in Age Estimation.- Development of a Femininity Estimator for Voice Therapy of Gender Identity Disorder Clients.- Real-Life Emotion Recognition in Speech.- Automatic Classification of Expressiveness in Speech: A Multi-corpus Study.- Acoustic Impact on Decoding of Semantic Emotion.- Emotion from Speakers to Listeners: Perception and Prosodic Characterization of Affective Speech.- Effects of the Phonological Contents on Perceptual Speaker Identification.- Durations of Context-Dependent Phonemes: A New Feature in Speaker Verification.- Language-Independent Speaker Classification over a Far-Field Microphone.- A Linear-Scaling Approach to Speaker Variability in Poly-segmental Formant Ensembles.- Sound Change and Speaker Identity: An Acoustic Study.- Bayes-Optimal Estimation of GMM Parameters for Speaker Recognition.- Speaker Individualities in Speech Spectral Envelopes and Fundamental Frequency Contours.- Speaker Segmentation for Air Traffic Control.- Detection of Speaker Characteristics Using Voice Imitation.- Reviewing Human Language Identification.- Underpinning /nailon/: Automatic Estimation of Pitch Range and Speaker Relative Pitch.- Automatic Dialect Identification: A Study of British English.- ACCDIST: An Accent Similarity Metric for Accent Recognition and Diagnosis.- Selecting Representative Speakers for a Speech Database on the Basis of Heterogeneous Similarity Criteria.- Speaker Classification by Means of Orthographic and Broad Phonetic Transcriptions of Speech.

Journal Article
10 May 2007-CLEaR
TL;DR: The latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, is described, highlighting improvements made over the last year, and a new NIST metric designed to evaluate combined speech diarization and recognition is reported.
Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

01 Aug 2007
TL;DR: This paper describes an HMM-based speech synthesis system developed by the HTS working group for the Blizzard Challenge 2007, and incorporates new features in the conventional system which underpin a speaker-independent approach: speaker adaptation techniques; adaptive training for HSMMs; and full covariance modeling using the CSMAPLR transforms.
Abstract: This paper describes an HMM-based speech synthesis system developed by the HTS working group for the Blizzard Challenge 2007. To further explore the potential of HMM-based speech synthesis, we incorporate new features in our conventional system which underpin a speaker-independent approach: speaker adaptation techniques; adaptive training for HSMMs; and full covariance modeling using the CSMAPLR transforms.

01 Jan 2007
TL;DR: A speaker segmentation and clustering system aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as telephony conversations and meetings is implemented.
Abstract: The automatic speaker recognition technologies have developed into more and more important modern technologies required by many speech-aided applications. The main challenge for automatic speaker recognition is to deal with the variability of the environments and channels from where the speech was obtained. In previous work, good results have been achieved for clean high-quality speech with matched training and test acoustic conditions, such as high accuracy of speaker identification and verification using clean wideband speech and Gaussian Mixture Models (GMM). However, under mismatched conditions and noisy environments, often expected in real-world conditions, the performance of GMM-based systems degrades significantly, far away from the satisfactory level. Therefore, robustness becomes a crucial research issue in speaker recognition field. In this thesis, our main focus is to-improve the robustness of speaker recognition systems on far-field distant microphones. We investigate approaches to improve robustness from two directions. First, we investigate approaches to improve robustness for traditional speaker recognition system which is based on low-level spectral information. We introduce a new reverberation compensation approach which, along with feature warping in the feature processing procedure, improves the system performance significantly. We propose four multiple channel combination approaches, which utilize information from multiple far-field microphones, to improve robustness under mismatched training-testing conditions. Secondly, we investigate approaches to use high-level speaker information to improve robustness. We propose new techniques to model speaker pronunciation idiosyncrasy from two dimensions: the cross-stream dimension and the time dimension. Such high-level information is expected to be robust under different mismatched conditions. We also built systems that support robust speaker recognition. We implemented a speaker segmentation and clustering system aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as telephony conversations and meetings. We also integrate speaker identification modality with face recognition modality to build a robust person identification system.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: New language resources designed to support research in speaker recognition are described, including a brief overview of collections protocols, and the shift from the Switchboard protocol to the Mixer protocol is motivated.
Abstract: This paper describes new language resources designed to support research in speaker recognition. It begins with a brief overview of collections protocols, motivates the shift from the Switchboard protocol to the Mixer protocol, summarizes yields from the earliest phase of Mixer collection and then describes more recent phases, yields and expected yields and lessons learned.

Journal ArticleDOI
TL;DR: It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability.
Abstract: This paper addresses the issue of speaker variability and session variability in text-independent Gaussian mixture model (GMM)-based speaker verification. A speaker model adaptation procedure is proposed which is based on a joint factor analysis approach to speaker verification. It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability. The paper also deals with the interaction between this model adaptation approach and score normalization strategies which act to reduce the variation in likelihood ratio scores. This issue is particularly important in establishing decision thresholds in practical speaker verification systems since the variability of likelihood ratio scores can increase as a result of progressive model adaptation. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 Speaker Recognition Evaluation Plan, which is based on conversation sides derived from telephone speech utterances. It was found that when target speaker models were trained from a single conversation, an equal error rate (EER) of 4.5% was obtained under the NIST unsupervised speaker adaptation scenario.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which operates directly on the recorded signal with noise as well as in combination with two cepstral approaches such as reduction in the performance gap between telephone and auxiliary microphone data.
Abstract: One particularly difficult challenge for cross-channel MLLR (CMLLR) are two widely-used techniques for speaker introduced in the 2005 and 2006 NIST Speaker Recognition Evaluations, where training uses telephone speech and verification uses speech from multiple auxiliary comparable to that obtained with cepstral features. This paper describes a new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which session effects through latent factor analysis (LFA) and through support vector machines (SVM). Results on the NIST operates directly on the recorded signal with noise well as in combination with two cepstral approaches such as reduction in the performance gap between telephone and auxiliary microphone data.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: With the best features, it is found that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments.
Abstract: Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.

Journal ArticleDOI
TL;DR: The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes.
Abstract: In this paper, the problem of identifying in-set versus out-of-set speakers for limited training/test data durations is addressed. The recognition objective is to form a decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or outside speakers. The general goal is to perform rapid speaker model construction from limited enrollment and test size resources for in-set testing for input audio streams. In-set detection can help ensure security and proper access to private information, as well as detecting and tracking input speakers. Areas of applications of these concepts include rapid speaker tagging and tracking for information retrieval, communication networks, personal device assistants, and location access. We propose an integrated system with emphasis on short-enrollment data (about 5 s of speech for each enrolled speaker) and test data (2-8 s) within a text-independent mode. We present a simple and yet powerful decision rule to accept or reject speakers using a discriminative vector in the decision score space, together with statistical hypothesis testing based on the conventional likelihood ratio test. Discriminative training is introduced to further improve system performance for both decision techniques, by employing minimum classification error and minimum verification error frameworks. Experiments are performed using three separate corpora. Using the YOHO speaker recognition database, the alternative decision rule achieves measurable improvement over the likelihood ratio test, and discriminative training consistently enhances overall system performance with relative improvements ranging from 11.26%-28.68%. A further extended evaluation using the TIMIT (CORPUS1) and actual noisy aircraft communications data (CORPUS2) shows measurable improvement over the traditional MAP based scheme using the likelihood ratio test (MAP-LRT), with average EERs of 9%-23% for TIMIT and 13%-32% for noisy aircraft communications. The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes

Journal ArticleDOI
TL;DR: A robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position is proposed.

Journal Article
01 Jan 2007-CLEaR
TL;DR: The design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation are presented; the fifth in a series of community-wide evaluations of language technologies in the meeting domain.
Abstract: We present the design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation; the fifth in a series of community-wide evaluations of language technologies in the meeting domain. For 2007, we supported three evaluation tasks: Speech-To-Text (STT) transcription, "Who Spoke When" Diarization (SPKR), and Speaker Attributed Speech-To-Text (SASTT). The SASTT task, which combines STT and SPKR tasks, was a new evaluation task. The test data consisted of three test sets: Conference Meetings, Lecture Meetings, and Coffee Breaks from lecture meetings. The Coffee Break data was included as a new test set this year. Twenty-one research sites materially contributed to the evaluation by providing data or building systems. The lowest STT word error rates with up to four simultaneous speakers in the multiple distant microphone condition were 40.6 %, 49.8 %, and 48.4 % for the conference, lecture, and coffee break test sets respectively. For the SPKR task, the lowest diarization error rates for all speech in the multiple distant microphone condition were 8.5 %, 25.8 %, and 25.5 % for the conference, lecture, and coffee break test sets respectively. For the SASTT task, the lowest speaker attributed word error rates for segments with up to three simultaneous speakers in the multiple distant microphone condition were 40.3 %, 59.3 %, and 68.4 % for the conference, lecture, and coffee break test sets respectively.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A novel biometric modality based on synchrony measures is introduced in order to improve the overall performance of identity verification, and more specifically its robustness to replay attacks.
Abstract: We investigate the use of audio-visual speech synchrony measure in the framework of identity verification based on talking faces. Two synchrony measures based on canonical correlation analysis and co-inertia analysis respectively are introduced and their performances are evaluated on the specific task of detecting synchronized and not-synchronized audio-visual speech sequences. The notion of high-effort impostor attacks is also introduced as a dangerous threat for current biometric system based on speaker verification and face recognition. A novel biometric modality based on synchrony measures is introduced in order to improve the overall performance of identity verification, and more specifically its robustness to replay attacks.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: The Normalized Cross Likelihood Ratio is used as a dissimilarity measure between two Gaussian speaker models in the speaker change detection step and its contribution to the performance of speakers change detection is compared with those of BIC and Hostelling's T2-Statistic measures.
Abstract: In this paper, we present the Normalized Cross Likelihood Ratio (NCLR) and the advantages of using it in a speaker diarization system. First, the NCLR is used as a dissimilarity measure between two Gaussian speaker models in the speaker change detection step and its contribution to the performance of speaker change detection is compared with those of BIC and Hostelling's T2-Statistic measures. Then, the NCLR measure is modified to deal with multi-gaussian adapted models in the cluster recombination step. This step ends the step-by-step speaker diarization process after the BIC-based hierarchical clustering and the Viterbi re-segmentation steps. By comparing the NCLR measure with the CLR (Cross Likelihood Ratio) one, more than 30% of relative diarization error is reduced in ESTER evaluation data.

Book ChapterDOI
TL;DR: The chapter presents an overview of the physical structures of the human vocal tract used in speech, it introduces the standard phonetic classification system for the description of spoken gestures and it presents a catalogue of the different ways in which individuality can be expressed through speech.
Abstract: As well as conveying a message in words and sounds, the speech signal carries information about the speaker's own anatomy, physiology, linguistic experience and mental state. These speaker characteristics are found in speech at all levels of description: from the spectral information in the sounds to the choice of words and utterances themselves. This chapter presents an introduction to speech production and to the phonetic description of speech to facilitate discussion of how speech can be a carrier for speaker characteristics as well as a carrier for messages. The chapter presents an overview of the physical structures of the human vocal tract used in speech, it introduces the standard phonetic classification system for the description of spoken gestures and it presents a catalogue of the different ways in which individuality can be expressed through speech. The chapter ends with a brief description of some applications which require access to information about speaker characteristics in speech.

Book
01 Feb 2007
TL;DR: This book discusses the many Roles of Speaker Classification in Speaker Verification and Identification, as well as the applications in Human Machine Dialog Systems and Evaluation of Speaker Recognition Systems.
Abstract: Fundamentals.- How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification.- Speaker Classification Concepts: Past, Present and Future.- Characteristics.- Speaker Characteristics.- Foreign Accent.- Acoustic Analysis of Adult Speaker Age.- Speech Under Stress: Analysis, Modeling and Recognition.- Speaker Characteristics and Emotion Classification.- Emotions in Speech: Juristic Implications.- Applications.- Application of Speaker Classification in Human Machine Dialog Systems.- Speaker Classification in Forensic Phonetics and Acoustics.- Forensic Automatic Speaker Classification in the "Coming Paradigm Shift".- The Many Roles of Speaker Classification in Speaker Verification and Identification.- Methods and Features.- Frame Based Features.- Higher-Level Features in Speaker Recognition.- Enhancing Speaker Discrimination at the Feature Level.- Classification Methods for Speaker Recognition.- Multi-stream Fusion for Speaker Classification.- Evaluation.- Evaluations of Automatic Speaker Classification Systems.- An Introduction to Application-Independent Evaluation of Speaker Recognition Systems.