Showing papers on "Speaker diarisation published in 2008"

PDF

Open Access

Journal Article•DOI•

Speaker identification on the SCOTUS corpus

[...]

Jiahong Yuan¹, Mark Liberman•Institutions (1)

09 May 2008-Journal of the Acoustical Society of America

TL;DR: The main findings are that a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second, and the sampling rate of 11025 Hz achieves the best performance.

...read moreread less

Abstract: This paper reports the results of our experiments on speaker identification in the SCOTUS corpus, which includes oral arguments from the Supreme Court of the United States. Our main findings are as follows: 1) a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second; (2) the sampling rate of 11025 Hz achieves the best performance (higher sampling rates are harmful); and a sampling rate as low as 2000 Hz still achieves more than 90% accuracy; (3) a distance score based on likelihood numbers was used to measure the variability of phones among speakers; we found that the most variable phone is the phone UH (as in good), and the velar nasal NG is more variable than the other two nasal sounds M and N; 4.) our models achieved “perfect” forced alignment on very long speech segments (one hour). These findings and their significance are discussed.

...read moreread less

585 citations

Journal Article•DOI•

The neural integration of speaker and message

[...]

Jos J. A. Van Berkum¹, Danielle van den Brink¹, Cathelijne M. J. Y. Tesink², Miriam Kos², Peter Hagoort¹ - Show less +1 more•Institutions (2)

Max Planck Society¹, F.C. Donders Centre for Cognitive Neuroimaging²

01 Apr 2008-Journal of Cognitive Neuroscience

TL;DR: According to event-related potential results, language comprehension takes very rapid account of the social context, and the construction of meaning based on language alone cannot be separated from the social aspects of language use.

...read moreread less

Abstract: When do listeners take into account who the speaker is? We asked people to listen to utterances whose content sometimes did not match inferences based on the identity of the speaker (e.g., If only I looked like Britney Spears in a male voice, or I have a large tattoo on my back spoken with an upper-class accent). Event-related brain responses revealed that the speaker's identity is taken into account as early as 200300 msec after the beginning of a spoken word, and is processed by the same early interpretation mechanism that constructs sentence meaning based on just the words. This finding is difficult to reconcile with standard Gricean models of sentence interpretation in which comprehenders initially compute a local, context-independent meaning for the sentence (semantics) before working out what it really means given the wider communicative context and the particular speaker (pragmatics). Because the observed brain response hinges on voice-based and usually stereotype-dependent inferences about the speaker, it also shows that listeners rapidly classify speakers on the basis of their voices and bring the associated social stereotypes to bear on what is being said. According to our event-related potential results, language comprehension takes very rapid account of the social context, and the construction of meaning based on language alone cannot be separated from the social aspects of language use. The linguistic brain relates the message to the speaker immediately.

...read moreread less

368 citations

Book Chapter•DOI•

The ICSI RT07s Speaker Diarization System

[...]

Chuck Wooters¹, Marijn Huijbregts¹•Institutions (1)

International Computer Science Institute¹

01 Jan 2008

TL;DR: The ICSI speaker diarization system as mentioned in this paper automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers, using standard speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion.

...read moreread less

Abstract: In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible. The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.

...read moreread less

224 citations

Proceedings Article•DOI•

Overlapped speech detection for improved speaker diarization in multiparty meetings

[...]

Kofi Boakye, B. Trueba-Hornero, Oriol Vinyals, Gerald Friedland

12 May 2008

TL;DR: This work presents the initial work toward developing an overlap detection system for improved meeting diarization, and investigates various features, with a focus on high-precision performance for use in the detector, and examines performance results on a subset of the AMI Meeting Corpus.

...read moreread less

Abstract: State-of-the-art speaker diarization systems for meetings are now at a point where overlapped speech contributes significantly to the errors made by the system. However, little if no work has yet been done on detecting overlapped speech. We present our initial work toward developing an overlap detection system for improved meeting diarization. We investigate various features, with a focus on high-precision performance for use in the detector, and examine performance results on a subset of the AMI Meeting Corpus. For the high-quality signal case of a single mixed-headset channel signal, we demonstrate a relative improvement of about 7.4% DER over the baseline diarization system, while for the more challenging case of the single far-field channel signal relative improvement is 3.6%. We also outline steps towards improvement and moving beyond this initial phase.

...read moreread less

150 citations

ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition

[...]

Jean-François Bonastre, Nicolas Scheffer, Driss Matrouf, Corinne Fredouille, Anthony Larcher, Alexandre Preti, Gilles Pouchoulin, Nicholas Evans, Benoit Fauve¹, John Mason¹ - Show less +6 more•Institutions (1)

Swansea University¹

01 Jan 2008

TL;DR: This paper presents the ALIZE/SpkDet open source software packages for text independent speaker recognition, based on the well-known UBM/GMM approach, which includes also the latest speaker recognition developments such as Latent Factor Analysis (LFA) and unsupervised adaptation.

...read moreread less

Abstract: This paper presents the ALIZE/SpkDet open source software packages for text independent speaker recognition. This software is based on the well-known UBM/GMM approach. It includes also the latest speaker recognition developments such as Latent Factor Analysis (LFA) and unsupervised adaptation. Discriminant classifiers such as SVM supervectors are also provided , linked with the Nuisance Attribute Projection (NAP). The software performance is demonstrated within the framework of the NIST'06 SRE evaluation campaign. Several other applications like speaker diarization, embedded speaker recognition , password dependent speaker recognition and pathological voice assessment are also presented.

...read moreread less

114 citations

Journal Article•DOI•

An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification

[...]

Xugang Lu¹, Jianwu Dang¹•Institutions (1)

Japan Advanced Institute of Science and Technology¹

01 Apr 2008-Speech Communication

TL;DR: This paper proposed a new physiological feature which emphasizes individual information for text-independent speaker identification by using a non-uniform subband processing strategy to emphasize the physiological information involved in speech production.

...read moreread less

111 citations

Journal Article•DOI•

Review: Speaker segmentation and clustering

[...]

M. Kotti¹, Vassiliki Moschou¹, Constantine Kotropoulos¹•Institutions (1)

Aristotle University of Thessaloniki¹

01 May 2008-Signal Processing

TL;DR: This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering, and model-based, metric- based, and hybrid speakers segmentation algorithms are reviewed.

...read moreread less

108 citations

Book Chapter•DOI•

The Rich Transcription 2007 Meeting Recognition Evaluation

[...]

Jonathan G. Fiscus¹, Jerome Ajot¹, John S. Garofolo¹•Institutions (1)

National Institute of Standards and Technology¹

01 Jan 2008

TL;DR: The Spring 2007 RT-07 Rich Transcription Meeting Recognition Evaluation (RT-07) as mentioned in this paper was the fifth in a series of community-wide evaluations of language technologies in the meeting domain.

...read moreread less

Abstract: We present the design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation; the fifth in a series of community-wide evaluations of language technologies in the meeting domain. For 2007, we supported three evaluation tasks: Speech-To-Text (STT) transcription, "Who Spoke When" Diarization (SPKR), and Speaker Attributed Speech-To-Text (SASTT). The SASTT task, which combines STT and SPKR tasks, was a new evaluation task. The test data consisted of three test sets: Conference Meetings, Lecture Meetings, and Coffee Breaks from lecture meetings. The Coffee Break data was included as a new test set this year. Twenty-one research sites materially contributed to the evaluation by providing data or building systems. The lowest STT word error rates with up to four simultaneous speakers in the multiple distant microphone condition were 40.6 %, 49.8 %, and 48.4 % for the conference, lecture, and coffee break test sets respectively. For the SPKR task, the lowest diarization error rates for all speech in the multiple distant microphone condition were 8.5 %, 25.8 %, and 25.5 % for the conference, lecture, and coffee break test sets respectively. For the SASTT task, the lowest speaker attributed word error rates for segments with up to three simultaneous speakers in the multiple distant microphone condition were 40.3 %, 59.3 %, and 68.4 % for the conference, lecture, and coffee break test sets respectively.

...read moreread less

108 citations

Journal Article•DOI•

Explicit modelling of session variability for speaker verification

[...]

Robbie Vogt¹, Sridha Sridharan¹•Institutions (1)

Queensland University of Technology¹

01 Jan 2008-Computer Speech & Language

TL;DR: In this paper, an explicit session term in the Gaussian mixture speaker modeling framework is proposed to model mismatch in speaker recognition by combining the true speaker model with an additional session-dependent offset constrained to lie in a low-dimensional subspace representing session variability.

...read moreread less

101 citations

Proceedings Article•DOI•

Stream-based speaker segmentation using speaker factors and eigenvoices

[...]

Fabio Castaldo, Daniele Colibro, Emanuele Dalmasso, Pietro Laface¹, Claudio Vair - Show less +1 more•Institutions (1)

Polytechnic University of Turin¹

12 May 2008

TL;DR: A stream-based approach for unsupervised multi-speaker conversational speech segmentation that produces segmentation error rates better than the state of the art ones reported in previous work on the segmentation task in the NIST 2000 Speaker Recognition Evaluation (SRE).

...read moreread less

Abstract: This paper presents a stream-based approach for unsupervised multi-speaker conversational speech segmentation. The main idea of this work is to exploit prior knowledge about the speaker space to find a low dimensional vector of speaker factors that summarize the salient speaker characteristics. This new approach produces segmentation error rates that are better than the state of the art ones reported in our previous work on the segmentation task in the NIST 2000 Speaker Recognition Evaluation (SRE). We also show how the performance of a speaker recognition system in the core test of the 2006 NIST SRE is affected, comparing the results obtained using single speaker and automatically segmented test data.

...read moreread less

99 citations

Book Chapter•DOI•

Text-Dependent Speaker Recognition

[...]

Matthieu Hébert

01 Jan 2008

TL;DR: The intrinsic dependence that the lexical content of the password phrase has on the accuracy is demonstrated and several research results will be presented and analyzed to show key techniques used in text-dependent speaker recognition systems from different sites.

...read moreread less

Abstract: Text-dependent speaker recognition characterizes a speaker recognition task, such as verification or identification, in which the set of words (or lexicon) used during the testing phase is a subset of the ones present during the enrollment phase. The restricted lexicon enables very short enrollment (or registration) and testing sessions to deliver an accurate solution but, at the same time, represents scientific and technical challenges. Because of the short enrollment and testing sessions, text-dependent speaker recognition technology is particularly well suited for deployment in large-scale commercial applications. These are the bases for presenting an overview of the state of the art in text-dependent speaker recognition as well as emerging research avenues. In this chapter, we will demonstrate the intrinsic dependence that the lexical content of the password phrase has on the accuracy. Several research results will be presented and analyzed to show key techniques used in text-dependent speaker recognition systems from different sites. Among these, we mention multichannel speaker model synthesis and continuous adaptation of speaker models with threshold tracking. Since text-dependent speaker recognition is the most widely used voice biometric in commercial deployments, several

...read moreread less

Proceedings Article•DOI•

A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

[...]

Kazuhiro Otsuka¹, Shoko Araki², Kentaro Ishizuka², Masakiyo Fujimoto², Martin Heinrich¹, Junji Yamato¹ - Show less +2 more•Institutions (2)

NTT Communications Corp¹, Nippon Telegraph and Telephone²

20 Oct 2008

TL;DR: A realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system to automatically discover the visual focus of attention (VFOA) and new 3-D visualization schemes for meeting scenes and the results of an analysis are presented.

...read moreread less

Abstract: This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.

...read moreread less

Patent•

Speaker recognition system

[...]

Franz Gerl, Tobias Herbig

10 Oct 2008

TL;DR: In this paper, a method automatically recognizes speech received through an input is presented, which detects whether the received speech input matches a speaker model according to an adaptable predetermined criterion, and then assigns the speaker model to a speaker set when no match occurs based on the input.

...read moreread less

Abstract: A method automatically recognizes speech received through an input. The method accesses one or more speaker-independent speaker models. The method detects whether the received speech input matches a speaker model according to an adaptable predetermined criterion. The method creates a speaker model assigned to a speaker model set when no match occurs based on the input.

...read moreread less

The Challenge of Multispeaker Lip-Reading

[...]

Stephen Cox, Richard P. Harvey, Yuxuan Lan, Jacob L. Newman, Barry-John Theobald - Show less +1 more

01 Sep 2008

TL;DR: This paper shows the danger of not using different speakers in the trainingand test-sets and demonstrates that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken.

...read moreread less

Abstract: In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multispeaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications. This paper shows the danger of not using different speakers in the trainingand test-sets. Firstly, we present classification results on a new single-word database AVletters 2 which is a high-definition version of the well known AVletters database. By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically. By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.

...read moreread less

Patent•

Spoken man-machine interface with speaker identification

[...]

Ralf Kompe, Thomas Kemp

20 Aug 2008

TL;DR: In this article, the method of operating a man-machine interface unit includes classifying at least one utterance of a speaker to be of a first type or of a second type, and automatically adding a new speaker to the speaker data base based on utterances of one of the clusters.

...read moreread less

Abstract: The method of operating a man-machine interface unit includes classifying at least one utterance of a speaker to be of a first type or of a second type. If the utterance is classified to be of the first type, the utterance belongs to a known speaker of a speaker data base, and if the utterance is classified to be of the second type, the utterance belongs to an unknown speaker that is not included in the speaker data base. The method also includes storing a set of utterances of the second type, clustering the set of utterances into clusters, wherein each cluster comprises utterances having similar features, and automatically adding a new speaker to the speaker data base based on utterances of one of the clusters.

...read moreread less

Proceedings Article•

Factor analysis subspace estimation for speaker verification with short utterances

[...]

Robbie Vogt¹, Brendan Baker, Sridha Sridharan¹•Institutions (1)

Queensland University of Technology¹

01 Jan 2008

TL;DR: In this paper, the authors compared several alternative procedures for this task with a particular focus on training and testing with short utterances and showed that better performance can be obtained when an independent rather than simultaneous optimisation of the two core variability subspaces is used.

...read moreread less

Abstract: Training the speaker and session subspaces is an integral problem in developing a joint factor analysis GMM speaker verification system. This work investigates and compares several alternative procedures for this task with a particular focus on training and testing with short utterances. Experiments show that better performance can be obtained when an independent rather than simultaneous optimisation of the two core variability subspaces is used. It is additionally shown that for verification trials on short utterances it is important for the session subspace to be trained with matched length utterances. Conversely, the speaker transform should always be trained with as much data as possible.

...read moreread less

Journal Article•DOI•

Voice disguise and automatic speaker recognition.

[...]

Cuiling Zhang, Tiejun Tan

05 Mar 2008-Forensic Science International

TL;DR: In this study 10 types of disguised voices and normal voices from 20 male college students were used as test samples and each disguised voice was compared with all normal voices in the database to make speaker identification and speaker verification.

...read moreread less

Book•

Speaker Classification I

[...]

Christian Müller

01 Jan 2008

TL;DR: A study of Acoustic Correlates of Speaker Age, and the impact of Visual and Auditory Cues in Age Estimation on Decoding of Semantic Emotion.

...read moreread less

Journal Article•DOI•

Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization

[...]

Kyu Jeong Han¹, Samuel Kim¹, Shrikanth S. Narayanan¹•Institutions (1)

University of Southern California¹

01 Nov 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel alternative stopping method for AHC based on information change rate (ICR) is proposed and is demonstrated to be more robust to data source variation than the BIC-based one.

...read moreread less

Abstract: Many current state-of-the-art speaker diarization systems exploit agglomerative hierarchical clustering (AHC) as their speaker clustering strategy, due to its simple processing structure and acceptable level of performance. However, AHC is known to suffer from performance robustness under data source variation. In this paper, we address this problem. We specifically focus on the issues associated with the widely used clustering stopping method based on Bayesian information criterion (BIC) and the merging-cluster selection scheme based on generalized likelihood ratio (GLR). First, we propose a novel alternative stopping method for AHC based on information change rate (ICR). Through experiments on several meeting corpora, the proposed method is demonstrated to be more robust to data source variation than the BIC-based one. The average improvement obtained in diarization error rate (DER) by this method is 8.76% (absolute) or 35.77% (relative). We also introduce a selective AHC (SAHC) in the paper, which first runs AHC with the ICR-based stopping method only on speech segments longer than 3 s and then classifies shorter speech segments into one of the clusters given by the initial AHC. This modified version of AHC is motivated by our previous analysis that the proportion of short speech turns (or segments) in a data source is a significant factor contributing to the robustness problem arising in the GLR-based merging-cluster selection scheme. The additional performance improvement obtained by SAHC is 3.45% (absolute) or 14.08% (relative) in terms of averaged DER.

...read moreread less

Proceedings Article•

Forensic speaker verification using formant features and Gaussian mixture models.

[...]

Timo Becker¹, Michael Jessen, Catalin Grigoras²•Institutions (2)

Austrian Academy of Sciences¹, University of Colorado Denver²

01 Jan 2008

TL;DR: A UBM-GMM verification system is applied to semi-automatically extracted formant features and speaker comparisons are expressed as likelihood ratios (the ratio of similarity to typicality), enabling speakers to be distinguished with a low error rate.

...read moreread less

Abstract: A new method for speaker verification based on formant features is presented A UBM-GMM verification system is applied to semi-automatically extracted formant features Speakerspecific vocal tract configurations, including the speakers’ variability, are incorporated in the speaker models Speaker comparisons are expressed as likelihood ratios (the ratio of similarity to typicality) F1, F2 and F3 values all enable speakers to be distinguished with a low error rate The corresponding bandwidths further lower the error rate

...read moreread less

Book Chapter•DOI•

The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

[...]

Andreas Stolcke¹, Xavier Anguera¹, Kofi Boakye¹, Özgür Çetin², Adam Janin¹, Mathew Magimai-Doss¹, Chuck Wooters¹, Jing Zheng³ - Show less +4 more•Institutions (3)

International Computer Science Institute¹, Yahoo!², SRI International³

01 Jan 2008

TL;DR: The SRI-ICSI meeting and lecture recognition system was used in the NIST RT-07 evaluations, highlighting improvements made over the last year as mentioned in this paper, including updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones.

...read moreread less

Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

...read moreread less

Proceedings Article•DOI•

Is voice transformation a threat to speaker identification

[...]

Qin Jin¹, Arthur R. Toth¹, Alan W. Black¹, Tanja Schultz¹•Institutions (1)

Carnegie Mellon University¹

01 Mar 2008

TL;DR: Experimental results showed that current standard voice transformation techniques are able to fool the GMM-based system but not the Phonetic speaker identification system, implying that future speaker identification systems should include idiosyncratic knowledge in order to successfully distinguish transformed speech from natural speech and thus be armed against imposter attacks.

...read moreread less

Abstract: With the development of voice transformation and speech synthesis technologies, speaker identification systems are likely to face attacks from imposters who use voice transformed or synthesized speech to mimic a particular speaker. Therefore, we investigated in this paper how speaker identification systems perform on voice transformed speech. We conducted experiments with two different approaches, the classical GMM-based speaker identification system and the Phonetic speaker identification system. Our experimental results showed that current standard voice transformation techniques are able to fool the GMM-based system but not the Phonetic speaker identification system. These findings imply that future speaker identification systems should include idiosyncratic knowledge in order to successfully distinguish transformed speech from natural speech and thus be armed against imposter attacks.

...read moreread less

Bayesian Analysis of Speaker Diarization with Eigenvoice Priors

[...]

Patrick Kenny, Panu Somervuo

01 Jan 2008

TL;DR: In his thesis, Valente showed how the speaker clustering problem could be formulated in a principled way in terms of Bayesian model selection.

...read moreread less

Abstract: The speaker diarization problem consists in determining how many speakers there are in a given speechﬁle and in partitioning the speech ﬁle into intervals each of which is assigned to one of the speakers. Thecollection of all intervals assigned to a given speaker is known as a cluster. We assume that the givenspeech ﬁle has already been partitioned into segments, that is, intervals each containing the speech of asingle speaker. These segments may be of very short duration and the possibility that the same speakeris talking in two successive segments is not excluded. The problem then is how to cluster the segmentsso that there is a 1–1 correspondence between speakers and cl usters.In his thesis, Valente showed how the speaker clustering problem could be formulated in a principledway in terms of Bayesian model selection [1]. The primary problem, namely determining the number ofspeakers in the given speech ﬁle, can be viewed as one of deter mining the number of components in amixture distribution where each mixture component is a speaker model; a Bayesian approach formulatesthis problem more generally, as one of calculating a posterior probability distribution over the numberof mixture components. Similarly a Bayesian approach to the question of which segments should beassigned to which speakers results in a posterior probability distribution on all possible assignments.

...read moreread less

Journal Article•DOI•

Automatic intention recognition in conversation processing

[...]

Thomas Holtgraves¹•Institutions (1)

Ball State University¹

01 Apr 2008-Journal of Memory and Language

TL;DR: This paper found that participants recognized the speech acts that speakers performed with their utterances and this recognition was automatic and occurred for both written and spoken utterances for both observers and participants, and participants performed either a recognition probe task or a lexical decision task after being exposed to utterances that performed specific speech acts.

...read moreread less

Proceedings Article•

Effects of vocal effort and speaking style on text-independent speaker verification.

[...]

Elizabeth Shriberg¹, Martin Graciarena, Harry Bratt, Andreas Kathol, Sachin S. Kajarekar, Huda Jameel¹, Colleen Richey, Fred Goodman - Show less +4 more•Institutions (1)

SRI International¹

01 Jan 2008

TL;DR: Experiments using the SRI-FRTIV corpus reveal that “furtive” speech poses a significant challenge, but conversations and interviews, despite stylistic differences, are well matched, and high-effort oration, in contrast to high- Effort read speech, shares characteristics with conversational and interview styles.

...read moreread less

Abstract: We study the question of how intrinsic variations (associated with the speaker rather than the recording environment) affect text-independent speaker verification performance. Experiments using the SRI-FRTIV corpus, which systematically varies both vocal effort and speaking style, reveal that (1) “furtive” speech poses a significant challenge; (2) conversations and interviews, despite stylistic differences, are well matched; (3) high-effort oration, in contrast to high-effort read speech, shares characteristics with conversational and interview styles; and (4) train/test pairings are generally symmetrical. Implications for further work in the area are discussed.

...read moreread less

Journal Article•DOI•

Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos

[...]

Cha Zhang¹, Pei Yin², Yong Rui¹, Ross Cutler¹, Paul A. Viola¹, Xinding Sun¹, N. Pinto¹, Zhengyou Zhang¹ - Show less +4 more•Institutions (2)

Microsoft¹, Georgia Institute of Technology²

01 Dec 2008-IEEE Transactions on Multimedia

TL;DR: The challenges met while designing a speaker detector for the Microsoft RoundTable distributed meeting device are presented, and a novel boosting-based multimodal speaker detection (BMSD) algorithm is proposed that reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%.

...read moreread less

Abstract: Identifying the active speaker in a video of a distributed meeting can be very helpful for remote participants to understand the dynamics of the meeting. A straightforward application of such analysis is to stream a high resolution video of the speaker to the remote participants. In this paper, we present the challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and propose a novel boosting-based multimodal speaker detection (BMSD) algorithm. Instead of separately performing sound source localization (SSL) and multiperson detection (MPD) and subsequently fusing their individual results, the proposed algorithm fuses audio and visual information at feature level by using boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. In experiments that includes hundreds of real-world meetings, the proposed BMSD algorithm reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%. To the best of our knowledge, this is the first real-time multimodal speaker detection algorithm that is deployed in commercial products.

...read moreread less

Proceedings Article•DOI•

Gender classification in two Emotional Speech databases

[...]

M. Kotti¹, Constantine Kotropoulos¹•Institutions (1)

Aristotle University of Thessaloniki¹

01 Dec 2008

TL;DR: A branch and bound feature selection algorithm is applied to select a subset of 15 features among the 1379 originally extracted features, which out perform those obtained by state-of-the-art techniques, since a perfect classification accuracy is obtained.

...read moreread less

Abstract: Gender classification is a challenging problem, which finds applications in speaker indexing, speaker recognition, speaker diarization, annotation and retrieval of multimedia databases, voice synthesis, smart human-computer interaction, biometrics, social robots etc. Although it has been studied for more than thirty years, by no means it is a solved problem. Processing emotional speech in order to identify speakerpsilas gender makes the problem even more interesting. A large pool of 1379 features is created including 605 novel features. A branch and bound feature selection algorithm is applied to select a subset of 15 features among the 1379 originally extracted. Support vector machines with various kernels are tested as gender classifiers, when applied to two databases, namely: the Berlin database of Emotional Speech and the Danish Emotional Speech database. The reported classification results out perform those obtained by state-of-the-art techniques, since a perfect classification accuracy is obtained.

...read moreread less

Proceedings Article•DOI•

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

[...]

Yi-Jian Wu¹, Simon King¹, Keiichi Tokuda¹•Institutions (1)

University of Edinburgh¹

30 Dec 2008

TL;DR: This paper explored a cross-lingual speaker adaptation technique for HMM-based speech synthesis, where a source voice model for English is transformed into a target speaker model using Mandarin Chinese speech data from the target speaker.

...read moreread less

Abstract: This paper explores a cross-lingual speaker adaptation technique for HMM-based speech synthesis, where a source voice model for English is transformed into a target speaker model using Mandarin Chinese speech data from the target speaker. A phone mapping- based method is adopted to map Chinese Initial/Finals into English phonemes and two types of mapping rules, including one-to-one and one-to-sequence mappings, are compared. In order to avoid having to map prosodic features between languages, the adaptation procedure uses regression classes and transforms that are constructed for triphone models, then used to adapt the phonetic-and-prosodic- context-dependent models. From the experimental results, we found that a one-to-sequence phone mapping is better than a one-to-one mapping, and that the similarity between adapted English speech and target Chinese speaker is reasonable.

...read moreread less

Patent•

Method and apparatus for active speaker selection using microphone arrays and speaker recognition

[...]

Sudhir Raman Ahuja¹, Jingdong Chen¹, Yiteng Arden Huang, Dong Liu¹, Qiru Zhou¹ - Show less +1 more•Institutions (1)

Alcatel-Lucent¹

03 Mar 2008

TL;DR: In this paper, a method and apparatus for active speaker selection in teleconferencing applications illustratively comprises a microphone array module, a speaker recognition system, a user interface, and a speech signal selection module.

...read moreread less

Abstract: A method and apparatus for performing active speaker selection in teleconferencing applications illustratively comprises a microphone array module, a speaker recognition system, a user interface, and a speech signal selection module. The microphone array module separates the speech signal from each active speaker from those of other active speakers, providing a plurality of individual speaker's speech signals. The speaker recognition system identifies each currently active speaker using conventional speaker recognition/identification techniques. These identities are then transmitted to a remote teleconferencing location for display to remote participants via a user interface. The remote participants may then select one of the identified speakers, and the speech signal selection module then selects for transmission the speech signal associated with the selected identified speaker, thereby enabling the participants at the remote location to listen to the selected speaker and neglect the speech from other active speakers.

...read moreread less

Factor analysis modelling for speaker verification with short utterances

[...]

Robert Vogt, Christopher J. Lustri, Sridha Sridharan

01 Jan 2008

TL;DR: In this article, the authors examined combining both relevance MAP and subspace speaker adaptation processes to train speaker models for use in speaker verification systems with a particular focus on short utterance lengths.

...read moreread less

Abstract: This paper examines combining both relevance MAP and subspace speaker adaptation processes to train GMM speaker models for use in speaker verification systems with a particular focus on short utterance lengths. The subspace speaker adaptation method involves developing a speaker GMM mean supervector as the sum of a speaker-independent prior distribution and a speaker dependent offset constrained to lie within a low-rank subspace, and has been shown to provide improvements in accuracy over ordinary relevance MAP when the amount of training data is limited. It is shown through testing on NIST SRE data that combining the two processes provides speaker models which lead to modest improvements in verification accuracy for limited data situations, in addition to improving the performance of the speaker verification system when a larger amount of available training data is available.

...read moreread less

Collapse