Showing papers on "Speaker diarisation published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Approaches and applications of audio diarization

[...]

D.A. Reynolds¹, Pedro A. Torres-Carrasquillo¹•Institutions (1)

18 Mar 2005

TL;DR: An overview of current audio diarization approaches is provided and performance and potential applications are discussed, as well as the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarized evaluation.

...read moreread less

Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper, we provide an overview of current audio diarization approaches and discuss performance and potential applications. We outline the general framework of diarization systems and present the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. Lastly, we look at future challenges and directions for diarization research.

...read moreread less

191 citations

Patent•

Dynamically Changing Voice Attributes During Speech Synthesis Based upon Parameter Differentiation for Dialog Contexts

[...]

Ilya Skuratovsky¹•Institutions (1)

IBM¹

22 Nov 2005

TL;DR: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations according to whether each portion of text was identified as a spoken passage or not.

...read moreread less

Abstract: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations to different portions of text within the text source according to whether each portion of text was identified as a spoken passage or a non-spoken passage. The method further can include identifying the speaker and/or the gender of the speaker and applying different voice configurations according to the speaker identity and/or speaker gender.

...read moreread less

160 citations

Proceedings Article•DOI•

MLLR transforms as features in speaker recognition.

[...]

Andreas Stolcke¹, Luciana Ferrer¹, Sachin S. Kajarekar¹, Elizabeth Shriberg¹, Anand Venkataraman¹ - Show less +1 more•Institutions (1)

SRI International¹

04 Sep 2005

TL;DR: The use of adaptation transforms employed in speech recognition systems as features for speaker recognition is explored, and the resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems.

...read moreread less

Abstract: We explore the use of adaptation transforms employed in speech recognition systems as features for speaker recognition. This approach is attractive because, unlike standard framebased cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification. Affine transforms are computed for the Gaussian means of the acoustic models used in a recognizer, using maximum likelihood linear regression (MLLR). The high-dimensional vectors formed by the transform coefficients are then modeled as speaker features using support vector machines (SVMs). The resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems. Further improvements are obtained by combining baseline and MLLR-based systems.

...read moreread less

152 citations

Journal Article•DOI•

Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system

[...]

B. Yegnanarayana¹, S. R. M. Prasanna¹, J.M. Zachariah¹, C.S. Gupta•Institutions (1)

Indian Institute of Technology Madras¹

20 Jun 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance and combining the evidence from these features seem to improve the performance of the system significantly.

...read moreread less

Abstract: This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information.

...read moreread less

117 citations

Proceedings Article•DOI•

Combining speaker identification and BIC for speaker diarization.

[...]

Xuan Zhu, Claude Barras, Sylvain Meignier, Jean-Luc Gauvain

04 Sep 2005

TL;DR: Improved multi-stage speaker diarization by incorporating a speaker identification step is described, which provides between 40% and 50% reduction of the speaker error, relative to a standard BIC clustering system.

...read moreread less

Abstract: This paper describes recent advances in speaker diarization by incorporating a speaker identification step. This system builds upon the LIMSI baseline data partitioner used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters, when there is a large quantity of data for the speaker. Several improvements to the baseline sys- tem have been made. Firstly, a standard Bayesian information criterion (BIC) agglomerative clustering has been integrated re- placing the iterative Gaussian mixture model (GMM) cluster- ing. Then a second clustering stage has been added, using a speaker identification method with MAP adapted GMM. A fi- nal post-processing stage refines the segment boundaries using the output of the transcription system. On the RT-04f and ES- TER evaluation data, the improved multi-stage system provides between 40% and 50% reduction of the speaker error, relative to a standard BIC clustering system.

...read moreread less

107 citations

Book Chapter•DOI•

Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system

[...]

Xavier Anguera¹, Chuck Wooters¹, Barbara Peskin¹, Mateu Aguilo¹•Institutions (1)

International Computer Science Institute¹

11 Jul 2005

TL;DR: This paper describes the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation, and adds several features to the baseline clustering system, including a “purification” module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones sub-task.

...read moreread less

Abstract: In this paper we describe the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplifying the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, including a “purification” module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post-evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lecture room environment.

...read moreread less

99 citations

Proceedings Article•DOI•

The Cambridge University March 2005 speaker diarisation system.

[...]

Rohit Sinha¹, S.E. Tranter¹, Mark J. F. Gales¹, Philip C. Woodland¹•Institutions (1)

University of Cambridge¹

04 Sep 2005

TL;DR: This system combines techniques used successfully in previous speaker diarisation systems with an additional second clustering stage based on state-of-the-art speaker identification methods to give a diarification error rate of 6.9% on the RT-04 Fall darisation evaluation data.

...read moreread less

Abstract: This paper describes the speaker diarisation system developed at Cambridge University in March 2005. This system combines techniques used successfully in our previous speaker diarisation systems with an additional second clustering stage based on state-of-the-art speaker identification methods. Several strategies for using the new system are investigated and the final system gives a diarisation error rate of 6.9% on the RT-04 Fall diarisation evaluation data when processing all the test data together or 8.6% when processing the test data shows independently.

...read moreread less

94 citations

Proceedings Article•DOI•

Modelling Session Variability in Text-Independent Speaker Verification

[...]

Robert Vogt, Brendan Baker, Subramanian Sridharan

04 Sep 2005

TL;DR: In this article, an approach to modelling session variability for text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures is presented, which reduces the data labelling requirements and removes discrete categorisation needed by techniques such as feature mapping and H-Norm.

...read moreread less

Abstract: Presented is an approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures. The proposed technique reduces the data labelling requirements and removes discrete categorisation needed by techniques such as feature mapping and H-Norm, while providing superior performance. Experiments on Switchboard-II conversational telephony data show improvements of as much as 48% in detection cost with a single training utterance and 68% with multiple training utterances over a baseline system.

...read moreread less

89 citations

Proceedings Article•DOI•

Voice forgery using ALISP: indexation in a client memory

[...]

P.Z. Patrick¹, G. Aversano¹, Raphaël Blouet¹, M. Charbit¹, Gérard Chollet¹ - Show less +1 more•Institutions (1)

Télécom ParisTech¹

18 Mar 2005

TL;DR: It is demonstrated that an automatic speaker recognition system could be seriously threatened by a transformation of this kind, using a speaker verification system to calculate the likelihood that the forged voice belongs to the genuine client.

...read moreread less

Abstract: The article deals with a technique of voice forgery using the ALISP (automatic language independent speech processing) approach. Such a technique allows the voice of an arbitrary person (the impostor) to be transformed, forging the identity of another person (the client). Our goal is to demonstrate that an automatic speaker recognition system could be seriously threatened by a transformation of this kind. For this purpose, we use a speaker verification system to calculate the likelihood that the forged voice belongs to the genuine client. Experiments on NIST 2004 evaluation data show that the equal error rate for the verification task is significantly increased by our voice transformation.

...read moreread less

68 citations

Proceedings Article•DOI•

Accent classification in speech

[...]

S. Deshpande¹, Sharat Chikkerur¹, Venu Govindaraju¹•Institutions (1)

University at Buffalo¹

17 Oct 2005-Scopus

TL;DR: This paper distinguishes between standard American English and Indian Accented English using the second and third formant frequencies of specific accent markers to achieve a suitable classification for these two accent groups.

...read moreread less

Abstract: Apart form the word content and identity of a speaker; speech also conveys information about several soft biometric traits such as accent and gender. Accurate classification of these features can have a direct impact on present speech systems. An accent specific dictionary or word models can be used to improve accuracy of speech recognition systems. Gender and accent information can also be used to improve the performance of speaker recognition systems. In this paper, we distinguish between standard American English and Indian Accented English using the second and third formant frequencies of specific accent markers. A GMM classification is used on the feature set for each accent group. The results show that using just the formant frequencies of these accent markers is sufficient to achieve a suitable classification for these two accent groups.

...read moreread less

68 citations

Proceedings Article•DOI•

Speaker diarization for multi-party meetings using acoustic fusion

[...]

Xavier Anguera¹, C. Woofers¹, Javier Hernando•Institutions (1)

International Computer Science Institute¹

01 Jan 2005

TL;DR: This paper uses delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal and tests the approach on the 2004 and 2005 NIST meetings evaluation databases show that the technique performs very well.

...read moreread less

Abstract: One of the sub-tasks of the Spring 2004 and Spring 2005 NIST Meetings evaluations requires segmenting multi-party meetings into speaker-homogeneous regions using data from multiple distant microphones (the "MDM" sub-task). One approach to this task is to run a speaker segmentation system on each of the microphone channels separately, and then merge the results. This can be thought of as a many-to-one post-processing approach. In this paper we propose an alternative approach in which we use delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal. This approach can be thought of a many-to-one pre-processing approach. In the pre-processing approach we propose, the time delay of arrival (TDOA) between each of the multiple distant channels and a reference channel is computed incrementally using a window that steps through the signals from each of the multiple microphones. No information about the locations or setup of the microphones is required. Using the TDOA information, the channels are first aligned and then summed and the resulting "enhanced" signal is clustered using our standard speaker diarization system. We test our approach on the 2004 and 2005 NIST meetings evaluation databases and show that the technique performs very well

...read moreread less

Book Chapter•DOI•

The AMI speaker diarization system for NIST RT06s meeting data

[...]

David A. van Leeuwen, Marijn Huijbregts¹•Institutions (1)

University of Twente¹

11 Jul 2005

TL;DR: The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm, and a speech activity detector (SAD) is developed based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech.

...read moreread less

Abstract: The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm. Since for the NIST Rich Transcription speaker dizarization evaluation measure correct speech detection appears to be essential, we have developed a speech activity detector (SAD) as well. This is based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech. The SAD was trained on only AMI development test data, and performed quite well in the evaluation on all 5 meeting locations, with a SAD error rate of 5.0 %. For the speaker clustering algorithm we optimized the BIC penalty parameter λ to 14, which is quite high with respect to the theoretical value of 1. The final speaker diarization error rate was evaluated at 35.1 %.

...read moreread less

Patent•DOI•

Method and apparatus for speaker spotting

[...]

Moshe Waserblat¹, Yaniv Zigel¹, Oren Pereg¹•Institutions (1)

NICE Systems¹

13 Nov 2005-Journal of the Acoustical Society of America

TL;DR: In this article, a method and apparatus for spotting a target speaker within a call interaction by generating speaker models based on one or more speaker's speech; and by searching for speaker models associated with one or multiple target speaker speech files.

...read moreread less

Abstract: A method and apparatus for spotting a target speaker within a call interaction by generating speaker models (98) based on one or more speaker's speech; and by searching for speaker models (110) associated with one or more target speaker speech files.

...read moreread less

Patent•

Automatic text-independent, language-independent speaker voice-print creation and speaker recognition

[...]

Claudio Vair, Daniele Loquendo S.p.A. Colibro, Luciano Fissore

24 May 2005

TL;DR: In this paper, a dual-step, text-independent, language-independent speaker voice-print creation and speaker recognition method is presented, where a neural network-based technique is used in the first step and a Markov model-based approach is used for the second step.

...read moreread less

Abstract: Disclosed herein is an automatic dual-step, text~ independent, language-independent speaker voice-print creation and speaker recognition method, wherein a neural network-based technique is used in a first step and a Markov model-based technique is used in the second step. In particular, the first step uses a neural network-based technique for decoding the content of what is uttered by the speaker in terms of language~ independent acoustic-phonetic classes, wherein the second step uses the sequence of language-independent acoustic-phonetic classes from the first step and employs a Markov model-based technique for creating the speaker voice-print and for recognizing the speaker. The combination of the two steps enables improvement in the accuracy and efficiency of the speaker voice-print creation and of the speaker recognition, without setting any constraints on the lexical content of the speaker utterance and on the language thereof.

...read moreread less

Journal Article•DOI•

Unsupervised speaker indexing using generic models

[...]

Soonil Kwon¹, Shrikanth S. Narayanan¹•Institutions (1)

University of Southern California¹

15 Aug 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: A predetermined generic speaker-independent model set, called the sample speaker models (SSM), is proposed, which can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data.

...read moreread less

Abstract: Unsupervised speaker indexing sequentially detects points where a speaker identity changes in a multispeaker audio stream, and categorizes each speaker segment, without any prior knowledge about the speakers. This paper addresses two challenges: The first relates to sequential speaker change detection. The second relates to speaker modeling in light of the fact that the number/identity of the speakers is unknown. To address this issue, a predetermined generic speaker-independent model set, called the sample speaker models (SSM), is proposed. This set can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data. Once a speaker-independent model is selected from the generic sample models, it is progressively adapted into a specific speaker-dependent model. Experiments were performed with data from the Speaker Recognition Benchmark NIST Speech corpus (1999) and the HUB-4 Broadcast News Evaluation English Test material (1999). Results showed that our new technique, sampled using the Markov Chain Monte Carlo method, gave 92.5% indexing accuracy on two speaker telephone conversations, 89.6% on four-speaker conversations with the telephone speech quality, and 87.2% on broadcast news. The SSMs outperformed the universal background model by up to 29.4% and the universal gender models by up to 22.5% in indexing accuracy in the experiments of this paper.

...read moreread less

Journal Article•DOI•

An information-theoretic perspective on feature selection in speaker recognition

[...]

Thomas Eriksson¹, Samuel Kim², Hong-Goo Kang², Chungyong Lee²•Institutions (2)

Chalmers University of Technology¹, Yonsei University²

20 Jun 2005-IEEE Signal Processing Letters

TL;DR: This letter studies feature selection in speaker recognition from an information-theoretic view and closely ties the performance, in terms of the expected classification error probability, to the mutual information between speaker identity and features.

...read moreread less

Abstract: This letter studies feature selection in speaker recognition from an information-theoretic view. We closely tie the performance, in terms of the expected classification error probability, to the mutual information between speaker identity and features. Information theory can then help us to make qualitative statements about feature selection and performance. We study various common features used for speaker recognition, such as mel-warped cepstrum coefficients and various parameterizations of linear prediction coefficients. The theory and experiments give valuable insights in feature selection and performance of speaker-recognition applications.

...read moreread less

Proceedings Article•DOI•

The 2004 MIT Lincoln Laboratory speaker recognition system

[...]

Douglas A. Reynolds¹, William M. Campbell¹, Terry P. Gleason¹, Carl Quillen¹, Douglas E. Sturim¹, Pedro A. Torres-Carrasquillo¹, André Gustavo Adami¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

18 Mar 2005

TL;DR: The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage.

...read moreread less

Abstract: The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian mixture models, support vector machines and N-gram language models and were combined using a single layer perceptron fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. We describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.

...read moreread less

Proceedings Article•DOI•

A comparative study using manual and automatic transcriptions for diarization

[...]

L. Canseco¹, Lori Lamel¹, Jean-Luc Gauvain¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Jan 2005

TL;DR: Recent studies on speaker diarization from automatic broadcast news transcripts are described and the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions.

...read moreread less

Abstract: This paper describes recent studies on speaker diarization from automatic broadcast news transcripts. Linguistic information revealing the true names of who speaks during a broadcast (the next, the previous and the current speaker) is detected by means of linguistic patterns. In order to associate the true speaker names with the speech segments, a set of rules are defined for each pattern. Since the effectiveness of linguistic patterns for diarization depends on the quality of the transcription, the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions. On about 150 hours of broadcast news data (295 shows) the global ratio of false identity association is about 13% for the automatic and the manual transcripts

...read moreread less

Book Chapter•DOI•

NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings

[...]

Dan Istrate, Corinne Fredouille, Sylvain Meignier, Laurent Besacier¹, Jean-François Bonastre - Show less +1 more•Institutions (1)

Centre national de la recherche scientifique¹

11 Jul 2005

TL;DR: Different pre-processing techniques, coupled with three speaker diarization systems in the framework of the NIST 2005 Spring Rich Transcription campaign (RT'05S), aim at providing a signal quality index in order to build a unique “virtual” signal obtained from all the microphone recordings available for a meeting.

...read moreread less

Abstract: This paper presents different pre-processing techniques, coupled with three speaker diarization systems in the framework of the NIST 2005 Spring Rich Transcription campaign (RT'05S). The pre-processing techniques aim at providing a signal quality index in order to build a unique “virtual” signal obtained from all the microphone recordings available for a meeting. This unique virtual signal relies on a weighted sum of the different microphone signals while the signal quality index is given according to a signal to noise ratio. Two methods are used in this paper to compute the instantaneous signal to noise ratio: a speech activity detection based approach and a noise spectrum estimate. The speaker diarization task is performed using systems developed by different labs: the LIA, LIUM and CLIPS. Among the different system submissions made by these three labs, the best system obtained 24.5 % speaker diarization error for the conference subdomain and 18.4 % for the lecture subdomain.

...read moreread less

A New Database for Speaker Recognition

[...]

Ling Feng, Lars Kai Hansen

01 Jan 2005

TL;DR: The paper presents a new database called ELSDSR dedicated to speaker recognition applications, and its main characteristics are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics.

...read moreread less

Abstract: In this paper we discuss properties of speech databases used for speaker recognition research and evaluation, and we characterize some popular standard databases The paper presents a new database called ELSDSR dedicated to speaker recognition applications The main characteristics of this database are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics

...read moreread less

Book Chapter•DOI•

Testing voice mimicry with the YOHO speaker verification corpus

[...]

Yee W. Lau¹, Dat Tran¹, Michael Wagner¹•Institutions (1)

University of Canberra¹

14 Sep 2005

TL;DR: How vulnerable a speaker verification system is to conscious effort by impostors to mimic a client of the system is determined and how much closer an impostor can get to another speaker's voice by repeated attempts is explored.

...read moreread less

Abstract: The aim of this paper is to determine how vulnerable a speaker verification system is to conscious effort by impostors to mimic a client of the system. The paper explores systematically how much closer an impostor can get to another speaker's voice by repeated attempts. Experiments on 138 speakers in the YOHO database and six people who played a role as imitators showed a fact that professional linguists could successfully attack the system. Non-professional people could have a good chance if they know their closest speaker in the database.

...read moreread less

Patent•

Speech recognition method and system

[...]

Rathinavelu Chengalvarayan¹, Scott M. Pennock¹•Institutions (1)

General Motors¹

27 Sep 2005

TL;DR: In this article, a speech recognition method comprising the steps of storing multiple recognition models for a vocabulary set, each model distinguished from the other models in response to a Lombard characteristic, detecting at least one speaker utterance in a motor vehicle, selecting one of the multiple recognition model and utilizing the selected recognition model to recognize the at least speaker utterances.

...read moreread less

Abstract: A speech recognition method comprising the steps of: storing multiple recognition models for a vocabulary set, each model distinguished from the other models in response to a Lombard characteristic, detecting at least one speaker utterance in a motor vehicle, selecting one of the multiple recognition models in response to a Lombard characteristic of the at least one speaker utterance, utilizing the selected recognition model to recognize the at least one speaker utterance; and providing a signal in response to the recognition.

...read moreread less

Journal Article•DOI•

Unsupervised speaker segmentation and tracking in real-time audio content analysis

[...]

Lie Lu¹, Hong-Jiang Zhang¹•Institutions (1)

Microsoft¹

01 Apr 2005-Multimedia Systems

TL;DR: In this approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing.

...read moreread less

Abstract: This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.

...read moreread less

Book Chapter•DOI•

Emotion-State conversion for speaker recognition

[...]

Dongdong Li¹, Yingchun Yang¹, Zhaohi Wu¹, Tian Wu¹•Institutions (1)

Zhejiang University¹

22 Oct 2005

TL;DR: The ongoing work proposes an approach of speech emotion-state conversion to improve the performance of speaker identification system over various affective speech.

...read moreread less

Abstract: The performance of speaker recognition system is easily disturbed by the changes of the internal states of human. The ongoing work proposes an approach of speech emotion-state conversion to improve the performance of speaker identification system over various affective speech. The features of neutral speech are modified according to statistical prosodic parameters of emotion utterances. Speaker models are generated based on the converted speech. The experiments conducted on an emotion corpus with 14 emotion states shows promising results with an improved performance by 7.2%.

...read moreread less

Journal Article•DOI•

Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing

[...]

Masafumi Nishida¹, Tatsuya Kawahara²•Institutions (2)

Chiba University¹, Kyoto University²

20 Jun 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: This work proposes a flexible framework in which an optimal speaker model (GMM or VQ) is automatically selected based on the Bayesian Information Criterion according to the amount of training data available, and demonstrates that speaker indexing with this framework is sufficiently accurate for adaptation of the acoustic model.

...read moreread less

Abstract: In conventional speaker recognition tasks, the amount of training data is almost the same for each speaker, and the speaker model structure is uniform and specified manually according to the nature of the task and the available size of the training data. In real-world speech data such as telephone conversations and meetings, however, serious problems arise in applying a uniform model because variations in the utterance durations of speakers are large, with numerous short utterances. We therefore propose a flexible framework in which an optimal speaker model (GMM or VQ) is automatically selected based on the Bayesian Information Criterion (BIC) according to the amount of training data available. The framework makes it possible to use a discrete model when the data is sparse, and to seamlessly switch to a continuous model after a large amount of data is obtained. The proposed framework was implemented in unsupervised speaker indexing of a discussion audio. For a real discussion archive with a total duration of 10 hours, we demonstrate that the proposed method has higher indexing performance than that of conventional methods. The speaker index is also used to adapt a speaker-independent acoustic model to each participant for automatic transcription of the discussion. We demonstrate that speaker indexing with our method is sufficiently accurate for adaptation of the acoustic model.

...read moreread less

Proceedings Article•DOI•

Two-way cluster voting to improve speaker diarisation performance

[...]

S.E. Tranter¹•Institutions (1)

University of Cambridge¹

23 Mar 2005

TL;DR: A cluster-voting scheme is described which takes the output from two speaker diarisation systems and produces a new output which aims to have a lower speaker diARisation error rate (DER) than either input.

...read moreread less

Abstract: Speaker diarisation is the task of automatically segmenting audio data and providing speaker labels for the resulting regions of audio. A cluster-voting scheme is described which takes the output from two speaker diarisation systems and produces a new output which aims to have a lower speaker diarisation error rate (DER) than either input. The scheme works in two stages: the first produces a set of possible outputs which minimise a distance metric based on the DER; the second votes between these alternatives to give the final output. Decisions where the inputs agree are always passed to the output and those where the inputs differ are re-evaluated in the final voting stage. Results are presented on the 6-show RT-03 broadcast news evaluation data; they show that the DER can be reduced by 1.64% and 2.56% absolute using this method when combining the best two Cambridge University and the best two MIT Lincoln Laboratory diarisation systems respectively.

...read moreread less

Proceedings Article•DOI•

Hybrid speaker-based segmentation system using model-level clustering

[...]

Hyoung-Gook Kim¹, D. Ertelt¹, Thomas Sikora¹•Institutions (1)

Free University of Berlin¹

09 May 2005

TL;DR: Results show that the hybrid approach using two-level clustering using a Bayesian information criterion and HMM model scores significantly outperforms direct metric-based segmentation.

...read moreread less

Abstract: We present a hybrid speaker-based segmentation, which combines metric-based and model-based techniques. Without a priori information about the number of speakers and speaker identities, the speech stream is segmented in three stages: (1) the most likely speaker changes are detected; (2) to group segments of identical speakers, a two-level clustering algorithm is performed using a Bayesian information criterion (BIC) and HMM model scores - every cluster is assumed to contain only one speaker; (3) the speaker models are reestimated from each cluster by HMM. Finally a resegmentation step performs a more refined segmentation using these speaker models. To measure the performance, we compare the segmentation results of the proposed hybrid method versus metric-based segmentation. Results show that the hybrid approach using two-level clustering significantly outperforms direct metric-based segmentation.

...read moreread less

Proceedings Article•

Gaussian Mixture Modelling of Broad Phonetic and Syllabic Events for Text-Independent Speaker Verification

[...]

Brendan Baker, Robert Vogt, Subramanian Sridharan

01 Jan 2005

TL;DR: The syllable-based modelling technique is shown to outperform a state-of-the-art baseline GMM system, and a simple selective reduction of the syllable set is also shown to give further improvement in performance.

...read moreread less

Abstract: This paper examines the usefulness of a multilingual broad syllable-based framework for text-independent speaker verification. Syllabic segmentation is used in order to obtain a convenient unit for constrained and more detailed model generation. Gaussian mixture models are chosen as a suitable modelling paradigm for initial testing of the framework. Promising results are presented for the NIST 2003 speaker recognition evaluation corpus. The syllable-based modelling technique is shown to outperform a state-of-the-art baseline GMM system. A simple selective reduction of the syllable set is also shown to give further improvement in performance. Overall, the syllable based framework presents itself as valid alternative to text-constrained speaker verification systems, with the advantage of being multilingual. The framework allows for future testing of alternative modelling paradigms, feature sets and qualitative analysis.

...read moreread less

Patent•

Automatic improvement of spoken language

[...]

Alexander Faisman¹, Dimitri Kanevsky¹, Zohar Sivan¹•Institutions (1)

IBM¹

18 Jan 2005

TL;DR: In this article, a method for improving spoken language includes accepting a speech input from by a speaker using a language, identifying the speaker with a predetermined speaker category and correcting an error in the speech input using an error model specific to the speaker category.

...read moreread less

Abstract: A method for improving spoken language includes accepting a speech input from by a speaker using a language, identifying the speaker with a predetermined speaker category and correcting an error in the speech input using an error model that is specific to the speaker category.

...read moreread less

Proceedings Article•DOI•

A correlation metric for speaker tracking using anchor models

[...]

M. Collet¹, Delphine Charlet¹, Frédéric Bimbot²•Institutions (2)

Orange S.A.¹, French Institute for Research in Computer Science and Automation²

18 Mar 2005

TL;DR: The technique of anchor modelling is presented and a new metric to compare speech segments based on the correlation coefficient is introduced, which appears to be more efficient than the classical metrics for the task of speaker detection.

...read moreread less

Abstract: This paper presents an approach for speaker tracking in a large audio database. The system described is based on a speaker segmentation procedure consisting of a detection of statistical ruptures in the speech signal followed by a speaker detection procedure using anchor models. The technique of anchor modelling is presented and a new metric to compare speech segments based on the correlation coefficient is introduced. This novel metric is evaluated and compared to the classical Euclidean and angular metrics for the speaker detection task. Evaluation is carried out on the audio database of the ESTER evaluation campaign for the rich transcription of French broadcast news. The new metric appears to be more efficient than the classical metrics for the task of speaker detection.

...read moreread less

Collapse