Showing papers on "Speaker diarisation published in 2004"

PDF

Open Access

Journal Article•DOI•

A tutorial on text-independent speaker verification

[...]

Frédéric Bimbot¹, Jean-François Bonastre², Corinne Fredouille², Guillaume Gravier¹, Ivan Magrin-Chagnolleau³, Sylvain Meignier², Teva Merlin², Javier Ortega-Garcia⁴, Dijana Petrovska-Delacrétaz, Douglas A. Reynolds⁵ - Show less +6 more•Institutions (5)

French Institute for Research in Computer Science and Automation¹, University of Avignon², Centre national de la recherche scientifique³, Technical University of Madrid⁴, Massachusetts Institute of Technology⁵

01 Jan 2004-EURASIP Journal on Advances in Signal Processing

TL;DR: An introduction proposes a modular scheme of the training and test phases of a speaker verification system, and the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed.

...read moreread less

Abstract: This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.

...read moreread less

874 citations

Channel compensation for SVM speaker recognition.

[...]

Alex Solomonoff¹, Carl Quillen, William M. Campbell•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2004

TL;DR: This paper explores techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations, resulting in a system that is less sensitive to specific kinds of labeled channel variations observed in training.

...read moreread less

Abstract: One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are known and available for adapting models between different channel conditions, but for the much more recent Support Vector Machine (SVM) based approaches to this problem, much less is known about the best way to handle this issue. In this paper we explore techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations. The result is a system that is less sensitive to specific kinds of labeled channel variations observed in training.

...read moreread less

156 citations

NIST speaker recognition evaluation chronicles.

[...]

Mark A. Przybocki, Alvin F. Martin

01 Jun 2004

TL;DR: The variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task are document here.

...read moreread less

Abstract: NIST has coordinated annual evaluations of textindependent speaker recognition since 1996. During the course of this series of evaluations there have been notable milestones related to the development of the evaluation paradigm and the performance achievements of state-of-the-art systems. We document here the variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task. Finally, we discuss the data collection and protocols for the 2004 evaluation and beyond.

...read moreread less

133 citations

Proceedings Article•DOI•

Vulnerability of speaker verification to voice mimicking

[...]

Yee Wah Lau¹, Michael Wagner¹, Dat Tran¹•Institutions (1)

University of Canberra¹

20 Oct 2004

TL;DR: Experiments on 138 speakers in the YOHO database and two people who played a role as imitators have shown that an imposter can attack the system if that impostor knows a registered speaker in the database who has very similar voice to the imposter's voice.

...read moreread less

Abstract: We consider mimicry, a simple technology form of attack requiring a low level of expertise, to investigate whether a speaker recognition system is vulnerable to mimicry by an impostor without using the assistance of any other technologies. Experiments on 138 speakers in the YOHO database and two people who played a role as imitators have shown that an impostor can attack the system if that impostor knows a registered speaker in the database who has very similar voice to the impostor's voice.

...read moreread less

133 citations

Patent•

Source-dependent text-to-speech system

[...]

Nicholas J. Cutaia¹•Institutions (1)

Cisco Systems, Inc.¹

28 Apr 2004

TL;DR: In this article, a speech feature vector for a voice associated with a source of a text message was determined and compared to speaker models, and a speaker model was selected as a preferred match for the voice based on the comparison.

...read moreread less

Abstract: A method of generating speech from textmessages includes determining a speech feature vector for a voice associated with a source of a text message, and comparing the speech feature vector to speaker models. The method also includes selecting one of the speaker models as a preferred match for the voice based on the comparison, and generating speech from the text message based on the selected speaker model.

...read moreread less

118 citations

Book Chapter•DOI•

AV16.3: an audio-visual corpus for speaker localization and tracking

[...]

Guillaume Lathoud¹, Jean-Marc Odobez¹, Daniel Gatica-Perez¹•Institutions (1)

Idiap Research Institute¹

21 Jun 2004

TL;DR: In this paper, the authors present a corpus of audio-visual data called "AV16.3" along with a method for 3-D location annotation based on calibrated cameras, which can be either 2-dimensional or 3-dimensional (physical space).

...read moreread less

Abstract: Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.

...read moreread less

105 citations

Journal Article•DOI•

Fifty years of progress in speech and speaker recognition

[...]

Sadaoki Furui¹•Institutions (1)

Tokyo Institute of Technology¹

06 Oct 2004-Journal of the Acoustical Society of America

TL;DR: The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.

...read moreread less

Abstract: Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus‐base statistical modeling, e.g., HMM and n‐grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time‐normalization to DTW/DP matching, (4) from gdistanceh‐based to likelihood‐based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context‐independent units to context‐dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker‐independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single‐modality (audio signal only) to multi‐modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.

...read moreread less

101 citations

Towards robust speaker segmentation: the icsi-sri fall 2004 diarization system

[...]

Chuck Wooters¹, James G. Fung¹, Barbara Peskin¹, Xavier Anguera¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2004

TL;DR: The ICSI-SRI system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge, providing robustness and portability.

...read moreread less

Abstract: We describe the ICSI-SRI entry in the Fall 2004 DARPA EARS Metadata Evaluation. The current system was derived from ICSI’s Fall 2003 Speaker-attributed STT system. Our system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. The main advantage of this approach is that it does not require pre-trained acoustic models, providing robustness and portability. Changes for this year’s system include: different front-end features, the addition of SRI’s Broadcast News speech/non-speech detector, and modifications to the segmentation routine. In post-evaluation work, we found further improvement by changing the stopping criterion from the BIC-like measure to a Viterbi measure. Additionally, we have explored issues related to pruning and improved initialization.

...read moreread less

91 citations

Improving Speaker Diarization

[...]

Claude Barras, Xuan Zhu, Sylvain Meignier, Jean-Luc Gauvain

01 Nov 2004

TL;DR: The improved LIMSI speaker diarization system used in the RT-04F evaluation reduces the speaker error time by over 75% on the development data, compared to the best configuration baseline system for this task.

...read moreread less

Abstract: This paper describes the LIMSI speaker diarization system used in the RT-04F evaluation The RT-04F system builds upon the LIMSI baseline data partitioner, which is used in the broadcast news transcription system This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters when there is a large quantity of data for the speaker In the RT-03S evaluation the baseline partitioner had a 245% diarization error rate Several improvements to the baseline diarization system have been made A standard Bayesian information criterion (BIC) agglomerative clustering has been integrated replacing the iterative Gaussian mixture model (GMM) clustering; a local BIC criterion is used for comparing single Gaussians with full covariance matrices A second clustering stage has been added, making use of a speaker identification method: maximum a posteriori adaptation of a ref- erence GMM with 128 Gaussians A final post-processing stage refines the segment boundaries using the output of the transcrip- tion system Compared to the best configuration baseline system for this task, the improved system reduces the speaker error time by over 75% on the development data On evaluation data, a 85% overall diarization error rate was obtained, a 60% reduction in error compared to the baseline

...read moreread less

85 citations

Proceedings Article•DOI•

Speaker Segmentation and Clustering in Meetings

[...]

Qin Jin, Tanja Schultz¹•Institutions (1)

Carnegie Mellon University¹

04 Oct 2004

TL;DR: The issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations and the IHM system, which requires no prior training and executes in one fifth real time on modern architectures, is described.

...read moreread less

Abstract: This paper describes the issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations. Two systems were developed and evaluated in the NIST RT-04S Meeting Recognition Evaluation, the Multiple Distant Microphone (MDM) system and the Individual Headset Microphone (IHM) system. The MDM system achieved a speaker diarization performance of 28.17%. This system also aims to provide automatic speech segments and speaker grouping information for speech recognition, a necessary prerequisite for subsequent audio processing. A 44.5% word error rate was achieved for speech recognition. The IHM system is based on the short-time crosscorrelation of all personal channel pairs. It requires no prior training and executes in one fifth real time on modern architectures. A 35.7% word error rate was achieved for speech recognition when segmentation was provided by this system.

...read moreread less

81 citations

Proceedings Article•DOI•

High-level speaker verification with support vector machines

[...]

William M. Campbell¹, J.R. Campbell¹, D.A. Reynolds¹, Douglas A. Jones¹, Tim Leek¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

17 May 2004

TL;DR: A new kernel based upon standard log likelihood ratio scoring to address limitations of text classification methods is derived and it is shown that the methods achieve significant gains over standard methods for processing high-level features.

...read moreread less

Abstract: Recently, high-level features such as word idiolect, pronunciation, phone usage, prosody, etc., have been successfully used in speaker verification. The benefit of these features was demonstrated in the NIST extended data task for speaker verification; with enough conversational data, a recognition system can become "familiar" with a speaker and achieve excellent accuracy. Typically, high-level-feature recognition systems produce a sequence of symbols from the acoustic signal and then perform recognition using the frequency and co-occurrence of symbols. We propose the use of support vector machines for performing the speaker verification task from these symbol frequencies. Support vector machines have been applied to text classification problems with much success. A potential difficulty in applying these methods is that standard text classification methods tend to "smooth" frequencies which could potentially degrade speaker verification. We derive a new kernel based upon standard log likelihood ratio scoring to address limitations of text classification methods. We show that our methods achieve significant gains over standard methods for processing high-level features.

...read moreread less

Proceedings Article•DOI•

Disentangling speaker and channel effects in speaker verification

[...]

Patrick Kenny, Pierre Dumouchel

17 May 2004

TL;DR: It is shown how a joint factor analysis of inter-speaker and intra-speakers variability in a training database which contains multiple recordings for each speaker can be used to construct likelihood ratio statistics for speaker verification which take account of intra-Speaker variation and channel variation in a principled way.

...read moreread less

Abstract: We show how a joint factor analysis of inter-speaker and intra-speaker variability in a training database which contains multiple recordings for each speaker can be used to construct likelihood ratio statistics for speaker verification which take account of intra-speaker variation and channel variation in a principled way. We report the results of experiments on the NIST 2001 cellular one speaker detection task carried out by applying this type of factor analysis to Switchboard Cellular Part I. The evaluation data for this task is contained in Switchboard Cellular Part I so these results cannot be taken at face value but they indicate that the factor analysis model can perform extremely well if it is perfectly estimated.

...read moreread less

Patent•

Biometric voice authentication

[...]

John C. Poss, Dag Boye, Mark W. Mobley

24 Jun 2004

TL;DR: In this paper, a system and method enrolls a speaker with an enrollment utterance and authenticates a user with a biometric analysis of an authentication utterance, without the need for a PIN (Personal Identification Number).

...read moreread less

Abstract: A system and method enrolls a speaker with an enrollment utterance and authenticates a user with a biometric analysis of an authentication utterance, without the need for a PIN (Personal Identification Number). During authentication, the system uses the same authentication utterance to identify who a speaker claims to be with speaker recognition, and verify whether is the speaker is actually the claimed person. Thus, it is not necessary for the speaker to identify biometric data using a PIN. The biometric analysis includes a neural tree network to determine unique aspects of the authentication utterances for comparison to the enrollment authentication. The biometric analysis leverages a statistical analysis using Hidden Markov Models to before authorizing the speaker.

...read moreread less

Proceedings Article•DOI•

Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs.

[...]

Michaël Betser, Frédéric Bimbot, Mathieu Ben, Guillaume Gravier

04 Oct 2004

The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations

[...]

D A Reynolds¹, P Torres-Carrasquillo•Institutions (1)

Massachusetts Institute of Technology¹

01 Nov 2004

TL;DR: This paper describes the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation and presents experiments analyzing performance of the systems and a cross-cluster recombination approach that significantly improves performance.

...read moreread less

Abstract: : Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper we describe the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. The primary system is based on a new proxy speaker model approach and the secondary system follows a more standard BIC based clustering approach. We present experiments analyzing performance of the systems and present a cross-cluster recombination approach that significantly improves performance. In addition, we also present results applying our system to a telephone speech, summed channel speaker detection task.

...read moreread less

Patent•

Graphical user interface for determining speaker spatialization parameters

[...]

Camille Huin, Miguel A. Chavez, Jesse Remignanti

10 Jun 2004

TL;DR: In this article, a graphical interface allowing a user to select a number of listeners, number of speakers, and a listening space is presented, along with graphically displaying on a display device the one or more listeners and one or multiple speakers within the listening space.

...read moreread less

Abstract: A system for determining speaker spatialization parameters. The system includes a graphical interface allowing a user to select a number of listeners, number of speakers, and a listening space and for graphically displaying on a display device the one or more listeners and one or more speakers within the listening space. The system further includes a speaker spatialization module for determining one or more speaker spatialization parameters based upon the relative positions on the display device of the one or more listeners and the one or more speakers within the listening space system. The user may reposition the speakers and for each speaker that is repositioned the speaker spatialization module recalculates the speaker spatialization parameters for that speaker. The user may also graphically reposition one of the listeners and the speaker spatialization module recalculates all of the speaker spatialization parameters for all of the speakers.

...read moreread less

Proceedings Article•DOI•

Clustering and segmenting speakers and their locations in meetings

[...]

Jitendra Ajmera, Guillaume Lathoud, L. McCowan

17 May 2004

TL;DR: A new approach toward automatic annotation of meetings in terms of speaker identities and their locations is presented by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization in an appropriate HMM framework.

...read moreread less

Abstract: The paper presents a new approach toward automatic annotation of meetings in terms of speaker identities and their locations. This is achieved by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization. We combine the two in an appropriate HMM framework. There are three main advantages of this approach. First, it is completely unsupervised, i.e. speaker identities and number of speakers and locations are automatically inferred. Second, it is threshold-free, i.e. the decisions are made without the need of a threshold value which generally requires an additional development dataset. The third advantage is that the joint segmentation improves over the speaker segmentation derived using only acoustic features. Experiments on a series of meetings recorded in the IDIAP smart meeting room demonstrate the effectiveness of this approach.

...read moreread less

Proceedings Article•DOI•

The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation

[...]

Daniel Moraru, Sylvain Meignier, Corinne Fredouille¹, Laurent Besacier¹, Jean-François Bonastre¹ - Show less +1 more•Institutions (1)

University of Avignon¹

17 May 2004

TL;DR: The paper presents the ELISA consortium activities in automatic speaker segmentation, also known as speaker diarization, during the NIST rich transcription (RT), 2003, evaluation, and two different approaches from the CLIPS and LIA laboratories are presented and different possibilities of combining them are investigated.

...read moreread less

Abstract: The paper presents the ELISA consortium activities in automatic speaker segmentation, also known as speaker diarization, during the NIST rich transcription (RT), 2003, evaluation. The experiments were conducted on real broadcast news data (HUB4). Two different approaches from the CLIPS and LIA laboratories are presented and different possibilities of combining them are investigated, in the framework of the ELISA consortium. The system submitted as an ELISA primary system obtained the second lowest segmentation error rate compared to the other RT03-participant primary systems. Another ELISA system submitted as a secondary system outperformed the best primary system and obtained the lowest speaker segmentation error rate.

...read moreread less

Patent•

System and method for audio-visual content synthesis

[...]

Nevenka Dimitrova¹, Andrew C. Miller¹, Dongge Li¹•Institutions (1)

Philips¹

28 Sep 2004

TL;DR: In this paper, a system and method for synthesizing audio-visual content in a video image processor is presented, where audio features and video features are extracted from audio visual input signals that represent a speaker who is speaking and used to create a computer generated animated version of the face of the speaker.

...read moreread less

Abstract: A system and method is provided for synthesizing audio-visual content in a video image processor. A content synthesis application processor extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking. The processor uses the extracted visual features to create a computer generated animated version of the face of the speaker. The processor synchronizes facial movements of the animated version of the face of the speaker with a plurality of audio logical units such as phonemes that represent the speaker's speech. In this manner the processor synthesizes an audio-visual representation of the speaker's face that is properly synchronized with the speaker's speech.

...read moreread less

Proceedings Article•DOI•

Speaker diarization from speech transcripts.

[...]

Lori Lamel, Jean-Luc Gauvain, Leonardo Canseco-Rodriguez

04 Oct 2004

TL;DR: Investigation of the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments develops patterns which can be used to identify the current, previous or next speaker.

...read moreread less

Patent•

Audio device and audio processing method

[...]

Masanori Kushibe

08 Sep 2004

TL;DR: In this article, an audio device and an audio processing method are provided for adjusting the position of a virtual speaker, which consists of a decoder which has audio data provided, the audio data including an audio component for a center speaker and a plurality of audio components corresponding to other speakers disposed with the center speaker interposed therewith, and which decodes these audio components to separate them from audio data, a center delay processor for delaying the audio component from the decoder, and a downmixing processor for distributing the delayed center speaker audio component between the other speakers and for merging the audio

...read moreread less

Abstract: An audio device and an audio processing method are provided for adjusting the position of a virtual speaker. The audio device comprises a decoder which has audio data provided thereto, the audio data including an audio component for a center speaker and a plurality of audio components corresponding to other speakers disposed with the center speaker interposed therewith, and which decodes these audio components to separate them from the audio data, a center delay processor for delaying the audio component for the center speaker received from the decoder, and a downmixing processor for distributing the delayed center speaker audio component between the other speakers and for merging the audio component distributed to each of the other speakers and the original audio component for each other speaker. Audio sounds corresponding to the downmixed audio components are produced from the other speakers.

...read moreread less

Proceedings Article•DOI•

Non-parallel training for voice conversion by maximum likelihood constrained adaptation

[...]

Athanasios Mouchtaris¹, J. Van der Spiegel¹, Paul Mueller•Institutions (1)

University of Pennsylvania¹

17 May 2004

TL;DR: This work proposes a voice conversion method that does not require a parallel corpus for training, and shows that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases.

...read moreread less

Abstract: The objective of voice conversion methods is to modify the speech characteristics of a particular speaker in such manner, as to sound like speech by a different target speaker. Current voice conversion algorithms are based on deriving a conversion function by estimating its parameters through a corpus that contains the same utterances spoken by both speakers. Such a corpus, usually referred to as a parallel corpus, has the disadvantage that many times it is difficult or even impossible to collect. Here, we propose a voice conversion method that does not require a parallel corpus for training, i.e. the spoken utterances by the two speakers need not be the same, by employing speaker adaptation techniques to adapt to a particular pair of source and target speakers, the derived conversion parameters from a different pair of speakers. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases, and with performance comparable with the ideal case when a parallel corpus is available.

...read moreread less

DOI•

Robust audio segmentation

[...]

Jitendra Ajmera

01 Jan 2004

TL;DR: A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection) and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value.

...read moreread less

Abstract: Audio segmentation, in general, is the task of segmenting a continuous audio stream in terms of acoustically homogenous regions, where the rule of homogeneity depends on the task. This thesis aims at developing and investigating efficient, robust and unsupervised techniques for three important tasks related to audio segmentation, namely speech/music segmentation, speaker change detection and speaker clustering. The speech/music segmentation technique proposed in this thesis is based on the functioning of a HMM/ANN hybrid ASR system where an MLP estimates the posterior probabilities of different phonemes. These probabilities exhibit a particular pattern when the input is a speech signal. This pattern is captured in the form of feature vectors, which are then integrated in a HMM framework. The technique thus segments the audio data in terms of {\it recognizable} and {\it non-recognizable} segments. The efficiency of the proposed technique is demonstrated by a number of experiments conducted on broadcast news data exhibiting real-life scenarios (different speech and music styles, overlapping speech and music, non-speech sounds other than music, etc.). A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection). The proposed metric can be seen as special case of Log Likelihood Ratio (LLR) or Bayesian Information Criterion (BIC), where the number of parameters in the two models (or hypotheses) is forced to be equal. However, the advantage of the proposed metric over LLR, BIC and other metric based approaches is that it achieves comparable performance without requiring an adjustable threshold/penalty term, hence also eliminating the need for a development dataset. Speaker clustering is the task of unsupervised classification of the audio data in terms of speakers. For this purpose, a novel HMM based agglomerative clustering algorithm is proposed where, starting from a large number of clusters, {\it closest} clusters are merged in an iterative process. A novel merging criterion is proposed for this purpose, which does not require an adjustable threshold value and hence the stopping criterion is also automatically met when there are no more clusters left for merging. The efficiency of the proposed algorithm is demonstrated with various experiments on broadcast news data and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value. These tasks obviously play an important role in the pre-processing stages of ASR. For example, correctly identifying {\it non-recognizable} segments in the audio stream and excluding them from recognition saves computation time in ASR and results in more meaningful transcriptions. Moreover, researchers have clearly shown the positive impact of further clustering of identified speech segments in terms of speakers (speaker clustering) on the transcription accuracy. However, we note that this processing has various other interesting and practical applications. For example, this provides characteristic information about the data (metadata), which is useful for the indexing of audio documents. One such application is investigated in this thesis which extracts this metadata and combines it with the ASR output, resulting in Rich Transcription (RT) which is much easier to understand for an end-user. In a further application, speaker clustering was combined with precise location information available in scenarios like smart meeting rooms to segment the meeting recordings jointly in terms of speakers and their locations in a meeting room. This is useful for automatic meeting summarization as it enables answering of questions like ``who is speaking and where''. This could be used to access, for example, a specific presentation made by a particular speaker or all the speech segments belonging to a particular speaker.

...read moreread less

Speaker diarisation for broadcast news.

[...]

S. E. Tranter¹, Douglas A. Reynolds•Institutions (1)

University of Cambridge¹

01 Jan 2004

TL;DR: This paper describes two systems for audio segmentation developed at CUED and MIT-LL and evaluates their performance using the speaker diarisation score defined in the 2003 Rich Transcription Evaluation.

...read moreread less

Abstract: It is often important to be able to automatically label ‘who spoke when’ during some audio data. This paper describes two systems for audio segmentation developed at CUED and MIT-LL and evaluates their performance using the speaker diarisation score defined in the 2003 Rich Transcription Evaluation. A new clustering procedure and BIC-based stopping criterion for the CUED system is introduced which improves both performance and robustness to changes in segmentation. Finally a hybrid ‘Plug and Play’ system is built which combines different parts of the CUED and MIT-LL systems to produce a single system which outperforms both the individual systems.

...read moreread less

Patent•

Speaker clustering and adaptation method based on the HMM model variation information and its apparatus for speech recognition

[...]

Namhoon Kim¹, In-jeong Choi¹, Song Yoon Kyung¹•Institutions (1)

Samsung¹

27 Dec 2004

TL;DR: In this paper, the authors proposed a speaker clustering and speaker adaptation method using average model variation information over speakers while analyzing the quantity variation amount and the directional variation amount, which can be applied to any speaker adaptation algorithm of MLLR and MAP.

...read moreread less

Abstract: A speech recognition method and apparatus perform speaker clustering and speaker adaptation using average model variation information over speakers while analyzing the quantity variation amount and the directional variation amount. In the speaker clustering method, a speaker group model variation is generated based on the model variation between a speaker-independent model and a training speaker ML model. In the speaker adaptation method, the model in which the model variation between a test speaker ML model and a speaker group ML model to which the test speaker belongs which is most similar to a training speaker group model variation is found, and speaker adaptation is performed on the found model. Herein, the model variation in the speaker clustering and the speaker adaptation are calculated while analyzing both the quantity variation amount and the directional variation amount. The present invention may be applied to any speaker adaptation algorithm of MLLR and MAP.

...read moreread less

Proceedings Article•DOI•

Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM

[...]

Seiichi Nakagawa¹, Wei Zhang¹, Mitsuo Takahashi¹•Institutions (1)

Toyohashi University of Technology¹

17 May 2004

TL;DR: This paper presents a new text-independent speaker recognition method by combining a speaker-specific Gaussian mixture model (GMM) with a syllable-based HMM adapted by MLLR or MAP and shows that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.

...read moreread less

Abstract: We presented a new text-independent speaker recognition method by combining a speaker-specific Gaussian mixture model (GMM) with a syllable-based HMM adapted by MLLR or MAP (S. Nakagawa et al., Proc. Eurospeech, p.3017-3020, 2003). The robustness of this speaker recognition method for speaking style changes was evaluated in this paper. A speaker identification experiment, using an NTT database, which consists of sentences of data uttered at three speed modes (normal, fast and slow) by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months, was conducted. Each speaker uttered only 5 training utterances (about 20 seconds in total). We obtained an accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). This result was superior to conventional methods for the same database. We show that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.

...read moreread less

Proceedings Article•DOI•

Screenplay alignment for closed-system speaker identification and analysis of feature films

[...]

R. Turetsky¹, Nevenka Dimitrova¹•Institutions (1)

Philips¹

27 Jun 2004

TL;DR: This work investigates the use of screenplay as a source of information for speaker/character identification and finds that the screenplay alignment is able to identify the speaker correctly in 30% of lines of dialogue on average, but with additional automatic statistical labeling for audio speaker ID on the soundtrack, the recognition rate improves significantly.

...read moreread less

Abstract: Existing methods for audiovisual and text analysis of videos perform "blind" recovery of metadata from the audiovisual signal. The film production process however, is based on the original screenplay and its versions. Using this information is like using the recipe book for the movie. High-level semantic information that is otherwise very difficult to derive from the audiovisual content can be extracted automatically by enhancing feature extraction with screenplay processing and analysis. As a test-bed of our approach, we investigated the use of screenplay as a source of information for speaker/character identification. Our speaker identification method consists of screenplay parsing, extraction of time-stamped transcript, alignment of the screenplay with the time-stamped transcript, audio segmentation and audio speaker identification. As the screenplay alignment cannot identify all dialogue sections within any film, we use the segments found by alignment as labels to train a statistical model in order to identify unaligned pieces of dialogue. We find that the screenplay alignment is able to identify the speaker correctly in 30% of lines of dialogue on average. However, with additional automatic statistical labeling for audio speaker ID on the soundtrack, our recognition rate improves significantly.

...read moreread less

Patent•

System and process for adding high frame-rate current speaker data to a low frame-rate video using audio watermarking techniques

[...]

Ross Cutler¹•Institutions (1)

Microsoft¹

30 Apr 2004

TL;DR: In this paper, a system and process for highlighting the current speaker on an on-going basis in each frame of a low frame-rate video of an event having multiple people in attendance, such as a video teleconference, is presented.

...read moreread less

Abstract: A system and process for highlighting the current speaker on an on-going basis in each frame of a low frame-rate video of an event having multiple people in attendance, such as a video teleconference, is presented. In general, this is accomplished by periodically identifying an attendee that is currently speaking at a rate substantially faster than the video frame rate, and for each frame of the video updating the frame to highlight the attendee currently speaking. More particularly, an A/V source provides video and audio data streams to the client computing device, with current speaker data embedded into the audio stream via audio watermarking techniques. The client device extracts the current speaker data from the audio stream, and then renders and displays the video while using the current speaker data to periodically update the frame being displayed to highlight the current speaker.

...read moreread less

Proceedings Article•DOI•

Text-independent speaker identification using GMM-UBM and frame level likelihood normalization

[...]

Rong Zheng¹, Shuwu Zhang¹, Bo Xu¹•Institutions (1)

Chinese Academy of Sciences¹

15 Dec 2004

TL;DR: The hypothesized speaker model is derived by adapting the parameters of UBM using the speaker's training speech and a form of Bayesian adaptation and the UBM technique is incorporated into the GMM speaker identification system to reduce the time requirement for recognition significantly.

...read moreread less

Abstract: In this paper, we describe a Gaussian mixture model-universal background model (GMM-UBM) speaker identification system. In this GMM-UBM system, we derive the hypothesized speaker model by adapting the parameters of UBM using the speaker's training speech and a form of Bayesian adaptation. The UBM technique is incorporated into the GMM speaker identification system to reduce the time requirement for recognition significantly. The paper also presents a new frame level likelihood score normalization for adjusting different scores of speaker models to get more robust scores in the final decision. Experiments on the 2000 NIST speaker recognition evaluation corpus show that GMM-UBM and frame level likelihood score normalization yield better performance. Compared to the baseline system, around 31.2% relative error reduction is obtained from the combination of both techniques.

...read moreread less

Patent•DOI•

Method for improving speaker identification by determining usable speech

[...]

Robert E. Yantorno, Daniel S. Benincasa, Stanley J. Wenndt, Brett Y. Smolenski

18 Aug 2004-Journal of the Acoustical Society of America

TL;DR: In this article, a matrix of optimum classifiers for the detection of SID usable and SID unusable speech segments is presented, and a decision tree based on fixed thresholds indicates the presence of a speech feature in a given speech segment.

...read moreread less

Abstract: Method for improving speaker identification by determining usable speech. Degraded speech is preprocessed in a speaker identification (SID) process to produce SID usable and SID unusable segments. Features are extracted and analyzed so as to produce a matrix of optimum classifiers for the detection of SID usable and SID unusable speech segments. Optimum classifiers possess a minimum distance from a speaker model. A decision tree based upon fixed thresholds indicates the presence of a speech feature in a given speech segment. Following preprocessing, degraded speech is measured in one or more time, frequency, cepstral or SID usable/unusable domains. The results of the measurements are multiplied by a weighting factor whose value is proportional to the reliability of the corresponding time, frequency, or cepstral measurements performed. The measurements are fused as information, and usable speech segments are extracted for further processing. Such further processing of co-channel speech may include speaker identification where a segment-by-segment decision is made on each usable speech segment to determine whether they correspond to speaker #1 or speaker #2. Further processing of co-channel speech may also include constructing the complete utterance of speaker #1 or speaker #2. Speech features such as pitch and formants may be extended back into the unusable segments to form a complete utterance from each speaker.

...read moreread less

Collapse