Showing papers on "Speaker diarisation published in 2010"

PDF

Open Access

Journal Article•DOI•

Spectral Mapping Using Artificial Neural Networks for Voice Conversion

[...]

Srinivas Desai, Alan W. Black¹, B. Yegnanarayana, Kishore Prahallad•Institutions (1)

01 Jul 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker is proposed and it is demonstrated that such a voice Conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

...read moreread less

Abstract: In this paper, we use artificial neural networks (ANNs) for voice conversion and exploit the mapping abilities of an ANN model to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using an ANN model and the state-of-the-art Gaussian mixture model (GMM) is conducted. The results of voice conversion, evaluated using subjective and objective measures, confirm that an ANN-based VC system performs as good as that of a GMM-based VC system, and the quality of the transformed speech is intelligible and possesses the characteristics of a target speaker. In this paper, we also address the issue of dependency of voice conversion techniques on parallel data between the source and the target speakers. While there have been efforts to use nonparallel data and speaker adaptation techniques, it is important to investigate techniques which capture speaker-specific characteristics of a target speaker, and avoid any need for source speaker's data either for training or for adaptation. In this paper, we propose a voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker and demonstrate that such a voice conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

...read moreread less

269 citations

Proceedings Article•

Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems.

[...]

Bo Li¹, Khe Chai Sim²•Institutions (2)

National University of Singapore¹, Institute for Infocomm Research Singapore²

01 Jan 2010

TL;DR: The discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints.

...read moreread less

Abstract: Speaker variability is one of the major error sources for ASR systems. Speaker adaptation estimates speaker specific models from the speaker independent ones to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. One of the commonly adopted approaches is the transformation based method. In this paper, the discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints. Experimental results show that the data-driven constrained discriminative transforms are much more robust for unsupervised adaptation.

...read moreread less

213 citations

Lium spkdiarization: an open source toolkit for diarization

[...]

Sylvain Meignier, Teva Merlin

01 Jan 2010

TL;DR: An open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM is presented, which includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR.

...read moreread less

Abstract: This paper presents an open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM This toolkit includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR Two applications for which the toolkit has been used are presented: one is for broadcast news using the ESTER 2 data and the other is for telephone conversations using the MEDIA corpus

...read moreread less

190 citations

Journal Article•DOI•

Diarization of Telephone Conversations Using Factor Analysis

[...]

Patrick Kenny, Douglas A. Reynolds¹, Fabio Castaldo•Institutions (1)

Massachusetts Institute of Technology¹

30 Sep 2010-IEEE Journal of Selected Topics in Signal Processing

TL;DR: Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation, with the Variational Bayes system proving to be the most effective.

...read moreread less

Abstract: We report on work on speaker diarization of telephone conversations which was begun at the Robust Speaker Recognition Workshop held at Johns Hopkins University in 2008. Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The systems are a Baseline agglomerative clustering system, a Streaming system which uses speaker factors for speaker change point detection and traditional methods for speaker clustering, and a Variational Bayes system designed to exploit a large number of speaker factors as in state of the art speaker recognition systems. The Variational Bayes system proved to be the most effective, achieving a diarization error rate of 1.0% on the summed-channel data. This represents an 85% reduction in errors compared with the Baseline agglomerative clustering system. An interesting aspect of the Variational Bayes approach is that it implicitly performs speaker clustering in a way which avoids making premature hard decisions. This type of soft speaker clustering can be incorporated into other diarization systems (although causality has to be sacrificed in the case of the Streaming system). With this modification, the Baseline system achieved a diarization error rate of 3.5% (a 50% reduction in errors).

...read moreread less

144 citations

An i-vector extractor suitable for speaker recognition with both microphone and telephone speech

[...]

Mohammed Senoussaoui, Patrick Kenny, Najim Dehak, Pierre Dumouchel

01 Jan 2010

TL;DR: This paper proposes a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context.

...read moreread less

Abstract: It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (i-vectors) from a low-dimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak’s work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.

...read moreread less

110 citations

Dissertation•

Measuring, refining and calibrating speaker and language information extracted from speech

[...]

Niko Brümmer

01 Dec 2010

TL;DR: A new methodology, based on proper scoring rules, is proposed, allowing for the evaluation of the goodness of pattern recognizers with probabilistic outputs, which are intended to be usefully applied over a wide range of applications, having variable priors and costs.

...read moreread less

Abstract: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having different cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizer’s likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers.

...read moreread less

109 citations

Patent•

Personalized text-to-speech synthesis and personalized speech feature extraction

[...]

Qingfang Wang¹, Shouchun He²•Institutions (2)

Ericsson Mobile Communications¹, Sony Broadcast & Professional Research Laboratories²

06 Dec 2010

TL;DR: In this paper, a personalized text-to-speech synthesizer is used to synthesize personalized speech with the speech characteristics of a specific speaker, based on the personalized speech feature library associated with the specific speaker.

...read moreread less

Abstract: A personalized text-to-speech synthesizing device includes: a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by comparing a random speech fragment of the specific speaker with preset keywords, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker. A personalized speech feature library of a specific speaker is established without a deliberate training process, and a text is synthesized into personalized speech with the speech characteristics of the speaker.

...read moreread less

79 citations

Proceedings Article•

The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news

[...]

Yannick Estève, Thierry Bazillon¹, Jean-Yves Antoine², Frédéric Béchet³, Jérôme Farinas⁴ - Show less +1 more•Institutions (4)

University of Avignon¹, François Rabelais University², Aix-Marseille University³, Paul Sabatier University⁴

01 May 2010

TL;DR: This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows.

...read moreread less

Abstract: This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools (automatic segmentation, transcription, POS tagging, etc) applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows This corpus was built during the EPAC project funded by the French Research Agency (ANR) from 2007 to 2010 This corpus increases significantly the amount of French manually transcribed audio recordings easily available and it is now included as a part of the ESTER 1 corpus in the ELRA catalog without additional cost By providing a large set of automatic outputs of speech processing tools, the EPAC corpus should be useful to researchers who want to work on such data without having to develop and deal with such tools These automatic annotations are various: segmentation and speaker diarization, one-best hypotheses from the LIUM automatic speech recognition system with confidence measures, but also word-lattices and confusion networks, named entities, part-of-speech tags, chunks, etc The 100 hours of speech manually transcribed were split into three data sets in order to get an official training corpus, an official development corpus and an official test corpus These data sets were used to develop and to evaluate some automatic tools which have been used to process the 1700 hours of audio recording For example, on the EPAC test data set our ASR system yields a word error rate equals to 1725%

...read moreread less

75 citations

Unsupervised Speaker Adaptation based on the Cosine Similarity for Text-Independent Speaker Verification.

[...]

Stephen Shum¹, Najim Dehak¹, Réda Dehak¹, James Glass²•Institutions (2)

Massachusetts Institute of Technology¹, École Normale Supérieure²

01 Jan 2010

TL;DR: The Symmetric Normalization (S-norm) method is proposed, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations.

...read moreread less

Abstract: This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to text-independent speaker verification [1]. This approach effectively represents speaker variability in terms of low-dimensional total factor vectors and, when paired alongside the simplicity of cosine similarity scoring, allows for easy manipulation and efficient computation [2]. The development of our adaptation algorithm is motivated by the desire to have a robust method of setting an adaptation threshold, to minimize the amount of required computation for each adaptation update, and to simplify the associated score normalization procedures where possible. To address the final issue, we propose the Symmetric Normalization (S-norm) method, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations. In subsequent experiments, we also assess an attempt to replace the use of score normalization procedures altogether with a Normalized Cosine Similarity scoring function [3]. We evaluated the performance of our unsupervised speaker adaptation algorithm under various score normalization procedures on the 10sec-10sec and core conditions of the 2008 NIST SRE dataset. Using results without adaptation as our baseline, it was found that the proposed methods are consistent in successfully improving speaker verification performance to achieve state-of-the-art results.

...read moreread less

73 citations

Proceedings Article•DOI•

Speaker identification by combining MFCC and phase information in noisy environments

[...]

Longbiao Wang¹, Kazue Minami², Kazumasa Yamamoto², Seiichi Nakagawa²•Institutions (2)

Shizuoka University¹, Toyohashi University of Technology²

14 Mar 2010

TL;DR: The effectiveness of phase information for noisy environments on speaker identification in noisy environments with integrated MFCC with phase information is described.

...read moreread less

Abstract: In conventional speaker recognition methods based on MFCC, the phase information has been ignored. Recently, we proposed a method that integrated MFCC with the phase information on a speaker recognition method. Using the phase information, the speaker identification error rate was reduced by 78% for clean speech. In this paper, we describe the effectiveness of phase information for noisy environments on speaker identification. Integrationg MFCC with phase information, the speaker error identification rates were reduced by 20%∼70% in comparison with using only MFCC in noisy environments.

...read moreread less

68 citations

Journal Article•

Speaker diarization : A review of recent research

[...]

Xavier Anguera¹, Simon Bozonnet², Nicholas Evans², Corinne Fredouille, Gerald Friedland³, Oriol Vinyals³ - Show less +2 more•Institutions (3)

Telefónica¹, Institut Eurécom², International Computer Science Institute³

01 Aug 2010-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: Speaker diarization is the task of determining "who spoke when" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers as discussed by the authors.

...read moreread less

Abstract: Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research

...read moreread less

Journal Article•DOI•

Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

[...]

Jon Barker¹, Ning Ma¹, André Coy¹, Martin Cooke²•Institutions (2)

University of Sheffield¹, University of the Basque Country²

01 Jan 2010-Computer Speech & Language

TL;DR: The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs and this component is combined with the recognition system to produce significant improvements.

...read moreread less

Patent•

Rgb/depth camera for improving speech recognition

[...]

John Tardif¹•Institutions (1)

Microsoft¹

17 Jun 2010

TL;DR: In this article, a system and method are disclosed for facilitating speech recognition through the processing of visual speech cues, which include the position of the lips, tongue and/or teeth during speech.

...read moreread less

Abstract: A system and method are disclosed for facilitating speech recognition through the processing of visual speech cues. These speech cues may include the position of the lips, tongue and/or teeth during speech. In one embodiment, upon capture of a frame of data by an image capture device, the system identifies a speaker and a location of the speaker. The system then focuses in on the speaker to get a clear image of the speaker's mouth. The system includes a visual speech cues engine which operates to recognize and distinguish sounds based on the captured position of the speaker's lips, tongue and/or teeth. The visual speech cues data may be synchronized with the audio data to ensure the visual speech cues engine is processing image data which corresponds to the correct audio data.

...read moreread less

Journal Article•DOI•

BIC-Based Speaker Segmentation Using Divide-and-Conquer Strategies With Application to Speaker Diarization

[...]

Shih-Sian Cheng¹, Hsin-Min Wang, Hsin-Chia Fu¹•Institutions (1)

National Chiao Tung University¹

01 Jan 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Three divide-and-conquer approaches for Bayesian information criterion (BlC)-based speaker segmentation are proposed and results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches.

...read moreread less

Abstract: In this paper, we propose three divide-and-conquer approaches for Bayesian information criterion (BlC)-based speaker segmentation. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using DeltaBIC, a widely-adopted distance measure of two audio segments. We compare our approaches to three popular distance-based approaches, namely, Chen and Gopalakrishnan's window-growing-based approach, Siegler et al.'s fixed-size sliding window approach, and Delacourt and Wellekens's DISTBIC approach, by performing computational cost analysis and conducting speaker change detection experiments on two broadcast news data sets. The results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches. In addition, we apply the segmentation approaches discussed in this paper to the speaker diarization task. The experiment results show that a more effective segmentation approach leads to better diarization accuracy.

...read moreread less

Proceedings Article•DOI•

Speaker verification against synthetic speech

[...]

Lian-Wu Chen¹, Wu Guo¹, Li-Rong Dai¹•Institutions (1)

University of Science and Technology of China¹

01 Nov 2010

TL;DR: Experiments demonstrate that the difference of Mel-cepstral (MCEP) between the natural and synthetic speech can be distinguished by the higher order of MCEP.

...read moreread less

Abstract: With the development of the HMM-based parametric speech synthesis algorithm, it is easy for impostors to generate the synthetic speech with specific speaker's characteristics, which is a serious threat to the state of the art speaker verification system. In this paper, we investigate the difference of Mel-cepstral (MCEP) between the natural and synthetic speech. Experiments demonstrate that we can discriminate synthetic speech from natural speech by the higher order of MCEP.

...read moreread less

Speaker linking in large data sets

[...]

D.A. van Leeuwen¹•Institutions (1)

Netherlands Organisation for Applied Scientific Research¹

01 Jan 2010

TL;DR: A framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE- 2008, which combines several acoustic conditions in one evalu- ation.

...read moreread less

Abstract: In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE- 2008, which combines several acoustic conditions in one evalu- ation. We do this by weighting trials of different conditions ac- cording to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet, Cllr, as well as the DET curve, from which EER and Cmin can det be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style.

...read moreread less

Proceedings Article•DOI•

The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification

[...]

Simon Bozonnet¹, Nicholas Evans¹, Corinne Fredouille²•Institutions (2)

Institut Eurécom¹, University of Avignon²

14 Mar 2010

TL;DR: In this paper, a top-down approach to speaker diarization is proposed and compared to bottom-up and top-up approaches, which are both computationally efficient, but also prone to poor model initialisation and cluster impurities.

...read moreread less

Abstract: There are two approaches to speaker diarization. They are bottom-up and top-down. Our work on top-down systems show that they can deliver competitive results compared to bottom-up systems and that they are extremely computationally efficient, but also that they are particularly prone to poor model initialisation and cluster impurities. In this paper we present enhancements to our state-of-the-art, top-down approach to speaker diarization that deliver improved stability across three different datasets composed of conference meetings from five standard NIST RT evaluations. We report an improved approach to speaker modelling which, despite having greater chances for cluster impurities, delivers a 35% relative improvement in DER for the MDM condition. We also describe new work to incorporate cluster purification into a top-down system which delivers relative improvements of 44% over the baseline system without compromising computational efficiency.

...read moreread less

Book Chapter•DOI•

A Survey on Automatic Speaker Recognition Systems

[...]

Zia Saquib¹, Nirmala Salam¹, Rekha Nair¹, Nipun Pandey¹, Akanksha Joshi¹ - Show less +1 more•Institutions (1)

Centre for Development of Advanced Computing¹

13 Dec 2010

TL;DR: This literature survey paper discusses general architecture of SRS, biometric standards relevant to voice/speech, typical applications of S RS, and current research in Speaker Recognition Systems.

...read moreread less

Abstract: Human listeners are capable of identifying a speaker, over the telephone or an entryway out of sight, by listening to the voice of the speaker. Achieving this intrinsic human specific capability is a major challenge for Voice Biometrics. Like human listeners, voice biometrics uses the features of a person’s voice to ascertain the speaker’s identity. The best-known commercialized forms of voice Biometrics is Speaker Recognition System (SRS). Speaker recognition is the computing task of validating a user’s claimed identity using characteristics extracted from their voices. This literature survey paper gives brief introduction on SRS, and then discusses general architecture of SRS, biometric standards relevant to voice/speech, typical applications of SRS, and current research in Speaker Recognition Systems. We have also surveyed various approaches for SRS.

...read moreread less

Proceedings Article•DOI•

A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis

[...]

Hui Liang¹, John Dines¹, Lakshmi Saheer¹•Institutions (1)

Idiap Research Institute¹

14 Mar 2010

TL;DR: Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.

...read moreread less

Abstract: The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the user's voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the project's success. So far, research has been conducted into unsupervised and cross-lingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) is compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.

...read moreread less

Journal Article•DOI•

Automatic speaker recognition as a measurement of voice imitation and conversion

[...]

Mireia Farrús¹, Michael Wagner, Daniel Erro, Javier Hernando•Institutions (1)

Polytechnic University of Catalonia¹

06 Jun 2010-International Journal of Speech Language and The Law

TL;DR: The results obtained in the current experiments show that the identification error rate increases when testing with imitated voices, as well as when using converted voices, especially the crossgender ones.

...read moreread less

Abstract: Voices can be deliberately disguised by means of human imitation or voice conversion. The question arises as to what extent they can be modified by using either of both methods. In the current paper, a set of speaker identification experiments are conducted; first, analysing some prosodic features extracted from voices of professional impersonators attempting to mimic a target voice and, second, using both intragender and crossgender converted voices in a spectral-based speaker recognition system. The results obtained in the current experiments show that the identification error rate increases when testing with imitated voices, as well as when using converted voices, especially the crossgender ones.

...read moreread less

Patent•

Speaker Recognition in a Speech Recognition System

[...]

Tobias Herbig¹, Franz Gerl¹•Institutions (1)

Nuance Communications¹

04 Feb 2010

TL;DR: In this article, a method for recognizing a speaker of an utterance in a speech recognition system is disclosed and a likelihood score for each of a plurality of speaker models for different speakers is determined.

...read moreread less

Abstract: A method for recognizing a speaker of an utterance in a speech recognition system is disclosed. A likelihood score for each of a plurality of speaker models for different speakers is determined. The likelihood score indicating how well the speaker model corresponds to the utterance. For each of the plurality of speaker models, a probability that the utterance originates from that speaker is determined. The probability is determined based on the likelihood score for the speaker model and requires the estimation of a distribution of likelihood scores expected based at least in part on the training state of the speaker.

...read moreread less

Journal Article•DOI•

Online Diarization of Streaming Audio-Visual Data for Smart Environments

[...]

Joerg Schmalenstroeer¹, Reinhold Haeb-Umbach¹•Institutions (1)

University of Paderborn¹

18 May 2010-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A system for joint temporal segmentation, speaker localization, and identification is presented, supported by face identification from video data obtained from a steerable camera, which describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners.

...read moreread less

Abstract: For an environment to be perceived as being smart, contextual information has to be gathered to adapt the system's behavior and its interface towards the user. Being a rich source of context information speech can be acquired unobtrusively by microphone arrays and then processed to extract information about the user and his environment. In this paper, a system for joint temporal segmentation, speaker localization, and identification is presented, which is supported by face identification from video data obtained from a steerable camera. Special attention is paid to latency aspects and online processing capabilities, as they are important for the application under investigation, namely ambient communication. It describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners, where the user can move freely within his home while the communication follows him. The speaker diarization serves as a context source, which has been integrated in a service-oriented middleware architecture and provided to the application to select the most appropriate I/O device and to steer the camera towards the speaker during ambient communication.

...read moreread less

Proceedings Article•DOI•

Speaker diarization system for RT07 and RT09 meeting room audio

[...]

Hanwu Sun¹, Bin Ma¹, Swe Zin Kalayar Khine¹, Haizhou Li¹•Institutions (1)

Institute for Infocomm Research Singapore¹

14 Mar 2010

TL;DR: An improved speaker diarization system for the Single Distant Microphone (SDM) task in the 2007 and 2009 NIST Rich Transcription Meeting Recognition Evaluations is described.

...read moreread less

Abstract: This paper describes an improved speaker diarization system for the Single Distant Microphone (SDM) task in the 2007 and 2009 NIST Rich Transcription Meeting Recognition Evaluations. The system includes three main modules: front-end processing, initial speaker clustering and cluster purification/merging. The front-end processing involves the Wiener filtering for the targeted audio channels and a self-adaptation speech activity detection algorithm. A simple but effective energy based segmentation is applied to chunk the meeting data into small segments to construct the initial clusters. An enhanced purification algorithm is proposed to further improve the performance after the preliminary purification, and the BIC criterion is adopted for the cluster merging. The system achieves competitive overall DERs of 15.67% for RT07 SDM speaker diarization task and 17.34% for RT09 SDM speaker diarization task.

...read moreread less

Journal Article•DOI•

Comparison of Speaker Adaptation Methods as Feature Extraction for SVM-Based Speaker Recognition

[...]

M. Ferras¹, Cheung-Chi Leung², Claude Barras¹, Jean-Luc Gauvain¹•Institutions (2)

Centre national de la recherche scientifique¹, Institute for Infocomm Research Singapore²

01 Aug 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The results show that using the compensated front-end, SAT models and multiple regression classes bring major performance improvements.

...read moreread less

Abstract: In the last years the speaker recognition field has made extensive use of speaker adaptation techniques. Adaptation allows speaker model parameters to be estimated using less speech data than needed for maximum-likelihood (ML) training. The maximum a posteriori (MAP) and maximum-likelihood linear regression (MLLR) techniques have typically been used for adaptation. Recently, MAP and MLLR adaptation have been incorporated in the feature extraction stage of support vector machine (SVM)-based speaker recognition systems. Two approaches to feature extraction use a SVM to classify either the MAP-adapted Gaussian mean vector parameters (GSV-SVM) or the MLLR transform coefficients (MLLR-SVM). In this paper, we provide an experimental analysis of the GSV-SVM and MLLR-SVM approaches. We largely focus on the latter by exploring constrained and unconstrained transforms and different choices of the acoustic model. A channel-compensated front-end is used to prevent the MLLR transforms to adapt to channel components in the speech data. Additional acoustic models were trained using speaker adaptive training (SAT) to better estimate the speaker MLLR transforms. We provide results on the NIST 2005 and 2006 Speaker Recognition Evaluation (SRE) data and fusion results on the SRE 2006 data. The results show that using the compensated front-end, SAT models and multiple regression classes bring major performance improvements.

...read moreread less

Journal Article•DOI•

Speaker recognition from encrypted VoIP communications

[...]

L. A. Khan¹, M. S. Baig², Amr M. Youssef¹•Institutions (2)

Concordia University¹, National University of Sciences and Technology²

01 Oct 2010-Digital Investigation

TL;DR: The proposed techniques for speaker identification and verification from encrypted VoIP conversations can correctly identify the actual speaker for 70-75% of the time among a group of 10 potential suspects and achieve more than 10 fold improvement over random guessing in identifying a perpetrator.

...read moreread less

Facial expression recognition for speaker using thermal image processing and speech recognition system

[...]

Yasunari Yoshitomi¹•Institutions (1)

Kyoto Prefectural University¹

04 Oct 2010

TL;DR: Improved the speech recognition system to save thermal images at the three timing positions of just before speaking, and just when speaking the phonemes of the first and last vowels, to ensure intentional facial expressions were discriminable with good recognition accuracy.

...read moreread less

Abstract: I investigated a method for facial expression recognition for a human speaker by using thermal image processing and a speech recognition system. In this study, we improved our speech recognition system to save thermal images at the three timing positions of just before speaking, and just when speaking the phonemes of the first and last vowels. With this method, intentional facial expressions of "angry", "happy", "neutral", "sad", and "surprised" were discriminable with good recognition accuracy.

...read moreread less

Proceedings Article•

GMM-UBM based open-set online speaker diarization

[...]

Jürgen T. Geiger¹, Frank Wallhoff¹, Gerhard Rigoll¹•Institutions (1)

Technische Universität München¹

01 Jan 2010

TL;DR: An open-set online speaker diarization system based on Gaussian mixture models, which are used as speaker models that implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification.

...read moreread less

Abstract: In this paper, we present an open-set online speaker diarization system. The system is based on Gaussian mixture models (GMMs), which are used as speaker models. The system starts with just 3 such models (one each for both genders and one for non-speech) and creates models for individual speakers not till the speakers occur. As more and more speakers appear, more models are created. Our system implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification. The system is tested with the HUB4-1996 radio broadcast news database.

...read moreread less

Proceedings Article•

A Novel Speaker Binary Key Derived from Anchor Models

[...]

Xavier Anguera¹, Jean-François Bonastre•Institutions (1)

Telefónica¹

01 Jan 2010

TL;DR: It is shown that the binary key vector extraction process does not need any threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement.

...read moreread less

Abstract: The approach presented in this paper represents voice recordings by a novel acoustic key composed only of binary values. Except for the process being used to extract such keys, there is no need for acoustic modeling and processing in the approach proposed, as all the other elements in the system are based on the binary vectors. We show that this binary key is able to effectively model a speaker’s voice and to distinguish it from other speakers. Its main properties are its small size compared to current speaker modeling techniques and its low computational cost when comparing different speakers as it is limited to obtaining a similarity metric between two binary vectors. Furthermore, the binary key vector extraction process does not need any threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement. Index Terms: binary key, speaker modeling, biometrics

...read moreread less

Proceedings Article•DOI•

Automatic Speaker Age and Gender Recognition in the Car for Tailoring Dialog and Mobile Services

[...]

Michael Feld¹, Felix Burkhardt², Christian Müller¹•Institutions (2)

German Research Centre for Artificial Intelligence¹, University of Erlangen-Nuremberg²

26 Sep 2010

TL;DR: A GMM/SVM-supervector system (Gaussian Mixture Model combined with Support Vector Machine) for speaker age and gender recognition, a technique that is adopted from state-of-the-art speaker recognition research is presented.

...read moreread less

Abstract: Car manufacturers are faced with a new challenge. While a new generation of “digital natives” becomes a new customer group, the problem of aging society is still increasing. This emphasizes the need of providing flexible in-car dialog that take into account the specific needs and preferences of the respective user (group). Along the lines of this year’s Interspeech motto “Spoken Language Processing for All”, we address the question how we find out which group the current user belongs to. We present a GMM/SVM-supervector system (Gaussian Mixture Model combined with Support Vector Machine) for speaker age and gender recognition, a technique that is adopted from state-of-the-art speaker recognition research. We furthermore describe an experimental study with the aim to evaluate the performance of the system as well as to explore the selection of parameters.

...read moreread less

Proceedings Article•DOI•

Variational Bayesian speaker diarization of meeting recordings

[...]

Fabio Valente¹, Petr Motlicek¹, Deepu Vijayasenan¹•Institutions (1)

Idiap Research Institute¹

14 Mar 2010

TL;DR: The proposed approach is compared with a conventional state-of-the-art system on the RT06 evaluation data for meeting recordings diarization and shows an improvement of 8.4% relative in terms of speaker error.

...read moreread less

Abstract: This paper investigates the use of the Variational Bayesian (VB) framework for speaker diarization of meetings data extending previous related works on Broadcast News audio. VB learning aims at maximizing a bound, known as Free Energy, on the model marginal likelihood and allows joint model learning and model selection according to the same objective function. While the BIC is valid only in the asymptotic limit, the Free Energy is always a valid bound. The paper proposes the use of Free Energy as objective function in speaker diarization. It can be used to select dynamically without any supervision or tuning, elements that typically affect the diarization performance i.e. the inferred number of speakers, the size of the GMM and the initialization. The proposed approach is compared with a conventional state-of-the-art system on the RT06 evaluation data for meeting recordings diarization and shows an improvement of 8.4% relative in terms of speaker error.

...read moreread less

Collapse