scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2010"


Journal ArticleDOI
TL;DR: A voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker is proposed and it is demonstrated that such a voice Conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.
Abstract: In this paper, we use artificial neural networks (ANNs) for voice conversion and exploit the mapping abilities of an ANN model to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using an ANN model and the state-of-the-art Gaussian mixture model (GMM) is conducted. The results of voice conversion, evaluated using subjective and objective measures, confirm that an ANN-based VC system performs as good as that of a GMM-based VC system, and the quality of the transformed speech is intelligible and possesses the characteristics of a target speaker. In this paper, we also address the issue of dependency of voice conversion techniques on parallel data between the source and the target speakers. While there have been efforts to use nonparallel data and speaker adaptation techniques, it is important to investigate techniques which capture speaker-specific characteristics of a target speaker, and avoid any need for source speaker's data either for training or for adaptation. In this paper, we propose a voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker and demonstrate that such a voice conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

269 citations


Proceedings Article
01 Jan 2010
TL;DR: The discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints.
Abstract: Speaker variability is one of the major error sources for ASR systems. Speaker adaptation estimates speaker specific models from the speaker independent ones to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. One of the commonly adopted approaches is the transformation based method. In this paper, the discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints. Experimental results show that the data-driven constrained discriminative transforms are much more robust for unsupervised adaptation.

213 citations


01 Jan 2010
TL;DR: An open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM is presented, which includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR.
Abstract: This paper presents an open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM This toolkit includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR Two applications for which the toolkit has been used are presented: one is for broadcast news using the ESTER 2 data and the other is for telephone conversations using the MEDIA corpus

190 citations


Journal ArticleDOI
TL;DR: Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation, with the Variational Bayes system proving to be the most effective.
Abstract: We report on work on speaker diarization of telephone conversations which was begun at the Robust Speaker Recognition Workshop held at Johns Hopkins University in 2008. Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The systems are a Baseline agglomerative clustering system, a Streaming system which uses speaker factors for speaker change point detection and traditional methods for speaker clustering, and a Variational Bayes system designed to exploit a large number of speaker factors as in state of the art speaker recognition systems. The Variational Bayes system proved to be the most effective, achieving a diarization error rate of 1.0% on the summed-channel data. This represents an 85% reduction in errors compared with the Baseline agglomerative clustering system. An interesting aspect of the Variational Bayes approach is that it implicitly performs speaker clustering in a way which avoids making premature hard decisions. This type of soft speaker clustering can be incorporated into other diarization systems (although causality has to be sacrificed in the case of the Streaming system). With this modification, the Baseline system achieved a diarization error rate of 3.5% (a 50% reduction in errors).

144 citations


01 Jan 2010
TL;DR: This paper proposes a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context.
Abstract: It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (i-vectors) from a low-dimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak’s work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.

110 citations


Dissertation
01 Dec 2010
TL;DR: A new methodology, based on proper scoring rules, is proposed, allowing for the evaluation of the goodness of pattern recognizers with probabilistic outputs, which are intended to be usefully applied over a wide range of applications, having variable priors and costs.
Abstract: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having different cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizer’s likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers.

109 citations


Patent
06 Dec 2010
TL;DR: In this paper, a personalized text-to-speech synthesizer is used to synthesize personalized speech with the speech characteristics of a specific speaker, based on the personalized speech feature library associated with the specific speaker.
Abstract: A personalized text-to-speech synthesizing device includes: a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by comparing a random speech fragment of the specific speaker with preset keywords, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker. A personalized speech feature library of a specific speaker is established without a deliberate training process, and a text is synthesized into personalized speech with the speech characteristics of the speaker.

79 citations


Proceedings Article
01 May 2010
TL;DR: This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows.
Abstract: This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools (automatic segmentation, transcription, POS tagging, etc) applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows This corpus was built during the EPAC project funded by the French Research Agency (ANR) from 2007 to 2010 This corpus increases significantly the amount of French manually transcribed audio recordings easily available and it is now included as a part of the ESTER 1 corpus in the ELRA catalog without additional cost By providing a large set of automatic outputs of speech processing tools, the EPAC corpus should be useful to researchers who want to work on such data without having to develop and deal with such tools These automatic annotations are various: segmentation and speaker diarization, one-best hypotheses from the LIUM automatic speech recognition system with confidence measures, but also word-lattices and confusion networks, named entities, part-of-speech tags, chunks, etc The 100 hours of speech manually transcribed were split into three data sets in order to get an official training corpus, an official development corpus and an official test corpus These data sets were used to develop and to evaluate some automatic tools which have been used to process the 1700 hours of audio recording For example, on the EPAC test data set our ASR system yields a word error rate equals to 1725%

75 citations


01 Jan 2010
TL;DR: The Symmetric Normalization (S-norm) method is proposed, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations.
Abstract: This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to text-independent speaker verification [1]. This approach effectively represents speaker variability in terms of low-dimensional total factor vectors and, when paired alongside the simplicity of cosine similarity scoring, allows for easy manipulation and efficient computation [2]. The development of our adaptation algorithm is motivated by the desire to have a robust method of setting an adaptation threshold, to minimize the amount of required computation for each adaptation update, and to simplify the associated score normalization procedures where possible. To address the final issue, we propose the Symmetric Normalization (S-norm) method, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations. In subsequent experiments, we also assess an attempt to replace the use of score normalization procedures altogether with a Normalized Cosine Similarity scoring function [3]. We evaluated the performance of our unsupervised speaker adaptation algorithm under various score normalization procedures on the 10sec-10sec and core conditions of the 2008 NIST SRE dataset. Using results without adaptation as our baseline, it was found that the proposed methods are consistent in successfully improving speaker verification performance to achieve state-of-the-art results.

73 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: The effectiveness of phase information for noisy environments on speaker identification in noisy environments with integrated MFCC with phase information is described.
Abstract: In conventional speaker recognition methods based on MFCC, the phase information has been ignored. Recently, we proposed a method that integrated MFCC with the phase information on a speaker recognition method. Using the phase information, the speaker identification error rate was reduced by 78% for clean speech. In this paper, we describe the effectiveness of phase information for noisy environments on speaker identification. Integrationg MFCC with phase information, the speaker error identification rates were reduced by 20%∼70% in comparison with using only MFCC in noisy environments.

68 citations


Journal Article
TL;DR: Speaker diarization is the task of determining "who spoke when" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers as discussed by the authors.
Abstract: Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research

Journal ArticleDOI
TL;DR: The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs and this component is combined with the recognition system to produce significant improvements.

Patent
John Tardif1
17 Jun 2010
TL;DR: In this article, a system and method are disclosed for facilitating speech recognition through the processing of visual speech cues, which include the position of the lips, tongue and/or teeth during speech.
Abstract: A system and method are disclosed for facilitating speech recognition through the processing of visual speech cues. These speech cues may include the position of the lips, tongue and/or teeth during speech. In one embodiment, upon capture of a frame of data by an image capture device, the system identifies a speaker and a location of the speaker. The system then focuses in on the speaker to get a clear image of the speaker's mouth. The system includes a visual speech cues engine which operates to recognize and distinguish sounds based on the captured position of the speaker's lips, tongue and/or teeth. The visual speech cues data may be synchronized with the audio data to ensure the visual speech cues engine is processing image data which corresponds to the correct audio data.

Journal ArticleDOI
TL;DR: Three divide-and-conquer approaches for Bayesian information criterion (BlC)-based speaker segmentation are proposed and results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches.
Abstract: In this paper, we propose three divide-and-conquer approaches for Bayesian information criterion (BlC)-based speaker segmentation. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using DeltaBIC, a widely-adopted distance measure of two audio segments. We compare our approaches to three popular distance-based approaches, namely, Chen and Gopalakrishnan's window-growing-based approach, Siegler et al.'s fixed-size sliding window approach, and Delacourt and Wellekens's DISTBIC approach, by performing computational cost analysis and conducting speaker change detection experiments on two broadcast news data sets. The results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches. In addition, we apply the segmentation approaches discussed in this paper to the speaker diarization task. The experiment results show that a more effective segmentation approach leads to better diarization accuracy.

Proceedings ArticleDOI
01 Nov 2010
TL;DR: Experiments demonstrate that the difference of Mel-cepstral (MCEP) between the natural and synthetic speech can be distinguished by the higher order of MCEP.
Abstract: With the development of the HMM-based parametric speech synthesis algorithm, it is easy for impostors to generate the synthetic speech with specific speaker's characteristics, which is a serious threat to the state of the art speaker verification system. In this paper, we investigate the difference of Mel-cepstral (MCEP) between the natural and synthetic speech. Experiments demonstrate that we can discriminate synthetic speech from natural speech by the higher order of MCEP.

01 Jan 2010
TL;DR: A framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE- 2008, which combines several acoustic conditions in one evalu- ation.
Abstract: In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE- 2008, which combines several acoustic conditions in one evalu- ation. We do this by weighting trials of different conditions ac- cording to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet, Cllr, as well as the DET curve, from which EER and Cmin can det be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: In this paper, a top-down approach to speaker diarization is proposed and compared to bottom-up and top-up approaches, which are both computationally efficient, but also prone to poor model initialisation and cluster impurities.
Abstract: There are two approaches to speaker diarization. They are bottom-up and top-down. Our work on top-down systems show that they can deliver competitive results compared to bottom-up systems and that they are extremely computationally efficient, but also that they are particularly prone to poor model initialisation and cluster impurities. In this paper we present enhancements to our state-of-the-art, top-down approach to speaker diarization that deliver improved stability across three different datasets composed of conference meetings from five standard NIST RT evaluations. We report an improved approach to speaker modelling which, despite having greater chances for cluster impurities, delivers a 35% relative improvement in DER for the MDM condition. We also describe new work to incorporate cluster purification into a top-down system which delivers relative improvements of 44% over the baseline system without compromising computational efficiency.

Book ChapterDOI
13 Dec 2010
TL;DR: This literature survey paper discusses general architecture of SRS, biometric standards relevant to voice/speech, typical applications of S RS, and current research in Speaker Recognition Systems.
Abstract: Human listeners are capable of identifying a speaker, over the telephone or an entryway out of sight, by listening to the voice of the speaker. Achieving this intrinsic human specific capability is a major challenge for Voice Biometrics. Like human listeners, voice biometrics uses the features of a person’s voice to ascertain the speaker’s identity. The best-known commercialized forms of voice Biometrics is Speaker Recognition System (SRS). Speaker recognition is the computing task of validating a user’s claimed identity using characteristics extracted from their voices. This literature survey paper gives brief introduction on SRS, and then discusses general architecture of SRS, biometric standards relevant to voice/speech, typical applications of SRS, and current research in Speaker Recognition Systems. We have also surveyed various approaches for SRS.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.
Abstract: The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the user's voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the project's success. So far, research has been conducted into unsupervised and cross-lingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) is compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.

Journal ArticleDOI
TL;DR: The results obtained in the current experiments show that the identification error rate increases when testing with imitated voices, as well as when using converted voices, especially the crossgender ones.
Abstract: Voices can be deliberately disguised by means of human imitation or voice conversion. The question arises as to what extent they can be modified by using either of both methods. In the current paper, a set of speaker identification experiments are conducted; first, analysing some prosodic features extracted from voices of professional impersonators attempting to mimic a target voice and, second, using both intragender and crossgender converted voices in a spectral-based speaker recognition system. The results obtained in the current experiments show that the identification error rate increases when testing with imitated voices, as well as when using converted voices, especially the crossgender ones.

Patent
04 Feb 2010
TL;DR: In this article, a method for recognizing a speaker of an utterance in a speech recognition system is disclosed and a likelihood score for each of a plurality of speaker models for different speakers is determined.
Abstract: A method for recognizing a speaker of an utterance in a speech recognition system is disclosed. A likelihood score for each of a plurality of speaker models for different speakers is determined. The likelihood score indicating how well the speaker model corresponds to the utterance. For each of the plurality of speaker models, a probability that the utterance originates from that speaker is determined. The probability is determined based on the likelihood score for the speaker model and requires the estimation of a distribution of likelihood scores expected based at least in part on the training state of the speaker.

Journal ArticleDOI
TL;DR: A system for joint temporal segmentation, speaker localization, and identification is presented, supported by face identification from video data obtained from a steerable camera, which describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners.
Abstract: For an environment to be perceived as being smart, contextual information has to be gathered to adapt the system's behavior and its interface towards the user. Being a rich source of context information speech can be acquired unobtrusively by microphone arrays and then processed to extract information about the user and his environment. In this paper, a system for joint temporal segmentation, speaker localization, and identification is presented, which is supported by face identification from video data obtained from a steerable camera. Special attention is paid to latency aspects and online processing capabilities, as they are important for the application under investigation, namely ambient communication. It describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners, where the user can move freely within his home while the communication follows him. The speaker diarization serves as a context source, which has been integrated in a service-oriented middleware architecture and provided to the application to select the most appropriate I/O device and to steer the camera towards the speaker during ambient communication.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: An improved speaker diarization system for the Single Distant Microphone (SDM) task in the 2007 and 2009 NIST Rich Transcription Meeting Recognition Evaluations is described.
Abstract: This paper describes an improved speaker diarization system for the Single Distant Microphone (SDM) task in the 2007 and 2009 NIST Rich Transcription Meeting Recognition Evaluations. The system includes three main modules: front-end processing, initial speaker clustering and cluster purification/merging. The front-end processing involves the Wiener filtering for the targeted audio channels and a self-adaptation speech activity detection algorithm. A simple but effective energy based segmentation is applied to chunk the meeting data into small segments to construct the initial clusters. An enhanced purification algorithm is proposed to further improve the performance after the preliminary purification, and the BIC criterion is adopted for the cluster merging. The system achieves competitive overall DERs of 15.67% for RT07 SDM speaker diarization task and 17.34% for RT09 SDM speaker diarization task.

Journal ArticleDOI
TL;DR: The results show that using the compensated front-end, SAT models and multiple regression classes bring major performance improvements.
Abstract: In the last years the speaker recognition field has made extensive use of speaker adaptation techniques. Adaptation allows speaker model parameters to be estimated using less speech data than needed for maximum-likelihood (ML) training. The maximum a posteriori (MAP) and maximum-likelihood linear regression (MLLR) techniques have typically been used for adaptation. Recently, MAP and MLLR adaptation have been incorporated in the feature extraction stage of support vector machine (SVM)-based speaker recognition systems. Two approaches to feature extraction use a SVM to classify either the MAP-adapted Gaussian mean vector parameters (GSV-SVM) or the MLLR transform coefficients (MLLR-SVM). In this paper, we provide an experimental analysis of the GSV-SVM and MLLR-SVM approaches. We largely focus on the latter by exploring constrained and unconstrained transforms and different choices of the acoustic model. A channel-compensated front-end is used to prevent the MLLR transforms to adapt to channel components in the speech data. Additional acoustic models were trained using speaker adaptive training (SAT) to better estimate the speaker MLLR transforms. We provide results on the NIST 2005 and 2006 Speaker Recognition Evaluation (SRE) data and fusion results on the SRE 2006 data. The results show that using the compensated front-end, SAT models and multiple regression classes bring major performance improvements.

Journal ArticleDOI
TL;DR: The proposed techniques for speaker identification and verification from encrypted VoIP conversations can correctly identify the actual speaker for 70-75% of the time among a group of 10 potential suspects and achieve more than 10 fold improvement over random guessing in identifying a perpetrator.

04 Oct 2010
TL;DR: Improved the speech recognition system to save thermal images at the three timing positions of just before speaking, and just when speaking the phonemes of the first and last vowels, to ensure intentional facial expressions were discriminable with good recognition accuracy.
Abstract: I investigated a method for facial expression recognition for a human speaker by using thermal image processing and a speech recognition system. In this study, we improved our speech recognition system to save thermal images at the three timing positions of just before speaking, and just when speaking the phonemes of the first and last vowels. With this method, intentional facial expressions of "angry", "happy", "neutral", "sad", and "surprised" were discriminable with good recognition accuracy.

Proceedings Article
01 Jan 2010
TL;DR: An open-set online speaker diarization system based on Gaussian mixture models, which are used as speaker models that implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification.
Abstract: In this paper, we present an open-set online speaker diarization system. The system is based on Gaussian mixture models (GMMs), which are used as speaker models. The system starts with just 3 such models (one each for both genders and one for non-speech) and creates models for individual speakers not till the speakers occur. As more and more speakers appear, more models are created. Our system implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification. The system is tested with the HUB4-1996 radio broadcast news database.

Proceedings Article
01 Jan 2010
TL;DR: It is shown that the binary key vector extraction process does not need any threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement.
Abstract: The approach presented in this paper represents voice recordings by a novel acoustic key composed only of binary values. Except for the process being used to extract such keys, there is no need for acoustic modeling and processing in the approach proposed, as all the other elements in the system are based on the binary vectors. We show that this binary key is able to effectively model a speaker’s voice and to distinguish it from other speakers. Its main properties are its small size compared to current speaker modeling techniques and its low computational cost when comparing different speakers as it is limited to obtaining a similarity metric between two binary vectors. Furthermore, the binary key vector extraction process does not need any threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement. Index Terms: binary key, speaker modeling, biometrics

Proceedings ArticleDOI
26 Sep 2010
TL;DR: A GMM/SVM-supervector system (Gaussian Mixture Model combined with Support Vector Machine) for speaker age and gender recognition, a technique that is adopted from state-of-the-art speaker recognition research is presented.
Abstract: Car manufacturers are faced with a new challenge. While a new generation of “digital natives” becomes a new customer group, the problem of aging society is still increasing. This emphasizes the need of providing flexible in-car dialog that take into account the specific needs and preferences of the respective user (group). Along the lines of this year’s Interspeech motto “Spoken Language Processing for All”, we address the question how we find out which group the current user belongs to. We present a GMM/SVM-supervector system (Gaussian Mixture Model combined with Support Vector Machine) for speaker age and gender recognition, a technique that is adopted from state-of-the-art speaker recognition research. We furthermore describe an experimental study with the aim to evaluate the performance of the system as well as to explore the selection of parameters.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: The proposed approach is compared with a conventional state-of-the-art system on the RT06 evaluation data for meeting recordings diarization and shows an improvement of 8.4% relative in terms of speaker error.
Abstract: This paper investigates the use of the Variational Bayesian (VB) framework for speaker diarization of meetings data extending previous related works on Broadcast News audio. VB learning aims at maximizing a bound, known as Free Energy, on the model marginal likelihood and allows joint model learning and model selection according to the same objective function. While the BIC is valid only in the asymptotic limit, the Free Energy is always a valid bound. The paper proposes the use of Free Energy as objective function in speaker diarization. It can be used to select dynamically without any supervision or tuning, elements that typically affect the diarization performance i.e. the inferred number of speakers, the size of the GMM and the initialization. The proposed approach is compared with a conventional state-of-the-art system on the RT06 evaluation data for meeting recordings diarization and shows an improvement of 8.4% relative in terms of speaker error.