scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2014"


Proceedings ArticleDOI
04 May 2014
TL;DR: Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
Abstract: In this paper we investigate the use of deep neural networks (DNNs) for a small footprint text-dependent speaker verification task. At development stage, a DNN is trained to classify speakers at the frame-level. During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer. The average of these speaker features, or d-vector, is taken as the speaker model. At evaluation stage, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision. Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. Finally the combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.

1,000 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: A novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) to produce frame alignments.
Abstract: We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE) The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions

631 citations


Journal ArticleDOI
TL;DR: The HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system, outperforms the state-of-the-art i- vector system in most of the scenarios and provides a reference evaluation scheme and a reference performance on RSR2015 database to the research community.

274 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs), and the algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction.
Abstract: We propose providing additional utterance-level features as inputs to a deep neural network (DNN) to facilitate speaker, channel and background normalization. Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs). The algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction. We address implementation of the algorithm for a streaming task.

227 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Abstract: Speaker diarization via unsupervised i-vector clustering has gained popularity in recent years In this approach, i-vectors are extracted from short clips of speech segmented from a larger multi-speaker conversation and organized into speaker clusters, typically according to their cosine score In this paper, we propose a system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring, a method already frequently utilized in speaker recognition tasks, and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion We also demonstrate that denser sampling in the i-vector space with overlapping temporal segments provides a gain in the diarization task We test our system on the CALLHOME conversational telephone speech corpus, which includes multiple languages and a varying number of speakers, and we show that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well

226 citations


Patent
28 Apr 2014
TL;DR: In this paper, a client's voice is recorded and characteristics of the recording are used to create and store a voice print when an enrolled client seeks access to secure information over a network, a sample voice recording is compared to at least one voice print.
Abstract: Systems and methods providing for secure voice print authentication over a network are disclosed herein During an enrollment stage, a client's voice is recorded and characteristics of the recording are used to create and store a voice print When an enrolled client seeks access to secure information over a network, a sample voice recording is created The sample voice recording is compared to at least one voice print If a match is found, the client is authenticated and granted access to secure information Systems and methods providing for a dual use voice analysis system are disclosed herein Speech recognition is achieved by comparing characteristics of words spoken by a speaker to one or more templates of human language words Speaker identification is achieved by comparing characteristics of a speaker's speech to one or more templates, or voice prints The system is adapted to increase or decrease matching constraints depending on whether speaker identification or speaker recognition is desired

192 citations


Journal ArticleDOI
TL;DR: A simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under speaker diarization conditions and state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus are reported.
Abstract: Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone speech dialogue and the absence of prior information on the number of clusters dramatically increase the difficulty of this problem in diarizing spontaneous telephone speech conversations. We propose a simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under these conditions. Two variants of the cosine distance Mean Shift are compared in an exhaustive practical study. We report state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus.

167 citations


Journal ArticleDOI
TL;DR: A general adaptation scheme for DNN based on discriminant condition codes is proposed, which is directly fed to various layers of a pre-trained DNN through a new set of connection weights, which are quite effective to adapt large DNN models using only a small amount of adaptation data.
Abstract: Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task ("Deep convolutional neural networks for LVCSR," T. N. Sainath et al., Proc. IEEE Acoust., Speech, Signal Process., 2013).

157 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: This paper shows how this i-vector based speaker adaptation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and reports excellent results on a French language audio transcription task.
Abstract: State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.

135 citations


Patent
20 Jan 2014
TL;DR: In this paper, an application detecting speaker locations and prompting a user to input rough room boundaries and a desired listener location in the room is used to determine optimum speaker locations/frequency assignations/speaker parameters.
Abstract: In an audio speaker network, setup of speaker location, sound track or channel assignation, and speaker parameters is facilitated by an application detecting speaker locations and prompting a user to input rough room boundaries and a desired listener location in the room. Based on this, optimum speaker locations/frequency assignations/speaker parameters may be determined and output.

117 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper explores the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC), and shows that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out- of-domain system by more than 25%.
Abstract: Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC).We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover, we integrate the DNN in an unsupervised adaptation framework, that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.

Proceedings ArticleDOI
Hagai Aronowitz1
04 May 2014
TL;DR: This work analyzes the sources of degradation for a particular setup in the context of an i-vector PLDA system and concludes that the main source for degradation is ani-vector dataset shift, which is introduced using the nuisance attribute projection (NAP) method.
Abstract: Recently satisfactory results have been obtained in NIST speaker recognition evaluations. These results are mainly due to accurate modeling of a very large development dataset provided by LDC. However, for many realistic scenarios the use of this development dataset is limited due to a dataset mismatch. In such cases, collection of a large enough dataset is infeasible. In this work we analyze the sources of degradation for a particular setup in the context of an i-vector PLDA system and conclude that the main source for degradation is an i-vector dataset shift. As a remedy, we introduce inter dataset variability compensation (IDVC) to explicitly compensate for dataset shift in the i-vector space. This is done using the nuisance attribute projection (NAP) method. Using IDVC we managed to reduce error dramatically by more than 50% for the domain mismatch setup.

Proceedings ArticleDOI
04 May 2014
TL;DR: The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN), and backpropagate class errors up to only one layer for few iterations before to train the network.
Abstract: The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors in a speaker verification task. The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN). It is also suggested to backpropagate class errors up to only one layer for few iterations before to train the network. Additionally, an impostor selection method is introduced which helps the DBN to outperform the cosine distance classifier. The evaluation is performed on the core test condition of the NIST SRE 2006 corpora, and it is shown that 10% and 8% relative improvements of EER and minDCF can be achieved, respectively.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: The i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs, which allows the factorisation approaches developed for CAT to be directly applied.
Abstract: The use of deep neural networks (DNNs) in a hybrid configuration is becoming increasingly popular and successful for speech recognition. One issue with these systems is how to efficiently adapt them to reflect an individual speaker or noise condition. Recently speaker i-vectors have been successfully used as an additional input feature for unsupervised speaker adaptation. In this work the use of i-vectors for adaptation is extended to incorporate acoustic factorisation. In particular, separate i-vectors are computed to represent speaker and acoustic environment. By ensuring "orthogonality" between the individual factor representations it is possible to represent a wide range of speaker and environment pairs by simply combining i-vectors from a particular speaker and a particular environment. In this paper the i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs. This allows the factorisation approaches developed for CAT to be directly applied. Initial experiments were conducted on a noise distorted version of the WSJ corpus. Compared to standard speaker-based i-vector adaptation, factorised i-vectors showed performance gains.

Journal ArticleDOI
TL;DR: The anatomical and physiological bases for individual differences in the human voice are reviewed, before discussing how recent methodological progress in voice morphing and voice synthesis has promoted research on current theoretical issues, such as how voices are mentally represented in thehuman brain.
Abstract: While humans use their voice mainly for communicating information about the world, paralinguistic cues in the voice signal convey rich dynamic information about a speaker's arousal and emotional state, and extralinguistic cues reflect more stable speaker characteristics including identity, biological sex and social gender, socioeconomic or regional background, and age. Here we review the anatomical and physiological bases for individual differences in the human voice, before discussing how recent methodological progress in voice morphing and voice synthesis has promoted research on current theoretical issues, such as how voices are mentally represented in the human brain. Special attention is dedicated to the distinction between the recognition of familiar and unfamiliar speakers, in everyday situations or in the forensic context, and on the processes and representational changes that accompany the learning of new voices. We describe how specific impairments and individual differences in voice perception could relate to specific brain correlates. Finally, we consider that voices are produced by speakers who are often visible during communication, and review recent evidence that shows how speaker perception involves dynamic face-voice integration. The representation of para- and extralinguistic vocal information plays a major role in person perception and social communication, could be neuronally encoded in a prototype-referenced manner, and is subject to flexible adaptive recalibration as a result of specific perceptual experience. WIREs Cogn Sci 2014, 5:15-25. doi: 10.1002/wcs.1261 CONFLICT OF INTEREST: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work has evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task and shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker.
Abstract: Recently an effective fast speaker adaptation method using discriminative speaker code (SC) has been proposed for the hybrid DNN-HMM models in speech recognition [1]. This adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes using the standard back-propagation algorithm. In this paper, we propose an alternative direct adaptation in model space, where speaker codes are directly connected to the original DNN models through a set of new connection weights, which can be estimated very efficiently from all or part of training data. As a result, the proposed method is more suitable for large scale speech recognition tasks since it eliminates the time-consuming training process to estimate another adaptation neural networks. In this work, we have evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task. Experimental results have shown that the proposed SC-based rapid adaptation method is very effective not only for small recognition tasks but also for very large scale tasks. For example, it has shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker (from 10 to a few dozens). Moreover, the extra training time required for adaptation is also significantly reduced from the method in [1].

Proceedings ArticleDOI
01 Oct 2014
TL;DR: Experimental results demonstrate that the proposed framework achieves better separation results than a GMM-based approach in the supervised mode, and in the semi-supervised mode which is believed to be the preferred mode in real-world operations, the DNN- based approach even outperforms the GMM/supervised approach.
Abstract: This paper proposes a novel data-driven approach based on deep neural networks (DNNs) for single-channel speech separation. DNN is adopted to directly model the highly non-linear relationship of speech features between a target speaker and the mixed signals. Both supervised and semi-supervised scenarios are investigated. In the supervised mode, both identities of the target speaker and the interfering speaker are provided. While in the semi-supervised mode, only the target speaker is given. We propose using multiple speakers to be mixed with the target speaker to train the DNN which is shown to well predict an unseen interferer in the separation stage. Experimental results demonstrate that our proposed framework achieves better separation results than a GMM-based approach in the supervised mode. More significantly, in the semi-supervised mode which is believed to be the preferred mode in real-world operations, the DNN-based approach even outperforms the GMM-based approach in the supervised mode.

Proceedings ArticleDOI
04 May 2014
TL;DR: This paper proposes a novel training scheme that applies SAT to a SI DNN-HMM recognizer, and implements the SAT scheme by allocating a Speaker-Dependent module to one of the intermediate layers of a seven-layer DNN, and elaborate its utility over TED Talks corpus data.
Abstract: Among many speaker adaptation embodiments, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden-Markov-Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, recent studies on Speaker-Independent (SI) recognizer development have reported that a new type of HMM speech recognizer, which replaces GMMs with Deep Neural Networks (DNNs), outperforms GMM-HMM recognizers. Along these two lines, it is natural to conceive of further improvement to a preset DNN-HMM recognizer by employing SAT. In this paper, we propose a novel training scheme that applies SAT to a SI DNN-HMM recognizer. We then implement the SAT scheme by allocating a Speaker-Dependent (SD) module to one of the intermediate layers of a seven-layer DNN, and elaborate its utility over TED Talks corpus data. Experiment results show that our speaker-adapted SAT-based DNN-HMM recognizer reduces the word error rate by 8.4% more than that of a baseline SI DNN-HMM recognizer, and (regardless of the SD module allocation) outperforms the conventional speaker adaptation scheme. The results also show that the inner layers of DNN are more suitable for the SD module than the outer layers.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: This study proposes an artificial neural network architecture to learn a feature transform that is optimized for speaker diarization, and trains a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features.
Abstract: Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to learn a feature transform that is optimized for speaker diarization. We train a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features that feeds into a bottleneck layer. We then use the bottleneck layer activations as features, either alone or in combination with baseline MFCC features in a multistream mode, for speaker diarization on test data. The resulting system is evaluated on various corpora of multi-party meetings. A combination of MFCC and ANN features gives up to 14% relative reduction in diarization error, demonstrating that these features are providing an additional independent source of knowledge.

Patent
30 Jun 2014
TL;DR: In this article, audio data is segmented into a plurality of utterances, and each utterance is represented as an utterance model representative of the plurality of feature vectors, and utterance models are clustered.
Abstract: In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.

Journal ArticleDOI
TL;DR: It is demonstrated that a very small subset of the training pairs is necessary to train the original PSVM model, and two approaches that allow discarding most of theTraining pairs that are not essential, without harming the accuracy of the model are proposed.
Abstract: State-of-the-art systems for text-independent speaker recognition use as their features a compact representation of a speaker utterance, known as "i-vector." We recently presented an efficient approach for training a Pairwise Support Vector Machine (PSVM) with a suitable kernel for i-vector pairs for a quite large speaker recognition task. Rather than estimating an SVM model per speaker, according to the "one versus all" discriminative paradigm, the PSVM approach classifies a trial, consisting of a pair of i-vectors, as belonging or not to the same speaker class. Training a PSVM with large amount of data, however, is a memory and computational expensive task, because the number of training pairs grows quadratically with the number of training i-vectors. This paper demonstrates that a very small subset of the training pairs is necessary to train the original PSVM model, and proposes two approaches that allow discarding most of the training pairs that are not essential, without harming the accuracy of the model. This allows dramatically reducing the memory and computational resources needed for training, which becomes feasible with large datasets including many speakers. We have assessed these approaches on the extended core conditions of the NIST 2012 Speaker Recognition Evaluation. Our results show that the accuracy of the PSVM trained with a sufficient number of speakers is 10%-30% better compared to the one obtained by a PLDA model, depending on the testing conditions. Since the PSVM accuracy increases with the training set size, but PSVM training does not scale well for large numbers of speakers, our selection techniques become relevant for training accurate discriminative classifiers.

Journal ArticleDOI
TL;DR: This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features by proposing a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors).

Proceedings ArticleDOI
04 May 2014
TL;DR: In this paper, the suitability of i-vectors for reducing these latter sources of variability for distinguishing between low or high levels of speaker depression was explored using the Audio/Visual Emotion Challenge and Workshop 2013 Depression Dataset.
Abstract: Variations in the acoustic space due to changes in speaker mental state are potentially overshadowed by variability due to speaker identity and phonetic content. Using the Audio/Visual Emotion Challenge and Workshop 2013 Depression Dataset we explore the suitability of i-vectors for reducing these latter sources of variability for distinguishing between low or high levels of speaker depression. In addition we investigate whether supervised variability compensation methods such as Linear Discriminant Analysis (LDA), and Within Class Covariance Normalisation (WCCN), applied in the i-vector domain, could be used to compensate for speaker and phonetic variability. Classification results show that i-vectors formed using an over-sampling methodology outperform a baseline set by KL-means supervectors. However the effect of these two compensation methods does not appear to improve system accuracy. Visualisations afforded by the t-Distributed Stochastic Neighbour Embedding (t-SNE) technique suggest that despite the application of these techniques, speaker variability is still a strong confounding effect.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: Different methods to further improve and extend SAT-DNN to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction are presented.
Abstract: Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from video signals. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.

Proceedings ArticleDOI
04 May 2014
TL;DR: Spear implements a set of complete speaker recognition toolchains, including all the processing stages from the front-end feature extractor to the final steps of decision and evaluation, and several state-of-the-art modeling techniques are included.
Abstract: In this paper, we introduce Spear, an open source and extensible toolbox for state-of-the-art speaker recognition. This toolbox is built on top of Bob, a free signal processing and machine learning library. Spear implements a set of complete speaker recognition toolchains, including all the processing stages from the front-end feature extractor to the final steps of decision and evaluation. Several state-of-the-art modeling techniques are included, such as Gaussian mixture models, inter-session variability, joint factor analysis and total variability (i-vectors). Furthermore, the toolchains can be easily evaluated on well-known databases such as NIST SRE and MOBIO. As a proof of concept, an experimental comparison of different modeling techniques is conducted on the MOBIO database.

Proceedings ArticleDOI
26 May 2014
TL;DR: A method by which speakers whose speech has not been used to build voice transformations (for training) can be efficiently de-identified online and performs similarly as a closed-set de-identification procedure that requires previous enrolment and can efficiently be used for online speaker de-Identification.
Abstract: Speaker de-identification is the process by which speech is transformed in a way that the speaker identity is masked, while at the same time the transformed speech preserves acoustic information that contributes to the intelligibility, naturalness and clarity. Systems that perform speech de-identification could be used in voice driven applications (for example in call centres) where the speaker's identity has to be hidden. The paper describes the experiments we have performed in order to de-identify speech using GMM based voice transformation techniques and speaker identification using freely available tools. We propose a method by which speakers whose speech has not been used to build voice transformations (for training) can be efficiently de-identified online. The proposed method is evaluated using a speech database of read speech and a small set of speakers. The results we present show that the proposed de-identification method performs similarly as a closed-set de-identification procedure that requires previous enrolment and can efficiently be used for online speaker de-identification.

Patent
13 Nov 2014
TL;DR: In this paper, a method and system for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions is described.
Abstract: A method and system is disclosed for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions. For a plurality of utterances of a reference speaker, a set of reference-speaker vectors may be extracted, and for each of a plurality of utterances of a colloquial speaker, a respective set of colloquial-speaker vectors may be extracted. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each colloquial-speaker vector to a reference-speaker vector. The colloquial-speaker vector may be replaced with the matched reference-speaker vector. The matching-and-replacing can be carried out separately for each set of colloquial-speaker vectors. A conditioned set of speaker vectors can then be constructed by aggregating all the replaced speaker vectors. The condition set of speaker vectors can be used to train the TTS system.

Journal ArticleDOI
TL;DR: A highly discriminative speaker verification framework is constructed through intrinsic and extrinsic back-end algorithm modification, resulting in complementary sub-systems resulting in very competitive performance with reasonable computational cost.
Abstract: This study aims to explore the case of robust speaker recognition with multi-session enrollments and noise, with an emphasis on optimal organization and utilization of speaker information presented in the enrollment and development data. This study has two core objectives. First, we investigate more robust back-ends to address noisy multi-session enrollment data for speaker recognition. This task is achieved by proposing novel back-end algorithms. Second, we construct a highly discriminative speaker verification framework. This task is achieved through intrinsic and extrinsic back-end algorithm modification, resulting in complementary sub-systems. Evaluation of the proposed framework is performed on the NIST SRE2012 corpus. Results not only confirm individual sub-system advancements over an established baseline, the final grand fusion solution also represents a comprehensive overall advancement for the NIST SRE2012 core tasks. Compared with state-of-the-art SID systems on the NIST SRE2012, the novel parts of this study are: 1) exploring a more diverse set of solutions for low-dimensional i-Vector based modeling; and 2) diversifying the information configuration before modeling. All these two parts work together, resulting in very competitive performance with reasonable computational cost.

Journal ArticleDOI
TL;DR: Experiments conducted on three corpora have shown that the proposed method improves the performance of acoustic feature-based overlap detector on all the corpora and that the model based on long-term conversational features used to estimate probability of overlap which is learned from AMI corpus generalizes to meetings from other corpora.
Abstract: Overlapping speech has been identified as one of the main sources of errors in diarization of meeting room conversations. Therefore, overlap detection has become an important step prior to speaker diarization. Studies on conversational analysis have shown that overlapping speech is more likely to occur at specific parts of a conversation. They have also shown that overlap occurrence is correlated with various conversational features such as speech, silence patterns and speaker turn changes. We use features capturing this higher level information from structure of a conversation such as silence and speaker change statistics to improve acoustic feature based classifier of overlapping and single-speaker speech classes. The silence and speaker change statistics are computed over a long-term window (around 3-4 seconds) and are used to predict the probability of overlap in the window. These estimates are then incorporated into a acoustic feature based classifier as prior probabilities of the classes. Experiments conducted on three corpora (AMI, NIST-RT and ICSI) have shown that the proposed method improves the performance of acoustic feature-based overlap detector on all the corpora. They also reveal that the model based on long-term conversational features used to estimate probability of overlap which is learned from AMI corpus generalizes to meetings from other corpora (NIST-RT and ICSI). Moreover, experiments on ICSI corpus reveal that the proposed method also improves laughter overlap detection. Consequently, applying overlap handling techniques to speaker diarization using the detected overlap results in reduction of diarization error rate (DER) on all the three corpora.

Journal ArticleDOI
TL;DR: Experiments show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa and a proposed method for associating both audio and video information by using co-occurrence matrices.
Abstract: Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.