Showing papers on "Speaker diarisation published in 2013"

PDF

Open Access

Proceedings Article•DOI•

Speaker adaptation of neural network acoustic models using i-vectors

[...]

George Saon¹, Hagen Soltau¹, David Nahamoo¹, Michael Picheny¹•Institutions (1)

01 Dec 2013

TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.

...read moreread less

Abstract: We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

...read moreread less

714 citations

Proceedings Article•DOI•

The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines

[...]

Emmanuel Vincent¹, Jon Barker², Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, Marco Matassoni - Show less +2 more•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Sheffield²

26 May 2013

TL;DR: This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment.

...read moreread less

Abstract: Distant-microphone automatic speech recognition (ASR) remains a challenging goal in everyday environments involving multiple background sources and reverberation. This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment. Two separate tracks have been proposed: a small-vocabulary task with small speaker movements and a medium-vocabulary task without speaker movements. We discuss the rationale for the challenge and provide a detailed description of the datasets, tasks and baseline performance results for each track.

...read moreread less

377 citations

Proceedings Article•DOI•

Speaker adaptation of context dependent deep neural networks

[...]

Hank Liao¹•Institutions (1)

Google¹

26 May 2013

TL;DR: This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network, and looks at how L2 regularization using weight decay to the speaker independent model improves generalization.

...read moreread less

Abstract: There has been little work on examining how deep neural networks may be adapted to speakers for improved speech recognition accuracy. Past work has examined using a discriminatively trained affine transformation of the input features applied at a frame level or the re-training of the entire shallow network for a specific speaker. This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network. We look at how L2 regularization using weight decay to the speaker independent model improves generalization. Other training factors are examined including the role momentum plays and stochastic mini-batch versus batch training. While improvements are significant for smaller networks, the largest show little gain from adaptation on a large vocabulary mobile speech recognition task.

...read moreread less

271 citations

Proceedings Article•DOI•

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code

[...]

Ossama Abdel-Hamid¹, Hui Jiang¹•Institutions (1)

York University¹

26 May 2013

TL;DR: A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

...read moreread less

Abstract: In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

...read moreread less

269 citations

Proceedings Article•DOI•

PLDA for speaker verification with utterances of arbitrary duration

[...]

Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Jahangir Alam, Pierre Dumouchel - Show less +1 more

26 May 2013

TL;DR: This paper shows how to quantify the uncertainty associated with the i-vector extraction process and propagate it into a PLDA classifier and finds that it led to substantial improvements in accuracy.

...read moreread less

Abstract: The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working in this framework have been relieved of the responsibility of dealing with the duration variability that arises in practical applications. The fixed dimensional i-vector representation of speech utterances is ideal for working under such controlled conditions and ignoring the fact that i-vectors extracted from short utterances are less reliable than those extracted from long utterances leads to a very simple formulation of the speaker recognition problem. However a more realistic approach seems to be needed to handle duration variability properly. In this paper, we show how to quantify the uncertainty associated with the i-vector extraction process and propagate it into a PLDA classifier. We evaluated this approach using test sets derived from the NIST 2010 core and extended core conditions by randomly truncating the utterances in the female, telephone speech trials so that the durations of all enrollment and test utterances lay in the range 3-60 seconds and we found that it led to substantial improvements in accuracy. Although the likelihood ratio computation for speaker verification is more computationally expensive than in the standard i-vector/PLDA classifier, it is still quite modest as it reduces to computing the probability density functions of two full covariance Gaussians (irrespective of the number of the number of utterances used to enroll a speaker).

...read moreread less

233 citations

Journal Article•DOI•

Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach

[...]

Stephen Shum¹, Najim Dehak¹, Réda Dehak², James Glass¹•Institutions (2)

Massachusetts Institute of Technology¹, École Pour l'Informatique et les Techniques Avancées²

01 Oct 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An improved clustering method is integrated with an existing re-segmentation algorithm and an iterative optimization scheme is implemented that demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner.

...read moreread less

Abstract: In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance.

...read moreread less

181 citations

Proceedings Article•DOI•

An Open-source State-of-the-art Toolbox for Broadcast News Diarization

[...]

Mickael Rouvier, Grégor Dupuy, Elie Khoury, Teva Merlin, Sylvain Meignier - Show less +1 more

25 Aug 2013

TL;DR: This paper presents the LIUM open-source speaker diarization toolbox, mostly dedicated to broadcast news, which includes both Hierarchical Agglomerative Clustering using well-known measures such as BIC and CLR, and the new ILP clustering algorithm using i-vectors.

...read moreread less

Abstract: This paper presents the LIUM open-source speaker diarization toolbox, mostly dedicated to broadcast news. This tool includes both Hierarchical Agglomerative Clustering using well-known measures such as BIC and CLR, and the new ILP clustering algorithm using i-vectors. Diarization systems are tested on the French evaluation data from ESTER, ETAPE and REPERE campaigns.

...read moreread less

162 citations

Proceedings Article•DOI•

Voice Conversion in High-order Eigen Space Using Deep Belief Nets

[...]

Toru Nakashika¹, Ryoichi Takashima¹, Tetsuya Takiguchi¹, Yasuo Ariki¹•Institutions (1)

Kobe University¹

25 Aug 2013

TL;DR: This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space.

...read moreread less

Abstract: This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. DBNs have a deep architecture that automatically discovers abstractions to maximally express the original input features. If we train the DBNs using only the speech of an individual speaker, it can be considered that there is less phonological information and relatively more speaker individuality in the output features at the highest layer. Training the DBNs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NNs). The converted abstraction of the source speaker is then brought back to the cepstrum space using an inverse process of the DBNs of the target speaker. We conducted speakervoice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method.

...read moreread less

140 citations

Proceedings Article•DOI•

Synthetic speech detection using temporal modulation feature

[...]

Zhizheng Wu¹, Xiong Xiao¹, Eng Siong Chng¹, Haizhou Li¹•Institutions (1)

Nanyang Technological University¹

26 May 2013

TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.

...read moreread less

Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

...read moreread less

136 citations

Proceedings Article•DOI•

Duration mismatch compensation for i-vector based speaker recognition systems

[...]

Taufiq Hasan¹, Rahim Saeidi², John H. L. Hansen¹, David A. van Leeuwen²•Institutions (2)

University of Texas at Dallas¹, Radboud University Nijmegen²

26 May 2013

TL;DR: The effect of duration variability on phoneme distributions of speech utterances and i-vector length is analyzed and it is demonstrated that, as utterance duration is decreased, number of detected unique phonemes andi- vector length approaches zero in a logarithmic and non-linear fashion.

...read moreread less

Abstract: Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.

...read moreread less

128 citations

Proceedings Article•DOI•

A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data

[...]

Tomi Kinnunen¹, Padmanabhan Rajan¹•Institutions (1)

University of Eastern Finland¹

21 Oct 2013

TL;DR: This work studies an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs) and provides open-source implementation of the method.

...read moreread less

Abstract: A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method.

...read moreread less

Patent•

Method and apparatus for automated speaker parameters adaptation in a deployed speaker verification system

[...]

Daniele Colibro¹, Claudio Vair¹, Kevin R. Farrell¹•Institutions (1)

Nuance Communications¹

25 Feb 2013

TL;DR: In this paper, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected.

...read moreread less

Abstract: Typical speaker verification systems usually employ speakers' audio data collected during an enrollment phase when users enroll with the system and provide respective voice samples. Due to technical, business, or other constraints, the enrollment data may not be large enough or rich enough to encompass different inter-speaker and intra-speaker variations. According to at least one embodiment, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice-based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected; and employing the classifier, with the corresponding parameters updated, in performing speaker recognition.

...read moreread less

Proceedings Article•DOI•

A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns

[...]

Federico Alegre¹, Asmaa Amehraye¹, Nicholas Evans¹•Institutions (1)

Institut Eurécom¹

01 Sep 2013

TL;DR: A novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach is presented, which captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former.

...read moreread less

Abstract: The vulnerability of automatic speaker verification systems to spoofing is now well accepted. While recent work has shown the potential to develop countermeasures capable of detecting spoofed speech signals, existing solutions typically function well only for specific attacks on which they are optimised. Since the exact nature of spoofing attacks can never be known in practice, there is thus a need for generalised countermeasures which can detect previously unseen spoofing attacks. This paper presents a novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach. The new countermeasure captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former. We report experiments with three different approaches to spoofing and with a state-of-the-art i-vector speaker verification system which uses probabilistic linear discriminant analysis for intersession compensation. While a support vector machine classifier is tuned with examples of converted voice, it delivers reliable detection of spoofing attacks using synthesized speech and artificial signals, attacks for which it is not optimised.

...read moreread less

Patent•

Providing audio-activated resource access for user devices based on speaker voiceprint

[...]

Harshini Ramnath Krishnan, Andrew Fregly

21 Jun 2013

TL;DR: In this article, the authors describe methods and computer systems for providing audio-activated resource access for user devices by storing instructions to cause the processor to perform operations, comprising capturing audio at a user device and receiving a resource corresponding to the identified speaker entry in the server system.

...read moreread less

Abstract: This disclosure includes, for example, methods and computer systems for providing audio-activated resource access for user devices. The computer systems may store instructions to cause the processor to perform operations, comprising capturing audio at a user device. The operations may also comprise using a speaker recognition system to identify a speaker in the transmitted audio and/or using a speech-to-text converter to identify text in the captured audio. The speaker identity or a condensed version of the speaker identity or other metadata along with the speaker identity may be transmitted to a server system to determine a corresponding speaker identity entry. The operations may also comprise receiving a resource corresponding to the identified speaker entry in the server system.

...read moreread less

Journal Article•DOI•

Privacy-Preserving Speaker Verification and Identification Using Gaussian Mixture Models

[...]

Manas A. Pathak¹, Bhiksha Raj¹•Institutions (1)

Carnegie Mellon University¹

01 Feb 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This article presents frameworks for privacy-preserving speaker verification and speaker identification systems, where the system is able to perform the necessary operations without being able to observe the speech input provided by the user.

...read moreread less

Abstract: Speech being a unique characteristic of an individual is widely used in speaker verification and speaker identification tasks in applications such as authentication and surveillance respectively. In this article, we present frameworks for privacy-preserving speaker verification and speaker identification systems, where the system is able to perform the necessary operations without being able to observe the speech input provided by the user. In a speech-based authentication setting, this privacy constraint protect against an adversary who can break into the system and use the speech models to impersonate legitimate users. In surveillance applications, we require the system to first identify if the speech recording belongs to a suspect while preserving the privacy constraints. This prevents the system from listening in on conversations of innocent individuals. In this paper we formalize the privacy criteria for the speaker verification and speaker identification problems and construct Gaussian mixture model-based protocols. We also report experiments with a prototype implementation of the protocols on a standardized dataset for execution time and accuracy.

...read moreread less

Journal Article•DOI•

Quality Measure Functions for Calibration of Speaker Recognition Systems in Various Duration Conditions

[...]

Miranti Indar Mandasari¹, Rahim Saeidi¹, Mitchell McLaren, David A. van Leeuwen¹•Institutions (1)

Radboud University Nijmegen¹

01 Nov 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Results highlight the importance of considering the quality metrics like duration in calibrating the scores for automatic speaker recognition systems and the need for a calibration approach to deal with these effects using quality measure functions (QMFs).

...read moreread less

Abstract: This paper investigates the effect of utterance duration to the calibration of a modern i-vector speaker recognition system with probabilistic linear discriminant analysis (PLDA) modeling. A calibration approach to deal with these effects using quality measure functions (QMFs) is proposed to include duration in the calibration transformation. Extensive experiments are performed in order to evaluate the robustness of the proposed calibration approach for unseen conditions in the training of calibration parameters. Using the latest NIST corpora for evaluation, results highlight the importance of considering the quality metrics like duration in calibrating the scores for automatic speaker recognition systems.

...read moreread less

Proceedings Article•DOI•

UTD-CRSS Systems for 2018 NIST Speaker Recognition Evaluation

[...]

Chunlei Zhang¹, Fahimeh Bahmaninezhad¹, Shivesh Ranjan¹, Harishchandra Dubey¹, Wei Xia¹, John H. L. Hansen¹ - Show less +2 more•Institutions (1)

University of Texas at Dallas¹

26 May 2013

TL;DR: This study presents systems submitted by the Center for Robust Speech Systems from UTDallas to NIST SRE 2018, and investigates three alternative front-end speaker embedding frameworks, finding them to be both complementary and effective in achieving overall improved speaker recognition performance.

...read moreread less

Abstract: In this study, we present systems submitted by the Center for Robust Speech Systems (CRSS) from UTDallas to NIST SRE 2018 (SRE18). Three alternative front-end speaker embedding frameworks are investigated, that includes: (i) i-vector, (ii) x-vector, (iii) and a modified triplet speaker embedding system (t-vector). Similar to the previous SRE, language mismatch between training and enrollment/test data, the so-called domain mismatch, remains as a major challenge in this evaluation. In addition, SRE18 also introduces a small portion of audio from an unstructured video corpus in which speaker detection/diarization is supposedly needed to be effectively integrated into speaker recognition for system robustness. In our system development, we focused on: (i) building novel deep neural network based speaker discriminative embedding systems as utterance level feature representations, (ii) exploring alternative dimension reduction methods, back-end classifiers, score normalization techniques which can incorporate unlabeled in-domain data for domain adaptation, (iii) finding an improved data set configurations for the speaker embedding network, LDA/PLDA, and score calibration training (v) and finally, investigating effective score calibration and fusion strategies. The final resulting systems are shown to be both complementary and effective in achieving overall improved speaker recognition performance.

...read moreread less

Proceedings Article•DOI•

Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances

[...]

Anthony Larcher¹, Kong Aik Lee¹, Bin Ma¹, Haizhou Li¹•Institutions (1)

Agency for Science, Technology and Research¹

26 May 2013

TL;DR: This study assesses the performance of Probabilistic Linear Discriminant Analysis (PLDA) and i-vector normalization for a text-dependent verification task and suggests that such scoring regime remains to be optimized.

...read moreread less

Abstract: The importance of phonetic variability for short duration speaker verification is widely acknowledged. This paper assesses the performance of Probabilistic Linear Discriminant Analysis (PLDA) and i-vector normalization for a text-dependent verification task. We show that using a class definition based on both speaker and phonetic content significantly improves the performance of a state-of-the-art system. We also compare four models for computing the verification scores using multiple enrollment utterances and show that using PLDA intrinsic scoring obtains the best performance in this context. This study suggests that such scoring regime remains to be optimized.

...read moreread less

Patent•DOI•

Large-scale speaker identification

[...]

Matthew Sharifi¹, Dominik Roblek¹•Institutions (1)

Google¹

01 May 2013

TL;DR: A system based on i-vectors, a current approach for speaker identification, and locality sensitive hashing, an algorithm for fast nearest neighbor search in high dimensions, which is approximately one to two orders of magnitude faster than a linear search while maintaining the identification accuracy of an i-vector-based system is proposed.

...read moreread less

Abstract: Speaker identification is one of the main tasks in speech processing. In addition to identification accuracy, large-scale applications of speaker identification give rise to another challenge: fast search in the database of speakers. In this paper, we propose a system based on i-vectors, a current approach for speaker identification, and locality sensitive hashing, an algorithm for fast nearest neighbor search in high dimensions. The connection between the two techniques is the cosine distance: on the one hand, we use the cosine distance to compare i-vectors, on the other hand, locality sensitive hashing allows us to quickly approximate the cosine distance in our retrieval procedure. We evaluate our approach on a realistic data set from YouTube with about 1,000 speakers. The results show that our algorithm is approximately one to two orders of magnitude faster than a linear search while maintaining the identification accuracy of an i-vector-based system.

...read moreread less

Proceedings Article•DOI•

A noise robust i-vector extractor using vector taylor series for speaker recognition

[...]

Yun Lei, Lukas Burget¹, Nicolas Scheffer•Institutions (1)

Brno University of Technology¹

26 May 2013

TL;DR: A novel approach for noise-robust speaker recognition, where the model of distortions caused by additive and convolutive noises is integrated into the i-vector extraction framework, based on a vector taylor series approximation widely successful in noise robust speech recognition.

...read moreread less

Abstract: We propose a novel approach for noise-robust speaker recognition, where the model of distortions caused by additive and convolutive noises is integrated into the i-vector extraction framework. The model is based on a vector taylor series (VTS) approximation widely successful in noise robust speech recognition. The model allows for extracting “cleaned-up” i-vectors which can be used in a standard i-vector back end. We evaluate the proposed framework on the PRISM corpus, a NIST-SRE like corpus, where noisy conditions were created by artificially adding babble noises to clean speech segments. Results show that using VTS i-vectors present significant improvements in all noisy conditions compared to a state-of-the-art baseline speaker recognition. More importantly, the proposed framework is robust to noise, as improvements are maintained when the system is trained on clean data.

...read moreread less

Patent•

Method for updating voiceprint feature model and terminal

[...]

Ting Lu¹•Institutions (1)

Huawei¹

08 Jul 2013

TL;DR: In this article, the authors proposed a method for updating a voiceprint feature model and a terminal. And the method comprises: obtaining an original audio stream comprising at least one speaker, and matching the respective audio stream of each speaker in the at least 1 speaker with an original voice print feature model, so as to obtain the successfully matched audio stream.

...read moreread less

Abstract: A method for updating a voiceprint feature model and a terminal. The method comprises: obtaining an original audio stream comprising at least one speaker (S101); obtaining the respective audio stream of each speaker in the at least one speaker in the original audio stream according to a preset speaker segmentation and clustering algorithm (S102); matching the respective audio stream of each speaker in the at least one speaker with an original voiceprint feature model, so as to obtain the successfully-matched audio stream (S103); and using the successfully-matched audio stream as an additional audio stream training sample used for generating the original voiceprint feature model, and updating the original voiceprint feature model (S104). According to the present invention, the valid audio stream during a conversation process is adaptively extracted and used as the additional audio stream training sample, and the additional audio stream training sample is used for dynamically correcting the original voiceprint feature model, thus achieving a purpose of improving the precision of the voiceprint feature model and the accuracy of recognition under the premise of high practicability.

...read moreread less

Journal Article•DOI•

Boosting the Performance of I-Vector Based Speaker Verification via Utterance Partitioning

[...]

Wei Rao, Man-Wai Mak

01 May 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is demonstrated that the discriminative power of i-vectors reaches a plateau quickly when the utterance length increases, suggesting that it is possible to make the best use of a long conversation by partitioning it into a number of sub-utterances so that more i-VEctors can be produced for each conversation.

...read moreread less

Abstract: The success of the recent i-vector approach to speaker verification relies on the capability of i-vectors to capture speaker characteristics and the subsequent channel compensation methods to suppress channel variability. Typically, given an utterance, an i-vector is determined from the utterance regardless of its length. This paper investigates how the utterance length affects the discriminative power of i-vectors and demonstrates that the discriminative power of i-vectors reaches a plateau quickly when the utterance length increases. This observation suggests that it is possible to make the best use of a long conversation by partitioning it into a number of sub-utterances so that more i-vectors can be produced for each conversation. To increase the number of sub-utterances without scarifying the representation power of the corresponding i-vectors, repeated applications of frame-index randomization and utterance partitioning are performed. Results on NIST 2010 speaker recognition evaluation (SRE) suggest that (1) using more i-vectors per conversation can help to find more robust linear discriminant analysis (LDA) and within-class covariance normalization (WCCN) transformation matrices, especially when the number of conversations per training speaker is limited; and (2) increasing the number of i-vectors per target speaker helps the i-vector based support vector machines (SVM) to find better decision boundaries, thus making SVM scoring outperforms cosine distance scoring by 19% and 9% in terms of minimum normalized DCF and EER.

...read moreread less

Journal Article•DOI•

Smooth Nonnegative Matrix Factorization for Unsupervised Audiovisual Document Structuring

[...]

Slim Essid¹, Cédric Févotte¹•Institutions (1)

Télécom ParisTech¹

01 Feb 2013-IEEE Transactions on Multimedia

TL;DR: A new paradigm for unsupervised audiovisual document structuring is introduced, which employs the Kullback-Leibler divergence as a cost function and imposes a temporal smoothness constraint to the activations.

...read moreread less

Abstract: This paper introduces a new paradigm for unsupervised audiovisual document structuring. In this paradigm, a novel Nonnegative Matrix Factorization (NMF) algorithm is applied on histograms of counts (relating to a bag of features representation of the content) to jointly discover latent structuring patterns and their activations in time. Our NMF variant employs the Kullback-Leibler divergence as a cost function and imposes a temporal smoothness constraint to the activations. It is solved by a majorization-minimization technique. The approach proposed is meant to be generic and is particularly well suited to applications where the structuring patterns may overlap in time. As such, it is evaluated on two person-oriented video structuring tasks (one using the visual modality and the second the audio). This is done using a challenging database of political debate videos. Our results outperform reference results obtained by a method using Hidden Markov Models. Further, we show the potential that our general approach has for audio speaker diarization.

...read moreread less

Speaker recognition by means of Deep Belief Networks

[...]

Vasileios Vasilakakis, Sandro Cumani, Pietro Laface¹•Institutions (1)

Polytechnic University of Turin¹

01 Jan 2013

TL;DR: This paper proposes to use as a pseudo-ivector extractor a Deep Belief Network (DBN) architecture, trained with the utterances of several hundred speakers, to model the distribution of the output units, given an utterance, by a reduced set of parameters that embed the speaker characteristics.

...read moreread less

Abstract: Most state-of-the-art speaker recognition systems are based on Gaussian Mixture Models (GMMs), where a speech segment is represented by a compact representation, referred to as "identity vector" (ivector for short), extracted by means of Factor Analysis. The main advantage of this representation is that the problem of intersession variability is deferred to a second stage, dealing with low-dimensional vectors rather than with the high-dimensional space of the GMM means. In this paper, we propose to use as a pseudo-ivector extractor a Deep Belief Network (DBN) architecture, trained with the utterances of several hundred speakers. In this approach, the DBN performs a non-linear transformation of the input features, which produces the probability that an output unit is on, given the input features. We model the distribution of the output units, given an utterance, by a reduced set of parameters that embed the speaker characteristics. Tested on the dataset exploited for training the systems that have been used for the NIST 2012 Speaker Recognition Evaluation, this approach shows promising results

...read moreread less

Proceedings Article•DOI•

Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech

[...]

Heidi Christensen¹, M. B. Aniol, Peter Bell, Phil D. Green¹, Thomas Hain¹, Simon King², Pawel Swietojanski - Show less +3 more•Institutions (2)

University of Sheffield¹, University of Edinburgh²

01 Aug 2013

TL;DR: It is shown that the relative benefit of using OOD data varies considerably from speaker to speaker and is only loosely correlated with the severity of a speaker's impairments, and an alternative approach with its focus on the feature extraction stage is investigated.

...read moreread less

Abstract: Recently there has been increasing interest in ways of using outof-domain (OOD) data to improve automatic speech recognition performance in domains where only limited data is available. This paper focuses on one such domain, namely that of disordered speech for which only very small databases exist, but where normal speech can be considered OOD. Standard approaches for handling small data domains use adaptation from OOD models into the target domain, but here we investigate an alternative approach with its focus on the feature extraction stage: OOD data is used to train feature-generating deep belief neural networks. Using AMI meeting and TED talk datasets, we investigate various tandem-based speaker independent systems as well as maximum a posteriori adapted speaker dependent systems. Results on the UAspeech isolated word task of disordered speech are very promising with our overall best system (using a combination of AMI and TED data) giving a correctness of 62.5%; an increase of 15% on previously best published results based on conventional model adaptation. We show that the relative benefit of using OOD data varies considerably from speaker to speaker and is only loosely correlated with the severity of a speaker’s impairments. Index Terms: Speech recognition, Tandem features, Deep belief neural network, Disordered speech

...read moreread less

Patent•

Primary speaker identification from audio and video data

[...]

Suzanne Marion Beaumont, James Anthony Hunt, Robert James Kapinos, Axel Ramirez Flores, Rod D. Waltermann - Show less +1 more

25 Sep 2013

TL;DR: In this paper, the human speech in the audio data was matched with the pattern of visual features in the image data associated with speaking, and a primary speaker was selected from among matched human speech.

...read moreread less

Abstract: An aspect provides a method, including: receiving image data from a visual sensor of an information handling device; receiving audio data from one or more microphones of the information handling device; identifying, using one or more processors, human speech in the audio data; identifying, using the one or more processors, a pattern of visual features in the image data associated with speaking; matching, using the one or more processors, the human speech in the audio data with the pattern of visual features in the image data associated with speaking; selecting, using the one or more processors, a primary speaker from among matched human speech; assigning control to the primary speaker; and performing one or more actions based on audio input of the primary speaker. Other aspects are described and claimed.

...read moreread less

Proceedings Article•DOI•

Developing a speaker identification system for the DARPA RATS project

[...]

Oldrich Plchot¹, Spyros Matsoukas², Pavel Matejka¹, Najim Dehak³, Jeff Z. Ma², Sandro Cumani¹, Ondrej Glembek¹, Hynek Hermansky⁴, Sri Harish Mallidi⁴, Nima Mesgarani⁵, Richard Schwartz², Mehdi Soufifar¹, Zheng-Hua Tan⁶, Samuel Thomas⁴, Bing Zhang², Xinhui Zhou⁵ - Show less +12 more•Institutions (6)

Brno University of Technology¹, Raytheon², Massachusetts Institute of Technology³, Johns Hopkins University⁴, University of Maryland, College Park⁵, Aalborg University⁶

26 May 2013

TL;DR: The speaker identification system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels, is described.

...read moreread less

Abstract: This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.

...read moreread less

Proceedings Article•DOI•

Impact of overlapping speech detection on speaker diarization for broadcast news and debates

[...]

Delphine Charlet¹, Claude Barras, Jean-Sylvain Liénard•Institutions (1)

Orange S.A.¹

26 May 2013

TL;DR: The overlapping speech detection systems developped by Orange and LIMSI for the ETAPE evaluation campaign on French broadcast news and debates are described and it is shown that it improves the diarization error rate in all situations and up to 26.1% relative in the best configuration.

...read moreread less

Abstract: The overlapping speech detection systems developped by Orange and LIMSI for the ETAPE evaluation campaign on French broadcast news and debates are described. Using either cepstral features or a multi-pitch analysis, a F1-measure for overlapping speech detection up to 59.2% is reported on the TV data of the ETAPE evaluation set, where 6.7% of the speech was measured as overlapping, ranging from 1.2% in the news to 10.4% in the debates. Overlapping speech segments were excluded during the speaker diarization stage, and these segments were further labelled with the two nearest speaker labels, taking into account the temporal distance. We describe the effects of this strategy for various overlapping speech systems and we show that it improves the diarization error rate in all situations and up to 26.1% relative in our best configuration.

...read moreread less

Proceedings Article•DOI•

An investigation on back-end for speaker recognition in multi-session enrollment

[...]

Gang Liu¹, Taufiq Hasan¹, Hynek Boril¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

26 May 2013

TL;DR: It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end.

...read moreread less

Abstract: This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusing several back-ends on an i-vector system framework. It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end. Averaged across both genders, we obtain a relative improvement in EER and minDCF by 56.5% and 49.4%, respectively. Consistent performance gains obtained using the proposed strategy validates its effectiveness. This system is part of the CRSS' NIST SRE 2012 submission system.

...read moreread less

Proceedings Article•DOI•

Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech.

[...]

Olivier Galibert¹•Institutions (1)

Conservatoire national des arts et métiers¹

25 Aug 2013

TL;DR: Evaluating speaker diarization and ASR systems required extending the metrics definitions and adapting the algorithmic approaches required for their implementation and the open tools that provide them are presented.

...read moreread less

Abstract: Speaker Diarization and Automatic Speech Recognition have been a topic of research for decades. Evaluating the developed systems has been required for almost as long. Following the NIST initiatives a number of metrics have become standard to handle these evaluations, namely the Diarization Error Rate and the Word Error Rate. The initial definitions of these metrics and, more importantly, their implementations, were designed for single-speaker speech. One of the aims of the OSEO Quaero and the ANR ETAPE projects was to investigate the capabilities of Diarization and ASR systems in the presence of overlapping speech. Evaluating said systems required extending the metrics definitions and adapting the algorithmic approaches required for their implementation. This paper presents these extensions and adaptations and the open tools that provide them.

...read moreread less

Collapse