scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2014"


Proceedings ArticleDOI
04 May 2014
TL;DR: Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
Abstract: In this paper we investigate the use of deep neural networks (DNNs) for a small footprint text-dependent speaker verification task. At development stage, a DNN is trained to classify speakers at the frame-level. During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer. The average of these speaker features, or d-vector, is taken as the speaker model. At evaluation stage, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision. Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. Finally the combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.

1,000 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: A novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) to produce frame alignments.
Abstract: We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE) The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions

631 citations


Journal ArticleDOI
TL;DR: The HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system, outperforms the state-of-the-art i- vector system in most of the scenarios and provides a reference evaluation scheme and a reference performance on RSR2015 database to the research community.

274 citations


01 Jan 2014
TL;DR: Although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity.
Abstract: We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead of training the universal background model using the standard EM algorithm, the components are predefined and correspond to the set of triphone states, the posterior occupancy probabilities of which are modeled by a DNN. Those assignments are then combined with the standard 60-dim MFCC features to calculate first order BaumWelch statistics in order to train the i-vector extractor and extract i-vectors. The DNN-based assignment force the i-vectors to capture the idiosyncratic way in which each speaker pronounces each particular triphone state, which can enrich the standard short-term spectral representation of the standard ivectors. After experimenting with Switchboard data and a baseline PLDA classifier, our results showed that although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity. A further experiment with a different DNN configuration attained comparable performance with the baseline i-vectors on NIST 2012 (condition C2, female).

241 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs), and the algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction.
Abstract: We propose providing additional utterance-level features as inputs to a deep neural network (DNN) to facilitate speaker, channel and background normalization. Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs). The algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction. We address implementation of the algorithm for a streaming task.

227 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Abstract: Speaker diarization via unsupervised i-vector clustering has gained popularity in recent years In this approach, i-vectors are extracted from short clips of speech segmented from a larger multi-speaker conversation and organized into speaker clusters, typically according to their cosine score In this paper, we propose a system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring, a method already frequently utilized in speaker recognition tasks, and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion We also demonstrate that denser sampling in the i-vector space with overlapping temporal segments provides a gain in the diarization task We test our system on the CALLHOME conversational telephone speech corpus, which includes multiple languages and a varying number of speakers, and we show that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well

226 citations


Patent
28 Apr 2014
TL;DR: In this paper, a client's voice is recorded and characteristics of the recording are used to create and store a voice print when an enrolled client seeks access to secure information over a network, a sample voice recording is compared to at least one voice print.
Abstract: Systems and methods providing for secure voice print authentication over a network are disclosed herein During an enrollment stage, a client's voice is recorded and characteristics of the recording are used to create and store a voice print When an enrolled client seeks access to secure information over a network, a sample voice recording is created The sample voice recording is compared to at least one voice print If a match is found, the client is authenticated and granted access to secure information Systems and methods providing for a dual use voice analysis system are disclosed herein Speech recognition is achieved by comparing characteristics of words spoken by a speaker to one or more templates of human language words Speaker identification is achieved by comparing characteristics of a speaker's speech to one or more templates, or voice prints The system is adapted to increase or decrease matching constraints depending on whether speaker identification or speaker recognition is desired

192 citations


Journal ArticleDOI
TL;DR: A general adaptation scheme for DNN based on discriminant condition codes is proposed, which is directly fed to various layers of a pre-trained DNN through a new set of connection weights, which are quite effective to adapt large DNN models using only a small amount of adaptation data.
Abstract: Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task ("Deep convolutional neural networks for LVCSR," T. N. Sainath et al., Proc. IEEE Acoust., Speech, Signal Process., 2013).

157 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: This study evaluates the vulnerability of text-dependent speaker verification systems under the replay attack using a standard benchmarking database, and proposes an anti-spoofing technique to safeguard the speaker verification system.
Abstract: Replay, which is to playback a pre-recorded speech sample, presents a genuine risk to automatic speaker verification technology. In this study, we evaluate the vulnerability of text-dependent speaker verification systems under the replay attack using a standard benchmarking database, and also propose an anti-spoofing technique to safeguard the speaker verification systems. The key idea of the spoofing detection technique is to decide whether the presented sample is matched to any previous stored speech samples based a similarity score. The experiments conducted on the RSR2015 database showed that the equal error rate (EER) and false acceptance rate (FAR) increased from both 2.92 % to 25.56 % and 78.36 % respectively as a result of the replay attack. It confirmed the vulnerability of speaker verification to replay attacks. On the other hand, our proposed spoofing countermeasure was able to reduce the FARs from 78.36 % and 73.14 % to 0.06 % and 0.0 % for male and female systems, respectively, in the face of replay spoofing. The experiments confirmed the effectiveness of the proposed anti-spoofing technique.

155 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: This paper shows how this i-vector based speaker adaptation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and reports excellent results on a French language audio transcription task.
Abstract: State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.

135 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: It is observed that the adaptation of the PLDA parameters (i.e. across-class and within-class co variances) produces the largest gains, and length-normalization is also important; whereas using an indomani UBM and T matrix is not crucial.
Abstract: In this paper, we present a comprehensive study on supervised domain adaptation of PLDA based i-vector speaker recognition systems. After describing the system parameters subject to adaptation, we study the impact of their adaptation on recognition performance. Using the recently designed domain adaptation challenge, we observe that the adaptation of the PLDA parameters (i.e. across-class and within-class co variances) produces the largest gains. Nonetheless, length-normalization is also important; whereas using an indomani UBM and T matrix is not crucial. For the PLDA adaptation, we compare four approaches. Three of them are proposed in this work, and a fourth one was previously published. Overall, the four techniques are successful at leveraging varying amounts of labeled in-domain data and their performance is quite similar. However, our approaches are less involved, and two of them are applicable to a larger class of models (low-rank across-class).

Journal ArticleDOI
TL;DR: A robust SID with speaker models trained in selected reverberant conditions is performed, on the basis of bounded marginalization and direct masking, which substantially improves SID performance over related systems in a wide range of reverberation time and signal-to-noise ratios.
Abstract: Robustness of speaker recognition systems is crucial for real-world applications, which typically contain both additive noise and room reverberation. However, the combined effects of additive noise and convolutive reverberation have been rarely studied in speaker identification (SID). This paper addresses this issue in two phases. We first remove background noise through binary masking using a deep neural network classifier. Then we perform robust SID with speaker models trained in selected reverberant conditions, on the basis of bounded marginalization and direct masking. Evaluation results show that the proposed system substantially improves SID performance over related systems in a wide range of reverberation time and signal-to-noise ratios.

01 Jan 2014
TL;DR: This paper presents a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems, and explores two versions of agglomerative hierarchical clustering that use the PLDA system.
Abstract: In this paper, we present a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems. Given an existing out-of-domain PLDA system, we use it to cluster unlabeled in-domain data, and then use this data to adapt the parameters of the PLDA system. We explore two versions of agglomerative hierarchical clustering that use the PLDA system. We also study two automatic ways to determine the number of clusters in the in-domain dataset. The proposed techniques are experimentally validated in the recently introduced domain adaptation challenge. This challenge provides a very useful setup to explore domain adaptation since it illustrates a significant performance gap between an in-domain and out-of-domain system. Using agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration we are able to recover 85% of this gap.

Patent
20 Jan 2014
TL;DR: In this paper, an application detecting speaker locations and prompting a user to input rough room boundaries and a desired listener location in the room is used to determine optimum speaker locations/frequency assignations/speaker parameters.
Abstract: In an audio speaker network, setup of speaker location, sound track or channel assignation, and speaker parameters is facilitated by an application detecting speaker locations and prompting a user to input rough room boundaries and a desired listener location in the room. Based on this, optimum speaker locations/frequency assignations/speaker parameters may be determined and output.

Proceedings ArticleDOI
14 Sep 2014
TL;DR: Experiments show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively.
Abstract: We investigate the concept of speaker adaptive training (SAT) in the context of deep neural network (DNN) acoustic models. Previous studies have shown success of performing speaker adaptation for DNNs in speech recognition. In this paper, we apply SAT to DNNs by learning two types of feature mapping neural networks. Given an initial DNN model, these networks take speaker i-vectors as additional information and project DNN inputs into a speaker-normalized space. The final SAT model is obtained by updating the canonical DNN in the normalized feature space. Experiments on a Switchboard 110hour setup show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively. Further evaluations on the more challenging BABEL datasets reveal significant word error rate reduction achieved by SAT-DNN.

Journal ArticleDOI
TL;DR: Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms conventional ones whenever interview-style speech is involved, and it is demonstrated that noise reduction is vital for energy-based VAD under low SNR.

Patent
04 Aug 2014
TL;DR: In this paper, a controller for a voice-controlled device is provided, which includes a setting module and a recognition module, and the recognition module compares a confident score of speech recognition with the threshold to accordingly execute voice control.
Abstract: A controller for a voice-controlled device is provided. The controller includes a setting module and a recognition module. The setting module generates a threshold according to an environmental parameter. The recognition module compares a confident score of speech recognition with the threshold to accordingly execute voice control.

01 Jan 2014
TL;DR: During late-2013 through mid-2014 NIST coordinated a special machine learning challenge based on the i-vector paradigm widely used by state-of-the-art speaker recognition systems, which saw approximately twice as many participants, and a nearly two orders of magnitude increase in the number of systems submitted for evaluation.
Abstract: During late-2013 through mid-2014 NIST coordinated a special machine learning challenge based on the i-vector paradigm widely used by state-of-the-art speaker recognition systems. The i-vector challenge was run entirely online and used as source data fixed-length feature vectors projected into a low-dimensional space (i-vectors) rather than audio recordings. These changes made the challenge more readily accessible, enabled system comparison with consistency in the front-end and in the amount and type of training data, and facilitated exploration of many more approaches than would be possible in a single evaluation as traditionally run by NIST. Compared to the 2012 NIST Speaker Recognition Evaluation, the i-vector challenge saw approximately twice as many participants, and a nearly two orders of magnitude increase in the number of systems submitted for evaluation. Initial results indicate that the leading system achieved a relative improvement of approximately 38% over the baseline system.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper explores the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC), and shows that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out- of-domain system by more than 25%.
Abstract: Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC).We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover, we integrate the DNN in an unsupervised adaptation framework, that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.

Proceedings ArticleDOI
04 May 2014
TL;DR: Results presented in this paper indicate that channel concatenation gives similar or better results than beamforming, andAugmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over theStandard DNN based recognition system, and yields additional improvements for far field speech recognition.
Abstract: This paper presents an investigation of far field speech recognition using beamforming and channel concatenation in the context of Deep Neural Network (DNN) based feature extraction. While speech enhancement with beamforming is attractive, the algorithms are typically signal-based with no information about the special properties of speech. A simple alternative to beamforming is concatenating multiple channel features. Results presented in this paper indicate that channel concatenation gives similar or better results. On average the DNN front-end yields a 25% relative reduction in Word Error Rate (WER). Further experiments aim at including relevant information in training adapted DNN features. Augmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over the standard DNN based recognition system, and yields additional improvements for far field speech recognition.

Proceedings ArticleDOI
Hagai Aronowitz1
04 May 2014
TL;DR: This work analyzes the sources of degradation for a particular setup in the context of an i-vector PLDA system and concludes that the main source for degradation is ani-vector dataset shift, which is introduced using the nuisance attribute projection (NAP) method.
Abstract: Recently satisfactory results have been obtained in NIST speaker recognition evaluations. These results are mainly due to accurate modeling of a very large development dataset provided by LDC. However, for many realistic scenarios the use of this development dataset is limited due to a dataset mismatch. In such cases, collection of a large enough dataset is infeasible. In this work we analyze the sources of degradation for a particular setup in the context of an i-vector PLDA system and conclude that the main source for degradation is an i-vector dataset shift. As a remedy, we introduce inter dataset variability compensation (IDVC) to explicitly compensate for dataset shift in the i-vector space. This is done using the nuisance attribute projection (NAP) method. Using IDVC we managed to reduce error dramatically by more than 50% for the domain mismatch setup.

Proceedings ArticleDOI
04 May 2014
TL;DR: The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN), and backpropagate class errors up to only one layer for few iterations before to train the network.
Abstract: The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors in a speaker verification task. The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN). It is also suggested to backpropagate class errors up to only one layer for few iterations before to train the network. Additionally, an impostor selection method is introduced which helps the DBN to outperform the cosine distance classifier. The evaluation is performed on the core test condition of the NIST SRE 2006 corpora, and it is shown that 10% and 8% relative improvements of EER and minDCF can be achieved, respectively.

Proceedings ArticleDOI
04 May 2014
TL;DR: The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Abstract: Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: 7 emotions are recognized using pitch and prosody features and Support Vector Machine (SVM) classifier has been used for classifying the emotions.
Abstract: In the past decade a lot of research has gone into Automatic Speech Emotion Recognition(SER). The primary objective of SER is to improve man-machine interface. It can also be used to monitor the psycho physiological state of a person in lie detectors. In recent time, speech emotion recognition also find its applications in medicine and forensics. In this paper 7 emotions are recognized using pitch and prosody features. Majority of the speech features used in this work are in time domain. Support Vector Machine (SVM) classifier has been used for classifying the emotions. Berlin emotional database is chosen for the task. A good recognition rate of 81% was obtained. The paper that was considered as the reference for our work recognized 4 emotions and obtained a recognition rate of 94.2%. The reference paper also used hybrid classifier thus increasing complexity but can only recognize 4 emotions.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: The i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs, which allows the factorisation approaches developed for CAT to be directly applied.
Abstract: The use of deep neural networks (DNNs) in a hybrid configuration is becoming increasingly popular and successful for speech recognition. One issue with these systems is how to efficiently adapt them to reflect an individual speaker or noise condition. Recently speaker i-vectors have been successfully used as an additional input feature for unsupervised speaker adaptation. In this work the use of i-vectors for adaptation is extended to incorporate acoustic factorisation. In particular, separate i-vectors are computed to represent speaker and acoustic environment. By ensuring "orthogonality" between the individual factor representations it is possible to represent a wide range of speaker and environment pairs by simply combining i-vectors from a particular speaker and a particular environment. In this paper the i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs. This allows the factorisation approaches developed for CAT to be directly applied. Initial experiments were conducted on a noise distorted version of the WSJ corpus. Compared to standard speaker-based i-vector adaptation, factorised i-vectors showed performance gains.

Journal ArticleDOI
TL;DR: The anatomical and physiological bases for individual differences in the human voice are reviewed, before discussing how recent methodological progress in voice morphing and voice synthesis has promoted research on current theoretical issues, such as how voices are mentally represented in thehuman brain.
Abstract: While humans use their voice mainly for communicating information about the world, paralinguistic cues in the voice signal convey rich dynamic information about a speaker's arousal and emotional state, and extralinguistic cues reflect more stable speaker characteristics including identity, biological sex and social gender, socioeconomic or regional background, and age. Here we review the anatomical and physiological bases for individual differences in the human voice, before discussing how recent methodological progress in voice morphing and voice synthesis has promoted research on current theoretical issues, such as how voices are mentally represented in the human brain. Special attention is dedicated to the distinction between the recognition of familiar and unfamiliar speakers, in everyday situations or in the forensic context, and on the processes and representational changes that accompany the learning of new voices. We describe how specific impairments and individual differences in voice perception could relate to specific brain correlates. Finally, we consider that voices are produced by speakers who are often visible during communication, and review recent evidence that shows how speaker perception involves dynamic face-voice integration. The representation of para- and extralinguistic vocal information plays a major role in person perception and social communication, could be neuronally encoded in a prototype-referenced manner, and is subject to flexible adaptive recalibration as a result of specific perceptual experience. WIREs Cogn Sci 2014, 5:15-25. doi: 10.1002/wcs.1261 CONFLICT OF INTEREST: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

01 Jan 2014
TL;DR: This paper proposes a framework that utilizes large-scale clustering algorithms and unlabeled in-domain data to adapt the system for evaluation and presents a system that achieves recognition performance comparable to one that is provided all knowledge of the domain mismatch.
Abstract: In this paper, we motivate and define the domain adaptation challenge task for speaker recognition. Using an i-vector system trained only on out-of-domain data as a starting point, we propose a framework that utilizes large-scale clustering algorithms and unlabeled in-domain data to adapt the system for evaluation. In presenting the results and analyses of an empirical exploration of this problem, our initial findings suggest that, while perfect clustering yields the best results, imperfect clustering can still provide recognition performance within 15% of the optimal. We further present a system that achieves recognition performance comparable to one that is provided all knowledge of the domain mismatch, and lastly, we outline throughout this paper some of the many directions for future work that this new task provides.

Proceedings ArticleDOI
08 May 2014
TL;DR: The developed system uses text independent speaker verification with MFCC features and i-vector based speaker modeling for authenticating the user and linear discriminant analysis and within class covariance normalization are used for normalizing the effects due to session/environment variations.
Abstract: In this paper we present the development and implementation of a speech biometric based attendance system The users access the system by making a call from few pre-decided mobile phones An interactive voice response (IVR) system guides a new user in the enrollment and an enrolled user in the verification processes The system uses text independent speaker verification with MFCC features and i-vector based speaker modeling for authenticating the user Linear discriminant analysis and within class covariance normalization are used for normalizing the effects due to session/environment variations A simple cosine distance scoring along with score normalization is used as the classifier and a fixed threshold is used for making the decision The developed system has been used by a group of 110 students for about two months on a regular basis The system performance in terms of recognition rate is found to be 942 % and the average response time of the system for a test data of duration 50 seconds is noted to be 26 seconds

Proceedings ArticleDOI
04 May 2014
TL;DR: This work has evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task and shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker.
Abstract: Recently an effective fast speaker adaptation method using discriminative speaker code (SC) has been proposed for the hybrid DNN-HMM models in speech recognition [1]. This adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes using the standard back-propagation algorithm. In this paper, we propose an alternative direct adaptation in model space, where speaker codes are directly connected to the original DNN models through a set of new connection weights, which can be estimated very efficiently from all or part of training data. As a result, the proposed method is more suitable for large scale speech recognition tasks since it eliminates the time-consuming training process to estimate another adaptation neural networks. In this work, we have evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task. Experimental results have shown that the proposed SC-based rapid adaptation method is very effective not only for small recognition tasks but also for very large scale tasks. For example, it has shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker (from 10 to a few dozens). Moreover, the extra training time required for adaptation is also significantly reduced from the method in [1].

Proceedings ArticleDOI
01 Oct 2014
TL;DR: Experimental results demonstrate that the proposed framework achieves better separation results than a GMM-based approach in the supervised mode, and in the semi-supervised mode which is believed to be the preferred mode in real-world operations, the DNN- based approach even outperforms the GMM/supervised approach.
Abstract: This paper proposes a novel data-driven approach based on deep neural networks (DNNs) for single-channel speech separation. DNN is adopted to directly model the highly non-linear relationship of speech features between a target speaker and the mixed signals. Both supervised and semi-supervised scenarios are investigated. In the supervised mode, both identities of the target speaker and the interfering speaker are provided. While in the semi-supervised mode, only the target speaker is given. We propose using multiple speakers to be mixed with the target speaker to train the DNN which is shown to well predict an unseen interferer in the separation stage. Experimental results demonstrate that our proposed framework achieves better separation results than a GMM-based approach in the supervised mode. More significantly, in the semi-supervised mode which is believed to be the preferred mode in real-world operations, the DNN-based approach even outperforms the GMM-based approach in the supervised mode.