scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2019"


Proceedings ArticleDOI
12 May 2019
TL;DR: This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.
Abstract: The objective of this paper is speaker recognition ‘in the wild’ – where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for ‘in the wild’ data, a longer length is beneficial.

308 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
Abstract: Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.

280 citations



Posted Content
TL;DR: The submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019 is described, a fusion of 4 Convolutional Neural Network (CNN) topologies and the best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.
Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

167 citations


Proceedings ArticleDOI
18 Mar 2019
TL;DR: This paper exploits the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms to exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems.
Abstract: Voice Processing Systems (VPSes), now widely deployed, have been made significantly more accurate through the application of recent advances in machine learning. However, adversarial machine learning has similarly advanced and has been used to demonstrate that VPSes are vulnerable to the injection of hidden commands - audio obscured by noise that is correctly recognized by a VPS but not by human beings. Such attacks, though, are often highly dependent on white-box knowledge of a specific machine learning model and limited to specific microphones and speakers, making their use across different acoustic hardware platforms (and thus their practicality) limited. In this paper, we break these dependencies and make hidden command attacks more practical through model-agnostic (blackbox) attacks, which exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems. Specifically, we exploit the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms (e.g., FFTs). We develop four classes of perturbations that create unintelligible audio and test them against 12 machine learning models, including 7 proprietary models (e.g., Google Speech API, Bing Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful attacks against all targets. Moreover, we successfully use our maliciously generated audio samples in multiple hardware configurations, demonstrating effectiveness across both models and real systems. In so doing, we demonstrate that domain-specific knowledge of audio signal processing represents a practical means of generating successful hidden voice command attacks.

108 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: Very deep xvector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors in NIST SRE18, and Extended TDNN x-vector was the best single system.
Abstract: We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary=0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.

101 citations


Posted Content
TL;DR: This paper conducts the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical black-box setting, and proposes an adversarial attack, named FakeBob, to craft adversarial samples.
Abstract: Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical blackbox setting. For this purpose, we propose an adversarial attack, named FAKEBOB, to craft adversarial samples. Specifically, we formulate the adversarial sample generation as an optimization problem, incorporated with the confidence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. We demonstrate that FAKEBOB achieves 99% targeted attack success rate on both open-source and commercial systems. We further demonstrate that FAKEBOB is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover, we have conducted a human study which reveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against FAKEBOB, which calls for more effective defense methods. We highlight that our study peeks into the security implications of adversarial attacks on SRSs, and realistically fosters to improve the security robustness of SRSs.

98 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.
Abstract: In 2016, the National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in robust text-independent speaker recognition, as well as measure performance of current state-of-the-art systems. Compared to previous NIST SREs, SRE16 introduced several new aspects including: an entirely online evaluation platform, a fixed training data condition, more variability in test segment duration (uniformly distributed between 10s and 60s), the use of non-English (Cantonese, Cebuano, Mandarin and Tagalog) conversational telephone speech (CTS) collected outside North America, and providing labeled and unlabeled development (a.k.a. validation) sets for system hyperparameter tuning and adaptation. The introduction of the new non-English CTS data made SRE16 more challenging due to domain/channel and language mismatches as compared to previous SREs. A total of 66 research organizations from industry and academia registered for SRE16, out of which 43 teams submitted 121 valid system outputs that produced scores. This paper presents an overview of the evaluation and analysis of system performance over all primary evaluation conditions. Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.

94 citations


Proceedings ArticleDOI
01 May 2019
TL;DR: In this article, the authors optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task.
Abstract: Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between recognizing speakers in the training set and unseen speakers. The latter case corresponds to the few-shot learning task, where a trained model is evaluated on unseen classes. Here, we optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task. The resulting embedding model outperforms the state-of-the-art triplet loss based models in both speaker verification and identification tasks, for both seen and unseen speakers.

94 citations


Journal ArticleDOI
TL;DR: The requirements for effective privacy preservation are established, generic cryptography-based solutions are reviewed, followed by specific techniques that are applicable to speaker characterisation and speech characterisation (biometrics and non-biometric applications), and common, empirical evaluation metrics for the assessment of privacy-preserving technologies for speech data are outlined.

91 citations


Journal ArticleDOI
TL;DR: This work proposes a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks, and designs a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes.
Abstract: In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications, including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation, we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.

Proceedings ArticleDOI
12 May 2019
TL;DR: This work proposes an adversarial speaker verification (ASV) scheme to learn the condition-invariant deep embedding via adversarial multi-task training and proposes multi-factorial ASV to simultaneously suppress multiple factors that constitute the condition variability.
Abstract: The use of deep networks to extract embeddings for speaker recognition has proven successfully. However, such embeddings are susceptible to performance degradation due to the mismatches among the training, enrollment, and test conditions. In this work, we propose an adversarial speaker verification (ASV) scheme to learn the condition-invariant deep embedding via adversarial multi-task training. In ASV, a speaker classification network and a condition identification network are jointly optimized to minimize the speaker classification loss and simultaneously mini-maximize the condition loss. The target labels of the condition network can be categorical (environment types) and continuous (SNR values). We further propose multi-factorial ASV to simultaneously suppress multiple factors that constitute the condition variability. Evaluated on a Microsoft Cortana text-dependent speaker verification task, the ASV achieves 8.8% and 14.5% relative improvements in equal error rates (EER) for known and unknown conditions, respectively.

Journal ArticleDOI
TL;DR: It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.
Abstract: This paper presents BUT ReverbDB—a dataset of real room impulse responses (RIR), background noises, and retransmitted speech data. The retransmitted data include LibriSpeech test-clean, 2000 HUB5 English evaluation, and part of 2010 NIST Speaker Recognition Evaluation datasets. We provide a detailed description of RIR collection (hardware, software, post-processing) that can serve as a “cook-book” for similar efforts. We also validate BUT ReverbDB in two sets of automatic speech recognition (ASR) experiments and draw conclusions for augmenting ASR training data with real and artificially generated RIRs. We show that a limited number of real RIRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results. The dataset is distributed for free under a non-restrictive license and it currently contains data from eight rooms, which is growing. The distribution package also contains a Kaldi-based recipe for augmenting publicly available AMI close-talk meeting data and test the results on an AMI single distant microphone set, allowing it to reproduce our experiments.

Proceedings ArticleDOI
01 Apr 2019
TL;DR: This paper designs and implements VoicePop, a robust software-only anti-spoofing system on smartphones that leverages the pop noise, which is produced by the user breathing while speaking close to the microphone, to identify legitimate users and defend against spoofing attacks.
Abstract: Voice biometrics is widely adopted for identity authentication in mobile devices. However, voice authentication is vulnerable to spoofing attacks, where an adversary may deceive the voice authentication system with pre-recorded or synthesized samples from the legitimate user or by impersonating the speaking style of the targeted user. In this paper, we design and implement VoicePop, a robust software-only anti-spoofing system on smartphones. VoicePop leverages the pop noise, which is produced by the user breathing while speaking close to the microphone. The pop noise is delicate and subject to user diversity, making it hard to record by replay attacks beyond a certain distance and to imitate precisely by impersonators. We design a novel pop noise detection scheme to pinpoint pop noises at the phonemic level, based on which we establish individually unique relationship between phonemes and pop noises to identify legitimate users and defend against spoofing attacks. Our experimental results with 18 participants and three types of smartphones show that VoicePop achieves over 93.5% detection accuracy at around 5.4% equal error rate. VoicePop requires no additional hardware but only the built-in microphones in virtually all smartphones, which can be readily integrated in existing voice authentication systems for mobile devices.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: Simple classifiers are used to investigate the contents encoded by x-vector embeddings for information related to the speaker, channel, transcription, and meta information about the utterance and compare these with the information encoded by i-vectors across a varying number of dimensions.
Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks.

Posted Content
TL;DR: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition with the special focus on single channel distant/far-field audio, under noisy conditions.
Abstract: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with the special focus on single channel distant/far-field audio, under noisy conditions. The main objectives of this challenge are to: (i) benchmark state-of-the-art technology in the area of speaker recognition and automatic speech recognition (ASR), (ii) support the development of new ideas and technologies in speaker recognition and ASR, (iii) support new research groups entering the field of distant/far-field speech processing, and (iv) provide a new, publicly available dataset to the community that exhibits realistic distance characteristics.

Posted Content
Xu Xiang1, Shuai Wang1, Houjun Huang, Yanmin Qian1, Kai Yu1 
TL;DR: Three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning and it could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings.
Abstract: Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks. In this paper, to address this issue, three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning. It could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings. Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW). The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: This work proposes the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances, based on a Convolutional Neural Network that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speakers embedding.
Abstract: Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18\% of relative EER. Obtained results show a 58\% relative improvement in EER compared to i-vector+PLDA.

Proceedings ArticleDOI
Xu Xiang1, Shuai Wang1, Houjun Huang, Yanmin Qian1, Kai Yu1 
18 Jun 2019
TL;DR: In this article, three different margin-based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning to obtain more discriminative speaker embeddings.
Abstract: Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks. In this paper, to address this issue, three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning. It could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings. Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW). The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: The extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR is studied and adversarial training is proposed to learn representations that perform well in ASR while hiding speaker identity.
Abstract: Automatic speech recognition (ASR) is a key technology in many services and applications. This typically requires user devices to send their speech data to the cloud for ASR decoding. As the speech signal carries a lot of information about the speaker, this raises serious privacy concerns. As a solution, an encoder may reside on each user device which performs local computations to anonymize the representation. In this paper, we focus on the protection of speaker identity and study the extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR. Through speaker identification and verification experiments on the Librispeech corpus with open and closed sets of speakers, we show that the representations obtained from a standard architecture still carry a lot of information about speaker identity. We then propose to use adversarial training to learn representations that perform well in ASR while hiding speaker identity. Our results demonstrate that adversarial training dramatically reduces the closed-set classification accuracy, but this does not translate into increased open-set verification error hence into increased protection of the speaker identity in practice. We suggest several possible reasons behind this negative result.

Journal ArticleDOI
TL;DR: Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD.
Abstract: Recently, there has been growing use of deep neural networks in many modern speech-based systems such as speaker recognition, speech enhancement, and emotion recognition. Inspired by this success, we propose to address the task of voice activity detection (VAD) by incorporating auditory and visual modalities into an end-to-end deep neural network. We evaluate our proposed system in challenging acoustic environments including high levels of noise and transients, which are common in real-life scenarios. Our multimodal setting includes a speech signal captured by a microphone and a corresponding video signal capturing the speaker's mouth region. Under such difficult conditions, robust features need to be extracted from both modalities in order for the system to accurately distinguish between speech and noise. For this purpose, we utilize a deep residual network, to extract features from the video signal, while for the audio modality, we employ a variant of WaveNet encoder for feature extraction. The features from both modalities are fused using multimodal compact bilinear pooling to form a joint representation of the speech signal. To further encode the temporal information, we feed the fused signal to a long short-term memory network and the system is then trained in an end-to-end supervised fashion. Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD. Upon the publication of this paper, we will make the implementation of our proposed models publicly available at https://github.com/iariav/End-to-End-VAD and https://israelcohen.com .

Posted Content
TL;DR: This work develops attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension, and finds that certain English language phonemes are significantly more susceptible to this attack.
Abstract: Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box and transferable, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy. We also find that certain English language phonemes (in particular, vowels) are significantly more susceptible to our attack. We show that the attacks are effective when mounted over cellular networks, where signals are subject to degradation due to transcoding, jitter, and packet loss.

Proceedings ArticleDOI
12 May 2019
TL;DR: In this article, an unsupervised linear discriminant analysis (PLDA) adaptation algorithm was proposed to learn from a small amount of unlabeled in-domain data, which was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL).
Abstract: State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system is deployed differ from that we trained the system. To close the gap due to the domain mismatch, we propose an unsupervised PLDA adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL). We refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficacy of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation (SRE’16, SRE’18) datasets.

Journal ArticleDOI
TL;DR: Improve and propose a SincNet-based classifier, Sinc net-R, which consists of three convolutional layers, and three deep neural network (DNN) layers, which is used to test the classification accuracy and robustness by emotional EEG signals.
Abstract: Deep learning (DL) methods have been used increasingly widely, such as in the fields of speech and image recognition. However, how to design an appropriate DL model to accurately and efficiently classify electroencephalogram (EEG) signals is still a challenge, mainly because EEG signals are characterized by significant differences between two different subjects or vary over time within a single subject, non-stability, strong randomness, low signal-to-noise ratio. SincNet is an efficient classifier for speaker recognition, but it has some drawbacks in dealing with EEG signals classification. In this paper, we improve and propose a SincNet-based classifier, SincNet-R, which consists of three convolutional layers, and three deep neural network (DNN) layers. We then make use of SincNet-R to test the classification accuracy and robustness by emotional EEG signals. The comparable results with original SincNet model and other traditional classifiers such as CNN, LSTM and SVM, show that our proposed SincNet-R model has higher classification accuracy and better algorithm robustness.

Posted Content
TL;DR: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data and provided its baselines, results and discussions.
Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions.

Journal ArticleDOI
19 Mar 2019
TL;DR: A novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter, Convolution Neural Networks, and statistical parameters as a single matrix set is presented.
Abstract: In this paper, we present a novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks (CNN), and statistical parameters as a single matrix set. This architecture has been developed to enable secure access to a voice-based user interface (UI) by enabling speaker-based authentication and integration with an existing Natural Language Processing (NLP) system. Gaining secure access to existing NLP systems also served as motivation. Initially, we identify challenges related to real-time speaker recognition and highlight the recent research in the field. Further, we analyze the functional requirements of a speaker recognition system and introduce the mechanisms that can address these requirements through our novel architecture. Subsequently, the paper discusses the effect of different techniques such as CNN, GF, and statistical parameters in feature extraction. For the classification, standard classifiers such as Support Vector Machine (SVM), Random Forest (RF) and Deep Neural Network (DNN) are investigated. To verify the validity and effectiveness of the proposed architecture, we compared different parameters including accuracy, sensitivity, and specificity with the standard AlexNet architecture.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: This work presents a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch between training and inference when extracting embeddings for long duration recordings.
Abstract: State-of-the-art text-independent speaker recognition systems for long recordings (a few minutes) are based on deep neural network (DNN) speaker embeddings. Current implementations of this paradigm use short speech segments (a few seconds) to train the DNN. This introduces a mismatch between training and inference when extracting embeddings for long duration recordings. To address this, we present a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch. At the same time, we also modify the DNN architecture to produce embeddings optimized for cosine distance scoring. This is accomplished using a largemargin strategy with angular softmax. Experimental validation shows that our approach is capable of producing embeddings that achieve record performance on the SITW benchmark.

Proceedings ArticleDOI
12 May 2019
TL;DR: In this article, adversarial domain adaptation is used for addressing the problem of language mismatch between speaker recognition corpora by minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains.
Abstract: In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains (i.e. languages), while preserving their capacity in recognizing speakers. Neural architectures for extracting utterance-level representations enable us to apply adversarial adaptation methods in an end-to-end fashion and train the network jointly with the standard cross-entropy loss. We examine several configurations, such as the use of (pseudo-)labels on the target domain as well as domain labels in the feature extractor, and we demonstrate the effectiveness of our method on the challenging NIST SRE16 and SRE18 benchmarks.

Posted Content
TL;DR: In this article, a self multi-head attention mechanism was proposed to obtain a discriminative speaker embedding given non-fixed length speech utterances, which outperformed both temporal and statistical pooling methods with a 18% of relative EER.
Abstract: Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18\% of relative EER. Obtained results show a 58\% relative improvement in EER compared to i-vector+PLDA.

Journal ArticleDOI
TL;DR: Results identify a reduced subset of relevant features, which are used in a hierarchical-like scenario incorporating information of different speech tasks, which open a discussion about the suitability of these techniques to be transferred to the clinical setting.