Showing papers on "Speaker recognition published in 2019"

PDF

Open Access

Proceedings Article•DOI•

Utterance-level Aggregation for Speaker Recognition in the Wild

[...]

Weidi Xie¹, Arsha Nagrani¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

12 May 2019

TL;DR: This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.

...read moreread less

Abstract: The objective of this paper is speaker recognition ‘in the wild’ – where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for ‘in the wild’ data, a longer length is beneficial.

...read moreread less

308 citations

Proceedings Article•DOI•

Speaker Recognition for Multi-speaker Conversations Using X-vectors

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Alan V. McCree¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

12 May 2019

TL;DR: It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.

...read moreread less

Abstract: Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.

...read moreread less

280 citations

DOI•

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

[...]

Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald

13 Nov 2019

246 citations

Posted Content•

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

[...]

Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matejka, Oldrich Plchot - Show less +1 more

16 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: The submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019 is described, a fusion of 4 Convolutional Neural Network (CNN) topologies and the best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

...read moreread less

Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

...read moreread less

167 citations

Proceedings Article•DOI•

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems

[...]

Hadi Abdullah¹, Washington Garcia¹, Christian Peeters¹, Patrick Traynor¹, Kevin R. B. Butler¹, Joseph N. Wilson¹ - Show less +2 more•Institutions (1)

University of Florida¹

18 Mar 2019

TL;DR: This paper exploits the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms to exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems.

...read moreread less

Abstract: Voice Processing Systems (VPSes), now widely deployed, have been made significantly more accurate through the application of recent advances in machine learning. However, adversarial machine learning has similarly advanced and has been used to demonstrate that VPSes are vulnerable to the injection of hidden commands - audio obscured by noise that is correctly recognized by a VPS but not by human beings. Such attacks, though, are often highly dependent on white-box knowledge of a specific machine learning model and limited to specific microphones and speakers, making their use across different acoustic hardware platforms (and thus their practicality) limited. In this paper, we break these dependencies and make hidden command attacks more practical through model-agnostic (blackbox) attacks, which exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems. Specifically, we exploit the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms (e.g., FFTs). We develop four classes of perturbations that create unintelligible audio and test them against 12 machine learning models, including 7 proprietary models (e.g., Google Speech API, Bing Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful attacks against all targets. Moreover, we successfully use our maliciously generated audio samples in multiple hardware configurations, demonstrating effectiveness across both models and real systems. In so doing, we demonstrate that domain-specific knowledge of audio signal processing represents a practical means of generating successful hidden voice command attacks.

...read moreread less

108 citations

Proceedings Article•DOI•

State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18.

[...]

Jesús Villalba¹, Nanxin Chen¹, David Snyder¹, Daniel Garcia-Romero¹, Alan V. McCree¹, Gregory Sell¹, Jonas Borgstrom², Fred Richardson², Suwon Shon², Francois Grondin², Réda Dehak³, Leibny Paola Garcia-Perera¹, Daniel Povey¹, Pedro A. Torres-Carrasquillo², Sanjeev Khudanpur¹, Najim Dehak¹ - Show less +12 more•Institutions (3)

Johns Hopkins University¹, Massachusetts Institute of Technology², École Pour l'Informatique et les Techniques Avancées³

15 Sep 2019

TL;DR: Very deep xvector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors in NIST SRE18, and Extended TDNN x-vector was the best single system.

...read moreread less

Abstract: We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary=0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.

...read moreread less

101 citations

Posted Content•

Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

[...]

Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, Yang Liu - Show less +3 more

03 Nov 2019-arXiv: Audio and Speech Processing

TL;DR: This paper conducts the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical black-box setting, and proposes an adversarial attack, named FakeBob, to craft adversarial samples.

...read moreread less

Abstract: Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical blackbox setting. For this purpose, we propose an adversarial attack, named FAKEBOB, to craft adversarial samples. Specifically, we formulate the adversarial sample generation as an optimization problem, incorporated with the confidence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. We demonstrate that FAKEBOB achieves 99% targeted attack success rate on both open-source and commercial systems. We further demonstrate that FAKEBOB is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover, we have conducted a human study which reveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against FAKEBOB, which calls for more effective defense methods. We highlight that our study peeks into the security implications of adversarial attacks on SRSs, and realistically fosters to improve the security robustness of SRSs.

...read moreread less

98 citations

Proceedings Article•DOI•

The 2018 NIST Speaker Recognition Evaluation.

[...]

Seyed Omid Sadjadi¹, Craig S. Greenberg¹, Elliot Singer², Douglas A. Reynolds³, Lisa P. Mason, Jaime Hernandez-Cordero - Show less +2 more•Institutions (3)

National Institute of Standards and Technology¹, Massachusetts Institute of Technology², Johns Hopkins University³

15 Sep 2019

TL;DR: Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.

...read moreread less

Abstract: In 2016, the National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in robust text-independent speaker recognition, as well as measure performance of current state-of-the-art systems. Compared to previous NIST SREs, SRE16 introduced several new aspects including: an entirely online evaluation platform, a fixed training data condition, more variability in test segment duration (uniformly distributed between 10s and 60s), the use of non-English (Cantonese, Cebuano, Mandarin and Tagalog) conversational telephone speech (CTS) collected outside North America, and providing labeled and unlabeled development (a.k.a. validation) sets for system hyperparameter tuning and adaptation. The introduction of the new non-English CTS data made SRE16 more challenging due to domain/channel and language mismatches as compared to previous SREs. A total of 66 research organizations from industry and academia registered for SRE16, out of which 43 teams submitted 121 valid system outputs that produced scores. This paper presents an overview of the evaluation and analysis of system performance over all primary evaluation conditions. Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.

...read moreread less

94 citations

Proceedings Article•DOI•

Centroid-based Deep Metric Learning for Speaker Recognition

[...]

Jixuan Wang¹, Kuan-Chieh Wang¹, Marc T. Law¹, Frank Rudzicz¹, Michael Brudno¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 May 2019

TL;DR: In this article, the authors optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task.

...read moreread less

Abstract: Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between recognizing speakers in the training set and unseen speakers. The latter case corresponds to the few-shot learning task, where a trained model is evaluated on unseen classes. Here, we optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task. The resulting embedding model outperforms the state-of-the-art triplet loss based models in both speaker verification and identification tasks, for both seen and unseen speakers.

...read moreread less

94 citations

Journal Article•DOI•

Preserving privacy in speaker and speech characterisation

[...]

Andreas Nautsch¹, Andreas Nautsch², Abelino Jiménez³, Amos Treiber⁴, Jascha Kolberg², Catherine Jasserand, Els Kindt⁵, Héctor Delgado¹, Massimiliano Todisco¹, Mohamed Amine Hmani⁶, Aymen Mtibaa, Mohammed Ahmed Abdelraheem, Alberto Abad, Francisco Teixeira, Driss Matrouf⁷, Marta Gomez-Barrero², Dijana Petrovska-Delacrétaz, Gérard Chollet⁶, Nicholas Evans¹, Thomas Schneider⁴, Jean-François Bonastre⁷, Bhiksha Raj³, Isabel Trancoso, Christoph Busch² - Show less +20 more•Institutions (7)

Institut Eurécom¹, Darmstadt University of Applied Sciences², Carnegie Mellon University³, Technische Universität Darmstadt⁴, Catholic University of Leuven⁵, Université Paris-Saclay⁶, University of Avignon⁷

01 Nov 2019-Computer Speech & Language

TL;DR: The requirements for effective privacy preservation are established, generic cryptography-based solutions are reviewed, followed by specific techniques that are applicable to speaker characterisation and speech characterisation (biometrics and non-biometric applications), and common, empirical evaluation metrics for the assessment of privacy-preserving technologies for speech data are outlined.

...read moreread less

91 citations

Journal Article•DOI•

Two-Stage Deep Learning for Noisy-Reverberant Speech Enhancement

[...]

Yan Zhao¹, Zhong-Qiu Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jan 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This work proposes a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks, and designs a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes.

...read moreread less

Abstract: In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications, including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation, we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.

...read moreread less

Proceedings Article•DOI•

Adversarial Speaker Verification

[...]

Zhong Meng¹, Yong Zhao¹, Jinyu Li¹, Yifan Gong¹•Institutions (1)

Microsoft¹

12 May 2019

TL;DR: This work proposes an adversarial speaker verification (ASV) scheme to learn the condition-invariant deep embedding via adversarial multi-task training and proposes multi-factorial ASV to simultaneously suppress multiple factors that constitute the condition variability.

...read moreread less

Abstract: The use of deep networks to extract embeddings for speaker recognition has proven successfully. However, such embeddings are susceptible to performance degradation due to the mismatches among the training, enrollment, and test conditions. In this work, we propose an adversarial speaker verification (ASV) scheme to learn the condition-invariant deep embedding via adversarial multi-task training. In ASV, a speaker classification network and a condition identification network are jointly optimized to minimize the speaker classification loss and simultaneously mini-maximize the condition loss. The target labels of the condition network can be categorical (environment types) and continuous (SNR values). We further propose multi-factorial ASV to simultaneously suppress multiple factors that constitute the condition variability. Evaluated on a Microsoft Cortana text-dependent speaker verification task, the ASV achieves 8.8% and 14.5% relative improvements in equal error rates (EER) for known and unknown conditions, respectively.

...read moreread less

Journal Article•DOI•

Building and Evaluation of a Real Room Impulse Response Dataset

[...]

Igor Szöke¹, Miroslav Skácel¹, Ladislav Mosner¹, Jakub Paliesek¹, Jan Cernocky¹ - Show less +1 more•Institutions (1)

Brno University of Technology¹

17 May 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.

...read moreread less

Abstract: This paper presents BUT ReverbDB—a dataset of real room impulse responses (RIR), background noises, and retransmitted speech data. The retransmitted data include LibriSpeech test-clean, 2000 HUB5 English evaluation, and part of 2010 NIST Speaker Recognition Evaluation datasets. We provide a detailed description of RIR collection (hardware, software, post-processing) that can serve as a “cook-book” for similar efforts. We also validate BUT ReverbDB in two sets of automatic speech recognition (ASR) experiments and draw conclusions for augmenting ASR training data with real and artificially generated RIRs. We show that a limited number of real RIRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results. The dataset is distributed for free under a non-restrictive license and it currently contains data from eight rooms, which is growing. The distribution package also contains a Kaldi-based recipe for augmenting publicly available AMI close-talk meeting data and test the results on an AMI single distant microphone set, allowing it to reproduce our experiments.

...read moreread less

Proceedings Article•DOI•

VoicePop: A Pop Noise based Anti-spoofing System for Voice Authentication on Smartphones

[...]

Qian Wang¹, Lin Xiu¹, Man Zhou¹, Yanjiao Chen¹, Cong Wang², Qi Li³, Xiangyang Luo - Show less +3 more•Institutions (3)

Wuhan University¹, City University of Hong Kong², Tsinghua University³

01 Apr 2019

TL;DR: This paper designs and implements VoicePop, a robust software-only anti-spoofing system on smartphones that leverages the pop noise, which is produced by the user breathing while speaking close to the microphone, to identify legitimate users and defend against spoofing attacks.

...read moreread less

Abstract: Voice biometrics is widely adopted for identity authentication in mobile devices. However, voice authentication is vulnerable to spoofing attacks, where an adversary may deceive the voice authentication system with pre-recorded or synthesized samples from the legitimate user or by impersonating the speaking style of the targeted user. In this paper, we design and implement VoicePop, a robust software-only anti-spoofing system on smartphones. VoicePop leverages the pop noise, which is produced by the user breathing while speaking close to the microphone. The pop noise is delicate and subject to user diversity, making it hard to record by replay attacks beyond a certain distance and to imitate precisely by impersonators. We design a novel pop noise detection scheme to pinpoint pop noises at the phonemic level, based on which we establish individually unique relationship between phonemes and pop noises to identify legitimate users and defend against spoofing attacks. Our experimental results with 18 participants and three types of smartphones show that VoicePop achieves over 93.5% detection accuracy at around 5.4% equal error rate. VoicePop requires no additional hardware but only the built-in microphones in virtually all smartphones, which can be readily integrated in existing voice authentication systems for mobile devices.

...read moreread less

Proceedings Article•DOI•

Probing the Information Encoded in X-Vectors

[...]

Desh Raj¹, David Snyder¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

01 Dec 2019

TL;DR: Simple classifiers are used to investigate the contents encoded by x-vector embeddings for information related to the speaker, channel, transcription, and meta information about the utterance and compare these with the information encoded by i-vectors across a varying number of dimensions.

...read moreread less

Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks.

...read moreread less

Posted Content•

The VOiCES from a Distance Challenge 2019 Evaluation Plan.

[...]

Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Colleen Richey¹, Aaron Lawson, Maria Alejandra Barrios - Show less +2 more•Institutions (1)

SRI International¹

27 Feb 2019-arXiv: Audio and Speech Processing

TL;DR: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition with the special focus on single channel distant/far-field audio, under noisy conditions.

...read moreread less

Abstract: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with the special focus on single channel distant/far-field audio, under noisy conditions. The main objectives of this challenge are to: (i) benchmark state-of-the-art technology in the area of speaker recognition and automatic speech recognition (ASR), (ii) support the development of new ideas and technologies in speaker recognition and ASR, (iii) support new research groups entering the field of distant/far-field speech processing, and (iv) provide a new, publicly available dataset to the community that exhibits realistic distance characteristics.

...read moreread less

Posted Content•

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

[...]

Xu Xiang¹, Shuai Wang¹, Houjun Huang, Yanmin Qian¹, Kai Yu¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

18 Jun 2019-arXiv: Audio and Speech Processing

TL;DR: Three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning and it could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings.

...read moreread less

Abstract: Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks. In this paper, to address this issue, three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning. It could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings. Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW). The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

...read moreread less

Proceedings Article•DOI•

Self multi-head attention for speaker recognition

[...]

Miquel India¹, Pooyan Safari¹, Javier Hernando¹•Institutions (1)

Polytechnic University of Catalonia¹

15 Sep 2019

TL;DR: This work proposes the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances, based on a Convolutional Neural Network that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speakers embedding.

...read moreread less

Abstract: Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18\% of relative EER. Obtained results show a 58\% relative improvement in EER compared to i-vector+PLDA.

...read moreread less

Proceedings Article•DOI•

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

[...]

Xu Xiang¹, Shuai Wang¹, Houjun Huang, Yanmin Qian¹, Kai Yu¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

18 Jun 2019

TL;DR: In this article, three different margin-based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning to obtain more discriminative speaker embeddings.

...read moreread less

Proceedings Article•DOI•

Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?

[...]

Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent

15 Sep 2019

TL;DR: The extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR is studied and adversarial training is proposed to learn representations that perform well in ASR while hiding speaker identity.

...read moreread less

Abstract: Automatic speech recognition (ASR) is a key technology in many services and applications. This typically requires user devices to send their speech data to the cloud for ASR decoding. As the speech signal carries a lot of information about the speaker, this raises serious privacy concerns. As a solution, an encoder may reside on each user device which performs local computations to anonymize the representation. In this paper, we focus on the protection of speaker identity and study the extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR. Through speaker identification and verification experiments on the Librispeech corpus with open and closed sets of speakers, we show that the representations obtained from a standard architecture still carry a lot of information about speaker identity. We then propose to use adversarial training to learn representations that perform well in ASR while hiding speaker identity. Our results demonstrate that adversarial training dramatically reduces the closed-set classification accuracy, but this does not translate into increased open-set verification error hence into increased protection of the speaker identity in practice. We suggest several possible reasons behind this negative result.

...read moreread less

Journal Article•DOI•

An End-to-End Multimodal Voice Activity Detection Using WaveNet Encoder and Residual Networks

[...]

Ido Ariav¹, Israel Cohen¹•Institutions (1)

Technion – Israel Institute of Technology¹

22 Feb 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD.

...read moreread less

Abstract: Recently, there has been growing use of deep neural networks in many modern speech-based systems such as speaker recognition, speech enhancement, and emotion recognition. Inspired by this success, we propose to address the task of voice activity detection (VAD) by incorporating auditory and visual modalities into an end-to-end deep neural network. We evaluate our proposed system in challenging acoustic environments including high levels of noise and transients, which are common in real-life scenarios. Our multimodal setting includes a speech signal captured by a microphone and a corresponding video signal capturing the speaker's mouth region. Under such difficult conditions, robust features need to be extracted from both modalities in order for the system to accurately distinguish between speech and noise. For this purpose, we utilize a deep residual network, to extract features from the video signal, while for the audio modality, we employ a variant of WaveNet encoder for feature extraction. The features from both modalities are fused using multimodal compact bilinear pooling to form a joint representation of the speech signal. To further encode the temporal information, we feed the fused signal to a long short-term memory network and the system is then trained in an end-to-end supervised fashion. Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD. Upon the publication of this paper, we will make the implementation of our proposed models publicly available at https://github.com/iariav/End-to-End-VAD and https://israelcohen.com .

...read moreread less

Posted Content•

Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems

[...]

Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, Patrick Traynor¹ - Show less +4 more•Institutions (1)

University of Florida¹

11 Oct 2019-arXiv: Cryptography and Security

TL;DR: This work develops attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension, and finds that certain English language phonemes are significantly more susceptible to this attack.

...read moreread less

Abstract: Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box and transferable, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy. We also find that certain English language phonemes (in particular, vowels) are significantly more susceptible to our attack. We show that the attacks are effective when mounted over cellular networks, where signals are subject to degradation due to transcoding, jitter, and packet loss.

...read moreread less

Proceedings Article•DOI•

The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA

[...]

Kong Aik Lee¹, Qiongqiong Wang¹, Takafumi Koshinaka¹•Institutions (1)

NEC¹

12 May 2019

TL;DR: In this article, an unsupervised linear discriminant analysis (PLDA) adaptation algorithm was proposed to learn from a small amount of unlabeled in-domain data, which was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL).

...read moreread less

Abstract: State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system is deployed differ from that we trained the system. To close the gap due to the domain mismatch, we propose an unsupervised PLDA adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL). We refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficacy of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation (SRE’16, SRE’18) datasets.

...read moreread less

Journal Article•DOI•

EEG Emotion Classification Using an Improved SincNet-Based Deep Learning Model.

[...]

Hong Zeng¹, Wu Zhenhua¹, Jiaming Zhang¹, Chen Yang¹, Hua Zhang¹, Guojun Dai¹, Wanzeng Kong¹ - Show less +3 more•Institutions (1)

Hangzhou Dianzi University¹

14 Nov 2019-Brain Sciences

TL;DR: Improve and propose a SincNet-based classifier, Sinc net-R, which consists of three convolutional layers, and three deep neural network (DNN) layers, which is used to test the classification accuracy and robustness by emotional EEG signals.

...read moreread less

Abstract: Deep learning (DL) methods have been used increasingly widely, such as in the fields of speech and image recognition. However, how to design an appropriate DL model to accurately and efficiently classify electroencephalogram (EEG) signals is still a challenge, mainly because EEG signals are characterized by significant differences between two different subjects or vary over time within a single subject, non-stability, strong randomness, low signal-to-noise ratio. SincNet is an efficient classifier for speaker recognition, but it has some drawbacks in dealing with EEG signals classification. In this paper, we improve and propose a SincNet-based classifier, SincNet-R, which consists of three convolutional layers, and three deep neural network (DNN) layers. We then make use of SincNet-R to test the classification accuracy and robustness by emotional EEG signals. The comparable results with original SincNet model and other traditional classifiers such as CNN, LSTM and SVM, show that our proposed SincNet-R model has higher classification accuracy and better algorithm robustness.

...read moreread less

Posted Content•

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

[...]

Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman - Show less +3 more

05 Dec 2019-arXiv: Sound

TL;DR: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data and provided its baselines, results and discussions.

...read moreread less

Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions.

...read moreread less

Journal Article•DOI•

A Near Real-Time Automatic Speaker Recognition Architecture for Voice-Based User Interface

[...]

Parashar Dhakal, Praveen Damacharla, Ahmad Y. Javaid, Vijay Devabhaktuni

19 Mar 2019

TL;DR: A novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter, Convolution Neural Networks, and statistical parameters as a single matrix set is presented.

...read moreread less

Abstract: In this paper, we present a novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks (CNN), and statistical parameters as a single matrix set. This architecture has been developed to enable secure access to a voice-based user interface (UI) by enabling speaker-based authentication and integration with an existing Natural Language Processing (NLP) system. Gaining secure access to existing NLP systems also served as motivation. Initially, we identify challenges related to real-time speaker recognition and highlight the recent research in the field. Further, we analyze the functional requirements of a speaker recognition system and introduce the mechanisms that can address these requirements through our novel architecture. Subsequently, the paper discusses the effect of different techniques such as CNN, GF, and statistical parameters in feature extraction. For the classification, standard classifiers such as Support Vector Machine (SVM), Random Forest (RF) and Deep Neural Network (DNN) are investigated. To verify the validity and effectiveness of the proposed architecture, we compared different parameters including accuracy, sensitivity, and specificity with the standard AlexNet architecture.

...read moreread less

Proceedings Article•DOI•

x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition.

[...]

Daniel Garcia-Romero¹, David Snyder¹, Gregory Sell¹, Alan V. McCree¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

15 Sep 2019

TL;DR: This work presents a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch between training and inference when extracting embeddings for long duration recordings.

...read moreread less

Abstract: State-of-the-art text-independent speaker recognition systems for long recordings (a few minutes) are based on deep neural network (DNN) speaker embeddings. Current implementations of this paradigm use short speech segments (a few seconds) to train the DNN. This introduces a mismatch between training and inference when extracting embeddings for long duration recordings. To address this, we present a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch. At the same time, we also modify the DNN architecture to produce embeddings optimized for cosine distance scoring. This is accomplished using a largemargin strategy with angular softmax. Experimental validation shows that our approach is capable of producing embeddings that achieve record performance on the SITW benchmark.

...read moreread less

Proceedings Article•DOI•

Speaker Verification Using End-to-end Adversarial Language Adaptation

[...]

Johan Rohdin¹, Themos Stafylakis, Anna Silnova¹, Hossein Zeinali¹, Lukas Burget¹, Oldrich Plchot¹ - Show less +2 more•Institutions (1)

Brno University of Technology¹

12 May 2019

TL;DR: In this article, adversarial domain adaptation is used for addressing the problem of language mismatch between speaker recognition corpora by minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains.

...read moreread less

Abstract: In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains (i.e. languages), while preserving their capacity in recognizing speakers. Neural architectures for extracting utterance-level representations enable us to apply adversarial adaptation methods in an end-to-end fashion and train the network jointly with the standard cross-entropy loss. We examine several configurations, such as the use of (pseudo-)labels on the target domain as well as domain labels in the feature extractor, and we demonstrate the effectiveness of our method on the challenging NIST SRE16 and SRE18 benchmarks.

...read moreread less

Posted Content•

Self Multi-Head Attention for Speaker Recognition

[...]

Miquel India¹, Pooyan Safari¹, Javier Hernando¹•Institutions (1)

Polytechnic University of Catalonia¹

24 Jun 2019-arXiv: Sound

TL;DR: In this article, a self multi-head attention mechanism was proposed to obtain a discriminative speaker embedding given non-fixed length speech utterances, which outperformed both temporal and statistical pooling methods with a 18% of relative EER.

...read moreread less

Journal Article•DOI•

On the design of automatic voice condition analysis systems. Part II: Review of speaker recognition techniques and study on the effects of different variability factors

[...]

Jorge A. Gomez-Garcia¹, Laureano Moro-Velázquez¹, Laureano Moro-Velázquez², Juan Ignacio Godino-Llorente¹•Institutions (2)

Technical University of Madrid¹, Johns Hopkins University²

01 Feb 2019-Biomedical Signal Processing and Control

TL;DR: Results identify a reduced subset of relevant features, which are used in a hierarchical-like scenario incorporating information of different speech tasks, which open a discussion about the suitability of these techniques to be transferred to the clinical setting.

...read moreread less

Collapse