Open AccessPosted Content
Utterance-level Aggregation For Speaker Recognition In The Wild
Reads0
Chats0
TLDR
In this article, the authors proposed a speaker recognition deep network using a "thin-ResNet" trunk architecture and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time.Abstract:
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a "thin-ResNet" trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for "in the wild" data, a longer length is beneficial.read more
Citations
More filters
Journal ArticleDOI
Voxceleb: Large-scale speaker verification in the wild
TL;DR: A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.
Journal ArticleDOI
Speaker recognition based on deep learning: An overview
Zhongxin Bai,Xiao-Lei Zhang +1 more
TL;DR: In this article, the authors review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition with a focus on deep learning-based methods.
Posted Content
Large Margin Softmax Loss for Speaker Verification
TL;DR: In this paper, the authors investigate the large margin softmax loss with different configurations in speaker verification and introduce ring loss and minimum hyperspherical energy criterion to further improve the performance.
Proceedings ArticleDOI
Automatic Speaker Recognition with Limited Data
TL;DR: An adversarial few-shot learning-based speaker identification framework (AFEASI) is proposed to develop robust speaker identification models with only a limited number of training instances to enhance the generalization and robustness for speaker identification with adversarial examples.
Posted Content
Emotionless: privacy-preserving speech analysis for voice assistants
TL;DR: A privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input and shows that identification of sensitive emotional state of the speaker is reduced by ~96 %.
References
More filters
Automatic differentiation in PyTorch
Adam Paszke,Sam Gross,Soumith Chintala,Gregory Chanan,Edward Z. Yang,Zachary DeVito,Zeming Lin,Alban Desmaison,Luca Antiga,Adam Lerer +9 more
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Posted Content
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi,Ashish Agarwal,Paul Barham,Eugene Brevdo,Zhifeng Chen,Craig Citro,Greg S. Corrado,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Ian Goodfellow,Andrew Harp,Geoffrey Irving,Michael Isard,Yangqing Jia,Rafal Jozefowicz,Lukasz Kaiser,Manjunath Kudlur,Josh Levenberg,Dan Mané,Rajat Monga,Sherry Moore,Derek G. Murray,Chris Olah,Mike Schuster,Jonathon Shlens,Benoit Steiner,Ilya Sutskever,Kunal Talwar,Paul A. Tucker,Vincent Vanhoucke,Vijay K. Vasudevan,Fernanda B. Viégas,Oriol Vinyals,Pete Warden,Martin Wattenberg,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +39 more
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Proceedings ArticleDOI
MatConvNet: Convolutional Neural Networks for MATLAB
Andrea Vedaldi,Karel Lenc +1 more
TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.
Proceedings ArticleDOI
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Proceedings ArticleDOI
NetVLAD: CNN Architecture for Weakly Supervised Place Recognition
TL;DR: A convolutional neural network architecture that is trainable in an end-to-end manner directly for the place recognition task and an efficient training procedure which can be applied on very large-scale weakly labelled tasks are developed.