Utterance-level Aggregation for Speaker Recognition in the Wild

doi:10.1109/ICASSP.2019.8683120

Open AccessProceedings ArticleDOI

Utterance-level Aggregation for Speaker Recognition in the Wild

Weidi Xie, +3 more

- pp 5791-5795

Chats0

TLDR

This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.

Abstract:

The objective of this paper is speaker recognition ‘in the wild’ – where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for ‘in the wild’ data, a longer length is beneficial.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Voxceleb: Large-scale speaker verification in the wild

Arsha Nagrani, +4 more

- 01 Mar 2020 -

Computer Speech & Language

TL;DR: A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.

...read moreread less

Proceedings ArticleDOI

Vggsound: A Large-Scale Audio-Visual Dataset

Honglie Chen, +3 more

TL;DR: The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.

...read moreread less

Proceedings ArticleDOI

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

Yue Fan, +9 more

TL;DR: CN-Celeb is presented, a large-scale speaker recognition dataset collected ‘in the wild’ that contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.

...read moreread less

Proceedings ArticleDOI

Spot the Conversation: Speaker Diarisation in the Wild.

Joon Son Chung, +4 more

TL;DR: This work proposes an automatic audio-visual diarisation method for YouTube videos that consists of active speaker detection using audio- visual methods and speaker verification using self-enrolled speaker models, and integrates this method into a semi-automatic dataset creation pipeline.

...read moreread less

Posted Content

Speaker Recognition Based on Deep Learning: An Overview

Zhongxin Bai, +1 more

- 02 Dec 2020 -

arXiv: Audio and Speech Processing

TL;DR: Several major subtasks of speaker recognition are reviewed, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Automatic differentiation in PyTorch

Adam Paszke, +9 more

TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.

...read moreread less

Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Proceedings ArticleDOI

MatConvNet: Convolutional Neural Networks for MATLAB

Andrea Vedaldi, +1 more

TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.

...read moreread less

Proceedings ArticleDOI

X-Vectors: Robust DNN Embeddings for Speaker Recognition

David Snyder, +4 more

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Proceedings ArticleDOI

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

Relja Arandjelovic, +4 more

TL;DR: A convolutional neural network architecture that is trainable in an end-to-end manner directly for the place recognition task and an efficient training procedure which can be applied on very large-scale weakly labelled tasks are developed.

...read moreread less

Collapse

Utterance-level Aggregation for Speaker Recognition in the Wild

Citations

Voxceleb: Large-scale speaker verification in the wild

Vggsound: A Large-Scale Audio-Visual Dataset

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

Spot the Conversation: Speaker Diarisation in the Wild.

Speaker Recognition Based on Deep Learning: An Overview

References

Automatic differentiation in PyTorch

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

MatConvNet: Convolutional Neural Networks for MATLAB

X-Vectors: Robust DNN Embeddings for Speaker Recognition

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

Related Papers (5)

VoxCeleb2: Deep Speaker Recognition.

VoxCeleb: A Large-Scale Speaker Identification Dataset.

X-Vectors: Robust DNN Embeddings for Speaker Recognition

Deep Residual Learning for Image Recognition

Front-End Factor Analysis for Speaker Verification