The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021

Open AccessPosted Content

The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021

- 05 Sep 2021 -

TLDR

In the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019, this article proposed a fusion of 9 models and achieved first place in these two tracks of VoxSRC 2021.

Abstract:

This report describes our submission to the track 1 and track 2 of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC 2021). Both track 1 and track 2 share the same speaker verification system, which only uses VoxCeleb2-dev as our training set. This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement. Our system is a fusion of 9 models and achieves first place in these two tracks of VoxSRC 2021. The minDCF of our submission is 0.1034, and the corresponding EER is 1.8460%.

Citations

PDF

Open Access

More filters

Posted Content

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Sanyuan Chen, +16 more

- 26 Oct 2021 -

arXiv: Computation and Language

TL;DR: WavLM as mentioned in this paper proposes a pre-trained model to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB speech recognition task.

...read moreread less

Journal ArticleDOI

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

- 01 Oct 2022 -

IEEE Journal of Selected Topics in Signa...

TL;DR: WavLM as discussed by the authors jointly learns masked speech prediction and denoising in pre-training to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark.

...read moreread less

Posted Content

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.

Zhengyang Chen, +7 more

- 12 Oct 2021 -

arXiv: Sound

TL;DR: In this paper, the authors explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.

...read moreread less

Posted Content

Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification.

Miao Zhao, +5 more

- 11 Oct 2021 -

arXiv: Sound

TL;DR: This article proposed a multi-query multi-head attention (MQMHA) pooling and inter-top-K penalty method, which achieved state-of-the-art performance in all the public VoxCeleb test sets.

...read moreread less

Posted Content

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information.

Jenthe Thienpondt, +2 more

- 18 Oct 2021 -

arXiv: Audio and Speech Processing

TL;DR: The authors showed that the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability and proposed two techniques to increase cross-lingual speaker verification robustness.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng, +3 more

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.

...read moreread less

Proceedings ArticleDOI

X-Vectors: Robust DNN Embeddings for Speaker Recognition

David Snyder, +4 more

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021

Citations

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.

Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification.

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information.

References

Deep Residual Learning for Image Recognition

PyTorch: An Imperative Style, High-Performance Deep Learning Library

The Kaldi Speech Recognition Toolkit

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

X-Vectors: Robust DNN Embeddings for Speaker Recognition

Related Papers (5)

VoxCeleb2: Deep Speaker Recognition.

ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification

The DKU-DukeECE Systems for VoxCeleb Speaker Recognition Challenge 2020

MUSAN: A Music, Speech, and Noise Corpus.

VoxCeleb: A Large-Scale Speaker Identification Dataset.