Open AccessPosted Content
The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021
TLDR
In the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019, this article proposed a fusion of 9 models and achieved first place in these two tracks of VoxSRC 2021.Abstract:
This report describes our submission to the track 1 and track 2 of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC 2021). Both track 1 and track 2 share the same speaker verification system, which only uses VoxCeleb2-dev as our training set. This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement. Our system is a fusion of 9 models and achieves first place in these two tracks of VoxSRC 2021. The minDCF of our submission is 0.1034, and the corresponding EER is 1.8460%.read more
Citations
More filters
Posted Content
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan Chen,Chengyi Wang,Zhengyang Chen,Yu Wu,Shujie Liu,Zhuo Chen,Jinyu Li,Naoyuki Kanda,Takuya Yoshioka,Xiong Xiao,Jian Wu,Long Zhou,Shuo Ren,Yanmin Qian,Yao Qian,Michael Zeng,Furu Wei +16 more
TL;DR: WavLM as mentioned in this paper proposes a pre-trained model to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB speech recognition task.
Journal ArticleDOI
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
TL;DR: WavLM as discussed by the authors jointly learns masked speech prediction and denoising in pre-training to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark.
Posted Content
Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.
TL;DR: In this paper, the authors explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.
Posted Content
Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification.
TL;DR: This article proposed a multi-query multi-head attention (MQMHA) pooling and inter-top-K penalty method, which achieved state-of-the-art performance in all the public VoxCeleb test sets.
Posted Content
Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information.
TL;DR: The authors showed that the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability and proposed two techniques to increase cross-lingual speaker verification robustness.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Proceedings Article
The Kaldi Speech Recognition Toolkit
Daniel Povey,Arnab Ghoshal,Gilles Boulianne,Lukas Burget,Ondrej Glembek,Nagendra Kumar Goel,Mirko Hannemann,Petr Motlicek,Yanmin Qian,Petr Schwarz,Jan Silovsky,Georg Stemmer,Karel Vesely +12 more
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Proceedings ArticleDOI
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.