Multi-Query Multi-Head Attention Pooling and Inter-Topk Penalty for Speaker Verification
Reads0
Chats0
TLDR
The authors proposed a multi-query multi-head attention (MQMHA) pooling and inter-top-K penalty method, which achieved state-of-the-art performance in all the public VoxCeleb test sets.Abstract:
This paper describes the multi-query multi-head attention (MQMHA) pooling and inter-topK penalty methods which were first proposed in our submitted system description for VoxCeleb speaker recognition challenge (VoxSRC) 2021. Most multi-head attention pooling mechanisms either attend to the whole feature through multiple heads or attend to several split parts of the whole feature. Our proposed MQMHA combines both these two mechanisms and gain more diversified information. The margin-based softmax loss functions are commonly adopted to obtain discriminative speaker representations. To further enhance the inter-class discriminability, we propose a method that adds an extra inter-topK penalty on some confused speakers. By adopting both the MQMHA and inter-topK penalty, we achieved state-of-the-art performance in all of the public VoxCeleb test sets. read more
Citations
More filters
Journal ArticleDOI
Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification
Xiuyan Liu,Chen Chen,Yongjun He +2 more
TL;DR: In this paper , a temporal feature extraction method based on convolutional neural network-bidirectional long short term memory (CNN-BLSTM) and temporal pooling (TMPOOL) is proposed for language identification.
Proceedings ArticleDOI
Exploring Binary Classification Loss for Speaker Verification
TL;DR: In this article , the authors propose a framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification, which can efficiently alleviate the gap between training and evaluation.
Journal ArticleDOI
Aggregating discriminative embedding by triple-domain feature joint learning with bidirectional sampling for speaker verification
Yunfei Zi,Shengwu Xiong +1 more
TL;DR: TribiNet as discussed by the authors proposes a triple-domain feature joint learning to enhance discriminative embedding from more dimensions for text-independent speaker verification by using bidirectional sampling multi-scale feature aggregation network based on Fisher feature fusion.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings ArticleDOI
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Proceedings ArticleDOI
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Proceedings ArticleDOI
VoxCeleb2: Deep Speaker Recognition.
TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
Journal ArticleDOI
Additive Margin Softmax for Face Verification
TL;DR: In this paper, the authors proposed a conceptually simple and intuitive learning objective function, i.e., additive margin softmax, for face verification, which is more intuitive and interpretable.