scispace - formally typeset
Open AccessProceedings ArticleDOI

Multi-Query Multi-Head Attention Pooling and Inter-Topk Penalty for Speaker Verification

Reads0
Chats0
TLDR
The authors proposed a multi-query multi-head attention (MQMHA) pooling and inter-top-K penalty method, which achieved state-of-the-art performance in all the public VoxCeleb test sets.
Abstract
This paper describes the multi-query multi-head attention (MQMHA) pooling and inter-topK penalty methods which were first proposed in our submitted system description for VoxCeleb speaker recognition challenge (VoxSRC) 2021. Most multi-head attention pooling mechanisms either attend to the whole feature through multiple heads or attend to several split parts of the whole feature. Our proposed MQMHA combines both these two mechanisms and gain more diversified information. The margin-based softmax loss functions are commonly adopted to obtain discriminative speaker representations. To further enhance the inter-class discriminability, we propose a method that adds an extra inter-topK penalty on some confused speakers. By adopting both the MQMHA and inter-topK penalty, we achieved state-of-the-art performance in all of the public VoxCeleb test sets.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

TL;DR: In this paper , a temporal feature extraction method based on convolutional neural network-bidirectional long short term memory (CNN-BLSTM) and temporal pooling (TMPOOL) is proposed for language identification.
Proceedings ArticleDOI

Exploring Binary Classification Loss for Speaker Verification

TL;DR: In this article , the authors propose a framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification, which can efficiently alleviate the gap between training and evaluation.
Journal ArticleDOI

Aggregating discriminative embedding by triple-domain feature joint learning with bidirectional sampling for speaker verification

TL;DR: TribiNet as discussed by the authors proposes a triple-domain feature joint learning to enhance discriminative embedding from more dimensions for text-independent speaker verification by using bidirectional sampling multi-scale feature aggregation network based on Fisher feature fusion.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Proceedings ArticleDOI

X-Vectors: Robust DNN Embeddings for Speaker Recognition

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Proceedings ArticleDOI

VoxCeleb2: Deep Speaker Recognition.

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
Journal ArticleDOI

Additive Margin Softmax for Face Verification

TL;DR: In this paper, the authors proposed a conceptually simple and intuitive learning objective function, i.e., additive margin softmax, for face verification, which is more intuitive and interpretable.
Related Papers (5)