Multi-Query Multi-Head Attention Pooling and Inter-Topk Penalty for Speaker Verification

doi:10.1109/icassp43922.2022.9746178

Open AccessProceedings ArticleDOI

Multi-Query Multi-Head Attention Pooling and Inter-Topk Penalty for Speaker Verification

Chats0

TLDR

The authors proposed a multi-query multi-head attention (MQMHA) pooling and inter-top-K penalty method, which achieved state-of-the-art performance in all the public VoxCeleb test sets.

Abstract:

This paper describes the multi-query multi-head attention (MQMHA) pooling and inter-topK penalty methods which were first proposed in our submitted system description for VoxCeleb speaker recognition challenge (VoxSRC) 2021. Most multi-head attention pooling mechanisms either attend to the whole feature through multiple heads or attend to several split parts of the whole feature. Our proposed MQMHA combines both these two mechanisms and gain more diversified information. The margin-based softmax loss functions are commonly adopted to obtain discriminative speaker representations. To further enhance the inter-class discriminability, we propose a method that adds an extra inter-topK penalty on some confused speakers. By adopting both the MQMHA and inter-topK penalty, we achieved state-of-the-art performance in all of the public VoxCeleb test sets.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Xiuyan Liu, +2 more

- 01 Jun 2022 -

Applied Acoustics

TL;DR: In this paper , a temporal feature extraction method based on convolutional neural network-bidirectional long short term memory (CNN-BLSTM) and temporal pooling (TMPOOL) is proposed for language identification.

...read moreread less

Proceedings ArticleDOI

Exploring Binary Classification Loss for Speaker Verification

Bing Han, +2 more

TL;DR: In this article , the authors propose a framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification, which can efficiently alleviate the gap between training and evaluation.

...read moreread less

Journal ArticleDOI

Aggregating discriminative embedding by triple-domain feature joint learning with bidirectional sampling for speaker verification

Yunfei Zi, +1 more

- 01 May 2023 -

Biomedical Signal Processing and Control

TL;DR: TribiNet as discussed by the authors proposes a triple-domain feature joint learning to enhance discriminative embedding from more dimensions for text-independent speaker verification by using bidirectional sampling multi-scale feature aggregation network based on Fisher feature fusion.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng, +3 more

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.

...read moreread less

Proceedings ArticleDOI

X-Vectors: Robust DNN Embeddings for Speaker Recognition

David Snyder, +4 more

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Proceedings ArticleDOI

VoxCeleb2: Deep Speaker Recognition.

Joon Son Chung, +2 more

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.

...read moreread less

Journal ArticleDOI

Additive Margin Softmax for Face Verification

Feng Wang, +3 more

- 04 Apr 2018 -

IEEE Signal Processing Letters

TL;DR: In this paper, the authors proposed a conceptually simple and intuitive learning objective function, i.e., additive margin softmax, for face verification, which is more intuitive and interpretable.

...read moreread less

Electronics Letters

OTFace: Hard Samples Guided Optimal Transport Loss for Deep Face Representation

- 01 Jan 2023 -

IEEE Transactions on Multimedia

Multi-Query Multi-Head Attention Pooling and Inter-Topk Penalty for Speaker Verification

Citations

Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Exploring Binary Classification Loss for Speaker Verification

Aggregating discriminative embedding by triple-domain feature joint learning with bidirectional sampling for speaker verification

References

Deep Residual Learning for Image Recognition

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

X-Vectors: Robust DNN Embeddings for Speaker Recognition

VoxCeleb2: Deep Speaker Recognition.

Additive Margin Softmax for Face Verification

Related Papers (5)

Human Organ Classifications from Computed Tomography Images Using Deep-Convolutional Neural Network

A new ordered pooling network based on multi-scale fusion feature for medical image recognition

Improved deep face identification with multi‐class pairwise discriminant loss

OTFace: Hard Samples Guided Optimal Transport Loss for Deep Face Representation

OTFace: Hard Samples Guided Optimal Transport Loss for Deep Face Representation