Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Open AccessPosted Content

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Jason Pelecanos, +2 more

- 05 Apr 2021 -

arXiv: Computation and Language

Chats0

TLDR

In this paper, a decision residual network (DRN) was proposed to capture uncertainty, enroll/test asymmetry and additional non-linear information in a 2nd-stage neural network (known as a decision network) for speaker recognition.

Abstract:

Many neural network speaker recognition systems model each speaker using a fixed-dimensional embedding vector. These embeddings are generally compared using either linear or 2nd-order scoring and, until recently, do not handle utterance-specific uncertainty. In this work we propose scoring these representations in a way that can capture uncertainty, enroll/test asymmetry and additional non-linear information. This is achieved by incorporating a 2nd-stage neural network (known as a decision network) as part of an end-to-end training regimen. In particular, we propose the concept of decision residual networks which involves the use of a compact decision network to leverage cosine scores and to model the residual signal that's needed. Additionally, we present a modification to the generalized end-to-end softmax loss function to target the separation of same/different speaker scores. We observed significant performance gains for the two techniques.

Citations

PDF

Open Access

More filters

Posted Content

Personalized Keyphrase Detection using Speaker and Environment Information

Rajeev V. Rikhye, +8 more

- 28 Apr 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this article, a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary is presented. But the system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model.

...read moreread less

Posted Content

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System

Roza Chojnacka, +3 more

- 05 Apr 2021 -

arXiv: Audio and Speech Processing

TL;DR: SpeakerStew as mentioned in this paper is a hybrid system to perform speaker verification on 46 languages using a smart speaker with interactions consisting of a wake-up keyword (textdependent) followed by a speech query (text-independent).

...read moreread less

Posted Content

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

Tom O'Malley, +5 more

- 18 Nov 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this paper, a frontend for improving robustness of automatic speech recognition (ASR) is proposed that jointly implements three modules within a single model: acoustic echo cancellation, speech enhancement, and speech separation.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

FaceNet: A unified embedding for face recognition and clustering

Florian Schroff, +2 more

TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.

...read moreread less

Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng, +3 more

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.

...read moreread less

Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

Najim Dehak, +4 more

- 01 May 2011 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.

...read moreread less

Collapse

Related Papers (5)

End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification

Hee-Soo Heo, +5 more

- 07 Feb 2019 -

arXiv: Audio and Speech Processing

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Gautam Bhattacharya, +2 more

- 07 Nov 2018 -

arXiv: Audio and Speech Processing

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Citations

Personalized Keyphrase Detection using Speaker and Environment Information

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

References

Deep Residual Learning for Image Recognition

Long short-term memory

FaceNet: A unified embedding for face recognition and clustering

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Front-End Factor Analysis for Speaker Verification

Related Papers (5)

End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Gaussian-constrained Training for Speaker Verification

Generalized End-to-End Loss for Speaker Verification

Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-end Speaker Verification