scispace - formally typeset
Open AccessPosted Content

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Reads0
Chats0
TLDR
In this paper, a decision residual network (DRN) was proposed to capture uncertainty, enroll/test asymmetry and additional non-linear information in a 2nd-stage neural network (known as a decision network) for speaker recognition.
Abstract
Many neural network speaker recognition systems model each speaker using a fixed-dimensional embedding vector. These embeddings are generally compared using either linear or 2nd-order scoring and, until recently, do not handle utterance-specific uncertainty. In this work we propose scoring these representations in a way that can capture uncertainty, enroll/test asymmetry and additional non-linear information. This is achieved by incorporating a 2nd-stage neural network (known as a decision network) as part of an end-to-end training regimen. In particular, we propose the concept of decision residual networks which involves the use of a compact decision network to leverage cosine scores and to model the residual signal that's needed. Additionally, we present a modification to the generalized end-to-end softmax loss function to target the separation of same/different speaker scores. We observed significant performance gains for the two techniques.

read more

Citations
More filters
Posted Content

Personalized Keyphrase Detection using Speaker and Environment Information

TL;DR: In this article, a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary is presented. But the system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model.
Posted Content

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System

TL;DR: SpeakerStew as mentioned in this paper is a hybrid system to perform speaker verification on 46 languages using a smart speaker with interactions consisting of a wake-up keyword (textdependent) followed by a speech query (text-independent).
Posted Content

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

TL;DR: In this paper, a frontend for improving robustness of automatic speech recognition (ASR) is proposed that jointly implements three modules within a single model: acoustic echo cancellation, speech enhancement, and speech separation.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings ArticleDOI

FaceNet: A unified embedding for face recognition and clustering

TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Related Papers (5)