scispace - formally typeset
Proceedings ArticleDOI

Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification

Reads0
Chats0
TLDR
A novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems to suppress undesired syllable variation and enhance the power of discrimination inherited in the frame-level features is proposed.
Abstract
This study proposes a novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems. To suppress undesired syllable variation and enhance the power of discrimination inherited in the frame-level features, we design a novel syllable-dependent clustering loss to optimize the network. Specifically, this loss function utilizes syllable labels as auxiliary supervision information to explicitly maximize inter-syllable divisibility and intra-syllable compactness between the learned frame-level features. Successively, we propose two syllable-dependent pooling mechanisms to aggregate the frame-level features to several syllable-level features by averaging those features corresponding to each syllable. The utterance-level speaker embeddings with powerful discrimination are then obtained by concatenating the syllable-level features. Experimental results on Tencent voice wake-up dataset show that our proposed scheme can accelerate the network convergence and achieve significant performance improvement against the state-of-the-art methods.

read more

Citations
More filters
Proceedings ArticleDOI

Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification.

TL;DR: The speaker identity information is introduced to form long-term speaker embedding centroids, which are determined by all the speakers in the training set and employed to construct a loss function, named long short term speaker loss (LSTSL).
Posted Content

SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification

TL;DR: In this paper, the optimal architectures from a time delay neural network (TDNN) based search space employing neural architecture search (NAS), named SpeechNAS, were identified from a TDNN-based search space, which achieved an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1.
Journal ArticleDOI

Augmented-syllabification of n-gram tagger for Indonesian words and named-entities

TL;DR: This article proposed an augmented-syllabification of n-gram tagger (ASnGT) model, which is applied on grapheme-level and does not rely on both vowel and diphthong detections.
References
More filters
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Proceedings ArticleDOI

CosFace: Large Margin Cosine Loss for Deep Face Recognition

TL;DR: In this article, the authors proposed a large margin cosine loss (LMCL), which normalizes both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.
Journal ArticleDOI

Speaker recognition: a tutorial

TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Proceedings ArticleDOI

Deep neural networks for small footprint text-dependent speaker verification

TL;DR: Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
Related Papers (5)