Duration mismatch compensation for i-vector based speaker recognition systems

doi:10.1109/ICASSP.2013.6639154

Open AccessProceedings ArticleDOI

Duration mismatch compensation for i-vector based speaker recognition systems

- pp 7663-7667

TLDR

The effect of duration variability on phoneme distributions of speech utterances and i-vector length is analyzed and it is demonstrated that, as utterance duration is decreased, number of detected unique phonemes andi- vector length approaches zero in a logarithmic and non-linear fashion.

Abstract:

Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Speaker Recognition by Machines and Humans: A tutorial review

John H. L. Hansen, +1 more

- 14 Oct 2015 -

IEEE Signal Processing Magazine

TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.

...read moreread less

Proceedings ArticleDOI

Deep neural network-based speaker embeddings for end-to-end speaker verification

David Snyder, +5 more

TL;DR: It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.

...read moreread less

Journal ArticleDOI

Text-dependent speaker verification: Classifiers, databases and RSR2015

Anthony Larcher, +3 more

- 01 May 2014 -

Speech Communication

TL;DR: The HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system, outperforms the state-of-the-art i- vector system in most of the scenarios and provides a reference evaluation scheme and a reference performance on RSR2015 database to the research community.

...read moreread less

Proceedings ArticleDOI

End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances.

Chunlei Zhang, +1 more

TL;DR: An end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials.

...read moreread less

Journal ArticleDOI

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

Chunlei Zhang, +2 more

- 01 Sep 2018 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: A novel text-independent speaker verification framework based on the triplet loss and a very deep convolutional neural network architecture are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Speaker Verification Using Adapted Gaussian Mixture Models

Douglas A. Reynolds, +2 more

- 01 Jan 2000 -

Digital Signal Processing

TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

...read moreread less

Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

Najim Dehak, +4 more

- 01 May 2011 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.

...read moreread less

Journal ArticleDOI

Noise power spectral density estimation based on optimal smoothing and minimum statistics

Rainer Martin

- 01 Jul 2001 -

IEEE Transactions on Speech and Audio Pr...

TL;DR: An unbiased noise estimator is developed which derives the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal by minimizing a conditional mean square estimation error criterion in each time step.

...read moreread less

Proceedings Article

Analysis of i-vector Length Normalization in Speaker Recognition Systems.

Daniel Garcia-Romero, +1 more

TL;DR: The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization, which allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions.

...read moreread less

Journal ArticleDOI

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging

Israel Cohen

- 26 Aug 2003 -

IEEE Transactions on Speech and Audio Pr...

TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).

...read moreread less