scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Factor analysis methods for joint speaker verification and spoof detection

TL;DR: This paper attempts to develop a joint modelling approach which can detect the presence of spoofing attacks while also performing the speaker verification task and proposes a factor modelling approach where the spoof variability subspace and the speaker variability sub space are jointly trained.
Abstract: The performance of a speaker verification system is severely degraded by spoofing attacks generated from artificial speech synthesizers. Recently, several approaches have been proposed for classifying natural and synthetic speech (spoof detection) which can be used in conjunction with a speaker verification system. In this paper, we attempt to develop a joint modelling approach which can detect the presence of spoofing attacks while also performing the speaker verification task. We propose a factor modelling approach where the spoof variability subspace and the speaker variability subspace are jointly trained. The lower dimensional projections in these subspaces are used for speaker verification as well as spoof detection tasks. We also investigate the benefits of linear discriminant analysis (LDA), widely used in speaker recognition, for the spoof detection task. Several experiments are performed using the speaker and spoofing (SAS) database. For speaker verification, we compare the performance of the proposed method with a baseline method of fusing a conventional speaker verification system and a spoof detection system. In these experiments, the proposed approach provides substantial improvements for spoof detection (relative improvements of 20% in EER over the baseline) as well as speaker verification under spoofing conditions (relative improvements of 40% in EER over the baseline).
Citations
More filters
Proceedings ArticleDOI
20 Aug 2017
TL;DR: Inspired by the success of ResNet in image recognition, the effectiveness of using ResNet for automatic spoofing detection is investigated and it is found that if the same feature is used for different fused models, the resulting system can hardly be improved.
Abstract: Speaker verification systems have achieved great progress in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as speech synthesis, voice conversion, and fake audio recordings etc. Inspired by the success of ResNet in image recognition, we investigated the effectiveness of using ResNet for automatic spoofing detection. Experimental results on the ASVspoof2017 data set show that ResNet performs the best among all the single-model systems. Model fusion is a good way to further improve the system performance. Nevertheless, we found that if the same feature is used for different fused models, the resulting system can hardly be improved. By using different features and models, our best fused model further reduced the Equal Error Rate (EER) by 18% relatively, compared with the best single-model system.

128 citations


Cites background from "Factor analysis methods for joint s..."

  • ...It is therefore very important to develop systems to automatically detect spoofing attacks — either in a joint modeling approach that can detect spoofing attack while at the same time perform the speaker verification task [11], or in a separate system that is used in conjunction with a speaker verification system [12]....

    [...]

Journal ArticleDOI
TL;DR: A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.

90 citations


Cites methods from "Factor analysis methods for joint s..."

  • ...0 - has already been made publicly available while no paper has been published to document the rVAD method and a number of studies have used it, covering applications such as voice activity detection in speaker verification [39, 40, 41], age and gender identification [42], emotion detection and recognition [43, 44, 45], and discovering linguistic structures [46]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions, and that the proposed model is capable of distinguishing device-invariant spoofing attempts.
Abstract: Recent advances in automatic speaker verification (ASV) lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead to serious security breaches. A spoofing attack within the context of ASV is a condition in which a (potentially harmful) person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory (LSTM) and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network (CLDNN) is proposed, which outperforms previous attempts on the BTAS2016 dataset (0.82% $\rightarrow$ 0.19% HTER), placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches are appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate (EER) of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database (domain mismatch) scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.

38 citations


Cites background or methods from "Factor analysis methods for joint s..."

  • ...joint ASV-Spoof detection systems requires training two vastly different systems with different features and evaluation protocols [2]....

    [...]

  • ...techniques such as joint factor analysis (JFA) [1], [2] as well as i-vector based approaches [1], [3], [4] and deep learning techniques [5]–[8]....

    [...]

Posted Content
TL;DR: In this article, an unsupervised segment-based method for robust voice activity detection (rVAD) is presented, which consists of two passes of denoising followed by a VAD stage, where high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference.
Abstract: This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

18 citations

Posted Content
TL;DR: A spoofing-robust automatic speaker verification system for diverse attacks based on a multi-task learning architecture that is jointly trained with time-frequency representations from utterances to provide recognition decisions for both tasks simultaneously.
Abstract: Spoofing attacks posed by generating artificial speech can severely degrade the performance of a speaker verification system. Recently, many anti-spoofing countermeasures have been proposed for detecting varying types of attacks from synthetic speech to replay presentations. While there are numerous effective defenses reported on standalone anti-spoofing solutions, the integration for speaker verification and spoofing detection systems has obvious benefits. In this paper, we propose a spoofing-robust automatic speaker verification (SR-ASV) system for diverse attacks based on a multi-task learning architecture. This deep learning based model is jointly trained with time-frequency representations from utterances to provide recognition decisions for both tasks simultaneously. Compared with other state-of-the-art systems on the ASVspoof 2017 and 2019 corpora, a substantial improvement of the combined system under different spoofing conditions can be obtained.

12 citations


Cites methods from "Factor analysis methods for joint s..."

  • ...In [13], a joint modeling approach was introduced to detect spoofing attacks while also performing the speaker verification task....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations


"Factor analysis methods for joint s..." refers methods in this paper

  • ...The spoof detection task is achieved by training a support vector machine (SVM) classifier [12] while the speaker verification is achieved by probabilistic linear discriminant analysis (PLDA) scoring [13]....

    [...]

Journal ArticleDOI
TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Abstract: This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.

3,526 citations

Journal ArticleDOI
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

1,433 citations


"Factor analysis methods for joint s..." refers methods in this paper

  • ...Separately, an automatic speaker verification (ASV) system is trained on human speech using the state-of-the-art approaches consisting of ivector with linear discriminant analysis (LDA) and length normalization [19] with probabilistic LDA (PLDA) scoring [20]....

    [...]

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper describes face data as resulting from a generative model which incorporates both within- individual and between-individual variation, and calculates the likelihood that the differences between face images are entirely due to within-individual variability.
Abstract: Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within-individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that allows explicit comparison across quite different viewing conditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recognition under varying pose.

1,099 citations


"Factor analysis methods for joint s..." refers methods in this paper

  • ...The spoof detection task is achieved by training a support vector machine (SVM) classifier [12] while the speaker verification is achieved by probabilistic linear discriminant analysis (PLDA) scoring [13]....

    [...]

Proceedings Article
01 Jan 2011
TL;DR: The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization, which allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions.
Abstract: We present a method to boost the performance of probabilistic generative models that work with i-vector representations. The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization. This non-linear transformation allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions. Significant performance improvements are demonstrated on the telephone portion of NIST SRE 2010.

1,077 citations


"Factor analysis methods for joint s..." refers methods in this paper

  • ...Separately, an automatic speaker verification (ASV) system is trained on human speech using the state-of-the-art approaches consisting of ivector with linear discriminant analysis (LDA) and length normalization [19] with probabilistic LDA (PLDA) scoring [20]....

    [...]

  • ...A length normalization of the ivectors is also performed before the PLDA training [19]....

    [...]