Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance
read more
Citations
The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection
The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge
References
LIBSVM: A library for support vector machines
Face Description with Local Binary Patterns: Application to Face Recognition
Speaker Verification Using Adapted Gaussian Mixture Models
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
Related Papers (5)
Frequently Asked Questions (19)
Q2. What are the future works in "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?
The authors suggest future work in ASV spoofing and countermeasures along the following lines: • More diverse spoofing materials: It would be interesting in the future to use either ‘ super recognisers ’ or forensic speech scientists, if the authors could access sufficient numbers of such listeners. To detect the SSMARY attack or similar waveform concatenation attacks, the authors suggest further development of pitch pattern-based countermeasures.
Q3. How long can a short-time Fourier analysis be applied to transform the signal?
Given a speech signal x(n), short-time Fourier analysis can be applied to transform the signal from the time domain to the frequency domain by assuming the signal is quasi-stationary within a short time frame, e.g., 25ms.
Q4. How many features are used to extract the modulation spectrum for each MGD trajectory?
In practice, the authors used a 1024-point FFT to extract the modulation spectrum for each MGD trajectory, then applied a DCT to the modulation spectrum, and after that kept the first 32 coefficients as features.
Q5. What is the phase information in speech synthesis and voice conversion systems?
Even though phase information is important in human speech perception [67], most speech synthesis and voice conversion systems use a simplified, minimum phase model which may introduce artefacts into the phase spectrum.
Q6. What are the main issues that should be taken into consideration in the proposed countermeasures?
To make the proposed countermeasures appropriate for practical applications, it would of course be important to take channel and noise issues into consideration.
Q7. What was the input to each voice conversion system?
During the execution of spoofing attacks, the transcript of an impostor trial was used as the textual input to each speech synthesis system, and the speech signal of the impostor trial was the input to each voice conversion system.
Q8. What is the main reason why the ASV system is more vulnerable to spoofing attacks?
As the amount of spoofing materials increases, ASV systems can access more representative prior information about spoofing, and the security of ASV systems should be enhanced as a result.
Q9. What is the purpose of the proposed countermeasures?
To make systems suitable for other voice authentication applications, spoofing countermeasures for text-dependent ASV must also be developed.
Q10. What was the main barrier to progress in this area?
As discussed in [8], the lack of a large-scale, standardised dataset and protocol was a fundamental barrier to progress in this area.
Q11. What is the reason why replay attack was not considered in this study?
replay attack – which does not require any speech processing knowledge on the part of the attacker – was not considered here.
Q12. What is the pitch pattern countermeasure?
The pitch pattern countermeasure detects synthetic speech well, but does not detect some voice conversion speech such as that from VC-C1, VC-FEST, VC-KPLS and VC-LSP.
Q13. How many samples were selected from the evaluation set?
In Task 3, there were 130 samples (65 human, 65 artificial (13 × 5)), and those samples were randomly selected from the evaluation set for each listener.
Q14. What did the authors use to extract acoustic features?
All systems used the same front-end to extract acoustic features: 19- dimensional Mel-Frequency Cepstral Coefficients (MFCCs) plus log-energy with delta and delta-delta coefficients.
Q15. What is the effectiveness of the phase-based features?
it remains unknown whether the phase-based features are also effective in detecting attacks from speech synthesisers using unknown vocoders.
Q16. What is the performance of the MGD-based countermeasure?
In respect of the frame-based features, the MGD-based countermeasure achieves the best overall performance in terms of low FARs and works well at detecting most types of spoofed speech with the notable exception of the SS-MARY attack.
Q17. How many spoofing systems were used during development?
Only five of the available spoofing systems were used during development, with all thirteen spoofing systems (Table I) being run on the evaluation set.
Q18. How many attacks were implemented on the training set?
In the countermeasure evaluation protocol, the authors used a further 25 speakers’ voices as training data and only implemented five attacks (as known attacks) on the training set.
Q19. How does the proposed fused countermeasure work?
This indicates that the fused countermeasure can be effectively integrated with any ASV system without needing additional joint optimisation.