scispace - formally typeset
Open AccessJournal ArticleDOI

Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance

Reads0
Chats0
TLDR
This paper starts with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks, and introduces a number of countermeasures to prevent spoofing attacks.
Abstract
In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Known attackers are spoofing systems whose output was used to train the countermeasures, while an unknown attacker is a spoofing system whose output was not available to the countermeasures during training. Finally, we benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Anti-Spoofing for Text-Independent Speaker Verification: An
Initial Database, Comparison of Countermeasures, and Human
Performance
Citation for published version:
Wu, Z, De Leon, P, Demiroglu, C, Khodabakhsh, A, King, S, Ling, Z, Saito, D, Stewart, B, Toda, T, Wester,
M & Yamagishi, J 2016, 'Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database,
Comparison of Countermeasures, and Human Performance', IEEE/ACM Transactions on Audio, Speech
and Language Processing, vol. 24, no. 4, pp. 768 - 783. https://doi.org/10.1109/TASLP.2016.2526653
Digital Object Identifier (DOI):
10.1109/TASLP.2016.2526653
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
IEEE/ACM Transactions on Audio, Speech and Language Processing
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 1
Anti-Spoofing for Text-Independent Speaker
Verification: An Initial Database, Comparison of
Countermeasures, and Human Performance
Zhizheng Wu
, Phillip L. De Leon, Senior Member, IEEE, Cenk Demiroglu, Ali Khodabakhsh,
Simon King, Fellow IEEE, Zhen-Hua Ling, Daisuke Saito, Bryan Stewart, Tomoki Toda,
Mirjam Wester, and Junichi Yamagishi, Senior Member, IEEE
Abstract—In this paper, we present a systematic study of the
vulnerability of automatic speaker verification to a diverse range
of spoofing attacks. We start with a thorough analysis of the
spoofing effects of five speech synthesis and eight voice conversion
systems, and the vulnerability of three speaker verification
systems under those attacks. We then introduce a number of
countermeasures to prevent spoofing attacks from both known
and unknown attackers. Known attackers are spoofing systems
whose output was used to train the countermeasures, whilst an
unknown attacker is a spoofing system whose output was not
available to the countermeasures during training. Finally, we
benchmark automatic systems against human performance on
both speaker verification and spoofing detection tasks.
Index Terms—Speaker verification, speech synthesis, voice con-
version, spoofing attack, anti-spoofing, countermeasure, security
I. INTRODUCTION
The task of automatic speaker verification (ASV), some-
times described as a type of voice biometrics, is to accept
or reject a claimed identity based on a speech sample.
There are two types of ASV system: text-dependent and text-
independent. Text-dependent ASV assumes constrained word
content and is normally used in authentication applications
because it can deliver the high accuracy required. However,
text-independent ASV does not place constraints on word
content, and is normally used in surveillance applications. For
This work was partially supported by EPSRC under Programme Grant
EP/I031022/1 (Natural Speech Technology) and EP/J002526/1 (CAF) and by
TUBITAK 1001 grant No 112E160. This article is an expanded version of [1],
[2]
Z. Wu is the correspondence author, and the remaining authors have been
listed in alphabetical order to indicate equal contributions.
Z. Wu, S. King, M. Wester and J. Yamagishi are with the Centre for Speech
Technology Research, University of Edinburgh, UK. e-mail: {zhizheng.wu,
simon.king}@ed.ac.uk, {mwester, jyamagis}@inf.ed.ac.uk
P. L. De Leon and B. Stewart are with the Klipsch School of Electrical and
Computer Engineering, New Mexico State University (NMSU), Las Cruces
NM 88003 USA. e-mail: {pdeleon, brystewa}@nmsu.edu
A. Khodabakhsh and C. Demiroglu are with Ozyegin University, Turkey.
e-mail: alikhodabakhsh@gmail.com, cenk.demiroglu@ozyegin.edu.tr
Z.-H. Ling is with University of Science and Technology of China, China.
zhling@ustc.edu
D. Saito is with University of Tokyo, Japan. e-mail: dsk saito@gavo.t.u-
tokyo.ac.jp
T. Toda is with Information Technology Center, Nagoya University, Japan.
e-mail:tomoki@icts.nagoya-u.ac.jp
example, in call-center applications
1,2
, a caller’s identity can
be verified during the course of a natural conversation without
forcing the caller to speak a specific passphrase. Moreover, as
such a verification process usually takes place under remote
scenarios without any face-to-face contact, a spoofing attack
an attempt to manipulate a verification result by mimicking
a target speaker’s voice in person or by using computer-based
techniques such as voice conversion or speech synthesis
is a fundamental concern. Hence, in this work, we focus on
spoofing and anti-spoofing for text-independent ASV.
Due to a number of technical advances, notably channel
and noise compensation techniques, ASV systems are being
widely adopted in security applications [3], [4], [5], [6], [7].
A major concern, however, when deploying an ASV system,
is its resilience to a spoofing attack. As identified in [8], there
are at least four types of spoofing attack: impersonation [9],
[10], [11], replay [12], [13], [14], speech synthesis [15], [16]
and voice conversion [17], [18], [19], [20], [21]. Among the
four types of spoofing attack, replay, speech synthesis, and
voice conversion present the highest risk to ASV systems [8].
Although replay might be the most common spoofing tech-
nique which presents a risk to both text-dependent and text-
independent ASV systems [12], [13], [14], it is not viable
for the generation of utterances of specific content, such as
would be required to maintain a live conversation in a call-
center application. On the other hand, open-source software
for state-of-the-art speech synthesis and voice conversion is
readily available (e.g., Festival
3
and Festvox
4
), making these
two approaches perhaps the most accessible and effective
means to carry out spoofing attacks, and therefore presenting
a serious risk to deployed ASV systems [8]. For that reason,
the focus in this work is only on those two types of spoofing
attacks.
A. Speech Synthesis and Voice Conversion Spoofing
Many studies have reported and analysed the vulnerability
of ASV systems to speech synthesis and voice conversion
spoofing. The potential vulnerability of ASV to synthetic
1
http://www.nuance.com/for-business/customer-service-solutions/
voice-biometrics/freespeech/index.htm
2
https://youtu.be/kyPTGoDyd o
3
http://www.cstr.ed.ac.uk/projects/festival/
4
http://festvox.org/

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 2
speech was first evaluated in [22], [23]. An HMM-based
speech synthesis system was used to spoof an HMM-based,
text-prompted ASV system. They reported that the false ac-
ceptance rate (FAR) increased from 0% to over 70% under a
speech synthesis spoofing attack. In [15], [16], the vulnerabil-
ity of two ASV systems a GMM-UBM system (Gaussian
mixture models with a universal background model), and
an SVM system (support vector machine using a GMM
supervector) was assessed using a speaker-adaptive, HMM-
based speech synthesizer. Experiments using the Wall Street
Journal (WSJ) corpus (283 speakers) [24] showed that FARs
increased from 0.28% and 0.00% to 86% and 81% for GMM-
UBM and SVM systems, respectively. These studies confirm
the vulnerability of ASV systems to speech synthesis spoofing
attack.
Voice conversion as a spoofing method has also been
attracting increasing attention. The potential risk of voice
conversion to a GMM ASV system was evaluated for the first
time in [25], which used the YOHO database (138 speakers).
In [26], [27], [17], text-independent GMM-UBM systems were
assessed when faced with voice conversion spoofing on NIST
speaker recognition evaluation (SRE) datasets. These studies
showed an increase in FAR from around 10% to over 40% and
confirmed the vulnerability of GMM-UBM systems to voice
conversion spoofing attack.
Recent studies [18], [19] have evaluated more advanced
ASV systems based on joint factor analysis (JFA), i-vectors,
and probabilistic linear discriminative analysis (PLDA), on
the NIST SRE 2006 database. The FARs of these systems
increased five-fold from about 3% to over 17% under attacks
from voice conversion spoofing.
B. Spoofing countermeasures
The vulnerability of ASV systems to spoofing attacks has
led to the development of anti-spoofing techniques, often
referred to as countermeasures. In [28], a synthetic speech
detector based on the average inter-frame difference (AIFD)
was proposed to discriminate between natural and synthetic
speech. This countermeasure works well if the dynamic vari-
ation of the synthetic speech is different from that of natural
speech; however, if global variance compensation is applied
to the synthetic speech, the countermeasure becomes less
effective [15].
In [29], [30], a synthetic speech detector based on image
analysis of pitch-patterns was proposed for human versus syn-
thetic speech discrimination. This countermeasure was based
on the observation that there can be artefacts in the pitch
contours generated by HMM-based speech synthesis. Experi-
ments showed that features extracted from pitch-patterns can
be used to significantly reduce the FAR for synthetic speech.
The performance of the pitch-pattern countermeasure was not
evaluated for detecting voice conversion spoofing.
In [31], a temporal modulation feature was proposed to
detect synthetic speech generated by copy-synthesis. The
modulation feature captures the long-term temporal distor-
tion caused by independent frame-by-frame operations in
speech synthesis. Experiments conducted on the WSJ database
showed the effectiveness of the modulation feature when
integrated with frame-based features. However, whether the
detector is effective across a variety of speech synthesis and
voice conversion spoofing attacks is unknown. Also using
spectro-temporal information, a feature derived from local
binary patterns [32] was employed to detect voice conversion
and speech synthesis attacks in [33], [34].
Phase- and modified group delay-based features have also
been proposed to detect voice conversion spoofing [35]. A
cosine-normalised phase feature was derived from the phase
spectrogram while the modified group delay feature contained
both magnitude and phase information. Evaluation on the
NIST SRE 2006 data confirmed the effectiveness of the
proposed features. However, it remains unknown whether the
phase-based features are also effective in detecting attacks
from speech synthesisers using unknown vocoders. Another
phase-based feature called the relative phase shift was pro-
posed in [16], [36], [37] to detect speech synthesis spoof-
ing, and was reported to achieve promising performance for
vocoders using minimum phase rather than natural phase.
In [38], an average pair-wise distance (PWD) between
consecutive feature vectors was employed to detect voice-
converted speech, on the basis that the PWD feature is
able to capture short-term variabilities, which might be lost
during statistical averaging when generating converted speech.
Although the PWD was shown to be effective against attacks
from their own voice conversion system, this technique (which
is similar to the AIFD feature proposed in [28]) might not be
an effective countermeasure against systems that apply global
variance enhancement.
In contrast to the above methods focusing on discriminative
features, a probabilistic approach was proposed in [39], [40].
This approach uses the same front-end as ASV, but treats
the synthetic speech as a signal passed through a synthesis
filter. Experiments on the NIST SRE 2006 database showed
comparable performance to feature-based countermeasures. In
this work, we focus on feature-based anti-spoofing techniques,
as they can be optimised independently without rebuilding the
ASV systems.
C. Motivations and Contributions of this Work
In the literature, each study assumes a particular spoofing
type (speech synthesis or voice conversion) and often just one
variant (algorithm) of that type, then designs and evaluates
a countermeasure for that specific, known attack. However,
in practice it may not be possible to know the exact type
of spoofing attack and therefore evaluations of ASV systems
and countermeasures under a broad set of spoofing types are
desirable. Most, if not all, previous studies have been unable to
conduct a broader evaluation because of the lack of a standard,
publicly-available spoofing database that contains a variety of
spoofing attacks. To address this issue, we have previously
developed a spoofing and anti-spoofing (SAS) database in-
cluding both speech synthesis and voice conversion spoofing
attacks [1]. This database includes spoofing speech from two
different speech synthesis systems and seven different voice
conversion systems.

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 3
Now, we first broaden the SAS database by including four
more variants: three text-to-speech (TTS) synthesisers and one
voice conversion system. They will be referred to as SS-
SMALL-48, SS-LARGE-48, SS-MARY and VC-LSP
5
, and
are described in Section II.A.
We also develop a joint speaker verification and countermea-
sure evaluation protocol, then refine that evaluation protocol
to enable better generalisability of countermeasures developed
using the database. We include contributions from both the
speech synthesis and speaker verification communities. This
database is offered as a resource for researchers investigating
generalised spoofing and anti-spoofing methods
6
. We hope
that the availability of a standard database will contribute to
reproducible research
7
.
Second, with the SAS database, we conduct a comprehen-
sive analysis of spoofing attacks on six different ASV systems.
From this analysis we are able to determine which spoofing
type and variant currently poses the greatest threat and how
best to counter this threat. To the best of our knowledge, this
study is the first evaluation of the vulnerability of ASV using
such a diverse range of spoofing attacks and the most thorough
analysis of the spoofing effects of speech synthesis and voice
conversion spoofing systems under the same protocol.
Third, we present a comparison of several anti-spoofing
countermeasures to discriminate between human and artificial
speech. In our previous work, we applied cosine-normalised
phase [35], modified group delay [35] and segment-based
modulation features [31] to detect voice converted speech,
and applied pitch pattern based features to detect synthetic
speech [29], [30]. In this work, we evaluate these countermea-
sures against both spoofing types and propose to fuse decisions
at the score level in order to leverage multiple, complementary
sources of information to create stronger countermeasures.
We also extend the segment-based modulation feature to an
utterance-level feature, to account for long-term variations.
Finally, we perform listening tests to evaluate the ability of
human listeners to discriminate between human and artificial
speech
8
. Although the vulnerability of ASV systems in the
face of spoofing attacks is known, some questions still remain
unanswered. These include whether human perceptual ability
is important in identifying spoofing and whether humans
can achieve better performance than automatic approaches in
detecting spoofing attacks. In this work, we attempt to answer
these questions through a series of carefully-designed listening
tests. In contrast to the human assisted speaker recognition
(HASR) evaluation [43], we consider spoofing attacks in
5
The four systems are new in this article while other systems have been
published in a conference paper [1]. SS-SMALL-48 and SS-LARGE-48 allow
us to analyse the effect of sampling rates of spoofing materials. SS-MARY
is useful to understand the effect of waveform concatenation-based speech
synthesis spoofing.
6
Based on this database, a spoofing and countermeasure challenge [41],
[42] has already been successfully organised as a special session of INTER-
SPEECH 2015.
7
The SAS corpus is publicly available: http://dx.doi.org/10.7488/ds/252
8
The preliminary version was published at INTERSPEECH 2015 [2] where
we focused on human and automatic spoofing detection performance on
wideband and narrowband data. The current work benchmarks automatic
systems against human performance on speaker verification and spoofing
detection tasks.
speaker verification and conduct listening tests for spoofing
detection, which was not considered in the HASR evaluation.
II. DATABASE AND PROTOCOL
We extended our SAS database [1] by including additional
artificial speech. The database is built from the freely available
Voice Cloning Toolkit (VCTK) database of native speakers
of British English
9
. The VCTK database was recorded in
a hemi-anechoic chamber using an omni-directional head-
mounted microphone (DPA 4035) at a sampling rate of 96
kHz. The sentences are selected from newspapers, and the
average duration of each sentence is about 2 seconds.
To design the spoofing database, we took speech data
from VCTK comprising 45 male and 61 female speakers and
divided each speaker’s data into ve parts:
A: 24 parallel utterances (i.e., same sentences for all
speakers) per speaker: training data for spoofing
systems.
B: 20 non-parallel utterances per speaker: additional
training for spoofing systems.
C: 50 non-parallel utterances per speaker: enrolment
data for client model training in speaker verification,
or training data for speaker-independent countermea-
sures.
D: 100 non-parallel utterances per speaker: development
set for speaker verification and countermeasures.
E: Around 200 non-parallel utterances per speaker: eval-
uation set for speaker verification and countermea-
sures.
In Parts B E, sentences were randomly selected from
newspapers without any repeating sentence across speakers.
In Parts A and B, we have two versions, downsampled to 48
kHz and 16 kHz respectively, while in Parts C, D and E all
signals are downsampled to 16 kHz. Parts A and B allow us
to analyse the effects of sampling rate for spoofing attack. For
training the spoofing systems, we designed two training sets.
The small set consists of data only from Part A, while the
large set comprises the data from Parts A and B together.
A. Spoofing systems
We implemented five speech synthesis (SS) and eight voice
conversion (VC) spoofing systems, as summarised in Table I.
These systems were built using both open-source software (to
facilitate reproducible research) as well as our own state-of-
the-art systems (to provide comprehensive results):
NONE: This is a baseline zero-effort impostor trial in which
the impostor’s own speech is used directly with no attempt to
match the target speaker.
SS-LARGE-16: An HMM-based TTS system built with the
statistical parametric speech synthesis framework described
in [44]. For speech analysis, the STRAIGHT vocoder with
mixed excitation is used, which results in 60-dimensional
Bark-Cepstral coefficients, log F
0
and 25-dimensional band-
limited aperiodicity measures [45], [46]. Speech data from
257 (115 male and 142 female) native speakers of British
9
http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 4
TABLE I
SUMMARY OF THE SPOOFING SYSTEMS USED IN THIS PAPER. MGC, BAP AND F
0
MEAN MEL-GENERALISED CEPSTRAL (MGC) COEFFICIENTS, BAND
APERIODICITY (BAP) AND FUNDAMENTAL FREQUENCY (F
0
).
Spoofing Sampling # training Background Known or Open source
Algorithm Rate utterances Vocoder Features data required? Unknown? Toolkit?
SS-LARGE-16 HMM TTS 16k 40 STRAIGHT MGC, BAP, F
0
Yes Known Yes
SS-LARGE-48 HMM TTS 48k 40 STRAIGHT MGC, BAP, F
0
Yes Unknown Yes
SS-SMALL-16 HMM TTS 16k 24 STRAIGHT MGC, BAP, F
0
Yes Known Yes
SS-SMALL-48 HMM TTS 48k 24 STRAIGHT MGC, BAP, F
0
Yes Unknown Yes
SS-MARY Unit Selection TTS 16k 40 None Waveform No Unknown Yes
VC-C1 C1 VC 16k 24 STRAIGHT MGC, BAP, F
0
No Known No
VC-EVC Eigenvoice VC 16k 24 STRAIGHT MGC, BAP, F
0
Yes Unknown No
VC-FEST GMM VC 16k 24 MLSA MGC, F
0
No Known Yes
VC-FS Frame selection VC 16k 24 STRAIGHT MGC, BAP, F
0
No Known No
VC-GMM GMM VC 16k 24 STRAIGHT MGC, BAP, F
0
No Unknown No
VC-KPLS KPLS VC 16k 24 STRAIGHT MGC, BAP, F
0
No Unknown No
VC-LSP GMM VC 16k 24 STRAIGHT LSP, F
0
No Unknown No
VC-TVC Tensor VC 16k 24 STRAIGHT MGC, BAP, F
0
Yes Unknown No
English is used to train the average voice model. In the speaker
adaptation phase, the average voice model is transformed
using structural variational Bayesian linear regression [47]
followed by maximum a posteriori (MAP) adaptation, using
the target speaker’s data from Parts A and B. To synthesise
speech, acoustic feature parameters are generated from the
adapted HMMs using a parameter generation algorithm that
considers global variance (GV) [48]. An excitation signal
is generated using mixed excitation and pitch-synchronous
overlap and add [49], and used to excite a Mel-logarithmic
spectrum approximation (MLSA) filter [50] corresponding to
the STRAIGHT Bark cepstrum, to create the final synthetic
speech waveform.
SS-LARGE-48: Same as SS-LARGE-16, except that 48
kHz sample rate waveforms are used for adaptation. The use
of 48 kHz data is motivated by findings in speech synthesis
that speaker similarity can be improved significantly by using
data at a higher sampling rate [51].
SS-SMALL-16: Same as SS-LARGE-16, except that only
Part A of the target speaker data is used for adaptation.
SS-SMALL-48: Same as SS-SMALL-16, except that 48
kHz sample rate waveforms are used to adapt the average
voice.
SS-MARY: Based on the Mary-TTS
10
unit selection synthe-
sis system [52]. Waveform concatenation operates on diphone
units. Candidate units for each position in the utterance are
found using decision trees that query the linguistic features of
the target diphone. A preselection algorithm is used to prune
candidates that do not fit the context well. The target cost
sums linguistic (target) and acoustic (join) costs. Candidate
diphone and target diphone labels and their contexts are used
to compute the linguistic sub-cost. Pitch and duration are used
for the join cost. Dynamic programming is used to find the
sequence of units with the minimum total target plus join
cost. Concatenation takes place in the waveform domain, using
pitch-synchronous overlap-add at unit boundaries.
VC-C1: The simplest voice conversion method, which
modifies the spectral slope simply by shifting the first
Mel-Generalised Cepstral coefficient (MGCs) [53]. No other
speaker-specific features are changed. The STRAIGHT
10
http://mary.dfki.de/
vocoder is used to extract MGCs, band aperiodicities (BAPs)
and F
0
.
VC-EVC: A many-to-many eigenvoice conversion (EVC)
system [54]. The eigenvoice GMM (EV-GMM) is constructed
from the training data of one pivot speaker in the ATR
Japanese speech database [55], and 273 speakers (137 male,
136 female) from the JNAS database
11
. Settings are the same
as in [56]. The 272-dimensional weight vectors are estimated
by using the Part A of the training data. STRAIGHT is
used to extract 24-dimensional MGCs, 5 BAPs, and F
0
. The
conversion function is applied only to the MGCs.
VC-FEST: The voice conversion toolkit provided by the
open-source Festvox system. It is based on the algorithm
proposed in [57], which is a joint density Gaussian mix-
ture model with maximum likelihood parameter generation
considering global variance. It is trained on the Part A set
of parallel training data, keeping the default settings of the
toolkit, except that the number of Gaussian components in the
mixture distributions is set to 32.
VC-FS: A frame selection voice conversion system, which
is a simplified version of exemplar-based unit selection [58],
using a single frame as an exemplar and without a concate-
nation cost. We used the Part A set for training. The same
features as in VC-C1 are used, and once again only the MGCs
are converted.
VC-GMM: Another GMM-based voice conversion method
very similar to VC-FEST but with some enhancements, which
also uses the parallel training data from Part A. STRAIGHT
is used to extract 24-dimensional MGCs, 5 BAPs, and F
0
.
The search range for F
0
extraction is automatically optimized
speaker by speaker to reduce errors. Two GMMs are trained
for separately converting the 1
st
through 24
th
MGCs and 5
BAPs. The number of mixture components is set to 32 for
MGCs and 8 for BAPs, respectively. GV-based post-filtering
[59] is used to enhance the variance of the converted spectral
parameter trajectories.
VC-KPLS: Voice conversion using kernel partial least
square (KPLS) regression [60], trained on the Part A parallel
data. Three hundred reference vectors and a Gaussian kernel
are used to derive kernel features and 50 latent components
11
http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html

Citations
More filters
Proceedings ArticleDOI

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection

TL;DR: ASVspoof 2017, the second in the series, focused on the development of replay attack countermeasures and indicates that the quest for countermeasures which are resilient in the face of variable replay attacks remains very much alive.
Proceedings ArticleDOI

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

TL;DR: A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.
Journal ArticleDOI

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

TL;DR: The proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings, and it is found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.
Journal ArticleDOI

ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge

TL;DR: A review of postevaluation studies conducted using the same dataset illustrates the rapid progress stemming from ASVspoof and outlines the need for further investigation.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Face Description with Local Binary Patterns: Application to Face Recognition

TL;DR: This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features that is assessed in the face recognition problem under different challenges.
Journal ArticleDOI

Speaker Verification Using Adapted Gaussian Mixture Models

TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.
Journal ArticleDOI

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.
Journal ArticleDOI

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

TL;DR: In a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation based on pitch-synchronous overlap-add approach are reviewed.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What are the contributions mentioned in the paper "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?

In this paper, the authors present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. The authors then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Finally, the authors benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks. 

The authors suggest future work in ASV spoofing and countermeasures along the following lines: • More diverse spoofing materials: It would be interesting in the future to use either ‘ super recognisers ’ or forensic speech scientists, if the authors could access sufficient numbers of such listeners. To detect the SSMARY attack or similar waveform concatenation attacks, the authors suggest further development of pitch pattern-based countermeasures. 

Given a speech signal x(n), short-time Fourier analysis can be applied to transform the signal from the time domain to the frequency domain by assuming the signal is quasi-stationary within a short time frame, e.g., 25ms. 

In practice, the authors used a 1024-point FFT to extract the modulation spectrum for each MGD trajectory, then applied a DCT to the modulation spectrum, and after that kept the first 32 coefficients as features. 

Even though phase information is important in human speech perception [67], most speech synthesis and voice conversion systems use a simplified, minimum phase model which may introduce artefacts into the phase spectrum. 

To make the proposed countermeasures appropriate for practical applications, it would of course be important to take channel and noise issues into consideration. 

During the execution of spoofing attacks, the transcript of an impostor trial was used as the textual input to each speech synthesis system, and the speech signal of the impostor trial was the input to each voice conversion system. 

As the amount of spoofing materials increases, ASV systems can access more representative prior information about spoofing, and the security of ASV systems should be enhanced as a result. 

To make systems suitable for other voice authentication applications, spoofing countermeasures for text-dependent ASV must also be developed. 

As discussed in [8], the lack of a large-scale, standardised dataset and protocol was a fundamental barrier to progress in this area. 

replay attack – which does not require any speech processing knowledge on the part of the attacker – was not considered here. 

The pitch pattern countermeasure detects synthetic speech well, but does not detect some voice conversion speech such as that from VC-C1, VC-FEST, VC-KPLS and VC-LSP. 

In Task 3, there were 130 samples (65 human, 65 artificial (13 × 5)), and those samples were randomly selected from the evaluation set for each listener. 

All systems used the same front-end to extract acoustic features: 19- dimensional Mel-Frequency Cepstral Coefficients (MFCCs) plus log-energy with delta and delta-delta coefficients. 

it remains unknown whether the phase-based features are also effective in detecting attacks from speech synthesisers using unknown vocoders. 

In respect of the frame-based features, the MGD-based countermeasure achieves the best overall performance in terms of low FARs and works well at detecting most types of spoofed speech with the notable exception of the SS-MARY attack. 

Only five of the available spoofing systems were used during development, with all thirteen spoofing systems (Table I) being run on the evaluation set. 

In the countermeasure evaluation protocol, the authors used a further 25 speakers’ voices as training data and only implemented five attacks (as known attacks) on the training set. 

This indicates that the fused countermeasure can be effectively integrated with any ASV system without needing additional joint optimisation.