How long can a short-time Fourier analysis be applied to transform the signal?

Given a speech signal x(n), short-time Fourier analysis can be applied to transform the signal from the time domain to the frequency domain by assuming the signal is quasi-stationary within a short time frame, e.g., 25ms.

How many features are used to extract the modulation spectrum for each MGD trajectory?

In practice, the authors used a 1024-point FFT to extract the modulation spectrum for each MGD trajectory, then applied a DCT to the modulation spectrum, and after that kept the first 32 coefficients as features.

What is the phase information in speech synthesis and voice conversion systems?

Even though phase information is important in human speech perception [67], most speech synthesis and voice conversion systems use a simplified, minimum phase model which may introduce artefacts into the phase spectrum.

What are the main issues that should be taken into consideration in the proposed countermeasures?

To make the proposed countermeasures appropriate for practical applications, it would of course be important to take channel and noise issues into consideration.

What was the input to each voice conversion system?

During the execution of spoofing attacks, the transcript of an impostor trial was used as the textual input to each speech synthesis system, and the speech signal of the impostor trial was the input to each voice conversion system.

What is the main reason why the ASV system is more vulnerable to spoofing attacks?

As the amount of spoofing materials increases, ASV systems can access more representative prior information about spoofing, and the security of ASV systems should be enhanced as a result.

What is the purpose of the proposed countermeasures?

To make systems suitable for other voice authentication applications, spoofing countermeasures for text-dependent ASV must also be developed.

What was the main barrier to progress in this area?

As discussed in [8], the lack of a large-scale, standardised dataset and protocol was a fundamental barrier to progress in this area.

What is the reason why replay attack was not considered in this study?

replay attack – which does not require any speech processing knowledge on the part of the attacker – was not considered here.

What is the pitch pattern countermeasure?

The pitch pattern countermeasure detects synthetic speech well, but does not detect some voice conversion speech such as that from VC-C1, VC-FEST, VC-KPLS and VC-LSP.

How many samples were selected from the evaluation set?

In Task 3, there were 130 samples (65 human, 65 artificial (13 × 5)), and those samples were randomly selected from the evaluation set for each listener.

What did the authors use to extract acoustic features?

All systems used the same front-end to extract acoustic features: 19- dimensional Mel-Frequency Cepstral Coefficients (MFCCs) plus log-energy with delta and delta-delta coefficients.

What is the performance of the MGD-based countermeasure?

In respect of the frame-based features, the MGD-based countermeasure achieves the best overall performance in terms of low FARs and works well at detecting most types of spoofed speech with the notable exception of the SS-MARY attack.

How many spoofing systems were used during development?

Only five of the available spoofing systems were used during development, with all thirteen spoofing systems (Table I) being run on the evaluation set.

How many attacks were implemented on the training set?

In the countermeasure evaluation protocol, the authors used a further 25 speakers’ voices as training data and only implemented five attacks (as known attacks) on the training set.

How does the proposed fused countermeasure work?

This indicates that the fused countermeasure can be effectively integrated with any ASV system without needing additional joint optimisation.

(Open Access) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance (2016) | Zhizheng Wu

Q: What are the contributions mentioned in the paper "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?

In this paper, the authors present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. The authors then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Finally, the authors benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.

Q: What are the future works in "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?

The authors suggest future work in ASV spoofing and countermeasures along the following lines: • More diverse spoofing materials: It would be interesting in the future to use either ‘ super recognisers ’ or forensic speech scientists, if the authors could access sufficient numbers of such listeners. To detect the SSMARY attack or similar waveform concatenation attacks, the authors suggest further development of pitch pattern-based countermeasures.

Edinburgh Research Explorer

Anti-Spoofing for Text-Independent Speaker Verification: An

Initial Database, Comparison of Countermeasures, and Human

Performance

Citation for published version:

Wu, Z, De Leon, P, Demiroglu, C, Khodabakhsh, A, King, S, Ling, Z, Saito, D, Stewart, B, Toda, T, Wester,

M & Yamagishi, J 2016, 'Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database,

Comparison of Countermeasures, and Human Performance', IEEE/ACM Transactions on Audio, Speech

and Language Processing, vol. 24, no. 4, pp. 768 - 783. https://doi.org/10.1109/TASLP.2016.2526653

Digital Object Identifier (DOI):

10.1109/TASLP.2016.2526653

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

IEEE/ACM Transactions on Audio, Speech and Language Processing

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 10. Aug. 2022

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 1

Anti-Spooﬁng for Text-Independent Speaker

Veriﬁcation: An Initial Database, Comparison of

Countermeasures, and Human Performance

Zhizheng Wu

∗

, Phillip L. De Leon, Senior Member, IEEE, Cenk Demiroglu, Ali Khodabakhsh,

Simon King, Fellow IEEE, Zhen-Hua Ling, Daisuke Saito, Bryan Stewart, Tomoki Toda,

Mirjam Wester, and Junichi Yamagishi, Senior Member, IEEE

Abstract—In this paper, we present a systematic study of the

vulnerability of automatic speaker veriﬁcation to a diverse range

of spooﬁng attacks. We start with a thorough analysis of the

spooﬁng effects of ﬁve speech synthesis and eight voice conversion

systems, and the vulnerability of three speaker veriﬁcation

systems under those attacks. We then introduce a number of

countermeasures to prevent spooﬁng attacks from both known

and unknown attackers. Known attackers are spooﬁng systems

whose output was used to train the countermeasures, whilst an

unknown attacker is a spooﬁng system whose output was not

available to the countermeasures during training. Finally, we

benchmark automatic systems against human performance on

both speaker veriﬁcation and spooﬁng detection tasks.

Index Terms—Speaker veriﬁcation, speech synthesis, voice con-

version, spooﬁng attack, anti-spooﬁng, countermeasure, security

I. INTRODUCTION

The task of automatic speaker veriﬁcation (ASV), some-

times described as a type of voice biometrics, is to accept

or reject a claimed identity based on a speech sample.

There are two types of ASV system: text-dependent and text-

independent. Text-dependent ASV assumes constrained word

content and is normally used in authentication applications

because it can deliver the high accuracy required. However,

text-independent ASV does not place constraints on word

content, and is normally used in surveillance applications. For

This work was partially supported by EPSRC under Programme Grant

EP/I031022/1 (Natural Speech Technology) and EP/J002526/1 (CAF) and by

TUBITAK 1001 grant No 112E160. This article is an expanded version of [1],

[2]

∗

Z. Wu is the correspondence author, and the remaining authors have been

listed in alphabetical order to indicate equal contributions.

Z. Wu, S. King, M. Wester and J. Yamagishi are with the Centre for Speech

Technology Research, University of Edinburgh, UK. e-mail: {zhizheng.wu,

simon.king}@ed.ac.uk, {mwester, jyamagis}@inf.ed.ac.uk

P. L. De Leon and B. Stewart are with the Klipsch School of Electrical and

Computer Engineering, New Mexico State University (NMSU), Las Cruces

NM 88003 USA. e-mail: {pdeleon, brystewa}@nmsu.edu

A. Khodabakhsh and C. Demiroglu are with Ozyegin University, Turkey.

e-mail: alikhodabakhsh@gmail.com, cenk.demiroglu@ozyegin.edu.tr

Z.-H. Ling is with University of Science and Technology of China, China.

zhling@ustc.edu

D. Saito is with University of Tokyo, Japan. e-mail: dsk saito@gavo.t.u-

tokyo.ac.jp

T. Toda is with Information Technology Center, Nagoya University, Japan.

e-mail:tomoki@icts.nagoya-u.ac.jp

example, in call-center applications

1,2

, a caller’s identity can

be veriﬁed during the course of a natural conversation without

forcing the caller to speak a speciﬁc passphrase. Moreover, as

such a veriﬁcation process usually takes place under remote

scenarios without any face-to-face contact, a spooﬁng attack

– an attempt to manipulate a veriﬁcation result by mimicking

a target speaker’s voice in person or by using computer-based

techniques such as voice conversion or speech synthesis –

is a fundamental concern. Hence, in this work, we focus on

spooﬁng and anti-spooﬁng for text-independent ASV.

Due to a number of technical advances, notably channel

and noise compensation techniques, ASV systems are being

widely adopted in security applications [3], [4], [5], [6], [7].

A major concern, however, when deploying an ASV system,

is its resilience to a spooﬁng attack. As identiﬁed in [8], there

are at least four types of spooﬁng attack: impersonation [9],

[10], [11], replay [12], [13], [14], speech synthesis [15], [16]

and voice conversion [17], [18], [19], [20], [21]. Among the

four types of spooﬁng attack, replay, speech synthesis, and

voice conversion present the highest risk to ASV systems [8].

Although replay might be the most common spooﬁng tech-

nique which presents a risk to both text-dependent and text-

independent ASV systems [12], [13], [14], it is not viable

for the generation of utterances of speciﬁc content, such as

would be required to maintain a live conversation in a call-

center application. On the other hand, open-source software

for state-of-the-art speech synthesis and voice conversion is

readily available (e.g., Festival

and Festvox

), making these

two approaches perhaps the most accessible and effective

means to carry out spooﬁng attacks, and therefore presenting

a serious risk to deployed ASV systems [8]. For that reason,

the focus in this work is only on those two types of spooﬁng

attacks.

A. Speech Synthesis and Voice Conversion Spooﬁng

Many studies have reported and analysed the vulnerability

of ASV systems to speech synthesis and voice conversion

spooﬁng. The potential vulnerability of ASV to synthetic

http://www.nuance.com/for-business/customer-service-solutions/

voice-biometrics/freespeech/index.htm

https://youtu.be/kyPTGoDyd o

http://www.cstr.ed.ac.uk/projects/festival/

http://festvox.org/

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 2

speech was ﬁrst evaluated in [22], [23]. An HMM-based

speech synthesis system was used to spoof an HMM-based,

text-prompted ASV system. They reported that the false ac-

ceptance rate (FAR) increased from 0% to over 70% under a

speech synthesis spooﬁng attack. In [15], [16], the vulnerabil-

ity of two ASV systems – a GMM-UBM system (Gaussian

mixture models with a universal background model), and

an SVM system (support vector machine using a GMM

supervector) – was assessed using a speaker-adaptive, HMM-

based speech synthesizer. Experiments using the Wall Street

Journal (WSJ) corpus (283 speakers) [24] showed that FARs

increased from 0.28% and 0.00% to 86% and 81% for GMM-

UBM and SVM systems, respectively. These studies conﬁrm

the vulnerability of ASV systems to speech synthesis spooﬁng

attack.

Voice conversion as a spooﬁng method has also been

attracting increasing attention. The potential risk of voice

conversion to a GMM ASV system was evaluated for the ﬁrst

time in [25], which used the YOHO database (138 speakers).

In [26], [27], [17], text-independent GMM-UBM systems were

assessed when faced with voice conversion spooﬁng on NIST

speaker recognition evaluation (SRE) datasets. These studies

showed an increase in FAR from around 10% to over 40% and

conﬁrmed the vulnerability of GMM-UBM systems to voice

conversion spooﬁng attack.

Recent studies [18], [19] have evaluated more advanced

ASV systems based on joint factor analysis (JFA), i-vectors,

and probabilistic linear discriminative analysis (PLDA), on

the NIST SRE 2006 database. The FARs of these systems

increased ﬁve-fold from about 3% to over 17% under attacks

from voice conversion spooﬁng.

B. Spooﬁng countermeasures

The vulnerability of ASV systems to spooﬁng attacks has

led to the development of anti-spooﬁng techniques, often

referred to as countermeasures. In [28], a synthetic speech

detector based on the average inter-frame difference (AIFD)

was proposed to discriminate between natural and synthetic

speech. This countermeasure works well if the dynamic vari-

ation of the synthetic speech is different from that of natural

speech; however, if global variance compensation is applied

to the synthetic speech, the countermeasure becomes less

effective [15].

In [29], [30], a synthetic speech detector based on image

analysis of pitch-patterns was proposed for human versus syn-

thetic speech discrimination. This countermeasure was based

on the observation that there can be artefacts in the pitch

contours generated by HMM-based speech synthesis. Experi-

ments showed that features extracted from pitch-patterns can

be used to signiﬁcantly reduce the FAR for synthetic speech.

The performance of the pitch-pattern countermeasure was not

evaluated for detecting voice conversion spooﬁng.

In [31], a temporal modulation feature was proposed to

detect synthetic speech generated by copy-synthesis. The

modulation feature captures the long-term temporal distor-

tion caused by independent frame-by-frame operations in

speech synthesis. Experiments conducted on the WSJ database

showed the effectiveness of the modulation feature when

integrated with frame-based features. However, whether the

detector is effective across a variety of speech synthesis and

voice conversion spooﬁng attacks is unknown. Also using

spectro-temporal information, a feature derived from local

binary patterns [32] was employed to detect voice conversion

and speech synthesis attacks in [33], [34].

Phase- and modiﬁed group delay-based features have also

been proposed to detect voice conversion spooﬁng [35]. A

cosine-normalised phase feature was derived from the phase

spectrogram while the modiﬁed group delay feature contained

both magnitude and phase information. Evaluation on the

NIST SRE 2006 data conﬁrmed the effectiveness of the

proposed features. However, it remains unknown whether the

phase-based features are also effective in detecting attacks

from speech synthesisers using unknown vocoders. Another

phase-based feature called the relative phase shift was pro-

posed in [16], [36], [37] to detect speech synthesis spoof-

ing, and was reported to achieve promising performance for

vocoders using minimum phase rather than natural phase.

In [38], an average pair-wise distance (PWD) between

consecutive feature vectors was employed to detect voice-

converted speech, on the basis that the PWD feature is

able to capture short-term variabilities, which might be lost

during statistical averaging when generating converted speech.

Although the PWD was shown to be effective against attacks

from their own voice conversion system, this technique (which

is similar to the AIFD feature proposed in [28]) might not be

an effective countermeasure against systems that apply global

variance enhancement.

In contrast to the above methods focusing on discriminative

features, a probabilistic approach was proposed in [39], [40].

This approach uses the same front-end as ASV, but treats

the synthetic speech as a signal passed through a synthesis

ﬁlter. Experiments on the NIST SRE 2006 database showed

comparable performance to feature-based countermeasures. In

this work, we focus on feature-based anti-spooﬁng techniques,

as they can be optimised independently without rebuilding the

ASV systems.

C. Motivations and Contributions of this Work

In the literature, each study assumes a particular spooﬁng

type (speech synthesis or voice conversion) and often just one

variant (algorithm) of that type, then designs and evaluates

a countermeasure for that speciﬁc, known attack. However,

in practice it may not be possible to know the exact type

of spooﬁng attack and therefore evaluations of ASV systems

and countermeasures under a broad set of spooﬁng types are

desirable. Most, if not all, previous studies have been unable to

conduct a broader evaluation because of the lack of a standard,

publicly-available spooﬁng database that contains a variety of

spooﬁng attacks. To address this issue, we have previously

developed a spooﬁng and anti-spooﬁng (SAS) database in-

cluding both speech synthesis and voice conversion spooﬁng

attacks [1]. This database includes spooﬁng speech from two

different speech synthesis systems and seven different voice

conversion systems.

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 3

Now, we ﬁrst broaden the SAS database by including four

more variants: three text-to-speech (TTS) synthesisers and one

voice conversion system. They will be referred to as SS-

SMALL-48, SS-LARGE-48, SS-MARY and VC-LSP

, and

are described in Section II.A.

We also develop a joint speaker veriﬁcation and countermea-

sure evaluation protocol, then reﬁne that evaluation protocol

to enable better generalisability of countermeasures developed

using the database. We include contributions from both the

speech synthesis and speaker veriﬁcation communities. This

database is offered as a resource for researchers investigating

generalised spooﬁng and anti-spooﬁng methods

. We hope

that the availability of a standard database will contribute to

reproducible research

Second, with the SAS database, we conduct a comprehen-

sive analysis of spooﬁng attacks on six different ASV systems.

From this analysis we are able to determine which spooﬁng

type and variant currently poses the greatest threat and how

best to counter this threat. To the best of our knowledge, this

study is the ﬁrst evaluation of the vulnerability of ASV using

such a diverse range of spooﬁng attacks and the most thorough

analysis of the spooﬁng effects of speech synthesis and voice

conversion spooﬁng systems under the same protocol.

Third, we present a comparison of several anti-spooﬁng

countermeasures to discriminate between human and artiﬁcial

speech. In our previous work, we applied cosine-normalised

phase [35], modiﬁed group delay [35] and segment-based

modulation features [31] to detect voice converted speech,

and applied pitch pattern based features to detect synthetic

speech [29], [30]. In this work, we evaluate these countermea-

sures against both spooﬁng types and propose to fuse decisions

at the score level in order to leverage multiple, complementary

sources of information to create stronger countermeasures.

We also extend the segment-based modulation feature to an

utterance-level feature, to account for long-term variations.

Finally, we perform listening tests to evaluate the ability of

human listeners to discriminate between human and artiﬁcial

speech

. Although the vulnerability of ASV systems in the

face of spooﬁng attacks is known, some questions still remain

unanswered. These include whether human perceptual ability

is important in identifying spooﬁng and whether humans

can achieve better performance than automatic approaches in

detecting spooﬁng attacks. In this work, we attempt to answer

these questions through a series of carefully-designed listening

tests. In contrast to the human assisted speaker recognition

(HASR) evaluation [43], we consider spooﬁng attacks in

The four systems are new in this article while other systems have been

published in a conference paper [1]. SS-SMALL-48 and SS-LARGE-48 allow

us to analyse the effect of sampling rates of spooﬁng materials. SS-MARY

is useful to understand the effect of waveform concatenation-based speech

synthesis spooﬁng.

Based on this database, a spooﬁng and countermeasure challenge [41],

[42] has already been successfully organised as a special session of INTER-

SPEECH 2015.

The SAS corpus is publicly available: http://dx.doi.org/10.7488/ds/252

The preliminary version was published at INTERSPEECH 2015 [2] where

we focused on human and automatic spooﬁng detection performance on

wideband and narrowband data. The current work benchmarks automatic

systems against human performance on speaker veriﬁcation and spooﬁng

detection tasks.

speaker veriﬁcation and conduct listening tests for spooﬁng

detection, which was not considered in the HASR evaluation.

II. DATABASE AND PROTOCOL

We extended our SAS database [1] by including additional

artiﬁcial speech. The database is built from the freely available

Voice Cloning Toolkit (VCTK) database of native speakers

of British English

. The VCTK database was recorded in

a hemi-anechoic chamber using an omni-directional head-

mounted microphone (DPA 4035) at a sampling rate of 96

kHz. The sentences are selected from newspapers, and the

average duration of each sentence is about 2 seconds.

To design the spooﬁng database, we took speech data

from VCTK comprising 45 male and 61 female speakers and

divided each speaker’s data into ﬁve parts:

A: 24 parallel utterances (i.e., same sentences for all

speakers) per speaker: training data for spooﬁng

systems.

B: 20 non-parallel utterances per speaker: additional

training for spooﬁng systems.

C: 50 non-parallel utterances per speaker: enrolment

data for client model training in speaker veriﬁcation,

or training data for speaker-independent countermea-

sures.

D: 100 non-parallel utterances per speaker: development

set for speaker veriﬁcation and countermeasures.

E: Around 200 non-parallel utterances per speaker: eval-

uation set for speaker veriﬁcation and countermea-

sures.

In Parts B — E, sentences were randomly selected from

newspapers without any repeating sentence across speakers.

In Parts A and B, we have two versions, downsampled to 48

kHz and 16 kHz respectively, while in Parts C, D and E all

signals are downsampled to 16 kHz. Parts A and B allow us

to analyse the effects of sampling rate for spooﬁng attack. For

training the spooﬁng systems, we designed two training sets.

The small set consists of data only from Part A, while the

large set comprises the data from Parts A and B together.

A. Spooﬁng systems

We implemented ﬁve speech synthesis (SS) and eight voice

conversion (VC) spooﬁng systems, as summarised in Table I.

These systems were built using both open-source software (to

facilitate reproducible research) as well as our own state-of-

the-art systems (to provide comprehensive results):

NONE: This is a baseline zero-effort impostor trial in which

the impostor’s own speech is used directly with no attempt to

match the target speaker.

SS-LARGE-16: An HMM-based TTS system built with the

statistical parametric speech synthesis framework described

in [44]. For speech analysis, the STRAIGHT vocoder with

mixed excitation is used, which results in 60-dimensional

Bark-Cepstral coefﬁcients, log F

and 25-dimensional band-

limited aperiodicity measures [45], [46]. Speech data from

257 (115 male and 142 female) native speakers of British

http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

IEEE TRANS. AUDIO, SPEECH AND LANGUAGE PROCESSING 4

TABLE I

SUMMARY OF THE SPOOFING SYSTEMS USED IN THIS PAPER. MGC, BAP AND F

MEAN MEL-GENERALISED CEPSTRAL (MGC) COEFFICIENTS, BAND

APERIODICITY (BAP) AND FUNDAMENTAL FREQUENCY (F

Spooﬁng Sampling # training Background Known or Open source

Algorithm Rate utterances Vocoder Features data required? Unknown? Toolkit?

SS-LARGE-16 HMM TTS 16k 40 STRAIGHT MGC, BAP, F

Yes Known Yes

SS-LARGE-48 HMM TTS 48k 40 STRAIGHT MGC, BAP, F

Yes Unknown Yes

SS-SMALL-16 HMM TTS 16k 24 STRAIGHT MGC, BAP, F

Yes Known Yes

SS-SMALL-48 HMM TTS 48k 24 STRAIGHT MGC, BAP, F

Yes Unknown Yes

SS-MARY Unit Selection TTS 16k 40 None Waveform No Unknown Yes

VC-C1 C1 VC 16k 24 STRAIGHT MGC, BAP, F

No Known No

VC-EVC Eigenvoice VC 16k 24 STRAIGHT MGC, BAP, F

Yes Unknown No

VC-FEST GMM VC 16k 24 MLSA MGC, F

No Known Yes

VC-FS Frame selection VC 16k 24 STRAIGHT MGC, BAP, F

No Known No

VC-GMM GMM VC 16k 24 STRAIGHT MGC, BAP, F

No Unknown No

VC-KPLS KPLS VC 16k 24 STRAIGHT MGC, BAP, F

No Unknown No

VC-LSP GMM VC 16k 24 STRAIGHT LSP, F

No Unknown No

VC-TVC Tensor VC 16k 24 STRAIGHT MGC, BAP, F

Yes Unknown No

English is used to train the average voice model. In the speaker

adaptation phase, the average voice model is transformed

using structural variational Bayesian linear regression [47]

followed by maximum a posteriori (MAP) adaptation, using

the target speaker’s data from Parts A and B. To synthesise

speech, acoustic feature parameters are generated from the

adapted HMMs using a parameter generation algorithm that

considers global variance (GV) [48]. An excitation signal

is generated using mixed excitation and pitch-synchronous

overlap and add [49], and used to excite a Mel-logarithmic

spectrum approximation (MLSA) ﬁlter [50] corresponding to

the STRAIGHT Bark cepstrum, to create the ﬁnal synthetic

speech waveform.

SS-LARGE-48: Same as SS-LARGE-16, except that 48

kHz sample rate waveforms are used for adaptation. The use

of 48 kHz data is motivated by ﬁndings in speech synthesis

that speaker similarity can be improved signiﬁcantly by using

data at a higher sampling rate [51].

SS-SMALL-16: Same as SS-LARGE-16, except that only

Part A of the target speaker data is used for adaptation.

SS-SMALL-48: Same as SS-SMALL-16, except that 48

kHz sample rate waveforms are used to adapt the average

voice.

SS-MARY: Based on the Mary-TTS

unit selection synthe-

sis system [52]. Waveform concatenation operates on diphone

units. Candidate units for each position in the utterance are

found using decision trees that query the linguistic features of

the target diphone. A preselection algorithm is used to prune

candidates that do not ﬁt the context well. The target cost

sums linguistic (target) and acoustic (join) costs. Candidate

diphone and target diphone labels and their contexts are used

to compute the linguistic sub-cost. Pitch and duration are used

for the join cost. Dynamic programming is used to ﬁnd the

sequence of units with the minimum total target plus join

cost. Concatenation takes place in the waveform domain, using

pitch-synchronous overlap-add at unit boundaries.

VC-C1: The simplest voice conversion method, which

modiﬁes the spectral slope simply by shifting the ﬁrst

Mel-Generalised Cepstral coefﬁcient (MGCs) [53]. No other

speaker-speciﬁc features are changed. The STRAIGHT

http://mary.dfki.de/

vocoder is used to extract MGCs, band aperiodicities (BAPs)

and F

VC-EVC: A many-to-many eigenvoice conversion (EVC)

system [54]. The eigenvoice GMM (EV-GMM) is constructed

from the training data of one pivot speaker in the ATR

Japanese speech database [55], and 273 speakers (137 male,

136 female) from the JNAS database

. Settings are the same

as in [56]. The 272-dimensional weight vectors are estimated

by using the Part A of the training data. STRAIGHT is

used to extract 24-dimensional MGCs, 5 BAPs, and F

. The

conversion function is applied only to the MGCs.

VC-FEST: The voice conversion toolkit provided by the

open-source Festvox system. It is based on the algorithm

proposed in [57], which is a joint density Gaussian mix-

ture model with maximum likelihood parameter generation

considering global variance. It is trained on the Part A set

of parallel training data, keeping the default settings of the

toolkit, except that the number of Gaussian components in the

mixture distributions is set to 32.

VC-FS: A frame selection voice conversion system, which

is a simpliﬁed version of exemplar-based unit selection [58],

using a single frame as an exemplar and without a concate-

nation cost. We used the Part A set for training. The same

features as in VC-C1 are used, and once again only the MGCs

are converted.

VC-GMM: Another GMM-based voice conversion method

very similar to VC-FEST but with some enhancements, which

also uses the parallel training data from Part A. STRAIGHT

is used to extract 24-dimensional MGCs, 5 BAPs, and F

The search range for F

extraction is automatically optimized

speaker by speaker to reduce errors. Two GMMs are trained

for separately converting the 1

through 24

MGCs and 5

BAPs. The number of mixture components is set to 32 for

MGCs and 8 for BAPs, respectively. GV-based post-ﬁltering

[59] is used to enhance the variance of the converted spectral

parameter trajectories.

VC-KPLS: Voice conversion using kernel partial least

square (KPLS) regression [60], trained on the Part A parallel

data. Three hundred reference vectors and a Gaussian kernel

are used to derive kernel features and 50 latent components

http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html

Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance

Figures

Citations

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge

References

LIBSVM: A library for support vector machines

Face Description with Local Binary Patterns: Application to Face Recognition

Speaker Verification Using Adapted Gaussian Mixture Models

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Related Papers (5)

Spoofing and countermeasures for speaker verification

A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients

A Comparison of Features for Synthetic Speech Detection

Constant Q cepstral coefficients

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection

Frequently Asked Questions (19)

Q1. What are the contributions mentioned in the paper "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?

Q2. What are the future works in "Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance" ?

Q3. How long can a short-time Fourier analysis be applied to transform the signal?

Q4. How many features are used to extract the modulation spectrum for each MGD trajectory?

Q5. What is the phase information in speech synthesis and voice conversion systems?

Q6. What are the main issues that should be taken into consideration in the proposed countermeasures?

Q7. What was the input to each voice conversion system?

Q8. What is the main reason why the ASV system is more vulnerable to spoofing attacks?

Q9. What is the purpose of the proposed countermeasures?

Q10. What was the main barrier to progress in this area?

Q11. What is the reason why replay attack was not considered in this study?

Q12. What is the pitch pattern countermeasure?

Q13. How many samples were selected from the evaluation set?

Q14. What did the authors use to extract acoustic features?

Q15. What is the effectiveness of the phase-based features?

Q16. What is the performance of the MGD-based countermeasure?

Q17. How many spoofing systems were used during development?

Q18. How many attacks were implemented on the training set?

Q19. How does the proposed fused countermeasure work?