Spoofing and countermeasures for speaker verification

doi:10.1016/J.SPECOM.2014.10.005

Journal Article•DOI•

Spoofing and countermeasures for speaker verification

Zhizheng Wu¹, Nicholas Evans², Tomi Kinnunen³, Junichi Yamagishi⁴, Federico Alegre², Haizhou Li⁵ - Show less +2 more•Institutions (5)

Nanyang Technological University¹, Institut Eurécom², University of Eastern Finland³, University of Edinburgh⁴, Institute for Infocomm Research Singapore⁵

01 Feb 2015-Speech Communication (Elsevier)-Vol. 66, pp 130-153

TL;DR: A survey of past work and priority research directions for the future is provided, showing that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.

read less

About: This article is published in Speech Communication.The article was published on 2015-02-01 and is currently open access. It has received 433 citations till now. The article focuses on the topics: Spoofing attack & IP address spoofing.

...read moreread less

Summary (9 min read)

Jump to: [2. Automatic speaker verification] – [2.1. Feature extraction] – [2.2. Speaker modeling and classification] – [2.3. System fusion] – [3.1. Possible attack points] – [3.2. Potential vulnerabilities] – [3.2.1. Feature extraction] – [3.2.2. Speaker modeling] – [4. Evaluation protocol] – [4.1. Dataset design] – [4.2. Evaluation metrics] – [5. Spoofing and countermeasures] – [5.1. Impersonation] – [5.1.1. Spoofing] – [5.1.2. Countermeasures] – [5.2. Replay] – [5.2.1. Spoofing] – [5.2.2. Countermeasures] – [5.3. Speech synthesis] – [5.3.1. Spoofing] – [5.3.2. Countermeasures] – [5.4. Voice conversion] – [5.4.1. Spoofing] – [5.4.2. Countermeasures] – [6. Discussion] – [6.1. Spoofing] – [6.2. Countermeasures] – [6.3. Generalised countermeasures] – [7. Issues for future research] – [7.1. Large-scale standard datasets] – [7.2. Evaluation metrics] – [7.3. Open-source software packages] – [7.4. Future directions] and [8. Conclusions]

2. Automatic speaker verification

There are two types of ASV systems:text-dependentandtext-independent.
Text-independent systems operate on arbitrary utterances, possibly spoken in different languages (Campbell Jr, 1997).
On account of evaluation sponsorship and dataset availability, text-independent ASV dominates the field and the research tends to place greater emphasis on surveillance applications rather than authentication.

2.1. Feature extraction

A speech signal has three-fold information: voice timbre, prosody and language content.
Correspondingly, speaker individuality can be characterised by short-term spectral, prosodic and high-level idiolectal features.
Short-term spectral fe tures are extracted from short frames typically of 20-30 milliseconds duration.
These features, such as pitch, energy and duration, are less sensitive to channel effects.
The extraction of high-level features requires considerably more complex front-ends, such as those which employ automatic speech recognition (Kinnunen and Li, 2010; Li and Ma, 2010).

2.2. Speaker modeling and classification

Approaches to text-independent ASV generally focus on modelling the feature distribution of a target speaker.
There are many different ways to implement Eq. (1).
With more modern techniques,X can also be high-dimensional i-vectors (Dehak et al., 2011) modelled with probabilistic linear discriminant analysis (PLDA) back-ends (Li et al., 2012) (see below).
Even so, GMMs are still needed for i-vector extraction and thus the authors provide a more detailed presentation of the GMM in the following.
The target speaker and UBM models are used as the hypothesised and alternative speaker models respectively.

2.3. System fusion

In addition to the development of increasingly robust models and classifiers, there is a significant emphasis within the ASV community on the study ofclassifier fusion.
The motivation is based on the assumption that multiple, independently trained recognisers together capture different aspects of the speech signal not covered by a single classifier alone.
Fusion also provides a convenient vehicle for large-scale research collabrations promoting independent classifier development and benchmarking (Saeidi et al., 2013).
Different classifiers can involve different features, classifiers, or hyper-parameter training sets (Brümmer et al., 2007; Hautamäki et al., 2013b).
A simple, yet robust approach to fusion involves the weighted summation of the base classifier score, where the weights are optimised according to a logistic regression cost function.

3.1. Possible attack points

A typical ASV system involves two processes: offline enrolment and runtime verification.
Features similarly extracted from this sample are compared to the model in order to determine whether or not the speaker matches the claimed identity.
The classifier determines a match score which represents the relative similarity of the sample to each of the two models.
These components and the links between them all represent possible attack points (Ratha et al., 2001).
In past studies of ASV spoofing, impersonation and replay attacks are assumed to apply at the microphone.

3.2. Potential vulnerabilities

This section explains the potential for typical ASV systems to be spoofed.
The authors focus on two key ASV modules: feature extraction and speaker modelling.

3.2.1. Feature extraction

All three feature representations described in Section 2.1are potentially vulnerable to spoofing attacks.
Due to their simpl city and performance, short-term spectral features are the most popular.
Ignoring any channel eff cts, replay attacks which use a pre-recorded speech sample can faithfully reflect the spectral attributes of the original speaker.
Voice conversion can also generate speech signals whose spectral envelope reflects that of a target speaker (Matrouf et al., 2006).

3.2.2. Speaker modeling

Most approaches to speaker modelling, be they applied to text-independent or text-dependent ASV, have their roots in he standard GMM.
Most lack the modelling of temporal sequence information, a key characteristic of human speech, which might otherwise afford some protection from spoofing; most models of the feature distributions used in typical speech synthesis and voice conversion algorithms assume independent features of observations, but are nonetheless effective as spoofing attacks.
As shown in (Kons and Aronowitz, 2013), HMM-based systems, which capture temporal information, are more robust to spoofing than GMM-based systems when subject to the same spoofing attack.
While preliminary studies of fused ASV system approaches to anti-spoofing were reported in (Riera et al., 2012), some insight into their likely full potential can be gained from related work in fused, multi-modal biometric systems.
The authors note, however, that (Rodriques et al., 2009; Akhtar et al., 2012) suggests it might suffice to spoof onlyonemodality (or sub-system) under a score fusion setting in the case where the spoofing of a single, significantly weighted sub-system is particularlyeffective.

4. Evaluation protocol

Here the authors present a generic experimental protocol which applies to the majority of past work.
The authors discuss database design and evaluation metrics with a focus on the comparability of baseline results with those of vulnerability and counterm asure studies.

4.1. Dataset design

While past studies of spoofing have used a range of different datasets (Alegre et al., 2014) there are some similarities in the experimental protocols.
The diagram illustrates three possible inputs: genuine, zero-eff rt impostor and spoofed speech.
A new dataset suitable for the study of spoofing is derived from the baseline by replacing all impostor trials with spoofed trials.
Referring once again to Figure 3, baseline performance is assessed using the pool ofM genuine trials (a) andN impostor trials (b), while that under spoofing is assessed with the pool of M genuine trials (a) andN spoofing trials (c).
The baseline performance and that under spoofing is thus directly comparable and the difference between them reflects the vulnerability of the system to the particular spoofing attack considered.

4.2. Evaluation metrics

The evaluation of ASV systems requires large numbers of two distinct tests: target tests, where the speaker matcheshe claimed identity, and impostor tests, where the identitiesdiffer.
There are two possible correct outcomes and two possible incorrect outcomes, namely false acceptance (or false alarm) and false rejection (or miss).
The FAR and FRR are complementary in the sense that, for a variable threshold and otherwise fixed system, one can only be reduced at the expense of increasing the other.
Equivalently, spoofing attacks will increase the FAR for a fixed decision threshold optimised on the standard baseline ASV dataset.
Nonetheless, similar to the decisions of a regular ASV system as illustrated in Figure 4, a practical, stand-alone countermeasure will inevitably lead to some false acceptances, where a spoofing attack remains undetected, in addition to false rejections, where genuine attempts are identified as spoofing attacks.

5. Spoofing and countermeasures

This section reviews past work to evaluate the vulnerability of typical ASV systems to spoofing and parallel efforts to develop anti-spoofing countermeasures.
Spoofing implies an attack at either the microphone or transmission level using a manipulated or synthesised speech sample in order to bias the system towards verifying a claimed identity.
The authors consider impersonation, replay, speech synthesis and voice conversion while concentrating on three different aspects: (i) the practicality of each spoofing attack; (ii) the vulnerability of ASV systems when subjected to such attacks, and (iii) the design of a realistic datasets for experimentation.
With regard to countermeasur s, the authors focus on: (i) the effectiveness of a countermeasure in preventing specific spoofing attacks, and (ii) the generalisation of countermeasures in protecting against varying attacks.

5.1. Impersonation

Impersonation is one of the most obvious approaches to spoofing and refers to attacks using human-altered voices, otherwise referred to as human mimicking.
Here, an attacker tris to mimic a target speaker’s voice timbre and prosody without computer-aided technologies.

5.1.1. Spoofing

The work in (Lau et al., 2004) showed that non-professional impersonators can readily adapt their voice to overcome ASV, but only when their natural voice is already similar to that of the target speaker (closest targets were selected from the YOHO corpus using a speaker recognition system).
One of the key observations was that the change in the vocal space (measured through F1 and F2) under impersonation cannot be described by a simple global transform; formant changes are vowel-specific.
Since impersonation is thought to involve mostly the mimicking of prosodic and stylistic cues, it is perhaps consideredmore effective in fooling human listeners than today’s state-of-the-art ASV systems (Perrot et al., 2005; Hautamäki et al., 2014).
The work in (Lau et al., 2005) and (Stoll and Doddington, 2010) showed how ASV systems themselves or even acoustic features alone may be employed to identify ‘similar’ speakers in order to prvoke false acceptances.
Past studies involving impersonation attacks are summarised in Table 2.

5.1.2. Countermeasures

While the threat of impersonation is not fully understood it is perhaps not surprising that there is virtually no prior work to investigate countermeasures against impersonation.
Unlike the spoofing attacks discussed below, all of which can be assumed to leave traces of the physical properties of the recording andplayback devices, or signal processing artefacts from synthesis or conversion systems, impersonators are live human beings who produce entirely natural speech.
Interestingly, some related work (Amin et al., 2013, 2014) has addressed the problem of disguisedetection5.
Specifically, the disguise detectors in (Amin et al., 2013, 2014) used quadratic discriminant on the first two for- mants to quantify the amount of acoustic variation on a vowelby-vowel basis.
5While the spoofing attacks discussed in this article are meant to i crease false acceptance rate, disguise is the opposite problem where on wishes to be not recognized as herself, thereby increasing false rejection(miss) rate.

5.2. Replay

Replay is a form of spoofing whereby an adversary attacks an ASV system using a pre-recorded speech sample collected from a genuine target speaker.
The speech sample can be any recording captured surreptitiously and even concatenated speechsamples extracted from a number of shorter segments, for example to overcome text-dependent ASV systems (Villalba and Lleida, 2011b).
In addition, due to the availability of high quality and low-cost recording devices, such as smart phones, replay spoofing attack are arguably the most accessible and therefore present a significant threat.
Here, a smart phone is used to replay a pre-recorded speech sample in order to unlock another smart phone which uses speaker verification technology for logical access authentication.
The left phone (black color) is the smart phone with a voice-unlock function f r user authentication as reported in (Lee et al., 2013).

5.2.1. Spoofing

Even though they are among the most simple and easily implemented, only a small number of studies have addressed replay attacks.
Vulnerabilities to replay attack were first evaluated in (Lindberg et al., 1999).
The significant variation between male and femal speakers lies in the use of only a single speaker in each case.
It shows that the spectrogram and formant trajectories of the replay speech (upper images) have a highly similarity to those of the genuine speech (lower images).

5.2.2. Countermeasures

Recently, due to the mass-market adoption of ASV techniques (Lee et al., 2013; Nuance, 2013) and the awareness and simplicity of replay attacks, both industry (Nuance, 2013)and academia (Shang and Stevenson, 2010; Villalba and Lleida, 2011a,b; Wang et al., 2011) have shown an interest in developing replay attack countermeasures.
New accesses are identified as replay attacks if they produce a similarity score higher than a pre-defined threshold.
The motivation stems from the increase in noise and reverberation which occurs as a result of replaying farfield recordings.
Thus, the detection of channel effects beyond those introduced by the recording device of the ASV system serves as an indicator of replay attack.
While countermeasures are generally effective in reducing the FARs, they remain significantly higher than those of the respective baselines.

5.3. Speech synthesis

Speech synthesis, commonly referred to as text-to-speech (TTS), is a technique for generating intelligible, naturalsounding artificial speech for any arbitrary text.
Inthe 1990s, larger speech databases were collected and used to select more appropriate speech units that match both phonemes and other linguistic contexts such as lexical stress and pitch accent in order to generate high-quality natural sounding sy - thetic speech with appropriate prosody.
In the late 1990s another data-driven approach emerged.
Acoustic parameters generated from HMMs and selected according to the linguistic specification are used to drive a vocoder, a simplified speech production model with which speech is represented by vocal tract parameters and excitation parametersin order to generate a speech waveform.
The first three approaches are unlikely to be eff ctive in ASV spoofing.

5.3.1. Spoofing

There is a considerable volume of research in the literature which has demonstrated the vulnerability of ASV to synthetic voices generated with a variety of approaches to speech syntesis (Lindberg et al., 1999; Foomany et al., 2009; Villalba and Lleida, 2010).
The work used acoustic models adapted to specific human speakers (Masuko et al., 1996, 1997) and was performed using an HMM-based, text-prompted ASV system (Matsui and Furui, 1995).
When subjected to spoofing attacks with synthetic speech, the FAR increased to over 70 %.
This result is due to the significant overlap in the distribution of ASV scores for genuine and synthetic speech, as shown in Figure 8.
All the past work confirms that speech synthesis attacks are ableto increase significantly the FAR of all tested ASV systems, in- cluding those at the state of the art.

5.3.2. Countermeasures

Most approaches to detect synthetic speech rely on processing artefacts specific to a particular synthesis algorithm.
While estimates of this variance thus provide a means of discriminating between genuine and synthetic speech, such an approach is based on the full knowledge of a specific HMMbased speech synthesis system.
There are some attempts which focus on acoustic differences between vocoders and natural speech.
This simplificationleads to differences in the phase spectra between human and synthetic speech, differences which can be utilised for discrimination (De Leon et al., 2012a; Wu et al., 2012a).
While the countermeasure investigated in (De Leon et al., 2012a) is shown to be eff ctive in protecting both GMM-UBM and SVM systems from spoofing, as discussed above, most exploit prior knowledge of specific spoofing algorithms.

5.4. Voice conversion

Voice conversion aims to manipulate the speech of a given speaker so that it resembles in some sense that of another, target speaker (Stylianou, 2009; Evans et al., 2014a).
A straightforward approach to spectral mapping based on vector quantisation (VQ) was proposed in (Abe et al., 1988).
A mapping codebook is learned from source-target feature pai s and is then used to estimate target features from source features at runtime.
Among the most significant aspects of prosody investigated in the context of voice conversion are the fundamental frequency (F0) and duration.

5.4.1. Spoofing

Voice conversion has attracted increasing interest in the context of ASV spoofing for over a decade.
The work in (Perrot et al., 2005) evaluated the vulnerability of a GMM-UBM ASV system.
The work was performed on the 2006 NIST SRE dataset using both joint-density GMM and unit selection approaches to voice conversion.
The work in (Kons and Aronowitz, 2013) examined the vulnerability of several state-of-the-art text-dependent systems, namely, i-vector, GMM-NAP and HMM-NAP systems.
Even though some approaches to voice conversion produce speech with clearlyaudible artefacts (Chen et al., 2003; Toda et al., 2007; Erro etal., 2013), Table 5 shows that all provoke significant increases in the FAR across a variety of different ASV systems.

5.4.2. Countermeasures

Voice conversion bears some similarity to speech synthesisin that some voice conversion algorithms employ vocoding techniques similar to those used in statistical parametric speech synthesis (Zen et al., 2009).
The work in (Wu et al., 2012a) exploited artefacts introduced by the vocoder as a means of discriminating converted speech from natural speech.
Cosine normalised phase (cos-phase) and modified group delay phase (MGD-phase) features were shown to be effective.
Interestingly, baseline performance was not affected as a result of integrating spoofing countermeasures.
It shows that countermeasures are effective in protecting ASV systems from voice conversion attacks, and that performance with integrated countermeasures is not too dissimilar to baseline performance.

6. Discussion

As discussed in Section 5, spoofing and countermeasures for ASV have been studied with various approaches to simulate spoofing attacks, different ASV systems, diverse experimental designs, and with a multitude of different datasets, evaluation protocols and metrics.
The lack of commonality makes th comparison of vulnerabilities and countermeasure performance extremely challenging.
Drawing carefully upon the literature and the authors’ own experience, the authors have nevertheless made such an attempt.

6.1. Spoofing

Each attack is compared in terms ofaccessibilityand effectiveness.
Accessibility is intended to reflect the ease with which the attack may be performed, i.e. whether the technology is widely known and available or whether it is limited to the technicallyknowledgeable.
Even if the effectiveness is reduced in the case of randomised, phrase-prompted text-dependent systems, replay attacks are the most accessible approach to spoofing, requiring only a recording and playback device such as a tape recorder or a smart phone.
Trainable speech synthesis and publicly available voice conversion tools are already in the public domain, e.g. Festival6 6http://www.cstr.ed.ac.uk/projects/festival/ and Festvox7 and it has been reported that some speech synthesis systems are able to produce speech comparable in quality to human speech8.
Among the others considered in this paper, speech syntesis and voice conversion spoofing attacks may pose the greatest threat to ASV performance and thus effectiveness, for both textdependent and text-independent ASV systems is high.

6.2. Countermeasures

The vulnerability of ASV systems to each of the four attacks considered above has been confirmed by several independent studies.
Even so, efforts to develop countermeasures are relatively embryonic, lagging far behind the level of effort in the case of some other biometric modalities.
Since impersonated speech is entirely natural, there are no processingartefacts which might otherwise be useful for detection purposes.
Even if speech synthesis and voice conversion have attracted greater attention, the majority of existingcountermeasures make unrealistic use of prior knowledge.
Furthermore, these countermeasures might be easily overcome if they are known to spoofing attackers.

6.3. Generalised countermeasures

All of the past work described above targets a specific form of spoofing and generally exploits some prior knowledge of a particular spoofing algorithm.
Hence, countermeasures based on processing artefacts indicative of a specific approach to spoofing may not generalise well in the face of varying attacks.
Recent work has thus investigated the reliability of specific countermeasure to 7http://www.festvox.org/index.html.
The potential for generalised countermeasures is highlighted in independent studies of spoofing with synthetic speech (De Leon et al., 2012a) and converted voice (Wu et al., 2012a).
Longer-term or higher-level features wre investigated in (Alegre et al., 2013c) in the form of local binary pattern (LBP) analysis , a technique originally developed for texture analysis in computer vision problems (Pietikäinen et al., 2011).

7. Issues for future research

As discussed in Section 5, the spoofing and countermeasure studies reported in the literature were conducted with different datasets, evaluation protocols and metrics.
Unfortunately, the lack of standards presents a fundamental barrier to the comparison of different research results.
This section discusses the current evaluation protocols and metrics and some weaknesses in the methodology.

7.1. Large-scale standard datasets

Past studies of impersonation and replay spoofing attacks were all conducted using small-scale datasets, with only small numbers of speakers.
While many of the past studies on speech synthesis and voice conversion spoofing attacks already employ largescale datasets, e.g. NIST speaker recognition evaluation (SRE) corpora, they all require the use of non-standard speech synt esis and voice conversion algorithms in order to generate spoofed speech.
While past work is sufficient to demonstrate the potential of spoofing countermeasures, their performance is probably over-estimated.
In addition, most of the past countermeasure studies have been conducted under matched conditions, e.g. where speech samples used to optimise the countermeasure are collected in the same or similar acoustic environment and over the same or similar channel as those used for evaluation.
Large-scale, standard datasets are thus also needed in order that countermeasure performance can be evaluated not only with realistic channel or recording environment variability, butalso in the absence of a priori knowledge and hence under variable attacks.

7.2. Evaluation metrics

While countermeasures can be integrated into existing ASV systems, they are most often implemented as independent modules which allow for theexplicit detectionof spoofing attacks.
Profiles 2 and 4 are dependent on the countermeasure threshold whereas the comparison of profiles 1 and 4 is poten- tially misleading; they reflect simultaneous changes to both the system and the dataset.
11Produced with the TABULA RASA Scoretoolkit: http: //publications.idiap.ch/downloads/reports/2012/Anjos_.
The EPSC explicitly reflects three types of error metrics, the FAR, FRR and SFAR, while still providing a single combined metric with a unique decision threshold.
Further work is thus required to design intuitive, universal metrics which represent the performance of spoofing countermeasures when combined with ASV.

7.3. Open-source software packages

As reflected throughout this article, spoofing and countermeasure studies involve a broad range of technologies, including ASV, speech synthesis and voice conversion.
Version 3.0 of ALIZE includes several state-of-the-art approaches including joint factor analysis (JFA), i-vector modelling and probabilistic linear discriminant analysis (PLDA) (Larcher et al., 2013a).
The Bob signal processing and machine learning toolbox14, is a general purpose biometric toolkit which also includes ASV functionality (Anjos et al., 2012).
Popular solutions for feature extraction include SPro15 and the Hidden Markov Model Toolkit16 (HTK) which also includes extensive statistical modelling functionalities.
The HMM-based Speech Synthesis System17 (HTS) can be used to implement HMM-based speech synthesis as well as speaker model adaptation, whereas the Festvox18 toolkit can be used for voice conversion.

7.4. Future directions

On account of dataset availability, the majority of past work involves text-independent ASV which is arguably more relevant to surveillance applications, also known as Text-dependent systems.
The most obvious, accessible attack involves replay.

8. Conclusions

This article reviews the previous work to assess the vulnerability of automatic speaker verification systems to spoofinga d the potential to protect them using dedicated countermeasur s.
Even if there are currently no standard datasets, evaluation protocols or metrics with which to conduct meaningfully comparable or reproducible research, previous studies involving impersonation, replay, speech synthesis and voice conversion all indicate genuine vulnerabilities.
Finally, while there is potential for next generation countermeasures to detect varying spoofing attacks, a continuous arms race is likely; efforts to develop more sophisticated countermeasures will likely be accompanied byincreased efforts to spoof automatic speaker verification systems.

Did you find this useful? Give us your feedback