scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Spoofing and countermeasures for speaker verification

TL;DR: A survey of past work and priority research directions for the future is provided, showing that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
About: This article is published in Speech Communication.The article was published on 2015-02-01 and is currently open access. It has received 433 citations till now. The article focuses on the topics: Spoofing attack & IP address spoofing.

Summary (9 min read)

2. Automatic speaker verification

  • There are two types of ASV systems:text-dependentandtext-independent.
  • Text-independent systems operate on arbitrary utterances, possibly spoken in different languages (Campbell Jr, 1997).
  • On account of evaluation sponsorship and dataset availability, text-independent ASV dominates the field and the research tends to place greater emphasis on surveillance applications rather than authentication.

2.1. Feature extraction

  • A speech signal has three-fold information: voice timbre, prosody and language content.
  • Correspondingly, speaker individuality can be characterised by short-term spectral, prosodic and high-level idiolectal features.
  • Short-term spectral fe tures are extracted from short frames typically of 20-30 milliseconds duration.
  • These features, such as pitch, energy and duration, are less sensitive to channel effects.
  • The extraction of high-level features requires considerably more complex front-ends, such as those which employ automatic speech recognition (Kinnunen and Li, 2010; Li and Ma, 2010).

2.2. Speaker modeling and classification

  • Approaches to text-independent ASV generally focus on modelling the feature distribution of a target speaker.
  • There are many different ways to implement Eq. (1).
  • With more modern techniques,X can also be high-dimensional i-vectors (Dehak et al., 2011) modelled with probabilistic linear discriminant analysis (PLDA) back-ends (Li et al., 2012) (see below).
  • Even so, GMMs are still needed for i-vector extraction and thus the authors provide a more detailed presentation of the GMM in the following.
  • The target speaker and UBM models are used as the hypothesised and alternative speaker models respectively.

2.3. System fusion

  • In addition to the development of increasingly robust models and classifiers, there is a significant emphasis within the ASV community on the study ofclassifier fusion.
  • The motivation is based on the assumption that multiple, independently trained recognisers together capture different aspects of the speech signal not covered by a single classifier alone.
  • Fusion also provides a convenient vehicle for large-scale research collabrations promoting independent classifier development and benchmarking (Saeidi et al., 2013).
  • Different classifiers can involve different features, classifiers, or hyper-parameter training sets (Brümmer et al., 2007; Hautamäki et al., 2013b).
  • A simple, yet robust approach to fusion involves the weighted summation of the base classifier score, where the weights are optimised according to a logistic regression cost function.

3.1. Possible attack points

  • A typical ASV system involves two processes: offline enrolment and runtime verification.
  • Features similarly extracted from this sample are compared to the model in order to determine whether or not the speaker matches the claimed identity.
  • The classifier determines a match score which represents the relative similarity of the sample to each of the two models.
  • These components and the links between them all represent possible attack points (Ratha et al., 2001).
  • In past studies of ASV spoofing, impersonation and replay attacks are assumed to apply at the microphone.

3.2. Potential vulnerabilities

  • This section explains the potential for typical ASV systems to be spoofed.
  • The authors focus on two key ASV modules: feature extraction and speaker modelling.

3.2.1. Feature extraction

  • All three feature representations described in Section 2.1are potentially vulnerable to spoofing attacks.
  • Due to their simpl city and performance, short-term spectral features are the most popular.
  • Ignoring any channel eff cts, replay attacks which use a pre-recorded speech sample can faithfully reflect the spectral attributes of the original speaker.
  • Voice conversion can also generate speech signals whose spectral envelope reflects that of a target speaker (Matrouf et al., 2006).

3.2.2. Speaker modeling

  • Most approaches to speaker modelling, be they applied to text-independent or text-dependent ASV, have their roots in he standard GMM.
  • Most lack the modelling of temporal sequence information, a key characteristic of human speech, which might otherwise afford some protection from spoofing; most models of the feature distributions used in typical speech synthesis and voice conversion algorithms assume independent features of observations, but are nonetheless effective as spoofing attacks.
  • As shown in (Kons and Aronowitz, 2013), HMM-based systems, which capture temporal information, are more robust to spoofing than GMM-based systems when subject to the same spoofing attack.
  • While preliminary studies of fused ASV system approaches to anti-spoofing were reported in (Riera et al., 2012), some insight into their likely full potential can be gained from related work in fused, multi-modal biometric systems.
  • The authors note, however, that (Rodriques et al., 2009; Akhtar et al., 2012) suggests it might suffice to spoof onlyonemodality (or sub-system) under a score fusion setting in the case where the spoofing of a single, significantly weighted sub-system is particularlyeffective.

4. Evaluation protocol

  • Here the authors present a generic experimental protocol which applies to the majority of past work.
  • The authors discuss database design and evaluation metrics with a focus on the comparability of baseline results with those of vulnerability and counterm asure studies.

4.1. Dataset design

  • While past studies of spoofing have used a range of different datasets (Alegre et al., 2014) there are some similarities in the experimental protocols.
  • The diagram illustrates three possible inputs: genuine, zero-eff rt impostor and spoofed speech.
  • A new dataset suitable for the study of spoofing is derived from the baseline by replacing all impostor trials with spoofed trials.
  • Referring once again to Figure 3, baseline performance is assessed using the pool ofM genuine trials (a) andN impostor trials (b), while that under spoofing is assessed with the pool of M genuine trials (a) andN spoofing trials (c).
  • The baseline performance and that under spoofing is thus directly comparable and the difference between them reflects the vulnerability of the system to the particular spoofing attack considered.

4.2. Evaluation metrics

  • The evaluation of ASV systems requires large numbers of two distinct tests: target tests, where the speaker matcheshe claimed identity, and impostor tests, where the identitiesdiffer.
  • There are two possible correct outcomes and two possible incorrect outcomes, namely false acceptance (or false alarm) and false rejection (or miss).
  • The FAR and FRR are complementary in the sense that, for a variable threshold and otherwise fixed system, one can only be reduced at the expense of increasing the other.
  • Equivalently, spoofing attacks will increase the FAR for a fixed decision threshold optimised on the standard baseline ASV dataset.
  • Nonetheless, similar to the decisions of a regular ASV system as illustrated in Figure 4, a practical, stand-alone countermeasure will inevitably lead to some false acceptances, where a spoofing attack remains undetected, in addition to false rejections, where genuine attempts are identified as spoofing attacks.

5. Spoofing and countermeasures

  • This section reviews past work to evaluate the vulnerability of typical ASV systems to spoofing and parallel efforts to develop anti-spoofing countermeasures.
  • Spoofing implies an attack at either the microphone or transmission level using a manipulated or synthesised speech sample in order to bias the system towards verifying a claimed identity.
  • The authors consider impersonation, replay, speech synthesis and voice conversion while concentrating on three different aspects: (i) the practicality of each spoofing attack; (ii) the vulnerability of ASV systems when subjected to such attacks, and (iii) the design of a realistic datasets for experimentation.
  • With regard to countermeasur s, the authors focus on: (i) the effectiveness of a countermeasure in preventing specific spoofing attacks, and (ii) the generalisation of countermeasures in protecting against varying attacks.

5.1. Impersonation

  • Impersonation is one of the most obvious approaches to spoofing and refers to attacks using human-altered voices, otherwise referred to as human mimicking.
  • Here, an attacker tris to mimic a target speaker’s voice timbre and prosody without computer-aided technologies.

5.1.1. Spoofing

  • The work in (Lau et al., 2004) showed that non-professional impersonators can readily adapt their voice to overcome ASV, but only when their natural voice is already similar to that of the target speaker (closest targets were selected from the YOHO corpus using a speaker recognition system).
  • One of the key observations was that the change in the vocal space (measured through F1 and F2) under impersonation cannot be described by a simple global transform; formant changes are vowel-specific.
  • Since impersonation is thought to involve mostly the mimicking of prosodic and stylistic cues, it is perhaps consideredmore effective in fooling human listeners than today’s state-of-the-art ASV systems (Perrot et al., 2005; Hautamäki et al., 2014).
  • The work in (Lau et al., 2005) and (Stoll and Doddington, 2010) showed how ASV systems themselves or even acoustic features alone may be employed to identify ‘similar’ speakers in order to prvoke false acceptances.
  • Past studies involving impersonation attacks are summarised in Table 2.

5.1.2. Countermeasures

  • While the threat of impersonation is not fully understood it is perhaps not surprising that there is virtually no prior work to investigate countermeasures against impersonation.
  • Unlike the spoofing attacks discussed below, all of which can be assumed to leave traces of the physical properties of the recording andplayback devices, or signal processing artefacts from synthesis or conversion systems, impersonators are live human beings who produce entirely natural speech.
  • Interestingly, some related work (Amin et al., 2013, 2014) has addressed the problem of disguisedetection5.
  • Specifically, the disguise detectors in (Amin et al., 2013, 2014) used quadratic discriminant on the first two for- mants to quantify the amount of acoustic variation on a vowelby-vowel basis.
  • 5While the spoofing attacks discussed in this article are meant to i crease false acceptance rate, disguise is the opposite problem where on wishes to be not recognized as herself, thereby increasing false rejection(miss) rate.

5.2. Replay

  • Replay is a form of spoofing whereby an adversary attacks an ASV system using a pre-recorded speech sample collected from a genuine target speaker.
  • The speech sample can be any recording captured surreptitiously and even concatenated speechsamples extracted from a number of shorter segments, for example to overcome text-dependent ASV systems (Villalba and Lleida, 2011b).
  • In addition, due to the availability of high quality and low-cost recording devices, such as smart phones, replay spoofing attack are arguably the most accessible and therefore present a significant threat.
  • Here, a smart phone is used to replay a pre-recorded speech sample in order to unlock another smart phone which uses speaker verification technology for logical access authentication.
  • The left phone (black color) is the smart phone with a voice-unlock function f r user authentication as reported in (Lee et al., 2013).

5.2.1. Spoofing

  • Even though they are among the most simple and easily implemented, only a small number of studies have addressed replay attacks.
  • Vulnerabilities to replay attack were first evaluated in (Lindberg et al., 1999).
  • The significant variation between male and femal speakers lies in the use of only a single speaker in each case.
  • It shows that the spectrogram and formant trajectories of the replay speech (upper images) have a highly similarity to those of the genuine speech (lower images).

5.2.2. Countermeasures

  • Recently, due to the mass-market adoption of ASV techniques (Lee et al., 2013; Nuance, 2013) and the awareness and simplicity of replay attacks, both industry (Nuance, 2013)and academia (Shang and Stevenson, 2010; Villalba and Lleida, 2011a,b; Wang et al., 2011) have shown an interest in developing replay attack countermeasures.
  • New accesses are identified as replay attacks if they produce a similarity score higher than a pre-defined threshold.
  • The motivation stems from the increase in noise and reverberation which occurs as a result of replaying farfield recordings.
  • Thus, the detection of channel effects beyond those introduced by the recording device of the ASV system serves as an indicator of replay attack.
  • While countermeasures are generally effective in reducing the FARs, they remain significantly higher than those of the respective baselines.

5.3. Speech synthesis

  • Speech synthesis, commonly referred to as text-to-speech (TTS), is a technique for generating intelligible, naturalsounding artificial speech for any arbitrary text.
  • Inthe 1990s, larger speech databases were collected and used to select more appropriate speech units that match both phonemes and other linguistic contexts such as lexical stress and pitch accent in order to generate high-quality natural sounding sy - thetic speech with appropriate prosody.
  • In the late 1990s another data-driven approach emerged.
  • Acoustic parameters generated from HMMs and selected according to the linguistic specification are used to drive a vocoder, a simplified speech production model with which speech is represented by vocal tract parameters and excitation parametersin order to generate a speech waveform.
  • The first three approaches are unlikely to be eff ctive in ASV spoofing.

5.3.1. Spoofing

  • There is a considerable volume of research in the literature which has demonstrated the vulnerability of ASV to synthetic voices generated with a variety of approaches to speech syntesis (Lindberg et al., 1999; Foomany et al., 2009; Villalba and Lleida, 2010).
  • The work used acoustic models adapted to specific human speakers (Masuko et al., 1996, 1997) and was performed using an HMM-based, text-prompted ASV system (Matsui and Furui, 1995).
  • When subjected to spoofing attacks with synthetic speech, the FAR increased to over 70 %.
  • This result is due to the significant overlap in the distribution of ASV scores for genuine and synthetic speech, as shown in Figure 8.
  • All the past work confirms that speech synthesis attacks are ableto increase significantly the FAR of all tested ASV systems, in- cluding those at the state of the art.

5.3.2. Countermeasures

  • Most approaches to detect synthetic speech rely on processing artefacts specific to a particular synthesis algorithm.
  • While estimates of this variance thus provide a means of discriminating between genuine and synthetic speech, such an approach is based on the full knowledge of a specific HMMbased speech synthesis system.
  • There are some attempts which focus on acoustic differences between vocoders and natural speech.
  • This simplificationleads to differences in the phase spectra between human and synthetic speech, differences which can be utilised for discrimination (De Leon et al., 2012a; Wu et al., 2012a).
  • While the countermeasure investigated in (De Leon et al., 2012a) is shown to be eff ctive in protecting both GMM-UBM and SVM systems from spoofing, as discussed above, most exploit prior knowledge of specific spoofing algorithms.

5.4. Voice conversion

  • Voice conversion aims to manipulate the speech of a given speaker so that it resembles in some sense that of another, target speaker (Stylianou, 2009; Evans et al., 2014a).
  • A straightforward approach to spectral mapping based on vector quantisation (VQ) was proposed in (Abe et al., 1988).
  • A mapping codebook is learned from source-target feature pai s and is then used to estimate target features from source features at runtime.
  • Among the most significant aspects of prosody investigated in the context of voice conversion are the fundamental frequency (F0) and duration.

5.4.1. Spoofing

  • Voice conversion has attracted increasing interest in the context of ASV spoofing for over a decade.
  • The work in (Perrot et al., 2005) evaluated the vulnerability of a GMM-UBM ASV system.
  • The work was performed on the 2006 NIST SRE dataset using both joint-density GMM and unit selection approaches to voice conversion.
  • The work in (Kons and Aronowitz, 2013) examined the vulnerability of several state-of-the-art text-dependent systems, namely, i-vector, GMM-NAP and HMM-NAP systems.
  • Even though some approaches to voice conversion produce speech with clearlyaudible artefacts (Chen et al., 2003; Toda et al., 2007; Erro etal., 2013), Table 5 shows that all provoke significant increases in the FAR across a variety of different ASV systems.

5.4.2. Countermeasures

  • Voice conversion bears some similarity to speech synthesisin that some voice conversion algorithms employ vocoding techniques similar to those used in statistical parametric speech synthesis (Zen et al., 2009).
  • The work in (Wu et al., 2012a) exploited artefacts introduced by the vocoder as a means of discriminating converted speech from natural speech.
  • Cosine normalised phase (cos-phase) and modified group delay phase (MGD-phase) features were shown to be effective.
  • Interestingly, baseline performance was not affected as a result of integrating spoofing countermeasures.
  • It shows that countermeasures are effective in protecting ASV systems from voice conversion attacks, and that performance with integrated countermeasures is not too dissimilar to baseline performance.

6. Discussion

  • As discussed in Section 5, spoofing and countermeasures for ASV have been studied with various approaches to simulate spoofing attacks, different ASV systems, diverse experimental designs, and with a multitude of different datasets, evaluation protocols and metrics.
  • The lack of commonality makes th comparison of vulnerabilities and countermeasure performance extremely challenging.
  • Drawing carefully upon the literature and the authors’ own experience, the authors have nevertheless made such an attempt.

6.1. Spoofing

  • Each attack is compared in terms ofaccessibilityand effectiveness.
  • Accessibility is intended to reflect the ease with which the attack may be performed, i.e. whether the technology is widely known and available or whether it is limited to the technicallyknowledgeable.
  • Even if the effectiveness is reduced in the case of randomised, phrase-prompted text-dependent systems, replay attacks are the most accessible approach to spoofing, requiring only a recording and playback device such as a tape recorder or a smart phone.
  • Trainable speech synthesis and publicly available voice conversion tools are already in the public domain, e.g. Festival6 6http://www.cstr.ed.ac.uk/projects/festival/ and Festvox7 and it has been reported that some speech synthesis systems are able to produce speech comparable in quality to human speech8.
  • Among the others considered in this paper, speech syntesis and voice conversion spoofing attacks may pose the greatest threat to ASV performance and thus effectiveness, for both textdependent and text-independent ASV systems is high.

6.2. Countermeasures

  • The vulnerability of ASV systems to each of the four attacks considered above has been confirmed by several independent studies.
  • Even so, efforts to develop countermeasures are relatively embryonic, lagging far behind the level of effort in the case of some other biometric modalities.
  • Since impersonated speech is entirely natural, there are no processingartefacts which might otherwise be useful for detection purposes.
  • Even if speech synthesis and voice conversion have attracted greater attention, the majority of existingcountermeasures make unrealistic use of prior knowledge.
  • Furthermore, these countermeasures might be easily overcome if they are known to spoofing attackers.

6.3. Generalised countermeasures

  • All of the past work described above targets a specific form of spoofing and generally exploits some prior knowledge of a particular spoofing algorithm.
  • Hence, countermeasures based on processing artefacts indicative of a specific approach to spoofing may not generalise well in the face of varying attacks.
  • Recent work has thus investigated the reliability of specific countermeasure to 7http://www.festvox.org/index.html.
  • The potential for generalised countermeasures is highlighted in independent studies of spoofing with synthetic speech (De Leon et al., 2012a) and converted voice (Wu et al., 2012a).
  • Longer-term or higher-level features wre investigated in (Alegre et al., 2013c) in the form of local binary pattern (LBP) analysis , a technique originally developed for texture analysis in computer vision problems (Pietikäinen et al., 2011).

7. Issues for future research

  • As discussed in Section 5, the spoofing and countermeasure studies reported in the literature were conducted with different datasets, evaluation protocols and metrics.
  • Unfortunately, the lack of standards presents a fundamental barrier to the comparison of different research results.
  • This section discusses the current evaluation protocols and metrics and some weaknesses in the methodology.

7.1. Large-scale standard datasets

  • Past studies of impersonation and replay spoofing attacks were all conducted using small-scale datasets, with only small numbers of speakers.
  • While many of the past studies on speech synthesis and voice conversion spoofing attacks already employ largescale datasets, e.g. NIST speaker recognition evaluation (SRE) corpora, they all require the use of non-standard speech synt esis and voice conversion algorithms in order to generate spoofed speech.
  • While past work is sufficient to demonstrate the potential of spoofing countermeasures, their performance is probably over-estimated.
  • In addition, most of the past countermeasure studies have been conducted under matched conditions, e.g. where speech samples used to optimise the countermeasure are collected in the same or similar acoustic environment and over the same or similar channel as those used for evaluation.
  • Large-scale, standard datasets are thus also needed in order that countermeasure performance can be evaluated not only with realistic channel or recording environment variability, butalso in the absence of a priori knowledge and hence under variable attacks.

7.2. Evaluation metrics

  • While countermeasures can be integrated into existing ASV systems, they are most often implemented as independent modules which allow for theexplicit detectionof spoofing attacks.
  • Profiles 2 and 4 are dependent on the countermeasure threshold whereas the comparison of profiles 1 and 4 is poten- tially misleading; they reflect simultaneous changes to both the system and the dataset.
  • 11Produced with the TABULA RASA Scoretoolkit: http: //publications.idiap.ch/downloads/reports/2012/Anjos_.
  • The EPSC explicitly reflects three types of error metrics, the FAR, FRR and SFAR, while still providing a single combined metric with a unique decision threshold.
  • Further work is thus required to design intuitive, universal metrics which represent the performance of spoofing countermeasures when combined with ASV.

7.3. Open-source software packages

  • As reflected throughout this article, spoofing and countermeasure studies involve a broad range of technologies, including ASV, speech synthesis and voice conversion.
  • Version 3.0 of ALIZE includes several state-of-the-art approaches including joint factor analysis (JFA), i-vector modelling and probabilistic linear discriminant analysis (PLDA) (Larcher et al., 2013a).
  • The Bob signal processing and machine learning toolbox14, is a general purpose biometric toolkit which also includes ASV functionality (Anjos et al., 2012).
  • Popular solutions for feature extraction include SPro15 and the Hidden Markov Model Toolkit16 (HTK) which also includes extensive statistical modelling functionalities.
  • The HMM-based Speech Synthesis System17 (HTS) can be used to implement HMM-based speech synthesis as well as speaker model adaptation, whereas the Festvox18 toolkit can be used for voice conversion.

7.4. Future directions

  • On account of dataset availability, the majority of past work involves text-independent ASV which is arguably more relevant to surveillance applications, also known as Text-dependent systems.
  • The most obvious, accessible attack involves replay.

8. Conclusions

  • This article reviews the previous work to assess the vulnerability of automatic speaker verification systems to spoofinga d the potential to protect them using dedicated countermeasur s.
  • Even if there are currently no standard datasets, evaluation protocols or metrics with which to conduct meaningfully comparable or reproducible research, previous studies involving impersonation, replay, speech synthesis and voice conversion all indicate genuine vulnerabilities.
  • Finally, while there is potential for next generation countermeasures to detect varying spoofing attacks, a continuous arms race is likely; efforts to develop more sophisticated countermeasures will likely be accompanied byincreased efforts to spoof automatic speaker verification systems.

Did you find this useful? Give us your feedback

Figures (13)
Citations
More filters
Journal ArticleDOI
TL;DR: An efficient and rather robust face spoof detection algorithm based on image distortion analysis (IDA) that outperforms the state-of-the-art methods in spoof detection and highlights the difficulty in separating genuine and spoof faces, especially in cross-database and cross-device scenarios.
Abstract: Automatic face recognition is now widely used in applications ranging from deduplication of identity to authentication of mobile payment. This popularity of face recognition has raised concerns about face spoof attacks (also known as biometric sensor presentation attacks), where a photo or video of an authorized person’s face could be used to gain access to facilities or services. While a number of face spoof detection techniques have been proposed, their generalization ability has not been adequately addressed. We propose an efficient and rather robust face spoof detection algorithm based on image distortion analysis (IDA). Four different features (specular reflection, blurriness, chromatic moment, and color diversity) are extracted to form the IDA feature vector. An ensemble classifier, consisting of multiple SVM classifiers trained for different face spoof attacks (e.g., printed photo and replayed video), is used to distinguish between genuine (live) and spoof faces. The proposed approach is extended to multiframe face spoof detection in videos using a voting-based scheme. We also collect a face spoof database, MSU mobile face spoofing database (MSU MFSD), using two mobile devices (Google Nexus 5 and MacBook Air) with three types of spoof attacks (printed photo, replayed video with iPhone 5S, and replayed video with iPad Air). Experimental results on two public-domain face spoof databases (Idiap REPLAY-ATTACK and CASIA FASD), and the MSU MFSD database show that the proposed approach outperforms the state-of-the-art methods in spoof detection. Our results also highlight the difficulty in separating genuine and spoof faces, especially in cross-database and cross-device scenarios.

716 citations


Cites background from "Spoofing and countermeasures for sp..."

  • ...State of the art Commercial Off-The-Shelf (COTS) face recognition systems are not well designed to differentiate spoof faces from genuine live faces....

    [...]

Journal ArticleDOI
TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

554 citations

Journal ArticleDOI
TL;DR: An approach which combines speech signal analysis using the constant Q transform with traditional cepstral processing and results show that CQCC configuration is sensitive to the general form of spoofing attack and use case scenario suggests that the past single-system pursuit of generalised spoofing detection may need rethinking.

327 citations


Cites background or methods or result from "Spoofing and countermeasures for sp..."

  • ...The most potentially damaging spoofing attacks in this case are voice conversion and speech synthesis (Wu et al., 2015)....

    [...]

  • ...Their utility for spoofing detection was first demonstrated using the ASVspoof 2015 database (Wu et al., 2014, 2015) for which they were shown to outperform the previous best result by 72% relative (Todisco et al., 2016)....

    [...]

  • ...The first two databases, namely ASVspoof 2015 (Wu et al., 2014, 2015) and AVspoof (Ergunay et al., 2015), are publicly available and have already been used for competitive evaluations....

    [...]

  • ...A growing body of work has gauged the vulnerability of ASV systems to a diverse range of spoofing attacks (Evans et al., 2013; Wu et al., 2015)....

    [...]

  • ...This hypothesis is supported by the general findings of the recent ASVspoof 2015 challenge (Wu et al., 2015) and of the BTAS 2016 Speaker Anti-spoofing Competition (Korshunov et al....

    [...]

Proceedings ArticleDOI
06 Sep 2015
TL;DR: Comparative results indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic speech detection task.
Abstract: The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using diff erent voice conversion (VC) and speech synthesis (SS) techniques. Various countermeasures are proposed to detect this type of attack, and in this context, choosing an appropriate feature extraction technique for capturing relevant information from speech is an important issue. This paper presents a concise experimental review of different features for synthetic speech detection task. A wide variety of features considered in this stud y include previously investigated features as well as some other potentially useful features for characterizing real and sy nthetic speech. The experiments are conducted on recently released ASVspoof 2015 corpus containing speech data from a large number of VC and SS technique. Comparative results using two different classifiers indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic sp eech. Index Terms: anti-spoofing, ASVspoof 2015, feature extraction, countermeasures

313 citations

Proceedings ArticleDOI
06 Sep 2015
TL;DR: The ASVspoof initiative as discussed by the authors aims to overcome the bottleneck through the provision of standard corpora, protocols and metrics to support a common evaluation, and summarizes the results and discusses directions for future challenges and research.
Abstract: An increasing number of independent studies have confirmed the vulnerability of automatic speaker verification (ASV) technology to spoofing. However, in comparison to that involving other biometric modalities, spoofing and countermeasure research for ASV is still in its infancy. A current barrier to progress is the lack of standards which impedes the comparison of results generated by different researchers. The ASVspoof initiative aims to overcome this bottleneck through the provision of standard corpora, protocols and metrics to support a common evaluation. This paper introduces the first edition, summaries the results and discusses directions for future challenges and research.

248 citations

References
More filters
Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations


"Spoofing and countermeasures for sp..." refers background in this paper

  • ...In addition to the four major approaches, inspired by advances in deep neural network (DNN)-based speech recognition (Hinton et al., 2012), new data-driven, DNN-based approaches have also been actively investigated (Zen et al., 2013; Ling et al., 2013; Lu et al., 2013; Qian et al., 2014)....

    [...]

  • ...In addition to the four major approaches, inspired by advances in deep neural network (DNN)-based speech recognition (Hinton et al., 2012), new data-driven, DNN-based approaches have also been actively investigated (Zen et al....

    [...]

01 Jan 2006

5,265 citations

Journal ArticleDOI
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

4,673 citations


"Spoofing and countermeasures for sp..." refers methods in this paper

  • ...GMMs have been used intensively and their combination with a universal background model (UBM) has become the de facto standard, commonly referred to as the GMM-UBM approach (Reynolds et al., 2000)....

    [...]

Journal ArticleDOI
TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Abstract: This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.

3,526 citations


"Spoofing and countermeasures for sp..." refers background in this paper

  • ...JFA subsequently evolved into a simplified total variability model or ‘i-vector’ approach which is now the state of the art (Dehak et al., 2011)....

    [...]

  • ...With more modern techniques, X can also be high-dimensional i-vectors (Dehak et al., 2011) modelled with probabilistic linear discriminant analysis (PLDA) back-ends (Li et al., 2012) (see below)....

    [...]

Journal ArticleDOI
TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >

3,134 citations


Additional excerpts

  • ...In the classical approach (Reynolds and Rose, 1995), features X are typically MFCCs and the acoustic models are Gaussian mixture models (GMMs) (see below)....

    [...]