scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance comparison of speaker recognition systems in presence of duration variability

TL;DR: This study reveals that the relative improvement of total variability based system gradually drops with the reduction in test utterance length, and if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances.
Abstract: Performance of speaker recognition system is highly dependent on the amount of speech data used in training and testing. In this paper, we compare the performance of two different speaker recognition systems in presence of utterance duration variability. The first system is based on state-of-the-art total variability (also known as i-vector system), whereas the other one is classical speaker recognition system based on Gaussian mixture model with universal background model (GMM-UBM). We have conducted extensive experiments for different cases of length mismatch on two NIST corpora: NIST SRE 2008 and NIST SRE 2010. Our study reveals that the relative improvement of total variability based system gradually drops with the reduction in test utterance length. We also observe that if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses to address the limited data issue within the scope of SV.
Abstract: Automatic speaker verification (ASV) technology now reports a reasonable level of accuracy in its applications in voice-based biometric systems. However, it requires adequate amount of speech data for enrolment and verification; otherwise, the performance becomes considerably degraded. For this reason, the trade-off between the convenience and security is difficult to maintain in practical scenarios. The utterance duration remains a critical issue while deploying a voice biometric system in real-world applications. A large amount of research work has been carried out to address the limited data issue within the scope of SV. The advancements and research activities in mitigating the challenges due to short utterance have seen a significant rise in recent times. In this study, the authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses. The review also summarises the major findings of the studies of duration variability problem in ASV systems. Finally, they discuss a number of possible future directions promoting further research in this field.

93 citations

Proceedings ArticleDOI
01 Dec 2015
TL;DR: A set of novel speech features for detecting spoofing attacks are proposed using alternative frequency-warping technique and formant-specific block transformation of filter bank log energies that outperform existing approaches for various spoofing attack detection task.
Abstract: Now-a-days, speech-based biometric systems such as automatic speaker verification (ASV) are highly prone to spoofing attacks by an imposture. With recent development in various voice conversion (VC) and speech synthesis (SS) algorithms, these spoofing attacks can pose a serious potential threat to the current state-of-the-art ASV systems. To impede such attacks and enhance the security of the ASV systems, the development of efficient anti-spoofing algorithms is essential that can differentiate synthetic or converted speech from natural or human speech. In this paper, we propose a set of novel speech features for detecting spoofing attacks. The proposed features are computed using alternative frequency-warping technique and formant-specific block transformation of filter bank log energies. We have evaluated existing and proposed features against several kinds of synthetic speech data from ASVspoof 2015 corpora. The results show that the proposed techniques outperform existing approaches for various spoofing attack detection task. The techniques investigated in this paper can also accurately classify natural and synthetic speech as equal error rates (EERs) of 0% have been achieved.

21 citations


Cites background from "Performance comparison of speaker r..."

  • ...MFCC [26], [27] feature captures spectral and phonetic information related to speech signal....

    [...]

Journal ArticleDOI
TL;DR: Voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) are proposed and assessed within the framework of speaker diarization and employed together with the state-of-the-art short-term cepstral and long-term prosodic features.
Abstract: Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.

15 citations

Journal ArticleDOI
TL;DR: In this article, a class of novel quality measures formulated using the zero-order sufficient statistics used during the i-vector extraction process were introduced for combining multiple systems based on different features and classifiers.

13 citations

Journal ArticleDOI
TL;DR: Gaussian mixture model with universal background model (GMM-UBM)-based and I-vector-based language identification approaches are investigated and the results show that GMM-UBm is more effective than the I- vector for language identification of short duration test utterances.
Abstract: With the advancement in technology, communication between people around the world from different linguistic backgrounds is increasing gradually, resulting in the requirement of language identification services. Language identification techniques extract distinguishable information as features of a language from the speech corpora to differentiate one language from other. Without publicly available speech corpora, comparison between different techniques will not be much reliable. This paper investigates state-of-the-art features and techniques for language identification of under-resource and closely related languages, namely Pashto, Punjabi, Sindhi, and Urdu. For language identification, speech corpus is designed and collected for mentioned languages. The dataset is a read speech data collected over telephone network (mobile and landline) from different regions of Pakistan. The speech corpus is annotated at the sentence level using X-SAMPA, its orthographic transcription is also provided, and verified data are divided into training and evaluation sets. Mel-frequency cepstral coefficients and their shifted delta cepstral features are used to develop language identification system of target languages. Gaussian mixture model with universal background model (GMM-UBM)-based and I-vector-based language identification approaches are investigated. The results show that GMM-UBM is more effective than the I-vector for language identification of short duration test utterances.

13 citations

References
More filters
Journal ArticleDOI
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

4,673 citations


"Performance comparison of speaker r..." refers background or methods in this paper

  • ...During the last two decades in speaker recognition research, most of the notable developments in classifier-level are based on the GMM concept [10], [11], [12]....

    [...]

  • ...In GMM-UBM, prior to enrollment phase, a single speaker independent universal background model (UBM) is created by using a large development data [10], [14]....

    [...]

  • ...For this reason, GMM-UBM systems are still popular and widely used, particularly when suitable amount development data is inadequate [10], [5], [14]....

    [...]

Journal ArticleDOI
TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Abstract: This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.

3,526 citations


"Performance comparison of speaker r..." refers background or methods in this paper

  • ...Though i-vector based speaker recognition systems are shown to give best recognition accuracy in latest NIST SREs [18], [19], [21], they require huge computational resources as well as massive amount of development data for estimating its parameters and hyper-parameters....

    [...]

  • ...Inspired by the earlier use of JFA, Dehak et al. proposed total-variability based approach for reducing the dimensionality of GMM-supervector [18]....

    [...]

  • ...The i-vector represents the GMM supervector by a single variability space which reduces high dimensional GMM supervector into lower dimensional total variability space [18]....

    [...]

  • ...proposed total-variability based approach for reducing the dimensionality of GMM-supervector [18]....

    [...]

Journal ArticleDOI
TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >

3,134 citations


"Performance comparison of speaker r..." refers methods in this paper

  • ...For classification, various modeling techniques such as vector quantization (VQ) [7], dynamic time warping (DTW) [8], Gaussian mixture model (GMM) [9] were used....

    [...]

Journal ArticleDOI
01 Sep 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.

1,686 citations


"Performance comparison of speaker r..." refers background or methods in this paper

  • ...Its potential applications include telephone banking system, system access control, providing forensic evidence, call centers and many more [1], [2]....

    [...]

  • ...A TI speaker recognition system includes three fundamental modules [1], [2]: a feature extraction unit, which represents the speech signal in a compact manner, a modeling block to characterize those features using statistical approaches, and lastly, a classification scheme to classify the unknown utterance....

    [...]

  • ...SV system can be broadly categorized as text-dependent (TD) [4] and text-independent (TI) modes depending on the speech content in training and test phase [1], [2]....

    [...]

  • ...Speech signal conveys information regarding the physiological aspects of a speaker because it is affected by the unique shape and size of vocal tract, mouth, nasal cavity, etc [1], [2]....

    [...]

Journal ArticleDOI
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

1,433 citations


"Performance comparison of speaker r..." refers background or methods in this paper

  • ...Its potential applications include telephone banking system, system access control, providing forensic evidence, call centers and many more [1], [2]....

    [...]

  • ...A TI speaker recognition system includes three fundamental modules [1], [2]: a feature extraction unit, which represents the speech signal in a compact manner, a modeling block to characterize those features using statistical approaches, and lastly, a classification scheme to classify the unknown utterance....

    [...]

  • ...SV system can be broadly categorized as text-dependent (TD) [4] and text-independent (TI) modes depending on the speech content in training and test phase [1], [2]....

    [...]

  • ...Speech signal conveys information regarding the physiological aspects of a speaker because it is affected by the unique shape and size of vocal tract, mouth, nasal cavity, etc [1], [2]....

    [...]