scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2011"


Journal ArticleDOI
TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Abstract: This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.

3,526 citations


Proceedings Article
01 Jan 2011
TL;DR: The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization, which allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions.
Abstract: We present a method to boost the performance of probabilistic generative models that work with i-vector representations. The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization. This non-linear transformation allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions. Significant performance improvements are demonstrated on the telephone portion of NIST SRE 2010.

1,077 citations


Journal ArticleDOI
TL;DR: The basic phenomenon reflecting the last fifteen years is addressed, commenting on databases, modelling and annotation, the unit of analysis and prototypicality and automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration.

671 citations


Journal ArticleDOI
TL;DR: The main paradigms for speaker identification, and recent work on missing data methods to increase robustness are presented, and combined approaches involving bottom-up estimation and top-down processing are reviewed.
Abstract: This paper presents the main paradigms for speaker identification, and recent work on missing data methods to increase robustness. The feature extraction, speaker modeling and system classification are discussed. Evaluations of speaker identification performance subject to environmental noise are presented. While performance is impressive in clean speech conditions, there is rapid degradation with mismatched additive noise. Missing data methods can compensate against arbitrary disturbances and remove environmental mismatches. An overview of missing data methods is provided and applications to robust speaker identification summarized. Finally combined approaches involving bottom-up estimation and top-down processing are reviewed, and their significance discussed.

269 citations


Book
08 Dec 2011
TL;DR: The "Fundamentals of Speaker Recognition" as mentioned in this paper is a textbook for advanced level students in computer science and engineering, concentrating on biometrics, speech recognition, pattern recognition, signal processing and specifically, speaker recognition.
Abstract: An emerging technology, Speaker Recognition is becoming well-known for providing voice authentication over the telephone for helpdesks, call centres and other enterprise businesses for business process automation. "Fundamentals of Speaker Recognition" introduces Speaker Identification, Speaker Verification, Speaker (Audio Event) Classification, Speaker Detection, Speaker Tracking and more. The technical problems are rigorously defined, and a complete picture is made of the relevance of the discussed algorithms and their usage in building a comprehensive Speaker Recognition System. Designed as a textbook with examples and exercises at the end of each chapter, "Fundamentals of Speaker Recognition" is suitable for advanced-level students in computer science and engineering, concentrating on biometrics, speech recognition, pattern recognition, signal processing and, specifically, speaker recognition. It is also a valuable reference for developers of commercial technology and for speech scientists. Click here to view the table of contents and index.

243 citations


Proceedings ArticleDOI
27 Aug 2011
TL;DR: In this paper, a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA) is presented.
Abstract: Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.

239 citations


Proceedings ArticleDOI
01 Dec 2011
TL;DR: This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task.
Abstract: To date, there has been limited work in applying Deep Belief Networks (DBNs) for acoustic modeling in LVCSR tasks, with past work using standard speech features. However, a typical LVCSR system makes use of both feature and model-space speaker adaptation and discriminative training. This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task. In addition, we provide a recipe for data parallelization of DBN training, showing that data parallelization can provide linear speed-up in the number of machines, without impacting WER.

237 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: The use of universal background models (UBM) with full-covariance matrices is suggested and thoroughly experimentally tested and dimensionality reduction of i-vectors before entering the PLDA-HT modeling is investigated.
Abstract: In this paper, we describe recent progress in i-vector based speaker verification. The use of universal background models (UBM) with full-covariance matrices is suggested and thoroughly experimentally tested. The i-vectors are scored using a simple cosine distance and advanced techniques such as Probabilistic Linear Discriminant Analysis (PLDA) and heavy-tailed variant of PLDA (PLDA-HT). Finally, we investigate into dimensionality reduction of i-vectors before entering the PLDA-HT modeling. The results are very competitive: on NIST 2010 SRE task, the results of a single full-covariance LDA-PLDA-HT system approach those of complex fused system.

194 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: The speaker verification score for a pair of i-vectors representing a trial is computed with a functional form derived from the successful PLDA generative model, which provides up to 40% relative improvement on the NIST SRE 2010 evaluation task.
Abstract: Recently, i-vector extraction and Probabilistic Linear Discriminant Analysis (PLDA) have proven to provide state-of-the-art speaker verification performance. In this paper, the speaker verification score for a pair of i-vectors representing a trial is computed with a functional form derived from the successful PLDA generative model. In our case, however, parameters of this function are estimated based on a discriminative training criterion. We propose to use the objective function to directly address the task in speaker verification: discrimination between same-speaker and different-speaker trials. Compared with a baseline which uses a generatively trained PLDA model, discriminative training provides up to 40% relative improvement on the NIST SRE 2010 evaluation task.

193 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A HMM-based speech synthesizer is used, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems are used, reducing the vulnerability of a speaker verification (SV) system to synthetic speech.
Abstract: In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.

172 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: Under certain assumptions, the formulas for i-vector extraction—also used in i- vector extractor training—can be simplified and lead to a faster and memory more efficient code.
Abstract: This paper introduces some simplifications to the i-vector speaker recognition systems. I-vector extraction as well as training of the i-vector extractor can be an expensive task both in terms of memory and speed. Under certain assumptions, the formulas for i-vector extraction—also used in i-vector extractor training—can be simplified and lead to a faster and memory more efficient code. The first assumption is that the GMM component alignment is constant across utterances and is given by the UBM GMM weights. The second assumption is that the i-vector extractor matrix can be linearly transformed so that its per-Gaussian components are orthogonal. We use PCA and HLDA to estimate this transform.

Patent
David Rasmussen1
09 May 2011
TL;DR: In this paper, a plurality of speakers may be recorded and associated with identity indicators, and voice prints for each speaker may be created, if the voice print for at least one speaker corresponds to a known user according to the identity indicators.
Abstract: Voice print identification may be provided. A plurality of speakers may be recorded and associated with identity indicators. Voice prints for each speaker may be created. If the voice print for at least one speaker corresponds to a known user according to the identity indicators, a database entry associating the user with the voice print may be created. Additional information associated with the user may also be displayed.

Proceedings ArticleDOI
08 Dec 2011
TL;DR: A system for detecting spoofing attacks on speaker verification systems and shows the degradation on the speaker verification performance in the presence of this kind of attack and how to use the spoofing detection to mitigate that degradation.
Abstract: In this paper, we describe a system for detecting spoofing attacks on speaker verification systems. We understand as spoofing the fact of impersonating a legitimate user. We focus on detecting two types of low technology spoofs. On the one side, we try to expose if the test segment is a far-field microphone recording of the victim that has been replayed on a telephone handset using a loudspeaker. On the other side, we want to determine if the recording has been created by cutting and pasting short recordings to forge the sentence requested by a text dependent system. This kind of attacks is of critical importance for security applications like access to bank accounts. To detect the first type of spoof we extract several acoustic features from the speech signal. Spoofs and non-spoof segments are classified using a support vector machine (SVM). The cut and paste is detected comparing the pitch and MFCC contours of the enrollment and test segments using dynamic time warping (DTW). We performed experiments using two databases created for this purpose. They include signals from land line and GSM telephone channels of 20 different speakers. We present results of the performance separately for each spoofing detection system and the fusion of both. We have achieved error rates under 10% for all the conditions evaluated. We show the degradation on the speaker verification performance in the presence of this kind of attack and how to use the spoofing detection to mitigate that degradation.

Proceedings ArticleDOI
04 Mar 2011
TL;DR: The application of Hidden Markov Models is proposed instead, which have already been successfully implemented in speaker recognition systems and can be directly used to construct the model and thus form the basis for successful recognition.
Abstract: Biometric gait recognition based on accelerometer data is still a new field of research. It has the merit of offering an unobtrusive and hence user-friendly method for authentication on mobile phones. Most publications in this area are based on extracting cycles (two steps) from the gait data which are later used as features in the authentication process. In this paper the application of Hidden Markov Models is proposed instead. These have already been successfully implemented in speaker recognition systems. The advantage is that no error-prone cycle extraction has to be performed, but the accelerometer data can be directly used to construct the model and thus form the basis for successful recognition. Testing this method with accelerometer data of 48 subjects recorded using a commercial of the shelve mobile phone a false non match rate (FNMR) of 10.42% at a false match rate (FMR) of 10.29% was obtained. This is half of the error rate obtained when applying an advanced cycle extraction method to the same data set in previous work.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: Comparing the performances between MFCC and LFCC in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task shows that LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials, and shows some advantage of LFCC over MFCC in reverberant speech.
Abstract: Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-the-art speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.

Journal ArticleDOI
22 May 2011
TL;DR: It is shown that it is possible to train a gender-independent discriminative model that achieves state-of-the-art accuracy, comparable to the one of aGender-dependent system, saving memory and execution time both in training and in testing.
Abstract: This work presents a new and efficient approach to discriminative speaker verification in the i-vector space. We illustrate the development of a linear discriminative classifier that is trained to discriminate between the hypothesis that a pair of feature vectors in a trial belong to the same speaker or to different speakers. This approach is alternative to the usual discriminative setup that discriminates between a speaker and all the other speakers. We use a discriminative classifier based on a Support Vector Machine (SVM) that is trained to estimate the parameters of a symmetric quadratic function approximating a log-likelihood ratio score without explicit modeling of the i-vector distributions as in the generative Probabilistic Linear Discriminant Analysis (PLDA) models. Training these models is feasible because it is not necessary to expand the i -vector pairs, which would be expensive or even impossible even for medium sized training sets. The results of experiments performed on the tel-tel extended core condition of the NIST 2010 Speaker Recognition Evaluation are competitive with the ones obtained by generative models, in terms of normalized Detection Cost Function and Equal Error Rate. Moreover, we show that it is possible to train a gender-independent discriminative model that achieves state-of-the-art accuracy, comparable to the one of a gender-dependent system, saving memory and execution time both in training and in testing.

Journal ArticleDOI
TL;DR: The different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc are discussed.
Abstract: Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software. But despite of all these advances, machines can not match the performance of their human counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition. So, today significant portion of speech recognition research is focused on speaker independent speech recognition problem. Before recognition, speech processing has to be carried out to get a feature vectors of the signal. So, front end analysis plays a important role. The reasons are its wide range of applications, and limitations of available techniques of speech recognition. So, in this report we briefly discuss the different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc. We have also discussed the various advantages and disadvantages of each feature extraction technique, along with the suitability of each method to particular application.

Proceedings ArticleDOI
10 Jul 2011
TL;DR: A channel pattern noise based approach to guard speaker recognition system against playback attacks and the experimental results indicate that, with the designed playback detector, the equal error rate of speakers recognition system is reduced by 30%.
Abstract: This paper proposes a channel pattern noise based approach to guard speaker recognition system against playback attacks. For each recording under investigation, the channel pattern noise severs as a unique channel identification fingerprint. Denoising filter and statistical frames are applied to extract channel pattern noise, and 6 Legendre coefficients and 6 statistical features are extracted. SVM is used to train channel noise model to judge whether the input speech is an authentic or a playback recording. The experimental results indicate that, with the designed playback detector, the equal error rate of speaker recognition system is reduced by 30%.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: A novel technique for discriminative feature-level adaptation of automatic speech recognition system and found it to be complementary to common adaptation techniques.
Abstract: We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. The concept of iVectors popular in Speaker Recognition is used to extract information about speaker or acoustic environment from speech segment. iVector is a low-dimensional fixed-length representing such information. To utilized iVectors for adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount of annotated data to extract the relevant information from iVectors and to compensate speech feature. The approach was tested on standard CTS data. We found it to be complementary to common adaptation techniques. On a well tuned RDLT system with standard CMLLR adaptation we reached 0.8% additive absolute WER improvement.

Proceedings Article
01 Jan 2011
TL;DR: This work proposes a mixture of Probabilistic Linear Discriminant Analysis models (PLDA) as a solution for making systems independent of speaker gender and shows the effectiveness of the mixture model on microphone speech.
Abstract: The Speaker Recognition community that participates in NIST evaluations has concentrated on designing genderand channel-conditioned systems. In the real word, this conditioning is not feasible. Our main purpose in this work is to propose a mixture of Probabilistic Linear Discriminant Analysis models (PLDA) as a solution for making systems independent of speaker gender. In order to show the effectiveness of the mixture model, we first experiment on 2010 NIST telephone speech (det5), where we prove that there is no loss of accuracy compared with a baseline gender-dependent model. We also test with success the mixture model on a more realistic situation where there are cross-gender trials. Furthermore, we report results on microphone speech for the det1, det2, det3 and det4 tasks to confirm the effectiveness of the mixture model.

Proceedings Article
01 Jan 2011
TL;DR: A novel approach to flexible control of speaker characteristics using tensor representation of speaker space is described, which can solve an inherent problem of supervector representation, and it improves the performance of voice conversion.
Abstract: This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. In voice conversion studies, realization of conversion from/to an arbitrary speaker’s voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. In the EVC, similarly to speaker recognition approaches, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. Experimental results of oneto-many voice conversion demonstrate the effectiveness of the proposed approach. Index Terms: voice conversion, Gaussian mixture model, eigenvoice, tensor analysis, Tucker decomposition

Proceedings Article
01 Oct 2011
TL;DR: A new approach to channel compensation in a low dimensional total factor space, rather than in the GMM supervector space, contributes to a better understanding of the session variability characteristics in the total factors space.
Abstract: The total variability factor space in speaker verification system architecture based on Factor Analysis (FA) has greatly improved speaker recognition performances. Carrying out channel compensation in a low dimensional total factor space, rather than in the GMM supervector space, allows for the application of new techniques. We propose here new intersession compensation and scoring methods. Furthermore, this new approach contributes to a better understanding of the session variability characteristics in the total factor space.

Journal ArticleDOI
TL;DR: The need to exploit the potential of the group delay functions for development of speech systems is demonstrated by demonstrating the effectiveness of segmentation of speech, and the features derived from the modified group delay are demonstrated in applications such as language identification, speech recognition and speaker recognition.
Abstract: Traditionally, the information in speech signals is represented in terms of features derived from short-time Fourier analysis. In this analysis the features extracted from the magnitude of the Fourier transform (FT) are considered, ignoring the phase component. Although the significance of the FT phase was highlighted in several studies over the recent three decades, the features of the FT phase were not exploited fully due to difficulty in computing the phase and also in processing the phase function. The information in the short-time FT phase function can be extracted by processing the derivative of the FT phase, i.e., the group delay function. In this paper, the properties of the group delay functions are reviewed, highlighting the importance of the FT phase for representing information in the speech signal. Methods to process the group delay function are discussed to capture the characteristics of the vocal-tract system in the form of formants or through a modified group delay function. Applications of group delay functions for speech processing are discussed in some detail. They include segmentation of speech into syllable boundaries, exploiting the additive and high resolution properties of the group delay functions. The effectiveness of segmentation of speech, and the features derived from the modified group delay are demonstrated in applications such as language identification, speech recognition and speaker recognition. The paper thus demonstrates the need to exploit the potential of the group delay functions for development of speech systems.

Patent
17 Feb 2011
TL;DR: In this paper, an enhanced speech recognition system and method are provided that may be used with a voice recognition wireless communication system and take advantage of group to group calling statistics to improve the recognition of names by the speech recognition systems.
Abstract: An enhanced speech recognition system and method are provided that may be used with a voice recognition wireless communication system. The enhanced speech recognition system and method take advantage of group to group calling statistics to improve the recognition of names by the speech recognition system.

Proceedings Article
12 Dec 2011
TL;DR: A multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss.
Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work.

Journal ArticleDOI
TL;DR: In this article, a deep neural architecture (DNA) was proposed for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and speaker recognition, which results in a speakerspecific overcomplete representation.
Abstract: Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.

Journal ArticleDOI
TL;DR: It is shown experimentally that increasing the inter-speaker variability in the UBM data while maintaining the overall total data size constant gradually improves system performance, dispels the myth of "There's no data like more data” for the purpose of UBM construction.
Abstract: State-of-the-art Gaussian mixture model (GMM)-based speaker recognition/verification systems utilize a universal background model (UBM), which typically requires extensive resources, especially if multiple channel and microphone categories are considered. In this study, a systematic analysis of speaker verification system performance is considered for which the UBM data is selected and purposefully altered in different ways, including variation in the amount of data, sub-sampling structure of the feature frames, and variation in the number of speakers. An objective measure is formulated from the UBM covariance matrix which is found to be highly correlated with system performance when the data amount was varied while keeping the UBM data set constant, and increasing the number of UBM speakers while keeping the data amount constant. The advantages of feature sub-sampling for improving UBM training speed is also discussed, and a novel and effective phonetic distance-based frame selection method is developed. The sub-sampling methods presented are shown to retain baseline equal error rate (EER) system performance using only 1% of the original UBM data, resulting in a drastic reduction in UBM training computation time. This, in theory, dispels the myth of “There's no data like more data” for the purpose of UBM construction. With respect to the UBM speakers, the effect of systematically controlling the number of training (UBM) speakers versus overall system performance is analyzed. It is shown experimentally that increasing the inter-speaker variability in the UBM data while maintaining the overall total data size constant gradually improves system performance. Finally, two alternative speaker selection methods based on different speaker diversity measures are presented. Using the proposed schemes, it is shown that by selecting a diverse set of UBM speakers, the baseline system performance can be retained using less than 30% of the original UBM speakers.

Journal Article
TL;DR: This paper introduces speakerrecognition in general and discusses its relevant parameters in relation to system performance.
Abstract: The explosive growth of information technology in the last decade has made a considerable impact on the designand construction of systems for human-machine communication, which is becoming increasingly important inmany aspects of life. Amongst other speech processing tasks, a great deal of attention has been devoted todeveloping procedures that identify people from their voices, and the design and construction of speakerrecognition systems has been a fascinating enterprise pursued over many decades. This paper introduces speakerrecognition in general and discusses its relevant parameters in relation to system performance.

Journal ArticleDOI
TL;DR: The proposed CFCC features consistently perform better than the baseline MFCC features under all three mismatched testing conditions and compare favorably to perceptual linear predictive (PLP) and RASTA-PLP features.
Abstract: An auditory-based feature extraction algorithm is presented. We name the new features as cochlear filter cepstral coefficients (CFCCs) which are defined based on a recently developed auditory transform (AT) plus a set of modules to emulate the signal processing functions in the cochlea. The CFCC features are applied to a speaker identification task to address the acoustic mismatch problem between training and testing environments. Usually, the performance of acoustic models trained in clean speech drops significantly when tested in noisy speech. The CFCC features have shown strong robustness in this kind of situation. In our experiments, the CFCC features consistently perform better than the baseline MFCC features under all three mismatched testing conditions-white noise, car noise, and babble noise. For example, in clean conditions, both MFCC and CFCC features perform similarly, over 96%, but when the signal-to-noise ratio (SNR) of the input signal is 6 dB, the accuracy of the MFCC features drops to 41.2%, while the CFCC features still achieve an accuracy of 88.3%. The proposed CFCC features also compare favorably to perceptual linear predictive (PLP) and RASTA-PLP features. The CFCC features consistently perform much better than PLP. Under white noise, the CFCC features are significantly better than RASTA-PLP, while under car and babble noise, the CFCC features provide similar performances to RASTA-PLP.

Patent
16 Mar 2011
TL;DR: In this article, a microphone array and a noise elimination method are used to enhance signal-to-noise ratio (SNR) to achieve goals of effectively improving voice communication quality and voice identification rate.
Abstract: A voice noise elimination method for microphone array applicable for voice enhancement and voice identification is disclosed. A microphone array and a noise elimination method are used to enhance signal-to-noise ratio (SNR) to achieve goals of effectively improving voice communication quality and voice identification rate. The method includes a voice and non-voice section detection technology and two stages of voice enhancement technology: (1) collecting voice signals in the space by a linear microphone array and processing the voice signals for determination of voice and non-voice sections; (2) effectively evaluating a long-term noise spectrum of the non-voice section to eliminate the noise on the voice signal spectrum, which is considered a first voice SNR enhancement; and (3) achieving the result of the second stage, which is also the final stage, of voice SNR enhancement by residue noise estimation and elimination complying with voice signal partial characteristics.