scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2005"


Journal ArticleDOI
TL;DR: This work derives an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and shows how it can be regarded as a new method of eigenvoice estimation.
Abstract: We derive an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and show how it can be regarded as a new method of eigenvoice estimation. Unlike other approaches to the problem of estimating eigenvoices in situations where speaker-dependent training is not feasible, our method enables us to estimate as many eigenvoices from a given training set as there are training speakers. In the limit as the amount of training data for each speaker tends to infinity, it is equivalent to cluster adaptive training.

523 citations


Journal ArticleDOI
TL;DR: Functional neuroimaging reveals that in the context of speaker recognition, the assessment of person familiarity does not necessarily engage supra-modal cortical substrates but can result from the direct sharing of information between auditory voice and visual face regions.
Abstract: Face and voice processing contribute to person recognition, but it remains unclear how the segregated specialized cortical modules interact. Using functional neuroimaging, we observed cross-modal responses to voices of familiar persons in the fusiform face area, as localized separately using visual stimuli. Voices of familiar persons only activated the face area during a task that emphasized speaker recognition over recognition of verbal content. Analyses of functional connectivity between cortical territories show that the fusiform face region is coupled with the superior temporal sulcus voice region during familiar speaker recognition, but not with any of the other cortical regions normally active in person recognition or in other tasks involving voices. These findings are relevant for models of the cognitive processes and neural circuitry involved in speaker recognition. They reveal that in the context of speaker recognition, the assessment of person familiarity does not necessarily engage supramodal cortical substrates but can result from the direct sharing of information between auditory voice and visual face regions.

310 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: This work performs channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections via an eigenvalue problem.
Abstract: Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study the problem for speaker recognition using support vector machines (SVMs). We perform channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections. Training to remove these dimensions is accomplished via an eigenvalue problem. The eigenvalue problem attempts to reduce multisession variation for the same speaker, reduce different channel effects, and increase "distance" between different speakers. We apply our methods to a subset of the Switchboard 2 corpus. Experiments show dramatic improvement in performance for the cross-channel case.

269 citations


Book ChapterDOI
01 Jan 2005
TL;DR: The National Institute of Standards and Technology (NIST) has coordinated annual scientific evaluations of text-independent speaker recognition since 1996, focusing primarily on speaker detection in the context of conversational telephone speech.
Abstract: The National Institute of Standards and Technology (NIST) has coordinated annual scientific evaluations of text-independent speaker recognition since 1996. These evaluations aim to provide important contributions to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text-independent speaker recognition. To this end, the evaluations are designed to be simple, fully supported, accessible and focused on core technology issues. The evaluations have focused primarily on speaker detection in the context of conversational telephone speech. More recent evaluations have also included related tasks, such as speaker segmentation, and have used data in addition to conversational telephone speech. The evaluations are designed to foster research progress, with the objectives of:

218 citations


Journal ArticleDOI
TL;DR: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels and introduces a technique called spherical normalization that preconditions the Hessian matrix.
Abstract: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system.

214 citations


Journal ArticleDOI
TL;DR: Overall, it is found that SVM modeling of prosodic feature sequences yields valuable information for automatic speaker recognition and offers rich new opportunities for exploring how speakers differ from each other in voluntary but habitual ways.

211 citations



Proceedings ArticleDOI
18 Mar 2005
TL;DR: The paper focuses on the innovative aspects of ALIZE and illustrates them by some examples and an experimental validation of the toolkit during the NIST 2004 speaker recognition evaluation campaign is proposed.
Abstract: This paper presents the ALIZE free speaker recognition toolkit. ALIZE is designed and developed within the framework of the ALIZE project, a part of the French Research Ministry Technolangue program. The paper focuses on the innovative aspects of ALIZE and illustrates them by some examples. An experimental validation of the toolkit during the NIST 2004 speaker recognition evaluation campaign is also proposed.

202 citations


Journal ArticleDOI
TL;DR: A comparison of the two classifiers in off-line signature verification using random, simple and simulated forgeries to observe the capability of the classifiers to absorb intrapersonal variability and highlight interpersonal similarity.

199 citations


Patent
Injeong Choi1
17 Feb 2005
TL;DR: In this paper, a domain-based speech recognition method and apparatus is proposed, which performs speech recognition by using a first language model and generating a first recognition result including a plurality of first recognition sentences.
Abstract: A domain-based speech recognition method and apparatus, the method including: performing speech recognition by using a first language model and generating a first recognition result including a plurality of first recognition sentences; selecting a plurality of candidate domains, by using a word included in each of the first recognition sentences and having a confidence score equal to or higher than a predetermined threshold, as a domain keyword; performing speech recognition with the first recognition result, by using an acoustic model specific to each of the candidate domains and a second language model and generating a plurality of second recognition sentences; and selecting at least one or more final recognition sentence from the first recognition sentences and the second recognition sentences. According to this method and apparatus, the effect of a domain extraction error by misrecognition of a word on selection of a final recognition result can be minimized.

187 citations


PatentDOI
TL;DR: In this article, a speech processing system including a speech recognition unit to receive input speech, and a natural language processor is described. And the system includes an adaptation processor to process the feedback information to adapt the acoustic models so that the SPR unit produces the speech recognition result with higher precision than when the adaptation processor is not used.
Abstract: A speech processing system including a speech recognition unit to receive input speech, and a natural language processor. The speech recognition unit performs speech recognition on input speech using acoustic models to produce a speech recognition result. The natural-language processor performs natural language processing on speech recognition result, and includes: a speech zone detector configured to detect correct zones from the speech recognition result; a feedback unit to feed back information obtained as a result of the natural language processing performed on the speech recognition result to said speech recognition unit. The feedback information includes the detected correct zones. The speech recognition unit includes an adaptation processor to process the feedback information to adapt the acoustic models so that the speech recognition unit produces the speech recognition result with higher precision than when the adaptation processor is not used.

Patent
Holger Scholl1
24 May 2005
TL;DR: In this article, an interactive speech recognition system and a corresponding method for determining a performance level of a speech recognition procedure on the basis of recorded background noise is presented. But this method requires the user to provide a reliable feedback of the performance of the speech recognition.
Abstract: The present invention provides an interactive speech recognition system and a corresponding method for determining a performance level of a speech recognition procedure on the basis of recorded background noise. The inventive system effectively exploits speech pauses that occur before the user enters speech that becomes subject to speech recognition. Preferably, the inventive performance prediction makes effective use of trained noise classification models. Moreover, predicted performance levels are indicated to the user in order to give a reliable feedback of the performance of the speech recognition procedure. In this way the interactive speech recognition system may react to noise conditions that are inappropriate for generating reliable speech recognition.

Patent
Ilya Skuratovsky1
22 Nov 2005
TL;DR: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations according to whether each portion of text was identified as a spoken passage or not.
Abstract: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations to different portions of text within the text source according to whether each portion of text was identified as a spoken passage or a non-spoken passage. The method further can include identifying the speaker and/or the gender of the speaker and applying different voice configurations according to the speaker identity and/or speaker gender.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: The use of adaptation transforms employed in speech recognition systems as features for speaker recognition is explored, and the resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems.
Abstract: We explore the use of adaptation transforms employed in speech recognition systems as features for speaker recognition. This approach is attractive because, unlike standard framebased cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification. Affine transforms are computed for the Gaussian means of the acoustic models used in a recognizer, using maximum likelihood linear regression (MLLR). The high-dimensional vectors formed by the transform coefficients are then modeled as speaker features using support vector machines (SVMs). The resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems. Further improvements are obtained by combining baseline and MLLR-based systems.

Journal ArticleDOI
TL;DR: A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance and combining the evidence from these features seem to improve the performance of the system significantly.
Abstract: This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: An extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented.
Abstract: We discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: It is shown how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora.
Abstract: We show how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora. We tested our algorithms on the NIST 1999 evaluation set (carbon data as well as electret). Using warped cepstral features we obtained equal error rates of about 6.3% and minimum detection costs of about 0.022.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: A new approach that combines accent detection, accent discriminative acoustic features, acoustic adaptation and model selection for accented Chinese speech recognition is proposed and experimental results show that this approach can improve the recognition of accented speech.
Abstract: As speech recognition systems are used in ever more applications, it is crucial for the systems to be able to deal with accented speakers. Various techniques, such as acoustic model adaptation and pronunciation adaptation, have been reported to improve the recognition of non-native or accented speech. In this paper, we propose a new approach that combines accent detection, accent discriminative acoustic features, acoustic adaptation and model selection for accented Chinese speech recognition. Experimental results show that this approach can improve the recognition of accented speech.

Journal ArticleDOI
TL;DR: An overview of the state-of-the-art in speaker recognition is offered, with special emphasis on the pros and cons, and the current research lines.
Abstract: Recent advances in speech technologies have produced new tools that can be used to improve the performance and flexibility of speaker recognition. While there are few degrees of freedom or alternative methods when using fingerprint or iris identification techniques, speech offers much more flexibility and different levels to perform recognition: the system can force the user to speak in a particular manner, different for each attempt to enter. Also, with voice input, the system has other degrees of freedom, such as the use of knowledge/codes that only the user knows, or dialectical/semantical traits that are difficult to forge. This paper offers an overview of the state-of-the-art in speaker recognition, with special emphasis on the pros and cons, and the current research lines. The current research lines include improved classification systems, and the use of high level information by means of probabilistic grammars. In conclusion, speaker recognition is far away from being a technology where all the possibilities have already been explored.

Proceedings ArticleDOI
27 Dec 2005
TL;DR: The potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors is discussed and a preliminary test with recognition of French spoken digits from a small speech database is illustrated.
Abstract: Speech recognition is very difficult in the context of noisy and corrupted speech. Most conventional techniques need huge databases to estimate speech (or noise) density probabilities to perform recognition. We discuss the potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors. We illustrate the potential of such non-linear processing of speech by means of a preliminary test with recognition of French spoken digits from a small speech database

Journal ArticleDOI
TL;DR: Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans for speaker-dependent consonant recognition using nonsense syllables degraded by highpass filtering, lowpass filters, or additive noise.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: This paper proposes an accurate and efficiently computed approximation of the KL-divergence based on the unscented transform which is usually used to obtain a better alternative to the extended Kalman filter and experimental results indicate that the proposed approximations outperform previously suggested methods.
Abstract: This paper proposes a dissimilarity measure between two Gaussian mixture models (GMM). Computing a distance measure between two GMMs that were learned from speech segments is a key element in speaker verification, speaker segmentation and many other related applications. A natural measure between two distributions is the Kullback-Leibler divergence. However, it cannot be analytically computed in the case of GMM. We propose an accurate and efficiently computed approximation of the KL-divergence. The method is based on the unscented transform which is usually used to obtain a better alternative to the extended Kalman filter. The suggested distance is evaluated in an experimental setup of speakers data-set. The experimental results indicate that our proposed approximations outperform previously suggested methods.

Journal ArticleDOI
TL;DR: The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions.
Abstract: We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: The various sensory modalities are processed both individually and jointly and it is shown that the multimodal approach results in significantly improved performance in spatial localization, identification and speech activity detection of the participants.
Abstract: Our long-term objective is to create smart room technologies that are aware of the users presence and their behavior and can become an active, but not an intrusive, part of the interaction. In this work, we present a multimodal approach for estimating and tracking the location and identity of the participants including the active speaker. Our smart room design contains three user-monitoring systems: four CCD cameras, an omnidirectional camera and a 16 channel microphone array. The various sensory modalities are processed both individually and jointly and it is shown that the multimodal approach results in significantly improved performance in spatial localization, identification and speech activity detection of the participants.

PatentDOI
TL;DR: Systems, methods, and computer program products for determining voice recognition accuracy of a voice recognition system are provided and a solution is then automatically implemented to eliminate the source of the error.
Abstract: Systems, methods, and computer program products for determining voice recognition accuracy of a voice recognition system are provided. In one embodiment, voice recognition information produced by a voice recognition system in response to recognizing a user utterance is analyzed. The voice recognition information comprises a recognized voice command associated with the user utterance and a reference to an audio file that includes the user utterance. Based on the analysis, a recognition error may be identified and the source of the error determined. A solution is then automatically implemented to eliminate the source of the error. As part of the analysis, the user utterance may be transcribed to create a transcribed utterance, if the recognized voice command does not match the user utterance. The transcribed utterance may then be compared to the recognized voice command to identify an error.

Journal ArticleDOI
TL;DR: Results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants.
Abstract: Natural spoken language processing includes not only speech recognition but also identification of the speaker’s gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.

Journal ArticleDOI
TL;DR: Experiments show that the proposed robust MFCC-based feature significantly reduces the recognition error rate over a wide signal-to-noise ratio range.

Proceedings ArticleDOI
Bhiksha Raj1, Paris Smaragdis1
21 Nov 2005
TL;DR: Results show that the proposed algorithm for the separation of multiple speakers from mixed single-channel recordings by latent variable decomposition of the speech spectrogram is very effective at separating mixed signals.
Abstract: In this paper we present an algorithm for the separation of multiple speakers from mixed single-channel recordings by latent variable decomposition of the speech spectrogram. We model each magnitude spectral vector in the short-time Fourier transform of a speech signal as the outcome of a discrete random process that generates frequency bin indices. The distribution of the process is modelled a mixture of multinomial distributions, such that the mixture weights of the component multinomials vary from analysis window to analysis window. The component multinomials are assumed to be speaker specific and are learnt from training signals for each speaker. The distributions representing magnitude spectral vectors for the mixed signal are decomposed into mixtures of the multinomials for all component speakers. The frequency distribution, i.e. the spectrum for each speaker is reconstructed from this decomposition. Experimental results show that the proposed method is very effective at separating mixed signals.

Journal ArticleDOI
TL;DR: Perception by normal-hearing subjects of gender and identity of a talker as a function of the number of channels in spectrally reduced speech was examined and results showed that gender and talker identification was better for the sine-wave processor, and that performance through the noise-band processor was more sensitive to thenumber of channels.
Abstract: Considerable research on speech intelligibility for cochlear-implant users has been conducted using acoustic simulations with normal-hearing subjects. However, some relevant topics about perception through cochlear implants remain scantly explored. The present study examined the perception by normal-hearing subjects of gender and identity of a talker as a function of the number of channels in spectrally reduced speech. Two simulation strategies were compared. They were implemented by two different processors that presented signals as either the sum of sine waves at the center of the channels or as the sum of noise bands. In Experiment 1, 15 subjects determined the gender of 40 talkers (20 males + 20 females) from a natural utterance processed through 3, 4, 5, 6, 8, 10, 12, and 16 channels with both processors. In Experiment 2, 56 subjects matched a natural sentence uttered by 10 talkers with the corresponding simulation replicas processed through 3, 4, 8, and 16 channels for each processor. In Experiment 3, 72 subjects performed the same task but different sentences were used for natural and processed stimuli. A control Experiment 4 was conducted to equate the processing steps between the two simulation strategies. Results showed that gender and talker identification was better for the sine-wave processor, and that performance through the noise-band processor was more sensitive to the number of channels. Implications and possible explanations for the superiority of sine-wave simulations are discussed.

Journal ArticleDOI
TL;DR: In this paper, it is argued that formants, whose frequencies and dynamics are the product of the interaction of an individual vocal tract with the idiosyncratic articulatory gestures needed to achieve linguistically agreed targets, are so central to speaker identity that they must play a pivotal role in speaker identification.
Abstract: Views differ on the relative importance for forensic speaker identification of different aspects of the speech signal. It is argued here that formants, whose frequencies and dynamics are the product of the interaction of an individual vocal tract with the idiosyncratic articulatory gestures needed to achieve linguistically agreed targets, are so central to speaker identity that they must play a pivotal role in speaker identification. As a practical demonstration a case is described in which F1, F2 analysis of a vowel and F2 analysis of three diphthongs show a consistent separation between two recordings, thus effectively eliminating a suspect from having made obscene telephone calls. Subsequent additional analysis, based on the statistical distribution of formant frequency estimates throughout the samples, confirms the distinctness of the voice of the suspect and that of the obscene caller. The theoretical foundation for several kinds of formant-based analysis is then discussed.