Showing papers on "Speaker recognition published in 2005"

PDF

Open Access

Journal Article•DOI•

Eigenvoice modeling with sparse training data

[...]

Patrick Kenny, Gilles Boulianne, Pierre Dumouchel

18 Apr 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: This work derives an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and shows how it can be regarded as a new method of eigenvoice estimation.

...read moreread less

Abstract: We derive an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and show how it can be regarded as a new method of eigenvoice estimation. Unlike other approaches to the problem of estimating eigenvoices in situations where speaker-dependent training is not feasible, our method enables us to estimate as many eigenvoices from a given training set as there are training speakers. In the limit as the amount of training data for each speaker tends to infinity, it is equivalent to cluster adaptive training.

...read moreread less

523 citations

Journal Article•DOI•

Interaction of Face and Voice Areas during Speaker Recognition

[...]

Katharina von Kriegstein¹, Andreas Kleinschmidt¹, Philipp Sterzer¹, Anne-Lise Giraud¹•Institutions (1)

Goethe University Frankfurt¹

01 Apr 2005-Journal of Cognitive Neuroscience

TL;DR: Functional neuroimaging reveals that in the context of speaker recognition, the assessment of person familiarity does not necessarily engage supra-modal cortical substrates but can result from the direct sharing of information between auditory voice and visual face regions.

...read moreread less

Abstract: Face and voice processing contribute to person recognition, but it remains unclear how the segregated specialized cortical modules interact. Using functional neuroimaging, we observed cross-modal responses to voices of familiar persons in the fusiform face area, as localized separately using visual stimuli. Voices of familiar persons only activated the face area during a task that emphasized speaker recognition over recognition of verbal content. Analyses of functional connectivity between cortical territories show that the fusiform face region is coupled with the superior temporal sulcus voice region during familiar speaker recognition, but not with any of the other cortical regions normally active in person recognition or in other tasks involving voices. These findings are relevant for models of the cognitive processes and neural circuitry involved in speaker recognition. They reveal that in the context of speaker recognition, the assessment of person familiarity does not necessarily engage supramodal cortical substrates but can result from the direct sharing of information between auditory voice and visual face regions.

...read moreread less

310 citations

Proceedings Article•DOI•

Advances in channel compensation for SVM speaker recognition

[...]

Alex Solomonoff¹, William M. Campbell¹, Ian Boardman¹•Institutions (1)

Massachusetts Institute of Technology¹

18 Mar 2005

TL;DR: This work performs channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections via an eigenvalue problem.

...read moreread less

Abstract: Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study the problem for speaker recognition using support vector machines (SVMs). We perform channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections. Training to remove these dimensions is accomplished via an eigenvalue problem. The eigenvalue problem attempts to reduce multisession variation for the same speaker, reduce different channel effects, and increase "distance" between different speakers. We apply our methods to a subset of the Switchboard 2 corpus. Experiments show dramatic improvement in performance for the cross-channel case.

...read moreread less

269 citations

Book Chapter•DOI•

The NIST speaker recognition evaluation program

[...]

Alvin F. Martin, Mark A. Przybocki, Joseph P. Campbell

01 Jan 2005

TL;DR: The National Institute of Standards and Technology (NIST) has coordinated annual scientific evaluations of text-independent speaker recognition since 1996, focusing primarily on speaker detection in the context of conversational telephone speech.

...read moreread less

Abstract: The National Institute of Standards and Technology (NIST) has coordinated annual scientific evaluations of text-independent speaker recognition since 1996. These evaluations aim to provide important contributions to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text-independent speaker recognition. To this end, the evaluations are designed to be simple, fully supported, accessible and focused on core technology issues. The evaluations have focused primarily on speaker detection in the context of conversational telephone speech. More recent evaluations have also included related tasks, such as speaker segmentation, and have used data in addition to conversational telephone speech. The evaluations are designed to foster research progress, with the objectives of:

...read moreread less

218 citations

Journal Article•DOI•

Speaker verification using sequence discriminant support vector machines

[...]

Vincent Wan¹, Steve Renals¹•Institutions (1)

University of Sheffield¹

22 Feb 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels and introduces a technique called spherical normalization that preconditions the Hessian matrix.

...read moreread less

Abstract: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system.

...read moreread less

214 citations

Journal Article•DOI•

Modeling prosodic feature sequences for speaker recognition

[...]

Elizabeth Shriberg¹, Elizabeth Shriberg², Luciana Ferrer³, Luciana Ferrer², Sachin S. Kajarekar², Anand Venkataraman², Andreas Stolcke², Andreas Stolcke¹ - Show less +4 more•Institutions (3)

International Computer Science Institute¹, SRI International², Stanford University³

01 Jul 2005-Speech Communication

TL;DR: Overall, it is found that SVM modeling of prosodic feature sequences yields valuable information for automatic speaker recognition and offers rich new opportunities for exploring how speakers differ from each other in voluntary but habitual ways.

...read moreread less

211 citations

Book Chapter•DOI•

An Introduction to Biometric Authentication Systems

[...]

James L. Wayman, Anil K. Jain, Davide Maltoni, Dario Maio

01 Jan 2005

207 citations

Proceedings Article•DOI•

ALIZE, a free toolkit for speaker recognition

[...]

Jean-François Bonastre¹, F. Wils¹, Sylvain Meignier¹•Institutions (1)

Centre national de la recherche scientifique¹

18 Mar 2005

TL;DR: The paper focuses on the innovative aspects of ALIZE and illustrates them by some examples and an experimental validation of the toolkit during the NIST 2004 speaker recognition evaluation campaign is proposed.

...read moreread less

Abstract: This paper presents the ALIZE free speaker recognition toolkit. ALIZE is designed and developed within the framework of the ALIZE project, a part of the French Research Ministry Technolangue program. The paper focuses on the innovative aspects of ALIZE and illustrates them by some examples. An experimental validation of the toolkit during the NIST 2004 speaker recognition evaluation campaign is also proposed.

...read moreread less

202 citations

Journal Article•DOI•

A comparison of SVM and HMM classifiers in the off-line signature verification

[...]

Edson J. R. Justino¹, Flávio Bortolozzi¹, Robert Sabourin•Institutions (1)

Pontifícia Universidade Católica do Paraná¹

01 Jul 2005-Pattern Recognition Letters

TL;DR: A comparison of the two classifiers in off-line signature verification using random, simple and simulated forgeries to observe the capability of the classifiers to absorb intrapersonal variability and highlight interpersonal similarity.

...read moreread less

199 citations

Patent•

Domain-based dialog speech recognition method and apparatus

[...]

Injeong Choi¹•Institutions (1)

Samsung¹

17 Feb 2005

TL;DR: In this paper, a domain-based speech recognition method and apparatus is proposed, which performs speech recognition by using a first language model and generating a first recognition result including a plurality of first recognition sentences.

...read moreread less

Abstract: A domain-based speech recognition method and apparatus, the method including: performing speech recognition by using a first language model and generating a first recognition result including a plurality of first recognition sentences; selecting a plurality of candidate domains, by using a word included in each of the first recognition sentences and having a confidence score equal to or higher than a predetermined threshold, as a domain keyword; performing speech recognition with the first recognition result, by using an acoustic model specific to each of the candidate domains and a second language model and generating a plurality of second recognition sentences; and selecting at least one or more final recognition sentence from the first recognition sentences and the second recognition sentences. According to this method and apparatus, the effect of a domain extraction error by misrecognition of a word on selection of a final recognition result can be minimized.

...read moreread less

187 citations

Patent•DOI•

Speech recognition with feedback from natural language processing for adaptation of acoustic model

[...]

Hitoshi Honda¹, Masanori Omote¹, Hiroaki Ogawa¹, Hongchang Pao¹•Institutions (1)

Sony Broadcast & Professional Research Laboratories¹

08 Mar 2005-Journal of the Acoustical Society of America

TL;DR: In this article, a speech processing system including a speech recognition unit to receive input speech, and a natural language processor is described. And the system includes an adaptation processor to process the feedback information to adapt the acoustic models so that the SPR unit produces the speech recognition result with higher precision than when the adaptation processor is not used.

...read moreread less

Abstract: A speech processing system including a speech recognition unit to receive input speech, and a natural language processor. The speech recognition unit performs speech recognition on input speech using acoustic models to produce a speech recognition result. The natural-language processor performs natural language processing on speech recognition result, and includes: a speech zone detector configured to detect correct zones from the speech recognition result; a feedback unit to feed back information obtained as a result of the natural language processing performed on the speech recognition result to said speech recognition unit. The feedback information includes the detected correct zones. The speech recognition unit includes an adaptation processor to process the feedback information to adapt the acoustic models so that the speech recognition unit produces the speech recognition result with higher precision than when the adaptation processor is not used.

...read moreread less

Patent•

Performance Prediction For An Interactive Speech Recognition System

[...]

Holger Scholl¹•Institutions (1)

Philips¹

24 May 2005

TL;DR: In this article, an interactive speech recognition system and a corresponding method for determining a performance level of a speech recognition procedure on the basis of recorded background noise is presented. But this method requires the user to provide a reliable feedback of the performance of the speech recognition.

...read moreread less

Abstract: The present invention provides an interactive speech recognition system and a corresponding method for determining a performance level of a speech recognition procedure on the basis of recorded background noise. The inventive system effectively exploits speech pauses that occur before the user enters speech that becomes subject to speech recognition. Preferably, the inventive performance prediction makes effective use of trained noise classification models. Moreover, predicted performance levels are indicated to the user in order to give a reliable feedback of the performance of the speech recognition procedure. In this way the interactive speech recognition system may react to noise conditions that are inappropriate for generating reliable speech recognition.

...read moreread less

Patent•

Dynamically Changing Voice Attributes During Speech Synthesis Based upon Parameter Differentiation for Dialog Contexts

[...]

Ilya Skuratovsky¹•Institutions (1)

IBM¹

22 Nov 2005

TL;DR: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations according to whether each portion of text was identified as a spoken passage or not.

...read moreread less

Abstract: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations to different portions of text within the text source according to whether each portion of text was identified as a spoken passage or a non-spoken passage. The method further can include identifying the speaker and/or the gender of the speaker and applying different voice configurations according to the speaker identity and/or speaker gender.

...read moreread less

Proceedings Article•DOI•

MLLR transforms as features in speaker recognition.

[...]

Andreas Stolcke¹, Luciana Ferrer¹, Sachin S. Kajarekar¹, Elizabeth Shriberg¹, Anand Venkataraman¹ - Show less +1 more•Institutions (1)

SRI International¹

04 Sep 2005

TL;DR: The use of adaptation transforms employed in speech recognition systems as features for speaker recognition is explored, and the resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems.

...read moreread less

Abstract: We explore the use of adaptation transforms employed in speech recognition systems as features for speaker recognition. This approach is attractive because, unlike standard framebased cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification. Affine transforms are computed for the Gaussian means of the acoustic models used in a recognizer, using maximum likelihood linear regression (MLLR). The high-dimensional vectors formed by the transform coefficients are then modeled as speaker features using support vector machines (SVMs). The resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems. Further improvements are obtained by combining baseline and MLLR-based systems.

...read moreread less

Journal Article•DOI•

Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system

[...]

B. Yegnanarayana¹, S. R. M. Prasanna¹, J.M. Zachariah¹, C.S. Gupta•Institutions (1)

Indian Institute of Technology Madras¹

20 Jun 2005-IEEE Transactions on Speech and Audio Processing

TL;DR: A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance and combining the evidence from these features seem to improve the performance of the system significantly.

...read moreread less

Abstract: This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information.

...read moreread less

Proceedings Article•DOI•

Speaker adaptive cohort selection for Tnorm in text-independent speaker verification

[...]

Douglas E. Sturim¹, D.A. Reynolds¹•Institutions (1)

Massachusetts Institute of Technology¹

18 Mar 2005

TL;DR: An extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented.

...read moreread less

Abstract: We discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented.

...read moreread less

Proceedings Article•DOI•

Factor analysis simplified [speaker verification applications]

[...]

Patrick Kenny, Gilles Boulianne, Pierre Ouellet, Pierre Dumouchel

18 Mar 2005

TL;DR: It is shown how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora.

...read moreread less

Abstract: We show how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora. We tested our algorithms on the NIST 1999 evaluation set (carbon data as well as electret). Using warped cepstral features we obtained equal error rates of about 6.3% and minimum detection costs of about 0.022.

...read moreread less

Proceedings Article•DOI•

Accent detection and speech recognition for Shanghai-accented Mandarin.

[...]

Yanli Zheng, Richard Sproat, Liang Gu, Izhak Shafran, Haolang Zhou, Yi Su, Dan Jurafsky, Rebecca Lurie Starr, Su-Youn Yoon - Show less +5 more

04 Sep 2005

TL;DR: A new approach that combines accent detection, accent discriminative acoustic features, acoustic adaptation and model selection for accented Chinese speech recognition is proposed and experimental results show that this approach can improve the recognition of accented speech.

...read moreread less

Abstract: As speech recognition systems are used in ever more applications, it is crucial for the systems to be able to deal with accented speakers. Various techniques, such as acoustic model adaptation and pronunciation adaptation, have been reported to improve the recognition of non-native or accented speech. In this paper, we propose a new approach that combines accent detection, accent discriminative acoustic features, acoustic adaptation and model selection for accented Chinese speech recognition. Experimental results show that this approach can improve the recognition of accented speech.

...read moreread less

Journal Article•DOI•

State-of-the-art in speaker recognition

[...]

Marcos Faundez-Zanuy, E Monte-Moreno

23 May 2005-IEEE Aerospace and Electronic Systems Magazine

TL;DR: An overview of the state-of-the-art in speaker recognition is offered, with special emphasis on the pros and cons, and the current research lines.

...read moreread less

Abstract: Recent advances in speech technologies have produced new tools that can be used to improve the performance and flexibility of speaker recognition. While there are few degrees of freedom or alternative methods when using fingerprint or iris identification techniques, speech offers much more flexibility and different levels to perform recognition: the system can force the user to speak in a particular manner, different for each attempt to enter. Also, with voice input, the system has other degrees of freedom, such as the use of knowledge/codes that only the user knows, or dialectical/semantical traits that are difficult to forge. This paper offers an overview of the state-of-the-art in speaker recognition, with special emphasis on the pros and cons, and the current research lines. The current research lines include improved classification systems, and the use of high level information by means of probabilistic grammars. In conclusion, speaker recognition is far away from being a technology where all the possibilities have already been explored.

...read moreread less

Proceedings Article•DOI•

Exploration of rank order coding with spiking neural networks for speech recognition

[...]

Stéphane Loiselle¹, Jean Rouat¹, Daniel Pressnitzer, Simon J. Thorpe•Institutions (1)

Université de Sherbrooke¹

27 Dec 2005

TL;DR: The potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors is discussed and a preliminary test with recognition of French spoken digits from a small speech database is illustrated.

...read moreread less

Abstract: Speech recognition is very difficult in the context of noisy and corrupted speech. Most conventional techniques need huge databases to estimate speech (or noise) density probabilities to perform recognition. We discuss the potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors. We illustrate the potential of such non-linear processing of speech by means of a preliminary test with recognition of French spoken digits from a small speech database

...read moreread less

Journal Article•DOI•

Human and machine consonant recognition

[...]

Jason J. Sroka¹, Louis D. Braida¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 2005-Speech Communication

TL;DR: Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans for speaker-dependent consonant recognition using nonsense syllables degraded by highpass filtering, lowpass filters, or additive noise.

...read moreread less

Proceedings Article•DOI•

A distance measure between GMMs based on the unscented transform and its application to speaker recognition.

[...]

Jacob Goldberger¹, Hagai Aronowitz¹•Institutions (1)

Bar-Ilan University¹

04 Sep 2005

TL;DR: This paper proposes an accurate and efficiently computed approximation of the KL-divergence based on the unscented transform which is usually used to obtain a better alternative to the extended Kalman filter and experimental results indicate that the proposed approximations outperform previously suggested methods.

...read moreread less

Abstract: This paper proposes a dissimilarity measure between two Gaussian mixture models (GMM). Computing a distance measure between two GMMs that were learned from speech segments is a key element in speaker verification, speaker segmentation and many other related applications. A natural measure between two distributions is the Kullback-Leibler divergence. However, it cannot be analytically computed in the case of GMM. We propose an accurate and efficiently computed approximation of the KL-divergence. The method is based on the unscented transform which is usually used to obtain a better alternative to the extended Kalman filter. The suggested distance is evaluated in an experimental setup of speakers data-set. The experimental results indicate that our proposed approximations outperform previously suggested methods.

...read moreread less

Journal Article•DOI•

Multimodal speaker identification using an adaptive classifier cascade based on modality reliability

[...]

Engin Erzin¹, Yücel Yemez¹, A.M. Tekalp¹•Institutions (1)

Koç University¹

01 Oct 2005-IEEE Transactions on Multimedia

TL;DR: The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions.

...read moreread less

Abstract: We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.

...read moreread less

Proceedings Article•DOI•

Smart room: participant and speaker localization and identification

[...]

Carlos Busso¹, S. Hernanz¹, Chi-Wei Chu¹, Soonil Kwon¹, Sung Lee¹, Panayiotis G. Georgiou¹, Isaac Cohen¹, Shrikanth S. Narayanan¹ - Show less +4 more•Institutions (1)

University of Southern California¹

18 Mar 2005

TL;DR: The various sensory modalities are processed both individually and jointly and it is shown that the multimodal approach results in significantly improved performance in spatial localization, identification and speech activity detection of the participants.

...read moreread less

Abstract: Our long-term objective is to create smart room technologies that are aware of the users presence and their behavior and can become an active, but not an intrusive, part of the interaction. In this work, we present a multimodal approach for estimating and tracking the location and identity of the participants including the active speaker. Our smart room design contains three user-monitoring systems: four CCD cameras, an omnidirectional camera and a 16 channel microphone array. The various sensory modalities are processed both individually and jointly and it is shown that the multimodal approach results in significantly improved performance in spatial localization, identification and speech activity detection of the participants.

...read moreread less

Patent•DOI•

Determining voice recognition accuracy in a voice recognition system

[...]

Sean Doyle

29 Mar 2005-Journal of the Acoustical Society of America

TL;DR: Systems, methods, and computer program products for determining voice recognition accuracy of a voice recognition system are provided and a solution is then automatically implemented to eliminate the source of the error.

...read moreread less

Abstract: Systems, methods, and computer program products for determining voice recognition accuracy of a voice recognition system are provided. In one embodiment, voice recognition information produced by a voice recognition system in response to recognizing a user utterance is analyzed. The voice recognition information comprises a recognized voice command associated with the user utterance and a reference to an audio file that includes the user utterance. Based on the analysis, a recognition error may be identified and the source of the error determined. A solution is then automatically implemented to eliminate the source of the error. As part of the analysis, the user utterance may be transcribed to create a transcribed utterance, if the recognized voice command does not match the user utterance. The transcribed utterance may then be compared to the recognized voice command to identify an error.

...read moreread less

Journal Article•DOI•

Speaker recognition with temporal cues in acoustic and electric hearing.

[...]

Michael Vongphoe¹, Fan-Gang Zeng•Institutions (1)

University of California, Irvine¹

04 Aug 2005-Journal of the Acoustical Society of America

TL;DR: Results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants.

...read moreread less

Abstract: Natural spoken language processing includes not only speech recognition but also identification of the speaker’s gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.

...read moreread less

Journal Article•DOI•

Improved MFCC-based feature for robust speaker identification

[...]

Zunjing Wu¹, Zhigang Cao¹•Institutions (1)

Tsinghua University¹

01 Apr 2005-Tsinghua Science & Technology

TL;DR: Experiments show that the proposed robust MFCC-based feature significantly reduces the recognition error rate over a wide signal-to-noise ratio range.

...read moreread less

Proceedings Article•DOI•

Latent variable decomposition of spectrograms for single channel speaker separation

[...]

Bhiksha Raj¹, Paris Smaragdis¹•Institutions (1)

Mitsubishi¹

21 Nov 2005

TL;DR: Results show that the proposed algorithm for the separation of multiple speakers from mixed single-channel recordings by latent variable decomposition of the speech spectrogram is very effective at separating mixed signals.

...read moreread less

Abstract: In this paper we present an algorithm for the separation of multiple speakers from mixed single-channel recordings by latent variable decomposition of the speech spectrogram. We model each magnitude spectral vector in the short-time Fourier transform of a speech signal as the outcome of a discrete random process that generates frequency bin indices. The distribution of the process is modelled a mixture of multinomial distributions, such that the mixture weights of the component multinomials vary from analysis window to analysis window. The component multinomials are assumed to be speaker specific and are learnt from training signals for each speaker. The distributions representing magnitude spectral vectors for the mixed signal are decomposed into mixtures of the multinomials for all component speakers. The frequency distribution, i.e. the spectrum for each speaker is reconstructed from this decomposition. Experimental results show that the proposed method is very effective at separating mixed signals.

...read moreread less

Journal Article•DOI•

Gender and speaker identification as a function of the number of channels in spectrally reduced speech.

[...]

Julio González, Juan Carlos Oliver

28 Jun 2005-Journal of the Acoustical Society of America

TL;DR: Perception by normal-hearing subjects of gender and identity of a talker as a function of the number of channels in spectrally reduced speech was examined and results showed that gender and talker identification was better for the sine-wave processor, and that performance through the noise-band processor was more sensitive to thenumber of channels.

...read moreread less

Abstract: Considerable research on speech intelligibility for cochlear-implant users has been conducted using acoustic simulations with normal-hearing subjects. However, some relevant topics about perception through cochlear implants remain scantly explored. The present study examined the perception by normal-hearing subjects of gender and identity of a talker as a function of the number of channels in spectrally reduced speech. Two simulation strategies were compared. They were implemented by two different processors that presented signals as either the sum of sine waves at the center of the channels or as the sum of noise bands. In Experiment 1, 15 subjects determined the gender of 40 talkers (20 males + 20 females) from a natural utterance processed through 3, 4, 5, 6, 8, 10, 12, and 16 channels with both processors. In Experiment 2, 56 subjects matched a natural sentence uttered by 10 talkers with the corresponding simulation replicas processed through 3, 4, 8, and 16 channels for each processor. In Experiment 3, 72 subjects performed the same task but different sentences were used for natural and processed stimuli. A control Experiment 4 was conducted to equate the processing steps between the two simulation strategies. Results showed that gender and talker identification was better for the sine-wave processor, and that performance through the noise-band processor was more sensitive to the number of channels. Implications and possible explanations for the superiority of sine-wave simulations are discussed.

...read moreread less

Journal Article•DOI•

A case for formant analysis in forensic speaker identification

[...]

Francis Nolan¹, Catalin Grigoras•Institutions (1)

University of Cambridge¹

14 Aug 2005-International Journal of Speech Language and The Law

TL;DR: In this paper, it is argued that formants, whose frequencies and dynamics are the product of the interaction of an individual vocal tract with the idiosyncratic articulatory gestures needed to achieve linguistically agreed targets, are so central to speaker identity that they must play a pivotal role in speaker identification.

...read moreread less

Abstract: Views differ on the relative importance for forensic speaker identification of different aspects of the speech signal. It is argued here that formants, whose frequencies and dynamics are the product of the interaction of an individual vocal tract with the idiosyncratic articulatory gestures needed to achieve linguistically agreed targets, are so central to speaker identity that they must play a pivotal role in speaker identification. As a practical demonstration a case is described in which F1, F2 analysis of a vowel and F2 analysis of three diphthongs show a consistent separation between two recordings, thus effectively eliminating a suspect from having made obscene telephone calls. Subsequent additional analysis, based on the statistical distribution of formant frequency estimates throughout the samples, confirms the distinctness of the voice of the suspect and that of the obscene caller. The theoretical foundation for several kinds of formant-based analysis is then discussed.

...read moreread less

Collapse