scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2002"


Patent
25 Jan 2002
TL;DR: In this paper, a system and method for universal access to voice-based documents containing information formatted using MIME and HTML standards using customized extensions for voice information access and navigation is presented.
Abstract: A system and method provides universal access to voice-based documents containing information formatted using MIME and HTML standards using customized extensions for voice information access and navigation. These voice documents are linked using HTML hyper-links that are accessible to subscribers using voice commands, touch-tone inputs and other selection means. These voice documents and components in them are addressable using HTML anchors embedding HTML universal resource locators (URLs) rendering them universally accessible over the Internet. This collection of connected documents forms a voice web. The voice web includes subscriber-specific documents including speech training files for speaker dependent speech recognition, voice print files for authenticating the identity of a user and personal preference and attribute files for customizing other aspects of the system in accordance with a specific subscriber.

983 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: Some of the strengths and weaknesses of current speaker recognition technologies are discussed, and some potential future trends in research, development and applications are outlined.
Abstract: In this paper we provide a brief overview of the area of speaker recognition, describing applications, underlying techniques and some indications of performance. Following this overview we will discuss some of the strengths and weaknesses of current speaker recognition technologies and outline some potential future trends in research, development and applications.

597 citations


Book
01 Jul 2002
TL;DR: The Likelihood Ratio Revisited: A Demonstration of the Method of Forensic Speaker Identification is presented in this paper, where the authors present a forensic speaker identification method based on forensic phonetic parameters.
Abstract: Introduction. Why Voices are Difficult to Discriminate Forensically. Forensic Phonetic Parameters. Expressing the Outcome. Characterizing Forensic Speaker Identification. The Human Vocal Tract and the Production and Description of Speech Sounds. Phonemics. Speech Acoustics. Speech Perception. What is a Voice? The Likelihood Ratio Revisited: A Demonstration of the Method. Summary and Envoi.

347 citations


Patent
06 Sep 2002
TL;DR: In this paper, the authors present an approach for speech recognition using selectable recognition modes, using choice lists in large-vocabulary speech recognition, enabling users to select word transformations, and speech recognition that automatically turns recognition off in one or more specified ways.
Abstract: The present invention relates to: speech recognition using selectable recognition modes; using choice lists in large-vocabulary speech recognition; enabling users to select word transformations; speech recognition that automatically turns recognition off in one or more specified ways; phone key control of large-vocabulary speech recognition; speech recognition using phone key alphabetic filtering and spelling: speech recognition that enables a user to perform re-utterance recognition; the combination of speech recognition and text-to-speech (TTS) generation; the combination of speech recognition with handwriting and/or character recognition; and the combination of large-vocabulary speech recognition with audio recording and playback.

284 citations


Proceedings ArticleDOI
Ara V. Nefian1, Luhong Liang1, Xiaobo Pi1, Liu Xiaoxiang1, Crusoe Mao1, Kevin Murphy1 
13 May 2002
TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.
Abstract: In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

252 citations


Journal ArticleDOI
TL;DR: The components of bimodal recognizers are reviewed, the accuracy of bIModal recognition is discussed, some outstanding research issues as well as possible application domains are highlighted, and the combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality.
Abstract: Speech recognition and speaker recognition by machine are crucial ingredients for many important applications such as natural and flexible human-machine interfaces. Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that preclude its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. The paper reviews the components of bimodal recognizers, discusses the accuracy of bimodal recognition, and highlights some outstanding research issues as well as possible application domains.

244 citations


01 Jan 2002
TL;DR: In this paper, the authors show how to encode stochastic finite-state word models as DBNs, and how to construct DBN models that explicitly model the speech-articulators, accent, gender, speaking-rate, and other important phenomena.
Abstract: Dynamic Bayesian networks (DBNs) are a powerful and flexible methodology for representing and computing with probabilistic models of stochastic processes. In the past decade, there has been increasing interest in applying them to practical problems, and this thesis shows that they can be used effectively in the field of automatic speech recognition. A principle characteristic of dynamic Bayesian networks is that they can model an arbitrary set of variables as they evolve over time. Moreover, an arbitrary set of conditional independence assumptions can be specified, and this allows the joint distribution to be represented in a highly factored way. Factorization allows for models with relatively few parameters, and computational efficiency. Standardized inference and learning routines allow a variety of model structures to be tested without deriving new formulae, or writing new code. The contribution of this thesis is to show how DBNs can be used in automatic speech recognition. This involves solving problems related to both representation and inference. Representationally, the thesis shows how to encode stochastic finite-state word models as DBNs, and how to construct DBNs that explicitly model the speech-articulators, accent, gender, speaking-rate, and other important phenomena. Technically, the thesis presents inference routines that are especially tailored to the requirements of speech recognition: efficient inference with deterministic constraints, variable-length utterances, and online inference. Finally, the thesis presents experimental results that indicate that real systems can be built, and that modeling important phenomena with DBNs results in higher recognition accuracy.

237 citations


Journal ArticleDOI
Qi Li1, Jinsong Zheng1, A. Tsai1, Qiru Zhou1
TL;DR: The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity.
Abstract: When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) and nonstationary environments, conventional approaches to endpoint detection and energy normalization often fail and ASR performances usually degrade dramatically. The purpose of this paper is to address the endpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition diagram for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the proposed algorithm significantly reduces the string error rates in low SNR situations. The reduction rates even exceed 50% in several evaluated databases. For SV, we propose a batch-mode approach. It uses the optimal filter plus a two-mixture energy model for endpoint detection. The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity.

225 citations


Proceedings ArticleDOI
01 Jan 2002

190 citations


PatentDOI
TL;DR: In this article, a speaker identification system includes a speaker model generator for generating a plurality of speaker models, which performs a blind clustering of the training utterances based on a predetermined criterion, and identifies a speaker determining a most likely one of the speaker models for an utterance received from the speaker.
Abstract: A speaker identification system includes a speaker model generator 110 for generating a plurality of speaker models. To this end, the generator records training utterances from a plurality of speakers in the background, without prior knowledge of the speakers who spoke the utterances. The generator performs a blind clustering of the training utterances based on a predetermined criterion. For each of the clusters a corresponding speaker model is trained. A speaker identifier 130 identifies a speaker determining a most likely one of the speaker models for an utterance received from the speaker. The speaker associated with the most likely speaker model is identified as the speaker of the test utterance.

175 citations


Patent
Nevenka Dimitrova1, Dongge Li1
19 Jun 2002
TL;DR: In this paper, a memory storing computer readable instructions for causing a processor associated with a mega speaker identification (ID) system to instantiate functions including an audio segmentation and classification function receiving general audio data (GAD) and generating segments, a feature extraction function receiving the segments and extracting features based on mel-frequency cepstral coefficients (MFCC) therefrom, a learning and clustering function receiving extracted features and reclassifying segments, when required, based on the extracted features, a matching and labeling function assigning a speaker ID to speech signals within the GAD,
Abstract: A memory storing computer readable instructions for causing a processor associated with a mega speaker identification (ID) system to instantiate functions including an audio segmentation and classification function receiving general audio data (GAD) and generating segments, a feature extraction function receiving the segments and extracting features based on mel-frequency cepstral coefficients (MFCC) therefrom, a learning and clustering function receiving the extracted features and reclassifying segments, when required, based on the extracted features, a matching and labeling function assigning a speaker ID to speech signals within the GAD, and a database function for correlating the assigned speaker ID to the respective speech signals within the GAD. The audio segmentation and classification function can assign each segment to one of N audio signal classes including silence, single speaker speech, music, environmental noise, multiple speaker's speech, simultaneous speech and music, and speech and noise. A mega speaker identification (ID) system and corresponding method are also described.

PatentDOI
TL;DR: In this paper, a speaker provides a quantity of enrollment data (18), which can be extracted from a short quantity of speech, and the system modifies the base synthesis parameters (12) to more closely resemble those of the new speaker.
Abstract: The speech synthesizer is personalized to sound like or mimic the speech characteristics of an individual speaker. The individual speaker provides a quantity of enrollment data (18), which can be extracted from a short quantity of speech, and the system modifies the base synthesis parameters (12) to more closely resemble those of the new speaker (36). More specifically, the synthesis parameters (12) may be decomposed into speaker dependent parameters (30), such as context-independent parameters, and speaker independent parameters (32), such as contextindependent parameters, and speaker independent parameters (32), such as context dependent parameters. The speaker dependent parameters (30) are adapted using enrollment data (18) from the new speaker. After adaptation, the speaker dependent parameters (30) are combined with the speaker independent parameters (32) to provide a set of personalized synthesis parameters (42).

Journal ArticleDOI
TL;DR: This work proposes the use of a polynomial-based classifier which is highly computationally scalable with the number of speakers, and a new training algorithm which is discriminative, handles large data sets, and has low memory usage.
Abstract: Modern speaker recognition applications require high accuracy at low complexity. We propose the use of a polynomial-based classifier to achieve these objectives. This approach has several advantages. First, polynomial classifier scoring yields a system which is highly computationally scalable with the number of speakers. Second, a new training algorithm is proposed which is discriminative, handles large data sets, and has low memory usage. Third, the output of the polynomial classifier is easily incorporated into a statistical framework allowing it to be combined with other techniques such as hidden Markov models. Results are given for the application of the new methods to the YOHO speaker recognition database.

Proceedings ArticleDOI
13 May 2002
TL;DR: It is shown that one of the recent techniques used for speaker recognition, feature warping can be formulated within the framework of Gaussianization, and around 20% relative improvement in both equal error rate (EER) and minimum detection cost function (DCF) is obtained on NIST 2001 cellular phone data evaluation.
Abstract: In this paper, a novel approach for robust speaker verification, namely short-time Gaussianization, is proposed. Short-time Gaussianization is initiated by a global linear transformation of the features, followed by a short-time windowed cumulative distribution function (CDF) matching. First, the linear transformation in the feature space leads to local independence or decorrelation. Then the CDF matching is applied to segments of speech localized in time and tries to warp a given feature so that its CDF matches normal distribution. It is shown that one of the recent techniques used for speaker recognition, feature warping [l] can be formulated within the framework of Gaussianization. Compared to the baseline system with cepstral mean subtraction (CMS), around 20% relative improvement in both equal error rate(EER) and minimum detection cost function (DCF) is obtained on NIST 2001 cellular phone data evaluation.

Journal ArticleDOI
TL;DR: A weighting process adaptive to various background noise situations is developed following a Separate Integration (SI) architecture and a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated.
Abstract: It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.

PatentDOI
TL;DR: In this paper, a method for the training or adaptation of a speech recognition device used to act upon functions of an electrical appliance, for example the triggering of a voice dial in a mobile telephone terminal, is proposed.
Abstract: The invention relates to a method for the training or adaptation of a speech recognition device used to act upon functions of an electrical appliance, for example the triggering of a voice dial in a mobile telephone terminal. In order to structure the training and/or adaptation of the speech recognition device to improve user comfort, a method is proposed with the following steps: performance of a speech input; processing of the speech input by means of the speech recognition device for the production of a speech recognition result; if the speech recognition result can be allocated to a function of the electrical appliance, action upon the allocatable function of the electrical appliance; training or adaptation of the speech recognition device on the basis of the speech recognition result associated with the speech input made, if the action upon the allocatable function of the electrical appliance does not cause a user input expressing rejection.

Proceedings ArticleDOI
Lie Lu1, Hong-Jiang Zhang1
01 Dec 2002
TL;DR: A two-step speaker change detection algorithm, including potential change detection and refinement, is proposed, which has low complexity and runs in real-time with a very limited delay in analysis.
Abstract: This paper addresses the problem of real time speaker change detection and speaker tracking in broadcasted news video analysis. In such a case, both speaker identities and number of speakers are assumed unknown. A two-step speaker change detection algorithm, including potential change detection and refinement, is proposed. Speaker tracking is performed based on the results of speaker change detection. A Bayesian Fusion method is used to fuse multiple audio features to get a more reliable result. The algorithm has low complexity and runs in real-time with a very limited delay in analysis. Our experiments show that the algorithms produce very satisfactory results.

Journal ArticleDOI
TL;DR: This paper introduces a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields, and presents various visual feature performance comparisons to explore their impact on the recognition accuracy.
Abstract: There has been growing interest in introducing speech as a new modality into the human-computer interface (HCI). Motivated by the multimodal nature of speech, the visual component is considered to yield information that is not always present in the acoustic signal and enables improved system performance over acoustic-only methods, especially in noisy environments. In this paper, we investigate the usefulness of visual speech information in HCI related applications. We first introduce a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields. We then derive a relevant set of visual speech parameters and incorporate them into a recognition engine. We present various visual feature performance comparisons to explore their impact on the recognition accuracy, including the lip inner contour and the visibility of the tongue and teeth. By using a common visual feature set, we demonstrate two applications that exploit speechreading in a joint audio-visual speech signal processing task: speech recognition and speaker verification. The experimental results based on two databases demonstrate that the visual information is highly effective for improving recognition performance over a variety of acoustic noise levels.

Journal ArticleDOI
TL;DR: On the German spontaneous speech task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill, the proposed methods for VTN reduce the error rates significantly.
Abstract: This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we avoid the problem that a mixture-density tends to learn the scale factors of the training speakers and thus cannot be used for selecting the scale factor. We show that using single Gaussian densities for selecting the scale factor in training results in lower error rates than using mixture densities. For the recognition phase, we propose an improvement of the well-known two-pass strategy: by using a non-normalized acoustic model for the first recognition pass instead of a normalized model, lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The two-pass strategy is an efficient method, but it is suboptimal because the scale factor and the word sequence are determined sequentially. We found that for telephone digit string recognition this suboptimality reduces the VTN gain in recognition performance by 30% relative. In summary, on the German spontaneous speech task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill, the proposed methods for VTN reduce the error rates significantly.

Journal ArticleDOI
TL;DR: This paper is purely a tutorial that presents a review of the classifier based methods used for speaker recognition, both unsupervised and supervised classifiers are described.

Journal ArticleDOI
TL;DR: This method for clustering the speakers from unlabeled and unsegmented conversation (with known number of speakers), when no a priori knowledge about the identity of the participants is given, is presented.
Abstract: We present a method for clustering the speakers from unlabeled and unsegmented conversation (with known number of speakers), when no a priori knowledge about the identity of the participants is given. Each speaker was modeled by a self-organizing map (SOM). The SOMs were randomly initiated. An iterative algorithm allows the data move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). This method was applied to high-quality conversations with two to five participants and to two-speaker telephone-quality conversations. The results for two (both high- and telephone-quality) and three speakers were over 80% correct segmentation. The problem becomes even harder when the number of participants is also unknown. Based on the iterative clustering algorithm a validity criterion was also developed to estimate the number of speakers. In 16 out of 17 conversations of high-quality conversations between two and three participants, the estimation of the number of the participants was correct. In telephone-quality the results were poorer.

Proceedings ArticleDOI
13 May 2002
TL;DR: Improvements to an innovative high-performance speaker recognition system are described, incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.
Abstract: This paper describes improvements to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5% on Switchboard-I audio files. The improved system described in this paper reduces the equal error rate to less then 4%. This is accomplished by incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.

Reference BookDOI
01 Oct 2002
TL;DR: This work presents a meta-modelling framework for estimating the Bayes-Risk Levels of Speech Recognition Error (MSE) in a discrete-time model, and some examples of how this model has changed over the years have changed its approach to speech recognition.
Abstract: Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu Chou Minimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William Byrne A Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang Huo Speech Pattern Recognition Using Neural Networks, Shigeru Katagiri Large Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc Gauvain Toward Spontaneous Speech Recognition and Understanding, Sadaoki Furui Speaker Authentication, Qi Li and Biing-Hwang Juang HMMs for Language Processing Problems, Richard M. Schwartz and John Makhoul Statistical Language Models with Embedded Latent Semantic Knowledge, Jerome R. Bellegarda Semantic Information Processing of Spoken Language - How May I Help You?sm, A.L. Gorin, A. Abella, T. Alonso, G . Riccardi, and J.H. Wright Machine Translation Using Statistical Modeling, H. Ney and F.J. Och Modeling Topics for Detection and Tracking, James Allen

Proceedings ArticleDOI
Luhong Liang1, Xiaoxing Liu1, Yibao Zhao1, Xiaobo Pi1, Ara V. Nefian1 
07 Nov 2002
TL;DR: The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region using a coupled hidden Markov (CHMM) model.
Abstract: The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov (CHMM) model. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0 dB.

Proceedings ArticleDOI
13 May 2002
TL;DR: DYPSA is automatic and operates using the speech signal alone without the need for an EGG or Laryngograph signal and incorporates a new technique for estimating GCI candidates and employs dynamic programming to select the most likely candidates according to a defined cost function.
Abstract: We present the DYPSA algorithm for automatic and reliable estimation of glottal closure instants (GCIs) in voiced speech. Reliable GCI estimation is essential for closed-phase speech analysis, from which can be derived features of the vocal tract and, separately, the voice source. It has been shown that such features can be used with significant advantages in applications such as speaker recognition. DYPSA is automatic and operates using the speech signal alone without the need for an EGG or Laryngograph signal. It incorporates a new technique for estimating GCI candidates and employs dynamic programming to select the most likely candidates according to a defined cost function. We review and evaluate three existing methods and compare our new algorithm to them. Results for DYPSA show GCI detection accuracy to within ±0.25ms on 87% of the test database and fewer than 1% false alarms and misses.

Proceedings ArticleDOI
13 May 2002
TL;DR: A technique which automatically estimates speakers' age only with acoustic, not linguistic, information of their utterances is proposed, showing high correlation between speakers'Age estimated subjectively by humans and automatically calculated score of ‘agedness’.
Abstract: This paper proposes a technique which automatically estimates speakers' age only with acoustic, not linguistic, information of their utterances. This method is based upon speaker recognition techniques. In the current work, we firstly divided speakers of two databases, JNAS and S(senior)-JNAS, into two groups by listening tests. One group has only the speakers whose speech sounds so aged that one should take special care when he/she talks to them. The other group has the remaining speakers of the two databases. After that, each speaker group was modeled with GMM. Experiments of automatic identification of elderly speakers showed the correct identification rate of 91 %. To improve the performance, two prosodic features were considered, i.e, speech rate and local perturbation of power. Using these features, the identification rate was improved to 95%. Finally, using scores calculated by integrating GMMs with prosodic features, experiments were carried out to automatically estimate speakers' age. The results showed high correlation between speakers' age estimated subjectively by humans and automatically calculated score of ‘agedness’.

Proceedings ArticleDOI
13 May 2002
TL;DR: It is found that, with sufficient test and training data, suprasegmental information can significantly enhance the performance of traditional speaker ID systems.
Abstract: We investigate the incorporation of larger time-scale information, such as prosody, into standard speaker ID systems. Our study is based on the Extended Data Task of the NIST 2001 Speaker ID evaluation, which provides much more test and training data than has traditionally been available to similar speaker ID investigations. In addition, we have had access to a detailed prosodic feature database of Switchboard-I conversations, including data not previously applied to speaker ID. We describe two baseline acoustic systems, an approach using Gaussian Mixture Models, and an LVCSR-based speaker ID system. These results are compared to and combined with two larger time-scale systems: a system based on an “idiolect” language model. and a system making use of the contents of the prosody database. We find that, with sufficient test and training data, suprasegmental information can significantly enhance the performance of traditional speaker ID systems.

Book
01 Jan 2002
TL;DR: From Biometrics Technology to Applications Regrading Face, Voice, Signature and Fingerprint Recognition Systems J. Ortega-Garcia, et al.
Abstract: Foreword. Preface. 1. Biometrics Applications in an E-World D. Zhang. 2. Achievements and Challenges in Fingerprint Recognition A.M. Bazen, S.H. Gerez. 3. Biometrics Electronic Purse E. Chong Tan, et al. 4. Face Recognition and its Application A.W. Senior, R.M. Bolle. 5. Personalize Mobile Access by Speaker Authentication K. Chen. 6. Biometrics on the Internet: Security Application and Services L.L. Ling, M.G. Lizarraga. 7. Forensic Identification Reporting Using Automatic Biometric Systems J. Gonzalez-Rodriguez, et al. 8. Signature Security System for E-Commerce B. Li, D. Zhang. 9. Web Guard: A Biometrics Identification System for Network Security J. You, D. Zhang. 10. Smart Card Application Based on Palmprint Identification G. Lu, D. Zhang. 11. Secure Fingerprint Authentication N.K. Ratha, et al. 12. From Biometrics Technology to Applications Regrading Face, Voice, Signature and Fingerprint Recognition Systems J. Ortega-Garcia, et al. 13. Face Verification for Access Control W. Gao, S. Shan. 14. Voice Biometrics for Securing Your Web-Based Business K. Farrell, et al. 15. Ability to Verify: A Metric for System Performance in Real-World Comparative Biometric Testing S. Nanavati, M. Thieme. 16. Automated Authentication Using Hybrid Biometric System N. Poh, J. Korczak. Index.

Proceedings Article
01 Jan 2002
TL;DR: Alternative methods for performing speaker identification that utilize domain dependent automatic speech recognition (ASR) to provide a phonetic segmentation of the test utterance are described.
Abstract: Traditional text independent speaker recognition systems are based on Gaussian Mixture Models (GMMs) trained globally over all speech from a given speaker. In this paper, we describe alternative methods for performing speaker identification that utilize domain dependent automatic speech recognition (ASR) to provide a phonetic segmentation of the test utterance. When evaluated on YOHO, several of these approaches were able outperform previously published results on the speaker ID task. On a more difficult conversational speech task, we were able to use a combination of classifiers to reduce identification error rates on single test utterances. Over multiple utterances, the ASR dependent approaches performed significantly better than the ASR independent methods. Using an approach we call speaker adaptive modeling for speaker identification, we were able to reduce speaker identification error rates by 39% over a baseline GMM approach when observing five test utterances from a speaker.

Journal ArticleDOI
TL;DR: By making the advances necessary to implement next-generation speech recognition applications, researchers could develop systems within a decade that match human performance levels.
Abstract: By making the advances necessary to implement next-generation speech recognition applications, researchers could develop systems within a decade that match human performance levels.