scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2003"


Proceedings Article
01 Jan 2003

358 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: The SuperSID project as mentioned in this paper used prosodic dynamics, pitch and duration features, phone streams, and conversational interactions to improve the accuracy of automatic speaker recognition using a defined NIST evaluation corpus and task.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

256 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: Two approaches that use the fundamental frequency and energy trajectories to capture long-term information are proposed that can achieve a 77% relative improvement over a system based on short-term pitch and energy features alone.
Abstract: Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short-term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a predefined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77% relative improvement over a system based on short-term pitch and energy features alone.

212 citations


Patent
03 Dec 2003
TL;DR: In this article, a fast on-line automatic speaker/environment adaptation suitable for speech/speaker recognition system, method and computer program product is presented, which consists of a computer system including a processor, a memory coupled with the processor, an input coupled with a processor for receiving acoustic signals, and an output coupled with an output for outputting recognized words or sounds.
Abstract: A fast on-line automatic speaker/environment adaptation suitable for speech/speaker recognition system, method and computer program product are presented. The system comprises a computer system including a processor, a memory coupled with the processor, an input coupled with the processor for receiving acoustic signals, and an output coupled with the processor for outputting recognized words or sounds. The system includes a model-adaptation system and a recognition system, configured to accurately and efficiently recognize on-line distorted sounds or words spoken with different accents, in the presence of randomly changing environmental conditions. The model-adaptation system quickly adapts standard acoustic training models, available on audio recognition systems, by incorporating distortion parameters representative of the changing environmental conditions or the speaker's accent. By adapting models already available to the new environment, the system does not need separate adaptation training data.

161 citations


Proceedings Article
09 Dec 2003
TL;DR: A new phone- based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches is introduced and a new kernel based upon a linearization of likelihood ratio scoring is derived.
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

150 citations


01 Jan 2003
TL;DR: This thesis attempts to see the feature extraction as a whole, starting from understanding the speech production process, what is known about speaker individuality, and then going to the methods adopted directly from the speech recognition task.
Abstract: Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features. Several feature extraction methods have been proposed, and successfully exploited in the speaker recognition task. However, almost exclusively, the methods are adopted directly from the speech recognition task. This is somewhat ironical, considering the opposite nature of the two tasks. In speech recognition, speaker variability is one of the major error sources, whereas in speaker recognition it is the information that we wish to extract. The mel-frequency cepstral coefficients (MFCC) is the most evident example of a feature set that is extensively used in speaker recognition, but originally developed for speech recognition purposes. When MFCC front-end is used in speaker recognition system, one makes an implicit assumption that the human hearing meachanism is the optimal speaker recognizer. However, this has not been confirmed, and in fact opposite results exist. Although several methods adopted from speech recognition have shown to work well in practise, they are often used as “black boxes” with fixed parameters. It is not understood what kind of information the features capture from the speech signal. Understanding the features at some level requires experience from specific areas such as speech physiology, acoustic phonetics, digital signal processing and statistical pattern recognition. According to the author’s general impression of literature, it seems more and more that currently, at the best we are guessing what is the code in the signal that carries our individuality. This thesis has two main purposes. On the one hand, we attempt to see the feature extraction as a whole, starting from understanding the speech production process, what is known about speaker individuality, and then going

138 citations


Proceedings Article
01 Jan 2003
TL;DR: Two approaches for extracting speaker traits are investigated: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker, showing that voice signatures are of practical interest in real-world applications.
Abstract: Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred from speech such as gender, age, dialect, or emotion. These characteristics of a speaker's voice, voice signatures, whether static or dynamic, can be useful for speech mining applications or for the design of a natural spoken-dialog system. This paper explores the problem of extracting automatically and accurately voice signatures from a speaker's voice. We investigate two approaches for extracting speaker traits: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker. In the first approach, we show that standard speech/nonspeech HMM, conditioned on speaker traits and evaluated on cepstral and pitch features, achieve accuracies well above chance for all examined traits. The second approach, using support vector machines with rational kernels applied to speech recognition lattices, attains an accuracy of about 8.1 % in the task of binary classification of emotion. Our results are based on a corpus of speech data collected from a deployed customer-care application (HMIHY 0300). While still preliminary, our results are significant and show that voice signatures are of practical interest in real-world applications.

128 citations


Proceedings Article
01 Jan 2003
TL;DR: It is shown how novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-levelfeature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.

104 citations


Book ChapterDOI
24 Jul 2003
TL;DR: This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors to give new insights into the practical utility of existing synchrony definitions.
Abstract: This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker's mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.

100 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: A set of new algorithms that perform speaker clustering in an online fashion that enables low-latency incremental speaker adaptation in online speech-to-text systems and gives a speaker tracking and indexing system the ability to label speakers with cluster ID on the fly.
Abstract: This paper describes a set of new algorithms that perform speaker clustering in an online fashion. Unlike typical clustering approaches, the proposed method does not require the presence of all the data before performing clustering. The clustering decision is made as soon as an audio segment is received. Being causal, this method enables low-latency incremental speaker adaptation in online speech-to-text systems. It also gives a speaker tracking and indexing system the ability to label speakers with cluster ID on the fly. We show that the new online speaker clustering method yields better performance compared to the traditional hierarchical speaker clustering. Evaluation metrics for speaker clustering are also discussed.

96 citations


PatentDOI
TL;DR: In this article, a system and method for automatic acoustic speaker adaptation in an automatic speech recognition assisted transcription system is presented, where partial transcripts of audio files are generated by a transcriptionist and a topic language model is generated from the partial transcripts.
Abstract: The invention is a system and method for automatic acoustic speaker adaptation in an automatic speech recognition assisted transcription system. Partial transcripts of audio files are generated by a transcriptionist. A topic language model is generated from the partial transcripts. The topic language model is interpolated with a general language model. Automatic speech recognition is performed on the audio files by a speech recognition engine using a speaker independent acoustic model and the interpolated language model to generate semi-literal transcripts of the audio files. The semi-literal transcripts are then used with the corresponding audio files to generate a speaker dependent acoustic model in an acoustic adaptation engine.

Proceedings ArticleDOI
14 Dec 2003
TL;DR: After applying several conventional VTLN warping functions, the piecewise linear function is extended to several segments, allowing a more detailed warping of the source spectrum.
Abstract: In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As voice conversion aims at the transformation of a source speaker's voice into that of a target speaker, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the piecewise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on voice conversion are performed on three corpora of two languages and both speaker genders.

Proceedings Article
01 Jan 2003
TL;DR: The use of temporal trajectories of fundamental frequency and short-term energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language is proposed.
Abstract: Current Automatic Speech Recognition systems convert the speech signal into a sequence of discrete units, such as phonemes, and then apply statistical methods on the units to produce the linguistic message. Similar methodology has also been applied to recognize speaker and language, except that the output of the system can be the speaker or language information. Therefore, we propose the use of temporal trajectories of fundamental frequency and short-term energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: The described approach confirms the relevance of long phonetic context in phonetic speaker recognition and represents an intermediate stage between short phone context and word-level modeling without the need for any lexical knowledge, which suggests its language independence.
Abstract: Recent work in phonetic speaker recognition has shown that modeling phone sequences using n-grams is a viable and effective approach to speaker recognition, primarily aiming at capturing speaker-dependent pronunciation and also word usage. The paper describes a method involving binary-tree-structured statistical models for extending the phonetic context beyond that of standard n-grams (particularly bigrams) by exploiting statistical dependencies within a longer sequence window without exponentially increasing the model complexity, as is the case with n-grams. Two ways of dealing with data sparsity are also studied; namely, model adaptation and a recursive bottom-up smoothing of symbol distributions. Results obtained under a variety of experimental conditions using the NIST 2001 Speaker Recognition Extended Data Task indicate consistent improvements in equal-error rate performance as compared to standard bigram models. The described approach confirms the relevance of long phonetic context in phonetic speaker recognition and represents an intermediate stage between short phone context and word-level modeling without the need for any lexical knowledge, which suggests its language independence.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: Support vector machines with the Fisher and score-space kernels are used for text independent speaker verification to provide direct discrimination between complete utterances to achieve error-rates that are significantly better than the current state-of-the-art on the PolyVar database.
Abstract: Support vector machines with the Fisher and score-space kernels are used for text independent speaker verification to provide direct discrimination between complete utterances. This is unlike approaches such as discriminatively trained Gaussian mixture models or other discriminative classifiers that discriminate at the frame-level only. Using the sequence-level discrimination approach we are able to achieve error-rates that are significantly better than the current state-of-the-art on the PolyVar database.

Patent
20 Mar 2003
TL;DR: In this article, a speaker authentication system includes a data fuser operable to fuse information to assist in authenticating a speaker providing audio input, including a data store of speaker voiceprints and a voiceprint matching module adapted to receive an audio input.
Abstract: A speaker authentication system includes a data fuser operable to fuse information to assist in authenticating a speaker providing audio input. In other aspects, the system includes a data store of speaker voiceprints and a voiceprint matching module adapted to receive an audio input and operable to attempt to assist in authenticating a speaker by matching the audio input to at least one of the speaker voiceprints.

Proceedings Article
01 Jan 2003
TL;DR: It is shown how eigenvoice MAP can be modified to yield a new model-based channel compensation technique which is called eigenchannel MAP, which was found to reduce speaker identification errors by 50%.
Abstract: We report the results of some experiments which demonstrate that eigenvoice MAP and eigenphone MAP are at least as effective as classical MAP for discriminative speaker modeling on SWITCHBOARD data. We show how eigenvoice MAP can be modified to yield a new model-based channel compensation technique which we call eigenchannel MAP. When compared with multi-channel training, eigenchannel MAP was found to reduce speaker identification errors by 50%.

Proceedings Article
01 Jan 2003
TL;DR: This work presents a method for speaker recognition that uses the duration patterns of speech units to aid speaker classification and finds that this approach yields significant perfomance improvement when combined with a state-of-the-art speaker recognition system based on standard cepstral features.
Abstract: We present a method for speaker recognition that uses the duration patterns of speech units to aid speaker classification. The approach represents each word and/or phone by a feature vector comprised of either the durations of the individual phones making up the word, or the HMM states making up the phone. We model the vectors using mixtures of Gaussians. The speaker specific models are obtained through adaptation of a “background” model that is trained on a large pool of speakers. Speaker models are then used to score the test data; they are normalized by subtracting the scores obtained with the background model. We find that this approach yields significant perfomance improvement when combined with a state-of-the-art speaker recognition system based on standard cepstral features. Furthermore, the improvement persists even after combination with lexical features. Finally, the improvement continues to increase with longer test sample durations, beyond the test duration at which standard system accuracy level off.


Patent
Damon V. Danieli1
25 Sep 2003
TL;DR: Visually identifying one or more known or anonymous voice speakers to a listener in a computing session is discussed in this article, where the visual indicator is preferably associated with a visual element controlled by the voice speaker, such as an animated game character.
Abstract: Visually identifying one or more known or anonymous voice speakers to a listener in a computing session. For each voice speaker, voice data include a speaker identifier that is associated with a visual indicator displayed to indicate the voice speaker who is currently speaking. The speaker identifier is first used to determine voice privileges before the visual indicator is displayed. The visual indicator is preferably associated with a visual element controlled by the voice speaker, such as an animated game character. Visually identifying a voice speaker enables the listener and/or a moderator of the computing session to control voice communications, such as muting an abusive voice speaker. The visual indicator can take various forms, such as an icon displayed adjacent to the voice speaker's animated character, or a different icon displayed in a predetermined location if the voice speaker's animated character is not currently visible to the listener.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: This paper proposes a new usable speech extraction method to improve the SID performance under the co-channel situation based on the pitch information obtained from a robust multi-pitch tracking algorithm.
Abstract: Recently, usable speech criteria have been proposed to extract minimally corrupted speech for speaker identification (SID) in co-channel speech In this paper, we propose a new usable speech extraction method to improve the SID performance under the co-channel situation based on the pitch information obtained from a robust multi-pitch tracking algorithm [2] The idea is to retain the speech segments that have only one pitch detected and remove the others The system is evaluated on co-channel speech and results show a significant improvement across various target to interferer ratios (TIR) for speaker identification

Proceedings ArticleDOI
06 Jul 2003
TL;DR: Experiments testing the system show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of an extension to handle dual-speaker overlap.
Abstract: The paper proposes a technique that segments audio according to speakers and based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. These features are integrated in a GMM/HMM framework to determine an optimal segmentation of the audio according to location. The HMM framework also allows extensions to recognise more complex structures, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of an extension to handle dual-speaker overlap.

Patent
20 Jun 2003
TL;DR: In this article, an audio processing system and method for classifying speakers in audio data using a discriminatively-trained classifier is presented, where the anchor model outputs are mapped to frame tags to that all speech corresponding to a single frame tag comes from a single speaker.
Abstract: An audio processing system and method for classifying speakers in audio data using a discriminatively-trained classifier. In general, the audio processing system inputs audio data containing unknown speakers and outputs frame tags whereby each tag represents an individual speaker. The audio processing system includes a training system for training a discriminatively-trained classifier (such as a time-delay neural network) and a speaker classification system for using the classifier to segment and classify the speakers. The audio processing method includes two phases. A training phase discriminatively trains the classifier on a speaker training set containing known speakers and produces fixed classifier data. A use phase uses the fixed classifier data in the discriminatively-trained classifier to produce anchor model outputs for every frame of speech in the audio data. The anchor model outputs are mapped to frame tags to that all speech corresponding to a single frame tag comes from a single speaker.

Proceedings ArticleDOI
06 Jul 2003
TL;DR: A bimodal audio-visual speaker identification system that exploits not only the temporal and spatial correlations existing in the speech and video signals of a speaker, but also the cross-correlation between these two modalities.
Abstract: In this paper we present a bimodal audio-visual speaker identification system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system exploits not only the temporal and spatial correlations existing in speech and video signals of a speaker, but also the cross-correlation between these two modalities. Lip images extracted for each video frame are transformed onto an eigenspace. The obtained eigenlip coefficients are interpolated to match the rate of the speech signal and fused with mel frequency cepstral coefficients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a hidden Markov model (HMM) based identification system. Experimental results are also included for demonstration of the system performance.

Proceedings Article
George Saon1, Geoffrey Zweig1, Brian Kingsbury1, Lidia Mangu1, Upendra V. Chaudhari1 
01 Jan 2003
TL;DR: The architecture proposed is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding.
Abstract: This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time1 (1 RT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation.

Patent
18 Mar 2003
TL;DR: In this article, individual voice recognition units for each speaker in a conference were used to perform automatic transcription of that speaker's contribution to the conference, which was then merged on a time basis to produce a textual transcription of the entire telecommunication conference call.
Abstract: Utilizing individual voice recognition units for each speaker in a conference to perform automatic transcription of that speaker's contribution to the conference. The output of each of the voice recognition units is then merged on a time basis to produce a textual transcription of the entire telecommunication conference call.

Patent
Delphine Charlet1
22 Jul 2003
TL;DR: In this article, a speech recognition device generates parameters of an acceptance voice model relating to a voice segment spoken by an authorized speaker and a rejection voice model during a learning phase, and uses normalization parameters to normalize a speaker verification score depending on the likelihood ratio of the voice segment to be tested and the acceptance model and rejection model.
Abstract: During a learning phase, a speech recognition device generates parameters of an acceptance voice model relating to a voice segment spoken by an authorized speaker and a rejection voice model. It uses normalization parameters to normalize a speaker verification score depending on the likelihood ratio of a voice segment to be tested and the acceptance model and rejection model. The speaker obtains access to a service application only if the normalized score is above a threshold. According to the invention, a module updates the normalization parameters as a function of the verification score on each voice segment test only if the normalized score is above a second threshold.

Proceedings Article
01 Sep 2003
TL;DR: A new method for the transformation of F0 contour from one speakertoanother based on a small linguistically motivated parameter set is presented, and it is shown that in many cases it is much better and almost as good as using the target F1 contour.
Abstract: Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. Training F0 contour generation models for speech synthesis requires a large corpus of speech. If it were possible to adapt the F0 contour of one speaker to sound like that of another speaker, using a small, easily obtainable parameter set, this would be extremely valuable. We present a new method for the transformation ofF0contoursfrom one speakertoanotherbased on a small linguistically motivated parameter set. The system performs a piecewise linear mapping using these parameters. A perceptual experiment clearly demonstrates that the presented system is at least as good as an existing technique for all speaker pairs, and that in many cases it is much better and almost as good as using the target F0 contour.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: This paper proposes a flexible framework that selects an optimal speaker model (GMM or VQ) based on the BIC according to the duration of utterances, and demonstrates that the proposed method achieves higher indexing performance than that of conventional methods.
Abstract: The paper addresses unsupervised speaker indexing for discussion audio archives. In discussions, the speaker changes frequently, thus the duration of utterances is very short and its variation is large, which causes significant problems in applying conventional methods such as model adaptation and variance-BIC (Bayesian information criterion) methods. We propose a flexible framework that selects an optimal speaker model (GMM or VQ) based on the BIC according to the duration of utterances. When the speech segment is short, the simple and robust VQ-based method is expected to be chosen, while GMM can be reliably trained for long segments. For a discussion archive having a total duration of 10 hours, it is demonstrated that the proposed method achieves higher indexing performance than that of conventional methods.

Book ChapterDOI
TL;DR: Experimental results show that the proposed text dependent audio-visual speaker identification approach improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.
Abstract: In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth are modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by the capability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.