scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2001"


Journal ArticleDOI
TL;DR: Audiovisual speech processing results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments.
Abstract: We have reported activities in audiovisual speech processing, with emphasis on lip reading and lip synchronization. These research results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments. Similarly, with lip synchronization, it is possible to render realistic talking heads with lip movements synchronized with the voice, which is very useful for human-computer interactions. We envision that in the near future, advancement in audiovisual speech processing will greatly increase the usability of computers. Once that happens, the cameras and the microphone may replace the keyboard and the mouse as better mechanisms for human-computer interaction.

244 citations


Proceedings ArticleDOI
03 Jul 2001
TL;DR: A new technique for expressive and realistic speech animation that uses an optical tracking system that extracts the 3D positions of markers attached at the feature point locations to capture the movements of the face of a talking person and forms a vector space representation that offers insight into improving realism of animated faces.
Abstract: We describe a new technique for expressive and realistic speech animation. We use an optical tracking system that extracts the 3D positions of markers attached at the feature point locations to capture the movements of the face of a talking person. We use the feature points as defined by the MPEG-4 standard. We then form a vector space representation by using the principal component analysis of this data. We call this space "expression and viseme space". Such a representation not only offers insight into improving realism of animated faces, but also gives a new way of generating convincing speech animation and blending between several expressions. As the rigid body movements and deformation constraints on the facial movements have been considered through this analysis, the resulting facial animation is very realistic.

69 citations


Patent
13 Aug 2001
TL;DR: In this paper, a plurality of visual-facial-animation values are provided based on tracking of facial features in the sequence of facial image frames of the speaking actor, and the plurality of audio-feng-alve animations are provided by visemes detected using the synchronously captured audio voice data of the actor.
Abstract: Facial animation values are generated using a sequence of facial image frames and synchronously captured audio data of a speaking actor. In the technique, a plurality of visual-facial-animation values are provided based on tracking of facial features in the sequence of facial image frames of the speaking actor, and a plurality of audio-facial-animation values are provided based on visemes detected using the synchronously captured audio voice data of the speaking actor. The plurality of visual facial animation values and the plurality of audio facial animation values are combined to generate output facial animation values for use in facial animation.

56 citations


Journal ArticleDOI
TL;DR: A three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes is proposed and improved audio-visual phonetic classification over the use of a single-stage image transform visual frontend is reported.
Abstract: We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.

56 citations


Proceedings ArticleDOI
15 Nov 2001
TL;DR: A perceptual transformation of the speech spectral envelope is proposed, which is shown to capture the dynamics of articulatory movements and suggests a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech.
Abstract: The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.

39 citations


PatentDOI
TL;DR: In this paper, the authors present a method for the animation of a synthesized model of a human face in relation to an audio signal. But their method is not language dependent and provides a very natural animated synthetic model, being based on the simultaneous analysis of voice and facial movements, tracked on real speakers, and on the extraction of suitable visemes.
Abstract: The method permits the animation of a synthesised model of a human face in relation to an audio signal. The method is not language dependent and provides a very natural animated synthetic model, being based on the simultaneous analysis of voice and facial movements, tracked on real speakers, and on the extraction of suitable visemes. The subsequent animation consists in transforming the sequence of visemes corresponding to the phonemes of the driving text into the sequence of movements applied to the model of the human face.

29 citations


PatentDOI
TL;DR: In this article, reference speech information representative of hierarchical-level skipping is prepared in a predetermined speech recognition dictionary so that, when recognizing an input corresponding to the reference text, speech recognition is carried out by extracting a part of speech recognition dictionaries belonging to a lower hierarchical level of the reference information being compared.
Abstract: Concerned is speech recognition that reference speech information is extracted from a plurality of speech recognition dictionaries in a hierarchical structure to compare between extracted reference speech information and an inputted speech thereby recognizing the speech. Reference speech information representative of hierarchical-level skipping is prepared in a predetermined speech recognition dictionary so that, when recognizing an input corresponding to the reference speech information representative of hierarchical-level skipping, speech recognition is carried out by extracting a part of speech recognition dictionary belonging to a lower hierarchical level of the reference speech information being compared.

25 citations


Patent
16 Feb 2001
TL;DR: In this article, a word which is probable as the result of the speech recognition is selected on the basis of an acoustic score and a linguistic score, while word selection is also performed on the based of a measure different from the acoustic score, such as the number of phonemes being small, a part of speech being a pre-set one, inclusion in the past results of speech recognition or the linguistic score being not less than a preset value.
Abstract: A speech recognition apparatus in which the accuracy in speech recognition is improved as the resource is prevented from increasing. Such a word which is probable as the result of the speech recognition is selected on the basis of an acoustic score and a linguistic score, while word selection is also performed on the basis of a measure different from the acoustic score, such as the number of phonemes being small, a part of speech being a pre-set one, inclusion in the past results of speech recognition or the linguistic score being not less than a pre-set value. The words so selected are subjected to matching processing.

23 citations


01 Jan 2001
TL;DR: A coarticulation model is developed that defines visemes as curves instead of a single position and treats a viseme as a dynamic shaping of the vocal tract and not as a static shape.
Abstract: Creating animated speech requires a facial model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct shape at the correct time. We present a facial model designed to support animated speech. Our model has a highly deformable lip model that is grafted onto the input facial geometry providing the necessary geometric complexity for creating lip shapes and high-quality lip renderings. We provide a highly deformable tongue model that can represent the shapes the tongue experiences during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. For more realistic movement of the skin we consider the underlying soft and hard tissue. To decrease the processing time we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech. We use a track-based animation system that has one facial model parameter per track with possibly more than one track per parameter. The tracks contain control points for a curve that describes the value of the parameter over time. We allow many different types and orders of curves that can be combined in different manners. For more realistic speech we develop a coarticulation model that defines visemes as curves instead of a single position. This treats a viseme as a dynamic shaping of the vocal tract and not as a static shape.

23 citations


Proceedings ArticleDOI
Tanveer A. Faruquie1, Ashish Kapoor1, Rohit J. Kate1, Nitendra Rajput1, L. V. Subramaniam1 
22 Aug 2001
TL;DR: A morphing based automated audio driven facial animation system where a face image is animated with full lip synchronization and expression and new viseme-expression combinations are synthesized to be able to generate animations with new facial expressions.
Abstract: In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking different visemes. Rules are formulated based on coarticulation and the duration of a viseme to control the continuity in terms of shape and extent of lip opening. In addition to this new viseme-expression combinations are synthesized to be able to generate animations with new facial expressions. Finally various applications of this system are discussed in the context of creating audio-visual reality.

11 citations


01 Jan 2001
TL;DR: In this paper, a 3D reconstruction of a speaking face is produced for each video frame, and a topological mask of the lower half of the face is fitted to the motion.
Abstract: We are all experts in the perception and interpretation of faces and their dynamics. This makes facial animation a particularly demanding area of graphics. Increasingly, computer vision is brought to bear and 3D models and their motions are learned from observations. The paper subscribes to this strand for the 3D modeling of human speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from faces with a few markers. A 3D reconstruction of a speaking face is produced for each video frame. A topological mask of the lower half of the face is fitted to the motion. The 3D shape statistics are extracted and pricipal components analysis (PCA) reduces the dimension of the maskspace. The final speech tracker can work without markers, as it is only allowed to roam this constrained space of masks. Upon the representation of the different visemes in this space, speech or text can be used as input for animation.

Journal ArticleDOI
TL;DR: A new and efficient method for facial expression generation on cloned synthetic head models that has real-time performance, is less computationally expensive than physically based models, and has greater anatomical correspondence than rational free-form deformation or spline-based techniques.
Abstract: This paper describes a new and efficient method for facial expression generation on cloned synthetic head models. The system uses abstract facial muscles called action units (AUs) based on both anatomical muscles and the facial action coding system. The facial expression generation method has real-time performance, is less computationally expensive than physically based models, and has greater anatomical correspondence than rational free-form deformation or spline-based, techniques. Automatic cloning of a real human head is done by adapting a generic facial and head mesh to Cyberware laser scanned data. The conformation of the generic head to the individual data and the fitting of texture onto it are based on a fully automatic feature extraction procedure. Individual facial animation parameters are also automatically estimated during the conformation process. The entire animation system is hierarchical; emotions and visemes (the visual mouth shapes that occur during speech) are defined in terms of the AUs, and higher-level gestures are defined in terms of AUs, emotions, and visemes as well as the temporal relationships between them. The main emphasis of the paper is on the abstract muscle model, along with limited discussion on the automatic cloning process and higher-level animation control aspects.

Proceedings Article
07 May 2001
TL;DR: The first results from applying a recently proposed novel algorithm for the robust and reliable automatic extraction of lip feature points to an audio-video speech data corpus show that there is a correlation between width and height of the mouth opening as well as between the protrusion parameters of upper and lower lips.
Abstract: We present the first results from applying a recently proposed novel algorithm for the robust and reliable automatic extraction of lip feature points to an audio-video speech data corpus. This corpus comprises 10 native speakers uttering sequences that cover the range of phonemes and visemes in Australian English. The lip-tracking algorithm is based on stereo vision which has the advantage of measurements being in real-world (3D) coordinates, instead of image (2D) coordinates. Certain lip feature points on the inner lip contour such as the lip corners and the mid-points of upper and lower lip are automatically tracked. Parameters describing the shape of the mouth are derived from these points. The results obtained so far show that there is a correlation between width and height of the mouth opening as well as between the protrusion parameters of upper and lower lips.


01 Jan 2001
TL;DR: An approach for speech animation by smooth viseme transition by using the Principal Component Analysis of facial capture data extracted using an optical tracking system to generate convincing speech animation and to make smooth transitions from one viseme to another.
Abstract: For realistic speech animation, smooth viseme and expression transitions, blending and co-articulation have been so far studied and experimented widely. In this paper, we describe an approach for speech animation by smooth viseme transition. Though this method cannot form an alternative to the coarticulation phenomenon, it certainly takes us a step nearer to realistic speech animation. The approach is devised as a result of the Principal Component Analysis of facial capture data extracted using an optical tracking system. The system extracts the 3D positions of markers attached at the specific feature point locations on face to capture the facial movements of a talking person. We form a vector space representation by using the Principal Component Analysis of this data. We call this space the “viseme space”. We use the viseme space to generate convincing speech animation and to make smooth transitions from one viseme to another. As the analysis and the resulting viseme space automatically consider the dynamics of and the deformation constraints on the facial movements, the resulting facial animation is very realistic.


01 Jan 2001
TL;DR: In this article, Hidden Markov Models (HMMs) are estimated for each viseme present in stored video data, and models are generated for each triseme plus the previous and following visemes in the training set.
Abstract: This research presents a new approach for estimating control points used in visual speech synthesis. First, Hidden Markov Models (HMMs) are estimated for each viseme present in stored video data. Second, models are generated for each triseme (a viseme plus the previous and following visemes) in the training set. Next, a decision tree clusters and relates states in the HMMs that are similar in a contextual and statistical sense. The tree also estimates HMMs for trisemes not present in the stored video data. Finally, the HMMs are used to generate sequences of visual speech control points for trisemes not occurring in the stored data. Statistical analysis indicates that the mean squared error between the desired and estimated control point locations is lowest when the process is conducted with certain HMMs trained using short-duration dynamic features, a high log-likelihood threshold, and a low outlier threshold. Also, comparisons of mouth shapes generated from the artificially generated control points and the control points estimated from video not used to train the HMMs indicate that the process estimates accurate control points. The research presented here thus establishes a practical method improving audio-driven visual speech synthesis quality.

Patent
20 Jun 2001
TL;DR: In this article, a method for eliminating synchronisation errors using speech recognition is proposed, which identifies 110 visemes, or visual cues which are indicative of articulatory type, in the video content, and identifies 120 phones and their articulatory types in the audio content.
Abstract: A method for eliminating synchronisation errors using speech recognition. Using separate audio and visual speech recognition techniques, the method identifies 110 visemes, or visual cues which are indicative of articulatory type, in the video content, and identifies 120 phones and their articulatory types in the audio content. Once the two recognition techniques have been applied, the outputs are compared 130 to determine the relative alignment and, if not aligned, a synchronisation algorithm is applied to time-adjust one or both of the audio and the visual streams in order to achieve synchronisation. Facial features, such as mouth movements, are used to provide visual cues in the video content.

Proceedings Article
01 Jan 2001
TL;DR: A technique for the extraction of the five main visemes produced in natural speech for German, which belongs to the LDA (Linear Discriminant Analysis) family, using many features in the recognition maximizes the probability of recognition rate.
Abstract: In this paper, we present a technique for the extraction of the five main visemes produced in natural speech for German. The method belongs to the LDA (Linear Discriminant Analysis) family. The intensity, the edges, and the line segments are used to locate the lips automatically and for viseme classification. Using many features in the recognition maximizes the probability of recognition rate. The corners of the mouth are used in case of small rotation and scale. An experiment has been carried out on different people, to understand the part of the speech that the human being use. The people grouped the phonemes into five different visemes. The number of distinguished visemes is not the same for each speaker. Everyone express the speech in a different visemes. Good recognition rate has been achieved on different speaker.

Proceedings ArticleDOI
07 May 2001
TL;DR: A method for synthesizing photo-realistic visual speech using a parametric model based on quadtree splines that can minimize the number of motion parameters for a given synthetic error is presented.
Abstract: In this paper, we present a method for synthesizing photo-realistic visual speech using a parametric model based on quadtree splines. In an image-based visual speech synthesis system, visemes are used for generating an arbitrary new image sequence. The images between visemes are usually synthesized using a certain mapping. Such a mapping can be characterized by motion parameters estimated from the training data. With the quadtree splines, we can minimize the number of motion parameters for a given synthetic error. The feasibility of the proposed method is demonstrated by experiments.

Book ChapterDOI
12 Sep 2001
TL;DR: The intensity, the edges, and the line segments are used to locate the lips automatically, and to discriminate between the desired viseme classes in German spoken language analysis from images.
Abstract: In this paper, we present a technique for the extraction of the main five visemes for German spoken language analysis from images. The intensity, the edges, and the line segments are used to locate the lips automatically, and to discriminate between the desired viseme classes. Good recognition rate has been achieved on different speakers.

01 Jan 2001
TL;DR: A robust, accurate and inexpensive approach to estimate human facial motion from mirror-reflected videos, which takes advantages of the characteristics between original and mirrored image, and can be more robust than most of other general-purposed stereovision approach in the motion analysis for mirror- Reflected videos.
Abstract: The goal of our project is to collect the dataset of 3D facial motion parameters for the synthesis of talking head. However, the capture of human facial motion is usually an expensive task in some related researches, since special devices must be applied, such as optical or electronic trackers. In this paper, we propose a robust, accurate and inexpensive approach to estimate human facial motion from mirror-reflected videos. The approach takes advantages of the characteristics between original and mirrored image, and can be more robust than most of other general-purposed stereovision approach in the motion analysis for mirror-reflected videos. A preliminary dataset of facial motion parameters of MPEG-4 and French visemes and with voice data has been acquired, the estimated data are also applied to our facial animation system.

01 Jan 2001
TL;DR: Comparisons of mouth shapes generated from the artificially generated control points and the control points estimated from video not used to train the HMMs indicate that the process estimated accurate control points for the trisemes tested.
Abstract: This paper addresses a problem often encountered when estimating control points used in visual speech synthesis. First, Hidden Markov Models (HMMs) are estimated for each viseme present in stored video data. Second, models are generated for each triseme (a viseme in context with the previous and following visemes) in the training set. Next, a decision tree is used to cluster and relate states in the HMMs that are similar in a contextual and statistical sense. The tree is also used to estimate HMMs for any trisemes that are not present in the stored video data when control points for such trisemes are required for synthesizing the lip motion for a sentence. Finally, the HMMs are used to generate sequences of visual speech control points for those trisemes not occurring in the stored data. Comparisons of mouth shapes generated from the artificially generated control points and the control points estimated from video not used to train the HMMs indicate that the process estimated accurate control points for the trisemes tested. This paper thus establishes a useful method for synthesizing realistic audio-synchronized video facial features.

Journal ArticleDOI
TL;DR: This paper provides a method using polymorphing to incorporate co-articulation during the speech in VTalk, a system for synthesizing text-to-audiovisual speech (TTAVS), where the input text is converted Into an audiovisually speech stream incorporating the head and eye movements.
Abstract: This paper describes VTalk, a system for synthesizing text-to-audiovisual speech (TTAVS), where the input text is converted into an audiovisual speech stream incorporating the head and eye movements. It is an image-based system, where the face is modeled using a set of images of a human subject. A concatination of visemes-the corresponding lip shapes for phonemes-can be used for modeling visual speech. A smooth transition between visemes is achieved using morphing along the correspondence between the visemes obtained by optical flows. The phonemes and timing parameters given by the text-to-speech synthesizer determines the corresponding visemes to be used for the synthesis of the visual stream. We provide a method using polymorphing to incorporate co-articulation during the speech in our TTAVS. We also include nonverbal mechanisms in visual speech communication such as eye blinks and head nods, which make the talking head model more lifelike. For eye movement, a simple mask based approach is employed and view morphing is used to generate the intermediate images for the movement of head. All these features are Integrated Into a single system, which takes text, head and eye movement parameters as Input and produces the complete audiovisual stream.