scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2016"


Journal ArticleDOI
11 Jul 2016
TL;DR: A system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output is presented.
Abstract: The rich signals we extract from facial expressions imposes high expectations for the science and art of facial animation. While the advent of high-resolution performance capture has greatly improved realism, the utility of procedural animation warrants a prominent place in facial animation workflow. We present a system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output. Because of the diversity of ways we produce sound, the mapping from phonemes to visual depictions as visemes is many-valued. We draw from psycholinguistics to capture this variation using two visually distinct anatomical actions: Jaw and Lip, wheresound is primarily controlled by jaw articulation and lower-face muscles, respectively. We describe the construction of a transferable template jali 3D facial rig, built upon the popular facial muscle action unit representation facs. We show that acoustic properties in a speech signal map naturally to the dynamic degree of jaw and lip in visual speech. We provide an array of compelling animation clips, compare against performance capture and existing procedural animation, and report on a brief user study.

118 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

63 citations


Patent
23 Nov 2016
TL;DR: In this paper, the entire pipeline of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages.
Abstract: Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

46 citations


Book ChapterDOI
20 Nov 2016
TL;DR: This paper tackles ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture and shows that additional view information helps to improve the performance of ALR with neural network architecture.
Abstract: It is well known that automatic lip-reading (ALR), also known as visual speech recognition (VSR), enhances the performance of speech recognition in a noisy environment and also has applications itself. However, ALR is a challenging task due to various lip shapes and ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture. We conduct single, cross, and multi-view experiments in speaker independent setting with various network configuration to integrate the multi-view data. We achieve 77.9%, 83.8%, and 78.6% classification accuracies in average on single, cross, and multi-view respectively. This result is better than the best score (76%) of preliminary single-view results given by ACCV 2016 workshop on multi-view lip-reading/audio-visual challenges. It also shows that additional view information helps to improve the performance of ALR with neural network architecture.

46 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper presented a two-pass method of training phoneme classifiers which used previously trained visemes in the first pass and showed classification performance which significantly improved on previous lip-reading results.
Abstract: To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance which significantly improves on previous lip-reading results.

41 citations


Journal ArticleDOI
TL;DR: An approach to ALR is proposed that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled, and a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers.

32 citations


Book ChapterDOI
28 Nov 2016
TL;DR: This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.
Abstract: The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

12 citations


Proceedings ArticleDOI
11 Mar 2016
TL;DR: The proposed algorithm is applied for recognition of ten words of Hindi language and can be easily extended to include more words of other languages and is fast as well as robust to various occlusions.
Abstract: Lip reading, also known as visual speech processing, means recognition of spoken word based on the pattern of lip movements while speaking. Audio speech recognition systems are popular since last many decades and have achieved a great success, but recently visual speech recognition has incited the interest of researchers towards lip reading. Lip reading has an added advantage of high accuracy and noise independency. This paper presents an algorithm for automatic lip reading. The algorithm consists of two main steps: feature extraction and classification for word recognition. The lip information is extracted using lip geometric and lip appearance features. The recognition of words is done by Learning Vector Quantization neural network. The accuracy achieved by proposed approach is 97%. The proposed algorithm is applied for recognition of ten words of Hindi language and can be easily extended to include more words of other languages. The presented approach will be helpful for hearing impaired or dumb people to communicate with humans or machines. The proposed algorithm is fast as well as robust to various occlusions.

9 citations


Proceedings ArticleDOI
21 Jul 2016
TL;DR: This paper surveys the field of lip reading and provides a detailed discussion of the trade-offs between various approaches, and gives a reverse chronological topic wise listing of the developments in lip reading systems in recent years.
Abstract: Automated lip reading is the process of converting movements of the lips, face and tongue to speech in real time with enhanced accuracy. Although performance of lip reading systems is still not remotely similar to audio speech recognition, recent developments in processor technology and the massive explosion and ubiquity of computing devices accompanied with increased research in this field has reduced the ambiguities of the labial language, making it possible for free speech-to-text conversion. This paper surveys the field of lip reading and provides a detailed discussion of the trade-offs between various approaches. It gives a reverse chronological topic wise listing of the developments in lip reading systems in recent years. With advancement in computer vision and pattern recognition tools, the efficacy of real time, effective conversion has increased. The major goal of this paper is to provide a comprehensive reference source for the researchers involved in lip reading, not just for the esoteric academia but all the people interested in this field regardless of particular application areas.

9 citations


Posted Content
TL;DR: This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function, which outperform previously published approaches on phone accuracy in clean and noisy conditions.
Abstract: Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

8 citations


Proceedings ArticleDOI
01 Oct 2016
TL;DR: The development of speech corpora for different Malayalam speech recognition tasks is presented and pronunciation dictionary and Transcription file are created.
Abstract: Speech corpus is the backbone of an Automatic speech Recognition system. This paper presents the development of speech corpora for different Malayalam speech recognition tasks. Pronunciation dictionary and Transcription file which are the other two essential resources for building a speech recognizer are also being created. Speech corpus of about 18 hours has been collected for different recognition tasks.

Journal ArticleDOI
TL;DR: This doctoral research aims to design and evaluate a visualisation technique that displays textual representations of a speaker's phonemes to a speech-reader, so speech-readers should be able to disambiguate confusing viseme-to-phoneme mappings without shifting their focus from the speaker's face.
Abstract: Speech-reading is an invaluable technique for people with hearing loss or those in adverse listening conditions (e.g., in a noisy restaurant, near children playing loudly). However, speech-reading is often difficult because identical mouth shapes (visemes) can produce several speech sounds (phonemes); there is a one-to-many mapping from visemes to phonemes. This decreases comprehension, causing confusion and frustration during conversation. My doctoral research aims to design and evaluate a visualisation technique that displays textual representations of a speaker's phonemes to a speech-reader. By combining my visualisation with their pre-existing speech-reading ability, speech-readers should be able to disambiguate confusing viseme-to-phoneme mappings without shifting their focus from the speaker's face. This will result in an improved level of comprehension, supporting natural conversation.


Proceedings ArticleDOI
01 Aug 2016
TL;DR: Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach.
Abstract: The article presents studies on the automatic whispery speech recognition. In the performed research a new corpus with whispery speech has been used. It has been checked whether the extended set of articulatory units (allophones have been used instead of phonemes) improves quality of whispery speech recognition. Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach. The authors also made available the trained g2p (grapheme-to-phoneme) model of Polish language for the Sequitur toolkit.

Posted Content
TL;DR: This paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme, and confirms that3D feature extracted methods improve the accuracy compared to 2D features extraction methods.
Abstract: Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20%, and the accuracy of viseme sequences is 39%. Both results improve the best values reported in other papers by approximately 2%. The contribution of the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.

Journal ArticleDOI
TL;DR: This special issue was inspired by the 2013 Workshop on Speech Production in Automatic Speech Recognition in Lyon, France, and this introduction provides an overview of the included papers in the context of the current research landscape.

Proceedings ArticleDOI
08 Sep 2016
TL;DR: This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN) and reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.
Abstract: This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

DOI
13 Dec 2016
TL;DR: This work defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation and proposed a choice of one of affection type as input in the system to generate a viseme sequence for natural speech of Indonesian sentences with affection.
Abstract: In a communication using texts input, viseme (visual phonemes) is derived from a group of phonemes having similar visual appearances. Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition. For speech emotion recognition, a HMM is trained for each emotion and an unknown sample is classified according to the model which illustrate the derived feature sequence best. Viterbi algorithm, HMM is used for guessing the most possible state sequence of observable states. In this work, first stage, we defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation. The second stage, we defined a choice of one of affection type as input in the system. The last stage, we experimentally using Trigram HMMs for generating the viseme sequence to be used for synchronized mouth shape and lip movements. The whole system is interconnected in a sequence. The final system produced a viseme sequence for natural speech of Indonesian sentences with affection. We show through various experiments that the proposed, the results in about 82,19% relative improvement in classification accuracy.

Proceedings ArticleDOI
01 Apr 2016
TL;DR: Variation of lip texture features for recognition of visemes is explored and the proposed approach is used for Hindi word recognition and achieved high accuracy at the cost of computation time.
Abstract: Visual Speech processing is the key concern of researchers working in the field of speech processing and computer vision. Though earlier audio speech processing was popularly used for speech recognition but their performance deteriorated in the presence of noise. Moreover, variation in accent is another challenge that affects the performance of such systems. In the presented paper, we explore variation of lip texture features for recognition of visemes. The variations in temporal behavior of lip texture features is coded using Local Binary Pattern features in three orthogonal planes. The classification is carried out using the back propagation neural network, which is a network with hidden layer. The added advantage of hidden layer for lip reading is that it takes into account the nonlinear variation of lip features while speaking. The proposed approach is used for Hindi word recognition and achieved high accuracy at the cost of computation time.


Dissertation
01 Jun 2016
TL;DR: The results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth, and the dependence of phoneme-to-viseme maps between speakers is investigated.
Abstract: This thesis is about improving machine lip-reading, that is, the classi�cation of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution a�ects. We show that high de�nition video is not needed to successfully lip-read with a computer. The term \viseme" is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally de�ned, we use the common working de�nition `a viseme is a group of phonemes with identical appearance on the lips'. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classi�cation. Our results show Lee's [82] is best. Lee's classi�cation also outperforms machine lip-reading systems which use the popular Fisher [48] phonemeto- viseme map. Further to this, we propose three methods of deriving speaker-dependent phonemeto- viseme maps and compare our new approaches to Lee's. Our results show the ii iii sensitivity of phoneme clustering and we use our new knowledge for our �rst suggested augmentation to the conventional lip-reading system. Speaker independence in machine lip-reading classi�cation is another unsolved obstacle. It has been observed, in the visual domain, that classi�ers need training on the test subject to achieve the best classi�cation. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classi�cation of a speaker not present in the classi�er's training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last di�culty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classi�er, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classi�ers. This new method of classi�er training demonstrates signi�cant increase in classi�cation with a word language network.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: A system is developed which recognizes silently uttered words and performs specific tasks based on the recognized lip movements of the words and the method used is noninvasive.
Abstract: Silent Speech Interface(SSI) has given rise the possibility of speech processing, even inthe absence of an acoustic signal and it has been used as an aid for the dumb person or speech handicapped. A Silent Speech Interface is a device that enables the communication to take place when voice is not present. In noisy environments, the SSI holds a promising approach for speech processing. In this project a system is developed which recognizes silently uttered words and performs specific tasks based on the recognized lip movements of the words. The method used is noninvasive. The system records video of the lip movements of the words/letter uttered silently. The recorded signal is analysed by extracting features using local binary pattern (LBP) operator and then detected using a kNN classifier. The system is trained and tested to recognize the word ‘WATER’. When the word ‘WATER’ is uttered, lip movements are analyzed and the control signal is generated to initiate the robot to pick a water bottle from a specified location and place it at the predefined destination.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper investigated face and mouth localization of viseme images using the enhanced Viola-Jones algorithm as to ensure the face and Mouth are accurately detected and showed a promising result with the high acceptance rate.
Abstract: Face and mouth localization are the vital phase for visual speech recognition. These tasks refer to the detection of the face and mouth region within the viseme images. The main problem in face and mouth localization is the constraints on the image such as rotation of the images and color of the homogeneity intensity within the images and some parts of the object images cannot be detected. This paper investigated face and mouth localization of viseme images using the enhanced Viola-Jones algorithm as to ensure the face and mouth are accurately detected. The enhanced Viola-Jones algorithm experimented on 80 viseme images from the recorded video data of 500 Bahasa Melayu common words as the test images which include female and male adult with no speech problem. The proposed method also showed a promising result with the high acceptance rate. For future work, further research may focus on overcoming the detection of similar color intensity and illumination problems within the images. The future research also may focus on mouth segmentation and temporal extraction of the mouth features.

Journal ArticleDOI
TL;DR: The contribution of this study is to develop real-time automatic talking system for English language based on the concatenation of the visemes, followed by presenting the results that was evaluated by the phoneme to viseme table using the Prophone.
Abstract: Lip-syncing is a process of speech assimilation with the lip motions of a virtual character. A virtual talking character is a challenging task because it should provide control on all articulatory movements and must be synchronized with the speech signal. This study presents a virtual talking character system aimed to speeding and easing the visual talking process as compared to the previous techniques using the blend shapes approach. This system constructs the lip-syncing using a set of visemes for reduced phonemes set by a new method named Prophone. This Prophone depend on the probability of appearing the phoneme in the sentence of English Language. The contribution of this study is to develop real-time automatic talking system for English language based on the concatenation of the visemes, followed by presenting the results that was evaluated by the phoneme to viseme table using the Prophone.

01 Jan 2016
TL;DR: The using speech recognition is universally compatible with any devices to read, and will help you to get the most less latency time to download any of the authors' books like this one.
Abstract: Thank you for downloading using speech recognition. Maybe you have knowledge that, people have look numerous times for their chosen books like this using speech recognition, but end up in malicious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some harmful virus inside their laptop. using speech recognition is available in our book collection an online access to it is set as public so you can get it instantly. Our digital library spans in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Merely said, the using speech recognition is universally compatible with any devices to read.

Proceedings Article
14 Dec 2016
TL;DR: A method using polymorphing to incorporate co-articulation in the TTAVS, a system for converting the input text to video realistic audio-visual sequence, and adds temporal smoothing for viseme transitions to avoid jerky animation.
Abstract: This paper provides our approach to co-articulation for a text-to-audiovisual speech synthesizer (TTAVS), a system for converting the input text to video realistic audio-visual sequence. It is an image-based system, where the face is modeled using a set of images of a human subject. A concatination of visemes -the corresponding lip shapes for phonemes- can be used for modeling visual speech. However, in actual speech production, there is overlap in the production of syllables and phonemes that are a sequence of discrete units of speech. Due to this overlap, boundaries between these discrete speech units are blurred, i.e., vocal tract motions associated with producing one phonetic segment overlap the motions for producing surrounding phonetic segments. This overlap is called as co-articulation. The lack of parameterization in the image-based model makes it difficult to use the techniques employed in 3D facial animation models for co-articulation. We introduce a method using polymorphing to incorporate co-articulation in our TTAVS. Further, we add temporal smoothing for viseme transitions to avoid jerky animation.


Proceedings ArticleDOI
28 Nov 2016
TL;DR: A speech animation can be easily produced by choosing a suitable action from the data base containing actions captured beforehand, by connecting any action unrelated to the speech.
Abstract: When an action does not relate to the speech, the speech may connect any action unrelated to the speech. In such case, a speech animation can be easily produced by choosing a suitable action from the data base containing actions captured beforehand.

Book ChapterDOI
01 Jan 2016
TL;DR: The aim of this chapter is to give a comprehensive overview of current state-of-the-art parametric methods for realistic facial modelling and animation.
Abstract: Facial modelling is a fundamental technique in a variety of applications in computer graphics, computer vision and pattern recognition areas. As 3D technologies evolved over the years, the quality of facial modelling greatly improved. To enhance the modelling quality and controllability of the model further, parametric methods, which represent or manipulate facial attributes (e.g. identity, expression, viseme) with a set of control parameters, have been proposed in recent years. The aim of this chapter is to give a comprehensive overview of current state-of-the-art parametric methods for realistic facial modelling and animation.

Dissertation
01 Mar 2016
TL;DR: This study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model.
Abstract: Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression.