scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2018"


Proceedings Article
27 Apr 2018
TL;DR: A hierarchical-LSTM (HLSTM) encoderdecoder model with visual content and word embedding for SLT exhibits promising performance on singer-independent test with seen sentences and also outperforms the comparison algorithms on unseen sentences.
Abstract: Continuous Sign Language Translation (SLT) is a challenging task due to its specific linguistics under sequential gesture variation without word alignment. Current hybrid HMM and CTC (Connectionist temporal classification) based models are proposed to solve frame or word level alignment. They may fail to tackle the cases with messing word order corresponding to visual content in sentences. To solve the issue, this paper proposes a hierarchical-LSTM (HLSTM) encoder-decoder model with visual content and word embedding for SLT. It tackles different granularities by conveying spatio-temporal transitions among frames, clips and viseme units. It firstly explores spatio-temporal cues of video clips by 3D CNN and packs appropriate visemes by online key clip mining with adaptive variable-length. After pooling on recurrent outputs of the top layer of HLSTM, a temporal attention-aware weighting mechanism is proposed to balance the intrinsic relationship among viseme source positions. At last, another two LSTM layers are used to separately recurse viseme vectors and translate semantic. After preserving original visual content by 3D CNN and the top layer of HLSTM, it shortens the encoding time step of the bottom two LSTM layers with less computational complexity while attaining more nonlinearity. Our proposed model exhibits promising performance on singer-independent test with seen sentences and also outperforms the comparison algorithms on unseen sentences.

135 citations


Journal ArticleDOI
TL;DR: A novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio, that integrates seamlessly into existing animation pipelines.
Abstract: We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

115 citations


Posted Content
TL;DR: In this article, a three-stage Long Short-Term Memory (LSTM) network architecture is proposed to produce animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio.
Abstract: We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

29 citations


Journal ArticleDOI
TL;DR: The aim of the presented research is a review of various approaches to the problem, the implementation of algorithms proposed in the literature and a comparative research on their effectiveness.
Abstract: An elementary visual unit – the viseme is concerned in the paper in the context of preparing the feature vector as a main visual input component of Audio-Visual Speech Recognition systems. The aim of the presented research is a review of various approaches to the problem, the implementation of algorithms proposed in the literature and a comparative research on their effectiveness. In the course of the study an optimal feature vector construction and an appropriate selection of the classifier were sought. The experimental research was conducted on the basis of a spoken corpus in which speech was represented both acoustically and visually. The extracted features represented three types: geometrical, textural and mixed ones. The features were processed employing the classification algorithms based on Hidden Markov Models and Sequential Minimal Optimization. Tests were carried out employing the processed video material recorded with English native speakers who read specially prepared list of commands. The obtained results are discussed in the paper.

21 citations


Patent
07 Jun 2018
TL;DR: In this article, a computer-implemented method and system for incorporating emotional and contextual visualization into an electronic communication, comprises: creating a 2D texture map of a user's face from a series of photos; comparing user's texture map with 2D textures of the samples from a reference database to find the closest matches and create a photorealistic composite 3D mesh model of the user's head that can be modified to present different emotions and phonemes; using emotion ID and phoneme ID to retrieve from the databases, loaded on the receiving device, corresponding 3D
Abstract: A computer-implemented method and system for incorporating emotional and contextual visualization into an electronic communication, comprises: creating a 2D texture map of a user's face from a series of photos; comparing user's 2D texture map with 2D texture maps of the samples from a reference database to find the closest matches and create a photorealistic composite 3D mesh model of the user's head that can be modified to present different emotions and phonemes; during an electronic communication between a sending device of the user and the receive device, determining an emotional state and a current phoneme (viseme) through user's voice, text or data from the camera and transmitting an emotional identifier and a phoneme identifier to the receiving devise; using emotion ID and phoneme ID to retrieve from the databases, loaded on receiving device, corresponding 3D mesh models and corresponding 2D textures to create and display a fully animated video message on the receiving devise without requiring video data transmission via the communication channel.

19 citations


Posted Content
TL;DR: It is concluded that a fundamental rethink of the modelling of visual features may be needed for this task, and the DCT is found to outperform AAM by more than 6% for a viseme recognition task with 56 speakers.
Abstract: Automatic lipreading has major potential impact for speech recognition, supplementing and complementing the acoustic modality. Most attempts at lipreading have been performed on small vocabulary tasks, due to a shortfall of appropriate audio-visual datasets. In this work we use the publicly available TCD-TIMIT database, designed for large vocabulary continuous audio-visual speech recognition. We compare the viseme recognition performance of the most widely used features for lipreading, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework. We also exploit recent advances in AAM fitting. We found the DCT to outperform AAM by more than 6% for a viseme recognition task with 56 speakers. The overall accuracy of the DCT is quite low (32-34%). We conclude that a fundamental rethink of the modelling of visual features may be needed for this task.

13 citations


Book ChapterDOI
18 Sep 2018
TL;DR: This paper investigates the issue in more detail and proposes advanced geometry-based visual features for automatic Russian lip-reading system and tests the main state-of-the-art methods for visual speech recognition.
Abstract: The use of video information plays an increasingly important role for automatic speech recognition. Nowadays, audio-only based systems have reached a certain accuracy threshold and many researchers see a solution to the problem in the use of visual modality to obtain better results. Despite the fact that audio modality of speech is much more representative than video, their proper fusion can improve both quality and robustness of the entire recognition system that was proved in practice by many researches. However, no agreement between researchers on the optimal set of visual features was reached. In this paper, we investigate this issue in more detail and propose advanced geometry-based visual features for automatic Russian lip-reading system. The experiments were conducted using collected HAVRUS audio-visual speech database. The average viseme recognition accuracy of our system trained on the entire corpus is 40.62%. We also tested the main state-of-the-art methods for visual speech recognition, applying them to continuous Russian speech with high-speed recordings (200 frames per seconds).

10 citations


Proceedings ArticleDOI
29 Mar 2018
TL;DR: This paper presents a detailed study of the machine learning approach for the real-time visual recognition of spoken words and nine different classifiers have been implemented and tested, reporting their confusion matrices among different groups of words.
Abstract: Lipreading is the process of interpreting spoken word by observing lip movement. It plays a vital role in human communication and speech understanding, especially for hearing-impaired individuals. Automated lipreading approaches have recently been used in such applications as biometric identification, silent dictation, forensic analysis of surveillance camera capture, and communication with autonomous vehicles. However, lipreading is a difficult process that poses several challenges to human- and machine-based approaches alike. This is due to the large number of phonemes in human language that are visually represented by a smaller number of lip movements (visemes). Consequently, the same viseme may be used to represent several phonemes, which confuses any lipreader. In this paper, we present a detailed study of the machine learning approach for the real-time visual recognition of spoken words. Our focus on real-time performance is motivated by the recent trend of using lipreading in autonomous vehicles. In this paper, machine learning approaches are applied to recognize lip-reading and nine different classifiers has been implemented and tested, reporting their confusion matrices among different groups of words. The classification process went on more than one classifier but these three classifiers got the best results which are GradientBoosting, Support Vector Machine(SVM) and logistic regression with results 64.7%, 63.5% and 59.4% respectively.

10 citations


Journal ArticleDOI
TL;DR: The proposed research integrated emotions by the consideration of Ekman model and Plutchik's wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model.
Abstract: Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model.

9 citations


Journal ArticleDOI
TL;DR: The findings suggest that the perceptual processes underlying asymmetries in unimodal visual vowel discrimination are sensitive to speech-specific motion and configural properties and raise foundational questions concerning the role of specialized and general processes in vowel perception.
Abstract: Masapollo, Polka, and Menard (2017) recently reported a robust directional asymmetry in unimodal visual vowel perception: Adult perceivers discriminate a change from an English /u/ viseme to a French /u/ viseme significantly better than a change in the reverse direction. This asymmetry replicates a frequent pattern found in unimodal auditory vowel perception that points to a universal bias favoring more extreme vocalic articulations, which lead to acoustic signals with increased formant convergence. In the present article, the authors report 5 experiments designed to investigate whether this asymmetry in the visual realm reflects a speech-specific or general processing bias. They successfully replicated the directional effect using Masapollo et al.'s dynamically articulating faces but failed to replicate the effect when the faces were shown under static conditions. Asymmetries also emerged during discrimination of canonically oriented point-light stimuli that retained the kinematics and configuration of the articulating mouth. In contrast, no asymmetries emerged during discrimination of rotated point-light stimuli or Lissajou patterns that retained the kinematics, but not the canonical orientation or spatial configuration, of the labial gestures. These findings suggest that the perceptual processes underlying asymmetries in unimodal visual vowel discrimination are sensitive to speech-specific motion and configural properties and raise foundational questions concerning the role of specialized and general processes in vowel perception. (PsycINFO Database Record

9 citations


Journal ArticleDOI
TL;DR: The result shows that the level of Synchronization and naturalness of the synthesis of visual speech is more realistic, therefore, the system can display the visualization of phoneme pronunciation to support learning Indonesian pronunciation.
Abstract: This study aims to build a realistic visual speech synthesis for Indonesian so that it can be used to learn Indonesian pronunciation. In this study, We used the combination of morphing viseme and syllable concatenation method. The morphing viseme method is a process of deformation from one viseme to another so that the animation of the mouth shape looks smoother. This method is used to create the transition of animation between viseme. The Syllable Concatenation method is used to assemble viseme based on certain syllable patterns. We built a syllable-based voice database as a basis for synchronization between syllables, speech and viseme models. The method proposed in this study consists of several stages, namely the formation of Indonesian viseme models, designing facial animation character, development of speech database, a synchronization process and subjective testing of the resulting application. Subjective tests were conducted on 30 respondents who assessed the suitability and natural movement of the mouth when uttering the Indonesian texts. The MOS (Mean Opinion Score) method is used to calculate the average of respondents' scores. The MOS calculation results for the criteria of Synchronization and naturalness are 4,283 and 4,107 on the scale of 1 to 5. This result shows that the level of Synchronization and naturalness of the synthesis of visual speech is more realistic. Therefore, the system can display the visualization of phoneme pronunciation to support learning Indonesian pronunciation.

Proceedings ArticleDOI
15 May 2018
TL;DR: Two deep learning speech-driven structures to integrate speech articulation and emotional cues are provided, based on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements.
Abstract: The orofacial area conveys a range of information, including speech articulation and emotions. These two factors add constraints to the facial movements, creating non-trivial integrations and interplays. To generate more expressive and naturalistic movements for conversational agents (CAs) the relationship between these factors should be carefully modeled. Data-driven models are more appropriate for this task than rule-based systems. This paper provides two deep learning speech-driven structures to integrate speech articulation and emotional cues. The proposed approaches rely on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion recognition and viseme recognition as secondary tasks. The approach creates shared representations that generate behaviors that not only are closer to the original orofacial movements, but also are perceived more natural than the results from single task learning.

Proceedings ArticleDOI
23 Jul 2018
TL;DR: This work addresses the problem of recognizing visemes, that are the visual equivalent of phonemes-the smallest distinguishable sound unit in a spoken word, by creating a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled.
Abstract: Recently, Deep Learning-based methods have obtained high accuracy for the problem of Visual Speech Recognition. However, while good results have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. In this work, we address the problem of recognizing visemes, that are the visual equivalent of phonemes-the smallest distinguishable sound unit in a spoken word. Viseme recognition has application in tasks such as lip synchronization, but acquiring and labeling a viseme dataset is complex and time-consuming. We tackle this problem by creating a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled. Then, we extract real viseme images from the GRID corpus-using audio data to locate phonemes via forced phonetic alignment and the registered video to extract the corresponding visemes-and evaluate the applicability of the synthetic dataset for recognizing real-world data.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: This work proposes a novel methodology to tackle the problem of recognizing viseme – the visual equivalent of phonemes – using a GAN to artificially lock the face view into a perfect frontal view, reducing the view angle variability and simplifying the recognition task performed by the classification CNN.
Abstract: Deep learning methods have become the standard for Visual Speech Recognition problems due to their high accuracy results reported in the literature. However, while successful works have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. Also, head-pose variation remains a known issue for facial analysis with direct impact in this problem. In this context, we propose a novel methodology to tackle the problem of recognizing visemes – the visual equivalent of phonemes – using a GAN to artificially lock the face view into a perfect frontal view, reducing the view angle variability and simplifying the recognition task performed by our classification CNN. The GAN is trained using a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled for different visemes, mapping a slightly random view to a perfect frontal view. We evaluate our method using the GRID corpus, which was processed to extract viseme images and their corresponding synthetic frontal views to be further classified by our CNN model. Our results demonstrate that the additional synthetic frontal view is able to improve accuracy in 5.9% when compared with classification using the original image only.

Journal ArticleDOI
TL;DR: Speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures, and a phoneme-clustering method is used to form new phoneme to viseme maps for both individual and multiple speakers.

Book ChapterDOI
01 Nov 2018
TL;DR: The preliminary experiment indicated that the MobileNets is the most adequate algorithm for smart phone apps.
Abstract: This paper describes our preliminary study towards a new type of speech enhancement system. To avoid using odd-looking electrolarynx, we used lip-reading function. Our final image is to use a smart phone with camera and audio output to be able to convert the lip motion to speech output. We tested MLP, CNN, and MobileNets image recognition methods. 3k image datasets for training and testing were recorded from five persons. The preliminary experiment indicated that the MobileNets is the most adequate algorithm for smart phone apps. in terms of the recognition accuracy and the calculation cost.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work creates two large-scale synthetic 2D datasets based on realistic 3D facial models - with near-frontal and multi-view mouth images - and performs experiments that indicate that a transfer learning approach using synthetic data can get higher accuracy than training from scratch using real data only, on both scenarios.
Abstract: Visual Speech Recognition is the ability to interpret spoken text using video information only To address such task automatically, recent works have employed Deep Learning and obtained high accuracy on the recognition of words and sentences uttered in controlled environments, with limited head-pose variation However, the accuracy drops for multi-view datasets and when it comes to interpreting isolated mouth shapes, such as visemes, the values reported are considerably lower, as shorter segments of speech lack temporal and contextual information In this work, we evaluate the applicability of synthetic datasets for assisting recognition of visemes in real-world data acquired under controlled and uncontrolled environments, using GRID and AVICAR datasets, respectively We create two large-scale synthetic 2D datasets based on realistic 3D facial models - with near-frontal and multi-view mouth images We perform experiments that indicate that a transfer learning approach using synthetic data can get higher accuracy than training from scratch using real data only, on both scenarios

Proceedings ArticleDOI
28 May 2018
TL;DR: An algorithm is developed that can automatically carry out a spatial-temporal tracking of tongue movement contour from the ultrasound images recorded of one speaker who read randomized lists of VCV utterances containing the vowels /e, a/ and CVC utterance containing the consonants /k, t/ in all possible combinations covering all phoneme.
Abstract: Visual speech play a significant role in speech perception especially for deaf and hard of hearing people or normal people in noisy environment. Lip reading depends on visible articulators to improve speech perception. However, not only the movement of lip and face provide part of phonetic information, but also the motion of the tongue which is generally not entirely seen carries an important part of the articulatory information not accessible through lip reading. First, static viseme (visual representation of a phoneme) classification of Chinese Shaanxi Xi'an dialect speech was performed according to the method that carried out to classify Mandarin (Standard Chinese) static viseme. Then we carried out an experiment with the purpose to study both the timing and position properties of articulatory movements of the tongue in VCV and CVC utterances of Chinese Shaanxi Xi'an dialect speech in different tempo. The speech materials conclude VCV and CVC sequences were recorded via ‘Micro’ ultrasound system that eventually form the JPG images and MP4 videos with the help of Assistant Advanced software. We have developed an algorithm that can automatically carry out a spatial-temporal tracking of tongue movement contour from the ultrasound images we recorded of one speaker who read randomized lists of VCV utterances containing the vowels /e, a/ and CVC utterances containing the consonants /k, t/ in all possible combinations covering all phoneme — 26 consonants and 41 vowels of Shaanxi Xi'an dialect. The extracted visual information gained by this method is then classified in order to define the uttered viseme and is used to create dynamic viseme system of tongue for Shaanxi Xi'an dialect. This dynamic viseme system is used to create the fundamentals of a talking head — animated articulation model for Shaanxi Xi'an dialect speech. The talking head is going to be utilized in a speech assistant system assisting the speech improvement of hard of hearing children and training for second language learners.

Posted ContentDOI
TL;DR: Which unit would best optimize a lipreading (visual speech) LM to observe their limitations is compared and three units are compared; visemes, phonemes (audible speech units), and words are compared.
Abstract: Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model expects. These models are commonplace but while higher-order N-gram LM's may improve classification rates, the cost of this model is disproportionate to the common goal of developing more accurate classifiers. So we compare which unit would best optimize a lipreading (visual speech) LM to observe their limitations. We compare three units; visemes (visual speech units) \cite{lan2010improving}, phonemes (audible speech units), and words.


Journal ArticleDOI
15 Feb 2018
TL;DR: A system Text-to-Audio Visual Indonesian can visualize the pronunciation of the sentences Indonesian synchronized with speech signals and shows that the level of conformity visualization syllable pronunciation and spoken voice is good.
Abstract: This paper aims to develop a system Text-to-Audio Visual Indonesian to support learning of Indonesian pronunciation based on speech database syllable-based. This system can visualize the pronunciation of the sentences Indonesian synchronized with speech signals. We conduct several research stages, namely forming the Indonesian viseme models, creating the speech database syllable-based, converting the text into syllables dan synchronizing. The synchronization process is a compilation the viseme models and the speech signal based on input text. This system was evaluated by involving 30 respondents who rate the system based on “lip-reading”. Each respondent provides an assessment of the 10 Indonesian sentences about the level of compatibility between the visualization of syllable and speech spoken based on text input. The MOS methode (Mean Opinion Score) is used to calculate the average ratings of respondents. MOS calculation results is 4.24, It shows that the level of conformity visualization syllable pronunciation and spoken voice is good.

Posted Content
TL;DR: This paper used a phoneme-clustering method to form new phoneme to viseme maps for both individual and multiple speakers and used signed rank tests to measure the distance between individuals.
Abstract: Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.