scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2020"


Posted ContentDOI
TL;DR: A marker-less approach for facial motion capture based on multi-view video is presented, which learns a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure.
Abstract: Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

107 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work describes a technique to detect manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsistent with a spoken phoneme, and demonstrates the efficacy and robustness of this approach to detect different types of deep-fake videos, including in thewild deep fakes.
Abstract: Recent advances in machine learning and computer graphics have made it easier to convincingly manipulate video and audio. These so-called deep-fake videos range from complete full-face synthesis and replacement (face-swap), to complete mouth and audio synthesis and replacement (lip-sync), and partial word-based audio and mouth synthesis and replacement. Detection of deep fakes with only a small spatial and temporal manipulation is particularly challenging. We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape - visemes - are occasionally inconsistent with a spoken phoneme. We focus on the visemes associated with words having the sound M (mama), B (baba), or P (papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme-viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. We demonstrate the efficacy and robustness of this approach to detect different types of deep-fake videos, including in-the-wild deep fakes.

90 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the changing frequencies of oral resonances—which are used to discriminate between speech sounds—can be predicted with unexpectedly high precision from the changing shape of the mouth during speech, and that listeners exploit this relationship to extract acoustic information from visual speech.
Abstract: Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time–frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.

24 citations


Journal ArticleDOI
TL;DR: A neural network-based lip reading system designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training has achieved a significantly improved performance with 15% lower word error rate.
Abstract: In this paper, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The system has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art works in lip reading sentences, the system has achieved a significantly improved performance with 15% lower word error rate. In addition, experiments with videos of varying illumination have shown that the proposed model has a good robustness to varying levels of lighting. The main contributions of this paper are: 1) The classification of visemes in continuous speech using a specially designed transformer with a unique topology; 2) The use of visemes as a classification schema for lip reading sentences; and 3) The conversion of visemes to words using perplexity analysis. All the contributions serve to enhance the accuracy of lip reading sentences. The paper also provides an essential survey of the research area.

22 citations


Journal ArticleDOI
TL;DR: The proposed model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences) and can tackle lip reading as an unconstrained natural speech signal in the video sequences.
Abstract: The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration.

8 citations


Journal ArticleDOI
TL;DR: The results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.
Abstract: Natural speech is processed in the brain as a mixture of auditory and visual features. An example of the importance of visual speech is the McGurk effect and related perceptual illusions that result from mismatching auditory and visual syllables. Although the McGurk effect has widely been applied to the exploration of audio-visual speech processing, it relies on isolated syllables, which severely limits the conclusions that can be drawn from the paradigm. In addition, the extreme variability and the quality of the stimuli usually employed prevents comparability across studies. To overcome these limitations, we present an innovative methodology using 3D virtual characters with realistic lip movements synchronized on computer-synthesized speech. We used commercially accessible and affordable tools to facilitate reproducibility and comparability, and the set-up was validated on 24 participants performing a perception task. Within complete and meaningful French sentences, we paired a labiodental fricative viseme (i.e. /v/) with a bilabial occlusive phoneme (i.e. /b/). This audiovisual mismatch is known to induce the illusion of hearing /v/ in a proportion of trials. We tested the rate of the illusion while varying the magnitude of background noise and audiovisual lag. Overall, the effect was observed in 40% of trials. The proportion rose to about 50% with added background noise and up to 66% when controlling for phonetic features. Our results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.

6 citations


Posted Content
25 Apr 2020
TL;DR: This work focuses on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal and demonstrates the effectiveness of the learned visual representations for classifying visemes (the visual analogy to phonemes).
Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual features provide not only high-level information about speech activity, i.e. speech vs. no speech, but also fine-grained visual information about the place of articulation. An interesting byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual representations for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.

5 citations


Posted Content
TL;DR: A view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding is proposed and shown that there is a strong correlation between the model's understanding of multi-view speech and the human perception.
Abstract: Speech as a natural signal is composed of three parts - visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact how speech is delivered and heard. To close the gap between speech understanding and multimedia video applications, in this paper, we show the initial experiments by modelling the perception on visual speech and showing its use case on video compression. On the other hand, in the visual speech recognition domain, existing studies have mostly modeled it as a classification problem, while ignoring the correlations between views, phonemes, visemes, and speech perception. This results in solutions which are further away from how human perception works. To bridge this gap, we propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. We conduct experiments on three public visual speech recognition datasets. The experimental results show that our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. Moreover, we show that there is a strong correlation between our model's understanding of multi-view speech and the human perception. This characteristic benefits downstream applications such as video compression and streaming where a significant number of less important frames can be compressed or eliminated while being able to maximally preserve human speech understanding with good user experience.

4 citations


Journal ArticleDOI
TL;DR: A lip matching scheme based on vowel priority and a similarity evaluation model based on the Manhattan distance by using computer vision lip features, which quantifies the lip shape similarity between 0–1 provides an effective recommendation of evaluation standard are proposed.
Abstract: At present, the significance of humanoid robots dramatically increased while this kind of robots rarely enters human life because of its immature development. The lip shape of humanoid robots is crucial in the speech process since it makes humanoid robots look like real humans. Many studies show that vowels are the essential elements of pronunciation in all languages in the world. Based on the traditional research of viseme, we increased the priority of the smooth transition of lip between vowels and propose a lip matching scheme based on vowel priority. Additionally, we also designed a similarity evaluation model based on the Manhattan distance by using computer vision lip features, which quantifies the lip shape similarity between 0–1 provides an effective recommendation of evaluation standard. Surprisingly, this model successfully compensates the disadvantages of lip shape similarity evaluation criteria in this field. We applied this lip-matching scheme to Ren-Xin humanoid robot and performed robot teaching experiments as well as a similarity comparison experiment of 20 sentences with two males and two females and the robot. Notably, all the experiments have achieved excellent results.

4 citations


Proceedings ArticleDOI
07 Dec 2020
TL;DR: A marker-less approach for facial motion capture based on multi-view video is presented, which learns a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure.
Abstract: Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

3 citations


Journal ArticleDOI
09 Sep 2020
TL;DR: The results show that viseme mapping preceded by allophonic pre-processing makes map performance more accurate when compared to other maps.
Abstract: The lip synchronization technology of animation can run automatically through the phoneme-to-viseme map. Since the complexity of facial muscles causes the shape of the mouth to vary greatly, phoneme-to-viseme mapping always has challenging problems. One of them is the allophone vowel problem. The resemblance makes many researchers clustering them into one class. This paper discusses the certainty of allophone vowels as a variable of the phoneme-to-viseme map. Vowel allophones pre-processing as a proposed method is carried out through formant frequency feature extraction methods and then compared by t-test to find out the significance of the difference. The results of pre-processing are then used to reference the initial data when building phoneme-to-viseme maps. This research was conducted on maps and allophones of the Indonesian language. Maps that have been built are then compared with other maps using the HMM method in the value of word correctness and accuracy. The results show that viseme mapping preceded by allophonic pre-processing makes map performance more accurate when compared to other maps.

Proceedings ArticleDOI
24 Sep 2020
TL;DR: 3D talking head has the potential to be developed so that it can be understood by a deaf person and is able to produce animated movements that are easily understood by the deaf person.
Abstract: This research aims to build an android application that is able to help a deaf person in learning the Indonesian Sign Language System (SIBI). The application is packaged in the form of a 3D animated talking head. This research, the development of a talking head uses the dynamic viseme method. Dynamic viseme has the advantage of creating a more natural animation movement because it uses the original human model in the building process. By using the Dirichlet Free-Form Deformation (DFFD) method as control, the concatenative method is able to produce animated movements that are easily understood by the deaf person. In its build, linear interpolation was also used so that the displacement in the speaker becomes more natural. To build animated movements, four different human models were used. Experimental testing was carried out using surveys and calculations were carried out using the Mean Opinion Score (MOS) method. The number of surveys carried out was five. The purpose of the survey is to determine a model that is easier to understand by mouth movements. The test results on the Indonesian sentence show a value of 4.25. These results indicate that the 3D talking head has the potential to be developed so that it can be understood by a deaf person.

Proceedings ArticleDOI
26 Sep 2020
TL;DR: A lip-reading method that can recognize by registering the words that you want to speak and that is optimized for the user using a small amount of data is examined, appropriate for embedding mobile devices in consideration of both usability and small vocabulary recognition accuracy.
Abstract: We have been developing a practical speech enhancement system that supports for laryngectomee. By interviewing users we captured essential issues, such as “utilization of existing device”, “the appearance needs to be inconspicuous”, and “the device should be easy to use”. Considering those user's needs, we plan to use smart phone platform and develop speech enhancement application so that the users are just ordinary looking, and there is no need to buy any additional device. In order to realize such system, the key concept of our proposed system performs lip-reading and speech synthesis. In this study, we examined a lip-reading method that can recognize by registering the words that you want to speak and that is optimized for the user using a small amount of data. 36 viseme images were converted into very small data using VAE(Variational Auto Encoder), then the training data for word recognition model was generated. Viseme is a group of phonemes with identical appearance on the lips. Our viseme sequence representation with VAE was used to be able to adapt users with very small amount of training data set. Word recognition experiment using VAE encoder and CNN was performed with 20 Japanese words. The experimental result showed 65% recognition accuracy, and 100% including 1st and 2nd candidates. The lip-reading type speech enhancement seems appropriate for embedding mobile devices in consideration of both usability and small vocabulary recognition accuracy.

Journal ArticleDOI
TL;DR: This study tested whether speech sounds can be predicted on the basis of visemic information only, and to what extent interfering with orofacial articulatory effectors can affect these predictions, and found that interfering with the motor articulatory system strongly disrupted cross-modal predictions.
Abstract: The human brain generates predictions about future events. During face-to-face conversations, visemic information is used to predict upcoming auditory input. Recent studies suggest that the speech motor system plays a role in these cross-modal predictions, however, usually only audio-visual paradigms are employed. Here we tested whether speech sounds can be predicted on the basis of visemic information only, and to what extent interfering with orofacial articulatory effectors can affect these predictions. We registered EEG and employed N400 as an index of such predictions. Our results show that N400's amplitude was strongly modulated by visemic salience, coherent with cross-modal speech predictions. Additionally, N400 ceased to be evoked when syllables' visemes were presented backwards, suggesting that predictions occur only when the observed viseme matched an existing articuleme in the observer's speech motor system (i.e., the articulatory neural sequence required to produce a particular phoneme/viseme). Importantly, we found that interfering with the motor articulatory system strongly disrupted cross-modal predictions. We also observed a late P1000 that was evoked only for syllable-related visual stimuli, but whose amplitude was not modulated by interfering with the motor system. The present study provides further evidence of the importance of the speech production system for speech sounds predictions based on visemic information at the pre-lexical level. The implications of these results are discussed in the context of a hypothesized trimodal repertoire for speech, in which speech perception is conceived as a highly interactive process that involves not only your ears but also your eyes, lips and tongue.

Proceedings ArticleDOI
12 Oct 2020
TL;DR: This work proposes a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT) that outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.
Abstract: To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.

Posted Content
28 Nov 2020
TL;DR: This paper proposed a method to tackle the one-to-many mapping problem when performing automated lip reading using solely visual cues in two separate scenarios: the first scenario is where the word boundary, that is, the beginning and the ending of a word, is unknown; and the second scenario was where the boundary is known.
Abstract: The performance of automated lip reading using visemes as a classification schema has achieved less success compared with the use of ASCII characters and words largely due to the problem of different words sharing identical visemes. The Generative Pre-Training transformer is an effective autoregressive language model used for many tasks in Natural Language Processing, including sentence prediction and text classification. This paper proposes a new application for this model and applies it in the context of lip reading, where it serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences. The network uses the search for optimal perplexity to perform the viseme-to-word mapping and is thus a solution to the one-to-many mapping problem that exists whereby various words that sound different when spoken look identical. This paper proposes a method to tackle the one-to-many mapping problem when performing automated lip reading using solely visual cues in two separate scenarios: the first scenario is where the word boundary, that is, the beginning and the ending of a word, is unknown; and the second scenario is where the boundary is known. Sentences from the benchmark BBC dataset "Lip Reading Sentences in the Wild"(LRS2), are classified with a character error rate of 10.7% and a word error rate of 18.0%. The main contribution of this paper is to propose a method of predicting words through the use of perplexity analysis when only visual cues are present, using an autoregressive language model.

Posted ContentDOI
31 Jul 2020-bioRxiv
TL;DR: In this article, face rotation was used to detect pitch modulation in target speech with upright and inverted faces that either matched the target or masker speech such that performance differences could be explained by binding, an early multisensory integration mechanism distinct from traditional late integration.
Abstract: When listening is difficult, seeing the face of the talker aids speech comprehension. Faces carry both temporal (low-level physical correspondence of mouth movement and auditory speech) and linguistic (learned physical correspondences of mouth shape (viseme) and speech sound (phoneme)) cues. Listeners participated in two experiments investigating how these cues may be used to process sentences when maskers are present. In Experiment I, faces were rotated to disrupt linguistic but not temporal cue correspondence. Listeners suffered a deficit in speech comprehension when the faces were rotated, indicating that visemes are processed in a rotation-dependent manner, and that linguistic cues aid comprehension. In Experiment II, listeners were asked to detect pitch modulation in the target speech with upright and inverted faces that either matched the target or masker speech such that performance differences could be explained by binding, an early multisensory integration mechanism distinct from traditional late integration. Performance in this task replicated previous findings that temporal integration induces binding, but there was no behavioral evidence for a role of linguistic cues in binding. Together these experiments point to temporal cues providing a speech processing benefit through binding and linguistic cues providing a benefit through late integration.

Proceedings ArticleDOI
04 Nov 2020
TL;DR: In this article, the authors presented a study on how to map from acoustic speech to visual speech with the goal of generating perceptually natural speech animation, which achieved 68.8% rating accuracy and a 70.8 % ranking accuracy.
Abstract: Lip synchronization, also known as visual speech animation, is the process of matching the speech with the lip movements. Visual speech animation has a great impact on the gaming and animation film industry, due to the reason that it provides a realistic experience to the users. Furthermore, this technology also supports better communication for deaf people.For most of the European languages, lip synchronizing models have been developed and used vastly in the entertainment industries. However, there are still no research experiments conducted towards the speech animation of the Sinhala language. Less contribution towards research development and unavailability of resources have been the issues for this.This research is focused on the problem of achieving a lip synchronization model for the Sinhala language. The project presents a study on how to map from acoustic speech to visual speech with the goal of generating perceptually natural speech animation. The experiments on developing a visemes alphabet is carried out using a static visemes approach on a video data set created by the author. The implemented lip synchronizing model was evaluated using a subjective evaluation based on six different categories. The generated model using the static visemes approach achieved 68.8% rating accuracy and a 70.8% ranking accuracy. This model performs well for individual words and short sentences rather than long sentences and sentences that are uttered with different speed levels.

Patent
17 Feb 2020
TL;DR: In this paper, an avatar visual conversion device and a message conversion method capable of providing a new user experience by expressing a text message in V-moji in a messenger service is presented.
Abstract: The present invention relates to an avatar visual conversion device and a message conversion method capable of providing a new user experience by expressing a text message in V-moji in a messenger service. Specifically, when text is input to a caller terminal, the visual conversion device analyzes meaning to set emotion coordinates and uses the emotion coordinates to generate an animation code for expressing an emotion of an avatar. The visual conversion device also extracts text for text-to-speech (TTS) to generate voice data and generates a viseme code for expressing a viseme for each phoneme. A receiver terminal controls a visual image of the avatar based on the viseme code and the animation code while outputting the voice data.

Journal ArticleDOI
01 Jun 2020-Heliyon
TL;DR: Lip shapes during movement are more uniform between individuals and resting morphological lip shape does not influence movement of the lips.

Journal ArticleDOI
TL;DR: This paper presents a new approach to creating speech animation with emotional expressions using a small set of example models that can be applied to diverse types of digital content and applications that use facial animation with high accuracy (over 90%) in speech recognition.
Abstract: In this paper, we present a new approach to creating speech animation with emotional expressions using a small set of example models. To generate realistic facial animation, two example models called key visemes and expressions are used for lip-synchronization and facial expressions, respectively. The key visemes represent lip shapes of phonemes such as vowels and consonants while the key expressions represent basic emotions of a face. Our approach utilizes a text-to-speech (TTS) system to create a phonetic transcript for the speech animation. Based on a phonetic transcript, a sequence of speech animation is synthesized by interpolating the corresponding sequence of key visemes. Using an input parameter vector, the key expressions are blended by a method of scattered data interpolation. During the synthesizing process, an importance-based scheme is introduced to combine both lip-synchronization and facial expressions into one animation sequence in real time (over 120Hz). The proposed approach can be applied to diverse types of digital content and applications that use facial animation with high accuracy (over 90%) in speech recognition.


06 Jan 2020
TL;DR: A novel and efficient Mandarin Chinese CS system, satisfying the main criterion that the hand coding constitutes a complement to the lips’ movements is proposed, to code vowels [i, u, y] as semiconsonants when they are followed by other Mandarin finals, which reduces the number of Mandarin finals to be coded from 36 to 16.
Abstract: Cued Speech (CS) is a communication system developed for deaf people, which exploits hand cues to complement speechreading at the phonetic level. Currently, it is estimated that CS has been adapted to over 60 languages; however, no official CS system is available for Mandarin Chinese. This article proposes a novel and efficient Mandarin Chinese CS system, satisfying the main criterion that the hand coding constitutes a complement to the lips’ movements. We propose to code vowels [i, u, y] as semiconsonants when they are followed by other Mandarin finals, which reduces the number of Mandarin finals to be coded from 36 to 16. We establish a coherent similarity between Mandarin Chinese and French vowels for the remaining 16 finals/vowels, which allows us to take advantage of the French CS system. Furthermore, by investigating the lips viseme distribution based on a new corpus, an optimal allocation of the 16 Mandarin vowels to different hand positions is obtained. A Gaussian classifier was used to evaluate the average separability of different allocated vowel groups, which gives 92.08%, 92.33%, and 92.73% for the three speakers, respectively. The consonants are mainly designed according to their similarities with the French CS system, as well as some considerations on the special Mandarin consonants. In our system, the tones of Mandarin are coded with head movements.

Proceedings ArticleDOI
Jayanth Shreekumar1, Ganesh K Shet1, Vijay P N1, Preethi S J1, Niranjana Krupa1 
16 Nov 2020
TL;DR: The authors used GANs to generate synthetic images that are used for data augmentation for visual speech recognition (VSR) using the VGG16 dataset, which was used for classification both before and after augmentation.
Abstract: The proliferation of convolutional neural networks (CNN) has resulted in increased interest in the field of visual speech recognition (VSR). However, while VSR for word-level and sentence-level classification has received much of this attention, recognition of visemes has remained relatively unexplored. This paper focuses on the visemic approach for VSR as it can be used to build language-independent models. Our method employs generative adversarial networks (GANs) to create synthetic images that are used for data augmentation. VGG16 is used for classification both before and after augmentation. The results obtained prove that data augmentation using GANs is a viable technique for improving the performance of VSR models. Augmenting the dataset with images generated using the Progressive Growing Generative Adversarial Network (PGGAN) model led to an average increase in test accuracy of 3.695% across speakers. An average increase in test accuracy of 2.59% was achieved by augmenting the dataset using images generated by the conditional Deep Convolutional Generative Adversarial Network (DCGAN) model.

Patent
12 Mar 2020
TL;DR: In this paper, a method was proposed to extract a plurality of viseme features using the first images of a human speaking an utterance, wherein each first image has depth information.
Abstract: In an embodiment, a method includes receiving a plurality of first images including at least a mouth-related portion of a human speaking an utterance, wherein each first image has depth information; extracting a plurality of viseme features using the first images, wherein one of the viseme features is obtained using depth information of a tongue of the human in the depth information of a first image of the first images; determining a sequence of words corresponding to the utterance using the viseme features, wherein the sequence of words comprises at least one word; and outputting, by a human-machine interface (HMI) outputting module, a response using the sequence of words.

Posted Content
TL;DR: In this paper, the authors present an introspection of an audiovisual speech enhancement model and demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes).
Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.

Patent
05 Mar 2020
TL;DR: In this article, a detection system assesses whether a person viewed by a computer-based system is a live person or not, using an interface configured to receive a video stream; a word, letter, character or digit generator subsystem configured to generate and output one or more words, letters, characters or digits to an end-user; and a computer vision subsystem.
Abstract: A detection system assesses whether a person viewed by a computer-based system is a live person or not. The system has an interface configured to receive a video stream; a word, letter, character or digit generator subsystem configured to generate and output one or more words, letters, characters or digits to an end-user; and a computer vision subsystem. The computer vision subsystem is configured to analyse the video stream received, and to determine, using a lip reading or viseme processing subsystem, if the end-user has spoken or mimed the or each word, letter, character or digit, and to output a confidence score that the end-user is a "live" person or not.

Patent
21 May 2020
TL;DR: In this article, a computing device accesses video frames depicting a person performing gestures usable for generating a layered puppet, including a viseme gesture corresponding to a target sound or phoneme.
Abstract: Certain embodiments involve automatically detecting video frames that depict visemes and that are usable for generating an animatable puppet. For example, a computing device accesses video frames depicting a person performing gestures usable for generating a layered puppet, including a viseme gesture corresponding to a target sound or phoneme. The computing device determines that audio data including the target sound or phoneme aligns with a particular video frame from the video frames that depicts the person performing the viseme gesture. The computing device creates, from the video frames, a puppet animation of the gestures, including an animation of the viseme corresponding to the target sound or phoneme that is generated from the particular video frame. The computing device outputs the puppet animation to a presentation device.

Patent
18 May 2020
TL;DR: In this paper, the authors presented a method of generating an animation model of head on speech signal and an electronic computing device implementing said method, capable of providing head animation on a real-time speech signal with low delay and high image quality.
Abstract: FIELD: computer equipment.SUBSTANCE: invention relates to computer engineering. Method for generating a head animation model based on a speech signal: receiving a speech signal; converting a speech signal into a set of speech signal attributes; extracting speech signal attributes from a set of speech signal attributes; phonemes sequence and visemes sequence are obtained; by trained artificial intelligence means calculating animation curves; merging the obtained phoneme sequence and obtaining the visemes sequence by superimposing the obtained phoneme sequence and the obtained visemes sequence taking into account the calculated animation curves; and forming an animation of the head model by animation of the visemes in the combined phoneme and visemes sequence and using the calculated animation curves.EFFECT: technical result of present invention is to provide a method of generating an animation model of head on speech signal and an electronic computing device implementing said method, capable of providing head animation on a real-time speech signal with low delay and high image quality.9 cl, 2 dwg

Posted Content
12 Sep 2020
TL;DR: This work proposes a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT) that outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.
Abstract: To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.