scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2022"


Proceedings ArticleDOI
23 Feb 2022
TL;DR: In this article , the authors used external text data (for viseme-to-character mapping) by dividing video to character into two stages, namely converting video to viseme and then converting viseme to character by using separate models.
Abstract: Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme and then converting viseme to character by using separate models. Our proposed method improves word error rate by an absolute rate of 4% compared to the typical sequence to sequence lipreading model on the BBC-Oxford Lip Reading dataset (LRS2).

2 citations


Journal ArticleDOI
TL;DR: In this article , a comparison of skilled readers' brain responses elicited by a spoken word presented alone versus synchronously with a static image of a viseme or a grapheme of the spoken word's onset showed that neither visual input induced audiovisual integration on N1 acoustic component, both led to a supra-additive integration on P2, with a stronger integration between speech and graphemes on left-anterior electrodes.

2 citations


Proceedings ArticleDOI
09 May 2022
TL;DR: The concept of dynamic tongue model is introduced, which represents the entire process of the vocal organ movement during the articulation of a certain phoneme in the Chinese Shaanxi Xi'an Dialect.
Abstract: Xi'an is the capital city of Shaanxi Province, the core city of the Guanzhong Plain urban agglomeration, an important central city in western China approved by the State Council, and an important national scientific research, education and industrial base. So creating Chinese Shaanxi Xi’an Dialect talking head has significant research value. In addition, Chinese Shaanxi Xi'an Dialect is also a precious cultural heritage which has great research value not only in linguistic field but also in speech analysis. We decided to take up this challenge to develop a facial animation system for this dialect lip and tone synchronization. The static viseme could be displayed by a stationary human facial image. However when a phoneme was articulated, the movement of our vocal organs,such as tongue or lips are more like a dynamic process than a static state. So we introduce the concept of dynamic tongue model here, which represents the entire process of the vocal organ movement during the articulation of a certain phoneme. The dynamic viseme model is a combination of dominance and parameter value based on dominance grade. I have already created a dynamic viseme model which is a dynamic articulation model based on dominance classification concept of the visemes for the Chinese Shaanxi Xi’an Dialect. It can be used in speech assistant systems for hard-of-hearing children or the people who wants to learn this dialect and provided a method to be used in speech assistant systems for other languages.

1 citations


Proceedings ArticleDOI
01 Jul 2022
TL;DR: ACTA 2.0 is an automated tool which relies on Argument Mining methods to analyse the abstracts of clinical trials to extract argument components and relations to support evidence-based clinical decision making.
Abstract: Evidence-based medicine aims at making decisions about the care of individual patients based on the explicit use of the best available evidence in the patient clinical history and the medical literature results. Argumentation represents a natural way of addressing this task by (i) identifying evidence and claims in text, and (ii) reasoning upon the extracted arguments and their relations to make a decision. ACTA 2.0 is an automated tool which relies on Argument Mining methods to analyse the abstracts of clinical trials to extract argument components and relations to support evidence-based clinical decision making. ACTA 2.0 allows also for the identification of PICO (Patient, Intervention, Comparison, Outcome) elements, and the analysis of the effects of an intervention on the outcomes of the study. A REST API is also provided to exploit the tool’s functionalities.

1 citations


Proceedings ArticleDOI
01 Jul 2022
TL;DR: In this paper , a text/speech-driven full-body animation synthesis system is presented, which synthesizes face and body animations simultaneously, which are then skinned and rendered to obtain a video stream output.
Abstract: Due to the increasing demand in films and games, synthesizing 3D avatar animation has attracted much attention recently. In this work, we present a production-ready text/speech-driven full-body animation synthesis system. Given the text and corresponding speech, our system synthesizes face and body animations simultaneously, which are then skinned and rendered to obtain a video stream output. We adopt a learning-based approach for synthesizing facial animation and a graph-based approach to animate the body, which generates high-quality avatar animation efficiently and robustly. Our results demonstrate the generated avatar animations are realistic, diverse and highly text/speech-correlated.

1 citations


Posted ContentDOI
04 Apr 2022
TL;DR: Zhang et al. as discussed by the authors propose a visual context attention module to encode global representations from the local visual features, and provide the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention.
Abstract: In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.

Proceedings ArticleDOI
21 Apr 2022
TL;DR: In this article , the authors proposed a lip perusing framework for hearing impaired people by transforming the lip movement video into several image frames as input; each frame passes through trained CNN architecture and frames are then divided into visemes.
Abstract: Lip perusing is a procedure to get words or discourse by a visual understanding of lip developments. As a broad development prospect automated lip reading method captures the fruitful outcome; and this system is helpful for Hearing Impaired People. To transform the lip movement video into several image frames as input; each frame passes through trained CNN architecture and frames are then divided into visemes. Produced visemes go through a thick layer of Long Short Term Memory (LSTM). The result of the LSTM layer turns into the contribution to the following thick layer. Finally, we receive sequence of visemes; classified visemes are labeled by LSTM softmax activation function. Feature extraction of visemes are judged using classifier schema known as visemes to phoneme mapping. Considering the mapping procedure, possible Word is detected using word detector. One to many mapping of words uses perplexity analysis to find the predicted visemes. Thus execution of the framework is assessed by contrasting the anticipated sentences of the lip perusing framework to the ground reality of the verbally expressed sentences with consequences of Confusion Matrix Dataset.

Proceedings ArticleDOI
16 Oct 2022
TL;DR: HyProGAN as mentioned in this paper introduces a hybrid and progressive training strategy that expands the unidirectional translation between two domains into the bidirectional intra-domain and inter-domain translation.
Abstract: Image translation from human faces to anime ones brings a low-end, efficient way to create animation characters for animation industry. However, due to the significant inter-domain difference between anime images and human photos, existing image-to-image translation approaches cannot address this task well. To solve this dilemma, we propose HyProGAN, an exemplar-guided image-to-image translation model without paired data. The key contribution of HyPro-GAN is that it introduces a novel hybrid and progressive training strategy that expands the unidirectional translation between two domains into the bidirectional intra-domain and inter-domain translation. To enhance the consistency between input and output, we further propose a local masking loss to align the facial features between the human face and the generated anime face. Extensive experiments demonstrate the superiority of HyProGAN against state-of-the-art models.

Journal ArticleDOI
TL;DR: In this article , the authors used the facial motion capture technology to obtain the dynamic lip viseme feature data, during the stop's forming block, continuing block, removing-block, and co-articulation with vowels in the CV structure.
Abstract: In the study of articulatory phonetics, lip shape and tongue position is the focus of linguists. In order to reveal the physiological characteristics of the lip shape during pronunciation, the author takes the Tibetan Xiahe dialect as the research object and defines the facial parameter feature points of the speaker according to the MPEG-4 international standard. Most importantly, the author uses the facial motion capture technology to obtain the dynamic lip viseme feature data, during the stop's forming-block, continuing-block, removing-block, and co-articulation with vowels in the CV structure. Through research and analysis, it is found that the distribution of lip shape change the characteristics of different parts' pronunciation is different during the stop's forming block. In the co-articulation with [a], the reverse effect is greater than the forward effect, which is consistent with the relevant conclusions in many languages obtained by many scholars through other experimental methods. The study also found that in the process of pronunciation, the movement of the lip physiological characteristics of each speaker is random to a certain extent, but when different speakers pronounce the same sound, they can always maintain the consistency of the changing trend of the lip shape characteristics.

Proceedings ArticleDOI
12 Oct 2022
TL;DR: In this article , a corpus of homovisemes from the speaker's lips has been developed to solve certain difficulties in how we learn to read from lips, and the main advantages of the corpus approach to this study have been discussed.
Abstract: In the article, we investigate how a person visually perceives and distinguishes words in oral speech with similar articulation patterns (homovisemes) from the speaker's lips. These words are defined by the term "homovisemes" introduced by the authors. To carry out our research, we have developed a corpus of homovisemes based on prepared material. This material is grouped into separate chains according to the principle of identical articulatory shells of words, with the exception of pseudowords. If such words are interchangeable, a person visually perceives them differently. The purpose of this research is to analyze if it is possible and expedient to use the homovisemes corpus to solve certain difficulties in how we learn to read from lips. In the article, we consider the main advantages of the corpus approach to this study. We present examples of possible variants of recognized words with similar visemes based on the content of the developed corpus. Our results allow us to conclude on how relevant the chosen topic is. The results will help us to predict uncertainty when choosing the right value. Work in this area will provide new results, create useful applications, and help people with hearing disabilities – an important social task.

Posted ContentDOI
31 May 2022
TL;DR: In this paper , a text/speech-driven full-body animation synthesis system is presented, which synthesizes face and body animations simultaneously, which are then skinned and rendered to obtain a video stream output.
Abstract: Due to the increasing demand in films and games, synthesizing 3D avatar animation has attracted much attention recently. In this work, we present a production-ready text/speech-driven full-body animation synthesis system. Given the text and corresponding speech, our system synthesizes face and body animations simultaneously, which are then skinned and rendered to obtain a video stream output. We adopt a learning-based approach for synthesizing facial animation and a graph-based approach to animate the body, which generates high-quality avatar animation efficiently and robustly. Our results demonstrate the generated avatar animations are realistic, diverse and highly text/speech-correlated.

Proceedings ArticleDOI
27 May 2022
TL;DR: In this article , the authors proposed a methodology for Sinhala static viseme classification and established a phoneme-viseme mapping model for a low-resource language that belongs to the Indo-European sub-family.
Abstract: Speech perception is considered entirely as an auditory process, but vision also has a significant influence on speech perception. In generating synthesized vocal systems, especially for robotic applications, inaccurate synchronization between voice and lip movements substantially decreases the speech understanding and the naturalness of face-to-face communication. Phoneme-viseme mapping is one of the most important approaches in the visual recognition of speech and visual speech synthesis applications. Although there are many phoneme-viseme mapping models for languages such as English, Indonesian, Arabic, German, no adequate phoneme-viseme mapping model is available for the Sinhala language. This research proposes a methodology for Sinhala static viseme classification and establishes a phoneme-viseme mapping model for the Sinhala language. The Sinhala language is a low-resource language that belongs to the Indo-European sub-family, it has some similarities to the languages like Hindi, Marathi, and Bengali. The traditional Sinhala phonetic alphabet consists of 40 phonemes including 14 vowels and 26 consonants. This paper outlines the analysis of geometrical lip movements and features of the speakers pronouncing Sinhala word sequences which have been recorded in optimal conditions. Viseme classes are obtained through a static viseme approach where K-means clustering techniques and Sinhala linguistic features are considered. The proposed model was validated through a subjective analysis method and this is expected to grow into a reference model for future research attempts, as well as for developing an instructional robotic face that will form the visual interface for Sinhala-speaking healthcare seekers.