scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2021"


Posted ContentDOI
24 Feb 2021-bioRxiv
TL;DR: It is shown that the visual system produces its own specialized representation of speech that is well-described by categorical linguistic units ("visemes") and predictive of lipreading ability, contradict a long-held view that visual speech processing co-opts auditory cortex after early visual processing stages.
Abstract: There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Surprisingly, much of this ambiguity stems from a lack of behaviorally grounded neurophysiological findings. To address this, we conducted an experiment in which human observers rehearsed audiovisual speech for the purpose of lipreading silent versions during testing. Using a combination of computational modeling, electroencephalography, and simultaneously recorded behavior, we show that the visual system produces its own specialized representation of speech that is 1) well-described by categorical linguistic units (“visemes”) 2) dissociable from lip movements, and 3) predictive of lipreading ability. These findings contradict a long-held view that visual speech processing co-opts auditory cortex after early visual processing stages. Consistent with hierarchical accounts of visual and audiovisual speech perception, our findings show that visual cortex performs at least a basic level of linguistic processing.

10 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, the authors present an introspection of an audiovisual speech enhancement model and demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes).
Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.

5 citations


Journal ArticleDOI
TL;DR: In this article, a light-weight auto-regressive network is introduced to transform an example-based animation database into a parametric model, which allows realistic synthesis of novel facial animation sequences like visual-speech but also general facial expressions in an example based manner.
Abstract: This article presents a hybrid animation approach that combines example-based and neural animation methods to create a simple, yet powerful animation regime for human faces. Example-based methods usually employ a database of prerecorded sequences that are concatenated or looped in order to synthesize novel animations. In contrast to this traditional example-based approach, we introduce a light-weight auto-regressive network to transform our animation-database into a parametric model. During training, our network learns the dynamics of facial expressions, which enables the replay of annotated sequences from our animation database as well as their seamless concatenation in new order. This representation is especially useful for the synthesis of visual speech, where coarticulation creates interdependencies between adjacent visemes, which affects their appearance. Instead of creating an exhaustive database that contains all viseme variants, we use our animation-network to predict the correct appearance. This allows realistic synthesis of novel facial animation sequences like visual-speech but also general facial expressions in an example-based manner.

5 citations


Proceedings ArticleDOI
17 Oct 2021
TL;DR: Wang et al. as mentioned in this paper proposed a novel lipreading framework, called CALLip, which employs attribute learning and contrastive learning, which extracts the speaker identity-aware features through a speaker recognition branch, which can normalize the lip shapes to eliminate cross-speaker variations.
Abstract: Lipreading, aiming at interpreting speech by watching the lip movements of the speaker, has great significance in human communication and speech understanding. Despite having reached a feasible performance, lipreading still faces two crucial challenges: 1) the considerable lip movement variations cross different persons when they utter the same words; 2) the similar lip movements of people when they utter some confused phonemes. To tackle these two problems, we propose a novel lipreading framework, CALLip, which employs attribute learning and contrastive learning. The attribute learning extracts the speaker identity-aware features through a speaker recognition branch, which are able to normalize the lip shapes to eliminate cross-speaker variations. Considering that audio signals are intrinsically more distinguishable than visual signals, the contrastive learning is devised between visual and audio signals to enhance the discrimination of visual features and alleviate the viseme confusion problem. Experimental results show that CALLip does learn better features of lip movements. The comparisons on both English and Chinese benchmark datasets, GRID and CMLR, demonstrate that CALLip outperforms six state-of-the-art lipreading methods without using any additional data.

3 citations


Posted Content
TL;DR: In this article, the authors proposed a method to use external text data (for viseme-to-character mapping) by dividing video to character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models.
Abstract: Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models. Our proposed method improves word error rate by 4\% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.

2 citations


Book ChapterDOI
01 Jan 2021
TL;DR: This chapter discusses the primary task of identifying visemes and the number of frames required to encode the temporal evolution of vowel and consonant phonemes, and analyzed three phoneme-to-viseme mappings.
Abstract: Knowledge of phonemes and visemes in language is a vital component of speech-based applications. A phoneme is the nuclear sound unit necessary to symbolize all words in a particular speech. The present definition of viseme is a visual language unit that describes the state of different speech articulators. This chapter discusses the primary task of identifying visemes and the number of frames required to encode the temporal evolution of vowel and consonant phonemes. For this work, an audio-visual Malayalam speech database is created from 23 native speakers of Kerala (18 females and five males). The tongue plays a vital role in the utterance of Malayalam, regarding flexibility and speed, which makes it distinct from other languages. The appearance of teeth and the oral cavity and the shape of the lips can be modeled using geometric features of lips obtained from the hue, saturation, value (HSV) color space, and deformation in the appearance of the lips and tongue can be modeled using the discrete cosine transform (DCT) feature. A linguistically involved, data-driven approach can model individual perception from a linguistic approach with the computational ease of a data-driven approach. The visual speech attributes are then clustered to identify the visual equivalent of the phoneme employing K-means cluster and Gap statistic. To study the temporal variation, we analyzed three phoneme-to-viseme mappings and compared them with the linguistic mapping and visual speech duration.

1 citations


Proceedings ArticleDOI
18 May 2021
TL;DR: In this article, Bilabial consonants recognition in a category based on lip shape using the CLM algorithm for lip detection is presented, which can have a positive effect on speech systems.
Abstract: According to previous researches, Persian consonants have been divided into seven categories based on viseme. It led to several consonants being placed in one category. Detecting between consonants in one category is so hard because the spots for the production of these consonants are the same. The forms of lips do not change at the time of production; these consonants are hardly distinguishable. The major challenge is to recognize the differences between lip shapes in one category. The purpose of this study is to recognize differences between bilabial consonants such as /p/, /b/, and /m/ in a word that composed of consonant/vowel called CV by computer vision. For the first time, this study attempts to distinguish these consonants. Proper pronunciations of words are required to identify consonants. Therefore, a database has formed based on the videos of the speech therapists. Generally, this kind of process is including 1-lip detection, 2- lip feature extraction, and 3- classification systems for the diagnosis of consonants. In this paper, consonants recognition in a category based on lip shape using the CLM algorithm for lip detection is presented. Geometric algorithms for feature extraction and DTW and equalizer as a classification system are proposed. Although this study is open because we could identify differences among consonants in just one class, we could reach remarkable CV video results for the first time. We could aim for acceptable results with reasonable accuracy for bilabial consonants detection. The principle purpose of this study is to improve lip-reading systems in security issues and help hearing-impaired people in interaction with their surroundings. The results of this paper can have a positive effect on speech systems.

Book ChapterDOI
01 Jan 2021
TL;DR: Preliminary results of applying rough sets in pre-processing video frames (with lip markers) of spoken corpus in an effort to label the phonemes spoken by the speakers show promise in the application of a granular computing method for pre- processing large audio-video datasets.
Abstract: Machine learning algorithms are increasingly effective in algorithmic viseme recognition which is a main component of audio-visual speech recognition (AVSR). A viseme is the smallest recognizable unit correlated with a particular realization of a given phoneme. Labelling of phonemes and assigning them to viseme classes is a challenging problem in AVSR. In this paper, we present preliminary results of applying rough sets in pre-processing video frames (with lip markers) of spoken corpus in an effort to label the phonemes spoken by the speakers. The problem addressed here is to detect and remove frames in which the shape of the lips do not represent a phoneme completely. Our results demonstrate that the silhouette score improves with rough set-based pre-processing using the unsupervised K-means clustering method. In addition, an unsupervised CNN model for feature extraction was used as input to the K-means clustering method. The results show promise in the application of a granular computing method for pre-processing large audio-video datasets.

Patent
18 Mar 2021
TL;DR: In this article, the authors provide techniques for generating a viseme and corresponding intensity pair based on one of a clean vocal track or corresponding transcription, where the viseme is scheduled to align with a corresponding phoneme.
Abstract: Aspects of this disclosure provide techniques for generating a viseme and corresponding intensity pair. In some embodiments, the method includes generating, by a server, a viseme and corresponding intensity pair based at least on one of a clean vocal track or corresponding transcription. The method may include generating, by the server, a compressed audio file based at least on one of the viseme, the corresponding intensity, music, or visual offset. The method may further include generating, by the server or a client end application, a buffer of raw pulse-code modulated (PCM) data based on decoding at least a part of the compressed audio file, where the viseme is scheduled to align with a corresponding phoneme.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, a new set of linguistically informed metrics targeted explicitly to the problem of speech video interpolation is proposed. But despite high performance on conventional metrics like MSE, PSNR, and SSIM, they find that the state-of-the-art frame interpolation models fail to produce faithful speech interpolation.
Abstract: Here we explore the problem of speech video interpolation. With close to 70% of web traffic, such content today forms the primary form of online communication and entertainment. Despite high performance on conventional metrics like MSE, PSNR, and SSIM, we find that the state-of-the-art frame interpolation models fail to produce faithful speech interpolation. For instance, we observe the lips stay static while the person is still speaking for most interpolated frames. With this motivation, using the information of words, sub-words, and visemes, we provide a new set of linguistically informed metrics targeted explicitly to the problem of speech video interpolation. We release several datasets to test video interpolation models of their speech understanding. We also design linguistically informed deep learning video interpolation algorithms to generate the missing frames.

Patent
10 Jun 2021
TL;DR: In this paper, a method for generating a head model animation from a voice signal using an artificial intelligence model; and an electronic device for implementing same, is presented, which comprises the steps of: acquiring characteristics information of a voice signals from the voice signal; by using the artificial intelligence models, acquiring, from the characteristics information, a phoneme stream corresponding to the voice signals, and a viseme stream correspond to the phoneme streams; and generating a Head Model animation by applying the animation curve to the visemes of the merged phoneme and viseme streams.
Abstract: Disclosed are: a method for generating a head model animation from a voice signal using an artificial intelligence model; and an electronic device for implementing same. The disclosed method for generating a head model animation from a voice signal, carried out by the electronic device, comprises the steps of: acquiring characteristics information of a voice signal from the voice signal; by using the artificial intelligence model, acquiring, from the characteristics information, a phoneme stream corresponding to the voice signal, and a viseme stream corresponding to the phoneme stream; by using the artificial intelligence model, acquiring an animation curve of visemes included in the viseme stream; merging the phoneme stream with the viseme stream; and generating a head model animation by applying the animation curve to the visemes of the merged phoneme and viseme stream.

Patent
14 Jan 2021
TL;DR: In this article, an avatar visual transformation device capable of providing a novel user experience by expressing a text message as a V-moji in a messenger service, and to a message transformation method.
Abstract: The present invention relates to an avatar visual transformation device capable of providing a novel user experience by expressing a text message as a V-moji in a messenger service, and to a message transformation method. More specifically, once text is inputted to a sender terminal, the visual transformation device analyzes the meaning of the text and sets emotion coordinates, and by using same, generates an animation code for an emotional expression of an avatar. In addition, by extracting text for TTS, voice data is generated, and a viseme code for a viseme expression for each phoneme is generated. A receiver terminal outputs the voice data, and on the basis of the viseme code and the animation code, controls a visual image of the avatar.

Journal ArticleDOI
01 Jan 2021
TL;DR: This paper has done lots of work in order to created Chinese Shaanxi Xi’an Dialect viseme data base by lip viseme analysis, with great significance to create realistic 3D facial animation of this dialect.
Abstract: Recently, multimedia, especially animation, has played an important role in language learning. It has made a significant contribution to the language learning process of the hearing impaired for learners of all ages, especially the 3D animation of virtual teachers in computer-assisted language learning applications. Researches on viseme in various official languages and 3D facial animations have emerged in various countries. This paper has done lots of work in order to created Chinese Shaanxi Xi’an Dialect viseme data base by lip viseme analysis. Series of lip visemes frame data was gained by processing this dialect images and videos taken by camera. The classification of the lip viseme of this dialect also made based on the lip viseme classification of Mandarin. This work has great significance to create realistic 3D facial animation of this dialect.