scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2019"


Proceedings ArticleDOI
01 Oct 2019
TL;DR: A Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well is proposed.
Abstract: Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.

66 citations


Journal ArticleDOI
TL;DR: This work presents a procedural audio‐driven speech animation method for interactive virtual characters that automatically generates lip‐synchronized speech animation that could drive any three‐dimensional virtual character.
Abstract: We present a procedural audio‐driven speech animation method for interactive virtual characters. Given any audio with its respective speech transcript, we automatically generate lip‐synchronized speech animation that could drive any three‐dimensional virtual character. The realism of the animation is enhanced by studying the emotional features of the audio signal and its effect on mouth movements. We also propose a coarticulation model that takes into account various linguistic rules. The generated animation is configurable by the user by modifying the control parameters, such as viseme types, intensities, and coarticulation curves. We compare our approach against two lip‐synchronized speech animation generators. Our results show that our method surpasses them in terms of user preference.

13 citations


Posted Content
TL;DR: A deep learning based interactive system that automatically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model that takes streaming audio as input and produces viseme sequences with less than 200ms of latency.
Abstract: The emergence of commercial tools for real-time performance-based 2D animation has enabled 2D characters to appear on live broadcasts and streaming platforms. A key requirement for live animation is fast and accurate lip sync that allows characters to respond naturally to other actors or the audience through the voice of a human performer. In this work, we present a deep learning based interactive system that automatically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model. Our system takes streaming audio as input and produces viseme sequences with less than 200ms of latency (including processing time). Our contributions include specific design decisions for our feature definition and LSTM configuration that provide a small but useful amount of lookahead to produce accurate lip sync. We also describe a data augmentation procedure that allows us to achieve good results with a very small amount of hand-animated training data (13-20 minutes). Extensive human judgement experiments show that our results are preferred over several competing methods, including those that only support offline (non-live) processing. Video summary and supplementary results at GitHub link: this https URL

11 citations


Journal ArticleDOI
TL;DR: A detailed objective evaluation shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively and outperforms significantly conventional phoneme-driven speech animation systems.

10 citations


Journal ArticleDOI
TL;DR: A structured approach to create speaker-dependent visemes with a fixed number of viseme within each set, based upon clustering phonemes, which significantly improves on previous lipreading results with RMAV speakers.
Abstract: Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes'. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.

9 citations


Journal ArticleDOI
TL;DR: Findings show that mapping sounds to categories aids speech perception in "cocktail party" environments; visual cues help lattice formation of auditory-phonetic categories to enhance and refine speech identification, and additional viseme cues largely counteracted noise-related decrements in performance and stabilized classification speeds.
Abstract: Speech perception requires grouping acoustic information into meaningful linguistic-phonetic units via categorical perception (CP) Beyond shrinking observers' perceptual space, CP might aid degraded speech perception if categories are more resistant to noise than surface acoustic features Combining audiovisual (AV) cues also enhances speech recognition, particularly in noisy environments This study investigated the degree to which visual cues from a talker (ie, mouth movements) aid speech categorization amidst noise interference by measuring participants' identification of clear and noisy speech (0 dB signal-to-noise ratio) presented in auditory-only or combined AV modalities (ie, A, A+noise, AV, AV+noise conditions) Auditory noise expectedly weakened (ie, shallower identification slopes) and slowed speech categorization Interestingly, additional viseme cues largely counteracted noise-related decrements in performance and stabilized classification speeds in both clear and noise conditions suggesting more precise acoustic-phonetic representations with multisensory information Results are parsimoniously described under a signal detection theory framework and by a reduction (visual cues) and increase (noise) in the precision of perceptual object representation, which were not due to lapses of attention or guessing Collectively, findings show that (i) mapping sounds to categories aids speech perception in “cocktail party” environments; (ii) visual cues help lattice formation of auditory-phonetic categories to enhance and refine speech identificationSpeech perception requires grouping acoustic information into meaningful linguistic-phonetic units via categorical perception (CP) Beyond shrinking observers' perceptual space, CP might aid degraded speech perception if categories are more resistant to noise than surface acoustic features Combining audiovisual (AV) cues also enhances speech recognition, particularly in noisy environments This study investigated the degree to which visual cues from a talker (ie, mouth movements) aid speech categorization amidst noise interference by measuring participants' identification of clear and noisy speech (0 dB signal-to-noise ratio) presented in auditory-only or combined AV modalities (ie, A, A+noise, AV, AV+noise conditions) Auditory noise expectedly weakened (ie, shallower identification slopes) and slowed speech categorization Interestingly, additional viseme cues largely counteracted noise-related decrements in performance and stabilized classification speeds in both clear and noise conditions sugge

8 citations


Proceedings ArticleDOI
09 Apr 2019
TL;DR: The success of automated lip reading has been constrained by the inability to distinguish between homopheme words, which are words that have different characters and produce the same lip movements, despite being intrinsically different.
Abstract: The success of automated lip reading has been constrained by the inability to distinguish between homopheme words, which are words that have different characters and produce the same lip movements (e.g. "time" and "some"), despite being intrinsically different. One word can often have different phonemes (units of sound) producing exactly the same viseme or visual equivalent of a phoneme for a unit of sound. Through the use of a Long-Short Term Memory Network with word embeddings, we can distinguish between homopheme words or words that produce identical lip movements. The neural network architecture achieved a character accuracy rate of 77.1% and a word accuracy rate of 72.2%.

7 citations


Journal ArticleDOI
TL;DR: In this paper, a structured approach was proposed to create speaker-dependent visemes with a fixed number of viseme within each set, each set having a unique phoneme-to-viseme mapping.
Abstract: Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.

5 citations


Patent
16 May 2019
TL;DR: In this article, the authors proposed a method for generating multimedia output that comprises receiving a text input and receiving an animated character input corresponding to an animated characters including at least one movement characteristic.
Abstract: A method for generating multimedia output. The method comprises receiving a text input and receiving an animated character input corresponding to an animated character including at least one movement characteristic. The method includes analyzing the text input to determine at least one text characteristic of the text input. The method includes generating a viseme timeline by applying at least one viseme characteristic to each of the at least one text characteristic. Based on the viseme timeline, the method includes generating a multimedia output coordinating the at least one character movement of the animated character with the at least one viseme characteristic.

4 citations


Journal ArticleDOI
18 Apr 2019
TL;DR: Analysis shows that bimodal speech recognition exceeds unimodal in accuracy, especially for low SNR values <15 dB, and the best results are achieved by a unimmodal visual speech recognition system.
Abstract: Introduction: The effectiveness of modern automatic speech recognition systems in quiet acoustic conditions is quite high and reaches 90-95%. However, in noisy uncontrolled environment, acoustic signals are often distorted, which greatly reduces the resulting recognition accuracy. In adverse conditions, it seems appropriate to use the visual information about the speech, as it is not affected by the acoustic noise. At the moment, there are no studies which objectively reflect the dependence of visual speech recognition accuracy on the video frame rate, and there are no relevant audio-visual databases for model training. Purpose : Improving the reliability and accuracy of the automatic audio-visual Russian speech recognition system; collecting representative audio-visual database and developing an experimental setup. Methods : For audio-visual speech recognition, we used coupled hidden Markov model architectures. For parametric representation of audio and visual features, we used mel-frequency cepstral coefficients and principal component analysis-based pixel features. Results: In the experiments, we studied 5 different rates of video data: 25, 50, 100, 150, and 200 fps. Experiments have shown a positive effect from the use of a high-speed video camera: we achieved an absolute increase in accuracy of 1.48% for a bimodal system and 3.10% for a unimodal one, as compared to the standard recording speed of 25 fps. During the experiments, test data for all speakers were added with two types of noise: wide-band white noise and “babble noise”. Analysis shows that bimodal speech recognition exceeds unimodal in accuracy, especially for low SNR values <15 dB. At very low SNR values <5 dB, the acoustic information becomes non-informative, and the best results are achieved by a unimodal visual speech recognition system. Practical relevance : The use of a high-speed camera can improve the accuracy and robustness of a continuous audio-visual Russian speech recognition system.

3 citations


Book ChapterDOI
01 Jan 2019
TL;DR: The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems.
Abstract: Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions degrade the recognition performance of the system. One way to enhance the robustness of ASR system is to use multiple sources of information about speech. In this work, two sources of additional information on speech are used to build a multimodal ASR system. A throat microphone speech and visual lip reading which is less susceptible to noise acts as alternate sources of information. Mel-frequency cepstral features are extracted from the throat signal and modeled by HMM. Pixel-based transformation methods (DCT and DWT) are used to extract the features from the viseme of the video data and modeled by HMM. Throat and visual features are combined at the feature level. The proposed system has improved recognition accuracy compared to unimodals. The digit database for the English language is used for the study. The experiments are carried out for both unimodal systems and the combined systems. The combined feature of normal and throat microphone gives 86.5% recognition accuracy. Visual speech features with the normal microphone combination produce 84% accuracy. The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems.

09 Apr 2019
TL;DR: Through the use of a Long-Short Term Memory Network with word embeddings, this work can distinguish between homopheme words or words that produce identical lip movements.
Abstract: The success of automated lip reading has been constrained by the inability to distinguish between homopheme words, which are words have different characters and produce the same lip movements (e.g. ”time” and ”some”), despite being intrinsically different. One word can often have different phonemes (units of sound) producing exactly the viseme or visual equivalent of phoneme for a unit of sound. Through the use of a Long-Short Term Memory Network with word embeddings, we can distinguish between homopheme words or words that produce identical lip movements. The neural network architecture achieved a character accuracy rate of 77.1% and a word accuracy rate of 72.2%.

Patent
08 Aug 2019
TL;DR: In this paper, a method to interactively convert a source language video/audio stream into one or more target languages in high definition video format using a computer is presented, where spoken words in the converted language are synchronized with synthesized movements of a rendered mouth.
Abstract: A method to interactively convert a source language video/audio stream into one or more target languages in high definition video format using a computer. The spoken words in the converted language are synchronized with synthesized movements of a rendered mouth. Original audio and video streams from pre-recorded or live sermons are synthesized into another language with the original emotional and tonal characteristics. The original sermon could be in any language and be translated into any other language. The mouth and jaw are digitally rendered with viseme and phoneme morphing targets that are pre-generated for lip synching with the synthesized target language audio. Each video image frame has the simulated lips and jaw inserted over the original. The new audio and video image then encoded and uploaded for internee viewing or recording to a storage medium.

Journal ArticleDOI
TL;DR: The coarticulation effect in the visual speech studied by creating many-to-many allophone- to-viseme mapping based on the data-driven approach only and both mapping methods make use of K-mean data clustering algorithm.
Abstract: Knowledge about phoneme and viseme in a language is a vital component in the making of any speech-based applications in that language. A phoneme is an atomic unit in an acoustic speech that can differentiate meaning. Viseme is the equivalent atomic unit in the visual realm which describes distinct dynamic visual speech gestures. The initial phase of the paper introduces a many-to-one phoneme-to-viseme mapping for the Malayalam language based on linguistic knowledge and data-driven approach. At the next stage, the coarticulation effect in the visual speech studied by creating many-to-many allophone-to-viseme mapping based on the data-driven approach only. Since the linguistic history in the visual realm was less explored in the Malayalam language, both mapping methods make use of K-mean data clustering algorithm. The optimum cluster determined by using the Gap statistic method with prior knowledge about the range of clusters. This work was carried out on Malayalam audio-visual speech database created by the authors of this paper with consist of 50 isolated phonemes and 106 connected words. From 50 isolated Malayalam phonemes, 14 viseme were linguistically identified and compared with results obtained from a data-driven approach as whole phonemes and consonant phonemes. The many-to-many mapping studied as a whole allophone, vowel allophones, and consonant allophones. Geometric and DCT based parameters are extracted and examined to find the parametric phoneme and allophone clustering in the visual domain.


Proceedings ArticleDOI
01 May 2019
TL;DR: This work uses Generative Adversarial Networks trained with synthetic data and map from mouth images acquired in a single arbitrary view to frontal and side views to address viseme recognition in still images and explores the synthetic generation of additional views to improve overall accuracy.
Abstract: Speech recognition technologies in the visual domain currently can only identify words and sentences in still images. Identifying visemes (i.e., the smallest visual units of spoken text) is useful when there are no language models or dictionaries available, which is often the case for languages besides English; however, it is a challenge, as temporal information cannot be extracted. In parallel, previous works demonstrated that exploring data acquired simultaneously under multiple views can improve the recognition accuracy in comparison to single-view data. For many different applications, however, most of the available audio-visual datasets are obtained from a single view, essentially due to acquisition limitations. In this work, we address viseme recognition in still images and explore the synthetic generation of additional views to improve overall accuracy. For that, we use Generative Adversarial Networks (GANs) trained with synthetic data and map from mouth images acquired in a single arbitrary view to frontal and side views – in which the face is rotated vertically at approximately 30°, 45°, and 60°. Then, we use a state-of-art Convolutional Neural Network for classifying the visemes and compare its performance when training only with the original single-view images versus training with the additional views artificially generated by the GANs. We run experiments using three audiovisual corpora acquired under different conditions (GRID, AVICAR, and OuluVS2 datasets) and our results indicate that the additional views synthesized by the GANs are able to improve the viseme recognition accuracy on all tested scenarios.

Journal ArticleDOI
TL;DR: A completely automatic system is designed by selecting acoustic features discriminative to both emotion and phoneme tags under a multi-task learning framework and associate each phoneme + emotion tuple with a key facial frame.

Patent
26 Feb 2019
TL;DR: In this paper, a system and method for training a set of expression and neutral convolutional neural networks using a single performance mapped to the set of known phonemes and visemes in the form of predetermined sentences and facial expressions is described.
Abstract: There is disclosed a system and method for training a set of expression and neutral convolutional neural networks using a single performance mapped to a set of known phonemes and visemes in the form predetermined sentences and facial expressions. Then, subsequent training of the convolutional neural networks can occur using temporal data derived from audio data within the original performance mapped to a set of professionally-created three dimensional animations. Thereafter, with sufficient training, the expression and neutral convolutional neural networks can generate facial animations from facial image data in real-time without individual specific training.

Patent
31 May 2019
TL;DR: In this paper, a dual-viseme mouth shape synthesis method was proposed for solving the technical problem of lower fidelity of a pronunciation mouth shape of a reproduction part of an existing mouth shape synthesizer.
Abstract: The embodiment of the invention discloses a dual-viseme mouth shape synthesis method. The method is used for solving the technical problem of lower fidelity of a pronunciation mouth shape of a reproduction part of an existing mouth shape synthesis technology. The method comprises the steps that viseme classification is performed on the pronunciation in standard Chinese, and the pronunciation is classified into 13 categories; corresponding mouth shape videos are recorded according to the viseme classification; a basic mouth shape viseme library is established according to an original mouth shape video; the basic mouth shape viseme library is subjected to dual-viseme treatment to obtain a basic mouth shape dual-viseme library; a speech recognition technology is used for recognizing newly input speech to obtain a text material, after the text material is subjected to initial consonant or simple or compound vowel recognition, the mouth shape viseme corresponding to each initial consonant or simple or compound vowel is searched for in the basic mouth shape dual-viseme library, and the mouth shape visemes are inserted into corresponding points in time to form a discrete mouth shape sequence, and the discrete mouth shape sequence is subjected to smoothing to obtain a continuous mouth shape sequence.