scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2015"


Journal ArticleDOI
TL;DR: A multiple inputs-driven realistic facial animation system based on 3-D virtual head for human-machine interface based on parameterized model and muscular model is proposed, and the objective and subjective experiments show that the system is suitable forhuman-machine interaction.
Abstract: A multiple inputs-driven realistic facial animation system based on 3-D virtual head for human–machine interface is proposed. The system can be driven independently by video, text, and speech, thus can interact with humans through diverse interfaces. The combination of parameterized model and muscular model is used to obtain a tradeoff between computational efficiency and high realism of 3-D facial animation. The online appearance model is used to track 3-D facial motion from video in the framework of particle filtering, and multiple measurements, i.e., pixel color value of input image and Gabor wavelet coefficient of illumination ratio image, are infused to reduce the influence of lighting and person dependence for the construction of online appearance model. The tri-phone model is used to reduce the computational consumption of visual co-articulation in speech synchronized viseme synthesis without sacrificing any performance. The objective and subjective experiments show that the system is suitable for human–machine interaction.

32 citations


Journal ArticleDOI
TL;DR: It is hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d’) and response latency is related to visual stimulus dissimilarities between consonant segments.
Abstract: From phonetic features to connected discourse, every level of psycholinguistic structure including prosody can be perceived through viewing the talking face Yet a longstanding notion in the literature is that visual speech perceptual categories comprise groups of phonemes (referred to as visemes), such as /p, b, m/ and /f, v/, whose internal structure is not informative to the visual speech perceiver This conclusion has not to our knowledge been evaluated using a psychophysical discrimination paradigm We hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d’) and response latency is related to visual stimulus dissimilarities between consonant segments In Experiment 1, participants performed speeded discrimination for pairs of consonant-vowel (CV) spoken nonsense syllables that were predicted to be same, near, or far in their perceptual distances, and that were presented as natural or synthesized video Near pairs were within-viseme consonants Natural within-viseme stimulus pairs were discriminated significantly above chance (except for /k/-/h/) Sensitivity (d’) increased and response times decreased with distance Discrimination and identification were superior with natural stimuli, which comprised more phonetic information We suggest that the notion of the viseme as a unitary perceptual category is incorrect Experiment 2 probed the perceptual basis for visual speech discrimination by inverting the stimuli Overall reductions in d’ with inverted stimuli but a persistent pattern of larger d’ for far than for near stimulus pairs are interpreted as evidence that visual speech is represented by both its motion and configural attributes The methods and results of this investigation open up avenues for understanding the neural and perceptual bases for visual and audiovisual speech perception and for development of practical applications such as visual speech synthesis

25 citations


Journal ArticleDOI
TL;DR: This study investigated the relationship between clearly produced and plain citation form speech styles and motion of visible articulators, and found significant effects of speech style as well as speaker gender and saliency of visual speech cues.

23 citations


OtherDOI
24 Apr 2015

18 citations


Journal ArticleDOI
TL;DR: Active appearance model (AAM) is used to extract the visual features as it finely represents the shape and appearance information extracted from jaw and lip region in phonetic and visemic information-based audio-visual speech recognizer (AVSR).

16 citations


15 Sep 2015
TL;DR: This paper used a phoneme-clustering method to form new phoneme toviseme maps for both individual and multiple speakers and used these maps to examine how similarly speakers talk visually.
Abstract: In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker (Cox et al, 2008). Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.

15 citations


Patent
05 Feb 2015
TL;DR: In this article, a method and apparatus for speech recognition and for generation of speech recognition engine, and a Speech Recognition Engine (SRE) for speech generation, is presented, in which the speech recognition system obtains a phoneme sequence from the speech input and provides the recognition result based on the phonetic distance of the phoneme sequences.
Abstract: A method and apparatus for speech recognition and for generation of speech recognition engine, and a speech recognition engine are provided. The method of speech recognition involves receiving a speech input, transmitting the speech input to a speech recognition engine, and receiving a speech recognition result from the speech recognition engine, in which the speech recognition engine obtains a phoneme sequence from the speech input and provides the speech recognition result based on a phonetic distance of the phoneme sequence.

13 citations


15 Sep 2015
TL;DR: The authors use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45.
Abstract: In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated phonemes, which have been confused during phoneme recognition, into viseme groups. Using these maps, with the LiLIR dataset, we show the effect of changing the viseme map size in speaker-dependent machine lip-reading, measured by word recognition correctness and so demonstrate that word recognition with phoneme classifiers is not just possible, but often better than word recognition with viseme classifiers. Furthermore, there are intermediate units between visemes and phonemes which are better still.

13 citations


Patent
06 Aug 2015
TL;DR: In this paper, a system for generating visually consistent alternative audio for redubbing visual speech using a processor configured to sample a dynamic viseme sequence corresponding to a given utterance by a speaker in a video was presented.
Abstract: There are provided systems and methods for generating a visually consistent alternative audio for redubbing visual speech using a processor configured to sample a dynamic viseme sequence corresponding to a given utterance by a speaker in a video, identify a plurality of phonemes corresponding to the dynamic viseme sequence, construct a graph of the plurality of phonemes that synchronize with a sequence of lip movements of a mouth of the speaker in the dynamic viseme sequence, use the graph to generate an alternative phrase that substantially matches the sequence of lip movements of the mouth of the speaker in the video.

11 citations


Proceedings ArticleDOI
19 Apr 2015
TL;DR: This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes, and explores the natural ambiguity in visual speech.
Abstract: This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, many-to-one, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling.

9 citations


Patent
01 Dec 2015
TL;DR: In this article, phoneme context is defined as a phoneme that is adjacent to the one or more complete phonemes that correspond to a given viseme unit and potential sets of viseme units that correspond with individual phoneme string portions may be determined.
Abstract: Speech animation may be performed using visemes with phonetic boundary context A viseme unit may comprise an animation that simulates lip movement of an animated entity Individual ones of the viseme units may correspond to one or more complete phonemes and phoneme context of the one or more complete phonemes Phoneme context may include a phoneme that is adjacent to the one or more complete phonemes that correspond to a given viseme unit Potential sets of viseme units that correspond with individual phoneme string portions may be determined One of the potential sets of viseme units may be selected for individual ones of the phoneme string portions based on a fit metric that conveys a match between individual ones of the potential sets and the corresponding phoneme string portion

Journal ArticleDOI
TL;DR: The present research aims to develop a text-to-audiovisual synthesizer for Indonesian language based on inputted Indonesian text called TTAVI (Text-To-AudioVisual synthesizerFor Indonesian language), and morphing viseme algorithm shows that a virtual character of the phonemes pronunciation resulting from the TTA VI synthesizer is smoother.
Abstract: There are many researches held on the text-to-audiovisual, but only a few are applied on Indonesian language. The results of the present research can be applied to a very wide field, e.g. gaming industry, animation industry, human computer interaction systems, etc. The correspondence among speech, mouth movements (visual phoneme/viseme) and phoneme spoken is needed to produce a realistic text-to-audiovisual. This research aims to develop a text-to-audiovisual synthesizer for Indonesian language based on inputted Indonesian text called TTAVI (Text-To-AudioVisual synthesizer for Indonesian language). The method consists of four major parts, namely, building the models of Indonesian’s viseme, converting a text-to-speech, synchronization process, and stringing the visemes by using the morphing viseme algorithm. Morphing viseme algorithm shows that a virtual character of the phonemes pronunciation resulting from the TTAVI synthesizer is smoother. 10 Indonesian texts inputted to TTAVI synthesizer were examined by 30 users. The appraisal results of users were calculated by applying Mean Opinion Score (MOS) methods. The average of the MOS score is 4.106 with a value range from 1 to 5. This shows that TTAVI synthesizer is considered good, and morphing viseme algorithm is able to make the result of TTAVI synthesizer smoother.

19 Dec 2015
TL;DR: A feature-fusion audio-visual speech recognition system that extracts lip geometry from the mouth region using a combination of skin color filter, border following and convex hull, and classification using a Hidden Markov Model is described.
Abstract: Humans are often able to compensate for noise degradation and uncertainty in speech information by augmenting the received audio with visual information. Such bimodal perception generates a rich combination of information that can be used in the recognition of speech. However, due to wide variability in the lip movement involved in articulation, not all speech can be substantially improved by audio-visual integration. This paper describes a feature-fusion audio-visual speech recognition (AVSR) system that extracts lip geometry from the mouth region using a combination of skin color filter, border following and convex hull, and classification using a Hidden Markov Model. The comparison of the new approach with conventional audio-only system is made when operating under simulated ambient noise conditions that affect the spoken phrases. The experimental results demonstrate that, in the presence of audio noise, the audio-visual approach significantly improves speech recognition accuracy compared with audio-only approach.

01 Jan 2015
TL;DR: A decision tree-based viseme clustering technique that allows visual speech synthesis after training on a small dataset of phonetically-annotated audiovisual speech and shows that this approach leads to a clear improvement over a comparable baseline in perceptual tests.
Abstract: We present a decision tree-based viseme clustering technique that allows visual speech synthesis after training on a small dataset of phonetically-annotated audiovisual speech. The decision trees allow improved viseme grouping by incorporating k-means clustering into the training algorithm. The use of overlapping dynamic visemes, defined by tri-phone time-varying oral pose boundaries, allows improved modelling of coarticulation effects. We show that our approach leads to a clear improvement over a comparable baseline in perceptual tests. The avatar is based on the freely available MakeHuman and Blender software components.

01 Jan 2015
TL;DR: It is shown using objective and subjective testing that a HMM synthesizer trained using dynamic visemes can generate better visual speech than HMM synthesisizers trained using either phone or traditional viseme units.
Abstract: In this paper we incorporate dynamic visemes into hidden Markov model (HMM)-based visual speech synthesis. Dynamic visemes represent intuitive visual gestures identified automatically by clustering purely visual speech parameters. They have the advantage of spanning multiple phones and so they capture the effects of visual coarticulation explicitly within the unit. The previous application of dynamic visemes to synthesis used a sample-based approach, where cluster centroids were concatenated to form parameter trajectories corresponding to novel visual speech. In this paper we generalize the use of these units to create more flexible and dynamic animation using a HMM-based synthesis framework. We show using objective and subjective testing that a HMM synthesizer trained using dynamic visemes can generate better visual speech than HMM synthesizers trained using either phone or traditional viseme units.

Proceedings ArticleDOI
27 May 2015
TL;DR: This study proposes the introduction of co-articulation in speech synchronization to produce a more realistic animation, and viseme mapping based on the consonant-vowel (CV) syllable pattern in the Indonesian language, resulting in a more specific viseme group, so it supports the development of realistic speech synchronization.
Abstract: Speech synchronization is one of the studies in the field of facial animation that has been widely studied, which results in speech animation, but there are still many challenges that have not been reached at this time, one of which is realistic speech synchronization. Because of differences in visual phoneme (viseme) in the pronunciation of each language, it is very difficult to make speech synchronization tools that are applicable for all languages, and at present there are no speech synchronization tools that can provide good results for Indonesian Language. This study proposes the introduction of co-articulation in speech synchronization to produce a more realistic animation, and viseme mapping based on the consonant-vowel (CV) syllable pattern in the Indonesian language, resulting in a more specific viseme group, so it supports the development of realistic speech synchronization, next called as Bahasa Speech Sync. Co-articulation calculation is done using Kochanek-Bartels spline interpolation approach which adds tension, bias and continuity parameters, using the 4 control points taken from real human videos, to accommodate the concept of co-articulation. Viseme mapping is done by comparing the difference in distance between the 12 crucial points with a point of reference for each syllable. Based on the results of our proposed viseme grouping procedure, we have simplified the viseme generation from a combination of 21 consonants and 5 vowels into 24 groups of viseme, 18 of which represent the start position while 6 groups represent the and position. Test result of the similarity of the movement between generated animation and real human videos has achieved 89% “realistic” perception based on our proposed distance criteria.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: A novel non-blind speech enhancement procedure based on visual speech recognition (VSR) based on a generative process that analyzes sequences of talking faces and classifies them into visual speech units known as visemes that clearly outperforms baseline blind methods as well as related work.
Abstract: We introduce in this paper a novel non-blind speech enhancement procedure based on visual speech recognition (VSR). The latter is based on a generative process that analyzes sequences of talking faces and classifies them into visual speech units known as visemes. We use an effective graphical model able to segment and label a given sequence of talking faces into a sequence of visemes. Our model captures unary potential as well as pairwise interaction; the former models visual appearance of speech units while the latter models their interactions using boundary and visual language model activations. Experiments conducted on a standard challenging dataset, show that when feeding the results of VSR to the speech enhancement procedure, it clearly outperforms baseline blind methods as well as related work.


Journal ArticleDOI
TL;DR: A new lip synchronization algorithm for realistic applications is proposed, which can be employed to generate synchronized facial movements among the audio generated from natural speech or through a text-to-speech engine.
Abstract: Speech is one of the most important interaction methods between the humans. Therefore, most of avatar researches focus on this area with significant attention. Creating animated speech requires a facial model capable of representing the myriad shapes the human face expressions during speech. Moreover, a method to produce the correct shape at the correct time is also in order. One of the main challenges is to create precise lip movements of the avatar and synchronize it with a recorded audio. This paper proposes a new lip synchronization algorithm for realistic applications, which can be employed to generate synchronized facial movements among the audio generated from natural speech or through a text-to-speech engine. This method requires an animator to construct animations using a canonical set of visemes for all pair wise combination of a reduced phoneme set. These animations are then stitched together smoothly to construct the final animation.

Journal ArticleDOI
TL;DR: The simulation results show that each individual word can be represented by a mathematical expression or visual speech signal whereas the sample sets can also be derived from the same mathematical expression, and this is a significant improvement over the popular statistical methods.
Abstract: 1. Mohammad Hossein Sadaghiani[1][1] 2. Niusha Shafiabady[1][1][⇑][2] 3. Dino Isa[1][1] 1. 1University of Nottingham, Malaysia Campus, Semenyih, Malaysia 1. Niusha Shafiabady, School of Electrical and Electronic Engineering, University of Nottingham, Malaysia Campus, Semenyih 43500, Malaysia. Email: niusha.shafiabady{at}nottingham.edu.my In this article, visual speech information modeling analysis by explicit mathematical expressions coupled with words’ phonemic structure is presented. The visual information is obtained from deformation of lips’ dimensions during articulation of a set of words that is called visual speech sample set. The continuous interpretation of the lips’ movement has been provided using Barycentric Lagrange Interpolation producing a unique mathematical expression named visual speech signal. Hierarchical analysis of the phoneme sequences has been applied for words’ categorization to organize the database properly. The visual samples were extracted from three visual feature points chosen on the lips via an experiment in which two individuals pronounced the aforementioned words. The simulation results show that each individual word can be represented by a mathematical expression or visual speech signal whereas the sample sets can also be derived from the same mathematical expression, and this is a significant improvement over the popular statistical methods. [1]: #aff-1 [2]: #corresp-1

Proceedings ArticleDOI
06 Sep 2015
TL;DR: A comparison between auditory and visual space for the same speech utterance in English, as spoken by an Indian and a Croatian national is presented, showing a clear correlation between distances in the visual and auditory domain at viseme level.
Abstract: Human interaction through speech is a multisensory activity, wherein the spoken audio is perceived using both auditory and visual cues. However, in the absence of auditory stimulus, speech content can be perceived through lip reading, using the dynamics of the social context. In our earlier work [1], we had presented a tool enabling hearing impaired to understand spoken speech in videos, through lip reading. During evaluation it was found that a hearing impaired person, trained to lip read Indian English was unable to lip read speech in other accents of English. We hypothesize that this difficulty can be attributed to a difference in viseme formation arising from underlying phonetic characteristics. In this paper, we present a comparison between auditory and visual space for the same speech utterance in English, as spoken by an Indian and a Croatian national. Results show a clear correlation between distances in the visual and auditory domain at viseme level. We then evaluate the feasibility of building visual subtitles through viseme adaptation from unknown accent to known accent.

Journal ArticleDOI
29 Jul 2015
TL;DR: Simulation of lip synching in real time and offline application can be applied in various areas such as entertainment, education, tutoring, animation and live performances, such as theater, broadcasting, education and live presentation.
Abstract: Performance of real-time lip sync animation is an approach to perform a virtual computer generated character talk, which synchronizes an accurate lip movement and sound in live. Based on the review, the creation of lip sync animation in real-time is particularly challenging in mapping the lip animation movement and sounds that are synchronized. The fluidity and accuracy in natural speech are one of the most difficult things to do convincingly in facial animation. People are very sensitive to this when you get it wrong because we are all focused on faces. Especially in real time application, the visual impact needed is immediate, commanding and convincing to the audience. A research on viseme based human speech was conducted to develop a lip synchronization platform in order to achieve an accurate lip motion with the sounds that are synchronized as well as increase the visual performance of the facial animation. Through this research, a usability automated digital speech system for lip sync animation was developed. Automatic designed with the use of simple synchronization tricks which generally improve accuracy and realistic visual impression and implementation of advanced features into lip synchronization application. This study allows simulation of lip synching in real time and offline application. Hence, it can be applied in various areas such as entertainment, education, tutoring, animation and live performances, such as theater, broadcasting, education and live presentation.

Proceedings ArticleDOI
31 Aug 2015
TL;DR: A lip operation that would recognize the content of an utterance by reading from an image is studied that could be used for utterance training in Japanese and English.
Abstract: Speech recognition technology is spreading with personal digital assistants such as smart phones However, we are concerned about the decline in the recognition rate at places with multiple voices and considerable noise Therefore, we have been studying a lip operation that would recognize the content of an utterance by reading from an image Based on this research, we created a database of utterances by Japanese television announcers and English teachers for utterance training in Japanese and English Furthermore, applying the technology we developed, we propose a method of utterance training using specific equipment First, we compared the student's utterance with data in the lip movement database Second, we evaluated the effectiveness of the utterance training equipment