Showing papers on "Viseme published in 2016"

PDF

Open Access

Journal Article•DOI•

JALI: an animator-centric viseme model for expressive lip synchronization

[...]

Pif Edwards¹, Chris Landreth¹, Eugene Fiume¹, Karan Singh¹•Institutions (1)

11 Jul 2016

TL;DR: A system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output is presented.

...read moreread less

Abstract: The rich signals we extract from facial expressions imposes high expectations for the science and art of facial animation. While the advent of high-resolution performance capture has greatly improved realism, the utility of procedural animation warrants a prominent place in facial animation workflow. We present a system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output. Because of the diversity of ways we produce sound, the mapping from phonemes to visual depictions as visemes is many-valued. We draw from psycholinguistics to capture this variation using two visually distinct anatomical actions: Jaw and Lip, wheresound is primarily controlled by jaw articulation and lower-face muscles, respectively. We describe the construction of a transferable template jali 3D facial rig, built upon the popular facial muscle action unit representation facs. We show that acoustic properties in a speech signal map naturally to the dynamic degree of jaw and lip in visual speech. We provide an array of compelling animation clips, compare against performance capture and existing procedural animation, and report on a brief user study.

...read moreread less

118 citations

Proceedings Article•DOI•

Improved speaker independent lip reading using speaker adaptive training and deep neural networks

[...]

Ibrahim Almajai¹, Stephen Cox¹, Richard P. Harvey¹, Yuxuan Lan¹•Institutions (1)

University of East Anglia¹

20 Mar 2016

TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

63 citations

Patent•

End-to-end speech recognition

[...]

Bryan Catanzaro¹, Jingdong Chen¹, Mike Chrzanowski¹, Erich Elsen¹, Jesse Engel¹, Christopher Fougner¹, Han Xu¹, Awni Hannun¹, Ryan Prenger¹, Sanjeev Satheesh¹, Shubhabrata Sengupta¹, Dani Yogatama¹, Chong Wang¹, Jun Zhan¹, Zhenyao Zhu¹, Dario Amodei¹ - Show less +12 more•Institutions (1)

Baidu¹

23 Nov 2016

TL;DR: In this paper, the entire pipeline of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages.

...read moreread less

Abstract: Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

...read moreread less

46 citations

Book Chapter•DOI•

Multi-view Automatic Lip-Reading Using Neural Network

[...]

Daehyun Lee¹, Jongmin Lee², Kee-Eung Kim²•Institutions (2)

Samsung¹, KAIST²

20 Nov 2016

TL;DR: This paper tackles ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture and shows that additional view information helps to improve the performance of ALR with neural network architecture.

...read moreread less

Abstract: It is well known that automatic lip-reading (ALR), also known as visual speech recognition (VSR), enhances the performance of speech recognition in a noisy environment and also has applications itself. However, ALR is a challenging task due to various lip shapes and ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture. We conduct single, cross, and multi-view experiments in speaker independent setting with various network configuration to integrate the multi-view data. We achieve 77.9%, 83.8%, and 78.6% classification accuracies in average on single, cross, and multi-view respectively. This result is better than the best score (76%) of preliminary single-view results given by ACCV 2016 workshop on multi-view lip-reading/audio-visual challenges. It also shows that additional view information helps to improve the performance of ALR with neural network architecture.

...read moreread less

46 citations

Proceedings Article•DOI•

Decoding visemes: Improving machine lip-reading

[...]

Helen L. Bear¹, Richard P. Harvey¹•Institutions (1)

University of East Anglia¹

20 Mar 2016

TL;DR: This paper presented a two-pass method of training phoneme classifiers which used previously trained visemes in the first pass and showed classification performance which significantly improved on previous lip-reading results.

...read moreread less

Abstract: To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance which significantly improves on previous lip-reading results.

...read moreread less

41 citations

Journal Article•DOI•

Visual units and confusion modelling for automatic lip-reading

[...]

Dominic Howell¹, Stephen Cox¹, Barry-John Theobald¹•Institutions (1)

University of East Anglia¹

01 Jul 2016-Image and Vision Computing

TL;DR: An approach to ALR is proposed that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled, and a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers.

...read moreread less

32 citations

Book Chapter•DOI•

A Data Driven Approach to Audiovisual Speech Mapping

[...]

Andrew Abel¹, Ricard Marxer², Jon Barker², Roger Watt¹, Bill Whitmer³, Peter Derleth, Amir Hussain¹ - Show less +3 more•Institutions (3)

University of Stirling¹, University of Sheffield², Global Reporting Initiative³

28 Nov 2016

TL;DR: This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

...read moreread less

Abstract: The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

...read moreread less

12 citations

Proceedings Article•DOI•

A novel approach for lip reading based on neural network

[...]

Neeru Rathee¹•Institutions (1)

Guru Gobind Singh Indraprastha University¹

11 Mar 2016

TL;DR: The proposed algorithm is applied for recognition of ten words of Hindi language and can be easily extended to include more words of other languages and is fast as well as robust to various occlusions.

...read moreread less

Abstract: Lip reading, also known as visual speech processing, means recognition of spoken word based on the pattern of lip movements while speaking. Audio speech recognition systems are popular since last many decades and have achieved a great success, but recently visual speech recognition has incited the interest of researchers towards lip reading. Lip reading has an added advantage of high accuracy and noise independency. This paper presents an algorithm for automatic lip reading. The algorithm consists of two main steps: feature extraction and classification for word recognition. The lip information is extracted using lip geometric and lip appearance features. The recognition of words is done by Learning Vector Quantization neural network. The accuracy achieved by proposed approach is 97%. The proposed algorithm is applied for recognition of ten words of Hindi language and can be easily extended to include more words of other languages. The presented approach will be helpful for hearing impaired or dumb people to communicate with humans or machines. The proposed algorithm is fast as well as robust to various occlusions.

...read moreread less

9 citations

Proceedings Article•DOI•

Lip reading techniques: A survey

[...]

Shreya Agrawal¹, Verma Rahul Omprakash¹, Ranvijay¹•Institutions (1)

Motilal Nehru National Institute of Technology Allahabad¹

21 Jul 2016

TL;DR: This paper surveys the field of lip reading and provides a detailed discussion of the trade-offs between various approaches, and gives a reverse chronological topic wise listing of the developments in lip reading systems in recent years.

...read moreread less

Abstract: Automated lip reading is the process of converting movements of the lips, face and tongue to speech in real time with enhanced accuracy. Although performance of lip reading systems is still not remotely similar to audio speech recognition, recent developments in processor technology and the massive explosion and ubiquity of computing devices accompanied with increased research in this field has reduced the ambiguities of the labial language, making it possible for free speech-to-text conversion. This paper surveys the field of lip reading and provides a detailed discussion of the trade-offs between various approaches. It gives a reverse chronological topic wise listing of the developments in lip reading systems in recent years. With advancement in computer vision and pattern recognition tools, the efficacy of real time, effective conversion has increased. The major goal of this paper is to provide a comprehensive reference source for the researchers involved in lip reading, not just for the esoteric academia but all the people interested in this field regardless of particular application areas.

...read moreread less

9 citations

Posted Content•

Robust end-to-end deep audiovisual speech recognition.

[...]

Ramon Sanabria, Florian Metze, Fernando De la Torre

21 Nov 2016-arXiv: Computation and Language

TL;DR: This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

...read moreread less

Abstract: Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

...read moreread less

8 citations

Proceedings Article•DOI•

Speech database and text corpora for Malayalam language automatic speech recognition technology

[...]

Cini Kurian

01 Oct 2016

TL;DR: The development of speech corpora for different Malayalam speech recognition tasks is presented and pronunciation dictionary and Transcription file are created.

...read moreread less

Abstract: Speech corpus is the backbone of an Automatic speech Recognition system. This paper presents the development of speech corpora for different Malayalam speech recognition tasks. Pronunciation dictionary and Transcription file which are the other two essential resources for building a speech recognizer are also being created. Speech corpus of about 18 hours has been collected for different recognition tasks.

...read moreread less

Journal Article•DOI•

Reducing viseme confusion in speech-reading

[...]

Benjamin M. Gorman¹•Institutions (1)

University of Dundee¹

16 Mar 2016-ACM Sigaccess Accessibility and Computing

TL;DR: This doctoral research aims to design and evaluate a visualisation technique that displays textual representations of a speaker's phonemes to a speech-reader, so speech-readers should be able to disambiguate confusing viseme-to-phoneme mappings without shifting their focus from the speaker's face.

...read moreread less

Abstract: Speech-reading is an invaluable technique for people with hearing loss or those in adverse listening conditions (e.g., in a noisy restaurant, near children playing loudly). However, speech-reading is often difficult because identical mouth shapes (visemes) can produce several speech sounds (phonemes); there is a one-to-many mapping from visemes to phonemes. This decreases comprehension, causing confusion and frustration during conversation. My doctoral research aims to design and evaluate a visualisation technique that displays textual representations of a speaker's phonemes to a speech-reader. By combining my visualisation with their pre-existing speech-reading ability, speech-readers should be able to disambiguate confusing viseme-to-phoneme mappings without shifting their focus from the speaker's face. This will result in an improved level of comprehension, supporting natural conversation.

...read moreread less

Book Chapter•DOI•

2. The Production and Classification of Speech Sounds

[...]

María de los Ángeles Gómez González, Teresa Sánchez Roura

15 Jan 2016

Proceedings Article•DOI•

Allophones in automatic whispery speech recognition

[...]

Piotr Kozierski¹, Talar Sadalla¹, Szymon Drgas¹, Adam Dabrowski¹•Institutions (1)

Poznań University of Technology¹

01 Aug 2016

TL;DR: Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach.

...read moreread less

Abstract: The article presents studies on the automatic whispery speech recognition. In the performed research a new corpus with whispery speech has been used. It has been checked whether the extended set of articulatory units (allophones have been used instead of phonemes) improves quality of whispery speech recognition. Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach. The authors also made available the trained g2p (grapheme-to-phoneme) model of Polish language for the Sequitur toolkit.

...read moreread less

Posted Content•

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms.

[...]

Toni Heidenreich, Michael W. Spratling

07 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme, and confirms that3D feature extracted methods improve the accuracy compared to 2D features extraction methods.

...read moreread less

Abstract: Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20%, and the accuracy of viseme sequences is 39%. Both results improve the best values reported in other papers by approximately 2%. The contribution of the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.

...read moreread less

Journal Article•DOI•

Speech Production in Speech Technologies: Introduction to the CSL Special Issue

[...]

Karen Livescu¹, Frank Rudzicz², Eric Fosler-Lussier³, Mark Hasegawa-Johnson⁴, Jeff A. Bilmes⁵ - Show less +1 more•Institutions (5)

Toyota Technological Institute at Chicago¹, Toronto Rehabilitation Institute², Ohio State University³, University of Illinois at Urbana–Champaign⁴, University of Washington⁵

01 Mar 2016-Computer Speech & Language

TL;DR: This special issue was inspired by the 2013 Workshop on Speech Production in Automatic Speech Recognition in Lyon, France, and this introduction provides an overview of the included papers in the context of the current research landscape.

...read moreread less

Proceedings Article•DOI•

Visual speech synthesis using dynamic visemes, contextual features and DNNs

[...]

Ausdang Thangthai, Ben Milner, Sarah Taylor

08 Sep 2016

TL;DR: This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN) and reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

...read moreread less

Abstract: This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

...read moreread less

DOI•

Hidden markov models based indonesian viseme model for natural speech with affection

[...]

Endang Setyati, Mauridhi Hery Purnomo¹, Surya Sumpeno¹, Joan Santoso•Institutions (1)

Sepuluh Nopember Institute of Technology¹

13 Dec 2016

TL;DR: This work defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation and proposed a choice of one of affection type as input in the system to generate a viseme sequence for natural speech of Indonesian sentences with affection.

...read moreread less

Abstract: In a communication using texts input, viseme (visual phonemes) is derived from a group of phonemes having similar visual appearances. Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition. For speech emotion recognition, a HMM is trained for each emotion and an unknown sample is classified according to the model which illustrate the derived feature sequence best. Viterbi algorithm, HMM is used for guessing the most possible state sequence of observable states. In this work, first stage, we defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation. The second stage, we defined a choice of one of affection type as input in the system. The last stage, we experimentally using Trigram HMMs for generating the viseme sequence to be used for synchronized mouth shape and lip movements. The whole system is interconnected in a sequence. The final system produced a viseme sequence for natural speech of Indonesian sentences with affection. We show through various experiments that the proposed, the results in about 82,19% relative improvement in classification accuracy.

...read moreread less

Proceedings Article•DOI•

Investigating back propagation neural network for lip reading

[...]

Neeru Rathee¹•Institutions (1)

Guru Gobind Singh Indraprastha University¹

01 Apr 2016

TL;DR: Variation of lip texture features for recognition of visemes is explored and the proposed approach is used for Hindi word recognition and achieved high accuracy at the cost of computation time.

...read moreread less

Abstract: Visual Speech processing is the key concern of researchers working in the field of speech processing and computer vision. Though earlier audio speech processing was popularly used for speech recognition but their performance deteriorated in the presence of noise. Moreover, variation in accent is another challenge that affects the performance of such systems. In the presented paper, we explore variation of lip texture features for recognition of visemes. The variations in temporal behavior of lip texture features is coded using Local Binary Pattern features in three orthogonal planes. The classification is carried out using the back propagation neural network, which is a network with hidden layer. The added advantage of hidden layer for lip reading is that it takes into account the nonlinear variation of lip features while speaking. The proposed approach is used for Hindi word recognition and achieved high accuracy at the cost of computation time.

...read moreread less

System and method for speech recognition

[...]

Dimitri Kanevsky, Tara N. Sainath

01 Jan 2016

Dissertation•

Decoding visemes: improvingmachine lip-reading

[...]

Helen L. Bear

01 Jun 2016

TL;DR: The results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth, and the dependence of phoneme-to-viseme maps between speakers is investigated.

...read moreread less

Abstract: This thesis is about improving machine lip-reading, that is, the classi�cation of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution a�ects. We show that high de�nition video is not needed to successfully lip-read with a computer. The term \viseme" is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally de�ned, we use the common working de�nition `a viseme is a group of phonemes with identical appearance on the lips'. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classi�cation. Our results show Lee's [82] is best. Lee's classi�cation also outperforms machine lip-reading systems which use the popular Fisher [48] phonemeto- viseme map. Further to this, we propose three methods of deriving speaker-dependent phonemeto- viseme maps and compare our new approaches to Lee's. Our results show the ii iii sensitivity of phoneme clustering and we use our new knowledge for our �rst suggested augmentation to the conventional lip-reading system. Speaker independence in machine lip-reading classi�cation is another unsolved obstacle. It has been observed, in the visual domain, that classi�ers need training on the test subject to achieve the best classi�cation. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classi�cation of a speaker not present in the classi�er's training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last di�culty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classi�er, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classi�ers. This new method of classi�er training demonstrates signi�cant increase in classi�cation with a word language network.

...read moreread less

Proceedings Article•DOI•

Conversion of LIP movement to speech: An aid to physically impaired and dumb people

[...]

C S Kavya, N H Poornima, N Sahana, K V Vidyashree, G. R. Kiranmayi - Show less +1 more

01 Oct 2016

TL;DR: A system is developed which recognizes silently uttered words and performs specific tasks based on the recognized lip movements of the words and the method used is noninvasive.

...read moreread less

Abstract: Silent Speech Interface(SSI) has given rise the possibility of speech processing, even inthe absence of an acoustic signal and it has been used as an aid for the dumb person or speech handicapped. A Silent Speech Interface is a device that enables the communication to take place when voice is not present. In noisy environments, the SSI holds a promising approach for speech processing. In this project a system is developed which recognizes silently uttered words and performs specific tasks based on the recognized lip movements of the words. The method used is noninvasive. The system records video of the lip movements of the words/letter uttered silently. The recorded signal is analysed by extracting features using local binary pattern (LBP) operator and then detected using a kNN classifier. The system is trained and tested to recognize the word ‘WATER’. When the word ‘WATER’ is uttered, lip movements are analyzed and the control signal is generated to initiate the robot to pick a water bottle from a specified location and place it at the predefined destination.

...read moreread less

Proceedings Article•DOI•

Face and mouth localization of viseme

[...]

Rosniza Roslan¹, Nursuriati Jamil¹, Noraini Seman¹, Syafiqa Ain Alfida Abdul Rahim¹•Institutions (1)

Universiti Teknologi MARA¹

01 Nov 2016

TL;DR: This paper investigated face and mouth localization of viseme images using the enhanced Viola-Jones algorithm as to ensure the face and Mouth are accurately detected and showed a promising result with the high acceptance rate.

...read moreread less

Abstract: Face and mouth localization are the vital phase for visual speech recognition. These tasks refer to the detection of the face and mouth region within the viseme images. The main problem in face and mouth localization is the constraints on the image such as rotation of the images and color of the homogeneity intensity within the images and some parts of the object images cannot be detected. This paper investigated face and mouth localization of viseme images using the enhanced Viola-Jones algorithm as to ensure the face and mouth are accurately detected. The enhanced Viola-Jones algorithm experimented on 80 viseme images from the recorded video data of 500 Bahasa Melayu common words as the test images which include female and male adult with no speech problem. The proposed method also showed a promising result with the high acceptance rate. For future work, further research may focus on overcoming the detection of similar color intensity and illumination problems within the images. The future research also may focus on mouth segmentation and temporal extraction of the mouth features.

...read moreread less

Journal Article•DOI•

Real Time Talking System for Virtual Human based on ProPhone

[...]

Itimad Raheem Ali, Ghazali Sulong, Hoshang Kolivand

15 Oct 2016-Research Journal of Applied Sciences, Engineering and Technology

TL;DR: The contribution of this study is to develop real-time automatic talking system for English language based on the concatenation of the visemes, followed by presenting the results that was evaluated by the phoneme to viseme table using the Prophone.

...read moreread less

Abstract: Lip-syncing is a process of speech assimilation with the lip motions of a virtual character. A virtual talking character is a challenging task because it should provide control on all articulatory movements and must be synchronized with the speech signal. This study presents a virtual talking character system aimed to speeding and easing the visual talking process as compared to the previous techniques using the blend shapes approach. This system constructs the lip-syncing using a set of visemes for reduced phonemes set by a new method named Prophone. This Prophone depend on the probability of appearing the phoneme in the sentence of English Language. The contribution of this study is to develop real-time automatic talking system for English language based on the concatenation of the visemes, followed by presenting the results that was evaluated by the phoneme to viseme table using the Prophone.

...read moreread less

Using Speech Recognition

[...]

Yvonne Neudorf

01 Jan 2016

TL;DR: The using speech recognition is universally compatible with any devices to read, and will help you to get the most less latency time to download any of the authors' books like this one.

...read moreread less

Abstract: Thank you for downloading using speech recognition. Maybe you have knowledge that, people have look numerous times for their chosen books like this using speech recognition, but end up in malicious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some harmful virus inside their laptop. using speech recognition is available in our book collection an online access to it is set as public so you can get it instantly. Our digital library spans in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Merely said, the using speech recognition is universally compatible with any devices to read.

...read moreread less

Proceedings Article•

Modeling Co-articulation in Text-to-Audio Visual Speech

[...]

Ashish Kapoor¹, Udit Kumar Goyal, Prem Kalra²•Institutions (2)

Microsoft¹, Indian Institute of Technology Delhi²

14 Dec 2016

TL;DR: A method using polymorphing to incorporate co-articulation in the TTAVS, a system for converting the input text to video realistic audio-visual sequence, and adds temporal smoothing for viseme transitions to avoid jerky animation.

...read moreread less

Abstract: This paper provides our approach to co-articulation for a text-to-audiovisual speech synthesizer (TTAVS), a system for converting the input text to video realistic audio-visual sequence. It is an image-based system, where the face is modeled using a set of images of a human subject. A concatination of visemes -the corresponding lip shapes for phonemes- can be used for modeling visual speech. However, in actual speech production, there is overlap in the production of syllables and phonemes that are a sequence of discrete units of speech. Due to this overlap, boundaries between these discrete speech units are blurred, i.e., vocal tract motions associated with producing one phonetic segment overlap the motions for producing surrounding phonetic segments. This overlap is called as co-articulation. The lack of parameterization in the image-based model makes it difficult to use the techniques employed in 3D facial animation models for co-articulation. We introduce a method using polymorphing to incorporate co-articulation in our TTAVS. Further, we add temporal smoothing for viseme transitions to avoid jerky animation.

...read moreread less

A Study of Noise-Robust Speech Recognition and Time-Varying Speech Features

[...]

Federico Ang

24 Mar 2016

Proceedings Article•DOI•

Does action have to relate to speech

[...]

Masanobu Yamamoto¹, Ryohei Suzuki¹•Institutions (1)

Niigata University¹

28 Nov 2016

TL;DR: A speech animation can be easily produced by choosing a suitable action from the data base containing actions captured beforehand, by connecting any action unrelated to the speech.

...read moreread less

Abstract: When an action does not relate to the speech, the speech may connect any action unrelated to the speech. In such case, a speech animation can be easily produced by choosing a suitable action from the data base containing actions captured beforehand.

...read moreread less

Book Chapter•DOI•

Parameterized Facial Modelling and Animation

[...]

Junghyun Cho¹, Heeseung Choi¹, Sang Chul Ahn¹, Ig-Jae Kim¹•Institutions (1)

Korea Institute of Science and Technology¹

01 Jan 2016

TL;DR: The aim of this chapter is to give a comprehensive overview of current state-of-the-art parametric methods for realistic facial modelling and animation.

...read moreread less

Abstract: Facial modelling is a fundamental technique in a variety of applications in computer graphics, computer vision and pattern recognition areas. As 3D technologies evolved over the years, the quality of facial modelling greatly improved. To enhance the modelling quality and controllability of the model further, parametric methods, which represent or manipulate facial attributes (e.g. identity, expression, viseme) with a set of control parameters, have been proposed in recent years. The aim of this chapter is to give a comprehensive overview of current state-of-the-art parametric methods for realistic facial modelling and animation.

...read moreread less

Dissertation•

Lip syncing method for realistic expressive three-dimensional face model

[...]

Itimad Raheem Ali Al-Rubaye

01 Mar 2016

TL;DR: This study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model.

...read moreread less

Abstract: Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression.

...read moreread less