scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2017"


Journal ArticleDOI
TL;DR: This work uses models based on the unheard acoustic envelope, the motion signal and categorical visual speech features to predict EEG activity during silent lipreading and finds that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model.
Abstract: Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM, and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions.

48 citations


Journal ArticleDOI
TL;DR: It is shown that there is definite difference in performance between viseme- to-phoneme mappings and why some maps appear to work better than others, and a new algorithm for constructing phoneme-to-visememappings from labeled speech data is devised.

46 citations


Posted Content
TL;DR: This work presents a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass and shows classification performance which significantly improves on previous lip-reading results.
Abstract: Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.

27 citations


Posted Content
TL;DR: A new database in which the speakers are aware of being read and aim to facilitate lip-reading is designed, in which hearing-impaired participants outperformed the normal-hearing participants but without reaching statistical significance.
Abstract: Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information that is complementary to the audio. Exploiting the visual information, however, has proven challenging. On one hand, researchers have reported that the mapping between phonemes and visemes (visual units) is one-to-many because there are phonemes which are visually similar and indistinguishable between them. On the other hand, it is known that some people are very good lip-readers (e.g: deaf people). We study the limit of visual only speech recognition in controlled conditions. With this goal, we designed a new database in which the speakers are aware of being read and aim to facilitate lip-reading. In the literature, there are discrepancies on whether hearing-impaired people are better lip-readers than normal-hearing people. Then, we analyze if there are differences between the lip-reading abilities of 9 hearing-impaired and 15 normal-hearing people. Finally, human abilities are compared with the performance of a visual automatic speech recognition system. In our tests, hearing-impaired participants outperformed the normal-hearing participants but without reaching statistical significance. Human observers were able to decode 44% of the spoken message. In contrast, the visual only automatic system achieved 20% of word recognition rate. However, if we repeat the comparison in terms of phonemes both obtained very similar recognition rates, just above 50%. This suggests that the gap between human lip-reading and automatic speech-reading might be more related to the use of context than to the ability to interpret mouth appearance.

24 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: This paper proposes introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources.
Abstract: End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-ofthe-art performance on a number of tasks and significantly simplify the modeling, training and decoding procedures for speech recognition. In this paper, we extend our prior work on End-toEnd speech recognition focusing on the effectiveness of these models in far-field environments. Specifically, we propose introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources. We evaluate our proposed model on the CHiME-4 task, and show substantial improvement compared to a model optimized for a single microphone input.

23 citations


Journal Article
TL;DR: The phoneme lipreading system word accuracy outperforms the viseme based system word word accuracy, however, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.
Abstract: There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is aWeighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.

22 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: In this article, the authors designed a new database in which the speakers are aware of being read and aim to facilitate lip-reading, and analyzed if there are differences between the lipreading abilities of 9 hearing-impaired and 15 normal-hearing people.
Abstract: Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information that is complementary to the audio. Exploiting the visual information, however, has proven challenging. On one hand, researchers have reported that the mapping between phonemes and visemes (visual units) is one-to-many because there are phonemes which are visually similar and indistinguishable between them. On the other hand, it is known that some people are very good lip-readers (e.g: deaf people). We study the limit of visual only speech recognition in controlled conditions. With this goal, we designed a new database in which the speakers are aware of being read and aim to facilitate lip-reading. In the literature, there are discrepancies on whether hearing-impaired people are better lip-readers than normal-hearing people. Then, we analyze if there are differences between the lip-reading abilities of 9 hearing-impaired and 15 normal-hearing people. Finally, human abilities are compared with the performance of a visual automatic speech recognition system. In our tests, hearing-impaired participants outperformed the normal-hearing participants but without reaching statistical significance. Human observers were able to decode 44% of the spoken message. In contrast, the visual only automatic system achieved 20% of word recognition rate. However, if we repeat the comparison in terms of phonemes both obtained very similar recognition rates, just above 50%. This suggests that the gap between human lip-reading and automatic speech-reading might be more related to the use of context than to the ability to interpret mouth appearance.

19 citations


Patent
21 Feb 2017
TL;DR: In this paper, a system and method for training a set of expression and neutral convolutional neural networks using a single performance mapped to the set of known phonemes and visemes in the form of predetermined sentences and facial expressions is described.
Abstract: There is disclosed a system and method for training a set of expression and neutral convolutional neural networks using a single performance mapped to a set of known phonemes and visemes in the form predetermined sentences and facial expressions. Then, subsequent training of the convolutional neural networks can occur using temporal data derived from audio data within the original performance mapped to a set of professionally-created three dimensional animations. Thereafter, with sufficient training, the expression and neutral convolutional neural networks can generate facial animations from facial image data in real-time without individual specific training.

17 citations


Book ChapterDOI
12 Sep 2017
TL;DR: Some of the developments in speech recognition and keyword-spotting during the lifetime of the IARPA Babel program are described, with a focus on techniques developed at Cambridge University.
Abstract: The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented.

16 citations


Posted Content
TL;DR: It is concluded that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.
Abstract: In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.

15 citations


Book ChapterDOI
12 Sep 2017
TL;DR: The evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.
Abstract: The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017) was held in Porto, Portugal in 2017 as discussed by the authors.
Abstract: Comunicacio presentada a: 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), celebrat del 27 de febrer a l'1 de marc de 2017 a Porto, Portugal.

Journal ArticleDOI
TL;DR: The best multimodal system that combines the two acoustic cues as well as visual cue improves the recognition of POA category, MOA category by 3% and vowels by 2%.

Posted Content
TL;DR: The authors use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45.
Abstract: In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated phonemes, which have been confused during phoneme recognition, into viseme groups. Using these maps, with the LiLIR dataset, we show the effect of changing the viseme map size in speaker-dependent machine lip-reading, measured by word recognition correctness and so demonstrate that word recognition with phoneme classifiers is not just possible, but often better than word recognition with viseme classifiers. Furthermore, there are intermediate units between visemes and phonemes which are better still.

Posted Content
TL;DR: This paper showed that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system and showed that computer lip reading is not heavily constrained by video resolution, pose, lighting and other practical factors.
Abstract: In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system.

Patent
03 Mar 2017
TL;DR: In this article, a system and method for animated lip synchronization is presented, which includes capturing speech input, parsing the speech input into phenomes, aligning the phonemes to the corresponding portions of the speech inputs, mapping the phonemees to visemes, synchronizing the viseme into viseme action units and outputting the viseme actions.
Abstract: A system and method for animated lip synchronization. The method includes: capturing speech input; parsing the speech input into phenomes; aligning the phonemes to the corresponding portions of the speech input; mapping the phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.

Patent
Rama Doddipatla1
17 Jan 2017
TL;DR: In this paper, a test-speaker-specific adaptive system for recognizing sounds in speech spoken by a test speaker is proposed, which is trained using training data from training speakers.
Abstract: A method for generating a test-speaker-specific adaptive system for recognising sounds in speech spoken by a test speaker; the method employing: (i) training data comprising speech items spoken by the test speaker; and (ii) an input network component and a speaker adaptive output network, the input network component and speaker adaptive output network having been trained using training data from training speakers; the method comprising: (a) using the training data to train a test-speaker-specific adaptive model component of an adaptive model comprising the input network component, and the test-speaker-specific adaptive model component, and (b) providing the test-speaker-specific adaptive system comprising the input network component, the trained test-speaker-specific adaptive model component, and the speaker-adaptive output network.

Posted Content
TL;DR: This work constructs an automatic system that uses local appearance descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences and test the system in a Spanish corpus of continuous speech.
Abstract: Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the key challenges is the definition of the visual elementary units (the visemes) and their vocabulary. Many researchers have analyzed the importance of the phoneme to viseme mapping and have proposed viseme vocabularies with lengths between 11 and 15 visemes. These viseme vocabularies have usually been manually defined by their linguistic properties and in some cases using decision trees or clustering techniques. In this work, we focus on the automatic construction of an optimal viseme vocabulary based on the association of phonemes with similar appearance. To this end, we construct an automatic system that uses local appearance descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. To compare the performance of the system different descriptors (PCA, DCT and SIFT) are analyzed. We test our system in a Spanish corpus of continuous speech. Our results indicate that we are able to recognize approximately 58% of the visemes, 47% of the phonemes and 23% of the words in a continuous speech scenario and that the optimal viseme vocabulary for Spanish is composed by 20 visemes.

Journal Article
TL;DR: Three areas with the intention to improve communication between those researching lipreading are highlighted; the effects of interchanging between speech reading and lipreading; speaker dependence across train, validation, and test splits; and the use of accuracy, correctness, errors, and varying units to measure system performance.
Abstract: We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas with different perspectives on computer lipreading is creating opportunities for collaborations, but in doing so the literature is experiencing challenges in knowledge sharing due to multiple uses of terms and phrases and the range of methods for scoring results. In particular we highlight three areas with the intention to improve communication between those researching lipreading; the effects of interchanging between speech reading and lipreading; speaker dependence across train, validation, and test splits; and the use of accuracy, correctness, errors, and varying units (phonemes, visemes, words, and sentences) to measure system performance. We make recommendations as to how we can be more consistent.

Posted Content
TL;DR: A new method of speaker-dependent phoneme-to-viseme maps between speakers is proposed, which uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.
Abstract: Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.

Journal Article
TL;DR: Benchmarked against SD results, and the isolated words performance, it is observed that with continuous speech, the trajectory between visemes has a greater negative effect on the speaker differentiation.
Abstract: Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop endto- end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new knowledge to other domains such as speech therapy. One challenge in lipreading systems is the correct labeling of the classifiers. These labels map an estimated function between visemes on the lips and the phonemes uttered. Here we ask if such maps are speaker-dependent? Prior work investigated isolated word recognition from speaker-dependent (SD) visemes, we extend this to continuous speech. Benchmarked against SD results, and the isolated words performance, we test with RMAV dataset speakers and observe that with continuous speech, the trajectory between visemes has a greater negative effect on the speaker differentiation.

Proceedings ArticleDOI
25 Aug 2017
TL;DR: In this paper, the authors compare the performance of the most widely used features for lipreading, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework.
Abstract: Automatic lipreading has major potential impact for speech recognition, supplementing and complementing the acoustic modality. Most attempts at lipreading have been performed on small vocabulary tasks, due to a shortfall of appropriate audio-visual datasets. In this work we use the publicly available TCD-TIMIT database, designed for large vocabulary continuous audio-visual speech recognition. We compare the viseme recognition performance of the most widely used features for lipreading, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework. We also exploit recent advances in AAM fitting. We found the DCT to outperform AAM by more than 6% for a viseme recognition task with 56 speakers. The overall accuracy of the DCT is quite low (32-34%). We conclude that a fundamental rethink of the modelling of visual features may be needed for this task.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This research proposes the establishment of natural Indonesian viseme order influenced by the expression of emotion through Viterbi algorithm Hidden Markov Model, which converts the text input of an Indonesian sentence into a sequence Indonesia viseme to be influence by the affection.
Abstract: Every language has different characteristics, one of which is how to pronounce the language. Pronunciation accompanied by emotional expression are increasingly making different characteristics. This research proposes the establishment of natural Indonesian viseme order influenced by the expression of emotion. This system converts the text input of an Indonesian sentence into a sequence Indonesian viseme to be influenced by the affection. Animation of mouth shape and lip movements are made to complement the visualization of natural speech, which is resulted from the establishment Indonesian viseme model. The method used in the natural Indonesian viseme sequence, is using statistical approach through Viterbi algorithm Hidden Markov Model.

Book ChapterDOI
27 Feb 2017
TL;DR: This work constructs an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences, and builds a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition.
Abstract: Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

Journal ArticleDOI
TL;DR: In this article, the authors present the KMSMWL-A (Korean Standard Monosyllabic Word List for Adults, KS-MWL A) for adults.
Abstract: 배경 및 목적: 한국어 독화소 체계 분석에 관한 기존의 연구들은 한정된 음소들을 조합하여 이루어짐으로써 일상회화체의 다양한 특성들을 반영하지 못한 점이 있다. 이에 본 연구는 일상회화체 내 음소 출현율을 반영한 한국표준 단음절어표 일반용(Korean Standard Monosyllabic Word List for Adults, KS-MWL-A)을 사용하여...

01 Jan 2017
TL;DR: Facial Capture Lip-Sync as discussed by the authors ) is a method for lip-sync in facial capture. But it is not suitable for indoor applications, such as walking, biking, etc.
Abstract: Facial Capture Lip-Sync

Patent
11 Jan 2017
TL;DR: In this paper, a Uygur language phoneme-viseme parameter conversion method and system is presented, which consists of adding 41 features and the visibility features of teeth and a tongue, carrying out the clustering of vowel mouth shape data, and obtaining a vowel basic static viseme set.
Abstract: The invention relates to a Uygur language phoneme-viseme parameter conversion method and system, and belongs to the technical field of voice-human face animation information processing The method comprises the steps: adding 41 features and the visibility features of teeth and a tongue, carrying out the clustering of vowel mouth shape data, and obtaining a vowel basic static viseme set; respectively carrying out the clustering of consonants and mouth shape data combined with different vowels, and obtaining a consonant basic static viseme set; proposing a composite viseme concept based on the above, and building a Uygur language basic dynamic viseme set; giving a composite dynamic viseme model and a dynamic viseme model parameter estimation method based on a linear regression algorithm, thereby achieving the Uygur language phoneme-viseme conversion According to the invention, the method carries out the text analysis of a to-be-converted Uygur language text according to the basic dynamic viseme set and the model parameters thereof, obtains a basic dynamic viseme sequence in the text, and can generate a human face and lip portion visual voice animation consistent with the content of the text

Posted Content
TL;DR: The influence of speaker individuality is explained, and how one can use visemes to boost lipreading is demonstrated, which has applications beyond machine lipreading.
Abstract: For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into the spoken sounds. We investigate the visual speech channel to further our understanding of visemes. This has applications beyond machine lipreading; speech therapists, animators, and psychologists can benefit from this work. We explain the influence of speaker individuality, and demonstrate how one can use visemes to boost lipreading.

Journal Article
TL;DR: This thesis report is submitted in partial fulfilment of the requirements for the degree of Master of Science in Computer Science and Engineering, 2016.
Abstract: This thesis report is submitted in partial fulfilment of the requirements for the degree of Master of Science in Computer Science and Engineering, 2016.

Posted Content
TL;DR: A limited lip reading algorithm for a subset of the English language based on Hidden Markov Models to predict the words the speaker is saying based on the sequences of classified phonemes and visemes.
Abstract: The goal of this project is to develop a limited lip reading algorithm for a subset of the English language. We consider a scenario in which no audio information is available. The raw video is processed and the position of the lips in each frame is extracted. We then prepare the lip data for processing and classify the lips into visemes and phonemes. Hidden Markov Models are used to predict the words the speaker is saying based on the sequences of classified phonemes and visemes. The GRID audiovisual sentence corpus [10][11] database is used for our study.