scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Patent
23 Nov 2016
TL;DR: In this paper, the entire pipeline of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages.
Abstract: Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

46 citations

Journal ArticleDOI
TL;DR: It is shown that there is definite difference in performance between viseme- to-phoneme mappings and why some maps appear to work better than others, and a new algorithm for constructing phoneme-to-visememappings from labeled speech data is devised.

46 citations

Proceedings ArticleDOI
17 Sep 2006
TL;DR: This paper describes the initial work in developing a real-time audio-visual Chinese speech synthesizer with a 3D expressive avatar, and extended the dominance blending approach to effect animations for coarticulated visemes superposed with expression changes.
Abstract: This paper describes our initial work in developing a real-time audio-visual Chinese speech synthesizer with a 3D expressive avatar. The avatar model is parameterized according to the MPEG-4 facial animation standard [1]. This standard offers a compact set of facial animation parameters (FAPs) and feature points (FPs) to enable realization of 20 Chinese visemes and 7 facial expressions (i.e. 27 target facial configurations). The Xface [2] open source toolkit enables us to define the influence zone for each FP and the deformation function that relates them. Hence we can easily animate a large number of coordinates in the 3D model by specifying values for a small set of FAPs and their FPs. FAP values for 27 target facial configurations were estimated from available corpora. We extended the dominance blending approach to effect animations for coarticulated visemes superposed with expression changes. We selected six sentiment-carrying text messages and synthesized expressive visual speech (for all expressions, in randomized order) with neutral audio speech. A perceptual experiment involving 11 subjects shows that they can identify the facial expression that matches the text message’s sentiment 85% of the time.

46 citations

Book ChapterDOI
20 Nov 2016
TL;DR: This paper tackles ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture and shows that additional view information helps to improve the performance of ALR with neural network architecture.
Abstract: It is well known that automatic lip-reading (ALR), also known as visual speech recognition (VSR), enhances the performance of speech recognition in a noisy environment and also has applications itself. However, ALR is a challenging task due to various lip shapes and ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture. We conduct single, cross, and multi-view experiments in speaker independent setting with various network configuration to integrate the multi-view data. We achieve 77.9%, 83.8%, and 78.6% classification accuracies in average on single, cross, and multi-view respectively. This result is better than the best score (76%) of preliminary single-view results given by ACCV 2016 workshop on multi-view lip-reading/audio-visual challenges. It also shows that additional view information helps to improve the performance of ALR with neural network architecture.

46 citations

Journal ArticleDOI
01 Jan 1968

45 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822