scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Proceedings ArticleDOI
04 Oct 2004
TL;DR: Leveraging the synchrony between speech and finger tapping provides a 46 % relative improvement and a 1 % absolute improvement in connected digit recognition experiments and LVCSR experiments, respectively.
Abstract: Behavioral synchronization between speech and finger tapping provides a novel approach to the improvement of speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a ’feature’ and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J) and LVCSR tasks. Leveraging the synchrony between speech and finger tapping provides a 46 % relative improvement and a 1 % absolute improvement in connected digit recognition experiments and LVCSR experiments, respectively.

4 citations

Proceedings ArticleDOI
01 Aug 2016
TL;DR: Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach.
Abstract: The article presents studies on the automatic whispery speech recognition. In the performed research a new corpus with whispery speech has been used. It has been checked whether the extended set of articulatory units (allophones have been used instead of phonemes) improves quality of whispery speech recognition. Experimental results show that the small changes in the allophone set may provide better speech recognition quality than using phonemes approach. The authors also made available the trained g2p (grapheme-to-phoneme) model of Polish language for the Sequitur toolkit.

4 citations

01 Jan 2003
TL;DR: A probabilistic and statistical framework for ASR based on the knowledge of acoustic phonetics for connected word ASR, which could overcome the disadvantages encountered by the early acoustic-phonetic knowledge based systems, that led the ASR community to switch to ASR systems highly dependent on statistical pattern analysis methods.
Abstract: In spite of decades of research, Automatic Speech Recognition (ASR) is far from reaching the goal of performance close to Human Speech Recognition (HSR). One of the reasons for unsatisfactory performance of the state-of-the-art ASR systems, that are based largely on Hidden Markov Models (HMMs), is the inferior acoustic modeling of low level or phonetic level linguistic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal. But an acoustic phonetic system that carries out large ASR speech recognition tasks, for example, connected word or continuous speech recognition, does not exist. We propose a probabilistic and statistical framework for ASR based on the knowledge of acoustic phonetics for connected word ASR. The proposed system is based on the idea of representation of speech sounds by bundles of binary valued articulatory phonetic features. The probabilistic framework requires only binary classifiers of phonetic features and the knowledge based acoustic correlates of the features for the purpose of connected word speech recognition. We explore the use of Support Vector Machines (SVMs) for binary phonetic feature classification because of the favorable properties well suited to our recognition task that SVMs offer. In the proposed method, probabilistic segmentation of speech is obtained using SVM based classifiers of manner phonetic features. The linguistically motivated landmarks obtained in each segmentation is used for classification of source and place phonetic features. Probabilistic segmentation paths are constrained using Finite State Automata (FSA) for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge based systems, that led the ASR community to switch to ASR systems highly dependent on statistical pattern analysis methods.

4 citations

Posted Content
TL;DR: This paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme, and confirms that3D feature extracted methods improve the accuracy compared to 2D features extraction methods.
Abstract: Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20%, and the accuracy of viseme sequences is 39%. Both results improve the best values reported in other papers by approximately 2%. The contribution of the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.

4 citations

Proceedings ArticleDOI
27 Jun 2007
TL;DR: In this paper it solves the issue of how to make human lips synchronization in one of the most popular 3D modelling software - Autodesk Maya and an automatic way to analyse voice, using Adobe after effects and how the result interacts to Maya is described.
Abstract: Realistic facial synthesis is one of the most fundamental problems in computer graphics and one of the most difficult. Human lips synchronization of the faces is very important part of this. In this paper it solves the issue of how to make human lips synchronization in one of the most popular 3D modelling software - Autodesk Maya. It is described an automatic way to analyse voice, using Adobe after effects and how the result interacts to Maya, but this way is still in progress, so voice is analyzed and classified manually. Using programming language of Maya - MEL (Maya Embedded Language) the recognized phonemes are associated with mouth positions to provide visemes for computer animation of speech. Different mouth positions are created using blend shape deformers, that let you to deform a surface into the shapes of other surfaces. Lip animation is facilitated by activating facial muscles and the jaw on the given facial model created specially for animation. Highspeed natural-looking synchronized lips animation is achieved. It is planed to model the expressive visual features of expressive speech. Informal user testing suggests that the addition of detailed internal mouth structures, such as the tongue, would improve the phrase recognition.

4 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822