scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Posted Content
12 Sep 2020
TL;DR: This work proposes a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT) that outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.
Abstract: To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.
Patent
Xu Haitan, K. K. Chin1
07 Apr 2010
TL;DR: In this paper, a method and apparatus for adapting a pattern recognition model, specifically a speech recognition model between first and second environments is presented, where a model for performing pattern recognition on an inputted sequence of observations is initially provided, the model having been trained to recognise a pattern in a second clean noise environment.
Abstract: The invention provides a method and apparatus for adapting a pattern recognition model, specifically a speech recognition model, between first and second environments. A model for performing pattern recognition on an inputted sequence of observations is initially provided, the model having been trained to recognise a pattern in a second clean noise environment. The model has a plurality of parameters relating to the probability distribution of a component of a pattern being related to an observation. The model is then adapted S59 to a first noise environment using inputted observations S51 from the first noise environment. Adapting the model trained in the second environment to that of the first environment comprises using second order or higher Taylor expansion coefficients derived for a group of probability distributions an wherein the same expansion is used for the whole group. The groups may be regression classes. In order to recognise speech a language model is also used with the combined likelihoods of the adapted model and language model used to output a sequence of words identified from the input signal of the first noise environment.
Book ChapterDOI
01 Jan 2021
TL;DR: Preliminary results of applying rough sets in pre-processing video frames (with lip markers) of spoken corpus in an effort to label the phonemes spoken by the speakers show promise in the application of a granular computing method for pre- processing large audio-video datasets.
Abstract: Machine learning algorithms are increasingly effective in algorithmic viseme recognition which is a main component of audio-visual speech recognition (AVSR). A viseme is the smallest recognizable unit correlated with a particular realization of a given phoneme. Labelling of phonemes and assigning them to viseme classes is a challenging problem in AVSR. In this paper, we present preliminary results of applying rough sets in pre-processing video frames (with lip markers) of spoken corpus in an effort to label the phonemes spoken by the speakers. The problem addressed here is to detect and remove frames in which the shape of the lips do not represent a phoneme completely. Our results demonstrate that the silhouette score improves with rough set-based pre-processing using the unsupervised K-means clustering method. In addition, an unsupervised CNN model for feature extraction was used as input to the K-means clustering method. The results show promise in the application of a granular computing method for pre-processing large audio-video datasets.
Posted Content
TL;DR: This work states that using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.
Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.
Proceedings ArticleDOI
24 Aug 2009
TL;DR: This study investigates the use of Cued Speech not only for perception, but also for speech production in the case of speech- or hearing-impaired individuals, and proposes an automatic recognition method based on hidden Markov model (HMM) automatic recognition.
Abstract: This study focuses on alternative speech communication based on Cued Speech. Cued Speech is a visual mode of communication that uses handshapes and placements in combination with the mouth movements of speech to make the phonemes of a spoken language look different from each other and clearly understandable to deaf and hearing-impaired people. Originally, the aim of Cued Speech was to overcome the problems of lip reading and thus enable deaf children and adults to wholly understand spoken language. In this study, we investigate the use of Cued Speech not only for perception, but also for speech production in the case of speech- or hearing-impaired individuals. The proposed method is based on hidden Markov model (HMM) automatic recognition. Automatic recognition of Cued Speech can be served as an alternative speech communication method for individuals with speech- or hearing impairments. This article presents vowel and consonant, and also isolated word recognition experiments for Cued Speech for French. The results obtained are promising and comparable to the results obtained when using audio signal.

Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822