scispace - formally typeset
Book ChapterDOI

Audio-to-Visual Conversion Using Hidden Markov Models

Soonkyu Lee, +1 more
- pp 563-570
Reads0
Chats0
TLDR
Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.
Abstract
We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

read more

Citations
More filters
Journal ArticleDOI

Lipreading With Local Spatiotemporal Descriptors

TL;DR: Local spatiotemporal descriptors are presented to represent and recognize spoken isolated phrases based solely on visual input to include local processing and robustness to monotonic gray-scale changes.
Journal ArticleDOI

Audiovisual Fusion: Challenges and New Approaches

TL;DR: This review will address issues in AV fusion in the context of AV speech processing, and especially speech recognition, where one of the issues is that the modalities both interact but also sometimes appear to desynchronize from each other.
Proceedings ArticleDOI

Decision level combination of multiple modalities for recognition and analysis of emotional expression

TL;DR: This work model face, voice and head movement cues for emotion recognition and fuse classifiers using a Bayesian framework and suggests a positive correlation between the number of classifiers that performed well and the perceptual salience of the expressed emotion.
Journal ArticleDOI

A coupled HMM approach to video-realistic speech animation

TL;DR: The proposed coupled hidden Markov model (CHMM) approach to video-realistic speech animation indicates that explicitly modelling audio-visual speech is promising for speech animation.
Proceedings Article

Phoneme-to-viseme mapping for visual speech recognition

TL;DR: These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development, and the best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set.
References
More filters
Journal ArticleDOI

A tutorial on hidden Markov models and selected applications in speech recognition

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Journal ArticleDOI

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
Book

Phoneme recognition using time-delay neural networks

TL;DR: The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation.
Journal ArticleDOI

Phoneme recognition using time-delay neural networks

TL;DR: In this article, the authors presented a time-delay neural network (TDNN) approach to phoneme recognition, which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input
Related Papers (5)