Audio-to-Visual Conversion Using Hidden Markov Models

doi:10.1007/3-540-45683-X_60

Book ChapterDOI

Audio-to-Visual Conversion Using Hidden Markov Models

Soonkyu Lee, +1 more

- pp 563-570

Chats0

TLDR

Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.

Abstract:

We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Lipreading With Local Spatiotemporal Descriptors

Guoying Zhao, +2 more

- 01 Nov 2009 -

IEEE Transactions on Multimedia

TL;DR: Local spatiotemporal descriptors are presented to represent and recognize spoken isolated phrases based solely on visual input to include local processing and robustness to monotonic gray-scale changes.

...read moreread less

Journal ArticleDOI

Audiovisual Fusion: Challenges and New Approaches

Aggelos K. Katsaggelos, +2 more

TL;DR: This review will address issues in AV fusion in the context of AV speech processing, and especially speech recognition, where one of the issues is that the modalities both interact but also sometimes appear to desynchronize from each other.

...read moreread less

Proceedings ArticleDOI

Decision level combination of multiple modalities for recognition and analysis of emotional expression

Angeliki Metallinou, +2 more

TL;DR: This work model face, voice and head movement cues for emotion recognition and fuse classifiers using a Bayesian framework and suggests a positive correlation between the number of classifiers that performed well and the perceptual salience of the expressed emotion.

...read moreread less

Journal ArticleDOI

A coupled HMM approach to video-realistic speech animation

Lei Xie, +1 more

- 01 Aug 2007 -

Pattern Recognition

TL;DR: The proposed coupled hidden Markov model (CHMM) approach to video-realistic speech animation indicates that explicitly modelling audio-visual speech is promising for speech animation.

...read moreread less

Proceedings Article

Phoneme-to-viseme mapping for visual speech recognition

Luca Cappelletta, +1 more

TL;DR: These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development, and the best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A tutorial on hidden Markov models and selected applications in speech recognition

Lawrence R. Rabiner

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.

...read moreread less

Journal ArticleDOI

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

Andrew J. Viterbi

- 01 Apr 1967 -

IEEE Transactions on Information Theory

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

...read moreread less

Book

Phoneme recognition using time-delay neural networks

Alex Waibel, +4 more

TL;DR: The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation.

...read moreread less

Journal ArticleDOI

Phoneme recognition using time-delay neural networks

Alex Waibel, +4 more

- 01 Mar 1989 -

IEEE Transactions on Acoustics, Speech, ...

TL;DR: In this article, the authors presented a time-delay neural network (TDNN) approach to phoneme recognition, which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input

...read moreread less