Visual speech recognition with loosely synchronized feature streams

doi:10.1109/ICCV.2005.251

Proceedings ArticleDOI

Visual speech recognition with loosely synchronized feature streams

Kate Saenko, +5 more

- Vol. 2, pp 1424-1431

Chats0

TLDR

A novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way is presented.

Abstract:

We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulate production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/non-speech classification, as well as recognition performance against several baseline systems

Visual speech recognition with loosely synchronized feature streams

Citations

"Hello! My name is... Buffy" - Automatic Naming of Characters in TV Video

Hidden Conditional Random Fields for Gesture Recognition

Lipreading With Local Spatiotemporal Descriptors

Taking the bite out of automated naming of characters in TV video

Speech production knowledge in automatic speech recognition.

References

Maximum likelihood from incomplete data via the EM algorithm

LIBSVM: A library for support vector machines

Maximum likelihood estimation from incomplete data via the EM algorithm

Dynamic bayesian networks: representation, inference and learning

Factorial Hidden Markov Models

Related Papers (5)

Hearing lips and seeing voices

Recent advances in the automatic recognition of audiovisual speech

Rapid object detection using a boosted cascade of simple features

An audio-visual corpus for speech perception and automatic speech recognition

Lip Reading Sentences in the Wild