scispace - formally typeset
Search or ask a question

Showing papers by "Patrick Lucey published in 2006"


01 Jan 2006
TL;DR: The experiments show that AVASR is possible from profile views with moderate performance degradation compared to frontal video data, and show the first real attempt to attack this problem.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face. In contrast, this paper investigates extracting visual speech information from the speaker's profile view, and, to our knowledge, constitutes the first real attempt to attack this problem. As with any AVASR system, the overall recognition performance depends heavily on the visual front end. This is especially the case with profile-view data, as the facial features are heavily compacted compared to the frontal scenario. In this paper, we particularly describe our visual front end approach, and report experiments on a multi-subject, small-vocabulary, bimodal, multisensory database that contains synchronously captured audio with frontal and profile face video. Our experiments show that AVASR is possible from profile views with moderate performance degradation compared to frontal video data.

61 citations


Proceedings ArticleDOI
01 Oct 2006
TL;DR: In this article, the authors investigated extracting visual speech information from the speaker's profile view, and reported that AVASR is possible from profile views with moderate performance degradation compared to frontal video data.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face. In contrast, this paper investigates extracting visual speech information from the speaker's profile view, and, to our knowledge, constitutes the first real attempt to attack this problem. As with any AVASR system, the overall recognition performance depends heavily on the visual front end. This is especially the case with profile-view data, as the facial features are heavily compacted compared to the frontal scenario. In this paper, we particularly describe our visual front end approach, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video. Our experiments show that AVASR is possible from profile views with moderate performance degradation compared to frontal video data

52 citations


01 Jan 2006
TL;DR: By dealing with the mouth region in this manner, it is shown that by extracting more speech information from the visual domain, the relative word error rate on the CUAVE audio-visual corpus is improved.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23\% on the CUAVE audio-visual corpus.

23 citations


Proceedings ArticleDOI
26 Dec 2006
TL;DR: In this article, an audio-visual automatic speech recognition (AVASR) system that operates on profile face views and its comparison with a traditional frontal-view AVASR system is presented.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of speaker's the face, which is not always the case in realistic human computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject's head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: First, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data.

11 citations


01 Nov 2006
TL;DR: In this paper, the mouth region is represented as an ensemble of image patches, which can be used to extract more speech information from the visual domain for speaker-independent isolated digit recognition.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23% on the CUAVE audio-visual corpus.

01 Jan 2006
TL;DR: This paper investigates extracting visual speech information from the speaker's profile view, and constitutes the first real attempt toattack this problem of automatic speech recognition robustness.
Abstract: Visual information fromaspeaker's mouthregion is knowntoimprove automatic speech recognition robustness. How- ever, thevastmajority ofaudio-visual automatic speech recog- nition (AVASR)studies assumefrontal imagesofthespeaker's face. Incontrast, this paperinvestigates extracting visual speech information fromthespeaker's profile view, and,toourknowl- edge, constitutes thefirst realattempt toattack thisproblem. AswithanyAVASRsystem, theoverall recognition performance depends heavily onthevisual front end.Thisisespecially the casewithprofile-view data, asthefacial features areheavily compacted compared tothefrontal scenario. Inthispaper, we particularly describe ourvisual front endapproach, andreport experiments onamulti-subject, small-vocabulary, bimodal, multi- sensory database thatcontains synchronously captured audio withfrontal andprofile facevideo. Ourexperiments showthat AVASRispossible fromprofile views withmoderate performance degradation compared tofrontal video data.