scispace - formally typeset
Search or ask a question

Showing papers by "Patrick Lucey published in 2007"


Proceedings ArticleDOI
27 Aug 2007
TL;DR: This paper proposes the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view.
Abstract: The vast majority of studies in the field of audio-visual automatic speech recognition (AVASR) assumes frontal images of a speaker's face, but this cannot always be guaranteed in practice. Hence our recent research efforts have concentrated on extracting visual speech information from non-frontal faces, in particular the profile view. The introduction of additional views to an AVASR system increases the complexity of the system, as it has to deal with the different visual features associated with the various views. In this paper, we propose the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view. In our experiments for the task of multi-speaker lipreading, we show that this "pose-invariant" technique reduces train/test mismatch between visual speech features of different views, and is of particular benefit when there is more training data for one viewpoint over another (e.g. frontal over profile).

26 citations


01 Jan 2007
TL;DR: In this paper, the authors proposed two approaches for lipreading from non-frontal views (poses): direct statistical modeling of the features and normalization of such features by their projection onto the ''space'' of frontal-view visual features.
Abstract: In recent work, we have concentrated on the problem of lipreading from non-frontal views (poses). In particular, we have focused on the use of profile views, and proposed two approaches for lipreading on basis of visual features extracted from such views: (a) Direct statistical modeling of the features, namely use of view-dependent statistical models; and (b) Normalization of such features by their projection onto the ``space'' of frontal-view visual features, which allows employing one set of statistical models for all available views. The latter approach has been considered for two only poses (frontal and profile views), and for visual features of a specific dimensionality. In this paper, we further extend this work, by investigating its applicability to the case where data from three views are available (frontal, left- and right-profile). In addition, we examine the effect of visual feature dimensionality on the pose-normalization approach. Our experiments demonstrate that results generalize well to three views, but also that feature dimensionality is crucial to the effectiveness of the approach. In particular, feature dimensionality larger than 30 is detrimental to multi-pose visual speech recognition performance.

20 citations


01 Jan 2007
TL;DR: In this paper, a transformation matrix based on synchronous frontal and profile visual speech data is used to normalize the visual speech in each viewpoint into a single uniform view, which reduces train/test mismatch.
Abstract: The vast majority of studies in the field of audio-visual automatic speech recognition (AVASR) assumes frontal images of a speaker's face, but this cannot always be guaranteed in practice. Hence our recent research efforts have concentrated on extracting visual speech information from non-frontal faces, in particular the profile view. The introduction of additional views to an AVASR system increases the complexity of the system, as it has to deal with the different visual features associated with the various views. In this paper, we propose the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view. In our experiments for the task of multi-speaker lipreading, we show that this "pose-invariant" technique reduces train/test mismatch between visual speech features of different views, and is of particular benefit when there is more training data for one viewpoint over another (e.g. frontal over profile).

19 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: In this paper, a fused hidden Markov model (FHMMs) adaptation method was proposed to adapt the multi-stream models from single-stream audio HMMs, and in the process, better model the video speech in the final model when compared to jointly-trained MSHMMs.
Abstract: A technique known as fused hidden Markov models (FHMMs) was recently proposed as an alternative multi-stream modelling technique for audio-visual speaker recognition. In this paper we show that for audio-visual speech recognition (AVSR), FHMMs can be adopted as a novel method of training synchronous MSHMMs. MSHMMs, as proposed by several authors for use in AVSR, are jointly trained on both the audio and visual modalities. In contrast our proposed FHMM adaptation method can be used to adapt the multi-stream models from single-stream audio HMMs, and in the process, better model the video speech in the final model when compared to jointly-trained MSHMMs. By experiments conducted on the XM2VTS database we show that the improved video performance of the FHMM-adapted MSHMMs results in an improvement in AVSR performance over jointly-trained MSHMMs at all levels of audio noise, and provide significant advantage in high noise environments.

15 citations


01 Jan 2007
TL;DR: It is shown that for audio-visual speech recognition (AVSR), FHMMs can be adopted as a novel method of training synchronous MSHMMs and the improved video performance of the FHMM-adapted MSHM Ms results in an improvement in AVSR performance over jointly-trained MSHMms at all levels of audio noise, and provide significant advantage in high noise environments.
Abstract: A technique known as fused hidden Markov models (FHMMs) was recently proposed as an alternative multi-stream modelling technique for audio-visual speaker recognition. In this paper we show that for audio-visual speech recognition (AVSR), FHMMs can be adopted as a novel method of training synchronous MSHMMs. MSHMMs, as proposed by several authors for use in AVSR, are jointly trained on both the audio and visual modalities. In contrast our proposed FHMM adaptation method can be used to adapt the multi-stream models from single-stream audio HMMs, and in the process, better model the video speech in the final model when compared to jointly-trained MSHMMs. By experiments conducted on the XM2VTS database we show that the improved video performance of the FHMM-adapted MSHMMs results in an improvement in AVSR performance over jointly-trained MSHMMs at all levels of audio noise, and provide significant advantage in high noise environments.

3 citations


01 Jan 2007
TL;DR: The authors examined the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition and found that the final performance is primarily a function of the choice of stream weight used in testing.
Abstract: In this paper, we examine the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition. Rather than considering the stream weights to be the same for training and testing, we examine the effect of different stream weights for each task on the final speech-recognition performance. Evaluating our system under varying levels of audio and video degradation on the XM2VTS database, we show that the final performance is primarily a function of the choice of stream weight used in testing, and that the choice of stream weight used for training has a very minor corresponding effect. By varying the value of the testing stream weights we show that the best average speech recognition performance occurs with the streams weighted at around 80% audio and 20% video. However, by examining the distribution of frame-by-frame scores for each stream on a left-out section of the database, we show that these testing weights chosen primarily serve to normalise the two stream score distributions, rather than indicating the dependence of the final performance on either stream. By using a novel adaption of zero-normalisation to normalise each stream's models before performing the weighted-fusion, we show that the actual contribution of the audio and video scores to the best performing speech system is closer to equal that appears to be indicated by the un-normalised stream weighting parameters alone.

3 citations


01 Jan 2007
TL;DR: In this article, the mouth region is represented as an ensemble of image patches, which can be used to extract more speech information from the visual domain for speaker-independent isolated digit recognition.
Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23\% on the CUAVE audio-visual corpus.

1 citations