scispace - formally typeset
Open AccessJournal ArticleDOI

Speaker association with signal-level audiovisual fusion

Reads0
Chats0
TLDR
A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.
Abstract
Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.

read more

Content maybe subject to copyright    Report

406 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
Speaker Association With Signal-Level
Audiovisual Fusion
John W. Fisher, III, Member, IEEE, and Trevor Darrell, Member, IEEE
Abstract—Audio and visual signals arriving from a common
source are detected using a signal-level fusion technique. A
probabilistic multimodal generation model is introduced and used
to derive an information theoretic measure of cross-modal corre-
spondence. Nonparametric statistical density modeling techniques
can characterize the mutual information between signals from
different domains. By comparing the mutual information between
different pairs of signals, it is possible to identify which person is
speaking a given utterance and discount errant motion or audio
from other utterances or nonspeech events.
Index Terms—Audiovisual correspondence, multimodal data as-
sociation, mutual information.
I. INTRODUCTION
C
ONVERSATIONAL dialog systems have become practi-
cally useful in many application domains, including travel
reservations, traffic information, and database access. However
most existing conversational speech systems require tethered in-
teraction, and work primarily for a single user. Users must wear
an attached microphone or speak into a telephone handset, and
do so one at a time. This limits the range of use of dialog sys-
tems, since in many applications users might expect to freely
approach and interact with a device. Worse, they may wish to
arrive as a group, and talk among themselves while interacting
with the system. To date it has been difficult for speech recogni-
tion systems to handle such conditions, and correctly recognize
the utterances intended for the device. We are interested facili-
tating untethered and casual conversational interaction, and ad-
dress the problem of how to temporally segregate the speech of
multiple users interacting with a system.
With a single modality, properly associating speech from
multiple unknown speakers is quite difficult. However, if other
modalities are available they can often provide disambiguating
information. In particular, visual information can be valuable
for deciding whether an individual user is speaking a particular
utterance. We wish to solve a conversational audiovisual
correspondence problem: given sets of audio visual signals,
decide which audiovisual pairs are consistent and could have
come from a single speaker. We approach this problem from a
signal-processing perspective, and develop a statistical measure
of whether two signals come from a common source. We make
no assumptions about the content of the audio signal or the
Manuscript received December 1, 2002; revised November 15, 2003. The
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Jun Ohya.
The authors are with the Computer Science and Artificial Intelligence Lab-
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
(e-mail: trevor@ai.mit.edu).
Digital Object Identifier 10.1109/TMM.2004.827503
visual appearance, and use a general information-theoretic
approach. Our method works without learning a specific
lip or language model, and is therefore robust to a range of
appearances and acoustic environments.
The core of our approach is a technique for jointly modeling
audio and video variation to identify cross-modal correspon-
dences. It is driven by the simple hypothesis of whether a re-
gion of interest in an image sequence (perhaps the entire image)
is associated with a separately measured audio signal. We for-
mulate the problem within a nonparametric hypothesis testing
framework, from which information theoretic quantities natu-
rally arise as the measure of association between two high-di-
mensional signals. We show how this approach can detect which
user is speaking when several are facing a device and distracting
motion is present. This allows the segregation of users’ utter-
ances from each other’s speech, and from background noise
events.
II. R
ELATED WORK
Humans routinely perform tasks in which ambiguous audi-
tory and visual data are combined in order to support accu-
rate perception. In contrast, automated approaches for statistical
processing of multimodal data sources lag far behind. This is
primarily due to the fact that few methods adequately model
the complexity of the audio/visual relationship. Classical ap-
proaches to multimodal fusion at a signal processing level often
either assume a statistical relationship which is too simple (e.g.,
jointly Gaussian) or defer fusion to the decision level when
many of the joint (and useful) properties have been lost. While
such pragmatic choices may lead to simple statistical measures,
they do so at the cost of modeling capacity.
An information theoretic approach motivates fusion at the
measurement level without regard to specific parametric den-
sities. The idea of using information-theoretic principles in an
adaptive framework is not new (e.g., see [1] for an overview)
with many approaches suggested over the last 30 years. Crit-
ical distinctions in most information theoretic approaches lie in
how densities are modeled (either explicitly or implicitly), how
entropy (and by extension mutual information) is approximated
or estimated, and the types of mappings which are used (e.g.,
linear versus nonlinear). Early approaches used a Gaussian as-
sumption, e.g., Plumbley [2], [3] and Becker [4].
There has been substantial progress on feature-level integra-
tion of speech and vision. For example, Meier et al. [5], Stork
[6], and others have built visual speech reading systems that
can improve speech recognition results dramatically. Our goal
is not recognition, but to be able to detect and disambiguate
cases where audio and video signals are coming from different
1520-9210/04$20.00 © 2004 IEEE

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 407
sources. Hershey and Movellan [7] addressed this problem
using the per-pixel correlation relative to the energy of an audio
track as a measure of their dependence. An inherent assumption
of this method was that the joint statistics were Gaussian. As
this is a per-pixel measure there is no straightforward way to
integrate the measure over an image region for purposes of
association without making simplifying assumptions which
will not hold in practice (e.g., pixels are independent of each
other conditioned on the speech signal). We should note that
the objective of their work was to locate the source of an audio
signal in an image sequence, association is an implicit step.
A more general approach was taken by Slaney and Covell [8]
which looked specifically at optimizing temporal alignment
between audio and video tracks using canonical correlations
which is equivalent to the maximum mutual information
projection in the jointly Gaussian case. They did not address
the problem of detecting whether two signals came from the
same person, although their method could be adapted to do so.
Nock et al. [9] consider two mutual information approaches
and one HMM based approach for assessing face and speech
consistency. The mutual information approaches compare a
histogram based estimate over vector quantized codebooks to
a Gaussian estimate over feature vectors. They report that the
Gaussian method gave superior results when using a cepstral
representation of the audio and a discrete cosine transform
representation of the video. All three methods utilize a training
corpus in order estimate a prior model, thereafter associations
and/or likelihoods are computed under the trained model. A
time-delay neural network approach was suggested in [10]
demonstrating location detection for a single visual appearance
on a small test set. Each of [8][10] require training data in
order to estimate model parameters. Here, and in contrast to
the previous methods, we develop a methodology for testing
audiovideo association in the absence of either a prior model
and without the requirement of training data with which to
construct one.
III. S
IGNAL-LEVEL AUDIOVISUAL
ASSOCIATION
We propose an independent cause model to capture the rela-
tionship between generated signals in each individual modality.
Using principles from information theory and nonparametric
statistics we show how an approach for learning maximally
informative joint subspaces can find cross-modal correspon-
dences. We first show how audiovisual association problem can
be formulated as a hypothesis test and giving a relationship to
mutual information based association methods (see [11] for an
extensive treatment). Following that we present an information
theoretic analysis of a graphical model of multimodal signal
generation which gives some incite on the relationship between
data association and learning a generative audiovisual model.
Given an audiovideo sequence, let us denote the sequence
of
images (or a region within each image) as where
indicates (discrete) time. Similarly denote audio measurements
as
. For our purposes, will be vectors of spectral measur-
ments. Treating the audio and video measurements as i.i.d. sam-
ples from the random variables
and , respectively, allows
us to cast the audiovisual association problem as a simple hy-
pothesis test:
(1)
where
states that the measurements are statistically inde-
pendent (i.e., their joint density is expressed as a product of
marginal densities) and
states that the measurements are
statistically dependent (or equivalently associated). Perceptual
grouping problems, in which there are multiple sources of both
video and audio can be stated in a similar, albeit more compli-
cated, fashion [12], [13]. Plugging the measurements into a (nor-
malized) log-likelihood ratio statistic, using a consistent proba-
bility density estimator for
, , , and taking
the expectation with respect to the joint probability density of
and yields
(2)
(3)
(4)
where
is the mutual information between the
random variables
and . Mutual information can be
expressed as a combination of the differential entropy terms
, , [14]. Consequently, estimating the
mutual information between signals is, in this sense, equivalent
to computing log-likelihood ratio statistic for the hypothesis
test of (1). For more complex perceptual grouping hypotheses
consisting of only pairwise relationships it has been shown
[12], [13] that the sufficient statistics involve pairwise mutual
information estimates. We elaborate on this in the empirical
section. A significant issue, and what distinguishes our ap-
proach from others, is how one models the probability density
terms of (2). Another important issue, which we address later,
arises when direct density estimation is infeasible as is the case
when measurements are of high dimension ( e.g., audio video
measurements).
Nonparametric density estimators, such as the Parzen kernel
density estimator [15], are useful for capturing complex sta-
tistical dependencies between random variables. The resulting
models can then be used to measure the degree of mutual infor-
mation in complex phenomena [16] which we apply to audio/vi-
sual data. This technique simultaneously learns projections of
images in the video sequence and projections of sequences of
periodograms taken from the audio sequence. The projections
are computed adaptively such that the video and audio projec-
tions have maximum mutual information (MI).
We now review our basic method for audiovisual fusion
and information theoretic adaptive methods. We then present
a probabilistic model for cross-modal signal generation, and
show how audiovisual correspondences can be found by
identifying components with maximal mutual information.
In an experiment comparing the audio and video of every

408 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
combination of a group of eight users, our technique was
able to perfectly match the corresponding audio and video for
each user. These results are based purely on the instantaneous
cross-modal mutual information between the
projections of the
two signals, and do not rely on any prior experience or model
of users speech or appearance.
IV. P
ROBABILISTIC
MODELS OF
AUDIOVISUAL
FUSION
We consider multimodal scenes which can be modeled prob-
abilistically with one joint audiovisual source and distinct back-
ground interference sources for each modality. Each observation
is a combination of information from the joint source, and in-
formation from the background interferer for that channel. We
use a graphical model (Fig. 1) to represent this relationship. In
the diagrams,
represents the joint source, while and rep-
resent single modality background interference. Recall that the
test of association is formulated as a measure of dependence
between the measurements
and . By conjecturing a la-
tent variable structure via
measurement dependence
is explained solely through the hidden cause
. Our purpose
here is to analyze under which conditions and in what sense our
methodology uncovers the underlying cause of our observation
without explicitly defining
or its exact relationship to and
.
Fig. 1(a) shows an independent cause model for our typical
case, where
are unobserved random variables rep-
resenting the causes of our (high-dimensional) observations in
each modality
. In general there may be more causes
and more measurements, but this simple case can be used to il-
lustrate our algorithm. An important aspect is that the measure-
ments have common dependence on a single cause. The joint
statistical model consistent with the graph of Fig. 1(a) is
Given the independent cause model a simple application of
Bayes rule (or the equivalent graphical manipulation) yields the
graph of Fig. 1(b) which is consistent with
which shows that information about contained in is con-
veyed through the joint statistics of
and . The consequence
being that, in general, we cannot disambiguate the influences
that
and have on the measurements. A similar graph is ob-
tained by conditioning on
. Suppose, however, that decom-
positions of the measurement
and exist such that the
following joint densities can be written:
where and . An example for
our specific application would be segmenting the video image
(or filtering the audio signal). In this case we get the graph of
Fig. 1(c) and from that graph we can extract the Markov chain
which contains elements related only to
. Fig. 1(d) shows
Fig. 1. Graphs illustrating the various statistical models exploited by the
algorithm: (a) the independent cause model
X
and
X
are independent of
each other conditioned on
f
A; B; C
g
, (b) information about
X
contained in
X
is conveyed through joint statistics of
A
and
B
, (c) the graph implied by
the existence of a separating function, and (d) two equivalent Markov chains
which can be extracted from the graphs if the separating functions can be found.
equivalent graphs of the extracted Markov chain. As a conse-
quence, there is no influence due to
or .
Of course, we are still left with the formidable task of finding
a decomposition, but given the decomposition it can be shown,
using the data processing inequality [14], that the following in-
equality holds:
More importantly, these inequalities hold for any functions of
and (e.q. and ).
That is
(5)
(6)
and finally one can show (see [12]) that
(7)
The inequalities of (5) and (6) show that by maximizing the
mutual information between
we necessarily increase
the mutual information between
and and and . The
implication is that fusion in such a manner discovers the un-
derlying cause of the observations, that is, the joint density of
is strongly related to and in that sense captures
elements of the generative model of audio and video. Note that
this is the case without ever specifying the exact form of
or its
relationship to the measurements. Additionally, the inequality
of (7) shows that by maximizing
we are also maxi-
mizing a lower bound on the likelihood statistic, (3), of the as-
sociation hypothesis test. Finally, with an approximation we de-
scribe shortly, we can optimize this criterion without estimating
the separating function directly. In the event that a perfect de-
composition does not exist, it can be shown that the method will
approach a good solution in the KullbackLeibler sense. From

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 409
the perspective of information theory, estimating separate pro-
jections of the audiovideo measurements which have high mu-
tual information has intuitive appeal as such features will be pre-
dictive of each other. An additional advantage is that the form of
those statistics are not subject to the strong parametric assump-
tions (e.g., joint Gaussianity) which we wish to avoid.
V. M
AXIMALLY
INFORMATIVE
PROJECTIONS
We now describe a method for learning maximally informa-
tive projections. The method uses a technique that maximizes
the mutual information between the projections of the audio-
visual measurements. Following [17], we use a nonparametric
model of joint density for which an analytic gradient of the mu-
tual information with respect to projection parameters is avail-
able. In principle the method may be applied to any function of
the measurements,
, which is differentiable in the
parameters
(e.g., as shown in [17] ). Here, we restrict ourselves
to linear functions of the measurements resulting in a significant
computational savings at a minimal cost to the representational
power. Note that while the projections are linear, the joint den-
sity is estimated nonparametrically allowing for more complex
joint dependencies than can be captured by Gaussian assump-
tions. We parameterize the projections as
(8)
(9)
where
and are lexicographic samples of
images and periodograms, respectively, from an A/V sequence.
The linear projection defined by
and
maps A/V samples to low dimensional features
and . Treating s and s as samples from a
random variable our goal is to choose
and to maximize the
mutual information,
, of the derived measurements.
Mutual information for continuous random variables can be
expressed in several ways as a combination of differential en-
tropy terms [14]
(10)
Mutual information indicates the amount of information that
one random variable conveys on average about another. The
usual difficulty of MI as a criterion for adaptation is that it is an
integral function of probability densities. Furthermore, in gen-
eral we are not given the densities themselves, but samples from
which they must be inferred. To overcome this problem, we re-
place each entropy term in (10) with a second-order Taylor-
series approximation as in [16], [18]
(11)
(12)
where
is the support of one feature output, is the sup-
port of the other,
is the uniform density over that support,
and
is a Parzen density [15] estimated over the projected
samples. The Parzen density estimate is defined as
(13)
where
is a gaussian kernel (in our case) and is the stan-
dard deviation. The Parzen density estimate has the capacity to
capture relationships with more complex structure than typical
parametric families of densities.
Note that this is essentially an integrated squared error com-
parison between the density of the projections to the uniform
density (which has maximum entropy over a finite region). An
advantage of this particular combination of second-order en-
tropy approximation and nonparametric density estimator is that
the gradient terms (appropriately combined to approximate mu-
tual information as in (12)) with respect to the projection coef-
ficients can be computed exactly by evaluating a finite number
of functions at a finite number of sample locations in the output
space as shown in [16], [18]. The update term for the individual
entropy terms in (12) (note the negative sign on the third term)
of the
th feature vector at iteration as a function of the value
of the feature vector at iteration
is (where denotes a
sample of either
or or their concatenation depending on
which term of (12) is being computed)
(14)
(15)
(16)
where
, ,or depending on the en-
tropy term. Both
and are vector-valued func-
tions (
-dimensional) and is the support of the output (i.e.,
a hyper-cube with volume
). The notation indicates
the
th element of . Adaptation consists of the update rule
above followed by a modified least squares solution for
and
until a local maximum is reached. In the experiments that
follow
with 150 to 300 iterations.

410 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
A. Capacity Control
In [17] early results were demonstrated using this method for
the video-based localization of a speaking user. However, the
technique lacked robustness as the projection coefficients were
under-determined. To improve on the method, we thus introduce
a capacity control mechanism in the form of a prior bias to small
weights. The method of [16] requires that the projection be dif-
ferentiable, which it is in this case. The specific means of ca-
pacity control that we utilize is to impose an
penalty on the
projection coefficients of
and . Furthermore, we impose
the criterion that if we consider the projection
as a filter, it
has low output energy when convolved with images in the se-
quence (on average). This constraint is the same as that proposed
by Mahalanobis et al. [19] for designing optimized correlators
the difference being that in their case the projection output was
designed explicitly while in our case it is derived from the MI
optimization in the output space.
The adaptation criterion, which we maximize in practice, is
then a combination of the approximation to MI (11) and the
regularization terms:
(17)
where the last term derives from the output energy constraint
and
is average autocorrelation function (taken over all im-
ages in the sequence). This term is more easily computed in the
frequency domain (see [19]) and is equivalent to prewhitening
the images using the inverse of the average power spectrum.
The scalar weighting terms
, , , were set using a data
dependent heuristic for all experiments. Note that there is a
straightforward probabilistic interperetation of the each of
terms where
relates to the hypothesis test and the
remaining terms represent Gaussian priors on the coefficients
of the projections (but not on the resulting the projections of
the measurements).
Computing
can be decomposed into three stages:
1) Prewhiten the images once (using the average spectrum
of the images) followed by iterations of
2) Updating the feature values (
s) using (14), and
3) Solving for the projection coefficients using least squares
and the
penalty.
The prewhitening interpretation has intuitive appeal for the
images as it accentuates edges in the input image. It is the
moving edges (lips, chin, etc.) which we expect to convey the
most information about the audio. Furthermore, by including
a prewhitening filter as a preproecessing step one can exclude
the final term of (17) which is what we do in practice.
The projection coefficients related to the audio signal,
, are
solved in a similar fashion (simultaneously) without the initial
prewhitening step.
VI. E
XPERIMENTS
Our motivating scenario for this application is a group of
users interacting with an anonymous handheld device or kiosk
using spoken commands. Given a received audio signal, we
would like to verify whether the person speaking the command
Fig. 2. Video sequence contains one speaker and monitor which is flickering:
(a) one image from the sequence, (b) pixel-wise image of standard deviations
taken over the entire sequence, (c) image of the learned projection,
h
, and
(d) image of
h
for incorrect audio.
Fig. 3. Video sequence containing one speaker (person on left) and one person
who is randomly moving their mouth/head (but not speaking): (a) one image
from the sequence, (b) pixel-wise image of standard deviations taken over the
entire sequence, (c) image of the learned projection,
h
, and (d) image of
h
for incorrect audio.
is in the field of view of the camera on the device, and if so
to localize which person is speaking. Simple techniques which
check only for the presence of a face (or moving face) would fail
when two people were looking at their individual devices and
one spoke a command. Since interaction may be anonymous,
we presume no prior model of the voice or appearance of users
are available to perform the verification and localization.
In the first experiment
1
we collected audiovideo data from
eight subjects. In all cases, the video data was collected at 29.97
frames per second at a resolution of 360
240. The audio signal
was collected at 48 000 KHz, but only 10 KHz of frequency con-
tent was used. All subjects were asked to utter a specific phrase.
This typically yielded 22.5 s of data. Video frames were pro-
cessed as is, while the audio signal was transformed to a series
1
A portion of these results appear in [20]

Figures
Citations
More filters
Journal ArticleDOI

Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection

TL;DR: A method that exploits an information theoretic framework to extract optimized audio features using video information and achieves a speaker detection rate of 100% on in-house test sequences, and of 85% on most commonly used sequences.
Journal ArticleDOI

Learning Bimodal Structure in Audio–Visual Data

TL;DR: The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material to robustly localize a speaker even in the presence of severe acoustic and visual distracters.
Journal ArticleDOI

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

TL;DR: Three new methods for asynchrony detection based on co-inertia analysis (CoIA) and a fourth based on coupled hidden Markov models (CHMMs) are derived.
Journal ArticleDOI

Conjugate mixture models for clustering multimodal data

TL;DR: This letter forms the problem as a likelihood maximization task and derives the associated conjugate expectation-maximization algorithm, which is tested and evaluated within the task of 3D localization of several speakers using both auditory and visual data.
Proceedings ArticleDOI

Detecting audio-visual synchrony using deep neural networks.

TL;DR: This paper addresses the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not, and investigates the use of deep neural networks (DNNs) for this purpose.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

On Estimation of a Probability Density Function and Mode

TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.
Journal ArticleDOI

Minimum average correlation energy filters

TL;DR: The synthesis of a new category of spatial filters that produces sharp output correlation peaks with controlled peak values is considered, and these filters are referred to as minimum average correlation energy filters.
Related Papers (5)
Frequently Asked Questions (6)
Q1. What have the authors contributed in "Speaker association with signal-level audiovisual fusion" ?

In this paper, a probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. 

Computing can be decomposed into three stages:1) Prewhiten the images once (using the average spectrum of the images) followed by iterations of 2) Updating the feature values ( ’s) using (14), and 3) Solving for the projection coefficients using least squaresand the penalty. 

Nonparametric statistical density models can be used to represent complex joint densities of projected signals, and to successfully estimate mutual information. 

Using principles from information theory and nonparametric statistics the authors show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences. 

The adaptation criterion, which the authors maximize in practice, is then a combination of the approximation to MI (11) and the regularization terms:(17)where the last term derives from the output energy constraint and is average autocorrelation function (taken over all images in the sequence). 

Mutual information for continuous random variables can be expressed in several ways as a combination of differential entropy terms [14](10)Mutual information indicates the amount of information that one random variable conveys on average about another.