How can nonparametric statistical density models be used to represent complex joint densities of projected?

Nonparametric statistical density models can be used to represent complex joint densities of projected signals, and to successfully estimate mutual information.

How can the authors learn the relationship between audio and video?

Using principles from information theory and nonparametric statistics the authors show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences.

(Open Access) Speaker association with signal-level audiovisual fusion (2004) | John W. Fisher

Q: What have the authors contributed in "Speaker association with signal-level audiovisual fusion" ?

In this paper, a probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence.

Q: What is the criterion for a prewhitening filter?

Computing can be decomposed into three stages:1) Prewhiten the images once (using the average spectrum of the images) followed by iterations of 2) Updating the feature values ( ’s) using (14), and 3) Solving for the projection coefficients using least squaresand the penalty.

Q: What is the adaptation criterion for the projections?

The adaptation criterion, which the authors maximize in practice, is then a combination of the approximation to MI (11) and the regularization terms:(17)where the last term derives from the output energy constraint and is average autocorrelation function (taken over all images in the sequence).

406 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004

Speaker Association With Signal-Level

Audiovisual Fusion

John W. Fisher, III, Member, IEEE, and Trevor Darrell, Member, IEEE

Abstract—Audio and visual signals arriving from a common

source are detected using a signal-level fusion technique. A

probabilistic multimodal generation model is introduced and used

to derive an information theoretic measure of cross-modal corre-

spondence. Nonparametric statistical density modeling techniques

can characterize the mutual information between signals from

different domains. By comparing the mutual information between

different pairs of signals, it is possible to identify which person is

speaking a given utterance and discount errant motion or audio

from other utterances or nonspeech events.

Index Terms—Audiovisual correspondence, multimodal data as-

sociation, mutual information.

I. INTRODUCTION

ONVERSATIONAL dialog systems have become practi-

cally useful in many application domains, including travel

reservations, traffic information, and database access. However

most existing conversational speech systems require tethered in-

teraction, and work primarily for a single user. Users must wear

an attached microphone or speak into a telephone handset, and

do so one at a time. This limits the range of use of dialog sys-

tems, since in many applications users might expect to freely

approach and interact with a device. Worse, they may wish to

arrive as a group, and talk among themselves while interacting

with the system. To date it has been difficult for speech recogni-

tion systems to handle such conditions, and correctly recognize

the utterances intended for the device. We are interested facili-

tating untethered and casual conversational interaction, and ad-

dress the problem of how to temporally segregate the speech of

multiple users interacting with a system.

With a single modality, properly associating speech from

multiple unknown speakers is quite difficult. However, if other

modalities are available they can often provide disambiguating

information. In particular, visual information can be valuable

for deciding whether an individual user is speaking a particular

utterance. We wish to solve a conversational audiovisual

correspondence problem: given sets of audio visual signals,

decide which audiovisual pairs are consistent and could have

come from a single speaker. We approach this problem from a

signal-processing perspective, and develop a statistical measure

of whether two signals come from a common source. We make

no assumptions about the content of the audio signal or the

Manuscript received December 1, 2002; revised November 15, 2003. The

associate editor coordinating the review of this manuscript and approving it for

publication was Dr. Jun Ohya.

The authors are with the Computer Science and Artificial Intelligence Lab-

oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA

(e-mail: trevor@ai.mit.edu).

Digital Object Identifier 10.1109/TMM.2004.827503

visual appearance, and use a general information-theoretic

approach. Our method works without learning a specific

lip or language model, and is therefore robust to a range of

appearances and acoustic environments.

The core of our approach is a technique for jointly modeling

audio and video variation to identify cross-modal correspon-

dences. It is driven by the simple hypothesis of whether a re-

gion of interest in an image sequence (perhaps the entire image)

is associated with a separately measured audio signal. We for-

mulate the problem within a nonparametric hypothesis testing

framework, from which information theoretic quantities natu-

rally arise as the measure of association between two high-di-

mensional signals. We show how this approach can detect which

user is speaking when several are facing a device and distracting

motion is present. This allows the segregation of users’ utter-

ances from each other’s speech, and from background noise

events.

II. R

ELATED WORK

Humans routinely perform tasks in which ambiguous audi-

tory and visual data are combined in order to support accu-

rate perception. In contrast, automated approaches for statistical

processing of multimodal data sources lag far behind. This is

primarily due to the fact that few methods adequately model

the complexity of the audio/visual relationship. Classical ap-

proaches to multimodal fusion at a signal processing level often

either assume a statistical relationship which is too simple (e.g.,

jointly Gaussian) or defer fusion to the decision level when

many of the joint (and useful) properties have been lost. While

such pragmatic choices may lead to simple statistical measures,

they do so at the cost of modeling capacity.

An information theoretic approach motivates fusion at the

measurement level without regard to specific parametric den-

sities. The idea of using information-theoretic principles in an

adaptive framework is not new (e.g., see [1] for an overview)

with many approaches suggested over the last 30 years. Crit-

ical distinctions in most information theoretic approaches lie in

how densities are modeled (either explicitly or implicitly), how

entropy (and by extension mutual information) is approximated

or estimated, and the types of mappings which are used (e.g.,

linear versus nonlinear). Early approaches used a Gaussian as-

sumption, e.g., Plumbley [2], [3] and Becker [4].

There has been substantial progress on feature-level integra-

tion of speech and vision. For example, Meier et al. [5], Stork

[6], and others have built visual speech reading systems that

can improve speech recognition results dramatically. Our goal

is not recognition, but to be able to detect and disambiguate

cases where audio and video signals are coming from different

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 407

sources. Hershey and Movellan [7] addressed this problem

using the per-pixel correlation relative to the energy of an audio

track as a measure of their dependence. An inherent assumption

of this method was that the joint statistics were Gaussian. As

this is a per-pixel measure there is no straightforward way to

integrate the measure over an image region for purposes of

association without making simplifying assumptions which

will not hold in practice (e.g., pixels are independent of each

other conditioned on the speech signal). We should note that

the objective of their work was to locate the source of an audio

signal in an image sequence, association is an implicit step.

A more general approach was taken by Slaney and Covell [8]

which looked specifically at optimizing temporal alignment

between audio and video tracks using canonical correlations

which is equivalent to the maximum mutual information

projection in the jointly Gaussian case. They did not address

the problem of detecting whether two signals came from the

same person, although their method could be adapted to do so.

Nock et al. [9] consider two mutual information approaches

and one HMM based approach for assessing face and speech

consistency. The mutual information approaches compare a

histogram based estimate over vector quantized codebooks to

a Gaussian estimate over feature vectors. They report that the

Gaussian method gave superior results when using a cepstral

representation of the audio and a discrete cosine transform

representation of the video. All three methods utilize a training

corpus in order estimate a prior model, thereafter associations

and/or likelihoods are computed under the trained model. A

time-delay neural network approach was suggested in [10]

demonstrating location detection for a single visual appearance

on a small test set. Each of [8]–[10] require training data in

order to estimate model parameters. Here, and in contrast to

the previous methods, we develop a methodology for testing

audio–video association in the absence of either a prior model

and without the requirement of training data with which to

construct one.

III. S

IGNAL-LEVEL AUDIOVISUAL

ASSOCIATION

We propose an independent cause model to capture the rela-

tionship between generated signals in each individual modality.

Using principles from information theory and nonparametric

statistics we show how an approach for learning maximally

informative joint subspaces can find cross-modal correspon-

dences. We first show how audiovisual association problem can

be formulated as a hypothesis test and giving a relationship to

mutual information based association methods (see [11] for an

extensive treatment). Following that we present an information

theoretic analysis of a graphical model of multimodal signal

generation which gives some incite on the relationship between

data association and learning a generative audiovisual model.

Given an audio–video sequence, let us denote the sequence

images (or a region within each image) as where

indicates (discrete) time. Similarly denote audio measurements

. For our purposes, will be vectors of spectral measur-

ments. Treating the audio and video measurements as i.i.d. sam-

ples from the random variables

and , respectively, allows

us to cast the audiovisual association problem as a simple hy-

pothesis test:

(1)

where

states that the measurements are statistically inde-

pendent (i.e., their joint density is expressed as a product of

marginal densities) and

states that the measurements are

statistically dependent (or equivalently associated). Perceptual

grouping problems, in which there are multiple sources of both

video and audio can be stated in a similar, albeit more compli-

cated, fashion [12], [13]. Plugging the measurements into a (nor-

malized) log-likelihood ratio statistic, using a consistent proba-

bility density estimator for

, , , and taking

the expectation with respect to the joint probability density of

and yields

(2)

(3)

(4)

where

is the mutual information between the

random variables

and . Mutual information can be

expressed as a combination of the differential entropy terms

, , [14]. Consequently, estimating the

mutual information between signals is, in this sense, equivalent

to computing log-likelihood ratio statistic for the hypothesis

test of (1). For more complex perceptual grouping hypotheses

consisting of only pairwise relationships it has been shown

[12], [13] that the sufficient statistics involve pairwise mutual

information estimates. We elaborate on this in the empirical

section. A significant issue, and what distinguishes our ap-

proach from others, is how one models the probability density

terms of (2). Another important issue, which we address later,

arises when direct density estimation is infeasible as is the case

when measurements are of high dimension ( e.g., audio video

measurements).

Nonparametric density estimators, such as the Parzen kernel

density estimator [15], are useful for capturing complex sta-

tistical dependencies between random variables. The resulting

models can then be used to measure the degree of mutual infor-

mation in complex phenomena [16] which we apply to audio/vi-

sual data. This technique simultaneously learns projections of

images in the video sequence and projections of sequences of

periodograms taken from the audio sequence. The projections

are computed adaptively such that the video and audio projec-

tions have maximum mutual information (MI).

We now review our basic method for audiovisual fusion

and information theoretic adaptive methods. We then present

a probabilistic model for cross-modal signal generation, and

show how audiovisual correspondences can be found by

identifying components with maximal mutual information.

In an experiment comparing the audio and video of every

408 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004

combination of a group of eight users, our technique was

able to perfectly match the corresponding audio and video for

each user. These results are based purely on the instantaneous

cross-modal mutual information between the

projections of the

two signals, and do not rely on any prior experience or model

of user’s speech or appearance.

IV. P

ROBABILISTIC

MODELS OF

AUDIOVISUAL

FUSION

We consider multimodal scenes which can be modeled prob-

abilistically with one joint audiovisual source and distinct back-

ground interference sources for each modality. Each observation

is a combination of information from the joint source, and in-

formation from the background interferer for that channel. We

use a graphical model (Fig. 1) to represent this relationship. In

the diagrams,

represents the joint source, while and rep-

resent single modality background interference. Recall that the

test of association is formulated as a measure of dependence

between the measurements

and . By conjecturing a la-

tent variable structure via

measurement dependence

is explained solely through the hidden cause

. Our purpose

here is to analyze under which conditions and in what sense our

methodology uncovers the underlying cause of our observation

without explicitly defining

or its exact relationship to and

Fig. 1(a) shows an independent cause model for our typical

case, where

are unobserved random variables rep-

resenting the causes of our (high-dimensional) observations in

each modality

. In general there may be more causes

and more measurements, but this simple case can be used to il-

lustrate our algorithm. An important aspect is that the measure-

ments have common dependence on a single cause. The joint

statistical model consistent with the graph of Fig. 1(a) is

Given the independent cause model a simple application of

Bayes’ rule (or the equivalent graphical manipulation) yields the

graph of Fig. 1(b) which is consistent with

which shows that information about contained in is con-

veyed through the joint statistics of

and . The consequence

being that, in general, we cannot disambiguate the influences

that

and have on the measurements. A similar graph is ob-

tained by conditioning on

. Suppose, however, that decom-

positions of the measurement

and exist such that the

following joint densities can be written:

where and . An example for

our specific application would be segmenting the video image

(or filtering the audio signal). In this case we get the graph of

Fig. 1(c) and from that graph we can extract the Markov chain

which contains elements related only to

. Fig. 1(d) shows

Fig. 1. Graphs illustrating the various statistical models exploited by the

algorithm: (a) the independent cause model—

and

are independent of

each other conditioned on

A; B; C

, (b) information about

contained in

is conveyed through joint statistics of

and

, (c) the graph implied by

the existence of a separating function, and (d) two equivalent Markov chains

which can be extracted from the graphs if the separating functions can be found.

equivalent graphs of the extracted Markov chain. As a conse-

quence, there is no influence due to

or .

Of course, we are still left with the formidable task of finding

a decomposition, but given the decomposition it can be shown,

using the data processing inequality [14], that the following in-

equality holds:

More importantly, these inequalities hold for any functions of

and (e.q. and ).

That is

(5)

(6)

and finally one can show (see [12]) that

(7)

The inequalities of (5) and (6) show that by maximizing the

mutual information between

we necessarily increase

the mutual information between

and and and . The

implication is that fusion in such a manner discovers the un-

derlying cause of the observations, that is, the joint density of

is strongly related to and in that sense captures

elements of the generative model of audio and video. Note that

this is the case without ever specifying the exact form of

or its

relationship to the measurements. Additionally, the inequality

of (7) shows that by maximizing

we are also maxi-

mizing a lower bound on the likelihood statistic, (3), of the as-

sociation hypothesis test. Finally, with an approximation we de-

scribe shortly, we can optimize this criterion without estimating

the separating function directly. In the event that a perfect de-

composition does not exist, it can be shown that the method will

approach a “good” solution in the Kullback–Leibler sense. From

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 409

the perspective of information theory, estimating separate pro-

jections of the audio–video measurements which have high mu-

tual information has intuitive appeal as such features will be pre-

dictive of each other. An additional advantage is that the form of

those statistics are not subject to the strong parametric assump-

tions (e.g., joint Gaussianity) which we wish to avoid.

V. M

AXIMALLY

INFORMATIVE

PROJECTIONS

We now describe a method for learning maximally informa-

tive projections. The method uses a technique that maximizes

the mutual information between the projections of the audio-

visual measurements. Following [17], we use a nonparametric

model of joint density for which an analytic gradient of the mu-

tual information with respect to projection parameters is avail-

able. In principle the method may be applied to any function of

the measurements,

, which is differentiable in the

parameters

(e.g., as shown in [17] ). Here, we restrict ourselves

to linear functions of the measurements resulting in a significant

computational savings at a minimal cost to the representational

power. Note that while the projections are linear, the joint den-

sity is estimated nonparametrically allowing for more complex

joint dependencies than can be captured by Gaussian assump-

tions. We parameterize the projections as

(8)

(9)

where

and are lexicographic samples of

images and periodograms, respectively, from an A/V sequence.

The linear projection defined by

and

maps A/V samples to low dimensional features

and . Treating ’s and ’s as samples from a

random variable our goal is to choose

and to maximize the

mutual information,

, of the derived measurements.

Mutual information for continuous random variables can be

expressed in several ways as a combination of differential en-

tropy terms [14]

(10)

Mutual information indicates the amount of information that

one random variable conveys on average about another. The

usual difficulty of MI as a criterion for adaptation is that it is an

integral function of probability densities. Furthermore, in gen-

eral we are not given the densities themselves, but samples from

which they must be inferred. To overcome this problem, we re-

place each entropy term in (10) with a second-order Taylor-

series approximation as in [16], [18]

(11)

(12)

where

is the support of one feature output, is the sup-

port of the other,

is the uniform density over that support,

and

is a Parzen density [15] estimated over the projected

samples. The Parzen density estimate is defined as

(13)

where

is a gaussian kernel (in our case) and is the stan-

dard deviation. The Parzen density estimate has the capacity to

capture relationships with more complex structure than typical

parametric families of densities.

Note that this is essentially an integrated squared error com-

parison between the density of the projections to the uniform

density (which has maximum entropy over a finite region). An

advantage of this particular combination of second-order en-

tropy approximation and nonparametric density estimator is that

the gradient terms (appropriately combined to approximate mu-

tual information as in (12)) with respect to the projection coef-

ficients can be computed exactly by evaluating a finite number

of functions at a finite number of sample locations in the output

space as shown in [16], [18]. The update term for the individual

entropy terms in (12) (note the negative sign on the third term)

of the

th feature vector at iteration as a function of the value

of the feature vector at iteration

is (where denotes a

sample of either

or or their concatenation depending on

which term of (12) is being computed)

(14)

(15)

(16)

where

, ,or depending on the en-

tropy term. Both

and are vector-valued func-

tions (

-dimensional) and is the support of the output (i.e.,

a hyper-cube with volume

). The notation indicates

the

th element of . Adaptation consists of the update rule

above followed by a modified least squares solution for

and

until a local maximum is reached. In the experiments that

with 150 to 300 iterations.

410 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004

A. Capacity Control

In [17] early results were demonstrated using this method for

the video-based localization of a speaking user. However, the

technique lacked robustness as the projection coefficients were

under-determined. To improve on the method, we thus introduce

a capacity control mechanism in the form of a prior bias to small

weights. The method of [16] requires that the projection be dif-

ferentiable, which it is in this case. The specific means of ca-

pacity control that we utilize is to impose an

penalty on the

projection coefficients of

and . Furthermore, we impose

the criterion that if we consider the projection

as a filter, it

has low output energy when convolved with images in the se-

quence (on average). This constraint is the same as that proposed

by Mahalanobis et al. [19] for designing optimized correlators

the difference being that in their case the projection output was

designed explicitly while in our case it is derived from the MI

optimization in the output space.

The adaptation criterion, which we maximize in practice, is

then a combination of the approximation to MI (11) and the

regularization terms:

(17)

where the last term derives from the output energy constraint

and

is average autocorrelation function (taken over all im-

ages in the sequence). This term is more easily computed in the

frequency domain (see [19]) and is equivalent to prewhitening

the images using the inverse of the average power spectrum.

The scalar weighting terms

, , , were set using a data

dependent heuristic for all experiments. Note that there is a

straightforward probabilistic interperetation of the each of

terms where

relates to the hypothesis test and the

remaining terms represent Gaussian priors on the coefficients

of the projections (but not on the resulting the projections of

the measurements).

Computing

can be decomposed into three stages:

1) Prewhiten the images once (using the average spectrum

of the images) followed by iterations of

2) Updating the feature values (

’s) using (14), and

3) Solving for the projection coefficients using least squares

and the

penalty.

The prewhitening interpretation has intuitive appeal for the

images as it accentuates edges in the input image. It is the

moving edges (lips, chin, etc.) which we expect to convey the

most information about the audio. Furthermore, by including

a prewhitening filter as a preproecessing step one can exclude

the final term of (17) which is what we do in practice.

The projection coefficients related to the audio signal,

, are

solved in a similar fashion (simultaneously) without the initial

prewhitening step.

VI. E

XPERIMENTS

Our motivating scenario for this application is a group of

users interacting with an anonymous handheld device or kiosk

using spoken commands. Given a received audio signal, we

would like to verify whether the person speaking the command

Fig. 2. Video sequence contains one speaker and monitor which is flickering:

(a) one image from the sequence, (b) pixel-wise image of standard deviations

taken over the entire sequence, (c) image of the learned projection,

, and

(d) image of

for incorrect audio.

Fig. 3. Video sequence containing one speaker (person on left) and one person

who is randomly moving their mouth/head (but not speaking): (a) one image

from the sequence, (b) pixel-wise image of standard deviations taken over the

entire sequence, (c) image of the learned projection,

, and (d) image of

for incorrect audio.

is in the field of view of the camera on the device, and if so

to localize which person is speaking. Simple techniques which

check only for the presence of a face (or moving face) would fail

when two people were looking at their individual devices and

one spoke a command. Since interaction may be anonymous,

we presume no prior model of the voice or appearance of users

are available to perform the verification and localization.

In the first experiment

we collected audio–video data from

eight subjects. In all cases, the video data was collected at 29.97

frames per second at a resolution of 360

240. The audio signal

was collected at 48 000 KHz, but only 10 KHz of frequency con-

tent was used. All subjects were asked to utter a specific phrase.

This typically yielded 2–2.5 s of data. Video frames were pro-

cessed as is, while the audio signal was transformed to a series

A portion of these results appear in [20]

Speaker association with signal-level audiovisual fusion

Figures

Citations

Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection

Learning Bimodal Structure in Audio–Visual Data

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

Conjugate mixture models for clustering multimodal data

Detecting audio-visual synchrony using deep neural networks.

References

Elements of information theory

On Estimation of a Probability Density Function and Mode

Information Theory and Statistics

Information Theory and Statistics.

Minimum average correlation energy filters

Related Papers (5)

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Pixels that sound

Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading

Recent advances in the automatic recognition of audiovisual speech

Frequently Asked Questions (6)

Q1. What have the authors contributed in "Speaker association with signal-level audiovisual fusion" ?

Q2. What is the criterion for a prewhitening filter?

Q3. How can nonparametric statistical density models be used to represent complex joint densities of projected?

Q4. How can the authors learn the relationship between audio and video?

Q5. What is the adaptation criterion for the projections?

Q6. What is the way to estimate the mutual information of continuous random variables?