What is the way to separate the sound likelihood from the image?

Since the sound likelihood depends only on D, which in turn depends on the configuration X only via the image x coordinate, the sampling is separated in two stages by “partitioned sampling” [11].

What is the particle filtering technique?

Since the broadest variations in X are due to x and y, separating x and y leads to a considerable improvement in sampling efficiency, palpable as a reduction in the number of particles needed per time step.

What is the simplest way to estimate the configuration of a head?

The general tracking problem involves the recursive estimation of the filtering distribution p(Xkjz1:k), with z = (zA; zI ) and the subscript 1 : k denoting all the observations from time 1 to time k, from which estimates of the configuration X can be obtained.

What is the way to measure the correlation between sound and audio?

So far studies have been based on stored audiovisual sequences, but preliminary indications (based on software profiling) suggest that a real-time system should be quite feasible without special hardware, and work is currently in progress to achieve this.

How much of the field of view is fixed?

In experiments, the authors fixed (x) = (y) = 10s 1 and v(x) and v(y) to 10% and 5% of the field of view in the respective directions, per second.

What is the particle filter architecture used in this experiment?

The particular particle filter architecture adopted here deviates from the standard particle filter, and makes the best use of the properties of the model.

What is the probability of all measurements being due to clutter?

In (2), q0 is the prior probability of all measurements being due to clutter, qi, i = 1; : : : ; N , is the prior probability ofthe i-th measurement corresponding to the true TDOA, c is a normalising constant, and ID ( ) is the indicator function for the set D = [ Dmax; Dmax].

(Open Access) Sequential Monte Carlo fusion of sound and vision for speaker tracking (2001) | Jaco Vermaak

Q: What are the contributions in "Sequential monte carlo fusion of sound and vision for speaker tracking" ?

Using generative probabilistic models and particle filtering, the authors show that stereo sound and vision can indeed be fused effectively, to make a system more capable than with either modality on its own.

Q: What is the importance of g for inter-frame stability?

It is important for inter-frame stability that g is defined relative to global image statistics, rather than local statistics gathered along one normal, or from the normals of one outline curve (which would be economical, computationally).

Q: What is the result of the first experiment on the “motion” sequence?

The particle filter successfully tracks the subject during the period of normal motion to the left, but loses track during the rapid motion to the right.

Q: What is the purpose of this paper?

The authors establish design principles and demonstrate a working system that fuses stereophonic sound localisation with active contour tracking.

Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking

J. Vermaak, M. Gangnet, A. Blake and P. P´erez

Microsoft Research Cambridge, Cambridge CB2 3NH, UK

Web: http://www.research.microsoft.com/vision

Abstract

Video telephony could be considerably enhanced by pro-

vision of a tracking system that allows freedom of movement

to the speaker, while maintaining a well-framed image, for

transmission over limited bandwidth. Already commercial

multi-microphone systems exist which track speaker direc-

tion in order to reject background noise. Stereo sound and

vision are complementary modalities in that sound is good

for initialisation (where vision is expensive) whereas vision

is good for localisation (where sound is less precise). Us-

ing generative probabilistic models and particle ﬁltering,

we show that stereo sound and vision can indeed be fused

effectively, to make a system more capable than with either

modality on its own.

1. Introduction

We establish design principles and demonstrate a work-

ing system that fuses stereophonic sound localisation with

active contour tracking. Where more ambitious systems

with several microphone pairs and several cameras, possi-

bly steerable, could potentially handle free format multi-

speaker interactions, here we aim at something more mod-

est. A single, ﬁxed camera with a single collocated micro-

phone pair is well suited to video telephony, serving one or

perhaps two speakers in a simple, closed environment. The

setup of camera and microphones is illustrated in ﬁgure 1.

The processing of stereo sound is based on cross-

correlation of the signal pair as a means of analysing Time

Delay of Arrival (TDOA). In acoustic environments with

relatively low noise and reverberation, triangulation based

on the TDOA of measurements at a microphone pair [5, 12]

is effective. In even moderately reverberant conditions,

problems arise in that no unique TDOA can be determined.

Some heuristic modiﬁcations to reduce the effects of rever-

beration have been proposed in e.g. [4, 6, 14], but these

are reliant on either speciﬁc array conﬁgurations, or rather

strong assumptions about the source signals and acoustic

environment, and are far from robust in general scenar-

talking

head

camera

omnidirectional microphones

Figure 1. Audiovisual setup. A single microphone pair

is positioned laterally and symmetrically with respect to the

camera’s optical axis, with its baseline in a horizontal plane.

ios. The alternative pursued here is to acknowledge that

the TDOA

cannot uniquely be determined, to record a

sequence

of candidate TDOAs, and to model them

jointly as probabilistic observations in clutter.

The visual tracking uses a standard approach, based on

a generative model for motion in a suitable contour state-

space, together with a likelihood based on one-dimensional

feature searches against a background of image clutter [8].

The probabilistic modelling of visual observations along a

line is analogous to the processing of sound-signal cross-

correlation peaks along the time-delayaxis, in that both deal

with linear search against a background of random clutter.

A particle ﬁlter is applied to fuse predictions from the gener-

ative model with aural and visual observations. This results

in a tracking capability whose robustness is enhanced rela-

tive to vision alone. The sound information provides for ini-

tialisation, and helps considerably with recovery from loss

of lock, as we demonstrate.

2. Observation Model for Sound

The sound measurement system consists of a pair of om-

nidirectional microphones as in ﬁgure 1, situated at posi-

tions

and

in the horizontal plane

2.1. Time Delay of Arrival (TDOA)

The maximum TDOA that can be measured is

max



, with

the speed of sound (normally taken

to be 342



), and

kk

the Euclidean norm. The true

TDOA is given by



(



kk



)

;

(1)

where

x; y ; z

)

is the source location. Apart from the

true source, “ghost sources” due to reverberationlead to ad-

ditional correspondences between left and right audio sig-

nals. These show up as additional peaks in the generalised

cross-correlation function (GCCF) [9], as in ﬁgure 2.

Rather than trying to eliminate the spurious peaks, for

example by further signal processing, they are acknowl-

edged explicitly, knowing that the particle ﬁlter mechanism

used for tracking and fusion is quite capable of assimilating

them. The audio observation vector is therefore deﬁned to

;::: ;D

)

, the

candidate TDOA measure-

ments corresponding to the delay timings of the peaks of

the GCCF. In what follows

is also considered unknown.

Peaks not due to the true source are regarded as clutter.

2.2. Likelihood Model

A complete derivation and discussion of the likelihood

model for the TDOA

can be found in [13]. The result-

ing likelihood follows from a multi-hypothesis analysis [2]

under the assumption of mutual independence of the TDOA

measurements, and is given by

(

)

max



;

D; 



(

)

(2)

if speech is present, and

(

)

if it is absent, so

that no inﬂuence is exerted by the audio stream in this case.

In (2),

is the prior probability of all measurements being

due to clutter,

;::: ;N

, is the prior probability of

Delay (ms)

Time (s)

0.5 1 1.5 2 2.5 3

-1

−1.5 −1 −0.5 0 0.5 1 1.5

0.05

0.1

0.15

0.2

GCCF

Figure 2. Reverberation generates multiplecorrespon-

dences. A speech signal of around 3 seconds duration (top)

gives a correlogram with multiple peaks (middle), shown by

dark blobs. The true delay trajectory is shown overlay. A

slice of the correlogram (bottom) at

, shows multi-

ple peaks, and the peak of highest magnitude is not in fact

the true peak (marked by a vertical line).

the

-th measurement corresponding to the true TDOA,

a normalising constant, and

(



)

is the indicator function

for the set



max

]

. The variance



depends

on the signal-to-noise ratio and the reverberation time of the

acoustic environment, and can be set empirically. However,

the performance of the tracking algorithm proved to be ro-

bust to the accuracy of the value chosen for this parameter.

3. Observation Model for Vision

The observation model for vision is summarised here,

although we do not go into full detail as the model follows

largely established practice for active contour tracking.

3.1. Image Processing

Following standard practice in visual search algorithms

(see e.g. [10]), visual measurements are taken along lines

normal to an outline curve

,asinﬁgure 3.

Point features along normals

(

)

; j

;::: ;M

are

deﬁned by sampling the intensity function regularly along

Figure 3. Image observations in clutter. Image mea-

surements are made along lines normal to a hypothesised

head contour

, as shown.

the line using bilinear interpolation of pixels, convolving

with the gradient of a Gaussian (with width

approxi-

mately 1 pixel), and marking those maxima of response

that exceeds a gradient threshold

. The resulting fea-

tures have offsets



(

)



(

)

; i

= 1

;::: ;N

, where

an offset



= 0

indicates a feature lying on the hypothe-

sised contour

. The combined image measurement is then



(

)

; j

;::: ;M

. It is important for inter-frame

stability that

is deﬁned relative to global image statistics,

rather than local statistics gathered along one normal, or

from the normals of one outline curve (which would be eco-

nomical, computationally). Instead, independent samples

of image gradient

are taken distributed evenly over the

image, and

is set to retain a proportion

of the strongest

responses. Typically

is set to give

= 30

%. This im-

parts a measure of invariance to global illumination changes

and drift in camera gain.

3.2. Image Likelihood

Likelihood modelling for observations along an individ-

ual normal

(

)

is straightforward, following similar rea-

soning as in the case of delay measurements for the sound

signals above, to give a likelihood





(

)











(

)

;



;

where

is the non-detection probability for the visual con-

tour (independent of

for sound), and which is typically

set to

for reasonable behaviour. The variance



for valid contour measurements, assumed Gaussian, is de-

termined from the residuals of contour ﬁts on a few training

images. Finally, the global image likelihood is computed as

a product

(

)





(

)





assuming joint independence of the



(

)

, and this is some-

thing for which some experimentaljustiﬁcation has recently

been claimed [11].

4. State-Space Model

In this study an image-based conﬁguration space is used,

on the camera-plane

(

x; y

)

, rather than in 3D coordinates.

This does not allow head rotation to be fully modelled,

but that is outside the scope of a system that tracks only a

bounding contour, as reported here, and the audiovisual cal-

ibration of an image-based system is more straightforward

than the full 3D case.

4.1. Conﬁguration Space

The image based conﬁguration

= (

)

consists of

the image coordinates

x; y

)

of the centroid of a head-

outline template, and the template itself as a curve

(

)

obtained by drawing around a single-frame head, and which

is perturbed afﬁnely by



matrix):

(

)

(

)

Further variability could easily be introduced using key-

frames [3], but afﬁne variability sufﬁces for the experiments

reported here. The head-outline template

is exactly the

curve

used to obtain the visual measurements in section 3.

The image based conﬁgurationused here does not allow the

direct use of (1) to compute the TDOA

, since the 3D po-

sition

corresponding to a hypothesised conﬁguration

is not uniquely determined. However, the geometry of the

setup allows

to be computed from

using the Fraunhof-

fer approximation

max

cos(arctan(

f=x

))

, where

is the focal length of the camera, for which a pinhole model

is adopted.

4.2. Dynamical Model

A stratiﬁed dynamical model was used to reﬂect the dif-

ferent kinds and degrees of variability that are appropriate

to the head tracking task. The greatest variability is in hor-

izontal motion (

-coordinate), followed by vertical motion

(

), neither of which should be drawn towards any partic-

ular origin in the image, but which should remain within

the ﬁeld of view. Shape variability (

) is more constrained

— of smaller magnitude, and with a restoring tendency to-

wards the home template

. Dynamical models that reﬂect

this are as follows, expressed discretely with respect to a

sampling time interval



(video frame-rate).

The displacement process is modelled as Langevin mo-

tion [1], as for a free particle in a liquid



(



(

)

(

)

with thermal excitation

(

)

. The parameters of such a

process are most naturally speciﬁed in terms of continuous-

time parameters which have clear physical interpretations:

the rate constant



(

)



, and the steady-state root-mean-

square velocity



(

)



. It corresponds to a discrete pro-

cess

(

)



(

)

(

)



u

in which

(

)

are

;

variables and

(

)

= exp







(

)





and

(

)

=

(

)



(

)

In experiments, we ﬁxed



(

)



(

)

= 10



and



(

)

and



(

)

to 10% and 5% of the ﬁeld of view in the respective

directions, per second.

For the afﬁne matrix, the model follows a stable, criti-

cally damped 2nd order autoregressive process, whose pa-

rameters are speciﬁed by a temporal rate constant



(

)

and

steady state root-mean-square magnitude





(

)

(dimension-

less). It takes the discrete form

(

)



(

)



(

)

(

)

;

in which

(

)

are

;

variables, and with

(

)

(

)

(

)

set in terms of



(

)





(

)

according to well-known

rules [3, p. 206]. We set



(

)

=10



and





(

)

to 10%.

5. Particle Filter Tracking Algorithm

The general tracking problem involves the recursive es-

timation of the ﬁltering distribution

(

)

, with

(

;

)

and the subscript

1 :

denoting all the observa-

tions from time 1 to time

, from which estimates of the

conﬁguration

can be obtained. The general recursions to

compute the ﬁltering distribution are given by

(



(



)

(



)

(

)

(

A;k

)

(

I;k

)

(



)

;

where the ﬁrst, or prediction, step uses the dynamicalmodel

and the ﬁltering distribution at the previous time step to

compute the one-step ahead prediction distribution, which

then acts as the prior for the conﬁguration in the second,

or update, step where it is combined with the likelihood to

obtain the ﬁltering distribution.

Due to the non-linearity and multi-modality inherent

in the problem, the recursions above are analytically in-

tractable. Under these conditions sequential Monte Carlo,

or particle ﬁltering, methods [7, 8] provide an attractive

solution strategy. The particular particle ﬁlter architecture

adopted here deviates from the standard particle ﬁlter, and

makes the best use of the properties of the model. Since

the sound likelihood depends only on

, which in turn de-

pends on the conﬁguration

only via the image

coordi-

nate, the sampling is separated in two stages by “partitioned

sampling” [11]. In the ﬁrst stages samples for the

coor-

dinate are generated from a proposal distribution which is

a mixture of the dynamics for

and the sound likelihood,

viewed as a distribution in

. These samples are then prop-

erly reweighted with the sound likelihood, and resampled

to populate

regions with high probability under the sound

likelihood. In the second stage the remaining components

and

are proposed from their corresponding dynamics,

reweightedwith the image likelihood, and resampled. Since

the broadest variations in

are due to

and

, separating

and

leads to a considerable improvement in sampling ef-

ﬁciency, palpable as a reduction in the number of particles

needed per time step.

6. Results

We illustrate our system on two test sequences. The ﬁrst

sequence starts with a subject moving slowly from the cen-

tre of the image to the left, while all the time being quiet.

The subject then moves rapidly to the right, where it pauses

and speaks, and then progresses back to the centre. The

second sequence involves two subjects (A and B), both ap-

pearing in the image at the same time. The subjects take

turns to speak, while all the time moving their heads around

to some degree. In what follows we will refer to the ﬁrst se-

quence as the “motion” sequence, and to the second as the

“ping-pong” sequence.

We performed two particle ﬁltering experiments, using

20 particles in all cases, on each of the sequences. In the

ﬁrst experiment only the visual measurements were used

to perform tracking, while visual and sound measurements

were combined in the second experiment. The results of the

ﬁrst experimenton the “motion” sequence is summarised by

the key frames in top of ﬁgure 4. The particle ﬁlter success-

fully tracks the subject during the period of normal motion

to the left, but loses track during the rapid motion to the

right. The particles latch on to prominent features in the

background and never recover the subject again. However,

in the second experiment, where the visual measurements

are combined with the sound measurements, the particle ﬁl-

ter is able to immediately reinitialise on the subject as soon

as it speaks, and the subsequent tracking is successful, as is

illustrated by the key frames in the bottom of ﬁgure 4.

Similar results were obtained on the “ping-pong” se-

quence, and are summarised by the key frames in ﬁgure 5.

In the case where only visual measurements are used the

particles remain focussed on subject A, where they were

initialised, regardless of which subject is speaking. When,

on the other hand, the sound measurements are also used,

the particles jump back and forth between the two subjects

as they take turns in the conversation. Thus, the algorithm

can be integrated into a teleconferencing system to deter-

mine the focus speaker for a steerable camera.

What is truly remarkable is that these results were

achieved with low cost off-the-shelfequipment. The system

was only very roughly calibrated, and proved to be robust

to the exact values chosen for the intrinsic parameters of the

camera, and did not require extremely careful placement of

the microphones relative to the camera. Furthermore, no

attempt was made to compensate for the reverberation and

background noise, of which there was a fair amount due to

fan and air-conditioner noise. Also, as is evident from the

result sequences, tracking was performed against a cluttered

background with many objects that can potentially distract

a vision-only based tracking algorithm. Thus, it is proved

that the combination of sound and vision can achieve a far

more robust tracking performance, at a low computational

cost, than any of the modalities on their own.

7. Conclusions

Further investigations are looking at the followingissues.



Ultimately a full 3D system may be desirable, so head

rotation can be fully modelled, including displacement

of the mouth relative to the centre of the head image.



So far studies have been based on stored audiovisual

sequences, but preliminary indications (based on soft-

ware proﬁling) suggest that a real-time system should

be quite feasible without special hardware, and work

is currently in progress to achieve this.



A more powerful system, on a teleconferencing scale,

would use several microphones, or microphone pairs,

distributed widely, not just at the camera centre This

would require full three dimensional calibration which

would be somewhat facilitated by the use of more than

one camera also.



It may be possible, and beneﬁcial, to cut out the in-

termediate stage of marking correlation maxima for

sound signal pairs, and evaluate a likelihood computed

directly from the instantaneous value of the correlation

(i.e. for one ﬁxed delay

), if such a likelihood could

satisfactorily be deﬁned.

References

[1] K. Astrom. Introduction to stochastic control theory. Aca-

demic Press, 1970.

[2] Y. Bar-Shalom and T. Fortmann. Tracking and Data Associ-

ation. Academic Press, 1988.

[3] A. Blake and M. Isard. Active contours. Springer, 1998.

[4] M. S. Brandstein. Time-delay estimation of reverberant

speech exploiting harmonic structure. Journal of the Acous-

tic Society of America, 105(5):2914–2919, 1999.

[5] M. S. Brandstein and H. F. Silverman. A practical methodol-

ogy for speech source localization with microphone arrays.

Computer, Speech and Language, 11(2):91–126, 1997.

[6] M. S. Brandstein and H. F. Silverman. A robust method for

speech signal time-delay estimation in reverberant rooms.

In Proceedings of the IEEE International Conference on

Acoustic, Speech and Signal Processing, pages 375–378,

1997.

[7] N. Gordon, D. Salmond, and A. Smith. Novel approach

to nonlinear/non-Gaussian Bayesian state estimation. IEE

Proc. F, 140(2):107–113, 1993.

[8] M. Isard and A. Blake. Visual tracking by stochastic prop-

agation of conditional density. In Proc. 4th European Conf.

Computer Vision, pages 343–356, Cambridge, England, Apr

1996.

[9] C. H. Knapp and G. C. Carter. The generalized correla-

tion method for estimation of time delay. IEEE Transac-

tions on Acoustics, Speech, and Signal Processing, ASSP-

24(4):320–327, 1976.

[10] D. Lowe. Robust model-based motion tracking through the

integration of search and estimation. Int. J. Computer Vision,

8(2):113–122, 1992.

[11] J. MacCormick and A. Blake. A probabilistic exclusion

principle for tracking multiple objects. In Proc. Int. Conf.

on Computer Vision, pages 572–578, 1999.

[12] H. F. Silverman and E. Kirtman. A two-stage algorithm

for determining talker location from linear microphone array

data. Computer Speech and Language, 6:129–152, 1992.

[13] J. Vermaak and A. Blake. Nonlinear ﬁltering for speaker

tracking in noisy and reverberant environments. In Pro-

ceedings of the IEEE International Conference on Acoustic,

Speech and Signal Processing, 2001.

[14] H. Wang and P. Chu. Voice source localization for auto-

matic camera pointing system in videoconferencing. In Pro-

ceedings of the IEEE International Conference on Acoustic,

Speech and Signal Processing, pages 187–190, 1997.

Sequential Monte Carlo fusion of sound and vision for speaker tracking

Figures

Citations

Multimodal fusion for multimedia analysis: a survey

Data fusion for visual tracking with particles

Apparatus and method performing audio-video sensor fusion for object localization, tracking, and separation

Pixels that sound

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

References

Novel approach to nonlinear/non-Gaussian Bayesian state estimation

The generalized correlation method for estimation of time delay

Tracking and data association

Tracking and data association

Introduction to stochastic control theory

Related Papers (5)

The generalized correlation method for estimation of time delay

Sequential Monte Carlo methods in practice

Particle filtering algorithms for tracking an acoustic source in a reverberant environment

Look who's talking: speaker detection using video and audio correlation

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Frequently Asked Questions (14)

Q1. What are the contributions in "Sequential monte carlo fusion of sound and vision for speaker tracking" ?

Q2. What is the way to separate the sound likelihood from the image?

Q3. What is the importance of g for inter-frame stability?

Q4. What is the particle filtering technique?

Q5. What is the way to measure acoustic noise?

Q6. What is the simplest way to estimate the configuration of a head?

Q7. What is the result of the first experiment on the “motion” sequence?

Q8. What is the purpose of this paper?

Q9. What is the way to measure the correlation between sound and audio?

Q10. How much of the field of view is fixed?

Q11. What is the recursive distribution of the xkj?

Q12. What was the result of the first experiment on the “motion” sequence?

Q13. What is the particle filter architecture used in this experiment?

Q14. What is the probability of all measurements being due to clutter?