scispace - formally typeset
Open AccessProceedings ArticleDOI

Sequential Monte Carlo fusion of sound and vision for speaker tracking

TLDR
Stereo sound and vision can indeed be fused effectively, to make a system more capable than with either modality on its own, using generative probabilistic models and particle filtering.
Abstract
Video telephony could be considerably enhanced by provision of a tracking system that allows freedom of movement to the speaker while maintaining a well-framed image, for transmission over limited bandwidth. Already commercial multi-microphone systems exist which track speaker direction in order to reject background noise. Stereo sound and vision are complementary modalities in that sound is good for initialisation (where vision is expensive) whereas vision is good for localisation (where sound is less precise). Using generative probabilistic models and particle filtering, we show that stereo sound and vision can indeed be fused effectively, to make a system more capable than with either modality on its own.

read more

Content maybe subject to copyright    Report

Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking
J. Vermaak, M. Gangnet, A. Blake and P. P´erez
Microsoft Research Cambridge, Cambridge CB2 3NH, UK
Web: http://www.research.microsoft.com/vision
Abstract
Video telephony could be considerably enhanced by pro-
vision of a tracking system that allows freedom of movement
to the speaker, while maintaining a well-framed image, for
transmission over limited bandwidth. Already commercial
multi-microphone systems exist which track speaker direc-
tion in order to reject background noise. Stereo sound and
vision are complementary modalities in that sound is good
for initialisation (where vision is expensive) whereas vision
is good for localisation (where sound is less precise). Us-
ing generative probabilistic models and particle filtering,
we show that stereo sound and vision can indeed be fused
effectively, to make a system more capable than with either
modality on its own.
1. Introduction
We establish design principles and demonstrate a work-
ing system that fuses stereophonic sound localisation with
active contour tracking. Where more ambitious systems
with several microphone pairs and several cameras, possi-
bly steerable, could potentially handle free format multi-
speaker interactions, here we aim at something more mod-
est. A single, fixed camera with a single collocated micro-
phone pair is well suited to video telephony, serving one or
perhaps two speakers in a simple, closed environment. The
setup of camera and microphones is illustrated in figure 1.
The processing of stereo sound is based on cross-
correlation of the signal pair as a means of analysing Time
Delay of Arrival (TDOA). In acoustic environments with
relatively low noise and reverberation, triangulation based
on the TDOA of measurements at a microphone pair [5, 12]
is effective. In even moderately reverberant conditions,
problems arise in that no unique TDOA can be determined.
Some heuristic modifications to reduce the effects of rever-
beration have been proposed in e.g. [4, 6, 14], but these
are reliant on either specific array configurations, or rather
strong assumptions about the source signals and acoustic
environment, and are far from robust in general scenar-
talking
head
camera
omnidirectional microphones
x
y
z
m
1
m
2
R
Figure 1. Audiovisual setup. A single microphone pair
is positioned laterally and symmetrically with respect to the
camera’s optical axis, with its baseline in a horizontal plane.
ios. The alternative pursued here is to acknowledge that
the TDOA
D
cannot uniquely be determined, to record a
sequence
f
D
i
g
of candidate TDOAs, and to model them
jointly as probabilistic observations in clutter.
The visual tracking uses a standard approach, based on
a generative model for motion in a suitable contour state-
space, together with a likelihood based on one-dimensional
feature searches against a background of image clutter [8].
The probabilistic modelling of visual observations along a
line is analogous to the processing of sound-signal cross-
correlation peaks along the time-delayaxis, in that both deal
with linear search against a background of random clutter.
1

A particle filter is applied to fuse predictions from the gener-
ative model with aural and visual observations. This results
in a tracking capability whose robustness is enhanced rela-
tive to vision alone. The sound information provides for ini-
tialisation, and helps considerably with recovery from loss
of lock, as we demonstrate.
2. Observation Model for Sound
The sound measurement system consists of a pair of om-
nidirectional microphones as in figure 1, situated at posi-
tions
m
1
and
m
2
in the horizontal plane
y
=0
.
2.1. Time Delay of Arrival (TDOA)
The maximum TDOA that can be measured is
D
max
=
c
1
k
m
1
m
2
k
, with
c
the speed of sound (normally taken
to be 342
ms
1
), and
kk
the Euclidean norm. The true
TDOA is given by
D
=
c
1
(
k
R
m
1
kk
R
m
2
k
)
;
(1)
where
R
=(
x; y ; z
)
is the source location. Apart from the
true source, “ghost sources” due to reverberationlead to ad-
ditional correspondences between left and right audio sig-
nals. These show up as additional peaks in the generalised
cross-correlation function (GCCF) [9], as in figure 2.
Rather than trying to eliminate the spurious peaks, for
example by further signal processing, they are acknowl-
edged explicitly, knowing that the particle filter mechanism
used for tracking and fusion is quite capable of assimilating
them. The audio observation vector is therefore defined to
be
z
A
=(
D
1
;::: ;D
N
)
, the
N
candidate TDOA measure-
ments corresponding to the delay timings of the peaks of
the GCCF. In what follows
N
is also considered unknown.
Peaks not due to the true source are regarded as clutter.
2.2. Likelihood Model
A complete derivation and discussion of the likelihood
model for the TDOA
D
can be found in [13]. The result-
ing likelihood follows from a multi-hypothesis analysis [2]
under the assumption of mutual independence of the TDOA
measurements, and is given by
L
(
z
A
j
D
)
/
q
0
2
D
max
+
c
N
X
i
=1
q
i
N
D
i
;
D;
2
D
I
D
(
D
)
(2)
if speech is present, and
L
(
z
A
j
D
)
/
1
if it is absent, so
that no influence is exerted by the audio stream in this case.
In (2),
q
0
is the prior probability of all measurements being
due to clutter,
q
i
,
i
=1
;::: ;N
, is the prior probability of
Delay (ms)
Time (s)
0.5 1 1.5 2 2.5 3
-1
0
1
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.05
0.1
0.15
0.2
GCCF
Figure 2. Reverberation generates multiplecorrespon-
dences. A speech signal of around 3 seconds duration (top)
gives a correlogram with multiple peaks (middle), shown by
dark blobs. The true delay trajectory is shown overlay. A
slice of the correlogram (bottom) at
t
=0
:
5
s
, shows multi-
ple peaks, and the peak of highest magnitude is not in fact
the true peak (marked by a vertical line).
the
i
-th measurement corresponding to the true TDOA,
c
is
a normalising constant, and
I
D
(
)
is the indicator function
for the set
D
=[
D
max
;D
max
]
. The variance
2
D
depends
on the signal-to-noise ratio and the reverberation time of the
acoustic environment, and can be set empirically. However,
the performance of the tracking algorithm proved to be ro-
bust to the accuracy of the value chosen for this parameter.
3. Observation Model for Vision
The observation model for vision is summarised here,
although we do not go into full detail as the model follows
largely established practice for active contour tracking.
3.1. Image Processing
Following standard practice in visual search algorithms
(see e.g. [10]), visual measurements are taken along lines
normal to an outline curve
C
,asingure 3.
Point features along normals
f
n
(
j
)
; j
=1
;::: ;M
g
are
dened by sampling the intensity function regularly along

Figure 3. Image observations in clutter. Image mea-
surements are made along lines normal to a hypothesised
head contour
C
, as shown.
the line using bilinear interpolation of pixels, convolving
with the gradient of a Gaussian (with width
w
approxi-
mately 1 pixel), and marking those maxima of response
that exceeds a gradient threshold
g
. The resulting fea-
tures have offsets
(
j
)
=
f
(
j
)
i
; i
= 1
;::: ;N
j
g
, where
an offset
= 0
indicates a feature lying on the hypothe-
sised contour
C
. The combined image measurement is then
z
I
=
f
(
j
)
; j
=1
;::: ;M
g
. It is important for inter-frame
stability that
g
is dened relative to global image statistics,
rather than local statistics gathered along one normal, or
from the normals of one outline curve (which would be eco-
nomical, computationally). Instead, independent samples
of image gradient
jr
I
j
are taken distributed evenly over the
image, and
g
is set to retain a proportion
k
g
of the strongest
responses. Typically
g
is set to give
k
g
= 30
%. This im-
parts a measure of invariance to global illumination changes
and drift in camera gain.
3.2. Image Likelihood
Likelihood modelling for observations along an individ-
ual normal
n
(
j
)
is straightforward, following similar rea-
soning as in the case of delay measurements for the sound
signals above, to give a likelihood
L
(
j
)
C
/
q
0
+
1
q
0
N
j
N
j
X
i
=1
N
(
j
)
i
;0
;
2
I
;
where
q
0
is the non-detection probability for the visual con-
tour (independent of
q
0
for sound), and which is typically
set to
q
0
=0
:
2
for reasonable behaviour. The variance
2
I
for valid contour measurements, assumed Gaussian, is de-
termined from the residuals of contour ts on a few training
images. Finally, the global image likelihood is computed as
a product
L
(
z
I
j
C
)
/
M
Y
j
=1
L
(
j
)
C
assuming joint independence of the
(
j
)
, and this is some-
thing for which some experimentaljustication has recently
been claimed [11].
4. State-Space Model
In this study an image-based conguration space is used,
on the camera-plane
(
x; y
)
, rather than in 3D coordinates.
This does not allow head rotation to be fully modelled,
but that is outside the scope of a system that tracks only a
bounding contour, as reported here, and the audiovisual cal-
ibration of an image-based system is more straightforward
than the full 3D case.
4.1. Configuration Space
The image based conguration
X
= (
r
;T
)
consists of
the image coordinates
r
=(
x; y
)
of the centroid of a head-
outline template, and the template itself as a curve
r
0
(
s
)
,
obtained by drawing around a single-frame head, and which
is perturbed afnely by
T
(a
2
2
matrix):
r
X
(
s
)=
r
+(
T
+
I
)
r
0
(
s
)
:
Further variability could easily be introduced using key-
frames [3], but afne variability sufces for the experiments
reported here. The head-outline template
r
X
is exactly the
curve
C
used to obtain the visual measurements in section 3.
The image based congurationused here does not allow the
direct use of (1) to compute the TDOA
D
, since the 3D po-
sition
R
corresponding to a hypothesised conguration
X
is not uniquely determined. However, the geometry of the
setup allows
D
to be computed from
x
using the Fraunhof-
fer approximation
D
=
D
max
cos(arctan(
f=x
))
, where
f
is the focal length of the camera, for which a pinhole model
is adopted.
4.2. Dynamical Model
A stratied dynamical model was used to reect the dif-
ferent kinds and degrees of variability that are appropriate
to the head tracking task. The greatest variability is in hor-
izontal motion (
x
-coordinate), followed by vertical motion
(
y
), neither of which should be drawn towards any partic-
ular origin in the image, but which should remain within
the eld of view. Shape variability (
T
) is more constrained

of smaller magnitude, and with a restoring tendency to-
wards the home template
r
0
. Dynamical models that reect
this are as follows, expressed discretely with respect to a
sampling time interval
(video frame-rate).
The displacement process is modelled as Langevin mo-
tion [1], as for a free particle in a liquid
x
(
t
)+
(
x
)
_
x
(
t
)=
w
(
t
)
with thermal excitation
w
(
t
)
. The parameters of such a
process are most naturally specied in terms of continuous-
time parameters which have clear physical interpretations:
the rate constant
(
x
)
s
1
, and the steady-state root-mean-
square velocity
v
(
x
)
ms
1
. It corresponds to a discrete pro-
cess
u
t
=
a
(
x
)
u
t
1
+
b
(
x
)
w
(
x
)
t
x
t
=
x
t
1
+
u
t
in which
w
(
x
)
t
are
N
(0
;
1)
variables and
a
(
x
)
= exp
(
x
)
and
b
(
x
)
=
v
(
x
)
q
1
(
a
(
x
)
)
2
:
In experiments, we xed
(
x
)
=
(
y
)
= 10
s
1
and
v
(
x
)
and
v
(
y
)
to 10% and 5% of the eld of view in the respective
directions, per second.
For the afne matrix, the model follows a stable, criti-
cally damped 2nd order autoregressive process, whose pa-
rameters are specied by a temporal rate constant
(
T
)
and
steady state root-mean-square magnitude
(
T
)
(dimension-
less). It takes the discrete form
T
t
=
a
(
T
)
1
T
t
1
+
a
(
T
)
2
T
t
2
+
b
(
T
)
w
(
T
)
t
;
in which
w
(
T
)
t
are
N
(0
;
1)
variables, and with
a
(
T
)
1
,
a
(
T
)
2
,
b
(
T
)
set in terms of
(
T
)
,
(
T
)
according to well-known
rules [3, p. 206]. We set
(
T
)
=10
s
1
and
(
T
)
to 10%.
5. Particle Filter Tracking Algorithm
The general tracking problem involves the recursive es-
timation of the ltering distribution
p
(
X
k
j
z
1:
k
)
, with
z
=
(
z
A
;
z
I
)
and the subscript
1 :
k
denoting all the observa-
tions from time 1 to time
k
, from which estimates of the
conguration
X
can be obtained. The general recursions to
compute the ltering distribution are given by
p
(
X
k
j
z
1:
k
1
)=
Z
p
(
X
k
j
X
k
1
)
p
(
dX
k
1
j
z
1:
k
1
)
p
(
X
k
j
z
1:
k
)
/
L
(
z
A;k
j
D
k
)
L
(
z
I;k
j
C
k
)
p
(
X
k
j
z
1:
k
1
)
;
where the rst, or prediction, step uses the dynamicalmodel
and the ltering distribution at the previous time step to
compute the one-step ahead prediction distribution, which
then acts as the prior for the conguration in the second,
or update, step where it is combined with the likelihood to
obtain the ltering distribution.
Due to the non-linearity and multi-modality inherent
in the problem, the recursions above are analytically in-
tractable. Under these conditions sequential Monte Carlo,
or particle ltering, methods [7, 8] provide an attractive
solution strategy. The particular particle lter architecture
adopted here deviates from the standard particle lter, and
makes the best use of the properties of the model. Since
the sound likelihood depends only on
D
, which in turn de-
pends on the conguration
X
only via the image
x
coordi-
nate, the sampling is separated in two stages by partitioned
sampling [11]. In the rst stages samples for the
x
coor-
dinate are generated from a proposal distribution which is
a mixture of the dynamics for
x
and the sound likelihood,
viewed as a distribution in
x
. These samples are then prop-
erly reweighted with the sound likelihood, and resampled
to populate
x
regions with high probability under the sound
likelihood. In the second stage the remaining components
y
and
T
are proposed from their corresponding dynamics,
reweightedwith the image likelihood, and resampled. Since
the broadest variations in
X
are due to
x
and
y
, separating
x
and
y
leads to a considerable improvement in sampling ef-
ciency, palpable as a reduction in the number of particles
needed per time step.
6. Results
We illustrate our system on two test sequences. The rst
sequence starts with a subject moving slowly from the cen-
tre of the image to the left, while all the time being quiet.
The subject then moves rapidly to the right, where it pauses
and speaks, and then progresses back to the centre. The
second sequence involves two subjects (A and B), both ap-
pearing in the image at the same time. The subjects take
turns to speak, while all the time moving their heads around
to some degree. In what follows we will refer to the rst se-
quence as the motion sequence, and to the second as the
ping-pong sequence.
We performed two particle ltering experiments, using
20 particles in all cases, on each of the sequences. In the
rst experiment only the visual measurements were used
to perform tracking, while visual and sound measurements
were combined in the second experiment. The results of the
rst experimenton the motion sequence is summarised by
the key frames in top of gure 4. The particle lter success-
fully tracks the subject during the period of normal motion
to the left, but loses track during the rapid motion to the
right. The particles latch on to prominent features in the
background and never recover the subject again. However,
in the second experiment, where the visual measurements
are combined with the sound measurements, the particle l-
ter is able to immediately reinitialise on the subject as soon

as it speaks, and the subsequent tracking is successful, as is
illustrated by the key frames in the bottom of gure 4.
Similar results were obtained on the ping-pong se-
quence, and are summarised by the key frames in gure 5.
In the case where only visual measurements are used the
particles remain focussed on subject A, where they were
initialised, regardless of which subject is speaking. When,
on the other hand, the sound measurements are also used,
the particles jump back and forth between the two subjects
as they take turns in the conversation. Thus, the algorithm
can be integrated into a teleconferencing system to deter-
mine the focus speaker for a steerable camera.
What is truly remarkable is that these results were
achieved with low cost off-the-shelfequipment. The system
was only very roughly calibrated, and proved to be robust
to the exact values chosen for the intrinsic parameters of the
camera, and did not require extremely careful placement of
the microphones relative to the camera. Furthermore, no
attempt was made to compensate for the reverberation and
background noise, of which there was a fair amount due to
fan and air-conditioner noise. Also, as is evident from the
result sequences, tracking was performed against a cluttered
background with many objects that can potentially distract
a vision-only based tracking algorithm. Thus, it is proved
that the combination of sound and vision can achieve a far
more robust tracking performance, at a low computational
cost, than any of the modalities on their own.
7. Conclusions
Further investigations are looking at the followingissues.
Ultimately a full 3D system may be desirable, so head
rotation can be fully modelled, including displacement
of the mouth relative to the centre of the head image.
So far studies have been based on stored audiovisual
sequences, but preliminary indications (based on soft-
ware proling) suggest that a real-time system should
be quite feasible without special hardware, and work
is currently in progress to achieve this.
A more powerful system, on a teleconferencing scale,
would use several microphones, or microphone pairs,
distributed widely, not just at the camera centre This
would require full three dimensional calibration which
would be somewhat facilitated by the use of more than
one camera also.
It may be possible, and benecial, to cut out the in-
termediate stage of marking correlation maxima for
sound signal pairs, and evaluate a likelihood computed
directly from the instantaneous value of the correlation
(i.e. for one xed delay
D
), if such a likelihood could
satisfactorily be dened.
References
[1] K. Astrom. Introduction to stochastic control theory. Aca-
demic Press, 1970.
[2] Y. Bar-Shalom and T. Fortmann. Tracking and Data Associ-
ation. Academic Press, 1988.
[3] A. Blake and M. Isard. Active contours. Springer, 1998.
[4] M. S. Brandstein. Time-delay estimation of reverberant
speech exploiting harmonic structure. Journal of the Acous-
tic Society of America, 105(5):29142919, 1999.
[5] M. S. Brandstein and H. F. Silverman. A practical methodol-
ogy for speech source localization with microphone arrays.
Computer, Speech and Language, 11(2):91126, 1997.
[6] M. S. Brandstein and H. F. Silverman. A robust method for
speech signal time-delay estimation in reverberant rooms.
In Proceedings of the IEEE International Conference on
Acoustic, Speech and Signal Processing, pages 375378,
1997.
[7] N. Gordon, D. Salmond, and A. Smith. Novel approach
to nonlinear/non-Gaussian Bayesian state estimation. IEE
Proc. F, 140(2):107113, 1993.
[8] M. Isard and A. Blake. Visual tracking by stochastic prop-
agation of conditional density. In Proc. 4th European Conf.
Computer Vision, pages 343356, Cambridge, England, Apr
1996.
[9] C. H. Knapp and G. C. Carter. The generalized correla-
tion method for estimation of time delay. IEEE Transac-
tions on Acoustics, Speech, and Signal Processing, ASSP-
24(4):320327, 1976.
[10] D. Lowe. Robust model-based motion tracking through the
integration of search and estimation. Int. J. Computer Vision,
8(2):113122, 1992.
[11] J. MacCormick and A. Blake. A probabilistic exclusion
principle for tracking multiple objects. In Proc. Int. Conf.
on Computer Vision, pages 572578, 1999.
[12] H. F. Silverman and E. Kirtman. A two-stage algorithm
for determining talker location from linear microphone array
data. Computer Speech and Language, 6:129152, 1992.
[13] J. Vermaak and A. Blake. Nonlinear ltering for speaker
tracking in noisy and reverberant environments. In Pro-
ceedings of the IEEE International Conference on Acoustic,
Speech and Signal Processing, 2001.
[14] H. Wang and P. Chu. Voice source localization for auto-
matic camera pointing system in videoconferencing. In Pro-
ceedings of the IEEE International Conference on Acoustic,
Speech and Signal Processing, pages 187190, 1997.

Citations
More filters
Journal ArticleDOI

Multimodal fusion for multimedia analysis: a survey

TL;DR: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks.
Journal ArticleDOI

Data fusion for visual tracking with particles

TL;DR: Generic importance sampling mechanisms for data fusion are introduced and it is shown how each of the three cues can be modeled by an appropriate data likelihood function, and how the intermittent cues are best handled by generating proposal distributions from their likelihood functions.
Patent

Apparatus and method performing audio-video sensor fusion for object localization, tracking, and separation

TL;DR: In this article, an audio likelihood module is used to determine corresponding audio likelihoods for each of a plurality of sounds received from corresponding different directions, each audio likelihood indicating a likelihood a sound is an object to be tracked.
Proceedings ArticleDOI

Pixels that sound

TL;DR: This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.
Journal ArticleDOI

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

TL;DR: In this article, a probabilistic approach is proposed to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras.
References
More filters
Journal ArticleDOI

Novel approach to nonlinear/non-Gaussian Bayesian state estimation

TL;DR: An algorithm, the bootstrap filter, is proposed for implementing recursive Bayesian filters, represented as a set of random samples, which are updated and propagated by the algorithm.
Journal ArticleDOI

The generalized correlation method for estimation of time delay

TL;DR: In this paper, a maximum likelihood estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise, where the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and suppress the noise power.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Sequential monte carlo fusion of sound and vision for speaker tracking" ?

Using generative probabilistic models and particle filtering, the authors show that stereo sound and vision can indeed be fused effectively, to make a system more capable than with either modality on its own. 

Since the sound likelihood depends only on D, which in turn depends on the configuration X only via the image x coordinate, the sampling is separated in two stages by “partitioned sampling” [11]. 

It is important for inter-frame stability that g is defined relative to global image statistics, rather than local statistics gathered along one normal, or from the normals of one outline curve (which would be economical, computationally). 

Since the broadest variations in X are due to x and y, separating x and y leads to a considerable improvement in sampling efficiency, palpable as a reduction in the number of particles needed per time step. 

In acoustic environments with relatively low noise and reverberation, triangulation based on the TDOA of measurements at a microphone pair [5, 12] is effective. 

The general tracking problem involves the recursive estimation of the filtering distribution p(Xkjz1:k), with z = (zA; zI ) and the subscript 1 : k denoting all the observations from time 1 to time k, from which estimates of the configuration X can be obtained. 

The particle filter successfully tracks the subject during the period of normal motion to the left, but loses track during the rapid motion to the right. 

The authors establish design principles and demonstrate a working system that fuses stereophonic sound localisation with active contour tracking. 

So far studies have been based on stored audiovisual sequences, but preliminary indications (based on software profiling) suggest that a real-time system should be quite feasible without special hardware, and work is currently in progress to achieve this. 

In experiments, the authors fixed (x) = (y) = 10s 1 and v(x) and v(y) to 10% and 5% of the field of view in the respective directions, per second. 

The general recursions to compute the filtering distribution are given byp (Xkj z1:k 1) = Z p (XkjXk 1) p (dXk 1j z1:k 1)p (Xkj z1:k) / L (zA;kjDk)L (zI;kjCk) p (Xkj z1:k 1) ;where the first, or prediction, step uses the dynamical model and the filtering distribution at the previous time step to compute the one-step ahead prediction distribution, which then acts as the prior for the configuration in the second,or update, step where it is combined with the likelihood to obtain the filtering distribution. 

The system was only very roughly calibrated, and proved to be robust to the exact values chosen for the intrinsic parameters of the camera, and did not require extremely careful placement of the microphones relative to the camera. 

The particular particle filter architecture adopted here deviates from the standard particle filter, and makes the best use of the properties of the model. 

In (2), q0 is the prior probability of all measurements being due to clutter, qi, i = 1; : : : ; N , is the prior probability ofthe i-th measurement corresponding to the true TDOA, c is a normalising constant, and ID ( ) is the indicator function for the set D = [ Dmax; Dmax].