What contributions have the authors mentioned in the paper "Cross-modal localization through mutual information" ?

One possible strategy to address this task is to examine whether the sensor outputs contain information which can be attributed to a common cause. In this paper, the authors present an approach to localise this embedded common information through an indirect method of estimating mutual information between all signal sources.

What are the future works in "Cross-modal localization through mutual information" ?

Research in several directions to extend the work presented in this paper are currently under way. Combining the indirect estimation methods with direct estimation could couple their respective strengths and would be a fruitful avenue of further research into signal grouping. Constructing a multidimensional feature space by combining the separate features could add value and this would obviously benefit future research outcomes.

What is the effect of a feature-level approach?

Formulating the problem in the feature level rather than signal level will remove the requirement of preserving locality of thedata source.

How can the authors achieve the maximisation of MI?

The maximisation of MI is achieved by maximising the entropies H(Y1) and H(Y2) and minimising the joint entropy, H(Y1, Y2) in (1).

What is the simplest way to calculate the L1 penalty?

Since the projections α1, α2 may be of very high dimensionality, it is assumed thatmin ‖α1‖1 = |α11 | + |α12 | + · · · |α1n | (16)Therefore the L1 penalty is∂ min ‖α1‖1 ∂Y1(17)further∂|α1| ∂Y11 = ∑n i=1 ∂|α1i | ∂Y11 = ∑ |X−11 |row1 sign|Y11 | ... ∂|α1| ∂Y1i = ∑n i=1 ∂|α1i | ∂Y1i = ∑ |X−11 |rowi sign|Y1i |resulting in∂ min ‖α1‖1 ∂Y1 = ∑ |X−11 | sign|Y1| (18)All iterative optimization methods require stopping criteria to indicate the successful completion of the process.

How can the authors minimise the entropy of the measure?

joint entropy H(Y1, Y2) can be minimised by selecting the mapping parameters to reflect the joint distribution, (Y1, Y2) is furthest away from a uniform distribution.

Where is the laser beam of the range finder intersecting?

The laser beam of the range finder intersects horizontally at the abdominal area of the standing person capturing the movement of the book.

How can the authors get the entropy of the measure with respect to the mapping parameters?

They proposed an unsupervised learning method by which the mappings g1(·) and g2(·) can be estimated indirectly, without computing mutual information.

What is the difference between the two norms?

The L1 norm performs equally well as the L2 norm on overdetermined system of equations while outperforming L2 norm for underdetermined problems [9] especially where the solution is expected to have fewer non zeros than 1/8 of the number of equations.

(Open Access) Cross-modal localization through mutual information (2009) | Alen Alempijevic

Q: How many pixels were used to analyze the video data?

Color images acquired were transformed to grey scale and pixel intensity values (consisting of 640 ∗ 480 = 307200 pixels per frame) of 100 frames were analyzed using raw pixel values.

Q: How many iterations did the L1 norm penalty produce?

Applying the L1 norm penalty to the optimization produced faster convergence, occurring in iteration 72 compared to 110 iteration with L2 norm penalty.

Q: How many iterations of the optimization should be used?

In order to detect that the optimization has reached a local minima the variation of δ should be contained in a 1.5e−3 limit at least for a minimal convergence span of 5 iterations.

Q: Why do the other mapping parameters have smaller non-zero vales?

due to the approximations in the objective function and the presence of local minima, the other mapping parameters have smaller non-zero vales.

Q: How can the authors maximize the entropy of the measure?

The entropies H(Y1) and H(Y2) can be maximised by selecting the mapping parameters to make the data on the lower dimensional space resemble a uniform distribution.

Q: What is the criterion for imposing a penalty on the input space?

the solution of the parameter vectors α1 and α2 should be sparse identifying the minimum number of nonzero elements naturally suggesting the use of the L1 norm as an appropriate penalty function.

S.; Dissanayake, G. Cross-modal localization through mutual information. Intelligent

Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference]. This

material is posted here with permission of the IEEE. Such permission of the IEEE

does not in any way imply IEEE endorsement of any of the University of Technology,

Sydney's products or services. Internal or personal use of this material is permitted.

However, permission to reprint/republish this material for advertising or promotional

purposes or for creating new collective works for resale or redistribution must be

obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to

view this document, you agree to all provisions of the copyright laws protecting it

Cross-Modal Localization Through Mutual Information

Alen Alempijevic, Sarath Kodagoda and Gamini Dissanayake

Abstract—Relating information originating from disparate

sensors observing a given scene is a challenging task, partic-

ularly when an appropriate model of the environment or the

behaviour of any particular object within it is not available.

One possible strategy to address this task is to examine whether

the sensor outputs contain information which can be attributed

to a common cause. In this paper, we present an approach to

localise this embedded common information through an indirect

method of estimating mutual information between all signal

sources. Ability of L

regularization to enforce sparseness of

the solution is exploited to identify a subset of signals that are

related to each other, from among a large number of sensor

outputs. As opposed to the conventional L

regularization,

the proposed method leads to faster convergence with much

reduced spurious associations. Simulation and experimental

results are presented to validate the ﬁndings.

I. INTRODUCTION

The world market for sensors and wireless communication

technologies is ever growing, prompting the rapid deploy-

ment of wireless sensor networks [1]. Therefore, it is not

unreasonable to assume that sensors will be omnipresent in

the near future. With the presence of large number of sensors

and signals, there is a growing interest in cross-modal signal

analysis. The objective is not necessarily to geometrically

relate the sensors, the emphasis is rather placed on relat-

ing parts of the sensor signals. The following fundamental

concept in perception is exploited extensively in this paper:

motion has in principle, greater power to specify properties

of an object than purely spatial information. Thus, relating

signals could generally be carried out through comparison

of vectors of signals, which have been monitored over time.

One important aspect of such signal processing is to localize

some components of a particular signal to that best correlate

with the other signal, which also originated from the same

source.

This type of analysis is reported in various ﬁelds including,

biomedical engineering, climatology, network analysis and

economy. In biomedical research, heart rate ﬂuctuations are

examined against several interacting physiological mecha-

nisms including visual cortex activity, respiratory rate etc

[10] in order to determine the neurological status of infants.

In climatology, dynamic weather patterns in a particular loca-

tion are correlated to synoptic meteorological data gathered

over time [13]. In economy, revenue performance of a market

is correlated with a large set of economic and social criteria

[15].

A. Alempijevic, S.Kodagoda and G. Dissanayake are with ARC

Centre of Excellence for Autonomous Systems (CAS), University of

Technology, Sydney, Australia a.alempijevic, s.kodagoda,

g.dissanayake @cas.edu.au

There a number of techniques that are suitable for detect-

ing the statistical dependence of signals. Techniques such as

Canonical Correlation Analysis and Principle Components

Analysis rely on correlation, a second order statistic. Alter-

native non parametric techniques are Kendall’s tau, Cross

Correlograms, Mutual Information (MI) and Independent

Component Analysis. The selected metric is required to

identify a non-linear higher (than second) order of statis-

tical dependence between signals. The measure of statistical

dependence should be valid without any assumptions of

an underlying probability density function and should be

extendible to high dimensionality of input signals. Mutual

information is identiﬁed as the most promising metric, ful-

ﬁlling all requirements.

The methods for mutual information (MI) estimation can

be classiﬁed into two broad categories, based on whether

mutual information is computed directly or the condition for

maximum MI is obtained indirectly through an optimization

process that does not involve computing MI [2], [7]. The

most natural way of estimating MI via the direct method is

to use a nonparametric density estimator together with the

theoretical expression for entropy. However, the deﬁnition

of entropy requires an integration of the underlying PDF

over the set of all possible outcomes. In practice, there is

no closed form solution for this integral. Combining the

nonparametric density estimator with an approximation of

theoretical entropy has been widely described in the literature

to overcome this problem [16]. However, this requires pair

wise comparisons of all permutations of input signals to ﬁnd

the most informative statistically dependent pairings, which

is not feasible for large number of signals, such as images.

The indirect MI estimation method determines the most

mutually informative signal pairings through mapping of the

signals into a two dimensional space. The key to obtaining

the most informative mapping is in a technique that computes

the effect of the mapping parameters on the information

content in the lower dimensional space. Fisher et. al. [8]

demonstrate a linear mapping of the signals that maximise

MI by deﬁning an objective function that operates on the

resulting two dimensional space.

This paper builds upon Fisher’s work [8] and our previous

research on indirect MI estimation [2] by introducing the

norm to obtain a sparse linear mapping. L

norm has

found extensive use recently in solving convex optimisation

problems from arbitrary signals estimated from incomplete

set of measurements corrupted by noise [5] and also exhibits

a very useful property, which is the preservation of the

sparsity of the relationship between the multidimensional

random variables. The L

norm as a penalty function on

the magnitudes of the mapping coefﬁcients is shown to be

suited to the applications examined in this paper where the

mutually informative signals are usually embedded in a large

number of non informative signals.

The remainder of this document is organised as follows,

Section II outlines an indirect estimation algorithm for MI.

Section III describes the process of ﬁnding the maximum

MI with L

penalty norm and optimization parameters.

Experimental results are presented in Section IV. Section V

concludes the paper providing future research directions.

II. INDIRECT ESTIMATION OF MUTUAL INFORMATION

THROUGH NON-LINEAR MAPPINGS

Mutual information between two random vectors X

, X

can be deﬁned as follows.

I(X

; X

) = H(X

) + H(X

) − H(X

, X

) (1)

Where, H(X

) and H(X

) are the entropies of X

and

respectively, H(X

, X

) is the joint entropy term. Direct

estimation of MI requires calculation of entropy terms in (1).

Entropy H(X

), also referred to as Shannon’s entropy of

random variable X

with density p(x

) is given by,

H(X

) = −

Ω

p(x

) log(p(x

))dx

(2)

where Ω is the set of possible outcomes.

There are two distinctive problems that need addressing

when calculating entropy in this form, ﬁrstly calculating the

underlying unknown PDF of the random variable to obtain

p(x

) over the entire space Ω, and second, the integration

over the set of all possible outcomes. Both are addressed

through indirect estimation.

Mutual information between two high dimensional signals

and X

can be indirectly estimated by mapping the

signals into a lower dimensional space, by exploiting the

data processing inequality [6] that deﬁnes lower bounds on

mutual information. The inequality states

I(g(α

, X

); g(α

, X

)) ≤ I(X

; X

) (3)

for any random vectors X

and X

and any function

g(α, ·) deﬁned on the range of X

and X

respectively.

The generality of the data processing inequality implies that

there are no constraints on the choice of transformations

g(·). Furthermore, as the functions g(α, ·) map the input data

into a lower dimensional space, computing the information

content I(g(α

, X

); g(α

, X

)) is signiﬁcantly easier.

The mappings Y

= g(α

, X

) and Y

= g(α

, X

)

can be achieved through any differentiable function, such

as hyperbolic tangent [11] or multiple layer perceptrons [8].

However, linear projections are preferred due to the fact that

the linear projection coefﬁcients themselves can be used as

a measure of MI of each individual signal in random vectors

, X

to the resulting lower dimensional Y

, Y

mutual

information. We now present how to select the parameters of

linear mappings Y

= α

and Y

= α

, thus, selecting

subset of the most mutual informative signals from sets of

signals X

and X

without the need to estimate MI on all

permutations of signal sets.

III. OPTIMIZATION OF MAPPINGS VIA

INFORMATION MAXIMISATION PRINCIPLE

Finding the optimal projections α

and α

would require

solving a complex non-linear optimization problem. It is

generally not feasible to obtain a closed form solution to

this problem without numerical methods such as Powell’s

direction set method [3]. However, the high cost of comput-

ing MI, together with the fact that the parameter vector α is

in the dimension of the input signals in the case of a linear

map makes direct optimization intractable.

An entropy estimation measure proposed by Fisher et.

al. [8] allows for obtaining the gradient of the measure

with respect to the mappings parameters. They proposed an

unsupervised learning method by which the mappings g

(·)

and g

(·) can be estimated indirectly, without computing

mutual information. The maximisation of MI is achieved by

maximising the entropies H(Y

) and H(Y

) and minimising

the joint entropy, H(Y

, Y

) in (1). The entropies H(Y

)

and H(Y

) can be maximised by selecting the mapping

parameters to make the data on the lower dimensional space

resemble a uniform distribution. Likewise, joint entropy

H(Y

, Y

) can be minimised by selecting the mapping pa-

rameters to reﬂect the joint distribution, (Y

, Y

) is furthest

away from a uniform distribution.

Thus, maximisation of MI can be achieved by maximising

the objective function J,

J = J

+ J

− J

1,2

(4)

where each element of J

, J

1,2

are of the form,

Ω



f(u) −

f(y

)



du (5)

Where Ω indicates the nonzero region over which the

integration is evaluated. Therefore (5) is the integrated square

distance between the output distribution (evaluated by a

parzen density estimator,

f(y

) at a point u over a set of

observations y) and the desired output distribution f(u).

It can be shown that the gradient of each element of J

with respect to the mappings parameters α can be computed

as follows [8].

∂J

∂α

∂J

∂

∂g(α,x)

∂α

= −

∂

∂α

g(α, x)

Note that

∂g(α,x)

∂α

is a constant as we have assumed g(·)

is a linear projection. The term ǫ

is [8],

(k)

= b

(k−1)

) −

j6=i

(k−1)

− y

(k−1)

, Σ) (6)

)

≈



, Σ)

− κ

−

, Σ)



(7)

(y, Σ) = G(y, Σ) ∗ G

′

(y, Σ) (8)

expanding G and G

′

(y, Σ) = −

M+1

M/2

M+2

exp(−

)y (9)

where, κ

(.) is a kernel: a Gaussian PDF with standard

deviation of Σ = σ

I is assumed here. y

symbolises a

sample of either Y

or Y

or the concatenation, Y

1,2

; Y

] for J

1,2

, M is the dimensionality of the output space

and is M = M

, M

or M

+ M

based on the term of (4)

that is considered. The j

element of b

) in (7) is deﬁned

as b

)

, d is the support of the output space and N is the

number of samples.

For systems where the dimensionality of the input space

N is more than the number of samples n, the mapping

can be arbitrary. To obtain a single solution a penalty on

the projection co-efﬁcients α

and α

can be imposed. The

minimal energy solution can be obtained by imposing the L

penalty while the L

norm is shown to lead to the sparsest

solution. The fact that the L

penalty leads to a vector

with fewest nonzero elements for both overdetermined and

underdetermined systems has been demonstrated [14].

A. Optimizing Linear Mappings via the L

Regularisation

Projection coefﬁcients that maximise the objective func-

tion can now be found using the algorithm given in Fig. 1

which includes the update rule (6) for each entropy term

(1) and imposition of a L

penalty (L

2(α

)

, L

2(α

)

) on the

projection coefﬁcients α

and α

J = J

+ J

− J

1,2

− β



2(α

)

+ L

2(α

)



(10)

where the L

penalty is derived from

2(α

)

∂α

(11)

therefore

2(α

)

= 2Y

−1



−1



(12)

2(α

)

= 2Y

−1



−1



(13)

where X

−1

is the pseudo inverse of matrix X.

B. Optimizing Linear Mappings via the L

Regularisation

The L

criterion seeks to spread the energy of α

and

over many small valued components, rather than concen-

trating the energy on a few dominant ones. The applications

examined in this paper, requires identifying a few dominant

components in the input signal space that are related to each

other. Hence, the solution of the parameter vectors α

and

should be sparse identifying the minimum number of

nonzero elements naturally suggesting the use of the L

norm as an appropriate penalty function. In addition, the

number of samples and dimensionality of the signals can vary

between applications producing an either underdetermined or

overdetermined system of equations Y

= α

and Y

. The L

norm performs equally well as the L

norm

on overdetermined system of equations while outperforming

norm for underdetermined problems [9] especially where

the solution is expected to have fewer non zeros than 1/8 of

the number of equations.

The update equation for the gradient descent method when

using the L

penalty is

J = J

+ J

− J

1,2

− β



1(α

)

+ L

1(α

)



(14)

The equations for the L

norm penalty are derived

min kα

subject to Y

= α

min kα

subject to Y

= α

(15)

where k  k

represents the L

norm. Since the projections

, α

may be of very high dimensionality, it is assumed

that

min kα

= |α

| + |α

| + · · · |α

| (16)

Therefore the L

penalty is

∂ min kα

∂Y

(17)

further

∂|α

∂Y

i=1

∂|α

∂Y

−1

row1

sign|Y

∂|α

∂Y

i=1

∂|α

∂Y

−1

rowi

sign|Y

resulting in

∂ min kα

∂Y

−1

| sign|Y

| (18)

C. Stopping Criteria

All iterative optimization methods require stopping cri-

teria to indicate the successful completion of the process.

Consider,

δ =

max(∆

) − m in(∆

)

max(∆)

(19)

where, the term ∆

is the nearest neighbor distance in

the resulting output distribution, ∆ is the distance between

any two samples in the output distribution, max(.) and min(.)

are the maximum distance and minimum distance between

samples in the output space. The numerator is a measure

of uniformity of the output space and the denominator is a

measure of how well the output space is ﬁlled. Therefore,

(19) can be used as a convergence criterion. However, ∆

is dependent on the number of samples obtained from the

signal n, the dimensionality N and the size of the output

space d. As the numerator approaches zero for uniformly

Fig. 1. Block Diagram of Proposed Method, η is the learning rate, β is

the normaliser on the L

penalties applied to the projection coefﬁcients

and α

distributed samples and for a given threshold γ required, δ

may be determined by

−N

. For all experiments in this

paper following parameter values have been chosen.

TABLE I

OPTIMIZATION LEARNING RATE COEFFICIENTS

β max



max(X

)

, max(X

)



IV. SIMULATION AND EXPERIMENTAL RESULTS

For the simulation and experimental study, output space

dimensionality is chosen to be d = 2. For a sample size, n =

100, the stopping criteria from equation (19) is calculated to

be δ < 0.035. In order to detect that the optimization has

reached a local minima the variation of δ should be contained

in a 1.5e

−3

limit at least for a minimal convergence span of

5 iterations.

A. Simulation Results

Two simulations are performed to evaluate the proposed

method. Simulation 1: The purpose is to detect identical

signal pairings embedded within a number of unrelated

signals. Simulation 2: The purpose is to identify non

informative signals. We have utilised Johnson’s [12] method

of generating signals with an arbitrary high order of

dependency. Signals that are generated for the purpose of

simulation are scaled to [−1, 1].

Simulation 1: Identical Signals: One hundred signals

are generated, containing 100 samples each. Five signals are

selected and supplied as sensor 1 output {1, 2, 3, 4, 5} and

one signal is selected as sensor 2 output {1}, thus, N

= 5

and N

= 1 with one signal in common.

In order to determine the most informative signal we

examine the vector of α

co-efﬁcients, where each α

corresponds to a X

. Results are presented in Fig. 2 with

the mapping coefﬁcients, α

i ∈ {1, 5} in blue, red, green,

cyan and yellow respectively. The convergence criterion,

δ is plotted as the dashed gray line. The results show

the highest coefﬁcient for α

conﬁrming that signal 1 is

common between the sensors. Applying the L

norm penalty

0 5 10 15 20 25 30 35 40 45 50

−0.5

0.5

1.5

(a) L1 penalty

0 50 100 150

−0.5

0.5

1.5

2.5

3.5

(b) L2 penalty

Fig. 2. Results of indirect estimation of Mutual Information for signals

with underlying linear dependency

0 20 40 60 80 100 120 140 160 180 200

−0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

iteration

/ δ

(a) L1 penalty

0 20 40 60 80 100 120 140 160 180 200

−3

−2

−1

iteration

/ δ

(b) L2 penalty

Fig. 3. Results of indirect estimation of MI for non informative signals

to the optimization produced faster convergence, occurring in

iteration 38 compared to 142 iteration with L

norm penalty.

It is to be noted that only the non-zero mapping parameter

ideally should be α

and all others should be zero. However,

due to the approximations in the objective function and the

presence of local minima, the other mapping parameters have

smaller non-zero vales.

Simulation 2: Non Informative Signals

In this simulation signals {1, 2 , 3, 4, 5} are selected as

sensor 1 output and signal {6} is chosen as the sensor 2

output, clearly there is no common signals. Fig. 3 shows that

neither L

or L

norm penalty has produced convergence

in 200 iterations. In fact the solution based on the L

regularization shows a divergence from an optimized solution

verifying there is no common signal.

B. Experiments

Two experiments are performed to evaluate the proposed

method in establishing the relationship between multi-modal

sensory data by identifying informative signals without any

prior knowledge about geometric parameters. Experiment 1:

The purpose is to localise the audio source in the video

data sequence. Experiment 2: The purpose is to identify the

common source in a laser and video data stream.

Experiment 1: Audio and Video Signals: A microphone

and camera were used to capture activity in an ofﬁce en-

vironment consisting of a person (left on image) reading a

sequence of numbers and another person (right of image)

mimicking unscripted sentences (see Fig. 4(a)). Video data

was captured at 15Hz while audio signal was captured at

48KHz with only 10KHz of content used. Both video and

audio data streams were synchronised in time. Color images

acquired were transformed to grey scale and pixel intensity

values (consisting of 640 ∗ 480 = 307200 pixels per frame)

of 100 frames were analyzed using raw pixel values. The

Cross-modal localization through mutual information

Figures

Citations

Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning

High-Resolution Talking Face Generation via Mutual Information Approximation.

Forest Sampling Desk Reference

A dependence maximization approach towards street map-based localization

Dependence maximization localization: a novel approach to 2D street-map-based robot localization

References

Hypothesis testing over factorizations for data association

Exploring the Spanish interbank yield curve

Related Papers (5)

From error probability to information theoretic signal and image processing

A Theoretical Analysis of Joint Manifolds

Alignment by maximization of mutual information

Observations and problems applying ART2 for dynamic sensor pattern interpretation

Sensor fusion in anti-personnel mine detection using a two-level belief function model

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Cross-modal localization through mutual information" ?

Q2. What are the future works in "Cross-modal localization through mutual information" ?

Q3. How many pixels were used to analyze the video data?

Q4. What is the effect of a feature-level approach?

Q5. How many iterations did the L1 norm penalty produce?

Q6. How many iterations of the optimization should be used?

Q7. How can the authors achieve the maximisation of MI?

Q8. Why do the other mapping parameters have smaller non-zero vales?

Q9. How can the authors maximize the entropy of the measure?

Q10. What is the criterion for imposing a penalty on the input space?

Q11. What is the simplest way to calculate the L1 penalty?

Q12. How can the authors minimise the entropy of the measure?

Q13. Where is the laser beam of the range finder intersecting?

Q14. How can the authors get the entropy of the measure with respect to the mapping parameters?

Q15. What is the difference between the two norms?