What contributions have the authors mentioned in the paper "Measuring statistical dependence with hilbert-schmidt norms" ?

The authors propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces ( RKHSs ), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator ( they term this a Hilbert-Schmidt Independence Criterion, or HSIC ). Finally, the authors show in the context of independent component analysis ( ICA ) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.

What is the way to test the independence of the ICA algorithm?

most specialised ICA algorithms exploit the linear mixing structure of the problem to avoid having to conduct a general test of independence, which makes the task of recovering A easier.

What is the covariance of the linear algebraic case?

In the linear algebraic case, the covariance is Cxx := Ex[xx >] − Ex[x]Ex[x >], while thecross-covariance is Cxy := Ex,y[xy >] − Ex[x]

What is the definition of a reproducing kernel Hilbert space?

Then F is a reproducing kernel Hilbert space if for each x ∈ X , the Dirac evaluation operator δx : F → , which maps f ∈ F to f(x) ∈ , is a bounded linear functional.

What is the definition of a one-sample U-statistic?

A one-sample U-statistic is defined as the random variableu := 1 (m)r∑imrg(xi1, . . . , xir),where g is called the kernel of the U-statistic.

What are the parameters used for the KCC and KGV?

In the case of the KCC and KGV, the authors use the parameters recommended by Bach and Jordan [2002]: namely, κ = 2 × 10−2 and σ = 1 for m ≤ 1000, κ = 2× 10−3 and σ = 0.5 for m > 1000 (σ being the kernel size, and κ the coefficient used to scale the regularising terms).

Why does the Laplace kernel improve on the Gaussian kernel?

This is because the slow decay of the eigenspectrum of the Laplace kernel improves the detection of dependence encoded at higher frequencies in the probability density function, which need not be related to the kurtosis — see [Gretton et al., 2005, Section 4.2].

What is the largest singular value of the spectral norm?

Proof According to Gretton et al. [2005], the largest singular value (i.e., the spectral norm) ‖Cxy‖S is zero if and only if x and y are independent, under the conditions specified in the theorem.

What is the definition of the cross-covariance operator?

Cross-Covariance Following Baker [1973], Fukumizu et al. [2004],5 the cross-covariance operator associated with the joint measure px,y on (X × Y, Γ × Λ) is a linear operator Cxy : G → F defined asCxy := Ex,y [(φ(x) − µx) ⊗ (ψ(y) − µy)] = Ex,y [φ(x) ⊗ ψ(y)] ︸︷︷︸:=C̃xy−µx ⊗ µy ︸︷︷︸:=Mxy. (6)Here (6) follows from the linearity of the expectation.

What is the way to test the dependence measures of a linear ICA?

That said, ICA is in general a good benchmark for dependence measures, in that it applies to a problem with a known “ground truth”, and tests that the dependence measures approach zero gracefully as dependent random variables are made to approach independence (through optimisation of the unmixing matrix).

What is the simplest way to define the kernels of the U-statistics?

y ′)] − 1(m)4∑im4Ki1i2Li3i4 ≥ 1 − α− βt Using the shorthand z := (x, y) the authors define the kernels of the U-statistics in the three expressions above as g(zi, zj) = KijLij, g(zi, zj, zr) = KijLjr and g(zi, zj, zq, zr) = KijLqr.

What is the HS norm of f g?

Then the tensor product operator f ⊗ g : G → F is defined as (f ⊗ g)h := f〈g, h〉G for all h ∈ G. (2) Moreover, by the definition of the HS norm, the authors can compute the HS norm of f ⊗ g via‖f ⊗ g‖2HS = 〈f ⊗ g, f ⊗ g〉HS = 〈f, (f ⊗ g)g〉F = 〈f, f〉F 〈g, g〉 G = ‖f‖2F‖g‖2G (3)Mean

What is the first experiment to use?

Their first experiment consisted in de-mixing data drawn independently from several distributions chosen at random with replacement from Table 1, and mixed with a random9

What is the advantage of HSIC, COCO, and the KMI?

A major advantage of HSIC, COCO, and the KMI is that these do not require any additional tuning beyond the selection of a kernel.

(Open Access) Measuring statistical dependence with hilbert-schmidt norms (2005) | Arthur Gretton

Q: What is the principle underlying these algorithms?

The principle underlying these algorithms is that the authors may define covariance and cross-covariance operators in RKHSs, and derive statistics from these operators suited to measuring the dependence between functions in these spaces.

Max–Planck–Institut f

ur biologische Kybernetik

Max Planck Institute for Biological Cybernetics

Technical Report No. 140

Measuring Statistical

Dependence with

Hilbert-Schmidt Norms

Arthur Gretton,

Olivier Bousquet,

Alexander

Smola,

Bernhard Sch

olkopf,

June 2005

Department Sch

olkopf, email: ﬁrstname.lastname@tuebingen.mpg.de;

Pertinence, 32, Rue des

uneurs, 75002 Paris, France, email:olivier.bousquet@pertinence.com;

NICTA, Canberra, Aus-

tralia, email: alex.smola@anu.edu.au

This report is available in PDF–format via anonymous ftp at ftp://ftp.kyb.tuebingen.mpg.de/pub/mpi-memos/pdf/techrepTitle.pdf.

The complete series of Technical Reports is documented at: http://www.kyb.tuebingen.mpg.de/techreports.html

Measuring Statistical Dependence with

Hilbert-Schmidt Norms

Arthur Gretton, Olivier Bousquet, Alexander Smola, and Bernhard Sch

olkopf

Abstract. We propose an independence criterion based on the eigenspectrum of covariance operators in re-

producing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of

the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach

has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate

is simpler than any other kernel dependence test, and requires no user-deﬁned regularisation. Second, there is a

clearly deﬁned population quantity which the empirical estimate approaches in the large sample limit, with ex-

ponential convergence guaranteed between the two: this ensures that independence tests based on HSIC do not

suffer from slow learning rates. Finally, we show in the context of independent component analysis (ICA) that the

performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently

published ICA methods.

1 Introduction

Methods for detecting dependence using kernel-based approaches have recently found

application in a wide variety of areas. Examples include independent component analysis

[Bach and Jordan, 2002, Gretton et al., 2003], gene selection [Yamanishi et al., 2004],

descriptions of gait in terms of hip and knee trajectories [Leurgans et al., 1993], feature

selection [Fukumizu et al., 2004], and dependence detection in fMRI signals [Gretton

et al., 2005]. The principle underlying these algorithms is that we may deﬁne covariance

and cross-covariance operators in RKHSs, and derive statistics from these operators suited

to measuring the dependence between functions in these spaces.

In the method of Bach and Jordan [2002], a regularised correlation operator was de-

rived from the covariance and cross-covariance operators, and its largest singular value

(the kernel canonical correlation, or KCC) was used as a statistic to test independence.

The approach of Gretton et al. [2005] was to use the largest singular value of the cross-

covariance operator, which behaves identically to the correlation operator at indepen-

dence, but is easier to deﬁne and requires no regularisation — the resulting test is called

the constrained covariance (COCO). Both these quantities fall within the framework set

out by R´enyi [1959], namely that for suﬃciently rich function classes, the functional cor-

relation (or, alternatively, the cross-covariance) serves as an independence test, being zero

only when the random variables tested are independent. Various empirical kernel quanti-

ties (derived from bounds on the mutual information that hold near independence)

were

also proposed based on the correlation and cross-covariance operators by Bach and Jor-

dan [2002], Gretton et al. [2003], however their connection to the population covariance

operators remains to be established (indeed, the population quantities to which these

approximations converge are not yet known). Gretton et al. [2005] showed that these

various quantities are guaranteed to be zero for independent random variables only when

the associated RKHSs are universal [Steinwart, 2001].

The present study extends the concept of COCO by using the entire spectrum of the

cross-covariance operator to determine when all its singular values are zero, rather than

looking only at the largest singular value; the idea being to obtain a more robust indication

of independence. To this end, we use the sum of the squared singular values of the cross-

covariance operator (i.e., its squared Hilbert-Schmidt norm) to measure dependence —

we call the resulting quantity the Hilbert-Schmidt Independence Criterion (HSIC).

turns out that the empirical estimate of HSIC is identical to the quadratic dependence

measure of Achard et al. [2003], although we shall see that their derivation approaches this

criterion in a completely diﬀerent way. Thus, the present work resolves the open question

in [Achard et al., 2003] regarding the link between the quadratic dependence measure

and kernel dependence measures based on RKHSs, and generalises this measure to metric

spaces (as opposed to subsets of the reals). More importantly, however, we believe our

proof assures that HSIC is indeed a dependence criterion under all circumstances (i.e.,

HSIC is zero if and only if the random variables are independent), which is not necessarily

Respectively the Kernel Generalised Variance (KGV) and the Kernel Mutual Information (KMI)

The possibility of using a Hilbert-Schmidt norm was suggested by Fukumizu et al. [2004], although the idea

was not pursued further in that work.

guaranteed by Achard et al. [2003]. We give a more detailed analysis of Achard’s proof

in Appendix B.

Compared with previous kernel independence measures, HSIC has several advantages:

– The empirical estimate is much simpler — just the trace of a product of Gram matrices

— and, unlike the canonical correlation or kernel generalised variance of Bach and

Jordan [2002], HSIC does not require extra regularisation terms for good ﬁnite sample

behaviour.

– The empirical estimate converges to the population estimate at rate 1/

√

m, where

m is the sample size, and thus independence tests based on HSIC do not suﬀer from

slow learning rates [Devroye et al., 1996]. In particular, as the sample size increases,

we are guaranteed to detect any existing dependence with high probability. Of the

alternative kernel dependence tests, this result is proved only for the constrained

covariance [Gretton et al., 2005].

– The ﬁnite sample bias of the estimate is O(m

−1

), and is therefore negligible compared

to the ﬁnite sample ﬂuctuations (which underly the convergence rate in the previous

point). This is currently proved for no other kernel dependence test, including COCO.

– Experimental results on an ICA problem show that the new independence test is

superior to the previous ones, and competitive with the best existing specialised ICA

methods. In particular, kernel methods are substantially more resistant to outliers

than other specialised ICA algorithms.

We begin our discussion in Section 2, in which we deﬁne the cross-covariance operator

between RKHSs, and give its Hilbert-Schmidt (HS) norm (this being the population

HSIC). In Section 3, we given an empirical estimate of the HS norm, and establish the link

between the population and empirical HSIC by determining the bias of the ﬁnite sample

estimate. In Section 4, we demonstrate exponential convergence between the population

HSIC and empirical HSIC. As a consequence of this fast convergence, we show in Section

5 that dependence tests formulated using HSIC do not suﬀer from slow learning rates.

Also in this section, we describe an eﬃcient approximation to the empirical HSIC based

on the incomplete Cholesky decomposition. Finally, in Section 6, we apply HSIC to the

problem of independent component analysis (ICA).

2 Cross-Covariance Operators

In this section, we provide the functional analytic background necessary in describing

cross-covariance operators between RKHSs, and introduce the Hilbert-Schmidt norm of

these operators. Our presentation follows Zwald et al. [2004] and Hein and Bousquet

[2004], the main diﬀerence being that we deal with cross-covariance operators rather

than the covariance operators.

We also draw on [Fukumizu et al., 2004], which uses

covariance and cross-covariance operators as a means of deﬁning conditional covariance

operators, but does not investigate the Hilbert-Schmidt norm; and on [Baker, 1973], which

characterises the covariance and cross-covariance operators for general Hilbert spaces.

Brieﬂy, a cross-covariance operator maps from one space to another, whereas a covariance operator maps from

a space to itself. In the linear algebraic case, the covariance is C

:= E

[xx

] − E

[x]E

], while the

cross-covariance is C

:= E

x,y

[xy

] − E

[x]E

2.1 RKHS theory

Consider a Hilbert space F of functions from X to



. Then F is a reproducing kernel

Hilbert space if for each x ∈ X, the Dirac evaluation operator δ

: F →



, which maps

f ∈ F to f(x) ∈



, is a bounded linear functional. To each point x ∈ X, there corresponds

an element φ(x) ∈ F such that hφ(x), φ(x

= k(x, x

), where k : X×X →



is a unique

positive deﬁnite kernel. We will require in particular that F be separable (it must have

a complete orthonormal system). As pointed out by Hein and Bousquet [2004, Theorem

7], any continuous kernel on a separable X (e.g.



) induces a separable RKHS.

likewise deﬁne a second separable RKHS, G, with kernel l(·, ·) and feature map ψ, on the

separable space Y.

Hilbert-Schmidt Norm Denote by C : G → F a linear operator. Then provided the sum

converges, the Hilbert-Schmidt (HS) norm of C is deﬁned as

kCk

i,j

hCv

, u

, (1)

where u

and v

are orthonormal bases of F and G respectively. It is easy to see that this

generalises the Frobenius norm on matrices.

Hilbert-Schmidt Operator A linear operator C : G → F is called a Hilbert-Schmidt

operator if its HS norm exists. The set of Hilbert-Schmidt operators HS(G, F) : G → F

is a separable Hilbert space with inner product

hC, Di

i,j

hCv

, u

hDv

, u

Tensor Product Let f ∈ F and g ∈ G. Then the tensor product operator f ⊗ g : G → F

is deﬁned as

(f ⊗ g)h := fhg, hi

for all h ∈ G. (2)

Moreover, by the deﬁnition of the HS norm, we can compute the HS norm of f ⊗ g via

kf ⊗ gk

= hf ⊗ g, f ⊗ gi

= hf, (f ⊗ g)gi

= hf, f i

hg, gi

= kfk

kgk

(3)

2.2 The Cross-Covariance Operator

Mean We assume that (X, Γ ) and (Y, Λ) are furnished with probability measures p

, p

respectively (Γ being the Borel sets on X, and Λ the Borel sets on Y). We may now

deﬁne the mean elements with respect to these measures as those members of F and G

respectively for which

hµ

, fi

:= E

[hφ(x), fi

] = E

[f(x)],

hµ

, gi

:= E

[hψ(y), gi

] = E

[g(y)],

(4)

For more detail on separable RKHSs and their properties, see [Hein and Bousquet, 2004] and references therein.

Measuring statistical dependence with hilbert-schmidt norms

Figures

Citations

A kernel two-sample test

Improved techniques for training GANs

Domain Adaptation via Transfer Component Analysis

A Comprehensive Survey on Transfer Learning

Kernel methods in machine learning

References

An information-maximization approach to blind separation and blind deconvolution

Probability Inequalities for sums of Bounded Random Variables

Independent Component Analysis

A Probabilistic Theory of Pattern Recognition

A New Learning Algorithm for Blind Signal Separation

Related Papers (5)

Measuring and testing dependence by correlation of distances

Kernel independent component analysis

A kernel two-sample test

Regression Shrinkage and Selection via the Lasso

A Hilbert space embedding for distributions

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Measuring statistical dependence with hilbert-schmidt norms" ?

Q2. What is the principle underlying these algorithms?

Q3. What is the way to test the independence of the ICA algorithm?

Q4. What is the covariance of the linear algebraic case?

Q5. What is the definition of a reproducing kernel Hilbert space?

Q6. What is the definition of a one-sample U-statistic?

Q7. What are the parameters used for the KCC and KGV?

Q8. Why does the Laplace kernel improve on the Gaussian kernel?

Q9. What is the largest singular value of the spectral norm?

Q10. What is the definition of the cross-covariance operator?

Q11. What is the way to test the dependence measures of a linear ICA?

Q12. What is the simplest way to define the kernels of the U-statistics?

Q13. What is the HS norm of f g?

Q14. What are the two important criterion for detecting dependence?

Q15. What is the first experiment to use?

Q16. What is the advantage of HSIC, COCO, and the KMI?