scispace - formally typeset
Open AccessBook ChapterDOI

Measuring statistical dependence with hilbert-schmidt norms

Reads0
Chats0
TLDR
The Hilbert-Schmidt Independence Criterion (HSIC) as mentioned in this paper is based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs).
Abstract
We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC) This approach has several advantages, compared with previous kernel-based independence criteria First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on HSIC do not suffer from slow learning rates Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods

read more

Content maybe subject to copyright    Report

Max–Planck–Institut f
¨
ur biologische Kybernetik
Max Planck Institute for Biological Cybernetics
Technical Report No. 140
Measuring Statistical
Dependence with
Hilbert-Schmidt Norms
Arthur Gretton,
1
Olivier Bousquet,
2
Alexander
Smola,
3
Bernhard Sch
¨
olkopf,
1
June 2005
1
Department Sch
¨
olkopf, email: firstname.lastname@tuebingen.mpg.de;
2
Pertinence, 32, Rue des
Je
ˆ
uneurs, 75002 Paris, France, email:olivier.bousquet@pertinence.com;
3
NICTA, Canberra, Aus-
tralia, email: alex.smola@anu.edu.au
This report is available in PDF–format via anonymous ftp at ftp://ftp.kyb.tuebingen.mpg.de/pub/mpi-memos/pdf/techrepTitle.pdf.
The complete series of Technical Reports is documented at: http://www.kyb.tuebingen.mpg.de/techreports.html

Measuring Statistical Dependence with
Hilbert-Schmidt Norms
Arthur Gretton, Olivier Bousquet, Alexander Smola, and Bernhard Sch
¨
olkopf
Abstract. We propose an independence criterion based on the eigenspectrum of covariance operators in re-
producing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of
the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach
has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate
is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a
clearly defined population quantity which the empirical estimate approaches in the large sample limit, with ex-
ponential convergence guaranteed between the two: this ensures that independence tests based on HSIC do not
suffer from slow learning rates. Finally, we show in the context of independent component analysis (ICA) that the
performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently
published ICA methods.
1

1 Introduction
Methods for detecting dependence using kernel-based approaches have recently found
application in a wide variety of areas. Examples include independent component analysis
[Bach and Jordan, 2002, Gretton et al., 2003], gene selection [Yamanishi et al., 2004],
descriptions of gait in terms of hip and knee trajectories [Leurgans et al., 1993], feature
selection [Fukumizu et al., 2004], and dependence detection in fMRI signals [Gretton
et al., 2005]. The principle underlying these algorithms is that we may define covariance
and cross-covariance operators in RKHSs, and derive statistics from these operators suited
to measuring the dependence between functions in these spaces.
In the method of Bach and Jordan [2002], a regularised correlation operator was de-
rived from the covariance and cross-covariance operators, and its largest singular value
(the kernel canonical correlation, or KCC) was used as a statistic to test independence.
The approach of Gretton et al. [2005] was to use the largest singular value of the cross-
covariance operator, which behaves identically to the correlation operator at indepen-
dence, but is easier to define and requires no regularisation the resulting test is called
the constrained covariance (COCO). Both these quantities fall within the framework set
out by R´enyi [1959], namely that for sufficiently rich function classes, the functional cor-
relation (or, alternatively, the cross-covariance) serves as an independence test, being zero
only when the random variables tested are independent. Various empirical kernel quanti-
ties (derived from bounds on the mutual information that hold near independence)
1
were
also proposed based on the correlation and cross-covariance operators by Bach and Jor-
dan [2002], Gretton et al. [2003], however their connection to the population covariance
operators remains to be established (indeed, the population quantities to which these
approximations converge are not yet known). Gretton et al. [2005] showed that these
various quantities are guaranteed to be zero for independent random variables only when
the associated RKHSs are universal [Steinwart, 2001].
The present study extends the concept of COCO by using the entire spectrum of the
cross-covariance operator to determine when all its singular values are zero, rather than
looking only at the largest singular value; the idea being to obtain a more robust indication
of independence. To this end, we use the sum of the squared singular values of the cross-
covariance operator (i.e., its squared Hilbert-Schmidt norm) to measure dependence
we call the resulting quantity the Hilbert-Schmidt Independence Criterion (HSIC).
2
It
turns out that the empirical estimate of HSIC is identical to the quadratic dependence
measure of Achard et al. [2003], although we shall see that their derivation approaches this
criterion in a completely different way. Thus, the present work resolves the open question
in [Achard et al., 2003] regarding the link between the quadratic dependence measure
and kernel dependence measures based on RKHSs, and generalises this measure to metric
spaces (as opposed to subsets of the reals). More importantly, however, we believe our
proof assures that HSIC is indeed a dependence criterion under all circumstances (i.e.,
HSIC is zero if and only if the random variables are independent), which is not necessarily
1
Respectively the Kernel Generalised Variance (KGV) and the Kernel Mutual Information (KMI)
2
The possibility of using a Hilbert-Schmidt norm was suggested by Fukumizu et al. [2004], although the idea
was not pursued further in that work.

guaranteed by Achard et al. [2003]. We give a more detailed analysis of Achard’s proof
in Appendix B.
Compared with previous kernel independence measures, HSIC has several advantages:
The empirical estimate is much simpler just the trace of a product of Gram matrices
and, unlike the canonical correlation or kernel generalised variance of Bach and
Jordan [2002], HSIC does not require extra regularisation terms for good finite sample
behaviour.
The empirical estimate converges to the population estimate at rate 1/
m, where
m is the sample size, and thus independence tests based on HSIC do not suffer from
slow learning rates [Devroye et al., 1996]. In particular, as the sample size increases,
we are guaranteed to detect any existing dependence with high probability. Of the
alternative kernel dependence tests, this result is proved only for the constrained
covariance [Gretton et al., 2005].
The finite sample bias of the estimate is O(m
1
), and is therefore negligible compared
to the finite sample fluctuations (which underly the convergence rate in the previous
point). This is currently proved for no other kernel dependence test, including COCO.
Experimental results on an ICA problem show that the new independence test is
superior to the previous ones, and competitive with the best existing specialised ICA
methods. In particular, kernel methods are substantially more resistant to outliers
than other specialised ICA algorithms.
We begin our discussion in Section 2, in which we define the cross-covariance operator
between RKHSs, and give its Hilbert-Schmidt (HS) norm (this being the population
HSIC). In Section 3, we given an empirical estimate of the HS norm, and establish the link
between the population and empirical HSIC by determining the bias of the finite sample
estimate. In Section 4, we demonstrate exponential convergence between the population
HSIC and empirical HSIC. As a consequence of this fast convergence, we show in Section
5 that dependence tests formulated using HSIC do not suffer from slow learning rates.
Also in this section, we describe an efficient approximation to the empirical HSIC based
on the incomplete Cholesky decomposition. Finally, in Section 6, we apply HSIC to the
problem of independent component analysis (ICA).
2 Cross-Covariance Operators
In this section, we provide the functional analytic background necessary in describing
cross-covariance operators between RKHSs, and introduce the Hilbert-Schmidt norm of
these operators. Our presentation follows Zwald et al. [2004] and Hein and Bousquet
[2004], the main difference being that we deal with cross-covariance operators rather
than the covariance operators.
3
We also draw on [Fukumizu et al., 2004], which uses
covariance and cross-covariance operators as a means of defining conditional covariance
operators, but does not investigate the Hilbert-Schmidt norm; and on [Baker, 1973], which
characterises the covariance and cross-covariance operators for general Hilbert spaces.
3
Briefly, a cross-covariance operator maps from one space to another, whereas a covariance operator maps from
a space to itself. In the linear algebraic case, the covariance is C
xx
:= E
x
[xx
>
] E
x
[x]E
x
[x
>
], while the
cross-covariance is C
xy
:= E
x,y
[xy
>
] E
x
[x]E
y
[y
>
].

2.1 RKHS theory
Consider a Hilbert space F of functions from X to
. Then F is a reproducing kernel
Hilbert space if for each x X, the Dirac evaluation operator δ
x
: F
, which maps
f F to f(x)
, is a bounded linear functional. To each point x X, there corresponds
an element φ(x) F such that hφ(x), φ(x
0
)i
F
= k(x, x
0
), where k : X×X
is a unique
positive definite kernel. We will require in particular that F be separable (it must have
a complete orthonormal system). As pointed out by Hein and Bousquet [2004, Theorem
7], any continuous kernel on a separable X (e.g.
n
) induces a separable RKHS.
4
We
likewise define a second separable RKHS, G, with kernel l(·, ·) and feature map ψ, on the
separable space Y.
Hilbert-Schmidt Norm Denote by C : G F a linear operator. Then provided the sum
converges, the Hilbert-Schmidt (HS) norm of C is defined as
kCk
2
HS
:=
X
i,j
hCv
i
, u
j
i
2
F
, (1)
where u
i
and v
j
are orthonormal bases of F and G respectively. It is easy to see that this
generalises the Frobenius norm on matrices.
Hilbert-Schmidt Operator A linear operator C : G F is called a Hilbert-Schmidt
operator if its HS norm exists. The set of Hilbert-Schmidt operators HS(G, F) : G F
is a separable Hilbert space with inner product
hC, Di
HS
:=
X
i,j
hCv
i
, u
j
i
F
hDv
i
, u
j
i
F
.
Tensor Product Let f F and g G. Then the tensor product operator f g : G F
is defined as
(f g)h := fhg, hi
G
for all h G. (2)
Moreover, by the definition of the HS norm, we can compute the HS norm of f g via
kf gk
2
HS
= hf g, f gi
HS
= hf, (f g)gi
F
= hf, f i
F
hg, gi
G
= kfk
2
F
kgk
2
G
(3)
2.2 The Cross-Covariance Operator
Mean We assume that (X, Γ ) and (Y, Λ) are furnished with probability measures p
x
, p
y
respectively (Γ being the Borel sets on X, and Λ the Borel sets on Y). We may now
define the mean elements with respect to these measures as those members of F and G
respectively for which
hµ
x
, fi
F
:= E
x
[hφ(x), fi
F
] = E
x
[f(x)],
hµ
y
, gi
G
:= E
y
[hψ(y), gi
G
] = E
y
[g(y)],
(4)
4
For more detail on separable RKHSs and their properties, see [Hein and Bousquet, 2004] and references therein.

Citations
More filters
Journal ArticleDOI

A kernel two-sample test

TL;DR: This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).
Proceedings Article

Improved techniques for training GANs

TL;DR: In this article, a variety of new architectural features and training procedures are applied to the generative adversarial networks (GANs) framework and achieved state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN.
Journal ArticleDOI

Domain Adaptation via Transfer Component Analysis

TL;DR: This work proposes a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation and proposes both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce thedistance between domain distributions by projecting data onto the learned transfer components.
Journal ArticleDOI

A Comprehensive Survey on Transfer Learning

TL;DR: Transfer learning aims to improve the performance of target learners on target domains by transferring the knowledge contained in different but related source domains as discussed by the authors, in which the dependence on a large number of target-domain data can be reduced for constructing target learners.
Journal ArticleDOI

Kernel methods in machine learning

TL;DR: A review of machine learning methods employing positive definite kernels, ranging from binary classifiers to sophisticated methods for estimation with structured data, which include nonlinear functions as well as functions defined on nonvectorial data.
References
More filters
Journal ArticleDOI

An information-maximization approach to blind separation and blind deconvolution

TL;DR: It is suggested that information maximization provides a unifying framework for problems in "blind" signal processing and dependencies of information transfer on time delays are derived.
Book ChapterDOI

Probability Inequalities for sums of Bounded Random Variables

TL;DR: In this article, upper bounds for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt are derived for certain sums of dependent random variables such as U statistics.
Book

Independent Component Analysis

TL;DR: Independent component analysis as mentioned in this paper is a statistical generative model based on sparse coding, which is basically a proper probabilistic formulation of the ideas underpinning sparse coding and can be interpreted as providing a Bayesian prior.
Book

A Probabilistic Theory of Pattern Recognition

TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Proceedings Article

A New Learning Algorithm for Blind Signal Separation

TL;DR: A new on-line learning algorithm which minimizes a statistical dependency among outputs is derived for blind separation of mixed signals and has an equivariant property and is easily implemented on a neural network like model.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Measuring statistical dependence with hilbert-schmidt norms" ?

The authors propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces ( RKHSs ), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator ( they term this a Hilbert-Schmidt Independence Criterion, or HSIC ). Finally, the authors show in the context of independent component analysis ( ICA ) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods. 

The principle underlying these algorithms is that the authors may define covariance and cross-covariance operators in RKHSs, and derive statistics from these operators suited to measuring the dependence between functions in these spaces. 

most specialised ICA algorithms exploit the linear mixing structure of the problem to avoid having to conduct a general test of independence, which makes the task of recovering A easier. 

In the linear algebraic case, the covariance is Cxx := Ex[xx >] − Ex[x]Ex[x >], while thecross-covariance is Cxy := Ex,y[xy >] − Ex[x] 

Then F is a reproducing kernel Hilbert space if for each x ∈ X , the Dirac evaluation operator δx : F → , which maps f ∈ F to f(x) ∈ , is a bounded linear functional. 

A one-sample U-statistic is defined as the random variableu := 1 (m)r∑imrg(xi1, . . . , xir),where g is called the kernel of the U-statistic. 

In the case of the KCC and KGV, the authors use the parameters recommended by Bach and Jordan [2002]: namely, κ = 2 × 10−2 and σ = 1 for m ≤ 1000, κ = 2× 10−3 and σ = 0.5 for m > 1000 (σ being the kernel size, and κ the coefficient used to scale the regularising terms). 

This is because the slow decay of the eigenspectrum of the Laplace kernel improves the detection of dependence encoded at higher frequencies in the probability density function, which need not be related to the kurtosis — see [Gretton et al., 2005, Section 4.2]. 

Proof According to Gretton et al. [2005], the largest singular value (i.e., the spectral norm) ‖Cxy‖S is zero if and only if x and y are independent, under the conditions specified in the theorem. 

Cross-Covariance Following Baker [1973], Fukumizu et al. [2004],5 the cross-covariance operator associated with the joint measure px,y on (X × Y, Γ × Λ) is a linear operator Cxy : G → F defined asCxy := Ex,y [(φ(x) − µx) ⊗ (ψ(y) − µy)] = Ex,y [φ(x) ⊗ ψ(y)] ︸ ︷︷ ︸:=C̃xy−µx ⊗ µy ︸ ︷︷ ︸:=Mxy. (6)Here (6) follows from the linearity of the expectation. 

That said, ICA is in general a good benchmark for dependence measures, in that it applies to a problem with a known “ground truth”, and tests that the dependence measures approach zero gracefully as dependent random variables are made to approach independence (through optimisation of the unmixing matrix). 

y ′)] − 1(m)4∑im4Ki1i2Li3i4 ≥ 1 − α− βt Using the shorthand z := (x, y) the authors define the kernels of the U-statistics in the three expressions above as g(zi, zj) = KijLij, g(zi, zj, zr) = KijLjr and g(zi, zj, zq, zr) = KijLqr. 

Then the tensor product operator f ⊗ g : G → F is defined as (f ⊗ g)h := f〈g, h〉G for all h ∈ G. (2) Moreover, by the definition of the HS norm, the authors can compute the HS norm of f ⊗ g via‖f ⊗ g‖2HS = 〈f ⊗ g, f ⊗ g〉HS = 〈f, (f ⊗ g)g〉F = 〈f, f〉F 〈g, g〉 G = ‖f‖2F‖g‖2G (3)Mean 

More importantly, however, the authors believe their proof assures that HSIC is indeed a dependence criterion under all circumstances (i.e., HSIC is zero if and only if the random variables are independent), which is not necessarily1 Respectively the Kernel Generalised Variance (KGV) and the Kernel Mutual Information (KMI) 2 

Their first experiment consisted in de-mixing data drawn independently from several distributions chosen at random with replacement from Table 1, and mixed with a random9 

A major advantage of HSIC, COCO, and the KMI is that these do not require any additional tuning beyond the selection of a kernel.