scispace - formally typeset
Open AccessJournal ArticleDOI

An Information-Geometric Characterization of Chernoff Information

Frank Nielsen
- 30 Jan 2013 - 
- Vol. 20, Iss: 3, pp 269-272
Reads0
Chats0
TLDR
This work proves analytically that the Chernoff distance amounts to calculate an equivalent but simpler Bregman divergence defined on the distribution parameters, and proposes three novel information-theoretic symmetric distances and middle distributions, from which two of them admit always closed-form expressions.
Abstract
The Chernoff information was originally introduced for bounding the probability of error of the Bayesian decision rule in binary hypothesis testing. Nowadays, it is often used as a notion of symmetric distance in statistical signal processing or as a way to define a middle distribution in information fusion. Computing the Chernoff information requires to solve an optimization problem that is numerically approximated in practice. We consider the Chernoff distance for distributions belonging to the same exponential family including the Gaussian and multinomial families. By considering the geometry of the underlying statistical manifold, we define exactly the solution of the optimization problem as the unique intersection of a geodesic with a dual hyperplane. Furthermore, we prove analytically that the Chernoff distance amounts to calculate an equivalent but simpler Bregman divergence defined on the distribution parameters. It follows a closed-form formula for the singly-parametric distributions, or an efficient geodesic bisection search for multiparametric distributions. Finally, based on this information-geometric characterization, we propose three novel information-theoretic symmetric distances and middle distributions, from which two of them admit always closed-form expressions.

read more

Content maybe subject to copyright    Report

1
An information-geometric characterization of
Chernoff information
Frank Nielsen, Senior Member, IEEE
Sony Computer Science Laboratories, Inc.
3-14-13 Higashi Gotanda
141-0022 Shinagawa-ku, Tokyo, Japan
nielsen@csl.sony.co.jp
Abstract
The Chernoff information was originally introduced for bounding the probability of error of the
Bayesian decision rule in binary hypothesis testing. Nowadays, it is often used as a notion of symmetric
distance in statistical signal processing or as a way to define a middle distribution in information fusion.
Computing the Chernoff information requires to solve an optimization problem that is numerically
approximated in practice. We consider the Chernoff distance for distributions belonging to the same
exponential family including the Gaussian and multinomial families. By considering the geometry of
the underlying statistical manifold, we define exactly the solution of the optimization problem as the
unique intersection of a geodesic with a dual hyperplane. Furthermore, we prove analytically that the
Chernoff distance amounts to calculate an equivalent but simpler Bregman divergence defined on the
distribution parameters. It follows a closed-form formula for the singly-parametric distributions, or an
efficient geodesic bisection search for multi-parametric distributions. Finally, based on this information-
geometric characterization, we propose three novel information-theoretic symmetric distances and middle
distributions, from which two of them admit always closed-form expressions.
Index Terms
Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other
purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org
Nielsen, F.; An Information-Geometric Characterization of Chernoff Information, IEEE Signal Processing Letters (SPL), vol.
20, no. 3, pp. 269-272, March 2013,. doi: 10.1109/LSP.2013.2243726
February 19, 2013 DRAFT

2
Chernoff information, exponential families, information geometry, Bregman divergence, information
fusion.
I. INTRODUCTION
Let (X , E) be a measurable space with X R
d
and E a σ-algebra on the set X . The Chernoff
information C(P, Q) between two probability measures P and Q, with p and q denoting their
Radon-Nikodym densities with respect to a dominating measure
1
ν, is defined as [2], [3]:
C(P, Q) = log min
α(0,1)
Z
p
α
(x)q
1α
(x)dν(x). (1)
This notion of information was first introduced by Chernoff [2] (1952) for bounding the
probability of error of a binary classification task. Namely, the Chernoff information is well-
known in information theory as the best achievable exponent for a Bayesian probability of error
in binary hypothesis testing (see [3], Chapter 11). Nowadays, the Chernoff information is often
used as a statistical distance for various applications of signal processing ranging from sensor
networks [4] to image processing tasks like image segmentation [5] or edge detection [6]. In fact,
this notion of Chernoff distance can be understood as a generalization of the former Bhattacharrya
distance [7], [8] (1943): Let c
α
(P : Q) =
R
p
α
(x)q
1α
(x)dν(x) [0, 1) denote the α-Chernoff
coefficient of similarity generalizing the Bhattacharrya coefficient (obtained for α =
1
2
). The
α-Chernoff divergence
2
:
C
α
(P : Q) = log c
α
(P : Q) (2)
generalizes the symmetric Bhattacharrya distance (α =
1
2
). Thus we can interpret the Chernoff
information as a maximization of the α-Chernoff divergence over the range α (0, 1):
C(P, Q) = max
α(0,1)
C
α
(P : Q). By construction, the Chernoff distance is symmetric:
C(P, Q) = max
α(0,1)
C
α
(P : Q) = max
α(0,1)
C
1α
(Q : P ) = max
β(0,1)
C
β
(Q : P ) =
C(Q, P ) making it attractive for information retrieval (IR). In information fusion [4], the Chernoff
1
We use the measure-theoretic framework [1] to handle both continuous distributions (eg., Gaussians, Beta, etc.) and discrete
distributions (eg., Bernoulli, Poisson, multinomial, etc.).
2
In information geometry [9], the α-Chernoff divergence is related also to Amari α-divergence: A
α
(P : Q) =
4
α(1α)
(1
R
p
α
(x)q
1α
(x)dν(x)) =
4
α(1α)
(1 c
α
(P : Q)) or R
´
enyi divergences.
February 19, 2013 DRAFT

3
information C(P, Q) = C
α
(P : Q) (where α
denotes the optimal value) is used to define a
middle distribution m
with density m
(x) =
p
α
(x)q
1α
(x)
c
α
(P :Q)
. Merging probability distributions
allows one to efficiently “compress” statistical models (e.g., simplify mixtures [10]).
This letter is organized as follows: Section II considers distributions belonging to the same
exponential family, reports a closed-form formula for the α-Chernoff divergences, and shows
that Chernoff information amounts to compute an equivalent Bregman divergence. Section III
gives a geometric interpretation of the Chernoff distribution (achieving the Chernoff information)
as the intersection of a primal geodesic with a dual hyperplane. Section IV presents three other
types of Chernoff information and Chernoff middle distributions, with two of them admitting
closed-form expressions. Finally, Section V concludes this work.
II. CHERNOFF INFORMATION AS A BREGMAN DIVERGENCE
A. Basics of exponential families
Let hx, yi denote the inner product for x, y X that is taken as the scalar product for
vector spaces X : hx, yi = x
>
y. An exponential family [1] F
F
is a set of probability measures
F
F
= {P
θ
}
θ
dominated by a measure ν having their Radon-Nikodym densities p
θ
expressed
canonically as:
p
θ
(x) = exp(ht(x), θi F (θ) + k(x)), (3)
for θ belonging to the natural parameter space: Θ =
θ R
D
R
p
θ
(x)dν(x) = 1
. Since
log
R
x∈X
p
θ
(x)dν(x) = log 1 = 0, it follows that:
F (θ) = log
Z
exp(ht(x), θi + k(x))dν(x). (4)
For full regular families [1], it can be proved that function F is strictly convex and
differentiable over the open convex set Θ. Function F characterizes the family, and bears different
names in the literature (partition function, log-normalizer or cumulant function) and parameter
θ (natural parameter) defines the member P
θ
of the family F
F
. Let D = dim(Θ) denote the
dimension of Θ, the order of the family. The map k(x) : X R is an auxiliary function defining
a carrier measure ξ with dξ(x) = e
k(x)
dν(x). In practice, we often consider the Lebesgue
measure ν
L
defined over the Borel σ-algebra E = B(R
d
) of R
d
for continuous distributions
February 19, 2013 DRAFT

4
(e.g., Gaussian), or the counting measure ν
C
defined on the power set σ-algebra E = 2
X
for discrete distributions (e.g., Poisson or multinomial families). The term t(x) is a measure
mapping called the sufficient statistic [1]. Many usual families of distributions {P
λ
| λ Λ}
are exponential families [1] in disguise once an invertible mapping θ(λ) : Λ Θ is elucidated
and the density written in the canonical form of Eq. 3. We refer to [1] for such decompositions
for the Poisson, Gaussian, multinomial, ... distributions. Besides those well-known distributions,
exponential families provide a generic framework in statistics. Indeed, any smooth density can
be arbitrary approximated by a member of an exponential family [11], although the cumulant
function F may be defined implicitly only (using Eq. 4).
B. Chernoff α-distance for exponential family members
For distributions P
1
and P
2
of the same exponential family F
F
, indexed with respective natural
parameter θ
1
and θ
2
, the α-Chernoff coefficient can be expressed analytically [12] as:
c
α
(P
1
: P
2
) =
Z
p
α
1
(x)p
1α
2
(x)dν(x) = exp(J
(α)
F
(θ
1
: θ
2
)), (5)
where J
(α)
F
(θ
1
: θ
2
) is a skew Jensen divergence defined for F on the natural parameter space
as:
J
(α)
F
(θ
1
: θ
2
) = αF (θ
1
) + (1 α)F (θ
2
) F (θ
(α)
12
), (6)
where θ
(α)
12
= αθ
1
+ (1 α)θ
2
= θ
2
αθ, with θ = θ
2
θ
1
.
C. Chernoff distance for exponential family members
It follows that maximizing the α-Chernoff divergence amounts equivalently to maximizing the
skew Jensen divergence with respect to α. The directional derivative of F at x with direction u is
defined (see [13], page 213) as dF (x; u) = lim
τ 0
F (x+τ u)F (x)
τ
. Since by definition F (θ) <
for all θ Θ, the limit always exist and F is G
ˆ
ateaux differentiable with:
dF (x; u) = h∇F (x), ui. (7)
Therefore, we have:
dJ
(α)
F
(θ
1
: θ
2
)
dα
= F (θ
1
) F (θ
2
) dF (θ
(α)
12
; θ
1
θ
2
),
= F (θ
1
) F (θ
2
) h∇F (θ
(α)
12
), θ
1
θ
2
i
February 19, 2013 DRAFT

5
Thus we need to find α
such that:
F (θ
1
) F (θ
2
) h∇F (θ
(α
)
12
), θ
1
θ
2
i = 0 (8)
Since the Hessian of the cumulant function is positive definite [1] (
2
F 0), it follows that
the second derivative of the skew Jensen divergence −hθ
>
2
F (θ
(α)
12
), θi is always negative
for θ
1
6= θ
2
. Therefore there is a unique solution for α
provided members are distinct (if not,
the Chernoff distance is obviously 0).
D. Chernoff distance as a Bregman divergence
Our first result states that the Chernoff information between any two distributions belonging
to the same exponential family amounts to calculate equivalently a Bregman divergence defined
on the natural parameter space, where the Bregman divergence [14] between θ and θ
0
is defined
by setting the generator F to the log-normalizer of the exponential family as:
B
F
(θ : θ
0
) = F (θ) F (θ
0
) hθ θ
0
, F (θ
0
)i (9)
Theorem 1: The Chernoff distance between two distinct distributions P
1
and P
2
of the
same exponential family, with respective natural parameters θ
1
and θ
2
, amounts to calculate
a Bregman divergence: C(P
1
, P
2
) = B
F
(θ
1
: θ
(α
)
12
), where α
is the unique value satisfying
h∇F (θ
(α)
12
), θ
1
θ
2
i = F (θ
1
) F (θ
2
), and θ
(α)
12
= αθ
1
+ (1 α)θ
2
.
a) Proof:: Once the optimal value α
has been computed, we calculate the Chernoff
distance using Eq. 2 that reduces for exponential families to a skew Jensen divergence
C(P
1
, P
2
) = log
R
c
α
(P
1
: P
2
) = J
(α
)
F
(θ
1
: θ
2
). This skew Jensen divergence for the optimal
value of α
yields, in turn, a Bregman divergence:
J
(α
)
F
(θ
1
: θ
2
) = B
F
(θ
1
: θ
(α
)
12
) = B
F
(θ
2
: θ
(α
)
12
), (10)
Indeed, from the definition of the Bregman divergence and the fact that θ
1
θ
(α
)
12
= (1α
)(θ
1
θ
2
), it follows that B
F
(θ
1
: θ
(α
)
12
) = F (θ
1
)F (θ
(α
)
12
)(1α
)hθ
1
θ
2
, F (θ
(α
)
12
)i. Furthermore,
since h∇F (θ
(α
)
12
), θ
1
θ
2
i = F (θ
1
) F (θ
2
), it follows that B
F
(θ
1
: θ
(α
)
12
) = F (θ
1
) F (θ
(α
)
12
)
(1 α
)F (θ
1
) + (1 α
)F (θ
2
) = α
F (θ
1
) + (1 α
)F (θ
2
) F (θ
(α
)
12
) = J
(α
)
F
(θ
1
: θ
2
).
Note that for singly-parametric distributions, we get a closed-form expression of the Chernoff
distance since α
=
(F
0
)
1
F (θ
1
)F (θ
2
)
θ
1
θ
2
θ
2
θ
1
θ
2
. To illustrate the formula, consider the Poisson
February 19, 2013 DRAFT

Citations
More filters
Journal ArticleDOI

An elementary introduction to information geometry

Frank Nielsen
- 17 Aug 2018 - 
TL;DR: In this article, the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these manifolds in information sciences are described.
Journal ArticleDOI

On a generalization of the Jensen-Shannon divergence and the JS-symmetrization of distances relying on abstract means.

TL;DR: This work presents a generalization of the Jensen-Shannon (JS) divergence using abstract means which yields closed-form expressions when the mean is chosen according to the parametric family of distributions.
Journal ArticleDOI

Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means ☆

TL;DR: This paper first expresses Bayes’ risk using the total variation distance on scaled distributions, and elucidate and extend the Bhattacharyya and the Chernoff upper bound mechanisms using generalized weighted means, providing as a byproduct novel notions of statistical divergences and affinity coefficients.
Journal ArticleDOI

An elementary introduction to information geometry

Frank Nielsen
- 29 Sep 2020 - 
TL;DR: The fundamental differential-geometric structures of information manifolds are described, the fundamental theorem of information geometry is state, and some use cases of these information manifolding in information sciences are illustrated.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

TL;DR: In this paper, it was shown that the likelihood ratio test for fixed sample size can be reduced to this form, and that for large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample with the second test.
Frequently Asked Questions (5)
Q1. What are the contributions in "An information-geometric characterization of chernoff information" ?

The Chernoff information was originally introduced for bounding the probability of error of the Bayesian decision rule in binary hypothesis testing. The authors consider the Chernoff distance for distributions belonging to the same exponential family including the Gaussian and multinomial families. Furthermore, the authors prove analytically that the Chernoff distance amounts to calculate an equivalent but simpler Bregman divergence defined on the distribution parameters. Finally, based on this informationgeometric characterization, the authors propose three novel information-theoretic symmetric distances and middle distributions, from which two of them admit always closed-form expressions. 

For distributions P1 and P2 of the same exponential family FF , indexed with respective naturalparameter θ1 and θ2, the α-Chernoff coefficient can be expressed analytically [12] as:cα(P1 : P2) = ∫ pα1 (x)p 1−α 2 (x)dν(x) = exp(−J (α) F (θ1 : θ2)), (5)where J (α)F (θ1 : θ2) is a skew Jensen divergence defined for F on the natural parameter space as:J (α) F (θ1 : θ2) = αF (θ1) + (1− α)F (θ2)− F (θ (α) 12 ), (6)where θ(α)12 = αθ1 + (1− α)θ2 = θ2 − α∆θ, with ∆θ = θ2 − θ1. 

4In convex analysis [13], each strictly convex and differentiable function F is associated with a dual convex conjugate F ∗ by the Legendre-Fenchel transformation: F ∗(η) = maxθ∈Θ〈η, θ〉−F (θ). 

Since log ∫ x∈X pθ(x)dν(x) = log 1 = 0, it follows that:F (θ) = − log ∫ exp(〈t(x), θ〉+ k(x))dν(x). (4)For full regular families [1], it can be proved that function F is strictly convex anddifferentiable over the open convex set Θ. Function F characterizes the family, and bears differentnames in the literature (partition function, log-normalizer or cumulant function) and parameterθ (natural parameter) defines the member Pθ of the family FF . 

In practice, the authors often consider the Lebesguemeasure νL defined over the Borel σ-algebra E = B(Rd) of Rd for continuous distributionsFebruary 19, 2013 DRAFT4 (e.g., Gaussian), or the counting measure νC defined on the power set σ-algebra E = 2X for discrete distributions (e.g., Poisson or multinomial families).