What is the function of the exponential family?

For distributions P1 and P2 of the same exponential family FF , indexed with respective naturalparameter θ1 and θ2, the α-Chernoff coefficient can be expressed analytically [12] as:cα(P1 : P2) = ∫ pα1 (x)p 1−α 2 (x)dν(x) = exp(−J (α) F (θ1 : θ2)), (5)where J (α)F (θ1 : θ2) is a skew Jensen divergence defined for F on the natural parameter space as:J (α) F (θ1 : θ2) = αF (θ1) + (1− α)F (θ2)− F (θ (α) 12 ), (6)where θ(α)12 = αθ1 + (1− α)θ2 = θ2 − α∆θ, with ∆θ = θ2 − θ1.

What is the definition of the expectation parameter?

4In convex analysis [13], each strictly convex and differentiable function F is associated with a dual convex conjugate F ∗ by the Legendre-Fenchel transformation: F ∗(η) = maxθ∈Θ〈η, θ〉−F (θ).

What is the function L defined over the Borel -algebra?

In practice, the authors often consider the Lebesguemeasure νL defined over the Borel σ-algebra E = B(Rd) of Rd for continuous distributionsFebruary 19, 2013 DRAFT4 (e.g., Gaussian), or the counting measure νC defined on the power set σ-algebra E = 2X for discrete distributions (e.g., Poisson or multinomial families).

(Open Access) An Information-Geometric Characterization of Chernoff Information (2013) | Frank Nielsen

Q: What is the function of the family FF?

Since log ∫ x∈X pθ(x)dν(x) = log 1 = 0, it follows that:F (θ) = − log ∫ exp(〈t(x), θ〉+ k(x))dν(x). (4)For full regular families [1], it can be proved that function F is strictly convex anddifferentiable over the open convex set Θ. Function F characterizes the family, and bears differentnames in the literature (partition function, log-normalizer or cumulant function) and parameterθ (natural parameter) defines the member Pθ of the family FF .

An information-geometric characterization of

Chernoff information

Frank Nielsen, Senior Member, IEEE

Sony Computer Science Laboratories, Inc.

3-14-13 Higashi Gotanda

141-0022 Shinagawa-ku, Tokyo, Japan

nielsen@csl.sony.co.jp

Abstract

The Chernoff information was originally introduced for bounding the probability of error of the

Bayesian decision rule in binary hypothesis testing. Nowadays, it is often used as a notion of symmetric

distance in statistical signal processing or as a way to deﬁne a middle distribution in information fusion.

Computing the Chernoff information requires to solve an optimization problem that is numerically

approximated in practice. We consider the Chernoff distance for distributions belonging to the same

exponential family including the Gaussian and multinomial families. By considering the geometry of

the underlying statistical manifold, we deﬁne exactly the solution of the optimization problem as the

unique intersection of a geodesic with a dual hyperplane. Furthermore, we prove analytically that the

Chernoff distance amounts to calculate an equivalent but simpler Bregman divergence deﬁned on the

distribution parameters. It follows a closed-form formula for the singly-parametric distributions, or an

efﬁcient geodesic bisection search for multi-parametric distributions. Finally, based on this information-

geometric characterization, we propose three novel information-theoretic symmetric distances and middle

distributions, from which two of them admit always closed-form expressions.

Index Terms

purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org

Nielsen, F.; “An Information-Geometric Characterization of Chernoff Information,” IEEE Signal Processing Letters (SPL), vol.

20, no. 3, pp. 269-272, March 2013,. doi: 10.1109/LSP.2013.2243726

February 19, 2013 DRAFT

Chernoff information, exponential families, information geometry, Bregman divergence, information

fusion.

I. INTRODUCTION

Let (X , E) be a measurable space with X ⊆ R

and E a σ-algebra on the set X . The Chernoff

information C(P, Q) between two probability measures P and Q, with p and q denoting their

Radon-Nikodym densities with respect to a dominating measure

ν, is deﬁned as [2], [3]:

C(P, Q) = − log min

α∈(0,1)

(x)q

1−α

(x)dν(x). (1)

This notion of information was ﬁrst introduced by Chernoff [2] (1952) for bounding the

probability of error of a binary classiﬁcation task. Namely, the Chernoff information is well-

known in information theory as the best achievable exponent for a Bayesian probability of error

in binary hypothesis testing (see [3], Chapter 11). Nowadays, the Chernoff information is often

used as a statistical distance for various applications of signal processing ranging from sensor

networks [4] to image processing tasks like image segmentation [5] or edge detection [6]. In fact,

this notion of Chernoff distance can be understood as a generalization of the former Bhattacharrya

distance [7], [8] (1943): Let c

(P : Q) =

(x)q

1−α

(x)dν(x) ∈ [0, 1) denote the α-Chernoff

coefﬁcient of similarity generalizing the Bhattacharrya coefﬁcient (obtained for α =

). The

α-Chernoff divergence

(P : Q) = − log c

(P : Q) (2)

generalizes the symmetric Bhattacharrya distance (α =

). Thus we can interpret the Chernoff

information as a maximization of the α-Chernoff divergence over the range α ∈ (0, 1):

C(P, Q) = max

α∈(0,1)

(P : Q). By construction, the Chernoff distance is symmetric:

C(P, Q) = max

α∈(0,1)

(P : Q) = max

α∈(0,1)

1−α

(Q : P ) = max

β∈(0,1)

(Q : P ) =

C(Q, P ) making it attractive for information retrieval (IR). In information fusion [4], the Chernoff

We use the measure-theoretic framework [1] to handle both continuous distributions (eg., Gaussians, Beta, etc.) and discrete

distributions (eg., Bernoulli, Poisson, multinomial, etc.).

In information geometry [9], the α-Chernoff divergence is related also to Amari α-divergence: A

(P : Q) =

α(1−α)

(1 −

(x)q

1−α

(x)dν(x)) =

α(1−α)

(1 − c

(P : Q)) or R

enyi divergences.

February 19, 2013 DRAFT

information C(P, Q) = C

∗

(P : Q) (where α

∗

denotes the optimal value) is used to deﬁne a

middle distribution m

∗

with density m

∗

(x) =

∗

(x)q

1−α

∗

(x)

(P :Q)

. Merging probability distributions

allows one to efﬁciently “compress” statistical models (e.g., simplify mixtures [10]).

This letter is organized as follows: Section II considers distributions belonging to the same

exponential family, reports a closed-form formula for the α-Chernoff divergences, and shows

that Chernoff information amounts to compute an equivalent Bregman divergence. Section III

gives a geometric interpretation of the Chernoff distribution (achieving the Chernoff information)

as the intersection of a primal geodesic with a dual hyperplane. Section IV presents three other

types of Chernoff information and Chernoff middle distributions, with two of them admitting

closed-form expressions. Finally, Section V concludes this work.

II. CHERNOFF INFORMATION AS A BREGMAN DIVERGENCE

A. Basics of exponential families

Let hx, yi denote the inner product for x, y ∈ X that is taken as the scalar product for

vector spaces X : hx, yi = x

y. An exponential family [1] F

is a set of probability measures

= {P

}

dominated by a measure ν having their Radon-Nikodym densities p

expressed

canonically as:

(x) = exp(ht(x), θi − F (θ) + k(x)), (3)

for θ belonging to the natural parameter space: Θ =



θ ∈ R



(x)dν(x) = 1



. Since

log

x∈X

(x)dν(x) = log 1 = 0, it follows that:

F (θ) = − log

exp(ht(x), θi + k(x))dν(x). (4)

For full regular families [1], it can be proved that function F is strictly convex and

differentiable over the open convex set Θ. Function F characterizes the family, and bears different

names in the literature (partition function, log-normalizer or cumulant function) and parameter

θ (natural parameter) deﬁnes the member P

of the family F

. Let D = dim(Θ) denote the

dimension of Θ, the order of the family. The map k(x) : X → R is an auxiliary function deﬁning

a carrier measure ξ with dξ(x) = e

k(x)

dν(x). In practice, we often consider the Lebesgue

measure ν

deﬁned over the Borel σ-algebra E = B(R

) of R

for continuous distributions

February 19, 2013 DRAFT

(e.g., Gaussian), or the counting measure ν

deﬁned on the power set σ-algebra E = 2

for discrete distributions (e.g., Poisson or multinomial families). The term t(x) is a measure

mapping called the sufﬁcient statistic [1]. Many usual families of distributions {P

| λ ∈ Λ}

are exponential families [1] in disguise once an invertible mapping θ(λ) : Λ → Θ is elucidated

and the density written in the canonical form of Eq. 3. We refer to [1] for such decompositions

for the Poisson, Gaussian, multinomial, ... distributions. Besides those well-known distributions,

exponential families provide a generic framework in statistics. Indeed, any smooth density can

be arbitrary approximated by a member of an exponential family [11], although the cumulant

function F may be deﬁned implicitly only (using Eq. 4).

B. Chernoff α-distance for exponential family members

For distributions P

and P

of the same exponential family F

, indexed with respective natural

parameter θ

and θ

, the α-Chernoff coefﬁcient can be expressed analytically [12] as:

: P

) =

(x)p

1−α

(x)dν(x) = exp(−J

(α)

(θ

: θ

)), (5)

where J

(α)

(θ

: θ

) is a skew Jensen divergence deﬁned for F on the natural parameter space

as:

(α)

(θ

: θ

) = αF (θ

) + (1 − α)F (θ

) − F (θ

(α)

), (6)

where θ

(α)

= αθ

+ (1 − α)θ

= θ

− α∆θ, with ∆θ = θ

− θ

C. Chernoff distance for exponential family members

It follows that maximizing the α-Chernoff divergence amounts equivalently to maximizing the

skew Jensen divergence with respect to α. The directional derivative of F at x with direction u is

deﬁned (see [13], page 213) as dF (x; u) = lim

τ →0

F (x+τ u)−F (x)

. Since by deﬁnition F (θ) < ∞

for all θ ∈ Θ, the limit always exist and F is G

ateaux differentiable with:

dF (x; u) = h∇F (x), ui. (7)

Therefore, we have:

(α)

(θ

: θ

)

dα

= F (θ

) − F (θ

) − dF (θ

(α)

; θ

− θ

= F (θ

) − F (θ

) − h∇F (θ

(α)

), θ

− θ

February 19, 2013 DRAFT

Thus we need to ﬁnd α

∗

such that:

F (θ

) − F (θ

) − h∇F (θ

(α

∗

)

), θ

− θ

i = 0 (8)

Since the Hessian of the cumulant function is positive deﬁnite [1] (∇

F  0), it follows that

the second derivative of the skew Jensen divergence −h∆θ

∇

F (θ

(α)

), ∆θi is always negative

for θ

6= θ

. Therefore there is a unique solution for α

∗

provided members are distinct (if not,

the Chernoff distance is obviously 0).

D. Chernoff distance as a Bregman divergence

Our ﬁrst result states that the Chernoff information between any two distributions belonging

to the same exponential family amounts to calculate equivalently a Bregman divergence deﬁned

on the natural parameter space, where the Bregman divergence [14] between θ and θ

is deﬁned

by setting the generator F to the log-normalizer of the exponential family as:

(θ : θ

) = F (θ) − F (θ

) − hθ − θ

, ∇F (θ

)i (9)

Theorem 1: The Chernoff distance between two distinct distributions P

and P

of the

same exponential family, with respective natural parameters θ

and θ

, amounts to calculate

a Bregman divergence: C(P

, P

) = B

(θ

: θ

(α

∗

)

), where α

∗

is the unique value satisfying

h∇F (θ

(α)

), θ

− θ

i = F (θ

) − F (θ

), and θ

(α)

= αθ

+ (1 − α)θ

a) Proof:: Once the optimal value α

∗

has been computed, we calculate the Chernoff

distance using Eq. 2 that reduces for exponential families to a skew Jensen divergence

C(P

, P

) = − log

∗

: P

) = J

(α

∗

)

(θ

: θ

). This skew Jensen divergence for the optimal

value of α

∗

yields, in turn, a Bregman divergence:

(α

∗

)

(θ

: θ

) = B

(θ

: θ

(α

∗

)

) = B

(θ

: θ

(α

∗

)

), (10)

Indeed, from the deﬁnition of the Bregman divergence and the fact that θ

−θ

(α

∗

)

= (1−α

∗

)(θ

−

), it follows that B

(θ

: θ

(α

∗

)

) = F (θ

)−F (θ

(α

∗

)

)−(1−α

∗

)hθ

− θ

, ∇F (θ

(α

∗

)

)i. Furthermore,

since h∇F (θ

(α

∗

)

), θ

− θ

i = F (θ

) − F (θ

), it follows that B

(θ

: θ

(α

∗

)

) = F (θ

) − F (θ

(α

∗

)

) −

(1 − α

∗

)F (θ

) + (1 − α

∗

)F (θ

) = α

∗

F (θ

) + (1 − α

∗

)F (θ

) − F (θ

(α

∗

)

) = J

(α

∗

)

(θ

: θ

). 

Note that for singly-parametric distributions, we get a closed-form expression of the Chernoff

distance since α

∗

)

−1



F (θ

)−F (θ

)

−θ



−θ

. To illustrate the formula, consider the Poisson

February 19, 2013 DRAFT

An Information-Geometric Characterization of Chernoff Information

Citations

Convex Analysisの二,三の進展について

An elementary introduction to information geometry

On a generalization of the Jensen-Shannon divergence and the JS-symmetrization of distances relying on abstract means.

Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means ☆

An elementary introduction to information geometry

References

Elements of information theory

Convex Analysisの二,三の進展について

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

Methods of information geometry

On a measure of divergence between two statistical populations defined by their probability distributions

Related Papers (5)

Elements of information theory

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

Clustering with Bregman Divergences

The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming

On Measures of Entropy and Information

Frequently Asked Questions (5)

Q1. What are the contributions in "An information-geometric characterization of chernoff information" ?

Q2. What is the function of the exponential family?

Q3. What is the definition of the expectation parameter?

Q4. What is the function of the family FF?

Q5. What is the function L defined over the Borel -algebra?