scispace - formally typeset
Open AccessJournal ArticleDOI

Maximum entropy and conditional probability

Reads0
Chats0
TLDR
The following theorem is offered, which states that the conditional distribution of a given random variable X is the (normalized) product of the maximum entropy distribution and the initial distribution.
Abstract
It is well-known that maximum entropy distributions, subject to appropriate moment constraints, arise in physics and mathematics. In an attempt to find a physical reason for the appearance of maximum entropy distributions, the following theorem is offered. The conditional distribution of X_{l} given the empirical observation (1/n)\sum^{n}_{i}=_{l}h(X_{i})=\alpha , where X_{1},X_{2}, \cdots are independent identically distributed random variables with common density g converges to f_{\lambda}(x)=e^{\lambda^{t}h(X)}g(x) (Suitably normalized), where \lambda is chosen to satisfy \int f_{lambda}(x)h(x)dx= \alpha . Thus the conditional distribution of a given random variable X is the (normalized) product of the maximum entropy distribution and the initial distribution. This distribution is the maximum entropy distribution when g is uniform. The proof of this and related results relies heavily on the work of Zabell and Lanford.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.
IT-27,
NO.
4,
JULY
198 1
483
Maximum Entropy and Conditional Probability
JAN M. VAN CAMPENHOUT
AND
THOMAS M. COVER,
FELLOW, IEEE
Abstract-
It is well-known that maximum entropy distributions, subject
to appropriate moment constraints, arise in physics and mathematics. In an
attempt to find a physical reason for the appearance of maximum entropy
distributions, the following theorem is offered. The conditional distribution
of X, given the empirical observation
(1 /n)X:, ,/I ( X,) = (Y,
where
X, , X2, . are independent identically distributed random variables with
common density g converges to fA( x) = e Ah(x)g( x) (suitably normalized),
where X is chosen to satisfy jfx( x) h( x) dx = o. Thus the conditional
distribution of a given random variable X is the (normalized) product of the
maximum entropy distribution and the initial distribution. This distribution
is the maximum entropy distribution when g is uniform. The proof of this
and related results relies heavily on the work of Zabell and Lanford.
I.
INTRODUCTION
T
HE differential entropy N(X) of a random variable X
with density function f(x) (with respect to Lebesgue
measure) is defined by H(X) = - /-,“f< x) In f(x)
dx.
All
of the well-known distributions in statistics are maximum
entropy distributions given appropriate simple moment
constraints. For example, the maximum entropy distribu-
tion under the constraint EX2 = a2 is the normal distribu-
tion with mean 0 and variance u2. The maximum entropy
nonnegative random variable with mean
m
is exponentially
distributed with parameter X =
l/m.
Even the Cauchy
distribution is a maximum entropy distribution over all
distributions satisfying
E
In (1 + X2) = (Y. In general, the
maximum entropy density f(x) under the constraint
P(x)f(x) dx =
(Y,. where h is a vector-valued function of
x, is of the form
f(x) = exp (X, + XI/z(x)).
(1)
The constants X,, X are chosen so that f(x) is normalized
and satisfies the moment constraint. An easy proof of (1)
based on a convexity argument can be found in Kagan
et
al. [l, Theorem 13.2.1, p. 4091.
The entropy H(X) is closely related to the disorder or
uncertainty associated with making a realization of X. For
that reason, maximizing the entropy is a method for find-
ing distributions that represent high uncertainty or, equiva-
lently, a state of high ignorance. For instance, in statistical
mechanics, Boltzmann and others found the three-variate
Manuscript received August 22, 1978; revised October 2, 1980. This
paper was supported by National Science Foundation under Grant ENG
76-03684 and JSEP DAAG 29-79-C-0047. It was presented at the IEEE
International Symposium on Information Theory, Grignano, Italy, June
25-29, 1979.
J. M. Van Campenhout is with the Electronics Laboratory, the State
University of Ghent, Ghent, Belgium.
T. M. Cover is with the Departments of Electrical Engineering and
Statistics, Stanford University, Stanford, CA 94305.
normal distribution of velocities in gases as a maximum
entropy distribution under an energy constraint. Similarly,
one can derive thep( h) = he -hh, h 1 0, distribution of air
density as a function of height in the earths atmosphere
under the mean potential energy constraint l@(h) d/z =
E.
In statistics, the principle of maximum entropy has been
used to obtain uninformative prior distributions in
Bayesian inference. A paper by Jaynes [2] discusses pre-
cisely this technique. Although the use of the maximum
entropy principle for these purposes may seem ad hoc,
maximum entropy distributions have some desirable prop-
erties. Jaynes [2] comments, . . . the probability distribu-
tion which maximizes the entropy is numerically identical
with the frequency distribution which can be realized in the
greatest number of ways, thus associating maximum ent-
ropy with a definite frequency (or maximum likelihood)
interpretation.
This note attaches another concrete meaning to the
maximum entropy distribution. It characterizes such a
distribution as the limit of a sequence of conditional distri-
butions. It is shown that, under certain regularity condi-
tions, the conditional distribution of the first random vari-
able X, in a sequence of independent identically distribu-
tion (i.i.d.) random variables X,, X2,. . . , given the empiri-
cal average
(l/n)Z;h(
Xi), converges to a maximum en-
tropy distribution. More precisely, the limiting distribution
f maximizes H,(X) = -lf(x)ln(f(x)/g(x))dx, the en-
tropy relative to the initial distribution g of X,, subject to
the constraint that jh(x)f(x) dx equals the observed aver-
age. The quantity -H,(X) is also known as the Kullback-
Leibler information number off relative to g. Thus among
all distributions satisfying the above moment constraint on
h( X,), the limiting conditional distribution f of X, mini-
mizes the Kullback-Leibler number with respect to g. It
follows that f is closest to g in a certain hypothesis testing
sense.
The convergence problem of conditional distributions
has a long history. As early as 1922, in the then fully
developing field of statistical mechanics, Darwin and
Fowler [ 1 l] established their method to derive the energy
distribution in large systems of particles with a given total
energy. Through the computation of the average occupancy
of the discrete energy levels, these authors arrived naturally
at the classical energy distributions. Jaynes [12] relates
Darwin and Fowlers work to the Shannon maximum
entropy principle.
In an attempt to formalize statistical limits in statistical
mechanics, Lanford [3] considers the convergence problem
001 g-9448/8 l/0700-0483$00.75 0 198 1 IEEE

484
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.
IT-27,
NO.
4,
JULY
1981
of conditional distributions when an empirical average is
conditioned in an interval that may or may not contain the
mean of the underlying distribution. The same problem is
considered by Bartfai [9] and Vincze [lo] from a statistical
point of view. Zabell [4], on the other hand, studies prim-
arily the convergence of conditional expectations when the
conditioning is pointwise, but the points are in the neigh-
borhood of the true mean. We extend Zabells work to
conditioning at points “far” from the mean, and we will
reinterpret the work of the above authors from a maximum
entropy viewpoint.
In Section II, we study the convergence properties of the
special case in which the random variables X,, X2, * * * take
values in the set {1,2;.*,
m}.
For pointwise conditioning
far from the mean, the use of Chernoffs tilting idea [5], [6]
is clearly illustrated by this example. The idea will be used
again in Section III, where we generalize Zabells result to
conditioning at points far from the mean. The convergence
of conditional distributions, under the condition that X has
only a finite number of mass points, can be obtained by
application of Stirlings inequalities, as shown in the work
of Vasicek [13]. In Section IV, we provide some examples
in the case where the random variables have densities, and
in Section V we review and interpret Lanfords work and
its implications.
II. A
SPECIAL CASE
In this section we consider the case of discrete bounded
random variables, and we give a direct proof of the follow-
ing convergence theorem.
Theorem
I: Let X,, X2, . . . be i.i.d. discrete random
variables with uniform probability mass function p(x) on
the range x E { 1,2; . a,
m}.
Then, for 1 I (Y 5
m,
and for
all x, E { 1,2; .
.,m}, we
have
lim
“?”
n a 1s mteger
P{X,=x,l+ ;,&=a} =p*(x,), (2)
where
p*(q) = ehxl
/iz 1
M ehi
i=l
I
(3)
is the maximum entropy probability mass function under
the constraint Zxp*(x) = (Y, and X is chosen to satisfy this
constraint.
Proof:
First we will prove that for
any
probability
mass function
q(x)
> 0 on the range { 1,2,. . . ,
m}
with
a = E,X, = Zx,q(x,) we
have
i I
n
lim
4 x1; ,2x,=a
i
= 4(x, >.
(4)
n-03
na is integer
r=l
Thus conditioning on the expected outcome has an
asymptotically negligible effect. This proves (2) in the case
where (Y = EX, =
(m
+ 1)/2. The limiting distribution
p*
is obtained by setting X = 0 in (3).
We then consider (2) in the case (Y #
(m +
1)/2. We use
Chernoffs tilting idea to modify
p(x)
so that we are again
conditioning on the expected outcome. Since the condi-
tional distributions in (2) are invariant under tilting, Theo-
rem 1 will follow.
Let us turn to the proof of (4). Let
na
be an integer such
that
P{(
l/n.)~~,, X = a} > 0. Letting S, = X, -I- X2
+ . . .
+X,, we have
P{X,=jIS,= ncu} =
P{X, =j, S, = ncu}
P{S, = ncu}
= d.av2 +
*s-+X,= na -j}
P{S, = ncu}
= 4(j)
P{S,-, = ncu -j}
P{S,= new} .
(5)
Therefore, if we can prove that
P{S,-, = na
- j} is
asymptotically independent of j E { 1,2,. . .
,m}
as
n --$ 00,
it will follow that
P{
X, = j] S,, =
ncu}
+
q(j).
The desired
result is contained in the following lemma.
Lemma 1:
Let X,, X2, **a be i.i.d. random variables
with probability mass function
q(x) > 0
on x E
{1,2,*.-,
m}.
With (Y =
EX,
and S,= Xi+ **.+X,,, we
have
P{
S,, =
k}/P{S,,
=
k
+ 1 } + 1 for all integers
k
satisfying ]
na
-
k(
< A for some constant A.
The proof of this lemma follows from a slight generaliza-
tion of a problem in Chung [7, exercise 24, p. 1771, and is,
in fact, a form of the Chung-Erdos strong ratio limit
theorem.
In the case that (Y #
(m
+ 1)/2, let p(x) be the tilted
probability mass function derived from
p(x)
as follows:
p(x) = ceAx,
x E {1,2;--,m},
(6)
where c and h are chosen to satisfy
Zp(x) = 1, zxJ?(x) = (Y.
(7)
Then clearly p(x) > 0 for all x E { 1,2; .
.,m};
thus (4) is
applicable. The properties of the tilting operation allow us
to reconnect (4) with the original statement (2) as follows.
First we observe that
= ii1 (e-Xx~/c)P(xi)
and thus that
=C
-ne-XZx~~(x,,...,x,),
P{S,= nar} = 2 p(x,;.*,xn)
x,+ . ..+x.=na
--n -Am
=c e
2
B(x,,**-,xJ
x,+ . ..+x.=na
= c-ne-““aF{Sn = new}.
(8)
From (8) it follows easily that the tilting transformation
leaves the conditional distributions in (2) invariant. This

VAN CAMPENHOUT AND COVER: MAXIMUM ENTROPY AND CONDITIONAL PROBABILITY
485
can be seen by using (5) and (8) as follows:
P{X,= xJSn= na}
= Ptx,PNl-* = ncJ - XI>
P{S, = ncu}
c-epAxlp(x,)c
=
-ne-*(-dp{Sn-, = n(y - x,}
c -ne-AnaF{Sn = na}
= P{X, = x,ISn= na}.
(9)
But j(X, = x,/S,, = na}
--$ p(X, = x,) = ehxl/(2eXi), by
(4), thus proving Theorem 1.
Remark:
The smooth behavior of the probability distri-
bution of S,, at small deviations from its mean is crucial to
the convergence in (4). These ideas are also borne out by
the restrictions on the random variables Xi imposed by
Zabell in the more general case (see Section III). Unlike the
central limit theorem or the law of large numbers, the
additional. conditions deal with the fine structure of S,, /n
in the sense that deviations of the order l/n from the mean
are considered.
III. A
LIMIT THEOREM FOR POINTWISE
CONDITIONING
We proceed with the generalization of the special case
studied in Section II to lattice random variables that may
be unbounded and to random variables with density func-
tions. We start by reminding the reader of Zabells results
[4] concerning the convergence of conditional expectations.
We then apply the tilting transformation to these results to
obtain the desired generalization.
Zabell considers a sequence U, X,, X,, . . . of random
variables where U has finite expectation and the pair
(U, X, ) is independent of X,, X,, . . . . He derives a set of
sufficient conditions under which it is true that
E(UIX,+ -.+X,=A,) -E(U),
(10)
as n + co. These conditions can be summarized as follows.
1) The random variables X,, X,, . . . take values in the
same additive subgroup of Iw (or, more generally, in the
same coset of an additive subgroup of W).
2) Consider the normalized sums Y, = (X, + . . . +X,
-A,)/&. Then there exist sequences {A,}, {B,} with
B,
+ m such that Y, converges in distribution to a (nonde-
generate) random variable Y.
3) Let &(t) = E(exp(itY,)) and G(t) = E(exp(itY))
denote the characteristic functions of Y, and Y, respec-
tively. Then either a) $n is periodic, and
P{Yn = 0} > 0
for
n sufficiently large; or b) \c, is absolutely integrable, I& is
absolutely integrable for n sufficiently large, &, + Ic/ in
L,,
and Jlc, > 0.
Condition 1) is the generalization of the regularity prop-
erties of S,, derived in Section II. For i.i.d. random vari-
ables Xi,
i =
1,2; * .,
this condition together with condi-
tion 3) implies that the random variables are either of the
lattice type (i.e., the Xi take values in {a +
kb: k =
0, 2 1, -c2, . . *})
or are real-valued and have a density
function f(x) with respect to Lebesgue measure.
For example, these conditions preclude the case Xi E
(0, T, 5).
In this case the event {X, + . . . +X, = nr} im-
plies the event {X, = v}, for all probability mass functions
p(x,). Thus clearly for all n,
x, = 71
otherwise,
and Theorem 1 fails to hold even if rr were the true mean of
P(X,).
Letting U range over all bounded continuous functions
of X,, we see that Zabells work implies the convergence of
the conditional probability distribution of X, to its uncon-
ditional distribution (see, for example, Chung [7, Theorem
4.4.2,
p. 891).
We now limit our attention to i.i.d. random variables
x,, x2, *. .
having a density function f(x) with respect to
Lebesgue measure. Rather than conditioning on X;X,, we
consider a (Borel-measurable) function h: Iw + [w and con-
dition on S,, = Z;h( Xi). We assume that h( X,) has a den-
sity with mean p. The case of discrete conditioning vari-
ables is completely analogous and was covered to some
extent in Section II.
It follows from (10) and conditions l)-3) that the center-
ing constants
A,
must be “close” to
np
in order to ensure
the nondegeneracy of the limit of (S, -
A,)/B,,.
Condi-
tioning on the event {(l/n)& = cy}, a! # p, results in
centering constants n(~ that are too far from
np
for (10) to
hold. Under certain restrictions, an application of Cher-
noffs tilting idea allows us to move the mean of h( X,) to
the conditioning point (Y, rendering (10) applicable. Again,
the tilting leaves the conditional distributions invariant and
thus provides us with the limit of these conditional distri-
butions, conditioned at points off the true mean.
More concretely, let f(x) and g(t) denote the probability
densities of X, and h( X,), respectively. Consider the ex-
ponential family 9 of densities indexed by X,
9 = e”g(t)/c(h): c(X) =/e*g(t)dt-=C co .
(
I
(11)
Assume that 9 contains a density g*(t) = e”g(t)/c(h)
with mean (Y. The desired tilting operation, then, changes
the underlying probability measure
P
to a measure
P*,
under which h(X,) has the density g*(t). One can easily
verify that changing the density of X, to
f*(x) = e
““xf(x)/C(A)
(12)
induces the density g*(t) on h( X,).
Thus, applying (10) under the measure
P*
induced byf*,
we have for all bounded continuous functions
U( .),
E* U(X,) i h(X;) = na - E*(U(X,)),
(13)
i 1
r=l
I
where
E*(U(X,)) = ( lU(x)exh“)f(x) dx)/c(h).
The con-
ditional expectation in (13) can be written as
E*(U(X,)IS,,= na) =/U(x)P*{X, E dxlS,= na}.
(14

486
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.
IT-27,
NO.
4,
JULY
1981
On the support set of S,, the conditional distribution of X, has density $(x; 0,
n),
and according to (15) we have
can be written in terms of the conditional density of X, as
P*{x, E dxlS,= t} =f+(x)g;-,(t - h(x))dx/gZ(t),
f(xlS,=ncr)=+(x;O,l)+(na-x;O,n- l)/+(na;O,n)
(15)
=+(x; cq).
where g,*(t) denotes the tilted density of S, = Z;h( Xi). Thus
Equation (15) can be directly verified by the defining
relation for conditional distributions
ftxI& = na) - +t
x; a, 1) = exa+(x; 0, l)/eaz/*,
(16) Example 2: Exponential Random Variables
As in Section II, it is easy to verify that the tilting
Let X,, X2, . . . be i.i.d. exponential random variables
transformation and convolutions commute, i.e.,
with parameter h. Let fs(x;
n,
p) denote the gamma den-
&W = t4VPg,tt).
(17)
sity with parameters
n,
II. Since S, has a gamma
(n, nh)
distribution we have
From (12), (15), and (17) it follows readily that
P*{X, E dxIS,= no} = P{X, E dxlS,= nay),
(18)
f(xILsn = na) =
f,tx; 1, x)f,(na - x; n - 1, A)
f,tna; n, A)
and thus that
E*{U(X,)IS,= no} = E{U(X,)IS,= no},
(19)
=!IpK$)-*.
from which it follows that
Thus
E{ U(X,)lS, = no} + ( /U(x)ehhcx)f(x) dx )/c(h),
f(xI&= na) +
Kexp(-x/cr)
= exp ((h - l/a)x)f,(x; 1, X)/X(u.
(20)
the desired extension of Zabells result (10). Thus we have
Example 3: An Exception (Cauchy Random Variables
proved the following.
(EIXI = 00))
Theorem
2: Let X,, X2, . . . be i.i.d. random variables
Let X,, X2, . . .
be i.i.d. with Cauchy density fc(x; 0, l),
with densityf(x), and let
h:
II% + Iw be a Bore1 measurable
where f,(x; (Y, j3) = /3/~(/3* + (x - a)*). It is well known
function. Let the random variables
h( X,), h( X2), . . .
have
that S, has density fc(x; 0,
n).
Thus we find
a density g(t). If there exists a real number X such that
f(xI% = na>
c(X) = Eexp
(hh(X,)) < 00,
=./Ax; O,l)f,tna - x; 0, n - l)/f,(nc 0, n)
and
=
(Y=
h(x)e
Xhf(~) dx),c(h),
f,tx;O, l)tn - l)(n* + (na)*)
/n((n -
l)+
(no - x)).
and furthermore, if gh(
t)
= ehg(
t)/c(
X) satisfies Zabells
It follows that for every finite value of (Y we have
conditions l)-3) then as
n + 00,
ftxI& = na> -f,tx; (Al),
exh(“,f(x) dx)/c(X).
pointwise in x. Thus conditioning on
any
(Y has an asymp-
totically negligible effect.
. ,
- -
The reason for this exceptional behavior is that X, has
IV.
SOME EXAMPLES
no mean and thus Theorem 2 is not applicable. It should
be noted that even if
El
X, I k < cc for some
k
> 1 but
We now give a few easy examples to illustrate Theorem
P{
X, <
-A}
and
P{
X, >
A}
approach zero less than ex-
2. Note that Examples 1, 2, and 3 are established by direct
ponentially fast, Zabells result is applicable, but not its
calculation rather than by using Theorem 2.
extension in Theorem 2. In this case the exponential family
4 in (11) contains only one density corresponding to A = 0.
Example 1: Gaussian Random Variables
No tilted density
g*(t)
with mean (Y can be found.
Let +(x; p, a*) denote the normal density with mean p
and variance a*. Let X,, X,, * . ; be i.i.d. with density
Example 4: The Maxwell-Boltzmann Distribution
+(x; 0, 1) and let f(x(Z;X,
=-no)
denote the conditional
Let the velocities V,, V,, . . . be i.i.d. vector valued ran-
density of X, given x:X, =
na.
The sum S, = X, +. . .
+X, dom variables (T.v.), each drawn according to a uniform

VAN CAMPENHOUT AND COVER: MAXIMUM ENTROPY AND CONDITIONAL PROBABILITY
487
distribution over the cube [-A,
A13.
Then, by Theorem 2,
f 01: ,i 11~112= E + ce-tlo112/2E,
(
i
u E [-A, A13.
r=l
Thus the limiting density is the multivariate normal density
truncated to the prior range.
V.
CONDITIONINGON~NTERVALS
In this section we review the limiting behavior of condi-
tional distribution of a random variable X,, given that the
empirical average of
n
independent observations
h(Xi),
i =
1,2;.*
,
n
lies in an interval
(a, b).
Although the results
presented here have the same flavor as the results discussed
in Section III, they are quite distinct. For instance, the
rather strong regularity conditions on S,, imposed by Zabell
are absent here. (Essentially what is left is the additional
condition allowing tilting imposed in Theorem 2.) Thus
Zabells result cannot be obtained from the results estab-
lished in this section. Conversely, one might be tempted to
find the limit of
P{X,
4
xla
<
n-Xh(Xi)
<
b}
through
an integration of
P{X, I xln-Zh( Xi) = t}
over
t
E
(a, b).
In order to do so, however, one would have to know
the limiting distribution of
n-Zh(Xi)
on the interval
(a, b),
and furthermore one would have to verify the
interchange of limits and integration as
n
+ cc. Thus,
again, the result in Theorem 2 is insufficient to provide the
solution.
In this section, a direct approach is taken toward the
identification of lim
.,,P{X,I xla < n-Zh(Xi) < b}.
In
contrast to the seemingly arbitrary way in which tilting was
introduced in the previous sections, this operation will now
appear quite naturally in a much different context.
Let the function
h:
R -+ IR be a bounded Bore1 measur-
able function, and let
A
and
B
denote the (essential)
infimum and supremum of
h(X,),
respectively. Since the
more general case of unbounded and vector-valued
h-
functions is discussed in Lanfords work, we shall limit
ourselves to a simple case of bounded scalar
h
functions.
We have the following result.
Theorem 3 (Lanford,
1973): Let X,, X2, . 1 * be i.i.d. ran-
dom variables and let
h:
[w --f R be a bounded Bore1
measurable function. For ess. inf
h( X,) < a -C b -C
ess. sup
h(
X,) define the distribution function F,(x) by
F*(x) = ( JX
eXh(x)P{ X, E dx}
1
/c(A),
(21)
-co
where c(X) =
Eexhcx)
and A is chosen so that
Slab(x) dF,(x) =
1
:h(X,,,
b < Eh(X,)
arEh(X,)lb
-00
a,
a > Eh( X,).
Then, as
n
+ cc, and for all continuity points x of F,(x),
P X,Sxja<~,~h(X,)<b
1
I
-+ F*(x). (24
I=
Thus Theorem 3 implies that if
a
<
Eh(
X,) <
b,
then
X = 0 and FA(x) = F(x) =
P{
X, I x}. That is, the condi-
tioning on the interval
(a, b)
has an asymptotically negligi-
ble effect. This statement can be directly verified, since by
the law of large numbers we have
Pa<i,ih(X,)<b
+l
i
r=l
I
as n + cc. Since furthermore
P( B I A) = P{ B 17 A)/P( A)
+
P(B)
as
P(A)
-+ 1, the theorem follows.
However, if
Eh(X,)
6?
(a, b),
then this reasoning is not
applicable. Theorem 3 asserts that the conditional distribu-
tion of X, still converges and identifies the limiting distri-
bution Fh( x) as belonging to the exponential family associ-
ated with F(x). The distribution
Fx
is the closest to
F
in
the Kullback-Leibler sense, of all distributions
F*(x)
ab-
solutely continuous with respect to
F(x) (F* +c F)
and
agreeing with the asymptotic evidence
a < (l/n)Zh( X,)
<
b.
More precisely,
FA
maximizes the F-relative entropy,
or minimizes the Kullback- Leibler number
dF*(x)
K(F*, F) = /logyjq-q~~*tx)
(23)
over all distributions
F* +c F
for which
/h(x) dF*(x) E
[a, bl.
We will now outline a proof of Theorem 3. The argu-
ments presented here are extracted from Lanfords work [3]
and its extension by Bahadur and Zabell [S], and we refer
to this work for details. The proof of Theorem 3 rests
essentially on an extension of the asymptotic theory of tail
probabilities to the probabilities of arbitrary open convex
sets. The exponential decay of these probabilities is estab-
lished in the following lemma.
Lemma 2:
Let Y,, Y,, * * .
be i.i.d. bounded random
variables taking values in [Wk. Let J be a finite union of
open convex sets of Rk. Then
1) S(Y, J) = lim n-log
P{n-Zr,,l; E J}
exists (pos-
sibly infinite);
2) with s(Y; x) = inf,{S(Y, J): x E J, J open convex}
we have S(Y, J) = sup,..&Y, x).
The set function supxEJ
s(Y; x) is known as the Lanford
entropy of
J.
Let p denote the measure on Iw k induced by Y,, and
define the function u: !R k --f IF4 by
a(y) = -inf
K(v; p):IRktv(dt) = y, v K p
1
, (24)
where, as before, K(v; p) = jlog
(dv/dp) dv
is the Kull-
back-Leibler number between v and p. Thus -u(y) is the
minimum Kullback- Leibler number between the measure
p and any measure v < p that has expectation y. It is well
known that u(y) 5 0 and
u(EY,) = 0.
The function u(y) is useful in that it allows us to
compute the Lanford point entropy s(x) as a Kullback-
Leibler number. Letting C denote the convex closure of the

Citations
More filters
Journal ArticleDOI

Information Theory and Reliable Communication

D.A. Bell
Journal ArticleDOI

Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy

TL;DR: Jaynes's principle of maximum entropy and Kullbacks principle of minimum cross-entropy (minimum directed divergence) are shown to be uniquely correct methods for inductive inference when new information is given in the form of expected values.
Journal ArticleDOI

On the rationale of maximum-entropy methods

TL;DR: The relations between maximum-entropy (MAXENT) and other methods of spectral analysis such as the Schuster, Blackman-Tukey, maximum-likelihood, Bayesian, and Autoregressive models are discussed, emphasizing that they are not in conflict, but rather are appropriate in different problems.
Journal ArticleDOI

Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems

Imre Csiszár
- 01 Dec 1991 - 
TL;DR: In this article, logically consistent rules for selecting a vector from any feasible set defined by linear constraints, when either all $n$-vectors or those with positive components or the probability vectors are permissible, are determined.
Journal ArticleDOI

The use of entropy in hydrology and water resources

TL;DR: A review of recent contributions on entropy applications in hydrology and water resources, discusses the usefulness and versatility of the entropy concept, and reflects on the strengths and limitations of this concept as mentioned in this paper.
References
More filters
Book

Information Theory and Reliable Communication

TL;DR: This chapter discusses Coding for Discrete Sources, Techniques for Coding and Decoding, and Source Coding with a Fidelity Criterion.
Journal ArticleDOI

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

TL;DR: In this paper, it was shown that the likelihood ratio test for fixed sample size can be reduced to this form, and that for large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample with the second test.
Book

A course in probability theory

Kai Lai Chung
TL;DR: This edition of A Course in Probability Theory includes an introduction to measure theory that expands the market, as this treatment is more consistent with current courses.
Journal ArticleDOI

Information Theory and Reliable Communication

D.A. Bell
Journal Article

Prior Probabilities

TL;DR: It is shown that in many problems, including some of the most important in practice, this ambiguity can be removed by applying methods of group theoretical reasoning which have long been used in theoretical physics.
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Maximum entropy and conditional probability" ?

In an attempt to find a physical reason for the appearance of maximum entropy distributions, the following theorem is offered. The proof of this and related results relies heavily on the work of Zabell and Lanford.