What are the contributions mentioned in the paper "Maximum entropy and conditional probability" ?

(Open Access) Maximum entropy and conditional probability (1981) | J. Van Campenhout

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

IT-27,

NO.

JULY

198 1

483

Maximum Entropy and Conditional Probability

JAN M. VAN CAMPENHOUT

AND

THOMAS M. COVER,

FELLOW, IEEE

Abstract-

It is well-known that maximum entropy distributions, subject

to appropriate moment constraints, arise in physics and mathematics. In an

attempt to find a physical reason for the appearance of maximum entropy

distributions, the following theorem is offered. The conditional distribution

of X, given the empirical observation

(1 /n)X:, ,/I ( X,) = (Y,

where

X, , X2, . are independent identically distributed random variables with

common density g converges to fA( x) = e A’h(x)g( x) (suitably normalized),

where X is chosen to satisfy jfx( x) h( x) dx = o. Thus the conditional

distribution of a given random variable X is the (normalized) product of the

maximum entropy distribution and the initial distribution. This distribution

is the maximum entropy distribution when g is uniform. The proof of this

and related results relies heavily on the work of Zabell and Lanford.

INTRODUCTION

HE differential entropy N(X) of a random variable X

with density function f(x) (with respect to Lebesgue

measure) is defined by H(X) = - /-‘,“f< x) In f(x)

dx.

All

of the well-known distributions in statistics are maximum

entropy distributions given appropriate simple moment

constraints. For example, the maximum entropy distribu-

tion under the constraint EX2 = a2 is the normal distribu-

tion with mean 0 and variance u2. The maximum entropy

nonnegative random variable with mean

is exponentially

distributed with parameter X =

l/m.

Even the Cauchy

distribution is a maximum entropy distribution over all

distributions satisfying

In (1 + X2) = (Y. In general, the

maximum entropy density f(x) under the constraint

P(x)f(x) dx =

(Y,. where h is a vector-valued function of

x, is of the form

f(x) = exp (X, + XI/z(x)).

(1)

The constants X,, X are chosen so that f(x) is normalized

and satisfies the moment constraint. An easy proof of (1)

based on a convexity argument can be found in Kagan

al. [l, Theorem 13.2.1, p. 4091.

The entropy H(X) is closely related to the disorder or

uncertainty associated with making a realization of X. For

that reason, maximizing the entropy is a method for find-

ing distributions that represent high uncertainty or, equiva-

lently, a state of high ignorance. For instance, in statistical

mechanics, Boltzmann and others found the three-variate

Manuscript received August 22, 1978; revised October 2, 1980. This

paper was supported by National Science Foundation under Grant ENG

76-03684 and JSEP DAAG 29-79-C-0047. It was presented at the IEEE

International Symposium on Information Theory, Grignano, Italy, June

25-29, 1979.

J. M. Van Campenhout is with the Electronics Laboratory, the State

University of Ghent, Ghent, Belgium.

T. M. Cover is with the Departments of Electrical Engineering and

Statistics, Stanford University, Stanford, CA 94305.

normal distribution of velocities in gases as a maximum

entropy distribution under an energy constraint. Similarly,

one can derive thep( h) = he -hh, h 1 0, distribution of air

density as a function of height in the earth’s atmosphere

under the mean potential energy constraint l@(h) d/z =

In statistics, the principle of maximum entropy has been

used to obtain “uninformative” prior distributions in

Bayesian inference. A paper by Jaynes [2] discusses pre-

cisely this technique. Although the use of the maximum

entropy principle for these purposes may seem ad hoc,

maximum entropy distributions have some desirable prop-

erties. Jaynes [2] comments, “. . . the probability distribu-

tion which maximizes the entropy is numerically identical

with the frequency distribution which can be realized in the

greatest number of ways,” thus associating maximum ent-

ropy with a definite frequency (or maximum likelihood)

interpretation.

This note attaches another concrete meaning to the

maximum entropy distribution. It characterizes such a

distribution as the limit of a sequence of conditional distri-

butions. It is shown that, under certain regularity condi-

tions, the conditional distribution of the first random vari-

able X, in a sequence of independent identically distribu-

tion (i.i.d.) random variables X,, X2,. . . , given the empiri-

cal average

(l/n)Z;h(

Xi), converges to a maximum en-

tropy distribution. More precisely, the limiting distribution

f maximizes H,(X) = -lf(x)ln(f(x)/g(x))dx, the en-

tropy relative to the initial distribution g of X,, subject to

the constraint that jh(x)f(x) dx equals the observed aver-

age. The quantity -H,(X) is also known as the Kullback-

Leibler information number off relative to g. Thus among

all distributions satisfying the above moment constraint on

h( X,), the limiting conditional distribution f of X, mini-

mizes the Kullback-Leibler number with respect to g. It

follows that f is closest to g in a certain hypothesis testing

sense.

The convergence problem of conditional distributions

has a long history. As early as 1922, in the then fully

developing field of statistical mechanics, Darwin and

Fowler [ 1 l] established their method to derive the energy

distribution in large systems of particles with a given total

energy. Through the computation of the average occupancy

of the discrete energy levels, these authors arrived naturally

at the classical energy distributions. Jaynes [12] relates

Darwin and Fowler’s work to the Shannon maximum

entropy principle.

In an attempt to formalize statistical limits in statistical

mechanics, Lanford [3] considers the convergence problem

001 g-9448/8 l/0700-0483$00.75 0 198 1 IEEE

484

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

IT-27,

NO.

JULY

1981

of conditional distributions when an empirical average is

conditioned in an interval that may or may not contain the

mean of the underlying distribution. The same problem is

considered by Bartfai [9] and Vincze [lo] from a statistical

point of view. Zabell [4], on the other hand, studies prim-

arily the convergence of conditional expectations when the

conditioning is pointwise, but the points are in the neigh-

borhood of the true mean. We extend Zabell’s work to

conditioning at points “far” from the mean, and we will

reinterpret the work of the above authors from a maximum

entropy viewpoint.

In Section II, we study the convergence properties of the

special case in which the random variables X,, X2, * * * take

values in the set {1,2;.*,

m}.

For pointwise conditioning

far from the mean, the use of Chernoff’s tilting idea [5], [6]

is clearly illustrated by this example. The idea will be used

again in Section III, where we generalize Zabell’s result to

conditioning at points far from the mean. The convergence

of conditional distributions, under the condition that X has

only a finite number of mass points, can be obtained by

application of Stirling’s inequalities, as shown in the work

of Vasicek [13]. In Section IV, we provide some examples

in the case where the random variables have densities, and

in Section V we review and interpret Lanford’s work and

its implications.

II. A

SPECIAL CASE

In this section we consider the case of discrete bounded

random variables, and we give a direct proof of the follow-

ing convergence theorem.

Theorem

I: Let X,, X2, . . . be i.i.d. discrete random

variables with uniform probability mass function p(x) on

the range x E { 1,2; . a,

m}.

Then, for 1 I (Y 5

and for

all x, E { 1,2; .

.,m}, we

have

lim

“?”

n a 1s mteger

P{X,=x,l+ ;,&=a} =p*(x,), (2)

where

p*(q) = ehxl

/iz 1

M ehi

i=l

(3)

is the maximum entropy probability mass function under

the constraint Zxp*(x) = (Y, and X is chosen to satisfy this

constraint.

Proof:

First we will prove that for

any

probability

mass function

q(x)

> 0 on the range { 1,2,. . . ,

with

a = E,X, = Zx,q(x,) we

have

i I

lim

4 x1; ,2x,=a

= 4(x, >.

(4)

n-03

na is integer

r=l

Thus conditioning on the expected outcome has an

asymptotically negligible effect. This proves (2) in the case

where (Y = EX, =

+ 1)/2. The limiting distribution

is obtained by setting X = 0 in (3).

We then consider (2) in the case (Y #

(m +

1)/2. We use

Chernoff’s tilting idea to modify

p(x)

so that we are again

conditioning on the expected outcome. Since the condi-

tional distributions in (2) are invariant under tilting, Theo-

rem 1 will follow.

Let us turn to the proof of (4). Let

be an integer such

that

P{(

l/n.)~~,, X = a} > 0. Letting S, = X, -I- X2

+ . . .

+X,, we have

P{X,=jIS,= ncu} =

P{X, =j, S, = ncu}

P{S, = ncu}

= d.av2 +

*s-+X,= na -j}

P{S, = ncu}

= 4(j)

P{S,-, = ncu -j}

P{S,= new} .

(5)

Therefore, if we can prove that

P{S,-, = na

- j} is

asymptotically independent of j E { 1,2,. . .

,m}

n --$ 00,

it will follow that

X, = j] S,, =

ncu}

q(j).

The desired

result is contained in the following lemma.

Lemma 1:

Let X,, X2, **a be i.i.d. random variables

with probability mass function

q(x) > 0

on x E

{1,2,*.-,

m}.

With (Y =

EX,

and S,= Xi+ **.+X,,, we

have

S,, =

k}/P{S,,

+ 1 } + 1 for all integers

satisfying ]

< A for some constant A.

The proof of this lemma follows from a slight generaliza-

tion of a problem in Chung [7, exercise 24, p. 1771, and is,

in fact, a form of the Chung-Erdos strong ratio limit

theorem.

In the case that (Y #

+ 1)/2, let p(x) be the tilted

probability mass function derived from

p(x)

as follows:

p(x) = ceAx,

x E {1,2;--,m},

(6)

where c and h are chosen to satisfy

Zp(x) = 1, zxJ?(x) = (Y.

(7)

Then clearly p(x) > 0 for all x E { 1,2; .

.,m};

thus (4) is

applicable. The properties of the tilting operation allow us

to reconnect (4) with the original statement (2) as follows.

First we observe that

= ii1 (e-Xx~/c)P(xi)

and thus that

-ne-XZx~~(x,,...,x,),

P{S,= nar} = 2 p(x,;.*,xn)

x,+ . ..+x.=na

--n -Am

=c e

B(x,,**-,xJ

x,+ . ..+x.=na

= c-ne-““aF{Sn = new}.

(8)

From (8) it follows easily that the tilting transformation

leaves the conditional distributions in (2) invariant. This

VAN CAMPENHOUT AND COVER: MAXIMUM ENTROPY AND CONDITIONAL PROBABILITY

485

can be seen by using (5) and (8) as follows:

P{X,= xJSn= na}

= Ptx,PNl-* = ncJ - XI>

P{S, = ncu}

c-‘epAxlp(x,)c

‘-ne-*(-dp{Sn-, = n(y - x,}

c -ne-AnaF{Sn = na}

= P{X, = x,ISn= na}.

(9)

But j(X, = x,/S,, = na}

--$ p(X, = x,) = ehxl/(2eXi), by

(4), thus proving Theorem 1.

Remark:

The smooth behavior of the probability distri-

bution of S,, at small deviations from its mean is crucial to

the convergence in (4). These ideas are also borne out by

the restrictions on the random variables Xi imposed by

Zabell in the more general case (see Section III). Unlike the

central limit theorem or the law of large numbers, the

additional. conditions deal with the fine structure of S,, /n

in the sense that deviations of the order l/n from the mean

are considered.

III. A

LIMIT THEOREM FOR POINTWISE

CONDITIONING

We proceed with the generalization of the special case

studied in Section II to lattice random variables that may

be unbounded and to random variables with density func-

tions. We start by reminding the reader of Zabell’s results

[4] concerning the convergence of conditional expectations.

We then apply the tilting transformation to these results to

obtain the desired generalization.

Zabell considers a sequence U, X,, X,, . . . of random

variables where U has finite expectation and the pair

(U, X, ) is independent of X,, X,, . . . . He derives a set of

sufficient conditions under which it is true that

E(UIX,+ -.+X,=A,) -E(U),

(10)

as n + co. These conditions can be summarized as follows.

1) The random variables X,, X,, . . . take values in the

same additive subgroup of Iw (or, more generally, in the

same coset of an additive subgroup of W).

2) Consider the normalized sums Y, = (X, + . . . +X,

-A,)/&. Then there exist sequences {A,}, {B,} with

+ m such that Y, converges in distribution to a (nonde-

generate) random variable Y.

3) Let &(t) = E(exp(itY,)) and G(t) = E(exp(itY))

denote the characteristic functions of Y, and Y, respec-

tively. Then either a) $n is periodic, and

P{Yn = 0} > 0

for

n sufficiently large; or b) \c, is absolutely integrable, I& is

absolutely integrable for n sufficiently large, &, + Ic/ in

L,,

and Jlc, > 0.

Condition 1) is the generalization of the regularity prop-

erties of S,, derived in Section II. For i.i.d. random vari-

ables Xi,

i =

1,2; * .,

this condition together with condi-

tion 3) implies that the random variables are either of the

lattice type (i.e., the Xi take values in {a +

kb: k =

0, 2 1, -c2, . . *})

or are real-valued and have a density

function f(x) with respect to Lebesgue measure.

For example, these conditions preclude the case Xi E

(0, T, 5).

In this case the event {X, + . . . +X, = nr} im-

plies the event {X, = v}, for all probability mass functions

p(x,). Thus clearly for all n,

x, = 71

otherwise,

and Theorem 1 fails to hold even if rr were the true mean of

P(X,).

Letting U range over all bounded continuous functions

of X,, we see that Zabell’s work implies the convergence of

the conditional probability distribution of X, to its uncon-

ditional distribution (see, for example, Chung [7, Theorem

4.4.2,

p. 891).

We now limit our attention to i.i.d. random variables

x,, x2, *. .

having a density function f(x) with respect to

Lebesgue measure. Rather than conditioning on X;X,, we

consider a (Borel-measurable) function h: Iw + [w and con-

dition on S,, = Z;h( Xi). We assume that h( X,) has a den-

sity with mean p. The case of discrete conditioning vari-

ables is completely analogous and was covered to some

extent in Section II.

It follows from (10) and conditions l)-3) that the center-

ing constants

must be “close” to

in order to ensure

the nondegeneracy of the limit of (S, -

A,)/B,,.

Condi-

tioning on the event {(l/n)& = cy}, a! # p, results in

centering constants n(~ that are too far from

for (10) to

hold. Under certain restrictions, an application of Cher-

noff’s tilting idea allows us to move the mean of h( X,) to

the conditioning point (Y, rendering (10) applicable. Again,

the tilting leaves the conditional distributions invariant and

thus provides us with the limit of these conditional distri-

butions, conditioned at points off the true mean.

More concretely, let f(x) and g(t) denote the probability

densities of X, and h( X,), respectively. Consider the ex-

ponential family 9 of densities indexed by X,

9 = e”‘g(t)/c(h): c(X) =/e*‘g(t)dt-=C co .

(

(11)

Assume that 9 contains a density g*(t) = e”‘g(t)/c(h)

with mean (Y. The desired tilting operation, then, changes

the underlying probability measure

to a measure

P*,

under which h(X,) has the density g*(t). One can easily

verify that changing the density of X, to

f*(x) = e

““‘x’f(x)/C(A)

(12)

induces the density g*(t) on h( X,).

Thus, applying (10) under the measure

induced byf*,

we have for all bounded continuous functions

U( .),

E* U(X,) i h(X;) = na - E*(U(X,)),

(13)

i 1

r=l

where

E*(U(X,)) = ( lU(x)exh’“)f(x) dx)/c(h).

The con-

ditional expectation in (13) can be written as

E*(U(X,)IS,,= na) =/U(x)P*{X, E dxlS,= na}.

(14

486

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

IT-27,

NO.

JULY

1981

On the support set of S,, the conditional distribution of X, has density $(x; 0,

n),

and according to (15) we have

can be written in terms of the conditional density of X, as

P*{x, E dxlS,= t} =f+(x)g;-,(t - h(x))dx/gZ(t),

f(xlS,=ncr)=+(x;O,l)+(na-x;O,n- l)/+(na;O,n)

(15)

=+(x; cq).

where g,*(t) denotes the tilted density of S, = Z;h( Xi). Thus

Equation (15) can be directly verified by the defining

relation for conditional distributions

ftxI& = na) - +t

x; a, 1) = exa+(x; 0, l)/eaz/*,

(16) Example 2: Exponential Random Variables

As in Section II, it is easy to verify that the tilting

Let X,, X2, . . . be i.i.d. exponential random variables

transformation and convolutions commute, i.e.,

with parameter h. Let fs(x;

p) denote the gamma den-

&W = t4VP”‘g,tt).

(17)

sity with parameters

II. Since S, has a gamma

(n, nh)

distribution we have

From (12), (15), and (17) it follows readily that

P*{X, E dxIS,= no} = P{X, E dxlS,= nay),

(18)

f(xILsn = na) =

f,tx; 1, x)f,(na - x; n - 1, A)

f,tna; n, A)

and thus that

E*{U(X,)IS,= no} = E{U(X,)IS,= no},

(19)

=!IpK$)“-*.

from which it follows that

Thus

E{ U(X,)lS, = no} + ( /U(x)ehhcx)f(x) dx )/c(h),

f(xI&= na) +

K’exp(-x/cr)

= exp ((h - l/a)x)f,(x; 1, X)/X(u.

(20)

the desired extension of Zabell’s result (10). Thus we have

Example 3: An Exception (Cauchy Random Variables

proved the following.

(EIXI = 00))

Theorem

2: Let X,, X2, . . . be i.i.d. random variables

Let X,, X2, . . .

be i.i.d. with Cauchy density fc(x; 0, l),

with densityf(x), and let

II% + Iw be a Bore1 measurable

where f,(x; (Y, j3) = /3/~(/3* + (x - a)*). It is well known

function. Let the random variables

h( X,), h( X2), . . .

have

that S, has density fc(x; 0,

n).

Thus we find

a density g(t). If there exists a real number X such that

f(xI% = na>

c(X) = Eexp

(hh(X,)) < 00,

=./Ax; O,l)f,tna - x; 0, n - l)/f,(nc 0, n)

and

(Y=

h(x)e

Xh’“‘f(~) dx),c(h),

f,tx;O, l)tn - l)(n* + (na)*)

/n((n -

l)‘+

(no - x)‘).

and furthermore, if gh(

= eh’g(

t)/c(

X) satisfies Zabell’s

It follows that for every finite value of (Y we have

conditions l)-3) then as

n + 00,

ftxI& = na> -f,tx; (Al),

exh(“,f(x) dx)/c(X).

pointwise in x. Thus conditioning on

any

(Y has an asymp-

totically negligible effect.

. ,

- -

The reason for this exceptional behavior is that X, has

IV.

SOME EXAMPLES

no mean and thus Theorem 2 is not applicable. It should

be noted that even if

X, I k < cc for some

> 1 but

We now give a few easy examples to illustrate Theorem

X, <

-A}

and

X, >

approach zero less than ex-

2. Note that Examples 1, 2, and 3 are established by direct

ponentially fast, Zabell’s result is applicable, but not its

calculation rather than by using Theorem 2.

extension in Theorem 2. In this case the exponential family

4 in (11) contains only one density corresponding to A = 0.

Example 1: Gaussian Random Variables

No tilted density

g*(t)

with mean (Y can be found.

Let +(x; p, a*) denote the normal density with mean p

and variance a*. Let X,, X,, * . ; be i.i.d. with density

Example 4: The Maxwell-Boltzmann Distribution

+(x; 0, 1) and let f(x(Z;X,

=-no)

denote the conditional

Let the velocities V,, V,, . . . be i.i.d. vector valued ran-

density of X, given x:X, =

na.

The sum S, = X, +‘. . .

+X, dom variables (T.v.), each drawn according to a uniform

VAN CAMPENHOUT AND COVER: MAXIMUM ENTROPY AND CONDITIONAL PROBABILITY

487

distribution over the cube [-A,

A13.

Then, by Theorem 2,

f 01: ,i 11~112= E + ce-tlo112/2E,

(

u E [-A, A13.

r=l

Thus the limiting density is the multivariate normal density

truncated to the prior range.

CONDITIONINGON~NTERVALS

In this section we review the limiting behavior of condi-

tional distribution of a random variable X,, given that the

empirical average of

independent observations

h(Xi),

i =

1,2;.*

lies in an interval

(a, b).

Although the results

presented here have the same flavor as the results discussed

in Section III, they are quite distinct. For instance, the

rather strong regularity conditions on S,, imposed by Zabell

are absent here. (Essentially what is left is the additional

condition allowing tilting imposed in Theorem 2.) Thus

Zabell’s result cannot be obtained from the results estab-

lished in this section. Conversely, one might be tempted to

find the limit of

P{X,

xla

n-‘Xh(Xi)

through

an integration of

P{X, I xln-‘Zh( Xi) = t}

over

(a, b).

In order to do so, however, one would have to know

the limiting distribution of

n-‘Zh(Xi)

on the interval

(a, b),

and furthermore one would have to verify the

interchange of limits and integration as

+ cc. Thus,

again, the result in Theorem 2 is insufficient to provide the

solution.

In this section, a direct approach is taken toward the

identification of lim

.,,P{X,I xla < n-‘Zh(Xi) < b}.

contrast to the seemingly arbitrary way in which tilting was

introduced in the previous sections, this operation will now

appear quite naturally in a much different context.

Let the function

R -+ IR be a bounded Bore1 measur-

able function, and let

and

denote the (essential)

infimum and supremum of

h(X,),

respectively. Since the

more general case of unbounded and vector-valued

functions is discussed in Lanford’s work, we shall limit

ourselves to a simple case of bounded scalar

functions.

We have the following result.

Theorem 3 (Lanford,

1973): Let X,, X2, . 1 * be i.i.d. ran-

dom variables and let

[w --f R be a bounded Bore1

measurable function. For ess. inf

h( X,) < a -C b -C

ess. sup

X,) define the distribution function F,(x) by

F*(x) = ( JX

eXh(x)P{ X, E dx}

/c(A),

(21)

-co

where c(X) =

Eexhcx)

and A is chosen so that

Slab(x) dF,(x) =

:h(X,,,

b < Eh(X,)

arEh(X,)lb

-00

a > Eh( X,).

Then, as

+ cc, and for all continuity points x of F,(x),

P X,Sxja<~,~h(X,)<b

-+ F*(x). (24

I=’

Thus Theorem 3 implies that if

Eh(

X,) <

then

X = 0 and FA(x) = F(x) =

X, I x}. That is, the condi-

tioning on the interval

(a, b)

has an asymptotically negligi-

ble effect. This statement can be directly verified, since by

the law of large numbers we have

Pa<i,ih(X,)<b

r=l

as n + cc. Since furthermore

P( B I A) = P{ B 17 A)/P( A)

P(B)

P(A)

-+ 1, the theorem follows.

However, if

Eh(X,)

(a, b),

then this reasoning is not

applicable. Theorem 3 asserts that the conditional distribu-

tion of X, still converges and identifies the limiting distri-

bution Fh( x) as belonging to the exponential family associ-

ated with F(x). The distribution

is the closest to

the Kullback-Leibler sense, of all distributions

F*(x)

ab-

solutely continuous with respect to

F(x) (F* +c F)

and

agreeing with the “asymptotic evidence”

a < (l/n)Zh( X,)

More precisely,

maximizes the F-relative entropy,

or minimizes the Kullback- Leibler number

dF*(x)

K(F*, F) = /logyjq-q~~*tx)

(23)

over all distributions

F* +c F

for which

/h(x) dF*(x) E

[a, bl.

We will now outline a proof of Theorem 3. The argu-

ments presented here are extracted from Lanford’s work [3]

and its extension by Bahadur and Zabell [S], and we refer

to this work for details. The proof of Theorem 3 rests

essentially on an extension of the asymptotic theory of tail

probabilities to the probabilities of arbitrary open convex

sets. The exponential decay of these probabilities is estab-

lished in the following lemma.

Lemma 2:

Let Y,, Y,, * * .

be i.i.d. bounded random

variables taking values in [Wk. Let J be a finite union of

open convex sets of Rk. Then

1) S(Y, J) = lim n-‘log

P{n-‘Zr,,l; E J}

exists (pos-

sibly infinite);

2) with s(Y; x) = inf,{S(Y, J): x E J, J open convex}

we have S(Y, J) = sup,..&Y, x).

The set function supxEJ

s(Y; x) is known as the Lanford

entropy of

Let p denote the measure on Iw k induced by Y,, and

define the function u: !R k --f IF4 by

a(y) = -inf

K(v; p):IRktv(dt) = y, v K p

, (24)

where, as before, K(v; p) = jlog

(dv/dp) dv

is the Kull-

back-Leibler number between v and p. Thus -u(y) is the

minimum Kullback- Leibler number between the measure

p and any measure v < p that has expectation y. It is well

known that u(y) 5 0 and

u(EY,) = 0.

The function u(y) is useful in that it allows us to

compute the Lanford point entropy s(x) as a Kullback-

Leibler number. Letting C denote the convex closure of the

Maximum entropy and conditional probability

Citations

Information Theory and Reliable Communication

Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy

On the rationale of maximum-entropy methods

Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems

The use of entropy in hydrology and water resources

References

Information Theory and Reliable Communication

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

A course in probability theory

Information Theory and Reliable Communication

Prior Probabilities

Related Papers (5)

Information Theory and Statistical Mechanics. II

Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy

Information Theory and Statistics

$I$-Divergence Geometry of Probability Distributions and Minimization Problems

A mathematical theory of communication

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Maximum entropy and conditional probability" ?