What have the authors contributed in "Some equivalences between shannon entropy and kolmogorov complexity" ?

Toussaint et al. this paper gave an upper bound on error probability for % multiclass pattern recognition.

(Open Access) Some equivalences between Shannon entropy and Kolmogorov complexity (1978) | S. Leung-Yan-Cheong

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

1~24,

NO.

MAY

1918

331

[24] A. N. Kolmogorov, “On the approximation of distributions of

1271 G. H. Hardy, J. E. Littlewood, and G. Polya, ZnequaIities. New

sums of independent summa

nds by infinitely divisible distribu-

tions,” San/&v& vol. 25, pp. 159-174, 1963.

[28]

York and London: Cambridge Univ. Press? 1934.

D. E. Da kin and C. J. Eliezer, “Generalization of Holder’s and

(251 A. Renyi, “On the amount of missing information and the

Minkows

s inequalities,” Proc. Cambridge Phil. Sot., vol. 64, pp.

Neyman-Pearson lemma,” in Research Papers in Statistics, David,

1023-1027, 1968.

Ed. New York: Wiley, 1966, pp. 281-288.

[29] L. Kanal, “Patterns in

attem recognition,” IEEE Trans. Znform.

[26] G. T. Toussaint, “Some upper bounds on error probability for

multiclass pattern recognition,” IEEE Trans. Comput., C-20, pp.

[30]

Theory, vol. IT-20, no. pp. 697-722, 1974.

C. E. Shannon, “A mathematical theory of communication,” Bell

943-944, 1971.

System Tech. J., vol. 27, pp. 379-423, 623-656, 1948.

Some Equivalences Between Shannon Entropy

and Kolmogorov Complexity

SIK K. LEUNG-YAN-CHEONG,

MEMBER, IEEE,

AND THOMAS M. COVER,

FELLOW, IEEE

Abstmct-It is known that the expected codeword length L,, of the

best uniquely decodable (UD) code satisfies H(X) < L,, < H(X) + 1.

LetXbearandomvariablewhichcantakeonnvalues.Thenitisshown

that the average codeword length L, :, for the best one-to-one (not

necessBluy uniquely decodable) code for X is shorter than the average

codeword length L,, for the best mdquely decodable code by no more

thau (log2 log, n) + 3. Let Y be a random variable taking OII a fiite or

countable number of values and having entropy H. Then it is proved that

L,:,>H-log2 (H+l)-log, log2 (H+l)-... -6. Some relations are

eatahlished amoug the Kolmogorov, Cl&in, and extension complexities.

Finally it is shown that, for all computable probability distributions, the

universal prefix codes associated with the conditional Chaitin complexity

have expected codeword length within a constant of the Shannon entropy.

INTRODUCTION

HANNON has shown that the minimal expected

length L of a prefix code for a random variable X

satisfies

H(X)<L<H(X)+l

(1)

where H is the entropy of the random variable. Shannon’s

restriction of the encoding or description of X to prefix

codes is highly motivated by the implicit assumption that

the descriptions will be concatenated and thus must be

uniquely decodable. Since the set of allowed codeword

lengths is the same for the uniquely decodable and in-

stantaneous codes [ 11, [2], the expected codeword length L

is the same for both sets of codes. Shannon’s result

follows by assigning codeword length li = [log 1 /piI to the

Manuscript received September 16, 1975; revised September 6, 1977.

This work was sup

under Grants GK- r

rted in part by the National Science Foundation

3250, ENG-10173, and ENG 76-03684, and in part

by the Air Force Office of Scientific Research under Contract

F4462O-74C-0068. This paper was previously presented at the IEEE

~;$onal Symposium on Information Theory, Ithaca, NY, October

S. K. ‘Leung-Yan-Cheong was with Stanford University. He is now

with the Department of Electrical Engineering and Computer Science,

Massachusetts Institute of Technology, Cambndge, MA.

T. M. Cover is with the Department of Electical Engineering and

Statistics, Stanford University, Stanford, CA.

ith outcome of the random variable, where

is the

probability of the ith outcome. Thus the entropy H plays

a fundamental role and may be interpreted as the minimal

expected length of the description of X. The intuition

behind the entropy H is so compelling that it would be

disconcerting if H did not figure prominently in a descrip-

tion of the most efficient coding with respect to other less

constrained coding schemes. In particular we have in

mind one-to-one (1: 1) codes, i.e., codes which assign a

distinct binary codeword to each outcome of the random

variable, without regard to the constraint that concatena-

tions of these descriptions be uniquely decodable. It will

be shown here that H is also a first order approximation

to the minimal expected length of one-to-one codes.

Throughout this paper we use L, :, and Lu, to denote

the average codeword lengths for the best 1: 1 code and

uniquely decodable code, respectively. Since the class of

1: 1 codes contains the class of uniquely decodable codes,

it follows that L, : i < L,,,. We show that L, :, > H-log

log n - 3 where 12 is the number of values that the random

variable X can take on. Perhaps more to the point, we also

show that L,,, > H - log(H + 1) - 0 (log lo&H + 1)).

Thus, to first order, a 1: 1 code allows no more compres-

sion than a uniquely decodable or prefix code.

As a consequence of the work of Kolmogorov and

Chaitin, a notion of the intrinsic descriptive complexity of

a finite object hgs been developed. This is closely related

to the work of Siannon in which the complexity of a class

of objects is defined in terms of the probability distribu-

tion over that class. The complexity measures of

Kolmogorov and Chaitin, together with a new complexity

measure which we call the extension complexity, have

associated with them universal coding schemes. We shall

establish that the universal encoding associated with the

complexity of Chaitin [3] and Willis [6] has an expected

codeword length with respect to any computable probabil-

ity distribution on the set of possible outcomes which is

within a constant of the Shannon entropy, thus connect-

OOlS-9448/78/0500-0331$00.75 01978 IEEE

332

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

IT-24,

NO.

MAY

1978

ing the individual complexity measure of Chaitin and

III.

LOWER

BOUNDS

L, :, IN

TERMS OF THE ENTROPY

Kolmogorov with the average statistical complexity

measure of Shannon.

In Section II, we consider a random variable which can

take on only a finite. number of values, and we maximize

(L,,- L, : i). In Section III we derive lower bounds on

L, : i in terms of the entropy of a random variable taking

values in a countable set. In Section IV we recall the

definitions of the Kolmogorov and Chaitin complexities

of binary sequences and introduce the notion of an exten-

sion complexity. We then derive some relationships

among these quantities. Finally, in Section V we show

that, for all computable probability distributions, the uni-

versal prefix codes associated with the conditional Chaitin

complexity have expected codeword length within a con-

stant of the Shannon entropy.

II.

MAXIMIZATION OF

(L,, - L, : J

Let X be a random variable (RV) taking on a finite

number of values, i.e.,

With no loss of generality, assume pI >p2 > * * . Zp,,.

Let li, i=1,2;--

,n be the lengths of the codewords in the

best 1: 1 code for encoding the RV X, where 1,. is the

length of the codeword assigned to xi.

Remark: Unless otherwise stated, all logarithms are to

the base 2. The set of available codewords is

{0,1,00,01,10,11$00,001,~~~}.

It is clear that the best 1: 1 code must have I, < Z2 < Zs

< . . . . Thus, by inspection, we have precisely I, = 1, I2 = 1,

I, = 2, ’ ’ ’ )

The objective in this section is to obtain lower bounds

on L, :, in terms of the entropy H of the random variable.

As a first step, we consider transformations of 1: 1 to UD

codes. The random variables considered may take on a

countable number of values.

Some Possible Transformations from I : 1 to UD Codes

The aim here is to find efficient means of transforming

1: 1 codes to UD codes.

Let I,, I,, . . . be the lengths of the codewords for the

best 1: 1 code; assume I, < I, < . - . .

Let

be any function such that Z i2-f(h) < 1. Then from

Kraft’s inequality, the set of lengths

{[f

(li)l} yields

acceptable word lengths for a prefix (or UD) code. If

integer-valued and Zi2-f(t) > 1, {

(li)} cannot yield a

prefix code.

Theorem 2: The following functions represent possible

transformations from 1: 1 to UD codes.

i) f(l,)=I,+a[logI,l+log(z), wherea>l;

ii) f(li)=J+2[log(li+1)];

iii) f(~)=~+[logZi+log(log~)+~~~~ ]+4.

The proof of Theorem 2 follows from verification of the

Kraft inequality for

(4) and is given in Appendix B.

We now make use of Theorem 2 to prove some lower

bounds on L, : i in terms of the entropy H.

Theorem 3: The expected length L,: i of the best 1: 1

code satisfies the following lower bounds

(2) i) L,:,>H-a(l+log(H+l))-log(G)

and

wherea>l; (11)

L,,,=~tpi~i=~tpi~log(~+l)]. (3) ii) Ll:+H-210g(H+2);

(12)

iii) L,:,>H-log(H+l)-loglog(H+l)-*-- -6.

We now prove the following theorem which gives an

upperbound on (L,, - L, : J.

(13)

Proof i) From Theorem 2 i) and the fact that the

Theorem I:

expected length for a UD code > H(X), we can write

L, : I> L”,

-loglogn-3.

(4)

Proof From (1) we have L,, < H (X,) + 1. Therefore

max(L,,-L,:,)<l+max(H(X>l.-L,:,).

(5)

Noting from (3) that

we can write

(6)

(7)

We then use the method of Lagrange multipliers to maxi-

mize the right side of (7). The proof is completed by using

(5). Details of the proof are given in Appendix A.

E(I+a[logIl+c)> H,

2”-1

where a> 1, c=log -

( )

2”-2 *

Therefore El+a(l+E log I)+c>H where El=L,:,.

From Jensen’s inequality and the convexity of -log I, we

have El+a+a log El+c>H. But El<H+l, since I

corresponds to the best 1: 1 code which is certainly better

than the best prefix code, and we know that the expected

length for the best prefix code is less than (H + 1). Thus

E/>H-a(l+log(H+l))-log(s)

ii) From Theorem 2 ii) and the fact that L,, > H, we

have

E(l+211og (I+ l)]) > H,

E1+2E log (I+ 1) > H.

LEUNG-YAN-CHEONG AND COVER: ENTROPY AND KOLMOGOROV COMPLEXITY

333

By Jensen’s inequality, El + 2 log(El+ 1) > H. But El < H

We now introduce a new complexity measure that is

+ 1 as before. Thus

useful in prediction and inference.

El+2 log (H+2) > H

Definition: Let U: (0, l}*+{O, l}* be a partial recursive

I,,,, > H-2 log (H+2).

function with a prefix domain. Then the extension com-

plexity of a binary sequence x with respect to U is defined

iii) From Theorem 2 iii) and the fact that Lo, > H, we

have

J%(x)= ,(g& l(P)

(22)

E(I+[logI+log(logI)+e..]+4)>H.

(14

where U(p)> x means that U(p) is an extension of x, or

Thus

equivalently that x is a prefix of U(p).

E(l+logl+log(logI)+... +4)>H.

(15)

Definition: Given a complexity measure C,* : Q+N

where Q is countable and B is a partial recursive function,

Definition: For convenience we will define the function

we say that C* is universal if there exists a partial recur-

log* n by

sive function U, such that for any other partial recursive

function A, there exists a constant c such that for all

log*nk logn+loglogn+***,

(16)

w Et& C&(w) < c, (w) + c.

(23)

stopping at the last positive term. Then

It has been shown [3], [4] that the Kolmogorov and

E(Z+log* Z+4) > H.

(17) Chaitin complexity measures are universal. The same re-

Although log* I is not concave, we prove in Appendix C

sult can be shown to hold for the extension complexity

that there exists a (piecewise-linear) concave function

measure. Thus from now on we will assume that the

F*(I) such that E*(l) < log* I< F*(1)+2. Thus E log*

complexities are measured with respect to some fixed

I < EF*(I) + 2 < P*(EI) + 2 < log*(El) + 2 yielding,

from

appropriate universal function, and the subscripts will be

(17),

dropped. We shall denote the Chaitin, Kolmogorov, and

El+log* Eli-6) H.

extension complexities of a binary sequence x E (0, l}* by

(l*) C(x), K(x]Z(x)), and E(x), respectively.

But El < H + 1 as before. Therefore

Theorem 4: There exist constants c0 and c, such that

L,:,>H-log(H+I)-loglog(H+l)-.*. -6. (19) forallxE{o’l]*’

E(x)+c,<C(x)<E(x)+logI(x)

IV. SOME

RELATIONS BETWEEN KOLMOGOROV,

+loglogI(x)+*** +c,

CHAITIN, AND EXTENSION COMPLEXITIES

=E(x)+log* I(x)++

(24)

Let (0, l}* denote the set of all binary finite length

sequences, including the empty sequence. For any x=

(

X1,X2,’

~)E{O,l}*u{O,l}OO, let x(n)=(x,,x,;**,x,)

denote the first n bits of x.

Definition: A subset S of (0, l}* is said to have the

prefix property if and only if no sequence in S is the

proper prefix of any other sequence in S.

For example, (00, lOO} has the prefix property, but

{OO,OOl} does not.

Definition: The Kolmogoroo complexity of a binary

sequence x(n) E { 0, 1 }”

with respect to a partial recursive

function A : {O,l}*

N+{O, l}* is defined to be

(20)

where 1(s) is the length of the sequencep, and N denotes

the set of natural numbers.

Here A may be considered to be a computer, p its

program, and x its output. We shall use interchangeably

the recursive function theoretic terminology and computer

terminology. (See, for example, Chaitin [3] for a discus-

sion of the equivalence of the two.)

Definition: Let U : { 0, 1 } *+ { 0, 1 } * be a partial recursive

function with a prefix domain. Then the Chaitin complex-

ity of a binary sequence x with respect to U is given by

Proof The first inequality follows directly from the

definitions of E(x) and C(x). To prove the second in-

equality, note that the Chaitin complexity program p’ can

be constructed from the extension complexity program p

as follows. Let s be the shortest program (from a set

having the prefix property) for calculating Z(x). Thenp’ is

the concatenation qsp where 4 consists of a few bits to tell

the computer to expect two programs and interpret them

appropriately. So we have

C(~)\<E(x)+C(l(x))+c~.

(25)

From Theorem 2 iii)

c(l(x))<logI(x)+loglogI(x)+~~~ +c,. (26)

Combining (26) and (27) yields Theorem 4.

Let

(27)

be the (conditional) Chaitin complexity of x(n) given n,

where n* is the shortest length binary program for n (see

Chaitin [3] for definitions of conditional complexities). As

before, the domain of U( .,n*) has the prefix property for

each n.

The conditional Chaitin complexity of x given its length

I(x) and the unconditional Chaitin complexity of x are

closely related in the following sense.

334

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.

IT-24,

NO.

MAY

1978

Theorem 5: There exist constants c,, and ci such that expected word length equal to first order to the optimal

for all x E (0, l}*,

Shannon bound H(X,; * * ,X,).

C(xll(x))+c,< C(x)< c(x]r(x))+log* Z(x)++

First we remark that Levin [7] has asserted (the proof

does not appear) that for any finite alphabet ergodic

(28) process (with computable probability distribution)

Proof The lower bound follows from Chaitin [3, (l/n)K(Xi~‘* * ’

X,Jn)-+H(X) with probability one. Thus

Theorem 3.l.e]. The upper bound follows from Chaitin [3,

from Theorem 5 it follows that (l/n)C(X,,X,;-- ,X,Jn)

Theorems 3.l.d, 3.l.f] where it is shown that

-+H (X) with probability one. We shall show that the

c(x)~c(x,I(x))+0(l)<c(x]z(x))+c(z(x))+0(1).

behavior of C is good for finite n, for all n.

But from Theorem 2 iii), C(l(x))<log* r(x)+ O(1).

Theorem 7: For every computable probability measure

Hence the theorem is proved.

p : (0, l}*+[O, l] for a stochastic process, there exists a

constant c such that for all n

Theorem 6: There exist constants co and cl such that

for all x E (0, l>*,

H(X ,,..., X,)(E,C(X ,,..., X,ln)<H(X, ,..., Xn)+c.

(31)

K(xll(x))+c,< C(x)< K(xJl(x))+log K(xJl(x))+-

Proof For each n, C(x(n)ln), x(n)E{O,l}” must

+logI(x)+loglogZ(x)+- +c,. (29)

satisfy the Kraft inequality. So we have

Proof: The first inequality is a direct consequence of

H(X,; . .

,X,> ( E,C(X,,* *. ,X,ln>.

(32)

the definitions. To prove the second inequality, we first

note that the Chaitin complexity measure is defined with

For the right half of the inequality, we must use a

respect to a computer whose programs belong to a set

theorem of Chaitin and Willis relating C and a certain

with the prefix property. From Theorem 2 iii), we know

universal probability measure P*. We then relate P* to

that we can transform the domain of a Kolmogorov

the true distribution P to achieve the desired proof. We

complexity measure computer into one which has the

d f.

e me, for some universal computer U,

prefix property by extending the length of the

Kolmogorov complexity program from K(xJl(x)) to

P*(+)b)= u(p ~-x~n~2~1(p)~

(33)

,n* -

K(x]I(x))+log K(xll(x))+ - -. + c2. Let us denote this

Chaitin has shown [3, Theorem 3.51 (see also Willis [6,

extended program by p. From the proof of Theorem 4, we

Theorem 161) that there exists a constant c’ such that

also know that a program s (belonging to a set with the

prefix property) which describes the length of x need not

C(x(n)ln) <log

P*(xin)]n) +”

(34)

be longer than log I(x) + log log I(x) + . . * + c3. The

Chaitin complexity program can be the concatenation qsp

for all n. In addition, he has shown that for any other

where q consists of a few bits to tell the computer to

prefix domain computer A, there exists a constant c” such

expect two programs and interpret them appropriately. So that

P*(x(n)ln) > c”PA (x(n)ln)

(35)

C(x) < K(x]Z(x))+log K(xll(x))+ * * *

for all n, where PA(-) is defined as in (33).

+logI(x)+loglogl(x)+-* +c.

In Lemma 1 below we show that, for the given comput-

able probability mass function p: {O,l}*-+[O, l] for a

This completes the proof of Theorem 5.

stochastic process, there exists a prefix domain computer

A such that PA(x(n)ln)=p(x(n)) for all n. The proof can

RELATIONOFCHAITINCODELENGTHTO

SHANNON CODELENGTH

then be completed as follows

E,Cb(nIn)=

X pbtW(xtn)ln)

qqE{O,l)n

(36)

Let {Xi}? be a stationary binary stochastic process

with marginals p(x(n)),x(n)E{O,l}*, n=1,2;--,and ( 2

P (x(n)> log

p using (34),

Shannon entropy

x(n)E(O,l}”

(

P*&ln) +c’

)

H(X) = ,jiir H (X,,X,; - . ,X,)/n.

(30)

(37)

The Shannon entropy H (X,, - - - ,X,J is a real number,

< 2

while the Chaitin complexity C(X,, - * - ,X&I) is a random

x(~)E(O,l)”

p(x(n+g c,,p citn,ln,) +c’, using (3%

variable equal to the length of the shortest codeword

(38)

(program) assigned to (Xi, - - - ,X,J by U. The prefix set of

= 2

P (x(4> log ’

“‘, using Lemma 1,

codewords so defined may be thought of as a universal

x(~)~(O,l]”

p(x(nN +c

prefix encoding of n-sequences for each n. Note in partic-

(39)

ular that the prefix encoding induced by U is completely

oblivious to the true underlying statistics p(x,, - * - ,x,J. We

= H(X,; + - ,Xn)+c”‘,

for all 12.

(40)

shall show, however, that this universal encoding has an

Q.E.D.

LEUNG-YAN-CHEONG AND COVER: ENTROPY AND KOLMOGOROV COMPLlXlTY

335

Lemma I: For any computable probability mass func-

tion p : (0, l}*+[O, I] for a stochastic process, there exists

a prefix domain computer A such that PA(x(n)]n)=

p(x(n)) for all n.

Remark I: Willis [6, Theorem 121 has proved a similar

lemma under the constraint that p ( *) be “r-computable,”

i.e., that p(x,; . *

,x,) have a finite base-r expansion for

every

x1,x2;

- - ,x,.

Remark 2: Here we define a number to be computable

if we can calculate its nth bit in finite time for all finite n.

An analogous result can be proved if by a computable

number we mean instead of a number which we can

approximate arbitrarily closely.

Proof Letp@)(x(n)) denotep(x(n)) truncated after k

bits. For example, if p(x(n)) =0.001011001~ * . , then

p(5)(x(n))=0.00101. Define

F(@(x(n)) = 2 p@)(x’(n))

x’(n)<x(n)

(41)

where x’(n) < x(n) means x’(n) precedes x(n) in a lexico-

graphic ordering of the n-sequences. Note that p(x(n))

being computable does not guarantee that F(x(n)) is

computable.

Let A be a computer that has n* on its work tape. It

also has at its disposal for inspection a random program

P =PIP2P3P4’ ’ ’ E{“~ l}“*

We now describe how A oper-

ates.

Step 1: Calculate n.

Step 2: Set m=l.

Step 3: Compute F(“)(x(n)), for all x(n)E (0,

l}“.

Step 4: The error in summing 2” binary terms each in

[0, I] and each truncated after m places is bounded above

by 2”-“.

Using this crude bound on the difference be-

tween I;(“)(x(n)) and the true distribution function

F(x(n)) L Cx++,x~n~p(x’(n)), and between -pcm) = *

P1P2’ ’

*p, and -p, decide if at this stage it can be guaran-

teed that

*pE(F(x*(n)),F(x*(n)+ OO;~l~l)]

(42)

for some x*(n)E{O,l}“. Here x(n)+OO-**OOl means the

sequence obtained by adding *x(n) and (i>” and reinter-

preting it as a sequence. If (42) can be decided, proceed to

step 6.

Step 5: Increment m by 1. Go back to Step 3.

Step 6: Print out x*(n) and stop.

It is easily seen that

(

*PE F(x(n)),

(

=p(x(n)) (43)

for all x(n) E (0, l}“. Since limm+m -p(‘@ = *p and

1i~m-m

F(“)(x(n))= F(x(n)), A

will fail to halt only if

.p = F(x(n)) for some x(n) E (0, l}“. This event has proba-

bility zero. Thus there exists a computer A such that a

Bernoulli random program p will induce the stochastic

process {Xi} as its output.

Q.E.D.

VI.

CONCLUSIONS

This study can be perceived in three parts. First, the

minimal average code length with respect to a known

distribution has been shown to be equal to the Shannon

entropy H to first order under different coding con-

straints. Second, the individual complexity measures of

Kolmogorov, Chaitin, and others have been shown to be

equivalent to one another, also to first order. Finally, the

expected code length of the individual algorithmic code

has been shown to be equal to first order to the Shannon

entropy, thus identifying the statistical and the logical

definitions of entropy.

ACKNOWLEDGMENT

The authors would like to thank Professor John T. Gill

for suggesting the method used for lower bounding L, :, in

Section III. They also wish to thank both referees for aid

in improving the proofs and making the concepts more

precise.

APPENDIX

PROOF OF THEOREM 1.

Theorem I:

L1 : 1 a LUD

-loglogn-3.

Proof:

From (1)

We now proceed to find max(H(X)-

L, : ,).

Let A g H(X) -

: ,. Then

A= 5 pil”g~-~~pi~log(f+l)] Cm)

i-1

<~pilog~-log ;+1

i=l

( ( )I

maxA <max i pi

(

log +-log

( 1)

;+1 .

(-44)

i=l

Let

J(P,,. * *

CA6)

Differentiating J(p,, . . . ,p,J with respect top;, we obtain

CIJ

-=-

aPi

q+h--l+ln $.

Setting

aJ/api

= 0, we obtain

lnp,=X-(q+l)

i.e.,

648)

p.=e”-(c,+‘)=ae-c,

where a is some constant. Now

(A9)

WO)

Some equivalences between Shannon entropy and Kolmogorov complexity

Citations

A universal prior for integers and estimation by minimum description length

Universal coding, information, prediction, and estimation

Universal modeling and coding

Shannon Information and Kolmogorov Complexity

Zero-error network coding for acyclic networks

References

Information theory and coding

A Theory of Program Size Formally Identical to Information Theory

The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms

Computational Complexity and Probability Constructions

Related Papers (5)

A mathematical theory of communication

An Introduction to Kolmogorov Complexity and Its Applications

Elements of information theory

Three approaches to the quantitative definition of information

The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Some equivalences between shannon entropy and kolmogorov complexity" ?