What are the contributions mentioned in the paper "A general formula for channel capacity" ?

Q: What are the contributions mentioned in the paper "A general formula for channel capacity" ?

In this paper, the capacity of arbitrary single-user channels without feedback is shown to equal the supremum, over all input processes, of the input-output inf information rate defined as the liminf in probability of the normalized information density.

(Open Access) A general formula for channel capacity (1994) | Sergio Verdu

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 40, NO. 4, JULY 1994

1147

A General Formula for Channel Capacity

Sergio Verdi?,

Fellow, IEEE, and

Te Sun Han,

Fellow, IEEE

Abstract-A formula for the capacity of arbitrary single-user

channels without feedback (not necessarily information stable,

stationary, etc.) is proved. Capacity is shown to equal the

supremum, over all input processes, of the input-output inf

information rate

defined as the liminf in probability of the

normalized information density. The key to this result is a new

converse approach based on a simple new lower bound on the

error probability of m-ary hypothesis tests among equiprobable

hypotheses. A necessary and sufficient condition for the validity

of the strong converse is given, as well as general expressions for

e-capacity.

Index

Terms-Shannon theory, channel capacity, channel cod-

ing theorem, channels with memory, strong converse.

INTRODUCTION

HANNON’S formula [l] for channel capacity (the

supremum of all rates

for which there exist se-

quences of codes with vanishing error probability and

whose size grows with the block length n as exp

(rzR)),

C = maxI(X;Y),

(1.1)

holds for

memoryless

channels. If the channel has mem-

ory, then (1.1) generalizes to the familiar limiting expres-

sion

C = !lim s;f iI(X”; Yn>.

(1.2)

However, the capacity formula (1.2) does not hold in full

generality; its validity was proved by Dobrushin [2] for the

class of

information stable

channels. Those channels can

be roughly described as having the property that the input

that maximizes mutual information and its corresponding

output behave ergodically. That ergodic behavior is the

key to generalize the use of the law of large numbers in

the proof of the direct part of the memoryless channel

coding theorem. Information stability is not a superfluous

sufficient condition for the validity of (1.2).l Consider a

Manuscript received December 15, 1992; revised June 12, 1993. This

work was supported in part by the National Science Foundation under

PYI Award ECSE-8857689 and by a grant from NEC. This paper was

presented in part at the 1993 IEEE workshop on Information Theory,

Shizuoka, Japan, June 1993.

S. Verdu is with the Department of Electrical Engineering, Princeton

University, Princeton, NJ 08544.

T. S. Han is with the Graduate School of Information Systems,

University of Electra-Communications, Tokyo 182, Japan.

IEEE Log Number 9402452.

‘In fact, it was shown by Hu [3] that information stability is essentially

equivalent to the validity of formula (1.2).

binary channel where the output codeword is equal to the

transmitted codeword with probability l/2 and indepen-

dent of the transmitted codeword with probability l/2.

The capacity of this channel is equal to 0 because arbi-

trarily small error probability is unattainable. However

the right-hand side of (1.2) is equal to l/2 bit/channel

use.

The immediate question is whether there exists a com-

pletely general formula for channel capacity, which does

not require any assumption such as memorylessness, in-

formation stability, stationarity, causality, etc. Such a for-

mula is found in this paper.

Finding expressions for channel capacity in terms of the

probabilistic description of the channel is the purpose of

channel coding theorems. The literature on coding theo-

rems for single-user channels is vast (cf., e.g., [4]). Since

Dobrushin’s information stability condition is not always

easy to check for specific channels, a large number of

works have been devoted to showing the validity of (1.2)

for classes of channels characterized by their memory

structure, such as finite-memory and asymptotically mem-

oryless conditions. The first example of a channel for

which formula (1.2) fails to hold was given in 1957 by

Nedoma [5]. In order to go beyond (1.2) and obtain

capacity formulas for

information unstable

channels, re-

searchers typically considered averages of stationary er-

godic channels, i.e., channels which, conditioned on the

initial choice of a parameter, are information stable. A

formula for averaged discrete memoryless channels was

obtained by Ahlswede [6] where he realized that the Fano

inequality fell short of providing a tight converse for those

channels. Another class of chanels that are not necessarily

information stable was studied by Winkelbauer [7]: sta-

tionary discrete regular decomposable channels with finite

input memory. Using the ergodic decomposition theorem,

Winkelbauer arrived at a formula for e-capacity that holds

for all but a countable number of values of E. Nedoma [81

had shown that some stationary nonergodic channels can-

not be represented as a mixture of ergodic channels;

however, the use of the ergodic decomposition theorem

was circumvented by Kieffer [9] who showed that

Winkelbauer’s capacity formula applies to all discrete

stationary nonanticipatory channels. This was achieved by

a converse whose proof involves Fano’s and Chebyshev’s

inequalities plus a generalized Shannon-McMillan Theo-

rem for periodic measures. The stationarity of the channel

is a crucial assumption in that argument.

Using the Fano inequality, it can be easily shown (cf.

Section III) that the capacity of every channel (defined in

0018-9448/94$04.00 0 1994 IEEE

1148

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 40, NO. 4, JULY 1994

the conventional way, cf. Section II) satisfies

C 4 liminf sup I1(X”;Yn).

n-m

xrl n

(1.3)

To establish equality in (1.3), the direct part of the coding

theorem needs to assume information stability of the

channel. Thus, the main existing results that constitute

our starting point are a converse theorem (i.e., an upper

bound on capacity) which holds in full generality and a

direct theorem which holds for information stable chan-

nels. At first glance, this may lead one to conclude that

the key to a general capacity formula is a new direct

theorem which holds without assumptions. However, the

foregoing example shows that the converse (1.3) is not

tight in that case. Thus, what is needed is a new converse

which is tight for every channel. Such a converse is the

main result of this paper. It is obtained without recourse

to the Fano inequality which, as we will see, cannot lead

to the desired result. The proof that the new converse is

tight (i.e., a general direct theorem) follows from the

conventional argument once the right definition is made.

The capacity formula proved in this paper is

c = supJ(X; Y).

(1.4)

In (1.4), X denotes an input process in the form of a

sequence of finite-dimensional distributions X = {X” =

(Xj”);+., X(“‘>]T=i. We denote by Y = {Y” =

(yp’“‘,...

, YJ’$]r= I the corresponding output sequence of

finite-dimensional distributions induced by X via the

channel

= {IV” =

P,,,,,: A”

* Bn}rZ1, which is an

arbitrary sequence of n-dimensional conditional output

distributions from

A”

to B”, where

and

are the

input and output alphabets, respectively.’ The symbol

J(X; Y) appearing in (1.4) is the

inf-information rate

be-

tween X and Y, which is defined in [lo] as the

liminf in

probability3

of the sequence of normalized information

densities

(l/n>i,.,dX”; Y”),

where

lXnWa(an; 6”) = log Pynlxn(b”lan) .

Pydbn)

(1.5)

For ease of notation and to highlight the simplicity of

the proofs, we have assumed in (1.5) and throughout the

paper that the input and output alphabets are finite.

However, it will be apparent from our proofs that the

results of this paper do not depend on that assumption.

They can be shown for channels with abstract alphabets

by working with a general information density defined in

the conventional way [ll] as the log derivative of the

‘The methods of this paper allow the study, with routine modifica-

tions, of even more abstract channels defined by arbitrary sequences of

conditional output distributions, which need not map Cartesian products

of the input/output alphabets. The only requirement is that the index of

the sequence be the parameter that divides the amount of information in

the definition of rate.

31f A, is a sequence of random variables, its liminfinprobabilig is the

supremum of all the reals 01 for which P[A, I cu] + 0 as IZ + a.

Similarly, its limsup in probability is the infimum of all the reals p for

which P[A, 2 p] --) 0 as n + m.

conditional output measure with respect to the uncondi-

tional output measure.

The notion of inf/sup-information/entropy rates and

the recognition of their key role in dealing with noner-

godic/nonstationary sources are due to [lo]. In particular,

that paper shows that the minimum achievable source

coding rate for any finite-alphabet source X = {X”}z= 1 is

equal to its sup-entropy rate H(X), defined as the limsup

in probability of (l/n> log l/Pxn(X”). In contrast to the

general capacity formula presented in this paper, the

general source coding result can be shown by generalizing

existing proofs.

The definition of channel as a sequence of finite-

dimensional conditional distributions can be found in

well-known contributions to the Shannon-theoretic litera-

ture (e.g., Dobrushin [2], Wolfowitz [12, ch. 71, and Csiszar

and Kiirner [13, p. loo]), although, as we saw, previous

coding theorems imposed restrictions on the allowable

class of conditional distributions. Essentially the same

general channel model was analyzed in [26] arriving at a

capacity formula which is not quite correct. A different

approach has been followed in the ergodic-theoretic liter-

ature, which defines a channel as a conditional distribu-

tion between spaces of doubly infinite sequences.4 In that

setting (and within the domain of block coding [14]),

codewords are preceded by a prehistory (a left-sided infi-

nite sequence) and followed by a posthistory (a right-sided

infinite sequence); the error probability may be defined in

a worst case sense over all possible input pre- and posthis-

tories. The channel definition adopted in this paper,

namely, a sequence of finite-dimensional distributions,

captures the physical situation to be modeled where block

codewords are transmitted through the channel. It is

possible to encompass physical models that incorporate

anticipation, unlimited memory, nonstationarity, etc., be-

cause we avoid placing restrictions on the sequence of

conditional distributions. Instead of taking the worst case

error probability over all possible pre- and posthistories,

whatever statistical knowledge is available about those

sequences can be incorporated by averaging the condi-

tional transition probabilities (and, thus, averaging the

error probability) over all possible pre- and posthistories.

For example, consider a simple channel with memory:

= xi + xiel + ni.

where {nJ is an i.i.d. sequence with distribution

PN.

The

posthistory to any n-block codeword is irrelevant since

this channel is causal. The conditional output distribution

takes the form

where the statistical information about the prehistory

(summarized by the distribution of the initial state) only

affects

PyI,, :

P,,,,j!Y,lx,)

CPJY,

-x1 - xo>px”(xJ.

40r occasionally semi-infinite sequences, as in [9].

VERDU AND HAN: GENERAL FORMULA FOR C

HANNEL CAPACITY

1149

In this case, the choice of P,,<x,> does not affect the

value of the capacity. In general, if a worst case approach

is desired, an alternative to the aforementioned approach

is to adopt a compound channel model [12] defined as a

family of sequences of finite-dimensional distributions

parametrized by the unknown initial state which belongs

to an uncertainty set. That model, or the more general

arbitrarily varying channel, incorporates nonprobabilistic

modeling of uncertainty, and is thus outside the scope of

this paper.

properties of mutual information are- satisfied by the

inf-information rate, thereby facilitating the evaluation of

the general formula (1.4). Examples of said evaluation for

channels that are not encompassed by previous formulas

can be found in Section VII.

In Section II, we show the direct part of the capacity

formula C 2 supx _I@; Y>. This result follows in a

straightforward fashion from Feinstein’s lemma [15] and

the definition of inf-information rate. Section III is de-

voted to the proof of the converse C 4 supx _I(X; Y). It

presents a new approach to the converse of the coding

theorem based on a simple lower bound on error proba-

bility that can be seen as a natural counterpart to the

upper bound provided by Feinstein’s lemma. That new

lower bound, along with the upper bound in Feinstein’s

lemma, are shown to lead to tight results on the +capacity

of arbitrary channels in Section IV. Another application

of the new lower bound is given in Section V: a necessary

and sufficient condition for the validity of the strong

converse. Section VI shows that many of the familiar

P,,,,, that satisfies

[

ES P --ix,,,

(X”;Yn) I :logM + y + exp(-yn).

(2.1)

Note that Theorem 1 applies to arbitrary fixed block

length and, moreover, to general random transformations

from input to output, not necessarily only to transforma-

tions between nth Cartesian products of sets. However,

we have chosen to state Theorem 1 in that setting for the

sake of concreteness.

Armed with Theorem 1 and the definitions of capacity

and inf-information rate, it is now straightforward to

prove the direct part of the coding theorem.

Theorem 2: 6

c 2 sup f(X; Y>.

(2.2)

Proof

Fix arbitrary 0 < E < 1 and X. We shall show

that 1(X; Y) is an e-achievable rate by demonstrating

that, for every S > 0 and all sufficiently large n, there

exist (n, M, exp (-n6/4) + e/2) codes with rate

log

J(X;Y> - s < -

If, in Theorem 1, we choose y = 6/4, then the probability

<J(X;Y) - ;.

(2.3)

in (2.1) becomes

XnWn(Xn;Yn) 5 - log

M + S/4

II.

DIRECT CODING THEOREM:

sup, J(X; Y)

The conventional definition of channel capacity is (e.g.,

[13]) the following.

Definition

I: An

(n, M,

E) code has block length

n, M

codewords, and error probability5 not larger than E.

R 2 0

is an

e-achievable rate

if, for every S > 0, there exist, for

all sufficiently large

n, (n, M,

E) codes with rate

log

->R-S.

The maximum e-achievable rate is called the

e-capacity

C,.

The

channel capacity

C is defined as the maximal rate

that is e-achievable for all 0 < E < 1. It follows immedi-

ately from the definition that C = lim, 1 J,.

The basis to prove the desired lower and upper bounds

on capacity are respective upper and lower bounds on the

error probability of a code as a function of its size. The

following classical result (Feinstein’s lemma) [15] shows

the existence of a code with a guaranteed error probabil-

ity as a function of its size.

Theorem

1: Fix a positive integer

and 0 < E < 1. For

every y > 0 and input distribution

Px”

on A”, there exists

(n, M,

E) code for the transition probability

W” =

5 P --ixnw.

(X”;Y”) I i(X; Y) - 6/4

4 ; (2.4)

where the second inequality holds for all sufficiently large

because of the definition of l(X; Y). In view of (2.41,

Theorem 1 guarantees the existence of the desired codes.

III.

CONVERSE CODING THEOREM:

5 sup,

l(X; Y>

This section is devoted to our main result: a tight

converse that holds in full generality. To that end, we

need to obtain for any arbitrary code a lower bound on its

error probability as a function of its size or, equivalently,

an upper bound on its size as a function of its error

probability. One such bound is the standard one resulting

from the Fano inequality.

Theorem

3: Every

(n, M,

E) code satisfies

log

M I

&1(X”; Yn> + h(E)1

(3.1)

where

is the binary entropy function, X” is the input

distribution that places probability mass l/M on each of

the input codewords, and Y” is its corresponding output

distribution.

5We work throughout with average error probabiiity. It is well known 6Whenever we omit the set over which the supremum is taken, it is

that the capacity of a single-user channel with known statistical descrip- understood that it is equal to the set of all sequences of finite-dimen-

tion remains the same under the maximal error probability criterion. sional distributions on input strings.

1150

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 40, NO. 4, JULY 1994

Using Theorem 3, it is evident that if

R 2 0

is E-

achievable, then for every 8 > 0.

R-8<

(3.2)

which, in turn, implies

R_<

& liminf 1 sup Z(X”; Y”).

(3.3)

n+m n xn

Thus, the general converse in (1.3) follows by letting

E + 0. But, as we illustrated in Section I, (1.3) is not

always tight. The standard bound in Theorem 3 falls short

of leading to the desired tight converse because it de-

pends on the channel through the input-output

mutual

information

(expectation of information density) achieved

by the code. Instead, we need a finer bound that depends

on the distribution of the information density achieved by

the code, rather than on just its expectation. The follow-

ing basic result provides such a bound in a form which is

pleasingly dual to the Feinstein bound. As for the

Feinstein bound, Theorem 4 holds not only for arbitrary

fixed block length, but for an arbitrary random transfor-

mation.

Theorem

4: Every

(n, M,

E) code satisfies

(X”;Y”) 2 -logM- y - exp(-yn)

(3.4)

for every y > 0, where X” places probability mass l/M

on each codeword.

Proofi

Denote p = exp (-

yn).

Note first that the

event whose probability appears in (3.4) is equal to the set

of “atypical” input-output pairs

L = {(a”, b”) E A”

Bn:Px,lya(anIbn) I p}. (3.5)

This is because the information density can be written as

i,.,,,(a”; b”) =

log

xn,yn(anlbn)

p (a”>

(3.6)

X”

and

Px,I(ci)

= l/M for each of the

codewords ci E A”.

We need to show that

yyJL1

I E + p.

(3.7)

Now, denoting the decoding set corresponding to ci by Di

and

Bi - (

b” E B”:Px,&cilb”) < p)

(3.8)

we can write

X"Y"

[Ll = f Pxnyn[(ci, B,)l

i=l

= f pxnyn[(ci,

Bi n Di”)l

i=l

SE+@

(3.9)

where the second inequality is due to (3.8) and the dis-

jointness of the decoding sets.

Theorems 3 and 4 hold for arbitrary random transfor-

mations, in which general setting they are nothing but

lower bounds on the minimum error probability of M-ary

equiprobable hypothesis testing. If, in that general setting,

we denote the observations by Y and the true hypothesis

by X (equiprobable on {l;..,

M}),

the

hypothesized

distributions are the conditional distributions

{PyIxzi, i =

l;**,

M}.

The bound in Theorem 3 yields

E21-

Z(X; Y > + log 2

log

M .

A slightly weaker result is known in statistical inference as

Fano’s lemma [16]. The bound in Theorem 4 can easily be

seen to be equivalent to the more general version

E 2 PIP,Iy(xIY) 5 a] - (Y

for arbitrary 0 5 (Y I 1. A stronger bound which holds

without the assumption of equiprobable hypothesis has

been found recently in [171.

Theorem 4 gives a family (parametrized by y) of lower

bounds on the error probability. To obtain the best bound,

we simply maximize the right-hand side of (3.4) over y.

However, a judicious, if not optimum, choice of y is

sufficient for the purposes of proving the general con-

verse.

Theorem 5:

c I sup I(X:Y).

(3.10)

Proof:

The intuition behind the use of Theorem 4 to

prove the converse is very simple. As a shorthand, let us

refer to a sequence of codes with vanishingly small error

probability (i.e., a sequence of

(n, M,

n> codes such that

E, + 0)

as a

reliable code sequence.

Also, we will say that

the

information spectrum

of a code (a term coined in [lOI>

is the distribution of the normalized information density

evaluated with the input distribution X” that places equal

probability mass on each of the codewords of the code.

Theorem 4 implies that if a reliable code sequence has

rate

then the mass of its information spectrum lying

strictly to the left of

must be asymptotically negligible.

VERDti AND HAN: GENERAL FORMULA FOR CHANNEL CAPAC1l-f

1151

In other words,

2 i(X; Y) where X corresponds to the

sequence of input distributions generated by the sequence

of codebooks.

To formalize this reasoning, let us argue by contradic-

tion and assume that for some p > 0,

c = sup J(X; Y) + 3p.

(3.11)

By definition of capacity, there exists a reliable code

sequence with rate

log

->C-0.

(3.12)

Now, letting X” be the distribution that places probabil-

ity mass l/M on the codewords of that code, Theorem 4

(choosing y = p), (3.11) and (3.12) imply that the error

probability must be lower bounded by

[

En 2 P

-ix,,,

w; Y”) I supl(X; Y) + p

exp

(-np).

(3.13)

But, by definition of _I(X; Y), the probability on the right-

hand side of (3.13) cannot vanish asymptotically, thereby

contradicting the fact that E, + 0.

Besides the behavior of the information spectrum of a

reliable code sequence revealed in the proof of Theorem

5, it is worth pointing out that the information spectrum

of any code places no probability mass above its rate. To

see this, simply note that (3.6) implies

,,,n(X”; Y”) I - log

(3.14)

Thus, we can conclude that the normalized information

density of a reliable code sequence converges in probabil-

ity to its rate. For finite-input channels, this implies [lo,

Lemma 11 the same behavior for the sequence of normal-

ized mutual informations, thereby yielding the classical

bound (1.3). However, that bound is not tight for informa-

tion unstable channels because, in that case, the mutual

information is maximized by input distributions whose

information spectrum does not converge to a single point

mass (unlike the behavior of the information spectrum of

a reliable code sequence).

Upon reflecting on the proofs of the general direct and

converse theorems presented in Sections II and III, we

can see that those results follow from asymptotically tight

upper and lower bounds on error probability, and are

decoupled from ergodic results such as the law of large

numbers or the asymptotic equipartition property. Those

ergodic results enter in the picture only as a way to

particularize the general capacity formula to special classes

of channels (such as memoryless or information stable

channels) so that capacity can be written in terms of the

mutual information rate.

Unlike the conventional approach to the converse cod-

ing theorem (Theorem 31, Theorem 4 can be used to

provide a formula for e-capacity as we show in Section IV.

Another problem where Theorem 4 proves to be the key

result is that of combined source/channel coding [la]. It

turns out that when dealing with arbitrary sources and

channels, the separation theorem may not hold because,

in general, it could happen that a source is transmissible

over a channel even if the minimum achievable source

coding rate (sup-entropy rate) exceeds the channel capac-

ity. Necessary and sufficient conditions for the transmissi-

bility of a source over a channel are obtained in [181.

Definition 1 is the conventional definition of channel

capacity (cf. [15] and [13]) where codes are required to be

reliable for all sufficiently large block length. An alterna-

tive, more optimistic, definition of capacity can be consid-

ered where codes are required to be reliable only in-

finitely often. This definition is less appealing in many

practical situations because of the additional uncertainty

in the favorable block lengths. Both definitions turn out to

lead to the same capacity formula for specific channel

classes such as discrete memoryless channels [13]. How-

ever, in general, both quantities need not be equal, and

the optimistic definition does not appear to admit a sim-

ple general formula such as the one in (1.4) for the

conventional definition. In particular, the optimistic ca-

pacity need not be equal to the supremum of sup-infor-

mation rates. See [18] for further characterization of this

quantity.

The conventional definition of capacity may be faulted

for being too conservative in those rare situations where

the maximum amount of reliably transmissible informa-

tion does not grow linearly with block length, but, rather,

O(b(n)).

For example, consider the case

b(n) = n +

y1 sin

(an).

This can be easily taken into account by “sea-

sonal adjusting:” substitution of

b(n)

in the defini-

tion of rate and in all previous results.

Iv. E-CAPACITY

The fundamental tools (Theorems 1 and 4) we used in

Section III to prove the general capacity formula are used

in this section to find upper and lower bounds on C,, the

e-capacity of the channel, for 0 < E < 1. These bounds

coincide at the points where the e-capacity is a continuous

function of E.

Theorem

6: For 0 < E < 1, the e-capacity C, satisfies

C, 5 sup sup{R:

F,(R) 5 E)

(4.1)

C, 2 sup sup{R:

F,(R) < E}

(4.2)

where

F,(R)

denotes the limit of cumulative distribution

functions

F,(R)

= 1imsupP

;ixnwn(Xn,Yn) < R . (4.3)

n-m

[

The bounds (4.1) and (4.2) hold with equality, except

possibly at the points of discontinuity of C,, of which

there are, at most, countably many.

A general formula for channel capacity

Citations

Channel Coding Rate in the Finite Blocklength Regime

On the achievable throughput of a multiantenna Gaussian broadcast channel

Network Information Theory

On the achievable throughput of a multiantenna Gaussian broadcast channels

Bit-Interleaved Coded Modulation

References

A mathematical theory of communication

Elements of information theory

Entropy and information theory

Broadcast channels

Asymptotic methods in statistical decision theory

Related Papers (5)

Information Theory and Reliable Communication

Elements of information theory

A mathematical theory of communication

Channel Coding Rate in the Finite Blocklength Regime

Information Theory: Coding Theorems for Discrete Memoryless Systems

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "A general formula for channel capacity" ?