What are the contributions mentioned in the paper "On the histogram as a density estimator: l 2 theory" ?

In this paper, a histogram for the X 's is used to estimate the number of X 's falling in an interval I, finite or infinite: I includes its finite endpoints, if any ; and f vanishes outside of I.

What is the simplest way to determine the h's g?

Choose g) positive, but smaller than rain {6o, b/3c}, where c and 6 o are as in (4.9), Then Ck(h)-ch 3 is a monotone increasing function of h in the interval (4.14).

what is the p ro o f of terva l ?

These b o u n d s force the fol lowing conclus ions : for large k the h's min imizing 0 k ( ' ) are to be found in the in terva l hk+_A/kl/2; on that whole in terval Ok(h) = C~k(hk)+ O(1/k).

what is the arbitrary value of k(')?

Because e was arbitrary, the infimum of Ok(h) over h with 0<h___b is(4.8) 3 .2 - 2/3. b 1/3. k- 2/3 + o(k- 2/3).Now (4.4-6) show that ~k(') has a global minimum, say at h~, any such h* tends to 0 as k ~ o% and Ok(h*)=qSk(hk)+

(Open Access) On the histogram as a density estimator:L 2 theory (1981) | David A. Freedman

Q: What is the simplest way to prove f?

For instance, if The author= [ 0 , 1] and Xo=0 , condition (1.5) requires that h = l / N for some positive integer N. By present conditions, if The author= [0, 1], then f and f ' are continuous on I, even at 0 and 1.(1.6) Theorem.

Q: how can g be approximated in l 2?

But g may be approximated closely in L 2 by a function go which is constant on each class interval: for instance, apply (2.5) to g.

Z. Wahrscheinlichkeitstheorie verw. Gebiete

57, 453-476 (1981)

Zeitsclarift fur

Wahr scheirdichkeitstheorie

und verwandte Gebiete

9 Springer-Verlag 1981

On the Histogram as a Density Estimator: L 2 Theory

David Freedman 1, and Persi Diaconis 2**

i Statistics Department, University of California, Berkeley, CA94720, USA

z Statistics Department, Stanford University, Stanford, CA94305, USA

1. Introduction

Let f be a probability density on an interval I, finite or infinite: I includes its

finite endpoints, if any; and f vanishes outside of I. Let X1, ...,X k be inde-

pendent random variables, with common density f The empirical histogram

for the X's is often used to estimate f To define this object, choose a reference

point xosI and a cell width h. Let Nj be the number of X's falling in the jth

class interval:

[Xo +j h, xo +(j + l) h).

On this interval the height of the histogram H(x) is defined as

Njk h.

This definition forces the area under H to be 1. The dependence of H on k and

h is suppressed in the notation.

On the average, how close does H come to f? A standard measure of

discrepancy is the mean square difference:

(1.1) ~2 = E {f [fI(x)- f (x)3 2 dx}.

I

This quantity is analyzed on the following assumptions:

(1.2) feL z and f is absolutely continuous on I, with a.e. derivate f'

(1.3) f'~L z and f' is absolutely continuous on I, with a.e. derivative f"

(1.4) f"eLp for some p with l<p<2.

* Research partially supported by NSF Grant MCS-80-02535

** Research partially supported by NSF Grant MCS-80-24649

0044- 3719/81/0057/0453/$04.80

454 D. Freedman and P. Diaconis

Conditions (1.3) and (1.4) have the (non-obvious) consequence that f' is

continuous and vanishes at oo. In particular, f' is bounded; see (2.21) below.

Also, f' is in fact the ordinary (everywhere) derivative of f Likewise, f is

continuous and vanishes at oe. It will also be assumed that

(1.5)

I is the union of class intervals.

For instance, if I=[0, 1] and Xo=0 , condition (1.5) requires that h=l/N for

some positive integer N. By present conditions, if I = [0, 1], then f and f' are

continuous on I, even at 0 and 1.

(1.6) Theorem. Assume (1.1-1.5). Let

7 = ~f'(x) 2 dx > 0

I

fl=88 9 71/3

~=61/3 7-1/3.

Then, the cell width h which minimizes the 82 of (1.1) is ~ k-1/3+ O(k-a/2), and at

such h's,

82=ilk -2/3 q--O(k-1).

The technique deVeloped to prove (1.6) can be used to give a result under

weaker conditions.

(1.7) Theorem. Suppose f~L 2 is absolutely continuous with a.e. derivative f'~L 2

and ~f'(x)Z dx>O. Suppose (1.5) as well. Define ~ and fl as in (1.6). Then the cell

width which minimizes the 8 2 of (1.1)

is

o~k-1/3q-o(k -1/3)

and at such h's,

8 2

~-fl k-2/3 +0(k--2/3).

Such results suggest that the discrepancy 62 can be made small by choosing

the cell width h as ock -1/3. Of course, this depends on 7, which will be

unknown in general cases. In principle, y can be estimated from the data, as in

Woodroofe (1968). However, numerical computations, which will be reported

elsewher e , suggest that the following simple, robust rule for choosing the cell

width h often gives quite reasonable results.

(1.8) Rule: Choose the cell width as twice the interquartile range of the data,

divided by the cube root of the sample size.

Rough versions of (1.6) and (!.7) seem part of the folklore. Two recent

references providing formal computations are Tapia and Thompson (1978), and

Scott (1979).

We hope to study the random variable A2= ~ [H(x)-f(x)]2dx in a future

paper. The standard deviation of A 2 is of smaller order than

E(A2)=8 2.

Thus,

choosing h to minimize 82 is a sensible way to get a small A 2. To be a bit

more precise, the standard deviation of A2 is of order k-lh-1/z~k -5/6 for the

optimal h~k -1/3. Using (1.6), the minimal discrepancy A 2 is of order

k -2/3

give or take a nearly normal random variable of the smaller order k -5/6.

The histogram may be considered a very old-fashioned way of estimating

densities. However, histograms are easy to draw; and, unlike kernel estimators,

On the Histogram as a Density Estimator: L 2 Theory 455

are very widely used in applied work. Mathematical aspects of density esti-

mation are surveyed by Rosenblatt (1971), Cover (1972), Wegman (1972),

Tarter and Kronmal (1976), Fryer (1977), Wertz and Schneider (1979), and

references listed therein. These papers report a great deal of careful work on

discrepancy at a point, and on global results for kernel estimates and other

"generalized" histograms. The results show that the mean square error of

kernel estimates tends to zero like a constant times

k -4Is,

while (1.6) implies

that the mean square error of histograms tends to zero like a constant times

k -2/3. Asymptotically, this rate is worse, a fact which seems to have stopped

further work on the mathematics of histograms. However, for finite sample

sizes, the constants determine everything. For example, take k=500: then

k -~*i5

-0.007 while k -2/3 -0.016. The asymptotic rate of

k -4Is

can be achieved

using another old-fashioned object: the frequency polygon. This is provable

with the techniques of this paper.

Before describing our results more carefully, it is helpful to separate the

discrepancy (1.1) into sampling error and bias components. To this end, let

(1.9)

1 Xo+(n+l)h

fh(X)= h ~+, f(u)du

for

Xo+nh<x<Xo+(n+l)h.

xo h

(1.10) Proposition.

Suppose feL2, and

(1.5).

Then

1 1 !fh(x)2dx+ (. [fh(x)--f(x)]Zdx.

E {5 [H(x)-f(x)] 2

dx} - k h k

I I

Proof.

Suppose

xo+nhNx<Xo+(n+l)h.

Then

H(x)=N,/kh,

and N, is bi-

nomial with number of trials k and success probability

P,h----hfh(X)"

In partic-

ular,

{H(x)} :L(x),

and

Var {H(x)} = L fh(X) [1 -- hA(x)]

Now integrate in x over I. []

The term 5 (fh--f) 2 in (1.10) represents the bias in using discrete intervals of

width h. Reducing h diminishes this bias, at the expense of increasing the

sampling error term

1/k h,

for the number of observations per cell will decrease

as h gets smaller. The tension between these two is resolved by (1.6) and (1.7).

Section2 of this paper is about the bias term

~(fh--f)2;

Sect. 3 gives exam-

ples to show what happens when the regularity conditions like (1.3) and (1.4)

are relaxed. In particular, (1.7) fails for some beta and chi-squared densities.

Section 4 gives the proof of (1.6) and (1.7). Clearly, the uniform density requires

special treatment, since the optimal number of class intervals is one. This

density is excluded by the condition that 5f'2>0, which surfaces in Lemma

(4.5) of Sect. 4.

456 D. Freedman and P. Diaconis

2. The Bias Term

To begin with assume only that

(2.1) f is an

L z

function on the interval I.

Define fh by (1.9). Let J be a union of class intervals. Clearly,

(2.2)

y fh(x) dx = y f (x) dx

J Y

(2.3)

5 fh(x) 2 dx < If(x) z dx

J J

(2.4) the fh are square integrable uniformly in h.

Also, fh converges tof in L2:

(2.5)

~ (fh-f):-+O

as h-+0.

x

For the proof of (2.5), approximate f in L 2 by a continuous function with

compact support. Estimates on the rate of convergence in (2.5) will be helpful.

For this, additional assumptions are needed. One such is:

(2.6) f is an

L z

function on the interval I, and f is absolutely continuous

with a.e. derivative f', and

f'sL z.

Under (2.6), the bias term on the left of (2.5) tends to zero like h 2. More

precisely;

(2.7) Proposition.

Suppose

(2.6)

and

(1.5).

Let

r(h)= ! [fh(x ) --f (x)]

2

dx _1

h2 ! f,(x) 2 dx.

Then

r(h)= o(h2).

Proof

To ease the notation, write g for f', and set x0=0. Focus on a specific

class interval, for instance, [0, hi. Clearly,

f (x) = a + i

g(u)

du

0

where a=f(0). In computing

S(fh--f) z,

the constant a will cancel, so it is

harmless to set a = 0. Of course,

h h

~(fh--f)2-----Sf2--hfk z.

0 0

On the Histogram as a Density Estimator: L2 Theory 457

In what follows, u v v=max(u, v) and u A v =min(u, v). Because a =0,

h hxx

5f 2= 5 5 5 g(u) g(v)

dudvdx

0 000

hhh

=~ ~ ~ g(u)g(v)dxdudv

OOuvv

hh

= ~ ~ (h-u v v) g(u) g(v)

dudv.

O0

Likewise,

SO

and

where

1 h

fh=~ ! (h - u) g(u)

du

2 1 h

h fh =-~ i o ! (h-u) (h- v) g(u) g(v) dudv

h hh

5 (fh

_f)2 = f 5 ~bh(u, V) g(u) g(v)

dudv

0 O0

G(u, v) = (h - u v v) - ~(h - u) (h - v)

1

=(u+v)-(uv~)-~uv

1

~-U A I)--~UIJ.

This defines ~b h as a function from

O<=u, v<h.

Note that

4~(u,O)=O(u,h)

= ~b(0, v)= q~(h, v)= 0. Define q5 on the whole plane by periodic continuation.

Let

(n+ 1)h 2 1 (n+ 1)h

The argument thus far shows that

1 (n+l)h (n+l)h 1 (n+l)h

It will now be shown that

~n~Snh(g)'-+

0

as h--* 0.

If g is constant on

[nh, (n+

1)h], a direct computation shows that

~Snh(g)=0.

But g may be approximated closely in L 2 by a function go which is constant

on each class interval: for instance, apply (2.5) to g. It remains to show that

G a.~(g) - z. a.;,(g o)

On the histogram as a density estimator:L 2 theory

Citations

Modern Applied Statistics with S Fourth edition

Causal Inference without Balance Checking: Coarsened Exact Matching

Computational Statistics Handbook with MATLAB

Precise and reliable gene expression via standard transcription and translation initiation elements

Differential privacy and robust statistics

References

An introduction to probability theory and its applications

The Art of Computer Programming

An Introduction to Probability Theory and Its Applications

An Introduction to Probability Theory and Its Applications.

An introduction to probability theory

Related Papers (5)

On optimal and data based histograms

Density estimation for statistics and data analysis

Multivariate density estimation : theory, practice, and visualization

Multivariate Density Estimation

A mathematical theory of communication

Frequently Asked Questions (7)

Q1. What are the contributions mentioned in the paper "On the histogram as a density estimator: l 2 theory" ?

Q2. What is the mean square error of kernel estimates?

Q3. What is the simplest way to prove f?

Q4. What is the simplest way to determine the h's g?

Q5. what is the p ro o f of terva l ?

Q6. how can g be approximated in l 2?

Q7. what is the arbitrary value of k(')?