scispace - formally typeset
Open AccessJournal ArticleDOI

On the histogram as a density estimator:L 2 theory

TLDR
In this article, a probability density on an interval I, finite or infinite, including its finite endpoints, if any; and f vanishes outside of I. To define this object, choose a reference point xosI and a cell width h.
Abstract
Let f be a probability density on an interval I, finite or infinite: I includes its finite endpoints, if any; and f vanishes outside of I. Let X1, . . . ,X k be independent random variables, with common density f The empirical histogram for the X's is often used to estimate f To define this object, choose a reference point xosI and a cell width h. Let Nj be the number of X's falling in the j th class interval:

read more

Content maybe subject to copyright    Report

Z. Wahrscheinlichkeitstheorie verw. Gebiete
57, 453-476 (1981)
Zeitsclarift fur
Wahr scheirdichkeitstheorie
und verwandte Gebiete
9 Springer-Verlag 1981
On the Histogram as a Density Estimator: L 2 Theory
David Freedman 1, and Persi Diaconis 2**
i Statistics Department, University of California, Berkeley, CA94720, USA
z Statistics Department, Stanford University, Stanford, CA94305, USA
1. Introduction
Let f be a probability density on an interval I, finite or infinite: I includes its
finite endpoints, if any; and f vanishes outside of I. Let X1, ...,X k be inde-
pendent random variables, with common density f The empirical histogram
for the X's is often used to estimate f To define this object, choose a reference
point xosI and a cell width h. Let Nj be the number of X's falling in the jth
class interval:
[Xo +j h, xo +(j + l) h).
On this interval the height of the histogram H(x) is defined as
Njk h.
This definition forces the area under H to be 1. The dependence of H on k and
h is suppressed in the notation.
On the average, how close does H come to f? A standard measure of
discrepancy is the mean square difference:
(1.1) ~2 = E {f [fI(x)- f (x)3 2 dx}.
I
This quantity is analyzed on the following assumptions:
(1.2) feL z and f is absolutely continuous on I, with a.e. derivate f'
(1.3) f'~L z and f' is absolutely continuous on I, with a.e. derivative f"
(1.4) f"eLp for some p with l<p<2.
* Research partially supported by NSF Grant MCS-80-02535
** Research partially supported by NSF Grant MCS-80-24649
0044- 3719/81/0057/0453/$04.80

454 D. Freedman and P. Diaconis
Conditions (1.3) and (1.4) have the (non-obvious) consequence that f' is
continuous and vanishes at oo. In particular, f' is bounded; see (2.21) below.
Also, f' is in fact the ordinary (everywhere) derivative of f Likewise, f is
continuous and vanishes at oe. It will also be assumed that
(1.5)
I is the union of class intervals.
For instance, if I=[0, 1] and Xo=0 , condition (1.5) requires that h=l/N for
some positive integer N. By present conditions, if I = [0, 1], then f and f' are
continuous on I, even at 0 and 1.
(1.6) Theorem. Assume (1.1-1.5). Let
7 = ~f'(x) 2 dx > 0
I
fl=88 9 71/3
~=61/3 7-1/3.
Then, the cell width h which minimizes the 82 of (1.1) is ~ k-1/3+ O(k-a/2), and at
such h's,
82=ilk -2/3 q--O(k-1).
The technique deVeloped to prove (1.6) can be used to give a result under
weaker conditions.
(1.7) Theorem. Suppose f~L 2 is absolutely continuous with a.e. derivative f'~L 2
and ~f'(x)Z dx>O. Suppose (1.5) as well. Define ~ and fl as in (1.6). Then the cell
width which minimizes the 8 2 of (1.1)
is
o~k-1/3q-o(k -1/3)
and at such h's,
8 2
~-fl k-2/3 +0(k--2/3).
Such results suggest that the discrepancy 62 can be made small by choosing
the cell width h as ock -1/3. Of course, this depends on 7, which will be
unknown in general cases. In principle, y can be estimated from the data, as in
Woodroofe (1968). However, numerical computations, which will be reported
elsewher e , suggest that the following simple, robust rule for choosing the cell
width h often gives quite reasonable results.
(1.8) Rule: Choose the cell width as twice the interquartile range of the data,
divided by the cube root of the sample size.
Rough versions of (1.6) and (!.7) seem part of the folklore. Two recent
references providing formal computations are Tapia and Thompson (1978), and
Scott (1979).
We hope to study the random variable A2= ~ [H(x)-f(x)]2dx in a future
paper. The standard deviation of A 2 is of smaller order than
E(A2)=8 2.
Thus,
choosing h to minimize 82 is a sensible way to get a small A 2. To be a bit
more precise, the standard deviation of A2 is of order k-lh-1/z~k -5/6 for the
optimal h~k -1/3. Using (1.6), the minimal discrepancy A 2 is of order
k -2/3
give or take a nearly normal random variable of the smaller order k -5/6.
The histogram may be considered a very old-fashioned way of estimating
densities. However, histograms are easy to draw; and, unlike kernel estimators,

On the Histogram as a Density Estimator: L 2 Theory 455
are very widely used in applied work. Mathematical aspects of density esti-
mation are surveyed by Rosenblatt (1971), Cover (1972), Wegman (1972),
Tarter and Kronmal (1976), Fryer (1977), Wertz and Schneider (1979), and
references listed therein. These papers report a great deal of careful work on
discrepancy at a point, and on global results for kernel estimates and other
"generalized" histograms. The results show that the mean square error of
kernel estimates tends to zero like a constant times
k -4Is,
while (1.6) implies
that the mean square error of histograms tends to zero like a constant times
k -2/3. Asymptotically, this rate is worse, a fact which seems to have stopped
further work on the mathematics of histograms. However, for finite sample
sizes, the constants determine everything. For example, take k=500: then
k -~*i5
-0.007 while k -2/3 -0.016. The asymptotic rate of
k -4Is
can be achieved
using another old-fashioned object: the frequency polygon. This is provable
with the techniques of this paper.
Before describing our results more carefully, it is helpful to separate the
discrepancy (1.1) into sampling error and bias components. To this end, let
(1.9)
1 Xo+(n+l)h
fh(X)= h ~+, f(u)du
for
Xo+nh<x<Xo+(n+l)h.
xo h
(1.10) Proposition.
Suppose feL2, and
(1.5).
Then
1 1 !fh(x)2dx+ (. [fh(x)--f(x)]Zdx.
E {5 [H(x)-f(x)] 2
dx} - k h k
I I
Proof.
Suppose
xo+nhNx<Xo+(n+l)h.
Then
H(x)=N,/kh,
and N, is bi-
nomial with number of trials k and success probability
P,h----hfh(X)"
In partic-
ular,
{H(x)} :L(x),
and
Var {H(x)} = L fh(X) [1 -- hA(x)]
Now integrate in x over I. []
The term 5 (fh--f) 2 in (1.10) represents the bias in using discrete intervals of
width h. Reducing h diminishes this bias, at the expense of increasing the
sampling error term
1/k h,
for the number of observations per cell will decrease
as h gets smaller. The tension between these two is resolved by (1.6) and (1.7).
Section2 of this paper is about the bias term
~(fh--f)2;
Sect. 3 gives exam-
ples to show what happens when the regularity conditions like (1.3) and (1.4)
are relaxed. In particular, (1.7) fails for some beta and chi-squared densities.
Section 4 gives the proof of (1.6) and (1.7). Clearly, the uniform density requires
special treatment, since the optimal number of class intervals is one. This
density is excluded by the condition that 5f'2>0, which surfaces in Lemma
(4.5) of Sect. 4.

456 D. Freedman and P. Diaconis
2. The Bias Term
To begin with assume only that
(2.1) f is an
L z
function on the interval I.
Define fh by (1.9). Let J be a union of class intervals. Clearly,
(2.2)
y fh(x) dx = y f (x) dx
J Y
(2.3)
5 fh(x) 2 dx < If(x) z dx
J J
(2.4) the fh are square integrable uniformly in h.
Also, fh converges tof in L2:
(2.5)
~ (fh-f):-+O
as h-+0.
x
For the proof of (2.5), approximate f in L 2 by a continuous function with
compact support. Estimates on the rate of convergence in (2.5) will be helpful.
For this, additional assumptions are needed. One such is:
(2.6) f is an
L z
function on the interval I, and f is absolutely continuous
with a.e. derivative f', and
f'sL z.
Under (2.6), the bias term on the left of (2.5) tends to zero like h 2. More
precisely;
(2.7) Proposition.
Suppose
(2.6)
and
(1.5).
Let
r(h)= ! [fh(x ) --f (x)]
2
dx _1
h2 ! f,(x) 2 dx.
Then
r(h)= o(h2).
Proof
To ease the notation, write g for f', and set x0=0. Focus on a specific
class interval, for instance, [0, hi. Clearly,
f (x) = a + i
g(u)
du
0
where a=f(0). In computing
S(fh--f) z,
the constant a will cancel, so it is
harmless to set a = 0. Of course,
h h
~(fh--f)2-----Sf2--hfk z.
0 0

On the Histogram as a Density Estimator: L2 Theory 457
In what follows, u v v=max(u, v) and u A v =min(u, v). Because a =0,
h hxx
5f 2= 5 5 5 g(u) g(v)
dudvdx
0 000
hhh
=~ ~ ~ g(u)g(v)dxdudv
OOuvv
hh
= ~ ~ (h-u v v) g(u) g(v)
dudv.
O0
Likewise,
SO
and
where
1 h
fh=~ ! (h - u) g(u)
du
2 1 h
h fh =-~ i o ! (h-u) (h- v) g(u) g(v) dudv
h hh
5 (fh
_f)2 = f 5 ~bh(u, V) g(u) g(v)
dudv
0 O0
G(u, v) = (h - u v v) - ~(h - u) (h - v)
1
=(u+v)-(uv~)-~uv
1
~-U A I)--~UIJ.
This defines ~b h as a function from
O<=u, v<h.
Note that
4~(u,O)=O(u,h)
= ~b(0, v)= q~(h, v)= 0. Define q5 on the whole plane by periodic continuation.
Let
(n+ 1)h 2 1 (n+ 1)h
The argument thus far shows that
1 (n+l)h (n+l)h 1 (n+l)h
It will now be shown that
~n~Snh(g)'-+
0
as h--* 0.
If g is constant on
[nh, (n+
1)h], a direct computation shows that
~Snh(g)=0.
But g may be approximated closely in L 2 by a function go which is constant
on each class interval: for instance, apply (2.5) to g. It remains to show that
G a.~(g) - z. a.;,(g o)

Citations
More filters
Journal ArticleDOI

Causal Inference without Balance Checking: Coarsened Exact Matching

TL;DR: It is shown that CEM possesses a wide range of statistical properties not available in most other matching methods but is at the same time exceptionally easy to comprehend and use.
Book

Computational Statistics Handbook with MATLAB

TL;DR: This book discusses Computational Statistics, a branch of Statistics, and its applications in medicine, education, and research.
Journal ArticleDOI

Precise and reliable gene expression via standard transcription and translation initiation elements

TL;DR: An expression cassette architecture for genetic elements controlling transcription and translation initiation in Escherichia coli is developed, demonstrating that arbitrary genes are reliably expressed to within twofold relative target expression windows with ∼93% reliability.
Proceedings ArticleDOI

Differential privacy and robust statistics

TL;DR: It is shown by means of several examples that robust statistical estimators present an excellent starting point for differentially private estimators.
References
More filters
Book

The Art of Computer Programming

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Book

An introduction to probability theory

TL;DR: The authors introduce probability theory for both advanced undergraduate students of statistics and scientists in related fields, drawing on real applications in the physical and biological sciences, and make probability exciting." -Journal of the American Statistical Association
Frequently Asked Questions (7)
Q1. What are the contributions mentioned in the paper "On the histogram as a density estimator: l 2 theory" ?

In this paper, a histogram for the X 's is used to estimate the number of X 's falling in an interval I, finite or infinite: I includes its finite endpoints, if any ; and f vanishes outside of I. 

The results show that the mean square error of kernel estimates tends to zero like a constant times k -4Is, while (1.6) implies that the mean square error of histograms tends to zero like a constant times k -2/3. 

For instance, if The author= [ 0 , 1] and Xo=0 , condition (1.5) requires that h = l / N for some positive integer N. By present conditions, if The author= [0, 1], then f and f ' are continuous on I, even at 0 and 1.(1.6) Theorem. 

Choose g) positive, but smaller than rain {6o, b/3c}, where c and 6 o are as in (4.9), Then Ck(h)-ch 3 is a monotone increasing function of h in the interval (4.14). 

These b o u n d s force the fol lowing conclus ions : for large k the h's min imizing 0 k ( ' ) are to be found in the in terva l hk+_A/kl/2; on that whole in terval Ok(h) = C~k(hk)+ O(1/k). 

But g may be approximated closely in L 2 by a function go which is constant on each class interval: for instance, apply (2.5) to g. 

Because e was arbitrary, the infimum of Ok(h) over h with 0<h___b is(4.8) 3 .2 - 2/3. b 1/3. k- 2/3 + o(k- 2/3).Now (4.4-6) show that ~k(') has a global minimum, say at h~, any such h* tends to 0 as k ~ o% and Ok(h*)=qSk(hk)+