Journal ArticleDOI

# Ideal spatial adaptation by wavelet shrinkage

01 Sep 1994-Biometrika (Oxford University Press)-Vol. 81, Iss: 3, pp 425-455
TL;DR: In this article, the authors developed a spatially adaptive method, RiskShrink, which works by shrinkage of empirical wavelet coefficients, and achieved a performance within a factor log 2 n of the ideal performance of piecewise polynomial and variable-knot spline methods.

• The authors are particularly interested in a variety of spatially adaptive methods which have been proposed in the statistical literature, such as CART (Breiman, Friedman, Olshen and Stone, 1983), Turbo (Friedman and Silverman, 1989), MARS (Friedman, 1991), and variablebandwidth kernel methods (M uller and Stadtmuller, 1987).
• Informal conversations with Leo Breiman and Jerome Friedman have con rmed this assumption.
• The authors now describe a simple framework which encompasses the most important spatially adaptive methods, and allows us to develop their main theme e ciently.
• The reconstruction formula is TPC(y; )(t) = LX =1 Ave(yi : ti 2 I)1I(t); piecewise constant reconstruction using the mean of the data within each piece to estimate the pieces. [2].
• The kernel method TK;2 equipped with the variable bandwidth selector described in Brockmann, Gasser and Herrmann (1992) results in the \Heidelberg" variable bandwidth smoothing method.

### 1.2 Ideal Adaptation with Oracles

• To avoid messy questions, the authors abandon the study of speci c -selectors and instead study ideal adaptation.
• For us, ideal adaptation is the performance which can be achieved from smoothing with the aid of an oracle.
• The risk of ideally adaptive piecewise polynomial ts is essentially 2L(D+1)=n.
• Indeed, an oracle could supply the information that one should use I1; : : : ; IL rather than some other partition.
• No better performance than this can be expected, since n 1 is the usual \parametric rate" for estimating nite-dimensional parameters.

### 1.3 Selective Wavelet Reconstruction as a Spatially Adaptive Method

• A new principle for spatially adaptive estimation can be based on recently developed \wavelets" ideas.
• This version yields an exactly orthogonal transformation between data and wavelet coe cient domains.
• This approximation improves with increasing n and increasing j1.
• For their purposes, the only details the authors need are [W1].
• Figures 1 displays four functions { Bumps, Blocks, HeaviSine and Doppler { which have been chosen because they caricature spatially variable functions arising in imaging, spectroscopy and other scienti c signal processing.

### 1.4 Near-Ideal Spatial Adaptation by Wavelets

• Of course, calculations of ideal risk which point to the bene t of ideal spatial adaptation prompt the question:.
• The bene t of the wavelet framework is that the authors can answer such questions precisely.
• The result, while slightly noisier than the ideal estimate, is still of good quality { and requires no oracle.

### 1.5 Universality of Wavelets as a Spatially Adaptive Procedure

• This last calculation is not essentially limited to piecewise polynomials; something like it holds for all f .
• We interpret this result as saying that selective wavelet reconstruction is essentially as powerful as variable-partition piecewise constant ts, variable-knot least-squares splines, or piecewise polynomial ts.the authors.the authors.
• The authors know of no proof that existing procedures for tting piecewise polynomials and variable-knot splines, such as those current in the statistical literature, can attain anything like the performance of ideal methods.
• And wavelet selection with an oracle o ers the advantages of other spatially-variable methods.
• The main assertion of this paper is therefore that, from this perspective, it is cleaner and more elegant to abandon the ideal of tting piecewise polynomials with optimal partitions, and turn instead to RiskShrink, about which the authors have results, and an order O(n) algorithm.

### 1.6 Contents

• Section 2 discusses the problem of mimicking ideal wavelet selection; Section 3 shows why wavelet selection o ers the same advantages as piecewise polynomial ts; Section 4 discusses variations and relations to other work.
• Related manuscripts by the authors, currently under publication review and available as LaTeX les by anonymous ftp from playfair.

### 2.1 Oracles for Diagonal Linear Projection

• Consider the following problem from multivariate normal decision theory.
• Suppose the authors had available an oracle which would supply for us the coe cients DP ( ) optimal for use in the diagonal projection scheme.
• Motivated by the idea that only very few wavelet coe cients contribute signal, the authors consider threshold rules, that retain only observed data that exceeds a multiple of the noise level.
• The authors give the result here and outline the approach in Section 2.4.
• However it is worth mentioning that a more traditional hard threshold estimator (11) exhibits the same asymptotic performance.

• The authors now apply the preceding results to function estimation.
• Let n = 2J+1, and letW denote the wavelet transform mentioned in section 1.3.
• Now let (yi) be data as in model (1) and let w =Wy be the discrete wavelet transform.
• Hence, the authors have achieved, by very simple means, essentially the best spatial adaptation possible via wavelets.

### 2.3 Implementation

• The authors have developed a computer software package which runs in the numerical computing environment Matlab.
• The name RiskShrink for the estimator emphasises that shrinkage of wavelet coe cients is performed by soft thresholding, and that a mean squared error , or \risk" approach has been taken to specify the threshold.
• The rationale behind this rule is as follows.
• Hence, those coe cients (a xed number, independent of n) should not be shrunken towards zero.
• Let gSW denote the selective wavelet reconstruction where the levels below j0 are never shrunk.

### 3 Piecewise Polynomials are not more powerful than Wavelets

• The authors now show that wavelet selection using an oracle can closely mimick piecewise polynomial tting using an oracle.
• Hence for every function, wavelets supplied with an oracle have an ideal risk that di ers by at most a logarithmic factor from the ideal risk of the piecewise polynomial estimate.
• Since variable-knot splines of order D are piecewise polynomials of order D, the authors also have Rn; (SW; f) (C1 + C2J)Rn; (Spl(D); f): (25) Note that the constants are not necessarily the same at each appearance : see the proof below.
• Suppose that this optimal partition contains L elements.

### 4.1 Variations on Choice of Oracle

• There is an oracle inequality for diagonal shrinkage also.
• (ii) More generally, the asymptotic inequality (28) continues to hold for soft threshold sequences ( n) and hard threshold estimators with threshold sequences (n) satisfying respectively 5 log log n 2n 2 logn o(logn) (29) (1 ) log log n 2n 2 logn o(logn): (30) (iii) Theorem 3 continues to hold, a fortiori, if the denominator 2 +.
• So oracles for diagonal shrinkage can be mimicked to within a factor 2 logn and not more closely.
• These results are carried over to adaptive wavelet shrinkage just as in Section 2.2 by de ning wavelet shrinkage in this case by the analog of (18) TWS =W T TDS W : Corollary 1 extends immediately to this case.

### 4.2 Variations on Choice of Threshold

• In Theorem 1 the authors have studied n, the minimax threshold for the soft threshold nonlinearity, with comparison to a projection oracle.
• A drawback of using optimal thresholds is that the threshold which is precisely optimal for one of the four combinations may not be even asymptotically optimal for another of the four combinations, also known as Remark.
• If a sample that in the noiseless case ought to be zero is in the noisy case nonzero, and that character is preserved in the reconstruction, the reconstruction will have an annoying visual appearance { it will contain small blips against an otherwise clean background.
• Not only is the method better in visual quality than RiskShrink, the asymptotic risk bounds are no worse: R( ~fvn ; f) (2 logn + 1)f 2 n +Rn; (gSW; f)g: This estimator is discussed further in their report [asymp.tex].
• In their experience, the empirical wavelet coe cients at the nest scale are, with a small fraction of exceptions, essentially pure noise.

### 4.4 Numerical measures of t

• Table 2 contains the average (over location) squared error of the various estimates from their four test functions for the noise realisation and the reconstructions shown in Figures 2 - 10.
• It is apparent that the ideal wavelets reconstruction dominates ideal Fourier and that the genuine estimate using soft threshold at n comes well within the factor 6.824 of the ideal error predicted for n = 2048 by Table 1.
• It has uniformly worse squared error than n, which re ects the well-known divergence between the usual numerical and visual assessments of quality of t.
• Table 3 shows the results of a very small simulation comparison of the same four techniques as sample size is varied dyadically from n = 256 through 8192, and using 10 replications in each case.
• The same features noted in Table 2 extend to the other sample sizes.

• The estimator proposed here has a number of optimality properties in minimax decision theory.
• RiskShrink is adaptive in the sense that it achieves, within a logarithmic factor, the best risk bounds that could be had if the class were known; and the logarithmic factor is necessary when the class is unknown, by work of Brown and Low (1993) and Lepskii (1990).
• Other near-minimax properties are described in detail in their report [asymp.tex].

### 4.6 Boundary correction

• As described in the Introduction, Cohen, Daubechies, Jawerth and Vial (1993), have introduced separate boundary lters' to correct the non-orthogonality on [0; 1] of the restriction to [0; 1] of basis functions that intersect [0; 1]c.
• Thus, the transform may be represented as W = U P , where U is the orthogonal transformation built from the quadrature mirror lters and their boundary versions via the cascade algorithm.
• Thus all the ideal risk inequalities in the paper remain valid, with only an additional dependence for the constants on 1 and 2.
• In particular, the conclusions concerning logarithmic mimicking of oracles are unchanged.

### 4.7 Relation to Model Selection

• The authors results show that the method gives almost the same performance in mean-square error as one could attain if one knew in advance which model provided the minimum mean-square error.
• The authors results apply equally well in orthogonal regression.
• George and Foster (1990) have proved two results about model selection which it is interesting to compare with their Theorem 4.
• The authors results here di er because the authors attempt to mimick more powerful oracles, which attain optimal mean-squared errors.
• The authors are also most grateful to Carl Taswell, who carried out the simulations reported in Table 3.

### 5.5 Theorem 3

• The main idea is to make a random variable, with prior distribution chosen so that a randomly selected subset of about logn coordinates are each of size roughly (2 logn)1=2, and to derive information from the Bayes risk of such a prior.
• Let ~ n denote the Bayes rule for n with respect to the loss ~Ln.

### 5.6 Theorems 4 and 6

• The authors give a proof that covers both soft and hard thresholding, and both DP and DS oracles.
• The expansion (23) shows that this range includes n and hence ̂ .

Did you find this useful? Give us your feedback

##### Figures (3)

Content maybe subject to copyright    Report

David L. Donoho
Iain M. Johnstone
Department of Statistics, Stanford University, Stanford, CA, 94305-4065, U.S.A.
June 1992
Revised April 1993
Abstract
With
, an oracle furnishes information about how b est to
adapt a spatially variable estimator, whether piecewise constant, piecewise polynomial,
variable knot spline, or variable bandwidth kernel, to the unknown function. Estimation
with the aid of an oracle oers dramatic advantages over traditional linear estimation
a priori
unclear whether such performance can
be obtained by a procedure relying on the data alone. We describe a new principle for
selective wavelet reconstruction
.Weshowthatvariable-
knot spline ts and piecewise-polynomial ts, when equipped with an oracle to select the
knots, are not dramatically more p owerful than selectivewavelet reconstruction with
an oracle. Wedevelop a practical spatially adaptive method,
RiskShrink
, whichworks
by shrinkage of empirical wavelet coecients.
RiskShrink
mimics the p erformance of
an oracle for selectivewavelet reconstruction as well as it is p ossible to do so. A new
inequalityin multivariate normal decision theory whichwecallthe
oracle inequality
shows that attained performance diers from ideal performance by at most a factor
2log
n
, where
n
is the sample size. Moreover no estimator can give a b etter guarantee
than this. Within the class of spatially adaptive procedures,
RiskShrink
is essentially
optimal. Relying only on the data, it comes within a factor log
2
n
of the p erformance
of piecewise polynomial and variable-knot spline metho ds equipped with an oracle.
In contrast, it is unknown how or if piecewise polynomial methods could b e made to
function this well when denied access to an oracle and forced to rely on data alone.
Keywords:
Minimax estimation sub ject to doing well at a point; Orthogonal Wavelet
Bases of Compact Support; Piecewise-Polynomial tting; Variable-Knot Spline.
1

1 Intro duction
Suppose we are given data
y
i
=
f
(
t
i
)+
e
i
; i
=1
;:::;n;
(1)
t
i
=
i=n
, where
e
i
are independently distributed as
N
(0
;
2
), and
f
(
) is an unknown
function whichwewould liketorecover. We measure performance of an estimate
^
f
(
)in
terms of quadratic loss at the sample p oints. In detail, let
f
=(
f
(
t
i
))
n
i
=1
and
^
f
=(
^
f
(
t
i
))
n
i
=1
denote the vectors of true and estimated sample values, respectively.Let
k
v
k
2
2
;n
=
P
n
i
=1
v
2
i
denote the usual squared

2
n
norm; we measure performance by the risk
R
(
^
f; f
)=
n
;
1
E
k
^
f
;
f
k
2
2
;n
;
whichwewould liketomake as small as possible. Although the notation
f
suggests a
function of a real variable
t
, in this paper wework only with the equally spaced sample
points
t
i
:
We are particularly interested in a variety of spatially adaptive metho ds whichhave been
proposed in the statistical literature, suchas
CART
(Breiman, Friedman, Olshen and Stone,
1983),
Turbo
(Friedman and Silverman, 1989),
MARS
(Friedman, 1991), and variable-
bandwidth kernel metho ds (M uller and Stadtmuller, 1987).
Such metho ds have presumably b een introduced because they were exp ected to do a
better job in recovery of the functions actually occurring with real data than do traditional
methods based on a xed spatial scale, suchasFourier series methods, xed-bandwidth
kernel methods, and linear spline smo others. Informal conversations with Leo Breiman and
Jerome Friedman have conrmed this assumption.
Wenow describe a simple framework which encompasses the most important spatially
adaptive metho ds, and allows us to develop our main theme eciently.We consider esti-
mates
^
f
dened as
^
f
(
)=
T
(
y; d
(
y
))(
) (2)
where
T
(
y;
)is a
reconstruction formula
with \spatial smoothing" parameter
,and
d
(
y
)
is a data-adaptivechoice of the spatial smo othing parameter
. A clearer picture of what
weintend emerges from ve examples.
[1]. Piecewise Constant Reconstruction
T
PC
(
y;
). Here
is a nite list of, say,
L
real
numbers dening a partition (
I
1
;:::;I
L
)of[0
;
1] via
I
1
=[0
;
1
)
;I
2
=[
1
;
1
+
2
)
;:::;I
L
=
[
1
+

+
L
;
1
;
1
+

+
L
] ,so that
P
L
1
i
= 1. Note that
L
is a variable. The reconstruction
formula is
T
PC
(
y;
)(
t
)=
L
X

=1
Ave(
y
i
:
t
i
2
I

)1
I

(
t
);
piecewise constant reconstruction using the mean of the data within each piece to estimate
the pieces.
[2]. Piecewise Polynomials
T
PP
(
D
)
(
y;
). Here the interpretation of
is the same as in
[1], only the reconstruction uses p olynomials of degree
D
.
T
PP
(
D
)
(
y;
)(
t
)=
L
X

=1
^
p

(
t
)1
I

(
t
)
;
2

where ^
p

(
t
)=
P
D
k
=0
a
k
t
k
is determined by applying the least squares principle to the data
arising for interval
I

X
t
i
2
I

(^
p

(
t
i
)
;
y
i
)
2
=min!
[3]. Variable-Knot Splines
T
spl;D
(
y;
). Here
denes a partition as ab ove, and on each
interval of the partition the reconstruction formula is a p olynomial of degree
D
, but now
the reconstruction must b e continuous and havecontinuous derivatives through order
D
;
1.
In detail, let

be the left endpointof
I

,

=1
;:::;L
. The reconstruction is chosen from
among those piecewise polynomials
s
(
t
) satisfying
d
dt
k
s
!
(

;
)=
d
dt
k
s
!
(

+)
for
k
=0
;:::;D
;
1,

=2
;:::;L
; sub ject to this constraint, one solves
n
X
i
=1
(
s
(
t
i
)
;
y
i
)
2
= min!
[4]. Variable Bandwidth Kernel Methods
T
VK;
2
(
y;
). Now
is a
function
on [0
;
1];
(
t
)
represents the \bandwidth of the kernel at
t
"; the smo othing kernel
K
is a
C
2
function of
compact support which is also a probabilitydensity, and if
^
f
=
T
VK;
2
(
y;
) then
^
f
(
t
)=
1
n
n
X
i
=1
y
i
K
t
;
t
i
(
t
)

(
t
)
:
(3)
More rened versions of this formula would adjust
K
for b oundary eects near
t
=0and
t
=1.
[5]. Variable-Bandwidth High-Order Kernels
T
VK;D
(
y;
),
D>
2. Here
is again the
local bandwidth, and the reconstruction formula is as in (3), only
K
(
)is a
C
D
function
integrating to 1, with vanishing intermediate moments
Z
t
j
K
(
t
)
dt
=0
; j
=1
;:::;D
;
1
:
As
D>
2,
K
(
) cannot b e nonnegative.
These reconstruction techniques, when equipped with appropriate selectors of the spatial
smoothing parameter
, duplicate essential features of certain well-known metho ds.
[1] The piecewise constant reconstruction formula
T
PC
, equipped with choice of partition
by recursive partitioning and cross-validatory choice of \pruning constant" as de-
scribed by Breiman, Friedman, Olshen and Stone (1983) results in the method
CART
applied to 1-dimensional data.
[2] The spline reconstruction formula
T
spl
;D
, equipped with a backwards deletion scheme
models the metho ds of Friedman and Silverman (1989) and Friedman (1991) applied
to 1-dimensional data.
[3] The kernel metho d
T
K;
2
equipped with the variable bandwidth selector described
in Bro ckmann, Gasser and Herrmann (1992) results in the \Heidelberg" variable
bandwidth smo othing method. Compare also Terrell and Scott (1992).
3

These schemes are computationally feasible and intuitively appealing. However, very
little is known ab out the theoretical p erformance of these adaptiveschemes, at the level of
uniformityin
f
and
N
that wewould like.
Toavoid messy questions, we abandon the study of sp ecic
ideal
For us, ideal adaptation is the performance whichcanbeachieved from smoothing with
the aid of an
oracle
. Such an oracle will not tell us
f
, but will tell us, for our metho d
T
(
y;
),
the \b est" choice of
for the true underlying
f
. The oracle's response is conceptually a
selection (
f
) which satises
R
(
T
(
y;
(
f
))
;f
)=
R
n;
(
T; f
)
where
R
n;
denotes the
ideal risk
R
n;
(
T; f
) = inf
R
(
T
(
y;
)
;f
)
:
As
R
measures performance with a selection (
f
) based on full knowledge of
f
rather
than a data-dependent selection
d
(
y
), it represents an ideal we cannot expect to attain.
Nevertheless it is the target we shall consider.
tive linear smo others. Consider the case of a function
f
which is a piecewise polynomial of
degree
D
, with a nite number of pieces
I
1
;:::;I
L
,say:
f
=
L
X

=1
p

(
t
)1
I

(
t
)
:
(4)
Assume that
f
has discontinuities at some of the break-points
2
;:::;
L
.
The risk of ideally adaptive piecewise polynomial ts is essentially
2
L
(
D
+1)
=n
. Indeed,
an oracle could supply the information that one should use
I
1
;:::;I
L
rather than some other
partition. Traditional least-squares theory says that, for data from the traditional linear
model
Y
=
X
+
E
, with noise
E
i
independently distributed as
N
(0
;
2
least-squares estimator
^
satises
E
k
X
;
X
^
k
2
2
=(number of parameters in
)(variance of noise)
Applying this to our setting, tting a function of the form (4) requires tting (# pieces )(degree+
1) parameters, so for the risk
R
(
^
f; f
)=
n
;
1
E
k
^
f
;
f
k
2
2
;n
weget
L
(
D
+1)
2
=n
On the other hand, the risk of a spatially-non-adaptive pro cedure is far worse. Con-
sider kernel smoothing. Because
f
has discontinuities, no kernel smo other with xed non-
spatially varying bandwidth attains a risk
R
(
^
f; f
) tending to zero faster than
Cn
;
1
=
2
,
C
=
C
(
f;
kernel). The same result holds for estimates in orthogonal series of polynomials
or sinusoids, for smoothing splines with knots at the sample p oints and for least squares
smoothing splines with knots equispaced.
Most strikingly,even for piecewise p olynomial ts with equal-width pieces, wehave that
R
(
^
f; f
) is of size
n
;
1
=
2
unless the breakpoints of
f
form a subset of the breakpoints of
^
f
. But this can happen only for very special
n
,soinanyevent
lim sup
N
!1
R
(
^
f; f
)
n
1
=
2
C>
0
:
4

In short, oracles oer an improvement|ideally|from risk of order
n
;
1
=
2
to order
n
;
1
.No
better p erformance than this can be exp ected, since
n
;
1
is the usual \parametric rate" for
estimating nite-dimensional parameters.
Can we approach this ideal performance with estimators using the data alone?
1.3 SelectiveWavelet Reconstruction as a Spatially Adaptive Metho d
A new principle for spatially adaptive estimation can b e based on recently developed
\wavelets" ideas. Introductions, historical accounts and references to much recentwork
may b e found in the b ooks byDaubechies (1992), Meyer (1990), Chui (1992) and Frazier,
Jawerth and Weiss (1991). Orthonormal bases of compactly supported wavelets provide a
powerful complement to traditional Fourier metho ds: they permit an analysis of a signal or
image into lo calised oscillating comp onents. In a statistical regression context, this spatially
varying decomposition can b e used to build algorithms that adapt their eective \window
width" to the amount of local oscillation in the data. Since the decomposition is in terms
of an orthogonal basis, analytic study in closed form is p ossible.
For the purp oses of this paper, we discuss a
nite, discrete, wavelet transform
. This
transform, along with a careful treatment of b oundary correction, has b een described by
Cohen, Daubechies, Jawerth, and Vial (1993), with related work in Meyer (1991) and
Malgouyres (1991). To fo cus attention on our main themes, we employ a simpler
periodised
version of the nite discrete wavelet transform in the main exp osition. This version yields
an
exactly
orthogonal transformation b etween data and wavelet coecient domains. Brief
comments on the minor changes needed for the boundary corrected version are made in
Section 4.6.
Suppose wehave data
y
=(
y
i
)
n
i
=1
,with
n
=2
J
+1
.For various combinations of pa-
rameters
M
(number of vanishing moments),
S
(support width), and
j
0
(Low-resolution
cuto ), one may construct an
n
-by-
n
orthogonal matrix
W
|the nite wavelet transform
matrix. Actually there are many such matrices, depending on sp ecial lters: in addition to
the original Daub echies wavelets there are the Coiets and Symmlets of Daubechies (1993).
For the gures in this pap er we use the Symmlet with parameter
N
= 8. This has
M
=7
vanishing moments and support length
S
=15.
This matrix yields a vector
w
of the
wavelet coecients
of
y
via|
w
=
W
y
;
and b ecause the matrix is orthogonal wehave the inversion formula
y
=
W
T
w
.
The vector
w
has
n
=2
J
+1
n
;
1=2
J
+1
;
1
of the elements following the scheme
w
j;k
:
j
=0
;:::;J
;
k
=0
;:::;
2
j
;
1
;
and the remaining elementwelabel
w
;
1
;
0
.Tointerpret these co ecients let
W
jk
denote
the (
j; k
)-th rowof
W
. The inversion formula
y
=
W
T
w
becomes
y
i
=
X
j;k
w
j;k
W
jk
(
i
)
;
expressing
y
as a sum of basis elements
W
jk
with coecients
w
j;k
.We call the
W
jk
wavelets
.
5

##### Citations
More filters
Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations

### Cites background or methods from "Ideal spatial adaptation by wavelet..."

• ...…from equation (11) we may derive the formula R{f(y)} p - 2 #(j; I7/TI <Y) + E max(I8/TI, Y)2} as an approximately unbiased estimate of the risk or mean-square rror E{,3(y) - 3}2, where P8(y) = sign(p,8)(fi/I- y)+ Donoho and Johnstone (1994) gave a similar formula in the function estimation setting....

[...]

• ...Donoho and Johnstone (1994) proved that the hard threshold (subset selection) estimator f,B = I(,7(1?...

[...]

• ...This is called a 'soft threshold' estimator by Donoho and Johnstone (1994); they applied this estimator to the coefficients of a wavelet transform of a function measured with noise....

[...]

• ...Donoho and Johnstone (1994) gave a similar formula in the function gstimation setting....

[...]

• ...Interestingly, this has exactly the same form as the soft shrinkage proposals of Donoho and Johnstone (1994) and Donoho et al....

[...]

Journal ArticleDOI
TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Abstract: We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include l(1) (the lasso), l(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

13,656 citations

### Cites background or methods from "Ideal spatial adaptation by wavelet..."

• ...Simple calculus shows (Donoho and Johnstone 1994) that the coordinate-wise update has the form ̃ j S ⇣ 1 N P N i=1 x ij (y i ỹ(j) i ), ↵ ⌘ 1 + (1 ↵) (5) where ỹ(j) i = ̃ 0 + P 6̀=j xĩ is the fitted value excluding the contribution from xij , and hence y i ỹ(j) i the partial residual for…...

[...]

• ...We would like to compute the gradient at j = ̃ j , which only exists if ̃ j 6= 0....

[...]

Journal ArticleDOI
TL;DR: In this article, a step-by-step guide to wavelet analysis is given, with examples taken from time series of the El Nino-Southern Oscillation (ENSO).
Abstract: A practical step-by-step guide to wavelet analysis is given, with examples taken from time series of the El Nino–Southern Oscillation (ENSO). The guide includes a comparison to the windowed Fourier transform, the choice of an appropriate wavelet basis function, edge effects due to finite-length time series, and the relationship between wavelet scale and Fourier frequency. New statistical significance tests for wavelet power spectra are developed by deriving theoretical wavelet spectra for white and red noise processes and using these to establish significance levels and confidence intervals. It is shown that smoothing in time or scale can be used to increase the confidence of the wavelet spectrum. Empirical formulas are given for the effect of smoothing on significance levels and confidence intervals. Extensions to wavelet analysis such as filtering, the power Hovmoller, cross-wavelet spectra, and coherence are described. The statistical significance tests are used to give a quantitative measure of change...

12,803 citations

### Cites background from "Ideal spatial adaptation by wavelet..."

• ...A more complete description including examples is given in Donoho and Johnstone (1994)....

[...]

Journal ArticleDOI
, Runze Li1
TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

8,314 citations

### Cites background or methods from "Ideal spatial adaptation by wavelet..."

• ...In language similar to Donoho and Johnstone (1994a), the resulting estimator performs as well as the oracle estimator, which knows in advance that ‚20 D 0....

[...]

• ...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding ‹ D p 2 log4d5 (see Donoho and Johnstone, 1994a) with d D 20140160, and 100; and Figure 5(b) is for d D 512, 1024, 2048, and 4096....

[...]

• ...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding A = /2 log(d) (see Donoho and Johnstone, 1994a) with d = 20, 40, 60, and 100; and Figure 5(b) is for d = 512, 1024, 2048, and 4096. From Figure 5, (a) and (b), it can be seen that the Bayesian risks are not very sensitive to the values of a. It can be seen from Figure 5(a) that the Bayes risks achieve their minimums at a t 3.7 when the value of d is less than 100. This choice gives pretty good practical performance for various variable selection problems. Indeed, based on the simulations in Section 4.3, the choice of a = 3.7 works similarly to that chosen by the generalized cross-validation (GCV) method. 2.2 Performance of Thresholding Rules We now compare the performance of the four previously stated thresholding rules. Marron, Adak, Johnstone, Neumann, and Patil (1998) applied the tool of risk analysis to understand the small sample behavior of the hard and soft thresholding rules. The closed forms for the L2 risk functions R(0, 0) = E(0 - 0)2 were derived under the Gaussian model Z - N(O, a2) for hard and soft thresholding rules by Donoho and Johnstone (1994b). The risk function of the SCAD thresholding rule can be found in Li (2000)....

[...]

• ...In wavelet approximations, Donoho and Johnstone (1994a) selected significant subbases (terms in the wavelet expansion) via thresholding....

[...]

• ...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding A = /2 log(d) (see Donoho and Johnstone, 1994a) with d = 20, 40, 60, and 100; and Figure 5(b) is for d = 512, 1024, 2048, and 4096. From Figure 5, (a) and (b), it can be seen that the Bayesian risks are not very sensitive to the values of a. It can be seen from Figure 5(a) that the Bayes risks achieve their minimums at a t 3.7 when the value of d is less than 100. This choice gives pretty good practical performance for various variable selection problems. Indeed, based on the simulations in Section 4.3, the choice of a = 3.7 works similarly to that chosen by the generalized cross-validation (GCV) method. 2.2 Performance of Thresholding Rules We now compare the performance of the four previously stated thresholding rules. Marron, Adak, Johnstone, Neumann, and Patil (1998) applied the tool of risk analysis to understand the small sample behavior of the hard and soft thresholding rules. The closed forms for the L2 risk functions R(0, 0) = E(0 - 0)2 were derived under the Gaussian model Z - N(O, a2) for hard and soft thresholding rules by Donoho and Johnstone (1994b). The risk function of the SCAD thresholding rule can be found in Li (2000). To gauge the performance of the four thresholding rules, Figure 5(c) depicts their L2 risk functions under the Gaussian model Z - N(O, 1)....

[...]

Journal ArticleDOI

TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,828 citations

##### References
More filters
Book
01 May 1992
TL;DR: This paper presents a meta-analyses of the wavelet transforms of Coxeter’s inequality and its applications to multiresolutional analysis and orthonormal bases.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

16,073 citations

### "Ideal spatial adaptation by wavelet..." refers background in this paper

• ...Introductions, historical accounts and references to much recent workmay be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier,Jawerth and Weiss (1991)....

[...]

Journal ArticleDOI
TL;DR: In this article, the regularity of compactly supported wavelets and symmetry of wavelet bases are discussed. But the authors focus on the orthonormal bases of wavelets, rather than the continuous wavelet transform.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

14,157 citations

Journal ArticleDOI
TL;DR: This work construct orthonormal bases of compactly supported wavelets, with arbitrarily high regularity, by reviewing the concept of multiresolution analysis as well as several algorithms in vision decomposition and reconstruction.
Abstract: We construct orthonormal bases of compactly supported wavelets, with arbitrarily high regularity. The order of regularity increases linearly with the support width. We start by reviewing the concept of multiresolution analysis as well as several algorithms in vision decomposition and reconstruction. The construction then follows from a synthesis of these different approaches.

8,588 citations

### "Ideal spatial adaptation by wavelet..." refers background in this paper

• ...Daubechies (1988) described a particular constructionwith S = 2M + 1 for which the smoothness (number of derivatives) of is proportional toM ....

[...]

• ...For j and k bounded away from extreme cases by the conditionsj0 j < J j1; S < k < 2j S;we have the approximationn1=2Wjk(i) 2j=2 (2jt k) t = i=nwhere is a xed \wavelet" in the sense of the usual wavelet transform on IR (Meyer, 1990),Daubechies (1988)....

[...]

Journal ArticleDOI
TL;DR: In this article, a new method is presented for flexible regression modeling of high dimensional data, which takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data.
Abstract: A new method is presented for flexible regression modeling of high dimensional data. The model takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data. This procedure is motivated by the recursive partitioning approach to regression and shares its attractive properties. Unlike recursive partitioning, however, this method produces continuous models with continuous derivatives. It has more power and flexibility to model relationships that are nearly additive or involve interactions in at most a few variables. In addition, the model can be represented in a form that separately identifies the additive contributions and those associated with the different multivariable interactions.

6,651 citations

Book
01 Jan 1992
TL;DR: An Overview: From Fourier Analysis to Wavelet Analysis, Multiresolution Analysis, Splines, and Wavelets.
Abstract: An Overview: From Fourier Analysis to Wavelet Analysis. The Integral Wavelet Transform and Time-Frequency Analysis. Inversion Formulas and Duals. Classification of Wavelets. Multiresolution Analysis, Splines, and Wavelets. Wavelet Decompositions and Reconstructions. Fourier Analysis: Fourier and Inverse Fourier Transforms. Continuous-Time Convolution and the Delta Function. Fourier Transform of Square-Integrable Functions. Fourier Series. Basic Convergence Theory and Poisson's Summation Formula. Wavelet Transforms and Time-Frequency Analysis: The Gabor Transform. Short-Time Fourier Transforms and the Uncertainty Principle. The Integral Wavelet Transform. Dyadic Wavelets and Inversions. Frames. Wavelet Series. Cardinal Spline Analysis: Cardinal Spline Spaces. B-Splines and Their Basic Properties. The Two-Scale Relation and an Interpolatory Graphical Display Algorithm. B-Net Representations and Computation of Cardinal Splines. Construction of Spline Approximation Formulas. Construction of Spline Interpolation Formulas. Scaling Functions and Wavelets: Multiresolution Analysis. Scaling Functions with Finite Two-Scale Relations. Direct-Sum Decompositions of L2(R). Wavelets and Their Duals. Linear-Phase Filtering. Compactly Supported Wavelets. Cardinal Spline-Wavelets: Interpolaratory Spline-Wavelets. Compactly Supported Spline-Wavelets. Computation of Cardinal Spline-Wavelets. Euler-Frobenius Polynomials. Error Analysis in Spline-Wavelet Decomposition. Total Positivity, Complete Oscillation, Zero-Crossings. Orthogonal Wavelets and Wavelet Packets: Examples of Orthogonal Wavelets. Identification of Orthogonal Two-Scale Symbols. Construction of Compactly Supported Orthogonal Wavelets. Orthogonal Wavelet Packets. Orthogonal Decomposition of Wavelet Series. Notes. References. Subject Index. Appendix.

3,992 citations

### "Ideal spatial adaptation by wavelet..." refers background in this paper

• ...Introductions, historical accounts and references to much recent workmay be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier,Jawerth and Weiss (1991)....

[...]

###### Q1. What are the contributions mentioned in the paper "Ideal spatial adaptation by wavelet shrinkage" ?

The authors describe a new principle for spatially-adaptive estimation: selective wavelet reconstruction. The authors show that variableknot spline ts and piecewise-polynomial ts, when equipped with an oracle to select the knots, are not dramatically more powerful than selective wavelet reconstruction with an oracle. A new inequality in multivariate normal decision theory which the authors call the oracle inequality shows that attained performance di ers from ideal performance by at most a factor 2 logn, where n is the sample size.

Because f has discontinuities, no kernel smoother with xed nonspatially varying bandwidth attains a risk R(f̂ ; f) tending to zero faster than Cn 1=2, C = C(f; kernel).

To preserve the important property [W1] of orthogonality to polynomials of degree M , a further preconditioning' transformation P of the data y is necessary.

The preconditioning transformation a ects only the N = M + 1 left-most and the N right-most elements of y: it has block diagonal structure P = diag(PL j The authorj PR).

For various combinations of parameters M (number of vanishing moments), S (support width), and j0 (Low-resolution cuto ), one may construct an n-by-n orthogonal matrix W|the nite wavelet transform matrix.

A total of 4 minimax quantities may be de ned, by considering various combinations of threshold type (soft, hard) and oracle type (projection,shrinkage).

"Figures 1 displays four functions { Bumps, Blocks, HeaviSine and Doppler { which have been chosen because they caricature spatially variable functions arising in imaging, spectroscopy and other scienti c signal processing.

However it is natural and more revealing tolook for optimal' thresholds n which yield the smallest possible constant n in place of 2 logn+1 among soft threshold estimators.

This matrix yields a vector w of the wavelet coe cients of y via|w =Wy;and because the matrix is orthogonal the authors have the inversion formula y =WTw.

In their language, they show that one can mimick the \\nonzeroness" oracle Z( ; ) = 21 f 6=0g to within Ln = 1 + 2 log(n + 1) by hard thresholding with n = (2 log(n + 1)) 1=2.

In addition, an implementation by G.P. Nason in the S language is available by anonymous ftp from Statlib at lib.stat.cmu.edu ; other implementations are also in development.