Ideal spatial adaptation by wavelet shrinkage
Summary (4 min read)
1.1 Spatially Adaptive Methods
- The authors are particularly interested in a variety of spatially adaptive methods which have been proposed in the statistical literature, such as CART (Breiman, Friedman, Olshen and Stone, 1983), Turbo (Friedman and Silverman, 1989), MARS (Friedman, 1991), and variablebandwidth kernel methods (M uller and Stadtmuller, 1987).
- Informal conversations with Leo Breiman and Jerome Friedman have con rmed this assumption.
- The authors now describe a simple framework which encompasses the most important spatially adaptive methods, and allows us to develop their main theme e ciently.
- The reconstruction formula is TPC(y; )(t) = LX `=1 Ave(yi : ti 2 I`)1I`(t); piecewise constant reconstruction using the mean of the data within each piece to estimate the pieces. [2].
- The kernel method TK;2 equipped with the variable bandwidth selector described in Brockmann, Gasser and Herrmann (1992) results in the \Heidelberg" variable bandwidth smoothing method.
1.2 Ideal Adaptation with Oracles
- To avoid messy questions, the authors abandon the study of speci c -selectors and instead study ideal adaptation.
- For us, ideal adaptation is the performance which can be achieved from smoothing with the aid of an oracle.
- The risk of ideally adaptive piecewise polynomial ts is essentially 2L(D+1)=n.
- Indeed, an oracle could supply the information that one should use I1; : : : ; IL rather than some other partition.
- No better performance than this can be expected, since n 1 is the usual \parametric rate" for estimating nite-dimensional parameters.
1.3 Selective Wavelet Reconstruction as a Spatially Adaptive Method
- A new principle for spatially adaptive estimation can be based on recently developed \wavelets" ideas.
- This version yields an exactly orthogonal transformation between data and wavelet coe cient domains.
- This approximation improves with increasing n and increasing j1.
- For their purposes, the only details the authors need are [W1].
- Figures 1 displays four functions { Bumps, Blocks, HeaviSine and Doppler { which have been chosen because they caricature spatially variable functions arising in imaging, spectroscopy and other scienti c signal processing.
1.4 Near-Ideal Spatial Adaptation by Wavelets
- Of course, calculations of ideal risk which point to the bene t of ideal spatial adaptation prompt the question:.
- The bene t of the wavelet framework is that the authors can answer such questions precisely.
- The result, while slightly noisier than the ideal estimate, is still of good quality { and requires no oracle.
1.5 Universality of Wavelets as a Spatially Adaptive Procedure
- This last calculation is not essentially limited to piecewise polynomials; something like it holds for all f .
- We interpret this result as saying that selective wavelet reconstruction is essentially as powerful as variable-partition piecewise constant ts, variable-knot least-squares splines, or piecewise polynomial ts.the authors.the authors.
- The authors know of no proof that existing procedures for tting piecewise polynomials and variable-knot splines, such as those current in the statistical literature, can attain anything like the performance of ideal methods.
- And wavelet selection with an oracle o ers the advantages of other spatially-variable methods.
- The main assertion of this paper is therefore that, from this perspective, it is cleaner and more elegant to abandon the ideal of tting piecewise polynomials with optimal partitions, and turn instead to RiskShrink, about which the authors have results, and an order O(n) algorithm.
1.6 Contents
- Section 2 discusses the problem of mimicking ideal wavelet selection; Section 3 shows why wavelet selection o ers the same advantages as piecewise polynomial ts; Section 4 discusses variations and relations to other work.
- Related manuscripts by the authors, currently under publication review and available as LaTeX les by anonymous ftp from playfair.
2.1 Oracles for Diagonal Linear Projection
- Consider the following problem from multivariate normal decision theory.
- Suppose the authors had available an oracle which would supply for us the coe cients DP ( ) optimal for use in the diagonal projection scheme.
- Motivated by the idea that only very few wavelet coe cients contribute signal, the authors consider threshold rules, that retain only observed data that exceeds a multiple of the noise level.
- The authors give the result here and outline the approach in Section 2.4.
- However it is worth mentioning that a more traditional hard threshold estimator (11) exhibits the same asymptotic performance.
2.2 Adaptive Wavelet Shrinkage
- The authors now apply the preceding results to function estimation.
- Let n = 2J+1, and letW denote the wavelet transform mentioned in section 1.3.
- Now let (yi) be data as in model (1) and let w =Wy be the discrete wavelet transform.
- Hence, the authors have achieved, by very simple means, essentially the best spatial adaptation possible via wavelets.
2.3 Implementation
- The authors have developed a computer software package which runs in the numerical computing environment Matlab.
- The name RiskShrink for the estimator emphasises that shrinkage of wavelet coe cients is performed by soft thresholding, and that a mean squared error , or \risk" approach has been taken to specify the threshold.
- The rationale behind this rule is as follows.
- Hence, those coe cients (a xed number, independent of n) should not be shrunken towards zero.
- Let gSW denote the selective wavelet reconstruction where the levels below j0 are never shrunk.
3 Piecewise Polynomials are not more powerful than Wavelets
- The authors now show that wavelet selection using an oracle can closely mimick piecewise polynomial tting using an oracle.
- Hence for every function, wavelets supplied with an oracle have an ideal risk that di ers by at most a logarithmic factor from the ideal risk of the piecewise polynomial estimate.
- Since variable-knot splines of order D are piecewise polynomials of order D, the authors also have Rn; (SW; f) (C1 + C2J)Rn; (Spl(D); f): (25) Note that the constants are not necessarily the same at each appearance : see the proof below.
- Suppose that this optimal partition contains L elements.
4.1 Variations on Choice of Oracle
- There is an oracle inequality for diagonal shrinkage also.
- (ii) More generally, the asymptotic inequality (28) continues to hold for soft threshold sequences ( n) and hard threshold estimators with threshold sequences (`n) satisfying respectively 5 log log n 2n 2 logn o(logn) (29) (1 ) log log n `2n 2 logn o(logn): (30) (iii) Theorem 3 continues to hold, a fortiori, if the denominator 2 +.
- So oracles for diagonal shrinkage can be mimicked to within a factor 2 logn and not more closely.
- These results are carried over to adaptive wavelet shrinkage just as in Section 2.2 by de ning wavelet shrinkage in this case by the analog of (18) TWS =W T TDS W : Corollary 1 extends immediately to this case.
4.2 Variations on Choice of Threshold
- In Theorem 1 the authors have studied n, the minimax threshold for the soft threshold nonlinearity, with comparison to a projection oracle.
- A drawback of using optimal thresholds is that the threshold which is precisely optimal for one of the four combinations may not be even asymptotically optimal for another of the four combinations, also known as Remark.
- If a sample that in the noiseless case ought to be zero is in the noisy case nonzero, and that character is preserved in the reconstruction, the reconstruction will have an annoying visual appearance { it will contain small blips against an otherwise clean background.
- Not only is the method better in visual quality than RiskShrink, the asymptotic risk bounds are no worse: R( ~fvn ; f) (2 logn + 1)f 2 n +Rn; (gSW; f)g: This estimator is discussed further in their report [asymp.tex].
- In their experience, the empirical wavelet coe cients at the nest scale are, with a small fraction of exceptions, essentially pure noise.
4.4 Numerical measures of t
- Table 2 contains the average (over location) squared error of the various estimates from their four test functions for the noise realisation and the reconstructions shown in Figures 2 - 10.
- It is apparent that the ideal wavelets reconstruction dominates ideal Fourier and that the genuine estimate using soft threshold at n comes well within the factor 6.824 of the ideal error predicted for n = 2048 by Table 1.
- It has uniformly worse squared error than n, which re ects the well-known divergence between the usual numerical and visual assessments of quality of t.
- Table 3 shows the results of a very small simulation comparison of the same four techniques as sample size is varied dyadically from n = 256 through 8192, and using 10 replications in each case.
- The same features noted in Table 2 extend to the other sample sizes.
4.5 Other Adaptive Properties
- The estimator proposed here has a number of optimality properties in minimax decision theory.
- RiskShrink is adaptive in the sense that it achieves, within a logarithmic factor, the best risk bounds that could be had if the class were known; and the logarithmic factor is necessary when the class is unknown, by work of Brown and Low (1993) and Lepskii (1990).
- Other near-minimax properties are described in detail in their report [asymp.tex].
4.6 Boundary correction
- As described in the Introduction, Cohen, Daubechies, Jawerth and Vial (1993), have introduced separate `boundary lters' to correct the non-orthogonality on [0; 1] of the restriction to [0; 1] of basis functions that intersect [0; 1]c.
- Thus, the transform may be represented as W = U P , where U is the orthogonal transformation built from the quadrature mirror lters and their boundary versions via the cascade algorithm.
- Thus all the ideal risk inequalities in the paper remain valid, with only an additional dependence for the constants on 1 and 2.
- In particular, the conclusions concerning logarithmic mimicking of oracles are unchanged.
4.7 Relation to Model Selection
- The authors results show that the method gives almost the same performance in mean-square error as one could attain if one knew in advance which model provided the minimum mean-square error.
- The authors results apply equally well in orthogonal regression.
- George and Foster (1990) have proved two results about model selection which it is interesting to compare with their Theorem 4.
- The authors results here di er because the authors attempt to mimick more powerful oracles, which attain optimal mean-squared errors.
- The authors are also most grateful to Carl Taswell, who carried out the simulations reported in Table 3.
5.5 Theorem 3
- The main idea is to make a random variable, with prior distribution chosen so that a randomly selected subset of about logn coordinates are each of size roughly (2 logn)1=2, and to derive information from the Bayes risk of such a prior.
- Let ~ n denote the Bayes rule for n with respect to the loss ~Ln.
5.6 Theorems 4 and 6
- The authors give a proof that covers both soft and hard thresholding, and both DP and DS oracles.
- The expansion (23) shows that this range includes n and hence ̂ .
Did you find this useful? Give us your feedback
Citations
40,785 citations
Cites background or methods from "Ideal spatial adaptation by wavelet..."
...…from equation (11) we may derive the formula R{f(y)} p - 2 #(j; I7/TI <Y) + E max(I8/TI, Y)2} as an approximately unbiased estimate of the risk or mean-square rror E{,3(y) - 3}2, where P8(y) = sign(p,8)(fi/I- y)+ Donoho and Johnstone (1994) gave a similar formula in the function estimation setting....
[...]
...Donoho and Johnstone (1994) proved that the hard threshold (subset selection) estimator f,B = I(,7(1?...
[...]
...This is called a 'soft threshold' estimator by Donoho and Johnstone (1994); they applied this estimator to the coefficients of a wavelet transform of a function measured with noise....
[...]
...Donoho and Johnstone (1994) gave a similar formula in the function gstimation setting....
[...]
...Interestingly, this has exactly the same form as the soft shrinkage proposals of Donoho and Johnstone (1994) and Donoho et al....
[...]
13,656 citations
Cites background or methods from "Ideal spatial adaptation by wavelet..."
...Simple calculus shows (Donoho and Johnstone 1994) that the coordinate-wise update has the form ̃ j S ⇣ 1 N P N i=1 x ij (y i ỹ(j) i ), ↵ ⌘ 1 + (1 ↵) (5) where ỹ(j) i = ̃ 0 + P 6̀=j xi`̃` is the fitted value excluding the contribution from xij , and hence y i ỹ(j) i the partial residual for…...
[...]
...We would like to compute the gradient at j = ̃ j , which only exists if ̃ j 6= 0....
[...]
12,803 citations
Cites background from "Ideal spatial adaptation by wavelet..."
...A more complete description including examples is given in Donoho and Johnstone (1994)....
[...]
8,314 citations
Cites background or methods from "Ideal spatial adaptation by wavelet..."
...In language similar to Donoho and Johnstone (1994a), the resulting estimator performs as well as the oracle estimator, which knows in advance that ‚20 D 0....
[...]
...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding ‹ D p 2 log4d5 (see Donoho and Johnstone, 1994a) with d D 20140160, and 100; and Figure 5(b) is for d D 512, 1024, 2048, and 4096....
[...]
...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding A = /2 log(d) (see Donoho and Johnstone, 1994a) with d = 20, 40, 60, and 100; and Figure 5(b) is for d = 512, 1024, 2048, and 4096. From Figure 5, (a) and (b), it can be seen that the Bayesian risks are not very sensitive to the values of a. It can be seen from Figure 5(a) that the Bayes risks achieve their minimums at a t 3.7 when the value of d is less than 100. This choice gives pretty good practical performance for various variable selection problems. Indeed, based on the simulations in Section 4.3, the choice of a = 3.7 works similarly to that chosen by the generalized cross-validation (GCV) method. 2.2 Performance of Thresholding Rules We now compare the performance of the four previously stated thresholding rules. Marron, Adak, Johnstone, Neumann, and Patil (1998) applied the tool of risk analysis to understand the small sample behavior of the hard and soft thresholding rules. The closed forms for the L2 risk functions R(0, 0) = E(0 - 0)2 were derived under the Gaussian model Z - N(O, a2) for hard and soft thresholding rules by Donoho and Johnstone (1994b). The risk function of the SCAD thresholding rule can be found in Li (2000)....
[...]
...In wavelet approximations, Donoho and Johnstone (1994a) selected significant subbases (terms in the wavelet expansion) via thresholding....
[...]
...Figure 5(a) depicts the Bayes risk as a function of a under the squared loss, for the universal thresholding A = /2 log(d) (see Donoho and Johnstone, 1994a) with d = 20, 40, 60, and 100; and Figure 5(b) is for d = 512, 1024, 2048, and 4096. From Figure 5, (a) and (b), it can be seen that the Bayesian risks are not very sensitive to the values of a. It can be seen from Figure 5(a) that the Bayes risks achieve their minimums at a t 3.7 when the value of d is less than 100. This choice gives pretty good practical performance for various variable selection problems. Indeed, based on the simulations in Section 4.3, the choice of a = 3.7 works similarly to that chosen by the generalized cross-validation (GCV) method. 2.2 Performance of Thresholding Rules We now compare the performance of the four previously stated thresholding rules. Marron, Adak, Johnstone, Neumann, and Patil (1998) applied the tool of risk analysis to understand the small sample behavior of the hard and soft thresholding rules. The closed forms for the L2 risk functions R(0, 0) = E(0 - 0)2 were derived under the Gaussian model Z - N(O, a2) for hard and soft thresholding rules by Donoho and Johnstone (1994b). The risk function of the SCAD thresholding rule can be found in Li (2000). To gauge the performance of the four thresholding rules, Figure 5(c) depicts their L2 risk functions under the Gaussian model Z - N(O, 1)....
[...]
7,828 citations
References
16,073 citations
"Ideal spatial adaptation by wavelet..." refers background in this paper
...Introductions, historical accounts and references to much recent workmay be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier,Jawerth and Weiss (1991)....
[...]
14,157 citations
8,588 citations
"Ideal spatial adaptation by wavelet..." refers background in this paper
...Daubechies (1988) described a particular constructionwith S = 2M + 1 for which the smoothness (number of derivatives) of is proportional toM ....
[...]
...For j and k bounded away from extreme cases by the conditionsj0 j < J j1; S < k < 2j S;we have the approximationn1=2Wjk(i) 2j=2 (2jt k) t = i=nwhere is a xed \wavelet" in the sense of the usual wavelet transform on IR (Meyer, 1990),Daubechies (1988)....
[...]
6,651 citations
3,992 citations
"Ideal spatial adaptation by wavelet..." refers background in this paper
...Introductions, historical accounts and references to much recent workmay be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier,Jawerth and Weiss (1991)....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What is the risk of a piecewise polynomial ts?
Because f has discontinuities, no kernel smoother with xed nonspatially varying bandwidth attains a risk R(f̂ ; f) tending to zero faster than Cn 1=2, C = C(f; kernel).
Q3. What is the way to preserve the important property of orthogonality to polynomials?
To preserve the important property [W1] of orthogonality to polynomials of degree M , a further `preconditioning' transformation P of the data y is necessary.
Q4. What is the preconditioning transformation a ects?
The preconditioning transformation a ects only the N = M + 1 left-most and the N right-most elements of y: it has block diagonal structure P = diag(PL j The authorj PR).
Q5. What is the nite wavelet transform matrix?
For various combinations of parameters M (number of vanishing moments), S (support width), and j0 (Low-resolution cuto ), one may construct an n-by-n orthogonal matrix W|the nite wavelet transform matrix.
Q6. How many minimax quantities can be de ned?
A total of 4 minimax quantities may be de ned, by considering various combinations of threshold type (soft, hard) and oracle type (projection,shrinkage).
Q7. What are the four functions chosen for the figure?
"Figures 1 displays four functions { Bumps, Blocks, HeaviSine and Doppler { which have been chosen because they caricature spatially variable functions arising in imaging, spectroscopy and other scienti c signal processing.
Q8. What is the way to estimate the thresholds?
However it is natural and more revealing tolook for `optimal' thresholds n which yield the smallest possible constant n in place of 2 logn+1 among soft threshold estimators.
Q9. What is the inverse of the nite discrete wavelet transform?
This matrix yields a vector w of the wavelet coe cients of y via|w =Wy;and because the matrix is orthogonal the authors have the inversion formula y =WTw.
Q10. How can one mimic the nonzeroness oracle?
In their language, they show that one can mimick the \\nonzeroness" oracle Z( ; ) = 21 f 6=0g to within Ln = 1 + 2 log(n + 1) by hard thresholding with n = (2 log(n + 1)) 1=2.
Q11. Where is the implementation of f available?
In addition, an implementation by G.P. Nason in the S language is available by anonymous ftp from Statlib at lib.stat.cmu.edu ; other implementations are also in development.