What are the contributions in this paper?

In this article, I present a practical and accessible framework to understand some of the basic underpinnings of these methods, with the intention of leading the reader to a broad understanding of how they interrelate. In particular, several novel optimality properties of algorithms in wide use such as block-matching and three-dimensional ( 3-D ) filtering ( BM3D ), and methods for their iterative improvement ( or nonexistence thereof ) are discussed.

How can one make W doubly stochastic?

If the orthonormal basis V contains a constant vector ,1vi n n1=l one can easily make W doubly stochastic by setting its corresponding shrinkage factor .1im =l(a)0.50.60.70.80.911.11.2M

What is the way to remedy the lack of optimality of the choice of kernel?

One way to remedy the lack of optimality of the choice of kernel is to apply the resulting filters iteratively, and that is the subject of this section.

how can i expand the regression function in a desired basis?

expanding the regression function ( )z x in a desired basis ,lz the authors can formulate the following optimization problem:( ) ( ) ( , ) ( , , , ),arg minx y x x x K y y x xz ( ) j xi l j l i j lNini i j 021l j b {= - b ==t = G// (S1)where N is the model (or regression) order.

What is the common assumption for the computation of the matrix W?

From a practical point of view, and insofar as the computation of the matrix W is concerned, it is always reasonable to assume that the noise variance is relatively small, because in practice the authors typically compute W on a “prefiltered” version of the noisy image y anyway.

What is the effect of noise on the computation of the weights?

On the other hand, one may legitimately worry that the effect of noise on the computation of these weights may be dramatic, resulting in too much sensitivity for the resulting filters to be effective.

How many independent noise realizations are shown in the simulations?

jANuARy 2013and their symmetrized versions computed by Monte-Carlo simulations are also shown, where in each simulation 100 independent noise realizations are averaged.

(Open Access) A Tour of Modern Image Filtering: New Insights and Methods, Both Practical and Theoretical (2013) | Peyman Milanfar

Q: How can the gradients be estimated from the given image?

In particular, the gradients used in the above expression can be estimated from the given noisy image by applying classical (i.e., nonadaptive) locally linear kernel regression.

Q: What is the main reason for the phenomenal progress in image processing?

While largely unacknowledged in their community, this phenomenal progress has been mostly thanks to the adoption and development of nonparametric point estimation procedures adapted to the local structure of the given multidimensional data.

ecent developments in computational imaging and

restoration have heralded the arrival and convergence

of several powerful methods for adaptive processing of

multidimensional data. Examples include moving

least square (from graphics), the bilateral filter (BF)

and anisotropic diffusion (from computer vision), boosting, ker-

nel, and spectral methods (from machine learning), nonlocal

means (NLM) and its variants (from signal processing), Bregman

iterations (from applied math), kernel regression, and iterative

scaling (from statistics). While these approaches found their inspi-

rations in diverse fields of nascence, they are deeply connected.

In this article, I present a practical and accessible framework

to understand some of the basic underpinnings of these meth-

ods, with the intention of leading the reader to a broad under-

standing of how they interrelate. I also illustrate connections

between these techniques and more classical (empirical) Bayes-

ian approaches.

The proposed framework is used to arrive at new insights

and methods, both practical and theoretical. In particular, sev-

eral novel optimality properties of algorithms in wide use such

as block-matching and three-dimensional (3-D) filtering

(BM3D), and methods for their iterative improvement (or non-

existence thereof) are discussed.

A general approach is laid out to enable the performance

analysis and subsequent improvement of many existing filtering

[

Peyman Milanfar

]

[

New insights and methods,

both practical and theoretical

]

A Tour

of Modern

Image Filtering

photo by david wagner

Digital Object Identifier 10.1109/MSP.2011.2179329

Date of publication: 5 December 2012

IEEE SIGNAL PROCESSING MAGAZINE [107] JANUARY 2013

algorithms. While much of the material discussed is applicable

to the wider class of linear degradation models beyond noise

(e.g., blur,) to keep matters focused, we consider the problem of

denoising here.

INTRODUCTION

Multidimensional filtering is the most fundamental operation

in image and video processing, and low-level computer vision.

In particular, the most widely used canonical filtering operation

is one that removes or attenuates the effect of noise. As such,

the basic design and analysis of image filtering operations form

a very large part of the image processing literature; the

resulting techniques often

quickly spreading to the wider

range of restoration and recon-

struction problems in imaging.

Over the years, many approaches

have been tried, but only recently

in the last decade or so, a great

leap forward in performance has

been realized. While largely unac-

knowledged in our community,

this phenomenal progress has been mostly thanks to the adop-

tion and development of nonparametric point estimation proce-

dures adapted to the local structure of the given

multidimensional data. Viewed through the lens of the denois-

ing application, here we develop a general framework for

understanding the basic science and engineering behind these

techniques and their generalizations. Surely this is not the first

article to attempt such an ambitious overview, and it will likely

not be the last; but the aim here is to provide a self-contained

presentation that distills, generalizes, and puts into proper con-

text many other excellent earlier works such as [1]–[5], and,

more recently, [6]. It is fair to say that this article is, by neces-

sity, not completely tutorial. Indeed it does contain several

novel results; yet these are largely novel interpretations, for-

malizations, or generalizations of ideas already known or

empirically familiar to the community. Hence, I hope that the

enterprising reader will find this article not only a good over-

view, but as should be the case with any useful presentation, a

source of new insights and food for thought.

So to begin, let us consider the measurement model for the

denoising problem

,,,,y ze in1for

iii

f=+ = (1)

where

()zzx

= is the underlying latent signal of interest at a

position

,,xxx

,,iii

is the noisy measured signal (pixel

value), and

is zero-mean, white noise with variance

make no other distributional assumptions for the noise. The

problem of interest then is to recover the complete set of sam-

ples of

(),zx

which we denote vectorially as

(),(), ,()zx zx zxz

from the corresponding data set

y. To restate the problem more concisely, the complete mea-

surement model in vector notation is given by (surely a

similar analysis to what follows can and should be carried out

for more general inverse problems such as deconvolution,

interpolation, etc.

.yze=+

(2)

It has been realized for some time now that effective restora-

tion of signals will require methods which either model the

signal a priori (i.e., are Bayesian) or learn the underlying char-

acteristics of the signal from the given data (i.e., learning,

nonparametric, or empirical Bayes’ methods.) Most recently,

the latter category of approaches has become exceedingly pop-

ular. Perhaps the most striking

recent example is the popularity

of patch-based methods [7]–[10].

This new generation of algo-

rithms exploit both local and

nonlocal redundancies or “self-

similarities” in the signals being

treated. Earlier on, the BF [8] was

developed with very much the

same idea in mind, as were its

spiritually close predecessors: the Susan filter [11], normal-

ized convolution [12], and the filters of Yaroslavsky [13]. The

common philosophy among these and related techniques is

the notion of measuring and making use of affinities between

a given data point (or more generally patch or region) of inter-

est, and others in the given measured signal y. These similari-

ties are then used in a filtering context to give higher weights

to contributions from more similar data values, and to prop-

erly discount data points that are less similar. The pattern rec-

ognition literature has also been a source of parallel ideas. In

particular, the celebrated mean-shift algorithm [14], [15] is in

principle an iterated version of point-wise regression as also

described in [1] and [2]. In the machine learning community,

the general regression problem has been carefully studied, and

deep connections between regularization, least-squares regres-

sion, and the support vector formalism have also been estab-

lished [16]–[19].

Despite the voluminous recent literature on techniques

based on these ideas, simply put, the key differences between

the resulting practical filtering methods have been relatively

minor, but rather poorly understood. In particular, the under-

lying framework for each of these methods is distinct only to

the extent that the weights assigned to different data points is

decided upon differently. To be more concrete and mathemati-

cally precise, let us consider the denoising problem (2) again.

The estimate of the signal

()zx

at the position x is found using

a (nonparametric) point estimation framework; specifically,

the weighted least squares problem

() () (, ,,),argminzx yzxKxxyy

()

ij ij

(3)

where the weight (or kernel) function

()K

$ is a a symmetric

function with respect to the indices i and j.

()K

$ is also a

RECENT DEVELOPMENTS IN

COMPUTATIONAL IMAGING AND

RESTORATION HAVE HERALDED

THE ARRIVAL AND CONVERGENCE

OF SEVERAL POWERFUL METHODS

FOR ADAPTIVE PROCESSING OF

MULTIDIMENSIONAL DATA.

IEEE SIGNAL PROCESSING MAGAZINE [108] JANUARY 2013

positive valued and unimodal function that measures the

“similarity” between the samples

and

at respective posi-

tions

and

If the kernel function is restricted to be only a

function of the spatial locations

and

then the resulting

formulation is what is known as (classical, or not data-adap-

tive) kernel regression in the nonparametric statistics litera-

ture [20], [21]. Perhaps more importantly, the key difference

between local and nonlocal patch-based methods lies essen-

tially in the definition of the range of the sum in (3). Specifi-

cally, indices covering a small spatial region around a pixel of

interest define local methods, and vice versa.

Interestingly, in the early 1980s, the essentially identical

concept of moving least-squares emerged independently [22],

[23] in the graphics community. This idea has since been

widely adopted in computer

graphics [24] as a very effective

tool for smoothing and interpola-

tion of data in three dimensions.

Surprisingly, despite the obvious

connections between moving

least-squares and the adaptive fil-

ters based on similarity, their

kinship has remained largely hid-

den so far.

EXISTING ALGORITHMS

Over the years, the measure of similarity

( ,,, )Kx xyy

ijij

has

been defined in a number of different ways, leading to a

cacophony of filters, including some of the most well-known

recent approaches to image denoising [7]–[9]. Figure 1 gives a

graphical illustration of how different choices of similarity

kernels lead to different classes of filters, some of which we

discuss next.

ClassiCal RegRession FilteRs

Naturally, the most naive way to measure the “distance”

between two pixels is to simply consider their spatial Euclidean

distance; specifically, using a Gaussian kernel,

(, ,,

expKx xyy

ijij

Such filters essentially lead to (possibly space-varying) Gauss-

ian filters which are quite familiar from traditional image pro-

cessing [13], [20], [21], [25]. It is possible to adapt the

variance (or bandwidth) parameter

to the local image sta-

tistics, and obtain a relatively modest improvement in perfor-

mance. But the lack of stronger adaptivity to the underlying

structure of the signal of interest is a major drawback of these

classical approaches.

the BilateRal FilteR

Another manifestation of the for-

mulation in (3) is the BF [8], [13]

where the spatial and photomet-

ric distances between two pixels

are taken into account in separa-

ble fashion as follows:

(, ,,)

()

exp exp

exp

Kx xyy

ijij

)

(4)

As can be observed in the exponent on the right-hand side, and

in Figure 1, the similarity metric here is a weighted Euclidean

distance between the vectors

(, )xy

and

(, ).xy

This approach

has several advantages. Specifically, while the kernel is easy to

construct and computationally simple

to calculate, it yields useful local adap-

tivity to the given data. In addition, it

has only two control parameters

(, ),hh

which make it very convenient

to use. However, as is well known,

this filter does not provide effective

performance in low signal-to-noise

scenarios [3].

nonloCal Means

The NLM algorithm [7], [26], [27], origi-

nally proposed in [28] and [29], has

stirred a great deal of interest in the

community in recent years. At its core,

however, it is a relatively simple general-

ization of the BF; specifically, the photo-

metric term in the bilateral similarity

kernel, which is measured point-wise, is

simply replaced with one that is patch-

wise. A second difference is that (at least

in theory) the geometric distance

The Photometric

Distance

Bilateral Filter

Nonlocal Means

Classic Kernel Regression

The Spatial Distance

K (y

- y)

K (x

- x) . K (y

- y )

K (x

- x)

LARK

The Geodesic

Distance

The Euclidean Distance

+ dy

dy = |y

- y|

dx = |x

- x|

[FIG1] Similarity metrics and the resulting filters.

THE HUMBLE WEIGHTED

LEAST-SQUARES, WITH LOCALLY

DATA-DEPENDENT WEIGHTS

BASED ON SIMILARITY KERNELS,

EXPLAINS AN ENORMOUS CLASS OF

RECENTLY POPULARIZED AND VERY

EFFECTIVE FILTERS.

IEEE SIGNAL PROCESSING MAGAZINE [109] JANUARY 2013

between the patches (corresponding to the first term in the

bilateral similarity kernel), is essentially ignored, leading to

strong contribution from patches that may not be physically

near the pixel of interest (hence, the name nonlocal). To sum-

marize, the NLM kernel is

(, ,,)

exp expKx xyy

ijij

-- --

eeoo

(5)

with

" 3 where

and

refer now to patches of pixels cen-

tered at pixels

and

, respectively. In practice, two imple-

mentation details should be observed. First, the patch-wise

photometric distance

in the above is in fact measured

()(),yyGy y

where G is a fixed diagonal matrix con-

taining Gaussian weights, which give higher importance to the

center of the respective patches. Second, it is rather computa-

tionally impractical to compare all the patches

although the NLM approach in Buades et al. [28] theoretically

forces

to be infinite, in practice typically the search is limited

to a reasonable spatial neighborhood of

Consequently, in

effect, the NLM filter too is more or less local; or said another

way,

is never infinite in practice. The method in Awate et al.

[29], on the other hand, proposes a Gaussian-distributed sam-

ple that comes closer to the exponential weighting on Euclidean

distances in (5).

Despite its popularity, the performance of the NLM filter

leaves much to be desired. The true potential of this filtering

scheme was demonstrated only later with the optimal spatial

adaptation (OSA) approach of Boulanger and Kervrann [26]. In

their approach, the photometric distance was refined to

include estimates of the local noise variances within each

patch. Specifically, they computed a local diagonal covariance

matrix, and defined the locally adaptive photometric distance

()()

yyVy y

in such a way as to minimize an esti-

mate of the local mean squared error (MSE). Furthermore,

they considered iterative application of the filter as discussed

in the section “Improving the Estimate by Iteration.”

loCally adaptive RegRession

(steeRing) KeRnels

The key idea behind this measure of similarity, originally proposed

in [9], is to robustly obtain the local structure of images by analyz-

ing the photometric (pixel value) differences based on estimated

gradients, and to use this structure information to adapt the shape

and size of a canonical kernel. The locally adaptive regression ker-

nel (LARK) is defined as follows:

(, ,,)()( ),expKx xyyxxxxC

ijij ij

ii j

(6)

where the matrix

(, )yyCC

iij

= is estimated from the given

data as

()

() ()

()

zxzx

xjxj

Specifically,

()zx

are the estimated gradients of the underly-

ing signal at point

computed from the given measurements

in a patch around the point of interest. In particular, the gra-

dients used in the above expression can be estimated from the

given noisy image by applying classical (i.e., nonadaptive)

locally linear kernel regression. Details of this estimation proce-

dure are given in [30]. The reader may recognize the above

matrix as the well-studied “structure tensor” [31]. The advan-

tage of the LARK descriptor is that it is exceedingly robust to

noise and perturbations of the data. The formulation is also the-

oretically well motivated since the quadratic exponent in (6)

essentially encodes the local geodesic distance between the

points (

,xy

i i

) and (

,xy

) on the graph of the function

( , )zxy

thought of as a two-dimensional (2-D) surface (a manifold)

embedded in three dimensions. The geodesic distance was also

used in the context of the Beltrami-flow kernel in [32] and [33]

in an analogous fashion.

GENERALIZATIONS AND PROPERTIES

The above discussion can be naturally generalized by defining

the augmented data variable

,xty

and a general

Gaussian kernel as follows:

(, )()( ),expK tt ttQt t

ij ij

(7)

,ij

;

(8)

where Q is symmetric positive definite (SPD).

Setting

and

,0Q

= we have classical kernel

regression, whereas one obtains the BF framework when

and

[, ,,, ,,].00 100diagQ

ff=

The latter

diagonal matrix picks out the center pixel in the element-

wise difference of patches

.tt

- When

= and

,QG

we have the NLM filter and its variants. Finally, the LARK

kernel in (6) is obtained when

= and

.Q0

= More gen-

erally, the matrix Q can be selected so that it has nonzero

off-diagonal blocks. However, no practical algorithms with

this choice have been proposed so far. As detailed below, with

an SPD Q, this general approach results in valid SPD ker-

nels, a property that is used throughout the rest of our dis-

cussion. The definition of t given here is only one of many

possible choices. Our treatment in this article is equally

valid when, for instance,

,xtT y=

is any feature derived

from a convenient linear or nonlinear transformation of the

original data.

The above concepts can be further extended using the pow-

erful theory of reproducible kernels originated in functional

analysis (and later successfully adopted in machine learning

[34], [35]) to present a significantly more general framework

for selecting the similarity functions. This will help us identify

the wider class of admissible similarity kernels more formally,

and to understand how to produce new kernels from ones

already defined [35]. Formally, a scalar-valued function

( , )K ts

over a compact region of its domain

is called an admissible

kernel if

IEEE SIGNAL PROCESSING MAGAZINE [110] JANUARY 2013

■ K is symmetric:

( , )(, )KKts st

■ K is positive definite. That is, for

any collection of points

,,,in1

f= the Gram matrix with

elements

(, )KKtt

,ij ij

= is positive

definite.

Such kernels satisfy some useful prop-

erties such as positivity,

( , ) ,K 0tt $

and the Cauchy–Schwartz

inequality

(, )(,) (,).KKKts tt ss

With the above definitions in place, there are numerous

ways to construct new valid kernels from existing ones. We list

some of the most useful ones below, without proof [35]. Given

two valid kernels

( , )K ts

and

( , ),K ts

the following construc-

tions yield admissible kernels:

() ( , )(, )KK K,ts ts ts

ab=+

for any pair

,0$

( , )(, )(, )KKKts ts ts

(, )()(),Kkkts ts

= where

( )k

$ is a scalar-valued function

(, )(,),KpKts ts

where

()p

$ is a polynomial with posi-

tive coefficients

( , )(( , )) .expKKts ts

Regardless of the choice of the kernel function, the weighted

least-square optimization problem (3) has a simple solution. In

matrix notation we can write

() () ()

argminzx zx zx

yKy

()

=- -

(9)

where

,, ,111

and

(, ,,), (, ,Kx xyyKxxdiagK

jjjj11 2

,),,(, ,,).y yKxxyy

jnjnj2

The closed-form solution to the

above is

()zx 111KKy

(10)

(, ,,)(,,,)Kx xyyKxx

yyy

ijij

ijiji

ccmm

ij i

.wy

(11)

So in general, the estimate

()zx

of the signal at position

given by a weighted sum of all the given data points

(),y x

each contributing a weight commensurate with its similarity

as indicated by

(),K

$ with the measurement

()y x

at the position

of interest. Furthermore, as should be apparent in (10), the

weights sum to one. To control computational complexity, or to

design local versus nonlocal filters, we may choose to set the

weight for some “sufficiently far-away” pixels to zero or a small

value, leading to a weighted sum involving a relatively small

number of data points in a properly defined vicinity of the sam-

ple of interest. This is essentially the only distinction between

locally adaptive processing methods (such as BL and LARK) and

so-called nonlocal methods such as NLM. It is worth noting

that in the formulation above, despite the simple form of (10),

in general we have a nonlinear estimator since the weights

(, ,,)WWxxyy

ij ijij

= depend on the noisy data. The

nonparametric approach in

(3) can be further extended

to include a more general

expansion of the signal z(x)

in a desired basis. We briefly

discuss this case in “General-

ization to Arbitrary Bases,”

but leave its full treatment for future research.

To summarize the discussion so far, we have presented a

general framework that absorbs many existing algorithms as

special cases. This was done in several ways, including a general

description of the set of admissible similarity kernels, which

allows the construction of a wide variety of new kernels not con-

sidered before in the image processing literature. Next, we turn

our attention to the matrix formulation of the nonparametric

filtering approach. As we shall see, this provides a framework for

more in-depth and intuitive understanding of the resulting fil-

ters, their subsequent improvement, and for their respective

asymptotic and numerical properties.

Before we end this section, it is worth saying a few words

about computational complexity. In general, patch-based meth-

ods are quite computationally intensive. Recently, many works

have aimed at both efficiently searching for similar patches, and

more cleverly computing the resulting weights. Among these,

notable recent work appears in [36] and [37].

THE MARTIX FORMULATION AND ITS PROPERTIES

In this section, we analyze the filtering problems posed earlier

in the language of linear algebra and make several theoretical

and practical observations. In particular, we are able to not only

study the numerical/algebraic properties of the resulting filters,

but also to analyze some of their fundamental statistical

properties.

To begin, recall the convenient vector form of the filters:

()

(12)

where

= (, ,,), (, ,,), ,(,,,Wx xyyWxxyy Wx xyw

jj jj njn11 22

is a vector of weights for each j. Writing the above at once

for all j we have

(13)

As such, the filters defined by the above process can be ana-

lyzed as the product of a (square,

nn#

) matrix of weights W

with the vector of the given data y. First, a notational matter:

W is in general a function of the data, so strictly speaking, the

notation

()W y

would be more descriptive. But as we will

describe later in more detail, the typical process for computing

these weights in practice involves first computing a prelimi-

nary denoised “pilot,” or “prefiltered” version of the image,

from which the weights are then calculated. This preprocess-

ing, done only for the purposes of computing the parameters

THE PERFORMANCE OF ANY

KERNEL-BASED DENOISING METHOD

CAN BE IMPROVED BY SOME TYPE

OF ITERATION. THE KEY IS TO USE

THE RIGHT ITERATION.

A Tour of Modern Image Filtering: New Insights and Methods, Both Practical and Theoretical

Citations

Graph Signal Processing: Overview, Challenges, and Applications

MemNet: A Persistent Memory Network for Image Restoration

Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration

Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections

Plug-and-Play priors for model based reconstruction

References

Scale-space and edge detection using anisotropic diffusion

Mean shift: a robust approach toward feature space analysis

A tutorial on support vector regression

Matching pursuits with time-frequency dictionaries

$rm K$ -SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation

Related Papers (5)

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

Bilateral filtering for gray and color images

A non-local algorithm for image denoising

Image quality assessment: from error visibility to structural similarity

Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries

Frequently Asked Questions (9)

Q1. What are the contributions in this paper?

Q2. How can one make W doubly stochastic?

Q3. What is the way to remedy the lack of optimality of the choice of kernel?

Q4. How can the gradients be estimated from the given image?

Q5. how can i expand the regression function in a desired basis?

Q6. What is the main reason for the phenomenal progress in image processing?

Q7. What is the common assumption for the computation of the matrix W?

Q8. What is the effect of noise on the computation of the weights?

Q9. How many independent noise realizations are shown in the simulations?