What is the effect of the proposed enhancement filter?

The proposed enhancement filter is minimally susceptible to noise even when strong sharpening is performed (i.e., α = 1.25), as shown by the smooth reconstruction of flat areas like the hat of Foreman and the bus roof of Bus.

What is the common technique used in enhancement?

A critical issue in enhancement is the amplification of the noise together with the sharpening of image details [44], [42], an effect that becomes more severe as the amount of applied sharpening increases.

What is the value of for the coefficients that do not belong to this 3-D?

the value of α can be decreased for the coefficients that do not belong to this 3-D volume, in order to attenuate the temporal flickering artifacts.

How is the performance of the V-BM4D metric measured?

the authors measure the performance of V-BM4D by means of the MOVIE index [46], a recently introduced video quality assessment (VQA) metric that is expected to be closer to the human visual judgement than the PSNR, because it concurrently evaluates space, time and jointly space-time video quality.

What is the optimum value of for a compressed video?

In order to use V-BM4D as a deblocking filter, the authors need to determine a suitable value of σ to handle the artifacts in a compressed video.

What is the simplest way to reduce the complexity of the grouping phase?

To reduce the complexity of the grouping phase, the authors restrict the search of similar volumes within a NG × NG neighborhood centered around the coordinates of the reference volume, and the authors introduce a step of Nstep ∈ N pixels in both horizontal and vertical directions between each reference volume.

How many arithmetical operations is required to perform the hard-thresholding?

Observe that the hard-thresholding, which is performed via element-wise comparison, requires one arithmetical operation per pixel.

What is the minimum degree of similarity between volumes?

The parameter τmatch > 0 controls the minimum degree of similarity among volumes with respect to the distance δv, which is typically the `2-norm of the difference between two volumes.

How are the temporal DC and AC sharpened?

In particular, the temporal DC coefficients are sharpened using αDC = 1.25, and the temporal AC are sharpened using the halved value αAC = 0.625.

How does the MOVIE index compare to the VBM3D?

From an objective point of view, as reported in Table IV, V-BM4D performs better than VBM3D in every experiment, with PSNR gains of up to 1.5dB.

How can V-BM4D process the useless blocks?

by skipping the motion estimation of the useless blocks, it is possible to achieve an additional speed-up of ∼12x that allows V-BM4D to process nearly 4 fps without affecting the final reconstruction quality.

Why is the grouping of volumes different?

in practice, because of the separability of the transform T4D, every group Gz(xi, ti) has to be composed of volumes having the samelength.

How can the authors eliminate the cost of the Wiener filtering?

Let us observe that this cost can be entirely eliminated when the input video is encoded with a motion-compensated algorithm, such as MPEG-4 or H.264, since the motion vectors required to build the spatiotemporal volumes can be directly extracted from the encoded video.

(Open Access) Video Denoising, Deblocking, and Enhancement Through Separable 4-D Nonlocal Spatiotemporal Transforms (2012) | Matteo Maggioni

Q: What are the contributions mentioned in the paper "Video denoising, deblocking and enhancement through separable 4-d nonlocal spatiotemporal transforms" ?

The authors propose a powerful video filtering algorithm that exploits temporal and spatial redundancy characterizing natural video sequences. In this way, the collaborative filtering provides estimates for each volume stacked in the group, which are then returned and adaptively aggregated to their original positions in the video.

Q: What is the way to control the decay rate of the exponential term?

By setting a proper value of σw the authors can control the decay rate of the exponential term as a function of v or, in other words, how permissive is the window contraction with respect to the velocity of the tracked block.

Video Denoising, Deblocking and Enhancement

Through Separable 4-D Nonlocal Spatiotemporal

Transforms

Matteo Maggioni, Giacomo Boracchi, Alessandro Foi, Karen Egiazarian

Abstract—We propose a powerful video ﬁltering algorithm that

exploits temporal and spatial redundancy characterizing natural

video sequences. The algorithm implements the paradigm of

nonlocal grouping and collaborative ﬁltering, where a higher-

dimensional transform-domain representation of the observations

is leveraged to enforce sparsity and thus regularize the data:

3-D spatiotemporal volumes are constructed by tracking blocks

along trajectories deﬁned by the motion vectors. Mutually similar

volumes are then grouped together by stacking them along an

additional fourth dimension, thus producing a 4-D structure,

termed group, where different types of data correlation exist

along the different dimensions: local correlation along the two

dimensions of the blocks, temporal correlation along the motion

trajectories, and nonlocal spatial correlation (i.e. self-similarity)

along the fourth dimension of the group. Collaborative ﬁltering is

then realized by transforming each group through a decorrelating

4-D separable transform and then by shrinkage and inverse

transformation. In this way, the collaborative ﬁltering provides

estimates for each volume stacked in the group, which are then

returned and adaptively aggregated to their original positions

in the video. The proposed ﬁltering procedure addresses several

video processing applications, such as denoising, deblocking, and

enhancement of both grayscale and color data. Experimental

results prove the effectiveness of our method in terms of both

subjective and objective visual quality, and shows that it outper-

forms the state of the art in video denoising.

Index Terms—Video ﬁltering, video denoising, video deblock-

ing, video enhancement, nonlocal methods, adaptive transforms,

motion estimation.

I. INTRODUCTION

EVERAL factors such as noise, blur, blocking, ringing,

and other acquisition or compression artifacts, typically

impair digital video sequences. The large number of practical

applications involving digital videos has motivated a signiﬁ-

cant interest in restoration or enhancement solutions, and the

literature contains a plethora of such algorithms (see [3], [4]

for a comprehensive overview).

At the moment, the most effective approach in restoring

images or video sequences exploits the redundancy given by

Matteo Maggioni, Alessandro Foi and Karen Egiazarian are with the

Department of Signal Processing, Tampere University of Technology, Finland.

Giacomo Boracchi is with the Dipartimento di Elettronica e Informazione,

Politecnico di Milano, Italy

This paper is based on and extends the authors’ preliminary conference

publications [1], [2]

This work was supported by the Academy of Finland (project no. 213462,

Finnish Programme for Centres of Excellence in Research 20062011, project

no. 252547, Academy Research Fellow 20112016, and project no. 129118,

Postdoctoral Researchers Project 20092011), and by Tampere Graduate School

in Information Science and Engineering (TISE).

the nonlocal similarity between patches at different locations

within the data [5], [6]. Algorithms based on this approach

have been proposed for various signal-processing problems,

and mainly for image denoising [4], [6], [7], [8], [9], [10], [11],

[12], [13], [14], [15]. Speciﬁcally, in [7] has been introduced

an adaptive pointwise image ﬁltering strategy, called non-

local means, where the estimate of each pixel x

is obtained

as a weighted average of, in principle, all the pixels x

the noisy image, using a family of weights proportional to

the similarity between two neighborhoods centered at x

and

. So far, the most effective image-denoising algorithm is

BM3D [10], [6], which relies on the so-called grouping and

collaborative ﬁltering paradigm: the observation is processed

in a blockwise manner and mutually similar 2-D image blocks

are stacked into a 3-D group (grouping), which is then ﬁltered

through a transform-domain shrinkage (collaborative ﬁltering),

simultaneously providing different estimates for each grouped

block. These estimates are then returned to their respective

locations and eventually aggregated resulting in the denoised

image. In doing so, BM3D leverages the spatial correlation

of natural images both at the nonlocal and local level, due

to the abundance of mutually similar patches and to the high

correlation of image data within each patch, respectively. The

BM3D ﬁltering scheme has been successfully applied to video

denoising in our previous work, V-BM3D [11], as well as to

several other applications including image and video super-

resolution [14], [15], [16], image sharpening [13], and image

deblurring [17].

In V-BM3D, groups are 3-D arrays of mutually similar

blocks extracted from a set of consecutive frames of the

video sequence. A group may include multiple blocks from

the same frame, naturally exploiting in this way the nonlocal

similarity characterizing images. However, it is typically along

the temporal dimension that most mutually similar blocks

can be found. It is well known that motion-compensated

videos [18] are extremely smooth along the temporal axis

and this fact is exploited by nearly all modern video-coding

techniques. Furthermore, experimental analysis in [12] shows

that, even when fast motion is present, the similarity along

the motion trajectories is much stronger than the nonlocal

similarity existing within an individual frame. In spite of this,

in V-BM3D the blocks are grouped regardless of whether their

similarity comes from the motion tracking over time or the

nonlocal spatial content. Consequently, during the ﬁltering, V-

BM3D is not able to distinguish between temporal and spatial

nonlocal similarity. We recognize this as a conceptual as well

as practical weakness of the algorithm. As a matter of fact,

the simple experiments reported in Section VIII demonstrate

that the denoising quality do not necessarily increase with

the number of spatially self-similar blocks in each group; in

contrast, the performances are always improved by exploiting

the temporal correlation of the video.

This work proposes V-BM4D, a novel video-ﬁltering ap-

proach that, to overcome the above weaknesses, separately

exploits the temporal and spatial redundancy of the video

sequences. The core element of V-BM4D is the spatiotemporal

volume, a 3-D structure formed by a sequence of blocks

of the video following a speciﬁc trajectory (obtained, for

example, by concatenating motion vectors along time) [19],

[20]. Thus, contrary to V-BM3D, V-BM4D does not group

blocks, but mutually similar spatiotemporal volumes according

to a nonlocal search procedure. Hence, groups in V-BM4D

are 4-D stacks of 3-D volumes, and the collaborative ﬁltering

is then performed via a separable 4-D spatiotemporal trans-

form. The transform leverages the following three types of

correlation that characterize natural video sequences: local

spatial correlation between pixels in each block of a volume,

local temporal correlation between blocks of each volume, and

nonlocal spatial and temporal correlation between volumes of

the same group. The 4-D group spectrum is thus highly sparse,

which makes the shrinkage more effective than in V-BM3D,

yielding superior performance of V-BM4D in terms of noise

reduction.

In this work we extend the basic implementation of V-

BM4D as a grayscale denoising ﬁlter introduced in the con-

ference paper [1] presenting its modiﬁcations for the de-

blocking and deringing of compressed videos, as well as for

the enhancement (sharpening) of low-contrast videos. Then,

leveraging the approach presented in [10], [21], we generalize

V-BM4D to perform collaborative ﬁltering of color (multi-

channel) data. An additional, and fundamental, contribution

of this paper is an experimental analysis of the different types

of correlation characterizing video data, and how these affect

the ﬁltering quality.

The paper is organized as follows. Section II introduces the

observation model, the formal deﬁnitions, and describes the

fundamental steps of V-BM4D, while Section III discusses

the implementation aspects, with particular emphasis on the

computation of motion vectors. The application of V-BM4D

to deblocking and deringing is given in Section IV, where it is

shown how to compute the thresholds used in the ﬁltering from

the compression parameters of a video; video enhancement

(sharpening) is presented in Section V. Before the conclusions,

we provide a comprehensive collection of experiments and a

discussion of the V-BM4D performance in Section VI, and a

detailed analysis of its computational complexity in Section

VII.

II. BASIC ALGORITHM

The aim of the proposed algorithm is to provide an estimate

of the original video from the observed data. For the algorithm

design, we assume the common additive white Gaussian noise

model.

Fig. 1. Illustration of a trajectory and the associated volume (left), and a

group of mutually similar volumes (right). These have been calculated from

the sequence Tennis corrupted by white Gaussian noise with σ = 20.

A. Observation Model

We consider the observed video as a noisy image sequence

z : X × T → R deﬁned as

z(x, t) = y(x, t) + η(x, t), x ∈ X, t ∈ T, (1)

where y is the original (unknown) video, η(·, ·) ∼ N(0, σ

) is

i.i.d. white Gaussian noise, and (x, t) are the 3-D spatiotem-

poral coordinates belonging to the spatial domain X ⊂ Z

and time domain T ⊂ Z, respectively. The frame of the video

z at time t is denoted by z(X, t).

The V-BM4D algorithm comprises three fundamental steps

inherited from the BM3D paradigm, speciﬁcally grouping

(Section II-C), collaborative ﬁltering (Section II-D) and ag-

gregation (Section II-E). These steps are performed for each

spatiotemporal volume of the video (Section II-B).

B. Spatiotemporal Volumes

Let B

, t

) denote a square block of ﬁxed size N × N

extracted from the noisy video z; without loss of generality,

the coordinates (x

, t

) identify the top-left pixel of the block

in the frame z(X, t

). A spatiotemporal volume is a 3-D

sequence of blocks built following a speciﬁc trajectory along

time, which is supposed to follow the motion in the scene.

Formally, the trajectory associated to (x

, t

) is deﬁned as

Traj(x

, t

) =

, t

+ j)

j=−h

−

, (2)

where the elements (x

, t

+ j) are time-consecutive coordi-

nates, each of these represents the position of the reference

block B

, t

) within the neighboring frames z(X, t

+ j),

j = −h

−

, . . . , h

. For the sake of simplicity, in this section

it is assumed h

−

= h

= h for all (x, t) ∈ X × T .

The trajectories can be either directly computed from the

noisy video, or, when a coded video is given, they can be

obtained by concatenating motion vectors. In what follows

we assume that, for each (x

, t

) ∈ X × T , a trajectory

Traj(x

, t

) is given and thus the 3-D spatiotemporal volume

associated to (x

, t

) can be determined as

, t

) =



, t

) : (x

, t

) ∈ Traj(x

, t

)



, (3)

where the subscript z speciﬁes that the volumes are extracted

from the noisy video.

C. Grouping

Groups are stacks of mutually similar volumes and consti-

tute the nonlocal element of V-BM4D. Mutually similar vol-

umes are determined by a nonlocal search procedure as in [10].

Speciﬁcally, let Ind(x

, t

) be the set of indices identifying

those volumes that, according to a distance operator δ

, are

similar to V

, t

Ind(x

, t

) =



, t

) : δ

, t

), V

, t

)) < τ

match



The parameter τ

match

> 0 controls the minimum degree of

similarity among volumes with respect to the distance δ

which is typically the `

-norm of the difference between two

volumes.

The group associated to the reference volume V

, t

) is

then

, t

) =



, t

) : (x

, t

) ∈ Ind(x

, t

)



. (4)

In (4) we implicitly assume that the 3-D volumes are stacked

along a fourth dimension; hence the groups are 4-D data

structures. The order of the spatiotemporal volumes in the 4-D

stacks is based on their similarity with the reference volume.

Note that since δ

, V

) = 0, every group G

, t

) con-

tains, at least, its reference volume V

, t

). Figure 1 shows

an example of trajectories and volumes belonging to a group.

D. Collaborative Filtering

According to the general formulation of the grouping and

collaborative-ﬁltering approach for a d-dimensional signal

[10], groups are (d + 1)-dimensional structures of similar

d-dimensional elements, which are then jointly ﬁltered. In

particular, each of the grouped elements inﬂuences the ﬁltered

output of all the other elements of the group: this is the basic

idea of collaborative ﬁltering. It is typically realized through

the following steps: ﬁrstly a (d + 1)-dimensional separable

linear transform is applied to the group, then the transformed

coefﬁcients are shrunk, for example by hard thresholding or by

Wiener ﬁltering, and ﬁnally the (d+1)-dimensional transform

is inverted to obtain an estimate for each grouped element.

The core elements of V-BM4D are the spatiotemporal

volumes (d = 3), and thus the collaborative ﬁltering performs

a 4-D separable linear transform T

on each 4-D group

, t

), and provides an estimate for each grouped volume

, t

) = T

−1



Υ (T

, t

)))



where Υ denotes a generic shrinkage operator. The ﬁltered

4-D group

, t

) is composed of volumes

(x, t)

, t

) =



, t

) : (x

, t

) ∈ Ind(x

, t

)



with each

being an estimate of the corresponding unknown

volume V

in the original video y.

E. Aggregation

The groups

constitute a very redundant representation

of the video, because in general the volumes

overlap

and, within the overlapping parts, the collaborative ﬁltering

provides multiple estimates at the same coordinates (x, t). For

this reason, the estimates are aggregated through a convex

combination with adaptive weights. In particular, the estimate

ˆy of the original video is computed as

ˆy =

)∈X×T



)∈Ind(x

)

, t

)



)∈X×T



)∈Ind(x

)



(5)

where we assume

, t

) to be zero-padded outside its

domain, χ

)

: X×T → {0, 1} is the characteristic function

(indicator) of the support of the volume

, t

), and the

aggregation weights w

)

are different for different groups.

Aggregation weights may depend on the result of the shrinkage

in the collaborative ﬁltering, and these are typically deﬁned

to be inversely proportional to the total sample variance of

the estimate of the corresponding groups [10]. Intuitively, the

sparser is the shrunk 4-D spectrum

, t

), the larger is

the corresponding weight w

)

. Such aggregation is a well-

established procedure to obtain a global estimate from different

overlapping local estimates [22], [23].

III. IMPLEMENTATION ASPECTS

A. Computation of the Trajectories

In our implementation of V-BM4D, we construct trajectories

by concatenating motion vectors which are deﬁned as follows.

1) Location prediction: As far as two consecutive spa-

tiotemporal locations (x

i−1

, t

− 1) and (x

, t

) of a block

are known, we can deﬁne the corresponding motion vector

(velocity) as v(x

, t

) = x

i−1

−x

. Hence, under the assump-

tion of smooth motion, we can predict the position

+ 1)

of the block in the frame z (X, t

+ 1) as

+ 1) = x

+ γ

· v(x

, t

), (6)

where γ

∈ [0, 1] is a weighting factor of the prediction. In

the case (x

i−1

, t

− 1) is not available, we consider the lack

of motion as the most likely situation and we set

1) = x

. Analogous predictions can be made when looking

for precedent blocks in the sequence.

2) Similarity criterion: The motion of a block is generally

tracked by identifying the most similar block in the subsequent

or precedent frame. However, since we deal with noisy signals,

it is advisable to enforce motion-smoothness priors to improve

the tracking. In particular, given the predicted future

+1)

or past

−1) positions of the block B

, t

), we deﬁne

the similarity between B

, t

) and B

, t

±1), through

a penalized quadratic difference



, t

), B

, t

± 1)



||B

, t

) − B

, t

± 1)||

+ γ

± 1) − x

, (7)

where

± 1) is deﬁned as in (6), and γ

∈ R

is the

penalization parameter. Observe that the tracking is performed

separately in time t

+ 1 and t

− 1.

V-BM4D constructs the trajectory (2) by repeatedly mini-

mizing (7). Formally, the motion of B

, t

) from time t

± 1 is determined by the position x

i±1

that minimizes (7)

i±1

= arg min

∈N



, t

), B

, t

± 1)



where N

is an adaptive spatial search neighborhood in

the frame z(X, t

± 1) (further details are given in Section

III-A3). Even though such x

i±1

can be always found, we

stop the trajectory construction whenever the corresponding

minimum distance δ

exceeds a ﬁxed parameter τ

traj

∈ R

which imposes a minimum amount of similarity along the

spatiotemporal volumes. This allows V-BM4D to effectively

Fig. 2. Effect of different penalties γ

= 0.025 (left) and γ

= 0 (right)

on the background textures of the sequence Tennis corrupted by Gaussian

noise with σ = 20. The block positions at time t = 1 are the same in both

experiments.

deal with those situations, such as occlusions and changes of

scene, where consistent blocks (in terms of both similarity and

motion smoothness) cannot be found.

Figure 2 illustrates two trajectories estimated using different

penalization parameters γ

. Observe that the penalization term

becomes essential when blocks are tracked within ﬂat areas

or homogeneous textures in the scene. In fact, the right image

of Figure 2 shows that without a position-dependent distance

metric the trajectories would be mainly determined by the

noise. As a consequence, the collaborative ﬁltering would

be less effective because of the badly conditioned temporal

correlation of the data within the volumes.

3) Search neighborhood: Because of the penalty term

± 1) − x

, the minimizer of (7) is likely close to

±1). Thus, we can rightly restrict the minimization of (7)

to a spatial search neighborhood N

centered at

±1). We

experienced that it is convenient to make the search-neighbor

size, N

P R

× N

P R

, adaptive on the velocity of the tracked

block (magnitude of motion vector) by setting

P R

= N

1 − γ

· e

−

||v(x

)||

2·σ

where N

is the maximum size of N

, γ

∈ [0, 1] is a

scaling factor and σ

> 0 is a tuning parameter. As the

velocity v increases, N

P R

approaches N

accordingly to σ

;

conversely, when the velocity is zero N

P R

= N

(1 − γ

By setting a proper value of σ

we can control the decay rate

of the exponential term as a function of v or, in other words,

how permissive is the window contraction with respect to the

velocity of the tracked block.

B. Sub-volume Extraction

So far, the number of frames spanned by all the trajectories

has been assumed ﬁxed and equal to h. However, because

of occlusions, scene changes or heavy noise, any trajectory

Traj(x

, t

) can be interrupted at any time, i.e. whenever the

distance between consecutive blocks falls below the threshold

traj

. Thus, given a temporal extent



− h

−

, t

+ h



for the

trajectory Traj(x

, t

), we have that in general 0 ≤ h

−

≤ h

and 0 ≤ h

≤ h, where h denotes the maximum forward and

backward extent of the trajectories (thus of volumes) allowed

in the algorithm.

As a result, in principle, V-BM4D may stack together

volumes having different lengths. However, in practice, be-

cause of the separability of the transform T

, every group

, t

) has to be composed of volumes having the same

length. Thus, for each reference volume V

, t

), we only

consider the volumes V

, t

) such that t

= t

, h

−

≥ h

−

and h

≥ h

. Then, we extract from each V

, t

) the sub-

volume having temporal extent [t

−h

−

, t

+ h

], denoted as



, t

)



. Among all the possible criteria for extracting

a sub-volume of length L

= h

−

+ h

+ 1 from a longer

volume, our choice aims at limiting the complexity while

maintaining a high correlation within the grouped volumes,

because we can reasonably assume that similar objects at

different positions are represented by similar volumes along

time.

In the grouping, we set as distance operator δ

the `

norm of the difference between time-synchronous volumes

normalized with respect to their lengths:



, t

), V

, t

)





, t

) − E



, t

)





(8)

C. Two-Stage Implementation with Collaborative Wiener Fil-

tering

The general procedure described in Section II is imple-

mented in two cascading stages, each composed of the group-

ing, collaborative ﬁltering and aggregation steps.

1) Hard-thresholding stage: In the ﬁrst stage, volumes are

extracted from the noisy video z, and groups are then formed

using the δ

-operator (8) and the predeﬁned threshold τ

match

Collaborative ﬁltering is realized by hard thresholding each

group G

(x, t) in 4-D transform domain:

(x, t) = T

−1



, t

))



, (x, t) ∈ X × T,

where T

is the 4-D transform and Υ

is the hard-threshold

operator with threshold σλ

The outcome of the hard-thresholding stage, ˆy

, is obtained

by aggregating with a convex combination all the estimated

groups

(x, t), as deﬁned in (5). The adaptive weights used

in this combination are inversely proportional to the number

)

of non-zero coefﬁcients of the corresponding hard-

thresholded group

, t

): that is w

)

= 1/N

)

which provides an estimate of the total variance of

(x, t). In

such a way, we assign larger weights to the volumes belonging

to groups having sparser representation in T

domain.

2) Wiener-ﬁltering stage: In the second stage, the motion

estimation is improved by extracting new trajectories Traj

ˆy

from the basic estimate ˆy

, and the grouping is performed

on the new volumes V

ˆy

. Volume matching is still performed

through the δ

-distance, but using a different threshold τ

wie

match

The indices identifying similar volumes Ind

ˆy

(x, t) are used

to construct both groups G

and G

ˆy

, composed by volumes

extracted from the noisy video z and from the estimate y

respectively.

Collaborative ﬁltering is hence performed using an em-

pirical Wiener ﬁlter in T

wie

transform domain. Shrinkage is

realized by scaling the 4-D transform coefﬁcients of each

group G

, t

), extracted from the noisy video z, with the

Wiener attenuation coefﬁcients W(x

, t

W(x

, t

) =



wie



ˆy

, t

)





wie



ˆy

, t

)





+ σ

Fig. 3. V-BM4D two stage denoising of the sequence Coastguard. From left

to right: original video y, noisy video z (σ = 40), result of the ﬁrst stage y

(frame PSNR 28.58 dB) and ﬁnal estimate y

wie

(frame PSNR 29.38 dB).

that are computed from the energy of the 4-D spectrum of the

group G

ˆy

, t

). Eventually, the group estimate is obtained

by inverting the 4-D transform as

wie

, t

) = T

wie

−1



W(x

, t

) · T

wie

, t

))



where · denotes the element-wise product. The ﬁnal global

estimate ˆy

wie

is computed by the aggregation (5), using the

weights w

wie

)

= ||W(x

, t

)||

−2

, which follow from con-

siderations similar to those underlying the adaptive weights

used in the ﬁrst stage.

D. Settings

The parameters involved in the motion estimation and in

the grouping, that is γ

, τ

traj

and τ

match

, depend on the noise

standard deviation σ. Intuitively, in order to compensate the

effects of the noise, the larger is σ, the larger become the

thresholds controlling blocks and volumes matching. For the

sake of simplicity we model such dependencies as second-

order polynomials in σ: γ

(σ), τ

traj

(σ) and τ

match

(σ). The

nine coefﬁcients required to describe the three polynomials

are jointly optimized using the Nelder-Mead simplex direct

search algorithm [24], [25]. As optimization criterion, we

maximize the sum of the restoration performance (PSNR) of

V-BM4D applied over a collection of test videos corrupted

by synthetic noise having different values of σ. Namely, we

considered Salesman, Tennis, Flower Garden, Miss America,

Coastguard, Foreman, Bus, and Bicycle corrupted by white

Gaussian noise having σ levels ranging from 5 and 70. The

resulting polynomials are

(σ) = 0.0005 · σ

− 0.0059 · σ + 0.0400, (9)

traj

(σ) = 0.0047 · σ

+ 0.0676 · σ + 0.4564, (10)

match

(σ) = 0.0171 · σ

+ 0.4520 · σ + 47.9294. (11)

The solid lines in Figure 4 show the above functions. We

also plot, using different markers, the best values of the

three parameters obtained by unconstrained and independent

optimizations of V-BM4D for each test video and value of σ.

Empirically, the polynomials demonstrate a good approxima-

tion of the optimum (γ

, τ

traj

, τ

match

). Within the considered

σ range, the curve (9) is “practically” monotone increasing

despite its negative ﬁrst-degree coefﬁcient. We refrain from

introducing additional constraints to the polynomials as well as

from considering additional σ values smaller than 5, because

the resulting sequences would be mostly affected by the noise

and quantization artifacts intrinsic in the original test-data.

During the second stage (namely, the Wiener ﬁltering) the

, τ

traj

and τ

match

parameters can be considered as constants

and independent, because in the processed sequence ˆy

the

noise has been considerably reduced with respect to the

observation z; this is evident when looking at the second and

third image of Figure 3. Moreover, since in this stage both the

trajectories and groups are determined from the basic estimate

ˆy

, there is no a straightforward relation with σ, the noise

standard deviation in z.

IV. DEBLOCKING

Most video compression techniques, such as MPEG-4 [26]

or H.264 [27], make use of block-transform coding and thus

may suffer, especially at low bitrates, from several com-

pression artifacts such as blocking, ringing, mosquito noise,

and ﬂickering. These artifacts are mainly due to the coarse

quantization of the block-transform coefﬁcients and to the

motion compensation. Moreover, since each block is processed

separately, the correlation between pixels at the borders of

neighboring blocks is typically lost during the compression,

resulting in false discontinuities in the decoded video (such as

those shown in the blocky frames in Figure 8).

A large number of deblocking ﬁlters have been proposed

in the last decade; among them we mention frame-based en-

hancement using a linear low-pass ﬁlter in spatial or transform

domain [28], projection onto convex sets (POCS) methods

[29], spatial block boundary ﬁlter [30], statistical modeling

methods [31] or shifted thresholding [32]. Additionally, most

of modern video coding block-based techniques, such as

H.264 or MPEG-4, embed an in-loop deblocking ﬁlter as an

additional processing step in the decoder [26].

Inspired by [33], we treat the blocking artifacts as additive

noise. This choice allows us to model the compressed video

z as in (1), where y now corresponds to the original uncom-

pressed video, and η represents the compression artifacts. In

what follows, we focus our attention on MPEG-4 compressed

videos. In this way, the proposed ﬁlter can be applied reliably

over different types of data degradations with little need of

adjustment or user intervention.

In order to use V-BM4D as a deblocking ﬁlter, we need

to determine a suitable value of σ to handle the artifacts

in a compressed video. To this purpose, we proceed as in

the previous section and we identify the optimum value of

σ for a set of test sequences compressed at various rates.

Figure 5 shows these optimum values plotted against the

average bit-per-pixel (bpp) rate of the compressed video and

the parameter q that controls the quantization of the block-

transform coefﬁcients [26] (Figure 5(a)). Let us observe that

both the bpp and q parameters are easily accessible from

any given MPEG-4 coded video. These plots suggest that a

power law may conveniently explain the relation between the

optimum value of σ and both the bpp rate and q. Hence, we ﬁt

such bivariate function to the optimum values via least-squares

regression, obtaining the adaptive value of σ for the V-BM4D

deblocking ﬁlter as

σ(bpp, q) = 0.09 · q

1.11

· bpp

−0.46

+ 3.37 (12)

The function σ(bpp, q) is shown in Figure 5 (right). Note that

in MPEG-4 the parameter q ranges from 2 to 31, where higher

values correspond to a coarser quantization and consequently

lower bitrates. As a matter of fact, when q increases and/or

bpp decreases, the optimum σ increases, in order to effectively

cope with stronger blocking artifacts. Clearly, a much larger

Video Denoising, Deblocking, and Enhancement Through Separable 4-D Nonlocal Spatiotemporal Transforms

Figures

Citations

Video Enhancement with Task-Oriented Flow

Burst photography for high dynamic range and low-light imaging on mobile cameras

Video Enhancement with Task-Oriented Flow

Burst Denoising with Kernel Prediction Networks

Fast Hyperspectral Image Denoising and Inpainting Based on Low-Rank and Sparse Representations

References

A simplex method for function minimization

Overview of the H.264/AVC video coding standard

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

Convergence Properties of the Nelder--Mead Simplex Method in Low Dimensions

A Review of Image Denoising Algorithms, with a New One

Related Papers (5)

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

A non-local algorithm for image denoising

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Image quality assessment: from error visibility to structural similarity

Adam: A Method for Stochastic Optimization

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Video denoising, deblocking and enhancement through separable 4-d nonlocal spatiotemporal transforms" ?

Q2. What is the way to control the decay rate of the exponential term?

Q3. What is the effect of the proposed enhancement filter?

Q4. What is the common technique used in enhancement?

Q5. What is the value of for the coefficients that do not belong to this 3-D?

Q6. How is the performance of the V-BM4D metric measured?

Q7. What is the optimum value of for a compressed video?

Q8. What is the simplest way to reduce the complexity of the grouping phase?

Q9. What is the way to extract a sub-volume of length L0?

Q10. How many arithmetical operations is required to perform the hard-thresholding?

Q11. What is the minimum degree of similarity between volumes?

Q12. How are the temporal DC and AC sharpened?

Q13. How does the MOVIE index compare to the VBM3D?

Q14. How can V-BM4D process the useless blocks?

Q15. Why is the grouping of volumes different?

Q16. How can the authors eliminate the cost of the Wiener filtering?

Q17. What is the result of the hard-thresholding stage?