scispace - formally typeset
Open AccessJournal ArticleDOI

Video Denoising, Deblocking, and Enhancement Through Separable 4-D Nonlocal Spatiotemporal Transforms

Reads0
Chats0
TLDR
Experimental results prove the effectiveness of the proposed video filtering algorithm in terms of both subjective and objective visual quality, and show that it outperforms the state of the art in video denoising.
Abstract
We propose a powerful video filtering algorithm that exploits temporal and spatial redundancy characterizing natural video sequences. The algorithm implements the paradigm of nonlocal grouping and collaborative filtering, where a higher dimensional transform-domain representation of the observations is leveraged to enforce sparsity, and thus regularize the data: 3-D spatiotemporal volumes are constructed by tracking blocks along trajectories defined by the motion vectors. Mutually similar volumes are then grouped together by stacking them along an additional fourth dimension, thus producing a 4-D structure, termed group, where different types of data correlation exist along the different dimensions: local correlation along the two dimensions of the blocks, temporal correlation along the motion trajectories, and nonlocal spatial correlation (i.e., self-similarity) along the fourth dimension of the group. Collaborative filtering is then realized by transforming each group through a decorrelating 4-D separable transform and then by shrinkage and inverse transformation. In this way, the collaborative filtering provides estimates for each volume stacked in the group, which are then returned and adaptively aggregated to their original positions in the video. The proposed filtering procedure addresses several video processing applications, such as denoising, deblocking, and enhancement of both grayscale and color data. Experimental results prove the effectiveness of our method in terms of both subjective and objective visual quality, and show that it outperforms the state of the art in video denoising.

read more

Content maybe subject to copyright    Report

1
Video Denoising, Deblocking and Enhancement
Through Separable 4-D Nonlocal Spatiotemporal
Transforms
Matteo Maggioni, Giacomo Boracchi, Alessandro Foi, Karen Egiazarian
Abstract—We propose a powerful video filtering algorithm that
exploits temporal and spatial redundancy characterizing natural
video sequences. The algorithm implements the paradigm of
nonlocal grouping and collaborative filtering, where a higher-
dimensional transform-domain representation of the observations
is leveraged to enforce sparsity and thus regularize the data:
3-D spatiotemporal volumes are constructed by tracking blocks
along trajectories defined by the motion vectors. Mutually similar
volumes are then grouped together by stacking them along an
additional fourth dimension, thus producing a 4-D structure,
termed group, where different types of data correlation exist
along the different dimensions: local correlation along the two
dimensions of the blocks, temporal correlation along the motion
trajectories, and nonlocal spatial correlation (i.e. self-similarity)
along the fourth dimension of the group. Collaborative filtering is
then realized by transforming each group through a decorrelating
4-D separable transform and then by shrinkage and inverse
transformation. In this way, the collaborative filtering provides
estimates for each volume stacked in the group, which are then
returned and adaptively aggregated to their original positions
in the video. The proposed filtering procedure addresses several
video processing applications, such as denoising, deblocking, and
enhancement of both grayscale and color data. Experimental
results prove the effectiveness of our method in terms of both
subjective and objective visual quality, and shows that it outper-
forms the state of the art in video denoising.
Index Terms—Video filtering, video denoising, video deblock-
ing, video enhancement, nonlocal methods, adaptive transforms,
motion estimation.
I. INTRODUCTION
S
EVERAL factors such as noise, blur, blocking, ringing,
and other acquisition or compression artifacts, typically
impair digital video sequences. The large number of practical
applications involving digital videos has motivated a signifi-
cant interest in restoration or enhancement solutions, and the
literature contains a plethora of such algorithms (see [3], [4]
for a comprehensive overview).
At the moment, the most effective approach in restoring
images or video sequences exploits the redundancy given by
Matteo Maggioni, Alessandro Foi and Karen Egiazarian are with the
Department of Signal Processing, Tampere University of Technology, Finland.
Giacomo Boracchi is with the Dipartimento di Elettronica e Informazione,
Politecnico di Milano, Italy
This paper is based on and extends the authors’ preliminary conference
publications [1], [2]
This work was supported by the Academy of Finland (project no. 213462,
Finnish Programme for Centres of Excellence in Research 20062011, project
no. 252547, Academy Research Fellow 20112016, and project no. 129118,
Postdoctoral Researchers Project 20092011), and by Tampere Graduate School
in Information Science and Engineering (TISE).
the nonlocal similarity between patches at different locations
within the data [5], [6]. Algorithms based on this approach
have been proposed for various signal-processing problems,
and mainly for image denoising [4], [6], [7], [8], [9], [10], [11],
[12], [13], [14], [15]. Specifically, in [7] has been introduced
an adaptive pointwise image filtering strategy, called non-
local means, where the estimate of each pixel x
i
is obtained
as a weighted average of, in principle, all the pixels x
j
of
the noisy image, using a family of weights proportional to
the similarity between two neighborhoods centered at x
i
and
x
j
. So far, the most effective image-denoising algorithm is
BM3D [10], [6], which relies on the so-called grouping and
collaborative filtering paradigm: the observation is processed
in a blockwise manner and mutually similar 2-D image blocks
are stacked into a 3-D group (grouping), which is then filtered
through a transform-domain shrinkage (collaborative filtering),
simultaneously providing different estimates for each grouped
block. These estimates are then returned to their respective
locations and eventually aggregated resulting in the denoised
image. In doing so, BM3D leverages the spatial correlation
of natural images both at the nonlocal and local level, due
to the abundance of mutually similar patches and to the high
correlation of image data within each patch, respectively. The
BM3D filtering scheme has been successfully applied to video
denoising in our previous work, V-BM3D [11], as well as to
several other applications including image and video super-
resolution [14], [15], [16], image sharpening [13], and image
deblurring [17].
In V-BM3D, groups are 3-D arrays of mutually similar
blocks extracted from a set of consecutive frames of the
video sequence. A group may include multiple blocks from
the same frame, naturally exploiting in this way the nonlocal
similarity characterizing images. However, it is typically along
the temporal dimension that most mutually similar blocks
can be found. It is well known that motion-compensated
videos [18] are extremely smooth along the temporal axis
and this fact is exploited by nearly all modern video-coding
techniques. Furthermore, experimental analysis in [12] shows
that, even when fast motion is present, the similarity along
the motion trajectories is much stronger than the nonlocal
similarity existing within an individual frame. In spite of this,
in V-BM3D the blocks are grouped regardless of whether their
similarity comes from the motion tracking over time or the
nonlocal spatial content. Consequently, during the filtering, V-
BM3D is not able to distinguish between temporal and spatial
nonlocal similarity. We recognize this as a conceptual as well

2
as practical weakness of the algorithm. As a matter of fact,
the simple experiments reported in Section VIII demonstrate
that the denoising quality do not necessarily increase with
the number of spatially self-similar blocks in each group; in
contrast, the performances are always improved by exploiting
the temporal correlation of the video.
This work proposes V-BM4D, a novel video-filtering ap-
proach that, to overcome the above weaknesses, separately
exploits the temporal and spatial redundancy of the video
sequences. The core element of V-BM4D is the spatiotemporal
volume, a 3-D structure formed by a sequence of blocks
of the video following a specific trajectory (obtained, for
example, by concatenating motion vectors along time) [19],
[20]. Thus, contrary to V-BM3D, V-BM4D does not group
blocks, but mutually similar spatiotemporal volumes according
to a nonlocal search procedure. Hence, groups in V-BM4D
are 4-D stacks of 3-D volumes, and the collaborative filtering
is then performed via a separable 4-D spatiotemporal trans-
form. The transform leverages the following three types of
correlation that characterize natural video sequences: local
spatial correlation between pixels in each block of a volume,
local temporal correlation between blocks of each volume, and
nonlocal spatial and temporal correlation between volumes of
the same group. The 4-D group spectrum is thus highly sparse,
which makes the shrinkage more effective than in V-BM3D,
yielding superior performance of V-BM4D in terms of noise
reduction.
In this work we extend the basic implementation of V-
BM4D as a grayscale denoising filter introduced in the con-
ference paper [1] presenting its modifications for the de-
blocking and deringing of compressed videos, as well as for
the enhancement (sharpening) of low-contrast videos. Then,
leveraging the approach presented in [10], [21], we generalize
V-BM4D to perform collaborative filtering of color (multi-
channel) data. An additional, and fundamental, contribution
of this paper is an experimental analysis of the different types
of correlation characterizing video data, and how these affect
the filtering quality.
The paper is organized as follows. Section II introduces the
observation model, the formal definitions, and describes the
fundamental steps of V-BM4D, while Section III discusses
the implementation aspects, with particular emphasis on the
computation of motion vectors. The application of V-BM4D
to deblocking and deringing is given in Section IV, where it is
shown how to compute the thresholds used in the filtering from
the compression parameters of a video; video enhancement
(sharpening) is presented in Section V. Before the conclusions,
we provide a comprehensive collection of experiments and a
discussion of the V-BM4D performance in Section VI, and a
detailed analysis of its computational complexity in Section
VII.
II. BASIC ALGORITHM
The aim of the proposed algorithm is to provide an estimate
of the original video from the observed data. For the algorithm
design, we assume the common additive white Gaussian noise
model.
Fig. 1. Illustration of a trajectory and the associated volume (left), and a
group of mutually similar volumes (right). These have been calculated from
the sequence Tennis corrupted by white Gaussian noise with σ = 20.
A. Observation Model
We consider the observed video as a noisy image sequence
z : X × T R defined as
z(x, t) = y(x, t) + η(x, t), x X, t T, (1)
where y is the original (unknown) video, η(·, ·) N(0, σ
2
) is
i.i.d. white Gaussian noise, and (x, t) are the 3-D spatiotem-
poral coordinates belonging to the spatial domain X Z
2
and time domain T Z, respectively. The frame of the video
z at time t is denoted by z(X, t).
The V-BM4D algorithm comprises three fundamental steps
inherited from the BM3D paradigm, specifically grouping
(Section II-C), collaborative filtering (Section II-D) and ag-
gregation (Section II-E). These steps are performed for each
spatiotemporal volume of the video (Section II-B).
B. Spatiotemporal Volumes
Let B
z
(x
0
, t
0
) denote a square block of fixed size N × N
extracted from the noisy video z; without loss of generality,
the coordinates (x
0
, t
0
) identify the top-left pixel of the block
in the frame z(X, t
0
). A spatiotemporal volume is a 3-D
sequence of blocks built following a specific trajectory along
time, which is supposed to follow the motion in the scene.
Formally, the trajectory associated to (x
0
, t
0
) is defined as
Traj(x
0
, t
0
) =
n
(x
j
, t
0
+ j)
o
h
+
j=h
, (2)
where the elements (x
j
, t
0
+ j) are time-consecutive coordi-
nates, each of these represents the position of the reference
block B
z
(x
0
, t
0
) within the neighboring frames z(X, t
0
+ j),
j = h
, . . . , h
+
. For the sake of simplicity, in this section
it is assumed h
= h
+
= h for all (x, t) X × T .
The trajectories can be either directly computed from the
noisy video, or, when a coded video is given, they can be
obtained by concatenating motion vectors. In what follows
we assume that, for each (x
0
, t
0
) X × T , a trajectory
Traj(x
0
, t
0
) is given and thus the 3-D spatiotemporal volume
associated to (x
0
, t
0
) can be determined as
V
z
(x
0
, t
0
) =
B
z
(x
i
, t
i
) : (x
i
, t
i
) Traj(x
0
, t
0
)
, (3)
where the subscript z specifies that the volumes are extracted
from the noisy video.
C. Grouping
Groups are stacks of mutually similar volumes and consti-
tute the nonlocal element of V-BM4D. Mutually similar vol-
umes are determined by a nonlocal search procedure as in [10].

3
Specifically, let Ind(x
0
, t
0
) be the set of indices identifying
those volumes that, according to a distance operator δ
v
, are
similar to V
z
(x
0
, t
0
):
Ind(x
0
, t
0
) =
(x
i
, t
i
) : δ
v
(V
z
(x
0
, t
0
), V
z
(x
i
, t
i
)) < τ
match
.
The parameter τ
match
> 0 controls the minimum degree of
similarity among volumes with respect to the distance δ
v
,
which is typically the `
2
-norm of the difference between two
volumes.
The group associated to the reference volume V
z
(x
0
, t
0
) is
then
G
z
(x
0
, t
0
) =
V
z
(x
i
, t
i
) : (x
i
, t
i
) Ind(x
0
, t
0
)
. (4)
In (4) we implicitly assume that the 3-D volumes are stacked
along a fourth dimension; hence the groups are 4-D data
structures. The order of the spatiotemporal volumes in the 4-D
stacks is based on their similarity with the reference volume.
Note that since δ
v
(V
z
, V
z
) = 0, every group G
z
(x
0
, t
0
) con-
tains, at least, its reference volume V
z
(x
0
, t
0
). Figure 1 shows
an example of trajectories and volumes belonging to a group.
D. Collaborative Filtering
According to the general formulation of the grouping and
collaborative-filtering approach for a d-dimensional signal
[10], groups are (d + 1)-dimensional structures of similar
d-dimensional elements, which are then jointly filtered. In
particular, each of the grouped elements influences the filtered
output of all the other elements of the group: this is the basic
idea of collaborative filtering. It is typically realized through
the following steps: firstly a (d + 1)-dimensional separable
linear transform is applied to the group, then the transformed
coefficients are shrunk, for example by hard thresholding or by
Wiener filtering, and finally the (d+1)-dimensional transform
is inverted to obtain an estimate for each grouped element.
The core elements of V-BM4D are the spatiotemporal
volumes (d = 3), and thus the collaborative filtering performs
a 4-D separable linear transform T
4D
on each 4-D group
G
z
(x
0
, t
0
), and provides an estimate for each grouped volume
V
z
:
ˆ
G
y
(x
0
, t
0
) = T
1
4D
Υ (T
4D
(G
z
(x
0
, t
0
)))
,
where Υ denotes a generic shrinkage operator. The filtered
4-D group
ˆ
G
y
(x
0
, t
0
) is composed of volumes
ˆ
V
y
(x, t)
ˆ
G
y
(x
0
, t
0
) =
ˆ
V
y
(x
i
, t
i
) : (x
i
, t
i
) Ind(x
0
, t
0
)
,
with each
ˆ
V
y
being an estimate of the corresponding unknown
volume V
y
in the original video y.
E. Aggregation
The groups
ˆ
G
y
constitute a very redundant representation
of the video, because in general the volumes
ˆ
V
y
overlap
and, within the overlapping parts, the collaborative filtering
provides multiple estimates at the same coordinates (x, t). For
this reason, the estimates are aggregated through a convex
combination with adaptive weights. In particular, the estimate
ˆy of the original video is computed as
ˆy =
P
(x
0
,t
0
)X×T
P
(x
i
,t
i
)Ind(x
0
,t
0
)
w
(x
0
,t
0
)
ˆ
V
y
(x
i
, t
i
)
P
(x
0
,t
0
)X×T
P
(x
i
,t
i
)Ind(x
0
,t
0
)
w
(x
0
,t
0
)
χ
(x
i
,t
i
)
,
(5)
where we assume
ˆ
V
y
(x
i
, t
i
) to be zero-padded outside its
domain, χ
(x
i
,t
i
)
: X×T {0, 1} is the characteristic function
(indicator) of the support of the volume
ˆ
V
y
(x
i
, t
i
), and the
aggregation weights w
(x
0
,t
0
)
are different for different groups.
Aggregation weights may depend on the result of the shrinkage
in the collaborative filtering, and these are typically defined
to be inversely proportional to the total sample variance of
the estimate of the corresponding groups [10]. Intuitively, the
sparser is the shrunk 4-D spectrum
ˆ
G
y
(x
0
, t
0
), the larger is
the corresponding weight w
(x
0
,t
0
)
. Such aggregation is a well-
established procedure to obtain a global estimate from different
overlapping local estimates [22], [23].
III. IMPLEMENTATION ASPECTS
A. Computation of the Trajectories
In our implementation of V-BM4D, we construct trajectories
by concatenating motion vectors which are defined as follows.
1) Location prediction: As far as two consecutive spa-
tiotemporal locations (x
i1
, t
i
1) and (x
i
, t
i
) of a block
are known, we can define the corresponding motion vector
(velocity) as v(x
i
, t
i
) = x
i1
x
i
. Hence, under the assump-
tion of smooth motion, we can predict the position
ˆ
x
i
(t
i
+ 1)
of the block in the frame z (X, t
i
+ 1) as
ˆ
x
i
(t
i
+ 1) = x
i
+ γ
p
· v(x
i
, t
i
), (6)
where γ
p
[0, 1] is a weighting factor of the prediction. In
the case (x
i1
, t
i
1) is not available, we consider the lack
of motion as the most likely situation and we set
ˆ
x
i
(t
i
+
1) = x
i
. Analogous predictions can be made when looking
for precedent blocks in the sequence.
2) Similarity criterion: The motion of a block is generally
tracked by identifying the most similar block in the subsequent
or precedent frame. However, since we deal with noisy signals,
it is advisable to enforce motion-smoothness priors to improve
the tracking. In particular, given the predicted future
ˆ
x
i
(t
i
+1)
or past
ˆ
x
i
(t
i
1) positions of the block B
z
(x
i
, t
i
), we define
the similarity between B
z
(x
i
, t
i
) and B
z
(x
j
, t
i
±1), through
a penalized quadratic difference
δ
b
B
z
(x
i
, t
i
), B
z
(x
j
, t
i
± 1)
=
||B
z
(x
i
, t
i
) B
z
(x
j
, t
i
± 1)||
2
2
N
2
+ γ
d
||
ˆ
x
i
(t
i
± 1) x
j
||
2
, (7)
where
ˆ
x
i
(t
i
± 1) is defined as in (6), and γ
d
R
+
is the
penalization parameter. Observe that the tracking is performed
separately in time t
i
+ 1 and t
i
1.
V-BM4D constructs the trajectory (2) by repeatedly mini-
mizing (7). Formally, the motion of B
z
(x
i
, t
i
) from time t
i
to
t
i
± 1 is determined by the position x
i±1
that minimizes (7)
as
x
i±1
= arg min
x
k
∈N
i
n
δ
b
B
z
(x
i
, t
i
), B
z
(x
k
, t
i
± 1)
o
,
where N
i
is an adaptive spatial search neighborhood in
the frame z(X, t
i
± 1) (further details are given in Section
III-A3). Even though such x
i±1
can be always found, we
stop the trajectory construction whenever the corresponding
minimum distance δ
b
exceeds a fixed parameter τ
traj
R
+
,
which imposes a minimum amount of similarity along the
spatiotemporal volumes. This allows V-BM4D to effectively

4
Fig. 2. Effect of different penalties γ
d
= 0.025 (left) and γ
d
= 0 (right)
on the background textures of the sequence Tennis corrupted by Gaussian
noise with σ = 20. The block positions at time t = 1 are the same in both
experiments.
deal with those situations, such as occlusions and changes of
scene, where consistent blocks (in terms of both similarity and
motion smoothness) cannot be found.
Figure 2 illustrates two trajectories estimated using different
penalization parameters γ
d
. Observe that the penalization term
becomes essential when blocks are tracked within flat areas
or homogeneous textures in the scene. In fact, the right image
of Figure 2 shows that without a position-dependent distance
metric the trajectories would be mainly determined by the
noise. As a consequence, the collaborative filtering would
be less effective because of the badly conditioned temporal
correlation of the data within the volumes.
3) Search neighborhood: Because of the penalty term
γ
d
||
ˆ
x
i
(t
i
± 1) x
j
||
2
, the minimizer of (7) is likely close to
ˆ
x
i
(t
i
±1). Thus, we can rightly restrict the minimization of (7)
to a spatial search neighborhood N
i
centered at
ˆ
x
i
(t
i
±1). We
experienced that it is convenient to make the search-neighbor
size, N
P R
× N
P R
, adaptive on the velocity of the tracked
block (magnitude of motion vector) by setting
N
P R
= N
S
·
1 γ
w
· e
||v(x
i
,t
i
)||
2
2
2·σ
2
w
!
,
where N
S
is the maximum size of N
i
, γ
w
[0, 1] is a
scaling factor and σ
w
> 0 is a tuning parameter. As the
velocity v increases, N
P R
approaches N
S
accordingly to σ
w
;
conversely, when the velocity is zero N
P R
= N
S
(1 γ
w
).
By setting a proper value of σ
w
we can control the decay rate
of the exponential term as a function of v or, in other words,
how permissive is the window contraction with respect to the
velocity of the tracked block.
B. Sub-volume Extraction
So far, the number of frames spanned by all the trajectories
has been assumed fixed and equal to h. However, because
of occlusions, scene changes or heavy noise, any trajectory
Traj(x
i
, t
i
) can be interrupted at any time, i.e. whenever the
distance between consecutive blocks falls below the threshold
τ
traj
. Thus, given a temporal extent
t
i
h
i
, t
i
+ h
+
i
for the
trajectory Traj(x
i
, t
i
), we have that in general 0 h
i
h
and 0 h
+
i
h, where h denotes the maximum forward and
backward extent of the trajectories (thus of volumes) allowed
in the algorithm.
As a result, in principle, V-BM4D may stack together
volumes having different lengths. However, in practice, be-
cause of the separability of the transform T
4D
, every group
G
z
(x
i
, t
i
) has to be composed of volumes having the same
length. Thus, for each reference volume V
z
(x
0
, t
0
), we only
consider the volumes V
z
(x
i
, t
i
) such that t
i
= t
0
, h
i
h
0
and h
+
i
h
+
0
. Then, we extract from each V
z
(x
i
, t
i
) the sub-
volume having temporal extent [t
0
h
0
, t
0
+ h
+
0
], denoted as
E
L
0
V
z
(x
i
, t
i
)
. Among all the possible criteria for extracting
a sub-volume of length L
0
= h
0
+ h
+
0
+ 1 from a longer
volume, our choice aims at limiting the complexity while
maintaining a high correlation within the grouped volumes,
because we can reasonably assume that similar objects at
different positions are represented by similar volumes along
time.
In the grouping, we set as distance operator δ
v
the `
2
-
norm of the difference between time-synchronous volumes
normalized with respect to their lengths:
δ
v
V
z
(x
0
, t
0
), V
z
(x
i
, t
i
)
=
V
z
(x
0
, t
0
) E
L
0
V
z
(x
i
, t
i
)
2
2
L
0
.
(8)
C. Two-Stage Implementation with Collaborative Wiener Fil-
tering
The general procedure described in Section II is imple-
mented in two cascading stages, each composed of the group-
ing, collaborative filtering and aggregation steps.
1) Hard-thresholding stage: In the first stage, volumes are
extracted from the noisy video z, and groups are then formed
using the δ
v
-operator (8) and the predefined threshold τ
ht
match
.
Collaborative filtering is realized by hard thresholding each
group G
z
(x, t) in 4-D transform domain:
ˆ
G
ht
y
(x, t) = T
ht
1
4D
Υ
ht
T
ht
4D
(G
z
(x
0
, t
0
))

, (x, t) X × T,
where T
ht
4D
is the 4-D transform and Υ
ht
is the hard-threshold
operator with threshold σλ
4D
.
The outcome of the hard-thresholding stage, ˆy
ht
, is obtained
by aggregating with a convex combination all the estimated
groups
ˆ
G
ht
y
(x, t), as defined in (5). The adaptive weights used
in this combination are inversely proportional to the number
N
ht
(x
0
,t
0
)
of non-zero coefficients of the corresponding hard-
thresholded group
ˆ
G
ht
y
(x
0
, t
0
): that is w
ht
(x
0
,t
0
)
= 1/N
ht
(x
0
,t
0
)
,
which provides an estimate of the total variance of
ˆ
G
ht
y
(x, t). In
such a way, we assign larger weights to the volumes belonging
to groups having sparser representation in T
4D
domain.
2) Wiener-filtering stage: In the second stage, the motion
estimation is improved by extracting new trajectories Traj
ˆy
ht
from the basic estimate ˆy
ht
, and the grouping is performed
on the new volumes V
ˆy
ht
. Volume matching is still performed
through the δ
v
-distance, but using a different threshold τ
wie
match
.
The indices identifying similar volumes Ind
ˆy
ht
(x, t) are used
to construct both groups G
z
and G
ˆy
ht
, composed by volumes
extracted from the noisy video z and from the estimate y
ht
,
respectively.
Collaborative filtering is hence performed using an em-
pirical Wiener filter in T
wie
4D
transform domain. Shrinkage is
realized by scaling the 4-D transform coefficients of each
group G
z
(x
0
, t
0
), extracted from the noisy video z, with the
Wiener attenuation coefficients W(x
0
, t
0
),
W(x
0
, t
0
) =
T
wie
4D
G
ˆy
ht
(x
0
, t
0
)
2
T
wie
4D
G
ˆy
ht
(x
0
, t
0
)
2
+ σ
2
,

5
Fig. 3. V-BM4D two stage denoising of the sequence Coastguard. From left
to right: original video y, noisy video z (σ = 40), result of the first stage y
ht
(frame PSNR 28.58 dB) and final estimate y
wie
(frame PSNR 29.38 dB).
that are computed from the energy of the 4-D spectrum of the
group G
ˆy
ht
(x
0
, t
0
). Eventually, the group estimate is obtained
by inverting the 4-D transform as
ˆ
G
wie
y
(x
0
, t
0
) = T
wie
1
4D
W(x
0
, t
0
) · T
wie
4D
(G
z
(x
0
, t
0
))
,
where · denotes the element-wise product. The final global
estimate ˆy
wie
is computed by the aggregation (5), using the
weights w
wie
(x
0
,t
0
)
= ||W(x
0
, t
0
)||
2
2
, which follow from con-
siderations similar to those underlying the adaptive weights
used in the first stage.
D. Settings
The parameters involved in the motion estimation and in
the grouping, that is γ
d
, τ
traj
and τ
match
, depend on the noise
standard deviation σ. Intuitively, in order to compensate the
effects of the noise, the larger is σ, the larger become the
thresholds controlling blocks and volumes matching. For the
sake of simplicity we model such dependencies as second-
order polynomials in σ: γ
d
(σ), τ
traj
(σ) and τ
match
(σ). The
nine coefficients required to describe the three polynomials
are jointly optimized using the Nelder-Mead simplex direct
search algorithm [24], [25]. As optimization criterion, we
maximize the sum of the restoration performance (PSNR) of
V-BM4D applied over a collection of test videos corrupted
by synthetic noise having different values of σ. Namely, we
considered Salesman, Tennis, Flower Garden, Miss America,
Coastguard, Foreman, Bus, and Bicycle corrupted by white
Gaussian noise having σ levels ranging from 5 and 70. The
resulting polynomials are
γ
d
(σ) = 0.0005 · σ
2
0.0059 · σ + 0.0400, (9)
τ
traj
(σ) = 0.0047 · σ
2
+ 0.0676 · σ + 0.4564, (10)
τ
match
(σ) = 0.0171 · σ
2
+ 0.4520 · σ + 47.9294. (11)
The solid lines in Figure 4 show the above functions. We
also plot, using different markers, the best values of the
three parameters obtained by unconstrained and independent
optimizations of V-BM4D for each test video and value of σ.
Empirically, the polynomials demonstrate a good approxima-
tion of the optimum (γ
d
, τ
traj
, τ
match
). Within the considered
σ range, the curve (9) is “practically” monotone increasing
despite its negative first-degree coefficient. We refrain from
introducing additional constraints to the polynomials as well as
from considering additional σ values smaller than 5, because
the resulting sequences would be mostly affected by the noise
and quantization artifacts intrinsic in the original test-data.
During the second stage (namely, the Wiener filtering) the
γ
d
, τ
traj
and τ
match
parameters can be considered as constants
and independent, because in the processed sequence ˆy
ht
the
noise has been considerably reduced with respect to the
observation z; this is evident when looking at the second and
third image of Figure 3. Moreover, since in this stage both the
trajectories and groups are determined from the basic estimate
ˆy
ht
, there is no a straightforward relation with σ, the noise
standard deviation in z.
IV. DEBLOCKING
Most video compression techniques, such as MPEG-4 [26]
or H.264 [27], make use of block-transform coding and thus
may suffer, especially at low bitrates, from several com-
pression artifacts such as blocking, ringing, mosquito noise,
and flickering. These artifacts are mainly due to the coarse
quantization of the block-transform coefficients and to the
motion compensation. Moreover, since each block is processed
separately, the correlation between pixels at the borders of
neighboring blocks is typically lost during the compression,
resulting in false discontinuities in the decoded video (such as
those shown in the blocky frames in Figure 8).
A large number of deblocking filters have been proposed
in the last decade; among them we mention frame-based en-
hancement using a linear low-pass filter in spatial or transform
domain [28], projection onto convex sets (POCS) methods
[29], spatial block boundary filter [30], statistical modeling
methods [31] or shifted thresholding [32]. Additionally, most
of modern video coding block-based techniques, such as
H.264 or MPEG-4, embed an in-loop deblocking filter as an
additional processing step in the decoder [26].
Inspired by [33], we treat the blocking artifacts as additive
noise. This choice allows us to model the compressed video
z as in (1), where y now corresponds to the original uncom-
pressed video, and η represents the compression artifacts. In
what follows, we focus our attention on MPEG-4 compressed
videos. In this way, the proposed filter can be applied reliably
over different types of data degradations with little need of
adjustment or user intervention.
In order to use V-BM4D as a deblocking filter, we need
to determine a suitable value of σ to handle the artifacts
in a compressed video. To this purpose, we proceed as in
the previous section and we identify the optimum value of
σ for a set of test sequences compressed at various rates.
Figure 5 shows these optimum values plotted against the
average bit-per-pixel (bpp) rate of the compressed video and
the parameter q that controls the quantization of the block-
transform coefficients [26] (Figure 5(a)). Let us observe that
both the bpp and q parameters are easily accessible from
any given MPEG-4 coded video. These plots suggest that a
power law may conveniently explain the relation between the
optimum value of σ and both the bpp rate and q. Hence, we fit
such bivariate function to the optimum values via least-squares
regression, obtaining the adaptive value of σ for the V-BM4D
deblocking filter as
σ(bpp, q) = 0.09 · q
1.11
· bpp
0.46
+ 3.37 (12)
The function σ(bpp, q) is shown in Figure 5 (right). Note that
in MPEG-4 the parameter q ranges from 2 to 31, where higher
values correspond to a coarser quantization and consequently
lower bitrates. As a matter of fact, when q increases and/or
bpp decreases, the optimum σ increases, in order to effectively
cope with stronger blocking artifacts. Clearly, a much larger

Figures
Citations
More filters
Journal ArticleDOI

Video Enhancement with Task-Oriented Flow

TL;DR: Task-Oriented Flow (TOFlow) as mentioned in this paper is a self-supervised, task-specific representation for low-level video processing, which is trained in a supervised manner.
Journal ArticleDOI

Burst photography for high dynamic range and low-light imaging on mobile cameras

TL;DR: A computational photography pipeline that captures, aligns, and merges a burst of frames to reduce noise and increase dynamic range, built atop Android's Camera2 API and written in the Halide domain-specific language (DSL).
Journal ArticleDOI

Video Enhancement with Task-Oriented Flow

TL;DR: T task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner, is proposed, which outperforms traditional optical flow on standard benchmarks as well as the Vimeo-90K dataset in three video processing tasks.
Proceedings ArticleDOI

Burst Denoising with Kernel Prediction Networks

TL;DR: In this paper, a convolutional neural network architecture is proposed for predicting spatially varying kernels that can both align and denoise frames, and a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima.
Journal ArticleDOI

Fast Hyperspectral Image Denoising and Inpainting Based on Low-Rank and Sparse Representations

TL;DR: Two very fast and competitive hyperspectral image (HSI) restoration algorithms are introduced: FastHyDe and FastHyIn, a denoising algorithm able to cope with Gaussian and Poissonian noise and an inpainting algorithm to restore HSIs where some observations from known pixels in some known bands are missing.
References
More filters
Journal ArticleDOI

A simplex method for function minimization

TL;DR: A method is described for the minimization of a function of n variables, which depends on the comparison of function values at the (n 41) vertices of a general simplex, followed by the replacement of the vertex with the highest value by another point.
Journal ArticleDOI

Overview of the H.264/AVC video coding standard

TL;DR: An overview of the technical features of H.264/AVC is provided, profiles and applications for the standard are described, and the history of the standardization process is outlined.
Journal ArticleDOI

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

TL;DR: An algorithm based on an enhanced sparse representation in transform domain based on a specially developed collaborative Wiener filtering achieves state-of-the-art denoising performance in terms of both peak signal-to-noise ratio and subjective visual quality.
Journal ArticleDOI

Convergence Properties of the Nelder--Mead Simplex Method in Low Dimensions

TL;DR: This paper presents convergence properties of the Nelder--Mead algorithm applied to strictly convex functions in dimensions 1 and 2, and proves convergence to a minimizer for dimension 1, and various limited convergence results for dimension 2.
Journal ArticleDOI

A Review of Image Denoising Algorithms, with a New One

TL;DR: A general mathematical and experimental methodology to compare and classify classical image denoising algorithms and a nonlocal means (NL-means) algorithm addressing the preservation of structure in a digital image are defined.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Video denoising, deblocking and enhancement through separable 4-d nonlocal spatiotemporal transforms" ?

The authors propose a powerful video filtering algorithm that exploits temporal and spatial redundancy characterizing natural video sequences. In this way, the collaborative filtering provides estimates for each volume stacked in the group, which are then returned and adaptively aggregated to their original positions in the video. 

By setting a proper value of σw the authors can control the decay rate of the exponential term as a function of v or, in other words, how permissive is the window contraction with respect to the velocity of the tracked block. 

The proposed enhancement filter is minimally susceptible to noise even when strong sharpening is performed (i.e., α = 1.25), as shown by the smooth reconstruction of flat areas like the hat of Foreman and the bus roof of Bus. 

A critical issue in enhancement is the amplification of the noise together with the sharpening of image details [44], [42], an effect that becomes more severe as the amount of applied sharpening increases. 

the value of α can be decreased for the coefficients that do not belong to this 3-D volume, in order to attenuate the temporal flickering artifacts. 

the authors measure the performance of V-BM4D by means of the MOVIE index [46], a recently introduced video quality assessment (VQA) metric that is expected to be closer to the human visual judgement than the PSNR, because it concurrently evaluates space, time and jointly space-time video quality. 

In order to use V-BM4D as a deblocking filter, the authors need to determine a suitable value of σ to handle the artifacts in a compressed video. 

To reduce the complexity of the grouping phase, the authors restrict the search of similar volumes within a NG × NG neighborhood centered around the coordinates of the reference volume, and the authors introduce a step of Nstep ∈ N pixels in both horizontal and vertical directions between each reference volume. 

Among all the possible criteria for extracting a sub-volume of length L0 = h−0 + h + 0 + 1 from a longer volume, their choice aims at limiting the complexity while maintaining a high correlation within the grouped volumes, because the authors can reasonably assume that similar objects at different positions are represented by similar volumes along time. 

Observe that the hard-thresholding, which is performed via element-wise comparison, requires one arithmetical operation per pixel. 

The parameter τmatch > 0 controls the minimum degree of similarity among volumes with respect to the distance δv, which is typically the `2-norm of the difference between two volumes. 

In particular, the temporal DC coefficients are sharpened using αDC = 1.25, and the temporal AC are sharpened using the halved value αAC = 0.625. 

From an objective point of view, as reported in Table IV, V-BM4D performs better than VBM3D in every experiment, with PSNR gains of up to 1.5dB. 

by skipping the motion estimation of the useless blocks, it is possible to achieve an additional speed-up of ∼12x that allows V-BM4D to process nearly 4 fps without affecting the final reconstruction quality. 

in practice, because of the separability of the transform T4D, every group Gz(xi, ti) has to be composed of volumes having the samelength. 

Let us observe that this cost can be entirely eliminated when the input video is encoded with a motion-compensated algorithm, such as MPEG-4 or H.264, since the motion vectors required to build the spatiotemporal volumes can be directly extracted from the encoded video. 

The outcome of the hard-thresholding stage, ŷht, is obtained by aggregating with a convex combination all the estimated groups Ĝhty (x, t), as defined in (5).