scispace - formally typeset
Open AccessProceedings ArticleDOI

Restoring an Image Taken through a Window Covered with Dirt or Rain

Reads0
Chats0
TLDR
This work presents a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image, and demonstrates effective removal of dirt and rain in outdoor test conditions.
Abstract
Photographs taken through a window are often compromised by dirt or rain present on the window surface. Common cases of this include pictures taken from inside a vehicle, or outdoor security cameras mounted inside a protective enclosure. At capture time, defocus can be used to remove the artifacts, but this relies on achieving a shallow depth-of-field and placement of the camera close to the window. Instead, we present a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image. We collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of dirt and water droplets in natural images. Our models demonstrate effective removal of dirt and rain in outdoor test conditions.

read more

Content maybe subject to copyright    Report

Restoring An Image Taken Through a Window Covered with Dirt or Rain
David Eigen Dilip Krishnan Rob Fergus
Dept. of Computer Science, Courant Institute, New York University
{deigen,dilip,fergus}@cs.nyu.edu
Abstract
Photographs taken through a window are often compro-
mised by dirt or rain present on the window surface. Com-
mon cases of this include pictures taken from inside a ve-
hicle, or outdoor security cameras mounted inside a pro-
tective enclosure. At capture time, defocus can be used to
remove the artifacts, but this relies on achieving a shallow
depth-of-field and placement of the camera close to the win-
dow. Instead, we present a post-capture image processing
solution that can remove localized rain and dirt artifacts
from a single image. We collect a dataset of clean/corrupted
image pairs which are then used to train a specialized form
of convolutional neural network. This learns how to map
corrupted image patches to clean ones, implicitly capturing
the characteristic appearance of dirt and water droplets in
natural images. Our models demonstrate effective removal
of dirt and rain in outdoor test conditions.
1. Introduction
There are many situations in which images or video
might be captured through a window. A person may be
inside a car, train or building and wish to photograph the
scene outside. Indoor situations include exhibits in muse-
ums displayed behind protective glass. Such scenarios have
become increasingly common with the widespread use of
smartphone cameras. Beyond consumer photography, many
cameras are mounted outside, e.g. on buildings for surveil-
lance or on vehicles to prevent collisions. These cameras
are protected from the elements by an enclosure with a
transparent window.
Such images are affected by many factors including re-
flections and attenuation. However, in this paper we address
the particular situation where the window is covered with
dirt or water drops, resulting from rain. As shown in Fig. 1,
these artifacts significantly degrade the quality of the cap-
tured image.
The classic approach to removing occluders from an im-
age is to defocus them to the point of invisibility at the time
of capture. This requires placing the camera right up against
Figure 1. A photograph taken through a glass pane covered in rain,
along with the output of our neural network model, trained to re-
move this type of corruption. The irregular size and appearance of
the rain makes it difficult to remove with existing methods. This
figure is best viewed in electronic form.
the glass and using a large aperture to produce small depth-
of-field. However, in practice it can be hard to move the
camera sufficiently close, and aperture control may not be
available on smartphone cameras or webcams. Correspond-
ingly, many shots with smartphone cameras through dirty or
rainy glass still have significant artifacts, as shown in Fig. 9.
In this paper we instead restore the image after capture,
treating the dirt or rain as a structured form of image noise.
Our method only relies on the artifacts being spatially com-
pact, thus is aided by the rain/dirt being in focus hence
the shots need not be taken close to the window.
Image denoising is a very well studied problem, with
current approaches such as BM3D [3] approaching theo-
retical performance limits [13]. However, the vast majority
of this literature is concerned with additive white Gaussian
noise, quite different to the image artifacts resulting from
dirt or water drops. Our problem is closer to shot-noise re-
moval, but differs in that the artifacts are not constrained to
single pixels and have characteristic structure. Classic ap-
proaches such as median or bilateral filtering have no way
1

of leveraging this structure, thus cannot effectively remove
the artifacts (see Section 5).
Our approach is to use a specialized convolutional neural
network to predict clean patches, given dirty or clean ones
as input. By asking the network to produce a clean output,
regardless of the corruption level of the input, it implicitly
must both detect the corruption and, if present, in-paint over
it. Integrating both tasks simplifies and speeds test-time op-
eration, since separate detection and in-painting stages are
avoided.
Training the models requires a large set of patch pairs
to adequately cover the space inputs and corruption, the
gathering of which was non-trivial and required the devel-
opment of new techniques. However, although training is
somewhat complex, test-time operation is simple: a new
image is presented to the neural network and it directly out-
puts a restored image.
1.1. Related Work
Learning-based methods have found widespread use in
image denoising, e.g. [23, 14, 16, 24]. These approaches
remove additive white Gaussian noise (AWGN) by building
a generative model of clean image patches. In this paper,
however, we focus on more complex structured corruption,
and address it using a neural network that directly maps cor-
rupt images to clean ones; this obviates the slow inference
procedures used by most generative models.
Neural networks have previously been explored for de-
noising natural images, mostly in the context of AWGN,
e.g. Jain and Seung [10], and Zhang and Salari [21]. Al-
gorithmically, the closest work to ours is that of Burger
et al. [2], which applies a large neural network to a range of
non-AWGN denoising tasks, such as salt-and-pepper noise
and JPEG quantization artifacts. Although more challeng-
ing than AWGN, the corruption is still significantly easier
than the highly variable dirt and rain drops that we address.
Furthermore, our network has important architectural dif-
ferences that are crucial for obtaining good performance on
these tasks.
Removing localized corruption can be considered a form
of blind inpainting, where the position of the corrupted re-
gions is not given (unlike traditional inpainting [5]). Dong
et al. [4] show how salt-and-pepper noise can be removed,
but the approach does not extend to multi-pixel corruption.
Recently, Xie et al. [20] showed how a neural network can
perform blind inpainting, demonstrating the removal of text
synthetically placed in an image. This work is close to ours,
but the solid-color text has quite different statistics to natu-
ral images, thus is easier to remove than rain or dirt which
vary greatly in appearance and can resemble legitimate im-
age structures. Jancsary et al. [11] denoise images with a
Gaussian conditional random field, constructed using deci-
sion trees on local regions of the input; however, they too
consider only synthetic corruptions.
Several papers explore the removal of rain from images.
Garg and Nayar [7] and Barnum et al. [1] address air-
borne rain. The former uses defocus, while the latter uses
frequency-domain filtering. Both require video sequences
rather than a single image, however. Roser and Geiger
[17] detect raindrops in single images; although they do not
demonstrate removal, their approach could be paired with
a standard inpainting algorithm. As discussed above, our
approach combines detection and inpainting.
Closely related to our application is Gu et al. [9], who
show how lens dust and nearby occluders can be removed,
but their method requires extensive calibration or a video se-
quence, as opposed to a single frame. Wilson et al. [19] and
Zhou and Lin [22] demonstrate dirt and dust removal. The
former removes defocused dust for a Mars Rover camera,
while the latter removes sensor dust using multiple images
and a physics model.
2. Approach
To restore an image from a corrupt input, we predict a
clean output using a specialized form of convolutional neu-
ral network [12]. The same network architecture is used
for all forms of corruption; however, a different network is
trained for dirt and for rain. This allows the network to tai-
lor its detection capabilities for each task.
2.1. Network Architecture
Given a noisy image x, our goal is to predict a clean
image y that is close to the true clean image y
. We
accomplish this using a multilayer convolutional network,
y = F (x). The network F is composed of a series of layers
F
l
, each of which applies a linear convolution to its input,
followed by an element-wise sigmoid (implemented using
hyperbolic tangent). Concretely, if the number of layers in
the network is L, then
F
0
(x) = x
F
l
(x) = tanh(W
l
F
l1
(x) + b
l
), l = 1, ..., L 1
F (x) =
1
m
(W
L
F
L1
(x) + b
L
)
Here, x is the RGB input image, of size N × M × 3. If
n
l
is the output dimension at layer l, then W
l
applies n
l
convolutions with kernels of size p
l
× p
l
× n
l1
, where p
l
is the spatial support. b
l
is a vector of size n
l
containing the
output bias (the same bias is used at each spatial location).
While the first and last layer kernels have a nontrivial
spatial component, we restrict the middle layers (2 l
L 1) to use p
l
= 1, i.e. they apply a linear map at each
spatial location. We also element-wise divide the final out-
put by the overlap mask
1
m to account for different amounts
of kernel overlap near the image boundary. The first layer
1
m = 1
K
1
I
, where 1
K
is a kernel of size p
L
× p
L
filled with ones,
and 1
I
is a 2D array of ones with as many pixels as the last layer input.

Figure 2. A subset of rain model network weights, sorted by l
2
-
norm. Left: first layer filters which act as detectors for the rain
drops. Right: top layer filters used to reconstruct the clean patch.
uses a “valid” convolution, while the last layer uses a “full”
(these are the same for the middle layers since their kernels
have 1 × 1 support).
In our system, the input kernels’ support is p
1
= 16, and
the output support is p
L
= 8. We use two hidden layers (i.e.
L = 3), each with 512 units. As stated earlier, the middle
layer kernel has support p
2
= 1. Thus, W
1
applies 512
kernels of size 16 × 16 × 3, W
2
applies 512 kernels of size
1 × 1 × 512, and W
3
applies 3 kernels of size 8 × 8 × 512.
Fig. 2 shows examples of weights learned for the rain data.
2.2. Training
We train the weights W
l
and biases b
l
by minimizing the
mean squared error over a dataset D = (x
i
, y
i
) of corre-
sponding noisy and clean image pairs. The loss is
J(θ) =
1
2|D|
X
iD
||F (x
i
) y
i
||
2
where θ = (W
1
, ..., W
L
, b
1
, ..., b
L
) are the model parame-
ters. The pairs in the dataset D are random 64 × 64 pixel
subregions of training images with and without corruption
(see Fig. 4 for samples). Because the input and output ker-
nel sizes of our network differ, the network F produces a
56 × 56 pixel prediction y
i
, which is compared against the
middle 56 × 56 pixels of the true clean subimage y
i
.
We minimize the loss using Stochastic Gradient Descent
(SGD). The update for a single step at time t is
θ
t+1
θ
t
η
t
(F (x
i
) y
i
)
T
θ
F (x
i
)
where η
t
is the learning rate hyper-parameter and i is a ran-
domly drawn index from the training set. The gradient is
further backpropagated through the network F.
We initialize the weights at all layers by randomly draw-
ing from a normal distribution with mean 0 and standard de-
viation 0.001. The biases are initialized to 0. The learning
rate is 0.001 with decay, so that η
t
= 0.001/(1 + 5t ·10
7
).
We use no momentum or weight regularization.
(a) (b) (c)
Figure 3. Denoising near a piece of noise. (a) shows a 64 × 64 im-
age region with dirt occluders (top), and target ground truth clean
image (bottom). (b) and (c) show the results obtained using non-
convolutional and convolutionally trained networks, respectively.
The top row shows the full output after averaging. The bottom
row shows the signed error of each individual patch prediction for
all 8 × 8 patches obtained using a sliding window in the boxed
area, displayed as a montage. The errors from the convolutionally-
trained network (c) are less correlated with one another compared
to (b), and cancel to produce a better average.
2.3. Effect of Convolutional Architecture
A key improvement of our method over [2] is that we
minimize the error of the final image prediction, whereas [2]
minimizes the error only of individual patches. We found
this difference to be crucial to obtain good performance on
the corruption we address.
Since the middle layer convolution in our network has
1 × 1 spatial support, the network can be viewed as first
patchifying the input, applying a fully-connected neural
network to each patch, and averaging the resulting output
patches. More explicitly, we can split the input image x
into stride-1 overlapping patches {x
p
} = patchify(x),
and predict a corresponding clean patch y
p
= f(x
p
) for
each x
p
using a fully-connected multilayer network f. We
then form the predicted image y = depatchify({y
p
}) by
taking the average of the patch predictions at pixels where
they overlap. In this context, the convolutional network F
can be expressed in terms of the patch-level network f as
F (x) = depatchify({f(x
p
) : x
p
patchify(x)}).
In contrast to [2], our method trains the full network F ,
including patchification and depatchification. This drives
a decorrelation of the individual predictions, which helps
both to remove occluders as well as reduce blur in the fi-
nal output. To see this, consider two adjacent patches y
1
and y
2
with overlap regions y
o1
and y
o2
, and desired output
y
o
. If we were to train according to the individual predic-
tions, the loss would minimize (y
o1
y
o
)
2
+ (y
o2
y
o
)
2
,
the sum of their error. However, if we minimize the er-
ror of their average, the loss becomes
y
o1
+y
o2
2
y
o
2
=
1
4
[(y
o1
y
o
)
2
+ (y
o2
y
o
)
2
+ 2(y
o1
y
o
)(y
o2
y
o
)].

The new mixed term pushes the individual patch errors in
opposing directions, encouraging them to decorrelate.
Fig. 3 depicts this for a real example. When trained at the
patch level, as in the system described by [2], each predic-
tion leaves the same residual trace of the noise, which their
average then maintains (b). When trained with our convolu-
tional network, however, the predictions decorrelate where
not perfect, and average to a better output (c).
2.4. Test-Time Evaluation
By restricting the middle layer kernels to have 1 × 1 spa-
tial support, our method requires no synchronization un-
til the final summation in the last layer convolution. This
makes our method natural to parallelize, and it can eas-
ily be run in sections on large input images by adding
the outputs from each section into a single image output
buffer. Our Matlab GPU implementation is able to restore a
3888 × 2592 color image in 60s using a nVidia GTX 580,
and a 1280 × 720 color image in 7s.
3. Training Data Collection
The network has 753,664 weights and 1,216 biases
which need to be set during training. This requires a large
number of training patches to avoid over-fitting. We now
describe the procedures used to gather the corrupted/clean
patch pairs
2
used to train each of the dirt and rain models.
3.1. Dirt
To train our network to remove dirt noise, we gener-
ated clean/noisy image pairs by synthesizing dirt on im-
ages. Similarly to [9], we also found that dirt noise was
well-modeled by an opacity mask and additive component,
which we extract from real dirt-on-glass panes in a lab
setup. Once we have the masks, we generate noisy images
according to
I
0
= pαD + (1 α)I
Here, I and I
0
are the original clean and generated noisy
image, respectively. α is a transparency mask the same size
as the image, and D is the additive component of the dirt,
also the same size as the image. p is a random perturbation
vector in RGB space, and the factors pαD are multiplied
together element-wise. p is drawn from a uniform distri-
bution over (0.9, 1.1) for each of red, green and blue, then
multiplied by another random number between 0 and 1 to
vary brightness. These random perturbations are necessary
to capture natural variation in the corruption and make the
network robust to these changes.
To find α and αD, we took pictures of several slide-
projected backgrounds, both with and without a dirt-on-
2
The corrupt patches still have many unaffected pixels, thus even with-
out clean/clean patch pairs in the training set, the network will still learn to
preserve clean input regions.
Figure 4. Examples of clean (top row) and corrupted (bottom row)
patches used for training. The dirt (left column) was added syn-
thetically, while the rain (right column) was obtained from real
image pairs.
glass pane placed in front of the camera. We then solved
a linear least-squares system for α and αD at each pixel;
further details are included in the supplementary material.
3.2. Water Droplets
Unlike the dirt, water droplets refract light around them
and are not well described by a simple additive model. We
considered using the more sophisticated rendering model
of [8], but accurately simulating outdoor illumination made
this inviable. Thus, instead of synthesizing the effects of
water, we built a training set by taking photographs of mul-
tiple scenes with and without the corruption present. For
corrupt images, we simulated the effect of rain on a window
by spraying water on a pane of anti-reflective MgF
2
-coated
glass, taking care to produce drops that closely resemble
real rain. To limit motion differences between clean and
rainy shots, all scenes contained only static objects. Further
details are provided in the supplementary material.
4. Baseline Methods
We compare our convolutional network against a non-
convolutional patch-level network similar to [2], as well as
three baseline approaches: median filtering, bilateral fil-
tering [18, 15], and BM3D [3]. In each case, we tuned
the algorithm parameters to yield the best qualitative per-
formance in terms of visibly reducing noise while keeping
clean parts of the image intact. On the dirt images, we used
an 8 × 8 window for the median filter, parameters σ
s
= 3
and σ
r
= 0.3 for the bilateral filter, and σ = 0.15 for
BM3D. For the rain images, we used similar parameters,
but adjusted for the fact that the images were downsampled
by half: 5 × 5 for the median filter, σ
s
= 2 and σ
r
= 0.3
for the bilateral filter, and σ = 0.15 for BM3D.

Original Our Output
Original Ours Nonconv Median
Figure 5. Example image containing dirt, and the restoration produced by our network. Note the detail preserved in high-frequency areas
like the branches. The nonconvolutional network leaves behind much of the noise, while the median filter causes substantial blurring.
5. Experiments
5.1. Dirt
We tested dirt removal by running our network on pic-
tures of various scenes taken behind dirt-on-glass panes.
Both the scenes and glass panes were not present in the
training set, ensuring that the network did not simply mem-
orize and match exact patterns. We tested restoration of
both real and synthetic corruption. Although the training
set was composed entirely of synthetic dirt, it was represen-
tative enough for the network to perform well in both cases.
The network was trained using 5.8 million examples
of 64 × 64 image patches with synthetic dirt, paired with
ground truth clean patches. We trained only on examples
where the variance of the clean 64 × 64 patch was at least
0.001, and also required that at least 1 pixel in the patch
had a dirt α-mask value of at least 0.03. To compare to [2],
we trained a non-convolutional patch-based network with
patch sizes corresponding to our convolution kernel sizes,
using 20 million 16 × 16 patches.
5.1.1 Synthetic Dirt Results
We first measure quantitative performance using synthetic
dirt. The results are shown in Table 1. Here, we generated
test examples using images and dirt masks held out from the
training set, using the process described in Section 3.1. Our
convolutional network substantially outperforms its patch-
based counterpart. Both neural networks are much better
PSNR Input Ours Nonconv Median Bilateral BM3D
Mean 28.93 35.43 34.52 31.47 29.97 29.99
Std.Dev. 0.93 1.24 1.04 1.45 1.18 0.96
Gain - 6.50 5.59 2.53 1.04 1.06
Table 1. PSNR for our convolutional neural network, nonconvolu-
tional patch-based network, and baselines on a synthetically gen-
erated test set of 16 images (8 scenes with 2 different dirt masks).
Our approach significantly outperforms the other methods.
than the three baselines, which do not make use of the struc-
ture in the corruption that the networks learn.
We also applied our network to two types of artificial
noise absent from the training set: synthetic “snow” made
from small white line segments, and “scratches” of random
cubic splines. An example region is shown in Fig. 6. In
contrast to the gain of +6.50 dB for dirt, the network leaves
these corruptions largely intact, producing near-zero PSNR
gains of -0.10 and +0.30 dB, respectively, over the same
set of images. This demonstrates that the network learns to
remove dirt specifically.
5.1.2 Dirt Results
Fig. 5 shows a real test image along with our output and the
output of the patch-based network and median filter. Be-
cause of illumination changes and movement in the scenes,
we were not able to capture ground truth images for quanti-
tative evaluation. Our method is able to remove most of the
corruption while retaining details in the image, particularly
around the branches and shutters. The non-convolutional

Figures
Citations
More filters
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Posted Content

Fully Convolutional Networks for Semantic Segmentation

TL;DR: It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.
Journal ArticleDOI

Image Super-Resolution Using Deep Convolutional Networks

TL;DR: Zhang et al. as discussed by the authors proposed a deep learning method for single image super-resolution (SR), which directly learns an end-to-end mapping between the low/high-resolution images.
Journal ArticleDOI

Fully Convolutional Networks for Semantic Segmentation

TL;DR: Fully convolutional networks (FCN) as mentioned in this paper were proposed to combine semantic information from a deep, coarse layer with appearance information from shallow, fine layer to produce accurate and detailed segmentations.
Book ChapterDOI

Learning a Deep Convolutional Network for Image Super-Resolution

TL;DR: This work proposes a deep learning method for single image super-resolution (SR) that directly learns an end-to-end mapping between the low/high-resolution images and shows that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network.
References
More filters
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Bilateral filtering for gray and color images

TL;DR: In contrast with filters that operate on the three bands of a color image separately, a bilateral filter can enforce the perceptual metric underlying the CIE-Lab color space, and smooth colors and preserve edges in a way that is tuned to human perception.
Journal ArticleDOI

Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1 ?

TL;DR: These deviations from linearity provide a potential explanation for the weak forms of non-linearity observed in the response properties of cortical simple cells, and they further make predictions about the expected interactions among units in response to naturalistic stimuli.
Journal ArticleDOI

Image denoising using scale mixtures of Gaussians in the wavelet domain

TL;DR: The performance of this method for removing noise from digital images substantially surpasses that of previously published methods, both visually and in terms of mean squared error.
Proceedings ArticleDOI

From learning models of natural image patches to whole image restoration

TL;DR: A generic framework which allows for whole image restoration using any patch based prior for which a MAP (or approximate MAP) estimate can be calculated is proposed and a generic, surprisingly simple Gaussian Mixture prior is presented, learned from a set of natural images.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Restoring an image taken through a window covered with dirt or rain" ?

Instead, the authors present a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image. 

To restore an image from a corrupt input, the authors predict a clean output using a specialized form of convolutional neural network [12]. 

The non-convolutional network leaves behind additional droplets, e.g. by the subject’s face in the top image; it performs somewhat better in the bottom image, but blurs the subject’s hand. 

Training the models requires a large set of patch pairs to adequately cover the space inputs and corruption, the gathering of which was non-trivial and required the development of new techniques. 

Beyond consumer photography, many cameras are mounted outside, e.g. on buildings for surveillance or on vehicles to prevent collisions. 

although training is somewhat complex, test-time operation is simple: a new image is presented to the neural network and it directly outputs a restored image. 

Their algorithm provides the underlying technology for a number of potential applications such as a digital car windshield to aid driving in adverse weather conditions, or enhancement of footage from security or automotive cameras in exposed locations. 

Despite the fact that their network was trained on static scenes to limit object motion between clean/noisy pairs, it still preserves animate parts of the images well: 

The largest failures appear towards the end of the sequence, when the rain on the glass is very heavy and starts to agglomerate, forming droplets larger than their network can handle. 

the closest work to ours is that of Burger et al. [2], which applies a large neural network to a range of non-AWGN denoising tasks, such as salt-and-pepper noise and JPEG quantization artifacts. 

Their method is able to remove most of the corruption while retaining details in the image, particularly around the branches and shutters. 

For corrupt images, the authors simulated the effect of rain on a window by spraying water on a pane of anti-reflective MgF2-coated glass, taking care to produce drops that closely resemble real rain. 

Their Matlab GPU implementation is able to restore a 3888 × 2592 color image in 60s using a nVidia GTX 580, and a 1280× 720 color image in 7s. 

The latter is caused by a lack of generalization: although the authors trained the network to be robust to shape and color by supplying it a range of variations, it will not recognize cases too far from those seen in training.