scispace - formally typeset
Open AccessProceedings ArticleDOI

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

Reads0
Chats0
TLDR
An end-to-end deep learning approach that bridges the gap by translating ordinary photos into DSLR-quality images by learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness.
Abstract
Despite a rapid rise in the quality of built-in smartphone cameras, their physical limitations – small sensor size, compact lenses and the lack of specific hardware, – impede them to achieve the quality results of DSLR cameras. In this work we present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. We propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, we introduce a composite perceptual error function that combines content, color and texture losses. The first two losses are defined analytically, while the texture loss is learned in an adversarial fashion. We also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. Our quantitative and qualitative assessments reveal that the enhanced image quality is comparable to that of DSLR-taken photos, while the methodology is generalized to any type of digital camera.

read more

Content maybe subject to copyright    Report

ETH Library
DSLR-Quality Photos on Mobile
Devices with Deep Convolutional
Networks
Conference Paper
Author(s):
Ignatov, Andrey; Kobyshev, Nikolay; Vanhoey, Kenneth; Timofte, Radu; Van Gool, Luc
Publication date:
2017
Permanent link:
https://doi.org/10.3929/ethz-b-000203254
Rights / license:
In Copyright - Non-Commercial Use Permitted
Originally published in:
https://doi.org/10.1109/ICCV.2017.355
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks
Andrey Ignatov, Nikolay Kobyshev, Kenneth Vanhoey, Radu Timofte, Luc Van Gool
ETH Zurich
andrey.ignatoff@gmail.com, {nk, vanhoey, timofter, vangool}@vision.ee.ethz.ch
Figure 1: iPhone 3GS photo enhanced to DSLR-quality by our method. Best zoomed on screen.
Abstract
Despite a rapid rise in the quality of built-in smartphone cam-
eras, their physical limitations small sensor size, compact
lenses and the lack of specific hardware, impede them to
achieve the quality results of DSLR cameras. In this work we
present an end-to-end deep learning approach that bridges this
gap by translating ordinary photos into DSLR-quality images.
We propose learning the translation function using a residual
convolutional neural network that improves both color rendition
and image sharpness. Since the standard mean squared loss is
not well suited for measuring perceptual image quality, we intro-
duce a composite perceptual error function that combines con-
tent, color and texture losses. The first two losses are defined
analytically, while the texture loss is learned in an adversarial
fashion. We also present DPED, a large-scale dataset that con-
sists of real photos captured from three different phones and one
high-end reflex camera. Our quantitative and qualitative assess-
ments reveal that the enhanced image quality is comparable to
that of DSLR-taken photos, while the methodology is general-
ized to any type of digital camera.
1 Introduction
During the last several years there has been a significant
improvement in compact camera sensors quality, which has
brought mobile photography to a substantially new level. Even
low-end devices are now able to take reasonably good photos
in appropriate lighting conditions, thanks to their advanced soft-
ware and hardware tools for post-processing. However, when it
comes to artistic quality, mobile devices still fall behind their
DSLR counterparts. Larger sensors and high-aperture optics
yield better photo resolution, color rendition and less noise,
whereas their additional sensors help to fine-tune shooting pa-
rameters. These physical differences result in strong obstacles,
making DSLR camera quality unattainable for compact mobile
devices.
While a number of photographer tools for automatic image
enhancement exist, they are usually focused on adjusting only
global parameters such as contrast or brightness, without im-
proving texture quality or taking image semantics into account.
Besides that, they are usually based on a pre-defined set of rules
that do not always consider the specifics of a particular device.
Therefore, the dominant approach to photo post-processing is
still based on manual image correction using specialized re-
touching software.
1.1 Related work
The problem of automatic image quality enhancement has not
been addressed in its entirety in the area of computer vision,
though a number of sub-tasks and related problems have been al-
ready successfully solved using deep learning techniques. Such
tasks are usually dealing with image-to-image translation prob-
lems, and their common property is that they are targeted at re-
1
arXiv:submit/1998002 [cs.CV] 5 Sep 2017

moving artificially added artifacts to the original images. Among
the related problems are the following:
Image super-resolution aims at restoring the original im-
age from its downscaled version. In [4] a CNN architecture
and MSE loss are used for directly learning low to high reso-
lution mapping. It is the first CNN-based solution to achieve
top performance in single image super-resolution, comparable
with non-CNN methods [20]. The subsequent works devel-
oped deeper and more complex CNN architectures (e.g., [10, 18,
16]). Currently, the best photo-realistic results on this task are
achieved using a VGG-based loss function [9] and adversarial
networks [12] that turned out to be efficient at recovering plau-
sible high-frequency components.
Image deblurring/dehazing tries to remove artificially added
haze or blur from the images. Usually, MSE is used as a target
loss function and the proposed CNN architectures consist of 3 to
15 convolutional layers [14, 2, 6] or are bi-channel CNNs [17].
Image denoising/sparse inpainting similarly targets removal
of noise and artifacts from the pictures. In [28] the authors pro-
posed weighted MSE together with a 3-layer CNN, while in [19]
it was shown that an 8-layer residual CNN performs better when
using a standard mean square error. Among other solutions are
a bi-channel CNN [29], a 17-layer CNN [26] and a recurrent
CNN [24] that was reapplied several times to the produced re-
sults.
Image colorization. Here the goal is to recover colors that
were removed from the original image. The baseline approach
for this problem is to predict new values for each pixel based
on its local description that consists of various hand-crafted fea-
tures [3]. Considerably better performance on this task was ob-
tained using generative adversarial networks [8] or a 16-layer
CNN with a multinomial cross-entropy loss function [27].
Image adjustment. A few works considered the problem of
image color/contrast/exposure adjustment. In [25] the authors
proposed an algorithm for automatic exposure correction using
hand-designed features and predefined rules. In [23], a more
general algorithm was proposed that similarly to [3] uses
local description of image pixels for reproducing various pho-
tographic styles. A different approach was considered in [13],
where images with similar content are retrieved from a database
and their styles are applied to the target picture. All of these
adjustments are implicitly included in our end-to-end transfor-
mation learning approach by design.
1.2 Contributions
The key challenge we face is dealing with all the aforementioned
enhancements at once. Even advanced tools cannot notably im-
prove image sharpness, texture details or small color variations
that were lost by the camera sensor, thus we can not generate tar-
get enhanced photos from the existing ones. Corrupting DSLR
photos and training an algorithm on the corrupted images does
not work either: the solution would not generalize to real-world
and very complex artifacts unless they are modeled and applied
as corruptions, which is infeasible. To tackle this problem, we
present a different approach: we propose to learn the transfor-
mation that modifies photos taken by a given camera to DSLR-
Table 1: DPED camera characteristics.
Camera Sensor Image size Photo quality
iPhone 3GS 3 MP 2048 × 1536 Poor
BlackBerry Passport 13 MP 4160 × 3120 Mediocre
Sony Xperia Z 13 MP 2592 × 1944 Average
Canon 70D DSLR 20 MP 3648 × 2432 Excellent
Figure 2: The rig with the four DPED cameras from Table 1.
quality ones. Thus, the goal is to learn a cross-distribution trans-
lation function, where the input distribution is defined by a given
mobile camera sensor, and the target distribution by a DSLR sen-
sor. To supervise the learning process, we create and leverage a
dataset of images capturing the same scene with different cam-
eras. Once the function is learned, it can be further applied to
unseen photos at will.
Our main contributions are:
A novel approach
1
for the photo enhancement task based on
learning a mapping function between photos from mobile
devices and a DSLR camera. The target model is trained in
an end-to-end fashion without using any additional super-
vision or handcrafted features.
A new large-scale dataset of over 6K photos taken syn-
chronously by a DSLR camera and 3 low-end cameras of
smartphones in a wide variety of conditions.
A multi-term loss function composed of color, texture and
content terms, allowing an efficient image quality estima-
tion.
Experiments measuring objective and subjective quality
demonstrating the advantage of the enhanced photos over
the originals and, at the same time, their comparable qual-
ity with the DSLR counterparts.
The remainder of the paper is structured as follows. In
Section 2 we describe the new DPED dataset. Section 3 presents
our architecture and the chosen loss functions. Section 4 shows
and analyzes the experimental results. Finally, Section 5 con-
cludes the paper.
1
https://github.com/aiff22/DPED
2

iPhone BlackBerry Sony Canon
Figure 3: Example quadruplets of images taken synchronously
by the DPED four cameras.
2 DSLR Photo Enhancement Dataset
In order to tackle the problem of image translation from poor
quality images captured by smartphone cameras to superior
quality images achieved by a professional DSLR camera, we
introduce a large-scale real-world dataset, namely the “DSLR
Photo Enhancement Dataset” (DPED)
2
, that can be used for the
general photo quality enhancement task. DPED consists of pho-
tos taken in the wild synchronously by three smartphones and
one DSLR camera. The devices used to collect the data are de-
scribed in Table 1 and example quadruplets can be seen in Fig-
ure 3.
To ensure that all cameras were capturing photos simultane-
ously, the devices were mounted on a tripod and activated re-
motely by a wireless control system (see Figure 2). In total, over
22K photos were collected during 3 weeks, including 4549 pho-
tos from Sony smartphone, 5727 from iPhone and 6015 photos
from each Canon and BlackBerry cameras. The photos were
taken during the daytime in a wide variety of places and in vari-
ous illumination and weather conditions. The photos were cap-
tured in automatic mode, and we used default settings for all
cameras throughout the whole collection procedure.
Matching algorithm. The synchronously captured images are
not perfectly aligned since the cameras have different viewing
angles and positions as can be seen in Figure 3. To address this,
we performed additional non-linear transformations resulting in
a fixed-resolution image that our network takes as an input. The
algorithm goes as follows (see Fig. 4). First, for each (phone-
DSLR) image pair, we compute and match SIFT keypoints [15]
across the images. These are used to estimate a homography
using RANSAC [21]. We then crop both images to the intersec-
tion part and downscale the DSLR image crop to the size of the
phone crop.
2
http://dped-photos.vision.ee.ethz.ch
Figure 4: Matching algorithm: an overlapping region is deter-
mined by SIFT descriptor matching, followed by a non-linear
transform and a crop resulting in two images of the same resolu-
tion representing the same scene. Here: Canon and BlackBerry
images, respectively.
Training CNN on the aligned high-resolution images is infea-
sible, thus patches of size 100×100px were extracted from these
photos. Our preliminary experiments revealed that larger patch
sizes do not lead to better performance, while requiring consid-
erably more computational resources. We extracted patches us-
ing a non-overlapping sliding window. The window was mov-
ing in parallel along both images from each phone-DSLR im-
age pair, and its position on the phone image was additionally
adjusted by shifts and rotations based on the cross-correlation
metrics. To avoid significant displacements, only patches with
cross-correlation greater than 0.9 were included in the dataset.
Around 100 original images were reserved for testing, the rest
of the photos were used for training and validation. This proce-
dure resulted in 139K, 160K and 162K training and 2.4-4.3K test
patches for BlackBerry-Canon, iPhone-Canon and Sony-Canon
pairs, respectively. It should be emphasized that both training
and test patches are precisely matched, the potential shifts do not
exceed 5 pixels. In the following we assume that these patches
of size 3×100×100 constitute the input data to our CNNs.
3 Method
Given a low-quality photo I
s
(source image), the goal of the con-
sidered enhancement task is to reproduce the image I
t
(target
image) taken by a DSLR camera. A deep residual CNN F
W
pa-
rameterized by weights W is used to learn the underlying trans-
lation function. Given the training set {I
j
s
, I
j
t
}
N
j=1
consisting of
N image pairs, it is trained to minimize:
W
= arg min
W
1
N
N
X
j=1
L
F
W
(I
j
s
), I
j
t
, (1)
where L denotes a multi-term loss function we detail in sec-
tion 3.1. We then define the system architecture of our solution
in Section 3.2.
3

Figure 5: Fragments from the original and blurred images taken
by the phone (two left-most) and DSLR (two right-most) cam-
era. Blurring removes high-frequencies and makes color com-
parison easier.
3.1 Loss function
The main difficulty of the image enhancement task is that in-
put and target photos cannot be matched densely (i.e., pixel-
to-pixel): different optics and sensors cause specific local non-
linear distortions and aberrations, leading to a non-constant shift
of pixels between each image pair even after precise alignment.
Hence, the standard per-pixel losses, besides being doubtful as
a perceptual quality metric, are not applicable in our case. We
build our loss function under the assumption that the overall per-
ceptual image quality can be decomposed into three independent
parts: i) color quality, ii) texture quality and iii) content quality.
We now define loss functions for each component, and ensure
invariance to local shifts by design.
3.1.1 Color loss
To measure the color difference between the enhanced and tar-
get images, we propose applying a Gaussian blur (see Figure 5)
and computing Euclidean distance between the obtained repre-
sentations. In the context of CNNs, this is equivalent to using
one additional convolutional layer with a fixed Gaussian kernel
followed by the mean squared error (MSE) function. Color loss
can be written as:
L
color
(X, Y ) = kX
b
Y
b
k
2
2
, (2)
where X
b
and Y
b
are the blurred images of X and Y , resp.:
X
b
(i, j) =
X
k,l
X(i + k, j + l) · G(k, l), (3)
and the 2D Gaussian blur operator is given by
G(k, l) = A exp
(k µ
x
)
2
2σ
x
(l µ
y
)
2
2σ
y
(4)
where we defined A = 0.053, µ
x,y
= 0, and σ
x,y
= 3.
The idea behind this loss is to evaluate the difference in bright-
ness, contrast and major colors between the images while elim-
inating texture and content comparison. Hence, we fixed a con-
stant σ by visual inspection as the smallest value that ensures that
texture and content are dropped. The crucial property of this loss
is its invariance to small distortions. Figure 6 demonstrates the
MSE and Color losses for image pairs (X, Y), where Y equals X
shifted in a random direction by n pixels. As one can see, color
loss is nearly insensitive to small distortions (6 2 pixels). For
0 5 10 15 20 25 30 35 40
Shift between the images [pixels]
0
0.05
0.1
0.15
0.2
0.25
Error x 10
-1
MSE loss
Color loss
Figure 6: Comparison between MSE and color loss as a function
of the magnitude of shift between images. Results were averaged
over 50K images.
higher shifts (3-5px), it is still about 5-10 times smaller com-
pared to the MSE, whereas for larger displacements it demon-
strates similar magnitude and behavior. As a result, color loss
forces the enhanced image to have the same color distribution as
the target one, while being tolerant to small mismatches.
3.1.2 Texture loss
Instead of using a pre-defined loss function, we build upon gen-
erative adversarial networks (GANs) [5] to directly learn a suit-
able metric for measuring texture quality. The discriminator
CNN is applied to grayscale images so that it is targeted specifi-
cally on texture processing. It observes both fake (improved) and
real (target) images, and its goal is to predict whether the input
image is real or not. It is trained to minimize the cross-entropy
loss function, and the texture loss is defined as a standard gener-
ator objective:
L
texture
=
X
i
log D(F
W
(I
s
), I
t
), (5)
where F
W
and D denote the generator and discriminator net-
works, respectively. The discriminator is pre-trained on the
{phone, DSLR} image pairs, and then trained jointly with the
proposed network as is conventional for GANs. It should be
noted that this loss is shift-invariant by definition since no align-
ment is required in this case.
3.1.3 Content loss
Inspired by [9, 12], we define our content loss based on the ac-
tivation maps produced by the ReLU layers of the pre-trained
VGG-19 network. Instead of measuring per-pixel difference be-
tween the images, this loss encourages them to have similar fea-
ture representation that comprises various aspects of their con-
tent and perceptual quality. In our case it is used to preserve
image semantics since other losses don’t consider it. Let ψ
j
() be
the feature map obtained after the j-th convolutional layer of the
4

Citations
More filters
Journal ArticleDOI

Deep Learning in Mobile and Wireless Networking: A Survey

TL;DR: This paper bridges the gap between deep learning and mobile and wireless networking research, by presenting a comprehensive survey of the crossovers between the two areas, and provides an encyclopedic review of mobile and Wireless networking research based on deep learning, which is categorize by different domains.
Proceedings ArticleDOI

Underexposed Photo Enhancement Using Deep Illumination Estimation

TL;DR: A new neural network for enhancing underexposed photos is presented, which introduces intermediate illumination in its network to associate the input with expected enhancement result, which augments the network's capability to learn complex photographic adjustment from expert-retouched input/output image pairs.
Proceedings ArticleDOI

Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs

TL;DR: An unpaired learning method for image enhancement based on the framework of two-way generative adversarial networks (GANs) with several improvements that significantly improve the stability of GAN training for this application.
Book ChapterDOI

Learning Enriched Features for Real Image Restoration and Enhancement

TL;DR: MIRNet as mentioned in this paper proposes a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting mult-scale features, (b) information exchange across the multiresolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention-based multiscale feature aggregation.
Journal ArticleDOI

Fast Underwater Image Enhancement for Improved Visual Perception

TL;DR: The proposed conditional generative adversarial network-based model is suitable for real-time preprocessing in the autonomy pipeline by visually-guided underwater robots and provides improved performances of standard models for underwater object detection, human pose estimation, and saliency prediction.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Dslr-quality photos on mobile devices with deep convolutional networks" ?

In this work the authors present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. The authors propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, the authors introduce a composite perceptual error function that combines content, color and texture losses. The authors also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. 

A sigmoidal activation function is applied to the outputs of the last fullyconnected layer containing 1024 neurons and produces a probability that the input image was taken by the target DSLR camera. 

The authors use classical distance metrics, namely PSNR and SSIM scores: the former measures signal distortion w.r.t. the reference, the latter measures structural similarity which is known to be a strong cue for perceived quality [22]. 

The authors build their loss function under the assumption that the overall perceptual image quality can be decomposed into three independent parts: i) color quality, ii) texture quality and iii) content quality. 

the best photo-realistic results on this task are achieved using a VGG-based loss function [9] and adversarial networks [12] that turned out to be efficient at recovering plausible high-frequency components. 

Larger sensors and high-aperture optics yield better photo resolution, color rendition and less noise, whereas their additional sensors help to fine-tune shooting parameters. 

All layers in the transformation network have 64 channels and are followed by a ReLU activation function, except for the last one, where a scaled tanh is applied to the outputs. 

Two typical artifacts that can appear on the processed images are color deviations (see ground/mountains in first image of Fig. 12) and too high contrast levels (second image). 

While a number of photographer tools for automatic image enhancement exist, they are usually focused on adjusting only global parameters such as contrast or brightness, without improving texture quality or taking image semantics into account. 

To train and evaluate their method the authors introduced DPED – a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera, and suggested an efficient way of calibrating the images so that they are suitable for image-to-image learning. 

the dominant approach to photo post-processing is still based on manual image correction using specialized retouching software. 

Considerably better performance on this task was obtained using generative adversarial networks [8] or a 16-layer CNN with a multinomial cross-entropy loss function [27]. 

Their preliminary experiments revealed that larger patch sizes do not lead to better performance, while requiring considerably more computational resources. 

To measure the color difference between the enhanced and target images, the authors propose applying a Gaussian blur (see Figure 5) and computing Euclidean distance between the obtained representations. 

As a result, color loss forces the enhanced image to have the same color distribution as the target one, while being tolerant to small mismatches. 

Trending Questions (1)
What are the main challenges in using deep convolutional networks (DCNs) to achieve DSLR-quality photos on mobile devices?

The paper does not explicitly mention the main challenges in using deep convolutional networks for achieving DSLR-quality photos on mobile devices. The paper is about enhancing image quality using deep learning and a large-scale dataset.