What is the probability of the input image being taken by the target camera?

A sigmoidal activation function is applied to the outputs of the last fullyconnected layer containing 1024 neurons and produces a probability that the input image was taken by the target DSLR camera.

What are the two metrics used in the study?

The authors use classical distance metrics, namely PSNR and SSIM scores: the former measures signal distortion w.r.t. the reference, the latter measures structural similarity which is known to be a strong cue for perceived quality [22].

What is the sigmoidal activation function used in the transformation network?

All layers in the transformation network have 64 channels and are followed by a ReLU activation function, except for the last one, where a scaled tanh is applied to the outputs.

What are the two common artifacts that can appear on the processed images?

Two typical artifacts that can appear on the processed images are color deviations (see ground/mountains in first image of Fig. 12) and too high contrast levels (second image).

(Open Access) DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks (2017) | Andrey Ignatov

Q: What contributions have the authors mentioned in the paper "Dslr-quality photos on mobile devices with deep convolutional networks" ?

In this work the authors present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. The authors propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, the authors introduce a composite perceptual error function that combines content, color and texture losses. The authors also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera.

Q: What is the loss function for the image enhancement task?

The authors build their loss function under the assumption that the overall perceptual image quality can be decomposed into three independent parts: i) color quality, ii) texture quality and iii) content quality.

Q: What is the way to train and evaluate a photo enhancement method?

To train and evaluate their method the authors introduced DPED – a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera, and suggested an efficient way of calibrating the images so that they are suitable for image-to-image learning.

ETH Library

DSLR-Quality Photos on Mobile

Devices with Deep Convolutional

Networks

Conference Paper

Author(s):

Ignatov, Andrey; Kobyshev, Nikolay; Vanhoey, Kenneth; Timofte, Radu; Van Gool, Luc

Publication date:

2017

Permanent link:

https://doi.org/10.3929/ethz-b-000203254

Rights / license:

In Copyright - Non-Commercial Use Permitted

Originally published in:

https://doi.org/10.1109/ICCV.2017.355

This page was generated automatically upon download from the ETH Zurich Research Collection.

For more information, please consult the Terms of use.

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

Andrey Ignatov, Nikolay Kobyshev, Kenneth Vanhoey, Radu Timofte, Luc Van Gool

ETH Zurich

andrey.ignatoff@gmail.com, {nk, vanhoey, timofter, vangool}@vision.ee.ethz.ch

Figure 1: iPhone 3GS photo enhanced to DSLR-quality by our method. Best zoomed on screen.

Abstract

Despite a rapid rise in the quality of built-in smartphone cam-

eras, their physical limitations — small sensor size, compact

lenses and the lack of speciﬁc hardware, — impede them to

achieve the quality results of DSLR cameras. In this work we

present an end-to-end deep learning approach that bridges this

gap by translating ordinary photos into DSLR-quality images.

We propose learning the translation function using a residual

convolutional neural network that improves both color rendition

and image sharpness. Since the standard mean squared loss is

not well suited for measuring perceptual image quality, we intro-

duce a composite perceptual error function that combines con-

tent, color and texture losses. The ﬁrst two losses are deﬁned

analytically, while the texture loss is learned in an adversarial

fashion. We also present DPED, a large-scale dataset that con-

sists of real photos captured from three different phones and one

high-end reﬂex camera. Our quantitative and qualitative assess-

ments reveal that the enhanced image quality is comparable to

that of DSLR-taken photos, while the methodology is general-

ized to any type of digital camera.

1 Introduction

During the last several years there has been a signiﬁcant

improvement in compact camera sensors quality, which has

brought mobile photography to a substantially new level. Even

low-end devices are now able to take reasonably good photos

in appropriate lighting conditions, thanks to their advanced soft-

ware and hardware tools for post-processing. However, when it

comes to artistic quality, mobile devices still fall behind their

DSLR counterparts. Larger sensors and high-aperture optics

yield better photo resolution, color rendition and less noise,

whereas their additional sensors help to ﬁne-tune shooting pa-

rameters. These physical differences result in strong obstacles,

making DSLR camera quality unattainable for compact mobile

devices.

While a number of photographer tools for automatic image

enhancement exist, they are usually focused on adjusting only

global parameters such as contrast or brightness, without im-

proving texture quality or taking image semantics into account.

Besides that, they are usually based on a pre-deﬁned set of rules

that do not always consider the speciﬁcs of a particular device.

Therefore, the dominant approach to photo post-processing is

still based on manual image correction using specialized re-

touching software.

1.1 Related work

The problem of automatic image quality enhancement has not

been addressed in its entirety in the area of computer vision,

though a number of sub-tasks and related problems have been al-

ready successfully solved using deep learning techniques. Such

tasks are usually dealing with image-to-image translation prob-

lems, and their common property is that they are targeted at re-

arXiv:submit/1998002 [cs.CV] 5 Sep 2017

moving artiﬁcially added artifacts to the original images. Among

the related problems are the following:

Image super-resolution aims at restoring the original im-

age from its downscaled version. In [4] a CNN architecture

and MSE loss are used for directly learning low to high reso-

lution mapping. It is the ﬁrst CNN-based solution to achieve

top performance in single image super-resolution, comparable

with non-CNN methods [20]. The subsequent works devel-

oped deeper and more complex CNN architectures (e.g., [10, 18,

16]). Currently, the best photo-realistic results on this task are

achieved using a VGG-based loss function [9] and adversarial

networks [12] that turned out to be efﬁcient at recovering plau-

sible high-frequency components.

Image deblurring/dehazing tries to remove artiﬁcially added

haze or blur from the images. Usually, MSE is used as a target

loss function and the proposed CNN architectures consist of 3 to

15 convolutional layers [14, 2, 6] or are bi-channel CNNs [17].

Image denoising/sparse inpainting similarly targets removal

of noise and artifacts from the pictures. In [28] the authors pro-

posed weighted MSE together with a 3-layer CNN, while in [19]

it was shown that an 8-layer residual CNN performs better when

using a standard mean square error. Among other solutions are

a bi-channel CNN [29], a 17-layer CNN [26] and a recurrent

CNN [24] that was reapplied several times to the produced re-

sults.

Image colorization. Here the goal is to recover colors that

were removed from the original image. The baseline approach

for this problem is to predict new values for each pixel based

on its local description that consists of various hand-crafted fea-

tures [3]. Considerably better performance on this task was ob-

tained using generative adversarial networks [8] or a 16-layer

CNN with a multinomial cross-entropy loss function [27].

Image adjustment. A few works considered the problem of

image color/contrast/exposure adjustment. In [25] the authors

proposed an algorithm for automatic exposure correction using

hand-designed features and predeﬁned rules. In [23], a more

general algorithm was proposed that – similarly to [3] – uses

local description of image pixels for reproducing various pho-

tographic styles. A different approach was considered in [13],

where images with similar content are retrieved from a database

and their styles are applied to the target picture. All of these

adjustments are implicitly included in our end-to-end transfor-

mation learning approach by design.

1.2 Contributions

The key challenge we face is dealing with all the aforementioned

enhancements at once. Even advanced tools cannot notably im-

prove image sharpness, texture details or small color variations

that were lost by the camera sensor, thus we can not generate tar-

get enhanced photos from the existing ones. Corrupting DSLR

photos and training an algorithm on the corrupted images does

not work either: the solution would not generalize to real-world

and very complex artifacts unless they are modeled and applied

as corruptions, which is infeasible. To tackle this problem, we

present a different approach: we propose to learn the transfor-

mation that modiﬁes photos taken by a given camera to DSLR-

Table 1: DPED camera characteristics.

Camera Sensor Image size Photo quality

iPhone 3GS 3 MP 2048 × 1536 Poor

BlackBerry Passport 13 MP 4160 × 3120 Mediocre

Sony Xperia Z 13 MP 2592 × 1944 Average

Canon 70D DSLR 20 MP 3648 × 2432 Excellent

Figure 2: The rig with the four DPED cameras from Table 1.

quality ones. Thus, the goal is to learn a cross-distribution trans-

lation function, where the input distribution is deﬁned by a given

mobile camera sensor, and the target distribution by a DSLR sen-

sor. To supervise the learning process, we create and leverage a

dataset of images capturing the same scene with different cam-

eras. Once the function is learned, it can be further applied to

unseen photos at will.

Our main contributions are:

• A novel approach

for the photo enhancement task based on

learning a mapping function between photos from mobile

devices and a DSLR camera. The target model is trained in

an end-to-end fashion without using any additional super-

vision or handcrafted features.

• A new large-scale dataset of over 6K photos taken syn-

chronously by a DSLR camera and 3 low-end cameras of

smartphones in a wide variety of conditions.

• A multi-term loss function composed of color, texture and

content terms, allowing an efﬁcient image quality estima-

tion.

• Experiments measuring objective and subjective quality

demonstrating the advantage of the enhanced photos over

the originals and, at the same time, their comparable qual-

ity with the DSLR counterparts.

The remainder of the paper is structured as follows. In

Section 2 we describe the new DPED dataset. Section 3 presents

our architecture and the chosen loss functions. Section 4 shows

and analyzes the experimental results. Finally, Section 5 con-

cludes the paper.

https://github.com/aiff22/DPED

iPhone BlackBerry Sony Canon

Figure 3: Example quadruplets of images taken synchronously

by the DPED four cameras.

2 DSLR Photo Enhancement Dataset

In order to tackle the problem of image translation from poor

quality images captured by smartphone cameras to superior

quality images achieved by a professional DSLR camera, we

introduce a large-scale real-world dataset, namely the “DSLR

Photo Enhancement Dataset” (DPED)

, that can be used for the

general photo quality enhancement task. DPED consists of pho-

tos taken in the wild synchronously by three smartphones and

one DSLR camera. The devices used to collect the data are de-

scribed in Table 1 and example quadruplets can be seen in Fig-

ure 3.

To ensure that all cameras were capturing photos simultane-

ously, the devices were mounted on a tripod and activated re-

motely by a wireless control system (see Figure 2). In total, over

22K photos were collected during 3 weeks, including 4549 pho-

tos from Sony smartphone, 5727 from iPhone and 6015 photos

from each Canon and BlackBerry cameras. The photos were

taken during the daytime in a wide variety of places and in vari-

ous illumination and weather conditions. The photos were cap-

tured in automatic mode, and we used default settings for all

cameras throughout the whole collection procedure.

Matching algorithm. The synchronously captured images are

not perfectly aligned since the cameras have different viewing

angles and positions as can be seen in Figure 3. To address this,

we performed additional non-linear transformations resulting in

a ﬁxed-resolution image that our network takes as an input. The

algorithm goes as follows (see Fig. 4). First, for each (phone-

DSLR) image pair, we compute and match SIFT keypoints [15]

across the images. These are used to estimate a homography

using RANSAC [21]. We then crop both images to the intersec-

tion part and downscale the DSLR image crop to the size of the

phone crop.

http://dped-photos.vision.ee.ethz.ch

Figure 4: Matching algorithm: an overlapping region is deter-

mined by SIFT descriptor matching, followed by a non-linear

transform and a crop resulting in two images of the same resolu-

tion representing the same scene. Here: Canon and BlackBerry

images, respectively.

Training CNN on the aligned high-resolution images is infea-

sible, thus patches of size 100×100px were extracted from these

photos. Our preliminary experiments revealed that larger patch

sizes do not lead to better performance, while requiring consid-

erably more computational resources. We extracted patches us-

ing a non-overlapping sliding window. The window was mov-

ing in parallel along both images from each phone-DSLR im-

age pair, and its position on the phone image was additionally

adjusted by shifts and rotations based on the cross-correlation

metrics. To avoid signiﬁcant displacements, only patches with

cross-correlation greater than 0.9 were included in the dataset.

Around 100 original images were reserved for testing, the rest

of the photos were used for training and validation. This proce-

dure resulted in 139K, 160K and 162K training and 2.4-4.3K test

patches for BlackBerry-Canon, iPhone-Canon and Sony-Canon

pairs, respectively. It should be emphasized that both training

and test patches are precisely matched, the potential shifts do not

exceed 5 pixels. In the following we assume that these patches

of size 3×100×100 constitute the input data to our CNNs.

3 Method

Given a low-quality photo I

(source image), the goal of the con-

sidered enhancement task is to reproduce the image I

(target

image) taken by a DSLR camera. A deep residual CNN F

pa-

rameterized by weights W is used to learn the underlying trans-

lation function. Given the training set {I

, I

}

j=1

consisting of

N image pairs, it is trained to minimize:

∗

= arg min

j=1



), I



, (1)

where L denotes a multi-term loss function we detail in sec-

tion 3.1. We then deﬁne the system architecture of our solution

in Section 3.2.

Figure 5: Fragments from the original and blurred images taken

by the phone (two left-most) and DSLR (two right-most) cam-

era. Blurring removes high-frequencies and makes color com-

parison easier.

3.1 Loss function

The main difﬁculty of the image enhancement task is that in-

put and target photos cannot be matched densely (i.e., pixel-

to-pixel): different optics and sensors cause speciﬁc local non-

linear distortions and aberrations, leading to a non-constant shift

of pixels between each image pair even after precise alignment.

Hence, the standard per-pixel losses, besides being doubtful as

a perceptual quality metric, are not applicable in our case. We

build our loss function under the assumption that the overall per-

ceptual image quality can be decomposed into three independent

parts: i) color quality, ii) texture quality and iii) content quality.

We now deﬁne loss functions for each component, and ensure

invariance to local shifts by design.

3.1.1 Color loss

To measure the color difference between the enhanced and tar-

get images, we propose applying a Gaussian blur (see Figure 5)

and computing Euclidean distance between the obtained repre-

sentations. In the context of CNNs, this is equivalent to using

one additional convolutional layer with a ﬁxed Gaussian kernel

followed by the mean squared error (MSE) function. Color loss

can be written as:

color

(X, Y ) = kX

− Y

, (2)

where X

and Y

are the blurred images of X and Y , resp.:

(i, j) =

k,l

X(i + k, j + l) · G(k, l), (3)

and the 2D Gaussian blur operator is given by

G(k, l) = A exp



−

(k − µ

)

2σ

−

(l − µ

)

2σ



(4)

where we deﬁned A = 0.053, µ

x,y

= 0, and σ

x,y

= 3.

The idea behind this loss is to evaluate the difference in bright-

ness, contrast and major colors between the images while elim-

inating texture and content comparison. Hence, we ﬁxed a con-

stant σ by visual inspection as the smallest value that ensures that

texture and content are dropped. The crucial property of this loss

is its invariance to small distortions. Figure 6 demonstrates the

MSE and Color losses for image pairs (X, Y), where Y equals X

shifted in a random direction by n pixels. As one can see, color

loss is nearly insensitive to small distortions (6 2 pixels). For

0 5 10 15 20 25 30 35 40

Shift between the images [pixels]

0.05

0.1

0.15

0.2

0.25

Error x 10

-1

MSE loss

Color loss

Figure 6: Comparison between MSE and color loss as a function

of the magnitude of shift between images. Results were averaged

over 50K images.

higher shifts (3-5px), it is still about 5-10 times smaller com-

pared to the MSE, whereas for larger displacements it demon-

strates similar magnitude and behavior. As a result, color loss

forces the enhanced image to have the same color distribution as

the target one, while being tolerant to small mismatches.

3.1.2 Texture loss

Instead of using a pre-deﬁned loss function, we build upon gen-

erative adversarial networks (GANs) [5] to directly learn a suit-

able metric for measuring texture quality. The discriminator

CNN is applied to grayscale images so that it is targeted speciﬁ-

cally on texture processing. It observes both fake (improved) and

real (target) images, and its goal is to predict whether the input

image is real or not. It is trained to minimize the cross-entropy

loss function, and the texture loss is deﬁned as a standard gener-

ator objective:

texture

= −

log D(F

), I

), (5)

where F

and D denote the generator and discriminator net-

works, respectively. The discriminator is pre-trained on the

{phone, DSLR} image pairs, and then trained jointly with the

proposed network as is conventional for GANs. It should be

noted that this loss is shift-invariant by deﬁnition since no align-

ment is required in this case.

3.1.3 Content loss

Inspired by [9, 12], we deﬁne our content loss based on the ac-

tivation maps produced by the ReLU layers of the pre-trained

VGG-19 network. Instead of measuring per-pixel difference be-

tween the images, this loss encourages them to have similar fea-

ture representation that comprises various aspects of their con-

tent and perceptual quality. In our case it is used to preserve

image semantics since other losses don’t consider it. Let ψ

() be

the feature map obtained after the j-th convolutional layer of the

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

Figures

Citations

Deep Learning in Mobile and Wireless Networking: A Survey

Underexposed Photo Enhancement Using Deep Illumination Estimation

Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs

Learning Enriched Features for Real Image Restoration and Enhancement

Fast Underwater Image Enhancement for Improved Visual Perception

References

Adam: A Method for Stochastic Optimization

Distinctive Image Features from Scale-Invariant Keypoints

Image quality assessment: from error visibility to structural similarity

Generative Adversarial Nets

Image-to-Image Translation with Conditional Adversarial Networks

Related Papers (5)

Generative Adversarial Nets

U-Net: Convolutional Networks for Biomedical Image Segmentation

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Dslr-quality photos on mobile devices with deep convolutional networks" ?

Q2. What is the probability of the input image being taken by the target camera?

Q3. What are the two metrics used in the study?

Q4. What is the loss function for the image enhancement task?

Q5. What is the way to achieve the results on this task?

Q6. What are the advantages of larger sensors and high-aperture optics?

Q7. What is the sigmoidal activation function used in the transformation network?

Q8. What are the two common artifacts that can appear on the processed images?

Q9. What are the main problems of automatic image enhancement?

Q10. What is the way to train and evaluate a photo enhancement method?

Q11. What is the main approach to photo post-processing?

Q12. What is the way to achieve better performance on this task?

Q13. What did the authors find out about the patch size?

Q14. What is the way to measure the color difference between the enhanced and target images?

Q15. What is the effect of the color loss on the enhanced image?

Trending Questions (1)