scispace - formally typeset
Open AccessProceedings ArticleDOI

Exploiting Spatial Structure for Localizing Manipulated Image Regions

TLDR
A high confidence detection framework which can localize manipulated regions in an image by learning the boundary discrepancy between manipulated and non-manipulated regions with the combination of LSTM and convolution layers.
Abstract
The advent of high-tech journaling tools facilitates an image to be manipulated in a way that can easily evade state-of-the-art image tampering detection approaches. The recent success of the deep learning approaches in different recognition tasks inspires us to develop a high confidence detection framework which can localize manipulated regions in an image. Unlike semantic object segmentation where all meaningful regions (objects) are segmented, the localization of image manipulation focuses only the possible tampered region which makes the problem even more challenging. In order to formulate the framework, we employ a hybrid CNN-LSTM model to capture discriminative features between manipulated and non-manipulated regions. One of the key properties of manipulated regions is that they exhibit discriminative features in boundaries shared with neighboring non-manipulated pixels. Our motivation is to learn the boundary discrepancy, i.e., the spatial structure, between manipulated and non-manipulated regions with the combination of LSTM and convolution layers. We perform end-to-end training of the network to learn the parameters through back-propagation given ground-truth mask information. The overall framework is capable of detecting different types of image manipulations, including copy-move, removal and splicing. Our model shows promising results in localizing manipulated regions, which is demonstrated through rigorous experimentation on three diverse datasets.

read more

Content maybe subject to copyright    Report

UC Santa Barbara
UC Santa Barbara Previously Published Works
Title
Exploiting Spatial Structure for Localizing Manipulated Image Regions
Permalink
https://escholarship.org/uc/item/4s13z9qm
ISBN
978-1-5386-1032-9
Authors
Bappy, Jawadul H
Roy-Chowdhury, Amit K
Bunk, Jason
et al.
Publication Date
2017
DOI
10.1109/ICCV.2017.532
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

Exploiting Spatial Structure for Localizing Manipulated Image Regions
Jawadul H. Bappy
1
, Amit K. Roy-Chowdhury
1
, Jason Bunk
2
, Lakshmanan Nataraj
2
, and B.S. Manjunath
2,3
1
Department of Electrical and Computer Engineering, University of California, Riverside, USA
2
Mayachitra Inc., Santa Barbara, California , USA
3
Department of Electrical and Computer Engineering, University of California, Santa Barbara, USA
Abstract
The advent of high-tech journaling tools facilitates an
image to be manipulated in a way that can easily evade
state-of-the-art image tampering detection approaches. The
recent success of the deep learning approaches in differ-
ent recognition tasks inspires us to develop a high confi-
dence detection framework which can localize manipulated
regions in an image. Unlike semantic object segmentation
where all meaningful regions (objects) are segmented, the
localization of image manipulation focuses only the possi-
ble tampered region which makes the problem even more
challenging. In order to formulate the framework, we em-
ploy a hybrid CNN-LSTM model to capture discrimina-
tive features between manipulated and non-manipulated re-
gions. One of the key properties of manipulated regions
is that they exhibit discriminative features in boundaries
shared with neighboring non-manipulated pixels. Our mo-
tivation is to learn the boundary discrepancy, i.e., the spa-
tial structure, between manipulated and non-manipulated
regions with the combination of LSTM and convolution lay-
ers. We perform end-to-end training of the network to learn
the parameters through back-propagation given ground-
truth mask information. The overall framework is capable
of detecting different types of image manipulations, includ-
ing copy-move, removal and splicing. Our model shows
promising results in localizing manipulated regions, which
is demonstrated through rigorous experimentation on three
diverse datasets.
1. Introduction
With the availability of digital image editing tools, digi-
tal altering or tampering of an image has become very easy.
In contrast, the identification of tampered images is a very
challenging problem due to the strong resemblance of a
forged image to its original one. There are certain types
of manipulations such as copy-move, splicing, removal, that
can easily deceive the human perceptual system. Digital im-
age forensics is an emerging important topic in diverse sci-
Manipulated Image
Ground-truth
Proposed Model
CRF_RNN
(a)
Manipulated Image
Ground-truth
Proposed Model
CRF_RNN
(b)
Figure 1. The figure demonstrates the challenge of segmenting ma-
nipulated regions from an image. In this figure, we consider two
types of manipulation-(a) copy-clone, and (b) removal. In (a), se-
mantic segmentation method such as CRF-RNN [60] tries to seg-
ment two seals in the image, whereas the proposed method seg-
ments only the copied seal (manipulated) from an image. In (b),
the detection of manipulated region is even harder - some part of
the image has been removed and filled with the neighboring re-
gions. Deep learning based segmentation method [60] is not able
to segment removed objects, whereas our model is capable of lo-
calizing removed objects.
entific and security/surveillance applications. Most of the
existing methods have focused on classifying whether an
image is manipulated or not. However, there are few meth-
ods [51, 24, 13] that localize manipulated regions from an
image. Some recent works address the localization prob-
lem by classifying patches as manipulated. In this paper,
we propose a novel detection framework which is capable
of locating manipulation at patch as well as pixel level.
In image forensics, most of the state-of-the-art image
tamper detection approaches exploit the frequency domain
characteristics and/or statistical properties of an image.
Some of the common methods are DWT [34], SVD [41],
PCA [43], DCT [56]. The analysis of artifacts by multi-
ple JPEG compressions is also utilized in [18, 56] to de-
4970

tect manipulated images, which are applicable only to the
JPEG formats. Recently, deep learning has become pop-
ular due to its promising performance in different visual
recognition tasks such as object detection [26, 8], scene
classification [62], and semantic segmentation [40]. There
have been a few recent works which exploit stacked auto-
encoders (SAE) [59], and convolutional neural networks
(CNN) [50, 9, 19] in order to detect tampered images.
Even though CNN has shown very promising performance
in understanding visual concepts such as object detection
and recognition, the detection of manipulated regions with
CNNs may not be best strategy because well manipulated
images usually do not leave any visual clue of alteration
[50], and resemble genuine images.
In semantic segmentation, deep learning models [40, 60,
7] exhibit good performance by learning hiearachical fea-
tures of different objects in an image. Recent advances
in semantic segmentation involves coarse image represen-
tations, which are recovered by upsampling. However,
coarse representation introduces significant loss of informa-
tion which might be important for learning manipulated re-
gions. In contrast to semantic segmentation, manipulated
regions could be removed objects, or copied object from
other part of the image. Fig. 1 explains the challenge of seg-
menting manipulated regions in an image. In Fig. 1(a), im-
age is tampered in such a way that the manipulated and non-
manipulated regions contain the same object (seal). Exist-
ing segmentation approaches will segment both of the ob-
jects. In addition, existing segmentation network fails to
catch the removed object from an image which is shown in
Fig. 1(b). However, our proposed model is able to segment
the manipulated regions with high accuracy as shown in the
last column of Fig. 1.
An image can be manipulated in many ways - removing
objects from an image, splicing and copy-clone. Most of the
existing forgery detection approaches focus on identifying a
specific tampering method (such as copy-move [17, 29, 35],
splicing [45]). Thus, these approaches might not do well for
other types of tampering. Moreover, it becomes infeasible
and unrealistic to assume that the type of manipulation will
be known beforehand. In real-life, image tamper detection
should be able to detect all types of manipulation rather than
focusing on a specific type.
Towards this goal of detecting and localizing manip-
ulated image regions, we present a unified deep learning
framework in order to learn the patch labels (manipulated
vs non-manipulated) and pixel-wise segmentation jointly.
These two are intricately tied together, since patch classifi-
cation can inform us about which pixels are manipulated,
and segmentation will determine whether a patch is manip-
ulated or not. Our multi-task learning framework exploits
convolutional layers along with long-short term memory
(LSTM) cells. We perform end-to-end training to learn the
joint tasks through back-propagation using ground-truth
patch labels and mask information. The proposed model
shows promising results in localizing manipulated regions
at the pixel level, as well as in patch classification, which is
demonstrated on different challenging datasets.
Framework Overview: In this paper, our goal is to
localize the manipulated regions from an image. Given an
image, we first extract patches by sliding a windows across
the image. In our framework, the image patch is taken as
input and produces a patch label (manipulated or not) and
a segmentation mask as output. Our overall framework
consists of total 5 convolutional layers and an LSTM
network with 3 stacked layers. The proposed framework
is shown in Fig. 2. In the network, first two convolutional
layers are used to learn the low-level features, such as
edges and textures. After passing through two consecutive
convolutional layers, we have a 2D feature map which has
been divided into 8 by 8 blocks. These blocks are then
fed into the LSTM network discussed in the following
paragraph.
In computer vision, LSTMs are generally used to
learn the temporal context of a video or any sequence of
data. In this work, we use an LSTM to model the spatial
relationships between neighboring pixels. This is because
manipulation breaks the natural statistics of an image in
the manipulated boundary region. We send the blocks
of low level features obtained from second convolution
layer to the LSTM cells sequentially, e.g., first block goes
to first cell, second block to second cell, and so on. The
3-stacked LSTM layers produce the correlation features
between blocks. These features are then used to classify
patches using a softmax classifier, and passed to the series
of convolution layers.
Finally, we obtain the 2D map with confidence score
of each pixel using three consecutive convolutional layers
on top of the LSTM network. With the ground-truth
mask of manipulated regions we perform end-to-end
training to classify each pixel. We compute the joint loss
obtained at the patch classification layer and the final layer
of segmentation, which is then minimized by utilizing
back-propagation algorithm.
Main Contributions. Our main contributions are as
follows.
In this paper, we propose a unified network for patch
classification and segmentation task using convolution
layers along with an LSTM network. To the best of our
knowledge, there is no prior work on joint pixel-wise
segmentation of manipulated regions and patch tamper
classification. The intricate relationship between the two,
as explained above, justifies this integrated approach.
In the proposed network, both patch classification
and segmentation (pixel-wise classification) exploit the
4971

Input
Image
Conv
Layer 1
Conv
Layer 2
Blocks from
Output Feature
Map
LSTM Network
3 Staked Layers
Extracted
Patch
Max Pooling
Reconstructed 2D
Map from LSTM
Conv
Layer 3
Conv
Layer 4
Conv
Layer 5
Manipulated
Mask
Patch Labels
16
1
32
2
2
Figure 2. Overview of proposed framework for joint tasks- patch classification and manipulated region segmentation.
interdependence between them in order to improve both
of the recognition tasks. Our framework is capable of
localizing a manipulated region with high confidence,
which is demonstrated on three datasets.
2. Related Work
The field of image forensics comprises of diverse areas to
detect manipulation including resampling detection, JPEG
artefacts, detection of copy-move operations, splicing, and
object removal. We will briefly discuss some of them be-
low.
In the past decades, several techniques have been pro-
posed to detect resampling in digital images [52, 47, 23]. In
most cases, it is assumed to be done using linear or cubic
interpolation. In [52], the authors exploit periodic proper-
ties of interpolation by the second-derivative of the trans-
formed image for detecting image manipulation. To detect
resampling on JPEG compressed images, the authors added
noise before passing the image through the resampling de-
tector and showed that adding noise aids in detecting resam-
pling [47]. In [22, 23], a feature is derived from the nor-
malized energy density and then SVM is used to robustly
detect resampled images. Some recent approaches [27, 33]
have been proposed to reduce JPEG artefacts left by com-
pression. In [5, 54], feature based forensic approaches have
been presented in order to detect manipulation in an image.
In order to detect copy-move forgeries, an image is first
divided into overlapping blocks and some sort of distance
measure or correlation is used to determine blocks that have
been cloned. Some recent works [35, 31, 30, 4] tackle the
problem of identifying and localizing copy-move manipu-
lation. In [35], the authors used an interesting segmentation
based approach to detect copy move forgeries. They first di-
vided an image into semantically independent patches and
then performed keypoint matching among these patches. In
[20], a patch-match algorithm is used to efficiently compute
an approximate nearest neighbor field over an image. They
further use invariant features such as Circular Harmonic
transforms and show robustness over duplicated blocks that
have undergone geometrical transformations.
In [45], an image splicing technique has been proposed
using visual artifacts. A novel image forgery detection
method is presented in [46] based on the steerable pyramid
transform (SPT) and the local binary pattern (LBP). The pa-
per [28] includes the recent advances in image manipulation
and discusses the process of restoring missing or damaged
areas in an image. In [6], the authors review the different
image forgery detection techniques in image forensic litera-
ture. However, in computer vision, there has been a growing
interest to detect image manipulation by applying different
computer vision and machine learning algorithms.
Many methods have been proposed to detect seam carv-
ing [53, 25, 39] and inpainting based object removal [58,
18, 37]. Several approaches exploit JPEG blocking artifacts
to detect tampered regions [38, 21, 42, 12, 13]. In com-
puter vision, deep learning shows outstanding performance
in different visual recognition tasks such as image classi-
fication [62], and semantic segmentation [40]. In [40], two
fully convolution layers have been exploited to segment dif-
ferent objects in an image. The segmentation task has been
further improved in [60, 7]. These models extract hierarchi-
cal features to represent the visual concept, which is useful
in object segmentation. Since, the manipulation does not
exhibit any visual change with respect to genuine images,
these models do not perform well in segmenting manipu-
lated regions.
Recent efforts, including [9, 10, 50, 15] in the manip-
ulation detection task, exploit deep learning based mod-
els. These tasks include detection of generic manipula-
tions [9, 10], resampling [11], splicing [50], and bootleg
[14]. In [49], the authors propose Gaussian-Neuron CNN
(GNCNN) for steganalysis. A deep learning approach to
identify facial retouching was proposed in [1]. In [59],
image region forgery detection has been performed using
stacked auto-encoder model. In [9], a new form of convolu-
tional layer is proposed to learn the manipulated features
from an image. Unlike most of the deep learning based
image tampering detection methods which use convolution
layers, we present an unique network exploiting convolu-
tion layers along with an LSTM network.
4972

3. Network Architecture Overview
Image manipulation techniques such as copy-clone,
splicing, and removal are very common as they are very
difficult to authenticate due to their resemblance to its gen-
uine images. The main goal of this work is to recognize
these manipulations at pixel and patch-level. Localization
of manipulated regions is a different problem than object
segmentation as tampered regions are not visually apparent.
For example, if an object is removed, the region may visu-
ally blend into the background, but needs to be identified as
manipulated. As another example, copy-move is a kind of
manipulation where one object is copied to another region
of the same image leading to two similar objects, one orig-
inally present, and another manipulated. However, only the
latter needs to be identified.
Fig. 3 shows the boundary region of manipulated and
non-manipulated block in a patch. From Fig. 3, we can
see that boundary regions of the manipulated patches are
affected, e.g. smoother boundary, when compared to non-
manipulated regions. When we zoom into the small cropped
region as shown in Fig. 3, we can see the difference be-
tween boundary of manipulated block (smoothed) and non-
manipulated region. The boundary shared between non-
manipulated and manipulated regions are sometimes inten-
tionally made smoother so that no one can visually under-
stand the artefacts seeing an image. Next, we will discuss
the details of our proposed architecture in order to recognize
and localize manipulated regions.
3.1. Model for Localizing Manipulated Regions
Here, we perform two tasks-(1) patch classification (ma-
nipulated vs non-manipulated), and (2) segmentation of ma-
nipulated regions from the patches. The proposed frame-
work is shown in Fig. 2. The network exploits convolutional
layers along with an LSTM network to classify patches, and
to segment manipulated regions.
3.1.1 Convolutional Layers
Convolutional layers consist of different filters which have
learnable weights and biases. In the first layer, the network
will take a patch as input. Each patch has R,G,B value with
dimension of 64 × 64 × 3 (width, height, color channels).
In [61], it is shown that convolutional layers are capable of
extracting different features from an image such as edges,
textures, objects, and scenes. As discussed above, manip-
ulation is better captured in the boundary of manipulated
regions. Thus, the low-level features are critical to identify
manipulated regions. The filters in convolutional layer will
create feature maps that are connected to the local region of
the previous layer. In the convolutional layers, we use ker-
nel size of 5 × 5 × D, where D is the depth of a filter. D
has different values for different layers in the network. An
Non-manipulated Region
Manipulated Region
( a )
( b )
Manipulated Block
( c )
( d )
Non-manipulated Block
Figure 3. The figure illustrates the boundary region of manipu-
lated block (red) and non-manipulated block (green) in column
(a). Column (b) shows the corresponding ground-truth masks for
the manipulated images in column (a). Columns (c) and (d) are
the zoomed-in version of the red (manipulated) and green (non-
manipulated) blocks respectively, showed in (a). Here, we can
see that the boundary formation is different for non-manipulated
(sharp) and manipulated (smooth) regions.
element-wise activation is also utilized in the form of RELU
function, ma x (0, x).
The first convolution layer creates 16 feature maps.
Then, these feature maps are combined in the next convolu-
tion layer. We keep one feature map which will be divided
into blocks to send into the LSTM cells. The reason for
using one feature map is to reduce the network complexity,
but it could be changed depending on the size of the dataset.
The feature map has been divided into 8 by 8 blocks, which
are taken as input the LSTM cells. In Fig. 2, we can see that
second convolutional layer provides a two-dimensional fea-
ture map which can be denoted as F
c
2
. The 8 by 8 block of
this feature map will be fed into the LSTM cells in order to
learn the boundary transformation, which will be discussed
in the Section 3.1.2.
The output feature from the LSTM network is used as
input to the later convolutional layers. These convolutional
layers learn the mapping between features of the boundary
transformation from the LSTM and the tampered pixels us-
ing the ground-truth mask. Unlike conventional CNNs, we
do not use pooling mechanism in every convolution layer
as it causes possible loss of information. We only use max
pooling in third convolution layer.
Motivated by the segmentation work presented in [40],
we also utilize two fully convolution layers (conv layer 4
and 5 as shown in Fig. 2) at the end. In [55, 40], segmen-
tation networks represent features coarsely, which is finally
compensated by upsampling operation to match the dimen-
sion of the ground-truth mask. However, in contrast to these
approaches, we do not follow upsampling operation as it
might create additional distortion. In our network, the size
4973

Citations
More filters
Proceedings ArticleDOI

FaceForensics++: Learning to Detect Manipulated Facial Images

TL;DR: In this paper, the realism of state-of-the-art image manipulations, and how difficult it is to detect them, either automatically or by humans, is examined.
Posted Content

FaceForensics++: Learning to Detect Manipulated Facial Images

TL;DR: This paper proposes an automated benchmark for facial manipulation detection, and shows that the use of additional domain-specific knowledge improves forgery detection to unprecedented accuracy, even in the presence of strong compression, and clearly outperforms human observers.
Proceedings ArticleDOI

Face X-Ray for More General Face Forgery Detection

TL;DR: A novel image representation called face X-ray is proposed, which only assumes the existence of a blending step and does not rely on any knowledge of the artifacts associated with a specific face manipulation technique, and can be trained without fake images generated by any of the state-of-the-art face manipulation methods.
Proceedings ArticleDOI

Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos

TL;DR: In this article, a convolutional neural network was designed to simultaneously detect manipulated images and videos and locate the manipulated regions for each query, where information gained by performing one task is shared with the other task and thereby enhance the performance of both tasks.
Proceedings ArticleDOI

Learning Rich Features for Image Manipulation Detection

TL;DR: Zhang et al. as discussed by the authors proposed a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Proceedings ArticleDOI

Fast R-CNN

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Posted Content

Fast R-CNN

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.
Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Exploiting spatial structure for localizing manipulated image regions" ?

In order to formulate the framework, the authors employ a hybrid CNN-LSTM model to capture discriminative features between manipulated and non-manipulated regions. The authors perform end-to-end training of the network to learn the parameters through back-propagation given groundtruth mask information. Their model shows promising results in localizing manipulated regions, which is demonstrated through rigorous experimentation on three diverse datasets. 

The key insight of using LSTM is to learn the boundary transformation between different blocks, which provides discriminative features between manipulated and non-manipulated regions. 

In image forensics, most of the state-of-the-art image tamper detection approaches exploit the frequency domain characteristics and/or statistical properties of an image. 

deep learning has become popular due to its promising performance in different visual recognition tasks such as object detection [26, 8], scene classification [62], and semantic segmentation [40]. 

The authors perform end-to-end training to learn thejoint tasks through back-propagation using ground-truth patch labels and mask information. 

The authors use adaptive moment estimation (Adam) [32] optimization technique in order to minimize the total loss of the network, shown in Eqn. 

The paper [28] includes the recent advances in image manipulation and discusses the process of restoring missing or damaged areas in an image. 

In order to detect copy-move forgeries, an image is first divided into overlapping blocks and some sort of distance measure or correlation is used to determine blocks that have been cloned. 

There are certain types of manipulations such as copy-move, splicing, removal, that can easily deceive the human perceptual system. 

These convolutional layers learn the mapping between features of the boundary transformation from the LSTM and the tampered pixels using the ground-truth mask. 

In this paper, the authors present a unified framework for joint patch classification and segmentation to localize manipulated regions from an image. 

In data preparation, the authors first split the whole image dataset into three subsets- training (65%), validation (10%) and testing (25%). 

The authors also try with varying number of feature maps such as (1) Conv1-8f : conv1 with 8 maps, (2) Conv1-32f : conv1 layer with 32 feature maps. 

A novel image forgery detection method is presented in [46] based on the steerable pyramid transform (SPT) and the local binary pattern (LBP).