What is the key insight of using LSTM?

The key insight of using LSTM is to learn the boundary transformation between different blocks, which provides discriminative features between manipulated and non-manipulated regions.

What is the method for minimizing the loss of the network?

The authors use adaptive moment estimation (Adam) [32] optimization technique in order to minimize the total loss of the network, shown in Eqn.

What is the purpose of this paper?

In this paper, the authors present a unified framework for joint patch classification and segmentation to localize manipulated regions from an image.

How many patches are used in the training dataset?

In data preparation, the authors first split the whole image dataset into three subsets- training (65%), validation (10%) and testing (25%).

How many feature maps do the authors use in their proposed network?

The authors also try with varying number of feature maps such as (1) Conv1-8f : conv1 with 8 maps, (2) Conv1-32f : conv1 layer with 32 feature maps.

(Open Access) Exploiting Spatial Structure for Localizing Manipulated Image Regions (2017) | Jawadul H. Bappy

Q: What are the contributions mentioned in the paper "Exploiting spatial structure for localizing manipulated image regions" ?

In order to formulate the framework, the authors employ a hybrid CNN-LSTM model to capture discriminative features between manipulated and non-manipulated regions. The authors perform end-to-end training of the network to learn the parameters through back-propagation given groundtruth mask information. Their model shows promising results in localizing manipulated regions, which is demonstrated through rigorous experimentation on three diverse datasets.

Q: What is the main reason why deep learning is popular?

deep learning has become popular due to its promising performance in different visual recognition tasks such as object detection [26, 8], scene classification [62], and semantic segmentation [40].

Q: How do the authors learn the patch tasks?

The authors perform end-to-end training to learn thejoint tasks through back-propagation using ground-truth patch labels and mask information.

Q: What is the main topic of the paper?

The paper [28] includes the recent advances in image manipulation and discusses the process of restoring missing or damaged areas in an image.

Q: What is the definition of copy-move forgeries?

In order to detect copy-move forgeries, an image is first divided into overlapping blocks and some sort of distance measure or correlation is used to determine blocks that have been cloned.

UC Santa Barbara

UC Santa Barbara Previously Published Works

Title

Exploiting Spatial Structure for Localizing Manipulated Image Regions

Permalink

https://escholarship.org/uc/item/4s13z9qm

ISBN

978-1-5386-1032-9

Authors

Bappy, Jawadul H

Roy-Chowdhury, Amit K

Bunk, Jason

et al.

Publication Date

2017

DOI

10.1109/ICCV.2017.532

Peer reviewed

eScholarship.org Powered by the California Digital Library

University of California

Exploiting Spatial Structure for Localizing Manipulated Image Regions

Jawadul H. Bappy

, Amit K. Roy-Chowdhury

, Jason Bunk

, Lakshmanan Nataraj

, and B.S. Manjunath

2,3

Department of Electrical and Computer Engineering, University of California, Riverside, USA

Mayachitra Inc., Santa Barbara, California , USA

Department of Electrical and Computer Engineering, University of California, Santa Barbara, USA

Abstract

The advent of high-tech journaling tools facilitates an

image to be manipulated in a way that can easily evade

state-of-the-art image tampering detection approaches. The

recent success of the deep learning approaches in differ-

ent recognition tasks inspires us to develop a high conﬁ-

dence detection framework which can localize manipulated

regions in an image. Unlike semantic object segmentation

where all meaningful regions (objects) are segmented, the

localization of image manipulation focuses only the possi-

ble tampered region which makes the problem even more

challenging. In order to formulate the framework, we em-

ploy a hybrid CNN-LSTM model to capture discrimina-

tive features between manipulated and non-manipulated re-

gions. One of the key properties of manipulated regions

is that they exhibit discriminative features in boundaries

shared with neighboring non-manipulated pixels. Our mo-

tivation is to learn the boundary discrepancy, i.e., the spa-

tial structure, between manipulated and non-manipulated

regions with the combination of LSTM and convolution lay-

ers. We perform end-to-end training of the network to learn

the parameters through back-propagation given ground-

truth mask information. The overall framework is capable

of detecting different types of image manipulations, includ-

ing copy-move, removal and splicing. Our model shows

promising results in localizing manipulated regions, which

is demonstrated through rigorous experimentation on three

diverse datasets.

1. Introduction

With the availability of digital image editing tools, digi-

tal altering or tampering of an image has become very easy.

In contrast, the identiﬁcation of tampered images is a very

challenging problem due to the strong resemblance of a

forged image to its original one. There are certain types

of manipulations such as copy-move, splicing, removal, that

can easily deceive the human perceptual system. Digital im-

age forensics is an emerging important topic in diverse sci-

Manipulated Image

Ground-truth

Proposed Model

CRF_RNN

(a)

Manipulated Image

Ground-truth

Proposed Model

CRF_RNN

(b)

Figure 1. The ﬁgure demonstrates the challenge of segmenting ma-

nipulated regions from an image. In this ﬁgure, we consider two

types of manipulation-(a) copy-clone, and (b) removal. In (a), se-

mantic segmentation method such as CRF-RNN [60] tries to seg-

ment two seals in the image, whereas the proposed method seg-

ments only the copied seal (manipulated) from an image. In (b),

the detection of manipulated region is even harder - some part of

the image has been removed and ﬁlled with the neighboring re-

gions. Deep learning based segmentation method [60] is not able

to segment removed objects, whereas our model is capable of lo-

calizing removed objects.

entiﬁc and security/surveillance applications. Most of the

existing methods have focused on classifying whether an

image is manipulated or not. However, there are few meth-

ods [51, 24, 13] that localize manipulated regions from an

image. Some recent works address the localization prob-

lem by classifying patches as manipulated. In this paper,

we propose a novel detection framework which is capable

of locating manipulation at patch as well as pixel level.

In image forensics, most of the state-of-the-art image

tamper detection approaches exploit the frequency domain

characteristics and/or statistical properties of an image.

Some of the common methods are DWT [34], SVD [41],

PCA [43], DCT [56]. The analysis of artifacts by multi-

ple JPEG compressions is also utilized in [18, 56] to de-

4970

tect manipulated images, which are applicable only to the

JPEG formats. Recently, deep learning has become pop-

ular due to its promising performance in different visual

recognition tasks such as object detection [26, 8], scene

classiﬁcation [62], and semantic segmentation [40]. There

have been a few recent works which exploit stacked auto-

encoders (SAE) [59], and convolutional neural networks

(CNN) [50, 9, 19] in order to detect tampered images.

Even though CNN has shown very promising performance

in understanding visual concepts such as object detection

and recognition, the detection of manipulated regions with

CNNs may not be best strategy because well manipulated

images usually do not leave any visual clue of alteration

[50], and resemble genuine images.

In semantic segmentation, deep learning models [40, 60,

7] exhibit good performance by learning hiearachical fea-

tures of different objects in an image. Recent advances

in semantic segmentation involves coarse image represen-

tations, which are recovered by upsampling. However,

coarse representation introduces signiﬁcant loss of informa-

tion which might be important for learning manipulated re-

gions. In contrast to semantic segmentation, manipulated

regions could be removed objects, or copied object from

other part of the image. Fig. 1 explains the challenge of seg-

menting manipulated regions in an image. In Fig. 1(a), im-

age is tampered in such a way that the manipulated and non-

manipulated regions contain the same object (seal). Exist-

ing segmentation approaches will segment both of the ob-

jects. In addition, existing segmentation network fails to

catch the removed object from an image which is shown in

Fig. 1(b). However, our proposed model is able to segment

the manipulated regions with high accuracy as shown in the

last column of Fig. 1.

An image can be manipulated in many ways - removing

objects from an image, splicing and copy-clone. Most of the

existing forgery detection approaches focus on identifying a

speciﬁc tampering method (such as copy-move [17, 29, 35],

splicing [45]). Thus, these approaches might not do well for

other types of tampering. Moreover, it becomes infeasible

and unrealistic to assume that the type of manipulation will

be known beforehand. In real-life, image tamper detection

should be able to detect all types of manipulation rather than

focusing on a speciﬁc type.

Towards this goal of detecting and localizing manip-

ulated image regions, we present a uniﬁed deep learning

framework in order to learn the patch labels (manipulated

vs non-manipulated) and pixel-wise segmentation jointly.

These two are intricately tied together, since patch classiﬁ-

cation can inform us about which pixels are manipulated,

and segmentation will determine whether a patch is manip-

ulated or not. Our multi-task learning framework exploits

convolutional layers along with long-short term memory

(LSTM) cells. We perform end-to-end training to learn the

joint tasks through back-propagation using ground-truth

patch labels and mask information. The proposed model

shows promising results in localizing manipulated regions

at the pixel level, as well as in patch classiﬁcation, which is

demonstrated on different challenging datasets.

Framework Overview: In this paper, our goal is to

localize the manipulated regions from an image. Given an

image, we ﬁrst extract patches by sliding a windows across

the image. In our framework, the image patch is taken as

input and produces a patch label (manipulated or not) and

a segmentation mask as output. Our overall framework

consists of total 5 convolutional layers and an LSTM

network with 3 stacked layers. The proposed framework

is shown in Fig. 2. In the network, ﬁrst two convolutional

layers are used to learn the low-level features, such as

edges and textures. After passing through two consecutive

convolutional layers, we have a 2D feature map which has

been divided into 8 by 8 blocks. These blocks are then

fed into the LSTM network discussed in the following

paragraph.

In computer vision, LSTMs are generally used to

learn the temporal context of a video or any sequence of

data. In this work, we use an LSTM to model the spatial

relationships between neighboring pixels. This is because

manipulation breaks the natural statistics of an image in

the manipulated boundary region. We send the blocks

of low level features obtained from second convolution

layer to the LSTM cells sequentially, e.g., ﬁrst block goes

to ﬁrst cell, second block to second cell, and so on. The

3-stacked LSTM layers produce the correlation features

between blocks. These features are then used to classify

patches using a softmax classiﬁer, and passed to the series

of convolution layers.

Finally, we obtain the 2D map with conﬁdence score

of each pixel using three consecutive convolutional layers

on top of the LSTM network. With the ground-truth

mask of manipulated regions we perform end-to-end

training to classify each pixel. We compute the joint loss

obtained at the patch classiﬁcation layer and the ﬁnal layer

of segmentation, which is then minimized by utilizing

back-propagation algorithm.

Main Contributions. Our main contributions are as

follows.

• In this paper, we propose a uniﬁed network for patch

classiﬁcation and segmentation task using convolution

layers along with an LSTM network. To the best of our

knowledge, there is no prior work on joint pixel-wise

segmentation of manipulated regions and patch tamper

classiﬁcation. The intricate relationship between the two,

as explained above, justiﬁes this integrated approach.

• In the proposed network, both patch classiﬁcation

and segmentation (pixel-wise classiﬁcation) exploit the

4971

Input

Image

Conv

Layer 1

Conv

Layer 2

Blocks from

Output Feature

Map

LSTM Network

3 Staked Layers

Extracted

Patch

Max Pooling

Reconstructed 2D

Map from LSTM

Conv

Layer 3

Conv

Layer 4

Conv

Layer 5

Manipulated

Mask

Patch Labels

Figure 2. Overview of proposed framework for joint tasks- patch classiﬁcation and manipulated region segmentation.

interdependence between them in order to improve both

of the recognition tasks. Our framework is capable of

localizing a manipulated region with high conﬁdence,

which is demonstrated on three datasets.

2. Related Work

The ﬁeld of image forensics comprises of diverse areas to

detect manipulation including resampling detection, JPEG

artefacts, detection of copy-move operations, splicing, and

object removal. We will brieﬂy discuss some of them be-

low.

In the past decades, several techniques have been pro-

posed to detect resampling in digital images [52, 47, 23]. In

most cases, it is assumed to be done using linear or cubic

interpolation. In [52], the authors exploit periodic proper-

ties of interpolation by the second-derivative of the trans-

formed image for detecting image manipulation. To detect

resampling on JPEG compressed images, the authors added

noise before passing the image through the resampling de-

tector and showed that adding noise aids in detecting resam-

pling [47]. In [22, 23], a feature is derived from the nor-

malized energy density and then SVM is used to robustly

detect resampled images. Some recent approaches [27, 33]

have been proposed to reduce JPEG artefacts left by com-

pression. In [5, 54], feature based forensic approaches have

been presented in order to detect manipulation in an image.

In order to detect copy-move forgeries, an image is ﬁrst

divided into overlapping blocks and some sort of distance

measure or correlation is used to determine blocks that have

been cloned. Some recent works [35, 31, 30, 4] tackle the

problem of identifying and localizing copy-move manipu-

lation. In [35], the authors used an interesting segmentation

based approach to detect copy move forgeries. They ﬁrst di-

vided an image into semantically independent patches and

then performed keypoint matching among these patches. In

[20], a patch-match algorithm is used to efﬁciently compute

an approximate nearest neighbor ﬁeld over an image. They

further use invariant features such as Circular Harmonic

transforms and show robustness over duplicated blocks that

have undergone geometrical transformations.

In [45], an image splicing technique has been proposed

using visual artifacts. A novel image forgery detection

method is presented in [46] based on the steerable pyramid

transform (SPT) and the local binary pattern (LBP). The pa-

per [28] includes the recent advances in image manipulation

and discusses the process of restoring missing or damaged

areas in an image. In [6], the authors review the different

image forgery detection techniques in image forensic litera-

ture. However, in computer vision, there has been a growing

interest to detect image manipulation by applying different

computer vision and machine learning algorithms.

Many methods have been proposed to detect seam carv-

ing [53, 25, 39] and inpainting based object removal [58,

18, 37]. Several approaches exploit JPEG blocking artifacts

to detect tampered regions [38, 21, 42, 12, 13]. In com-

puter vision, deep learning shows outstanding performance

in different visual recognition tasks such as image classi-

ﬁcation [62], and semantic segmentation [40]. In [40], two

fully convolution layers have been exploited to segment dif-

ferent objects in an image. The segmentation task has been

further improved in [60, 7]. These models extract hierarchi-

cal features to represent the visual concept, which is useful

in object segmentation. Since, the manipulation does not

exhibit any visual change with respect to genuine images,

these models do not perform well in segmenting manipu-

lated regions.

Recent efforts, including [9, 10, 50, 15] in the manip-

ulation detection task, exploit deep learning based mod-

els. These tasks include detection of generic manipula-

tions [9, 10], resampling [11], splicing [50], and bootleg

[14]. In [49], the authors propose Gaussian-Neuron CNN

(GNCNN) for steganalysis. A deep learning approach to

identify facial retouching was proposed in [1]. In [59],

image region forgery detection has been performed using

stacked auto-encoder model. In [9], a new form of convolu-

tional layer is proposed to learn the manipulated features

from an image. Unlike most of the deep learning based

image tampering detection methods which use convolution

layers, we present an unique network exploiting convolu-

tion layers along with an LSTM network.

4972

3. Network Architecture Overview

Image manipulation techniques such as copy-clone,

splicing, and removal are very common as they are very

difﬁcult to authenticate due to their resemblance to its gen-

uine images. The main goal of this work is to recognize

these manipulations at pixel and patch-level. Localization

of manipulated regions is a different problem than object

segmentation as tampered regions are not visually apparent.

For example, if an object is removed, the region may visu-

ally blend into the background, but needs to be identiﬁed as

manipulated. As another example, copy-move is a kind of

manipulation where one object is copied to another region

of the same image leading to two similar objects, one orig-

inally present, and another manipulated. However, only the

latter needs to be identiﬁed.

Fig. 3 shows the boundary region of manipulated and

non-manipulated block in a patch. From Fig. 3, we can

see that boundary regions of the manipulated patches are

affected, e.g. smoother boundary, when compared to non-

manipulated regions. When we zoom into the small cropped

region as shown in Fig. 3, we can see the difference be-

tween boundary of manipulated block (smoothed) and non-

manipulated region. The boundary shared between non-

manipulated and manipulated regions are sometimes inten-

tionally made smoother so that no one can visually under-

stand the artefacts seeing an image. Next, we will discuss

the details of our proposed architecture in order to recognize

and localize manipulated regions.

3.1. Model for Localizing Manipulated Regions

Here, we perform two tasks-(1) patch classiﬁcation (ma-

nipulated vs non-manipulated), and (2) segmentation of ma-

nipulated regions from the patches. The proposed frame-

work is shown in Fig. 2. The network exploits convolutional

layers along with an LSTM network to classify patches, and

to segment manipulated regions.

3.1.1 Convolutional Layers

Convolutional layers consist of different ﬁlters which have

learnable weights and biases. In the ﬁrst layer, the network

will take a patch as input. Each patch has R,G,B value with

dimension of 64 × 64 × 3 (width, height, color channels).

In [61], it is shown that convolutional layers are capable of

extracting different features from an image such as edges,

textures, objects, and scenes. As discussed above, manip-

ulation is better captured in the boundary of manipulated

regions. Thus, the low-level features are critical to identify

manipulated regions. The ﬁlters in convolutional layer will

create feature maps that are connected to the local region of

the previous layer. In the convolutional layers, we use ker-

nel size of 5 × 5 × D, where D is the depth of a ﬁlter. D

has different values for different layers in the network. An

Non-manipulated Region

Manipulated Region

( a )

( b )

Manipulated Block

( c )

( d )

Non-manipulated Block

Figure 3. The ﬁgure illustrates the boundary region of manipu-

lated block (red) and non-manipulated block (green) in column

(a). Column (b) shows the corresponding ground-truth masks for

the manipulated images in column (a). Columns (c) and (d) are

the zoomed-in version of the red (manipulated) and green (non-

manipulated) blocks respectively, showed in (a). Here, we can

see that the boundary formation is different for non-manipulated

(sharp) and manipulated (smooth) regions.

element-wise activation is also utilized in the form of RELU

function, ma x (0, x).

The ﬁrst convolution layer creates 16 feature maps.

Then, these feature maps are combined in the next convolu-

tion layer. We keep one feature map which will be divided

into blocks to send into the LSTM cells. The reason for

using one feature map is to reduce the network complexity,

but it could be changed depending on the size of the dataset.

The feature map has been divided into 8 by 8 blocks, which

are taken as input the LSTM cells. In Fig. 2, we can see that

second convolutional layer provides a two-dimensional fea-

ture map which can be denoted as F

. The 8 by 8 block of

this feature map will be fed into the LSTM cells in order to

learn the boundary transformation, which will be discussed

in the Section 3.1.2.

The output feature from the LSTM network is used as

input to the later convolutional layers. These convolutional

layers learn the mapping between features of the boundary

transformation from the LSTM and the tampered pixels us-

ing the ground-truth mask. Unlike conventional CNNs, we

do not use pooling mechanism in every convolution layer

as it causes possible loss of information. We only use max

pooling in third convolution layer.

Motivated by the segmentation work presented in [40],

we also utilize two fully convolution layers (conv layer 4

and 5 as shown in Fig. 2) at the end. In [55, 40], segmen-

tation networks represent features coarsely, which is ﬁnally

compensated by upsampling operation to match the dimen-

sion of the ground-truth mask. However, in contrast to these

approaches, we do not follow upsampling operation as it

might create additional distortion. In our network, the size

4973

Exploiting Spatial Structure for Localizing Manipulated Image Regions

Figures

Citations

FaceForensics++: Learning to Detect Manipulated Facial Images

FaceForensics++: Learning to Detect Manipulated Facial Images

Face X-Ray for More General Face Forgery Detection

Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos

Learning Rich Features for Image Manipulation Detection

References

Adam: A Method for Stochastic Optimization

Fully convolutional networks for semantic segmentation

Fast R-CNN

Fast R-CNN

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Related Papers (5)

A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer

A deep learning approach to detection of splicing and copy-move forgeries in images

Rich Models for Steganalysis of Digital Images

Deep Residual Learning for Image Recognition

Image Forgery Localization via Fine-Grained Analysis of CFA Artifacts

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Exploiting spatial structure for localizing manipulated image regions" ?

Q2. What is the key insight of using LSTM?

Q3. What is the common method used in image forensics?

Q4. What is the main reason why deep learning is popular?

Q5. How do the authors learn the patch tasks?

Q6. What is the method for minimizing the loss of the network?

Q7. What is the main topic of the paper?

Q8. What is the definition of copy-move forgeries?

Q9. What are some types of manipulations that can easily deceive the human perceptual system?

Q10. What is the purpose of using convolutional layers?

Q11. What is the purpose of this paper?

Q12. How many patches are used in the training dataset?

Q13. How many feature maps do the authors use in their proposed network?

Q14. What is the novel image forgery detection method?