scispace - formally typeset
Open AccessProceedings ArticleDOI

Weakly Supervised Cascaded Convolutional Networks

TLDR
In this article, a new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions, with either two cascade stages or three which are trained in an end-to-end pipeline.
Abstract
Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions. We introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. The first stage of both architectures extracts best candidate of class specific region proposals by training a fully convolutional network. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. The final stage of both architectures is a part of a convolutional neural network that performs multiple instance learning on proposals extracted in the previous stage(s). Our experiments on the PASCAL VOC 2007, 2010, 2012 and large scale object datasets, ILSVRC 2013, 2014 datasets show improvements in the areas of weakly-supervised object detection, classification and localization.

read more

Content maybe subject to copyright    Report

Weakly Supervised Cascaded Convolutional Networks
Ali Diba
1
, Vivek Sharma
2,⋆
, Ali Pazandeh
3
, Hamed Pirsiavash
4
and Luc Van Gool
1,5
1
ESAT-PSI, KU Leuven,
2
CV:HCI, Karlsruhe Institute of Technology
3
Sharif University,
4
University of Maryland Baltimore County,
5
CVL, ETH Z
¨
urich
ali.diba@kuleuven.be, vivek.sharma@kit.edu, pazandeh@ee.sharif.edu, hpirsiav@umbc.edu
Abstract
Object detection is a challenging task in visual under-
standing domain, and even more so if the supervision is to
be weak. Recently, few efforts to handle the task without
expensive human annotations is established by promising
deep neural network. A new architecture of cascaded net-
works is proposed to learn a convolutional neural network
(CNN) under such conditions. We introduce two such ar-
chitectures, with either two cascade stages or three which
are trained in an end-to-end pipeline. The first stage of both
architectures extracts best candidate of class specific region
proposals by training a fully convolutional network. In the
case of the three stage architecture, the middle stage pro-
vides object segmentation, using the output of the activation
maps of first stage. The final stage of both architectures is a
part of a convolutional neural network that performs mul-
tiple instance learning on proposals extracted in the previ-
ous stage(s). Our experiments on the PASCAL VOC 2007,
2010, 2012 and large scale object datasets, ILSVRC 2013,
2014 datasets show improvements in the areas of weakly-
supervised object detection, classification and localization.
1. Introduction
The ability to train a system that detects objects in clut-
tered scenes by only naming the objects in the training im-
ages, without specifying their number or their bounding
boxes, is understood to be of major importance. Then it
becomes possible to annotate very large datasets or to auto-
matically collect them from the web.
Most current methods to train object detection systems
assume strong supervision [
12, 26, 19]. Providing both the
bounding boxes and their labels as annotations for each ob-
ject, still renders such methods more powerful than their
weakly supervised counterparts. Although the availability
of larger sets of training data is advantageous for the train-
ing of convolutional neural networks (CNNs), weak super-
This work was carried out while he was at ESAT-PSI, KU Leuven.
Primary
Stage
Secondary
Stage
cat
cat
cat
cat
C
O
N
V
C
O
N
V
Primary
Stage
Secondary
Stage
Figure 1. Weakly Supervised Cascaded Deep CNN: Overview
of the proposed cascaded weakly supervised object detection and
classification method. Our cascaded networks take images and ex-
isting object labels to find the best location of objects samples in
each of images. Trained networks based on these location is ca-
pable of detecting and classifying objects in images, under weakly
supervision circumstances.
vision as a means of producing those has only been em-
braced to a limited degree.
The proposed weak supervision methods have come in
some different flavors. One of the most common ap-
proaches [
7] consists of the following steps. The first step
generates object proposals. The last stage extracts features
from the proposals. And the final stage applies multiple
instance learning (MIL) to the features and finds the box la-
bels from the weak bag (image) labels. This approach can
thus be improved by enhancing any of its steps. For in-
stance, it would be advantageous if the first stage were to
produce more reliable - and therefore fewer - object pro-
posals.
It is the aforementioned approach that our weak super-
vision algorithm also follows. To improve the detection
performance, object proposal generation, feature extraction,
and MIL are trained in a cascaded manner, in an end-to-end
way. We propose two architectures. The first is a two stage
network. The first stage extracts class specific object pro-
posals using a fully convolutional network followed by a
1
914

global average (max) pooling layer. The last stage extracts
features from the object proposals by a ROI pooling layer
and performs MIL. Given the importance of getting better
object proposals we added a middle stage to the previous
architecture in our three stage network. This middle stage
performs a class specific segmentation using the input im-
ages and the extracted objectness of the first stage. This
results in more reliable object proposals and a better detec-
tion.
The proposed architecture improves both initial object
proposal extraction and final object detection. In the for-
ward sense, less noisy proposals indeed lead to improved
object detection, due to the non-convexity of the cost func-
tion. In the reverse, backward sense, due the weight shar-
ing between the first layers of both stages, training the MIL
on the extracted proposals will improve the performance of
feature extraction in the first convolutional layers and as a
result will produce more reliable proposals.
Next, we review related works in section 2 and discuss
our proposed method in section 3. In section 4 we explain
the details of our experiments, including the dataset and
complete set of experiments and results.
2. Related works
Weakly supervised detection: In the last decade, sev-
eral weakly supervised object detection methods have been
studied using multiple instance learning algorithms [
4, 5,
29, 30]. To do so they define images as the bag of regions,
wherein they assume the image labeled positive contains at
least one object instance of a certain category and an im-
age labeled negative do not contain an object from the cat-
egory of interest. The most common way of weakly super-
vised learning methods often work by selecting the candi-
date positive object instances in the positive bags, and then
learning a model of the object appearance using appearance
model. Due to the training phase of the MIL problem al-
ternating between out of bag object extraction and training
classifiers, the solutions are non-convex and as a result is
sensitive to the initialization. In practice, a bad initializa-
tion is prone to getting the solution stuck in a local optima,
instead of global optima. To alleviate this shortcoming, sev-
eral methods try to improve the initialization [
31, 9, 28, 29]
as the solution strongly depends on the initialization, while
some others focus on regularizing the optimization strate-
gies [
4, 5, 7]. Kumar et al. [17] employ an iterative self-
learning strategy to employ harder samples to a small set
of initial samples at training stage. Joulin et al. [
15] use a
convex relaxation of soft-max loss in order to minimize the
prone to get stuck in the local minima. Deselaers et al. [9]
initialize the object locations via the objectness score. Cin-
bis et al. [
7] split the training date in a multi-fold manner
for escaping from getting trapped into the local minima.
In order to have more robustness from poor initialization,
Song et al. [
30] apply Nesterov’s smoothing technique to
latent SVM formulation [
10]. In [31], the same authors ini-
tialize the object locations based on sub-modular clustering
method. Bilen et al. [
4] formulates the MIL to softly label
the object instances by regularizing the latent object loca-
tions based on penalizing unlikely configurations. Further
in [
5], the authors extend their work [4] by enforcing simi-
larity between object windows via regularization technique.
Wang et al. [
35] employ probabilistic latent semantic anal-
ysis on the windows of positive samples to select the most
discriminative clusters that represents the object category.
As a matter of fact, majority of the previous works [
25, 32]
use a large collection of noisy object proposals to train their
object detector. In contrast, our method only focuses on a
very few clean collection of object proposals that are far
more reliable, robust, computationally efficient, and gives
better performance.
Object proposal generation: In [20, 23], Nguyen et al.
and Pandey et al. extract dense regions of candidate pro-
posals from an image using an initial bounding box. To
handle the problem of not being able to generate enough
candidate proposals because of fixed shape and size, ob-
ject saliency [
9, 28, 29] based approaches were proposed
to extract region proposals. Following this, generic object-
ness measure [
1] was employed to extract region proposals.
Selective search algorithm [
33], a segmentation based ob-
ject proposal generation was proposed, which is currently
among the most promising techniques used for proposal
generation. Recently, Ghodrati et al. [
11] proposed an in-
verse cascade method using various CNN feature maps to
localize object proposals in a coarse to fine manner.
CNN based weakly supervised object detection: In
view of the promising results of CNNs for visual recogni-
tion, some recent efforts in weakly supervised classification
have been based on CNNs. Oquab et al. [
21] improved fea-
ture discrimination based on a pre-trained CNN. In [
22], the
same authors improved the performance further by incor-
porating both localization and classification on a new CNN
architecture. Bilen et al. [
4] proposed a CNN-based convex
optimization method to solve the problem to escape from
getting stuck in local minima. Their soft similarity between
possible regions and clusters was helpful in improving the
optimization. Li et al. [
18] introduced a class-specific object
proposal generation based on the mask out strategy of [
2],
in order to have a reliable initialization. They also proposed
their two-stage algorithm, classification adaptation and de-
tection adaptation.
3. Proposed Method
This section introduces our weak cascaded convolutional
networks (WCCN) for object detection and classification
with weak supervision. Our networks are designed to learn
multiple different but related tasks all together jointly. The
2
915

Conv5
Global
Pooling
Multi-
Class
Loss
Class Activation Map
Convs
ROI Pooling
FCs
FCs
FCs
MIL
Loss
Stage 1
Stage 2
C
O
N
V
C
O
N
V
5
Shared Convs
Image
LocNet
MilNet
Loss1
Loss2
Figure 2. WCCN (2stage): The pipeline of end-to-end 2-stage cascaded CNN for weakly supervised object detection. Inputs to the network
are images, labels and unsupervised object proposals. First stage learns to create a class activation map based on object categories to make
some candidate boxes for each instance of objects. Second stage picks the best bounding box among the candidates to represent the specific
category by multiple instance learning loss.
tasks are classification, localization, and multiple instance
learning. We show that learning these tasks jointly in an
end-to-end fashion results in better object detection and lo-
calization. The goal is to learn good appearance models
from images with multiple objects where the only manual
supervision signal is image-level labels. Our main contribu-
tion is improving multiple object detection with such weak
annotation. To this end, we propose two different cascaded
network architectures. The first one is a 2-stage cascade net-
work that first localizes the objects and then learns to detect
them in a multiple instance learning framework. Our sec-
ond architecture is a 3-stage cascade network where the new
middle stage performs semantic segmentation with pseudo
ground truth in a weakly supervised setting.
3.1. Two-stage Cascade
As mentioned earlier, there are only a few end-to-end
frameworks with deep CNNs for weakly supervised object
detection. In particular, there is not much prior art on object
localization without supervising in localization level. Sup-
pose we have dataset I of N training images in C classes.
The set is given as I = {(I
1
, y
1
), ..., (I
N
, y
N
)} where I
k
is an image and y
k
= [y
1
, ..., y
C
] {0, 1}
C
is a vector of
labels indicating the presence or absence of each class in
image I
k
.
In the proposed cascaded network, the initial fully-
convolutional stage learns to infer object location maps
based on the object labels in the given images. This stage
produces some candidate boxes of objects as input to the
next stage. The last stage selects the best boxes through an
end-to-end multiple instance learning.
First stage (Location network): The first stage of our
cascaded model is a fully-convolutional CNN with a global
average pooling (GAP) or global maximum pooling (GMP)
layer, inspired by [
36]. The training yields the object lo-
cation or ‘class activation’ maps, that provide candidate
bounding boxes. Since multiple categories can exist in a
single image [
22], we use an independent loss function for
each class in this branch of the CNN architecture, so the
loss function is the sum of C binary logistic regression loss
functions.
Last stage (MIL network): The goal of the last stage
is to select the best candidate boxes for each class from
the outputs of the first stage using multiple instance learn-
ing (MIL). To obtain an end-to-end framework, we incor-
porate an MIL loss function into our network. Assume
x = {x
j
|j = 1, 2, ..., n} is a bag of instances for image
I where x
j
is a candidate box, and assume f
cj
C×n
is
the score of box x
j
belonging to category i. We use ROI-
pooling layer [
12] to achieve f
cj
. We define the probabili-
ties and loss as:
P
c
(x, I) =
exp
max
j
f
cj
P
C
k=1
exp
max
j
f
kj
L
MIL
(y, x, I) =
C
X
c=1
y
c
log(P
c
(x, I))
(1)
The weights for conv1 till conv5 are shared between the
two stages. For the last stage, we have additional two fully
connected layers and a score layer for learning the MIL task.
End-to-End Training: The whole cascade with two loss
functions is learned jointly by end-to-end stochastic gradi-
ent descent optimization. The total loss function of the cas-
3
916

Class Activation Map
ROI Pooling
FCs
FCs
FCs
MIL
Loss
We ak l y su p erv is ed
segmentation
Segmentation Loss
Stage 2
Stage 3
Shared Convs
Image
LocNet
SegNet
MilNet
Loss1
Loss2
Loss3
Conv5
Global
Pooling
Multi
Class
Loss
Convs
Stage 1
C
O
N
V
C
O
N
V
5
Conv5
Figure 3. WCCN (3stage): The pipeline of end-to-end 3-stage cascaded CNN for weakly supervised object detection. For this cascaded
network, we designed new architecture to have weakly supervised segmentation as last stage, so first and last stages are identical to the
stages of the previous cascade. The new stage will improve the selecting candidate bounding boxes by providing more accurate object
regions.
caded network is:
L
T otal
= L
GAP
(y, I) + λL
MIL
(y, x, I)
(2)
where λ is the hyper-parameter balancing two loss func-
tions. In the experiments, we set λ = 1. We suspect cross-
validation on this hyper-parameter can improve the results.
Generating bag of instances: We use Edgeboxs [
37] to
generate an initial set of object proposals. Then we thresh-
old the class activation map [
36] to come up with a mask.
Finally, we choose the initial boxes with largest overlap
with the mask.
3.2. Three-stage Cascade
In this section, we extend our 2-stage cascaded model by
another stage that adds object segmentation as another task.
We believe more information about the objects’ boundary
learned in a segmentation task can lead to acquisition of
a better appearance model and then better object localiza-
tion. For this purpose, our new stage uses another form of
weak supervision to learn a segmentation model, embedded
in the cascaded network and trained along with other stages.
This extra stage will help the multi-loss CNN to have better
initial locations for choosing candidate bounding boxes to
pass to the next stage. So this new cascade has three stages:
first stage, similar to previous cascade is a CNN with global
pooling layer; middle stage, fully convolutional network
with segmentation loss; last stage, multiple instance learn-
ing with corresponding loss.
Middle stage (Segmentation Loss): Inspired by [
3, 24],
we propose to use a weakly supervised segmentation net-
work which uses an object point of location and also label
as supervisory signals. Incorporation of initial location of
object from previous stage (location network) in the seg-
mentation stage can obtain more meaningful object location
map. The weak segmentation network uses the results of the
first stage as supervision signal (i.e., pseudo ground truth)
and learns jointly with the MIL stage to further improve the
object localization results.
In the middle stage, we add a fully convolutional CNN
similar to the one in [
3] to our network. The final layer
is a pixel-wise softmax that outputs S
C×m
where m
is the number of pixels in the image. Assuming H
c
for
the heatmap for class c, we define α
c
= max(H
c
) across
the whole image and I
c
to be the neighborhood around
argmax(H
c
). In the experiments, we use a neighborhood
of 3 × 3 pixels. Note that our formulation is closely fol-
lowing the one in [
3] except that our point-wise annotation
is provided by the automatically generated heatmap rather
than manual annotation.
Considering y as the label set for image I , the loss
function for the weakly supervised segmentation network
4
917

is given by:
L
Seg
(S, H, y) =
C
X
c=1
y
c
log(S
t
c
c
) +
X
iI
c
α
c
log(S
ic
)
(3)
where t
c
= argmax
iI
S
ic
. The first term is used for image-
level label supervision and the second term is for the set of
pixels that the heatmap confidently predicted to be a point
on the object. Note that α
c
is the second term is emphasiz-
ing on more confident categories.
Due to more supervision using psuedo-groundtruth pro-
vided by the heatmap, the middle stage provides a bet-
ter segmentation map compared to the original heatmap.
Hence, we pass the resulting segmentation map to the fi-
nal MIL stage to find candidate boxes with overlapping and
then calculate the MIL loss.
Output of this middle stage is a set of candidate bound-
ing boxes of objects for pushing to next stage of the CNN
cascade which uses multiple instance learning to choose the
most accurate box as the representative of object category.
In the experiments, we show that learning this extra task as
another stage of cascade can improve performance of the
whole network as a weakly supervised classifier.
End-to-End Training: Similar to the last cascade, the
total loss in Eq.
4 is calculated by simply adding all three
loss terms. We learn all parameters of the network jointly
in an end-to-end fashion.
L
T otal
= L
GAP
(y, I) + γL
Seg
(y, I) + λL
MIL
(y, x, I)
(4)
In the experiments, we set λ = 1 and γ = 1.
3.3. Object Detection Training
Since we are interested in weakly supervised object de-
tection, we propose to use the output of our network as
pseudo-groundtruth in a standard object detection frame-
work e.g., Fast-RCNN [
12]. There are two ways of doing
this: we can either train a standard Fast-RCNN without our
trained model or we can transfer our learned model into the
Fast-RCNN framework and finetune it. For the later case,
we use the shared early convolutional layers along with the
fully connected layers in the last stage of our model. In
both cases, at the testing time, we extract object proposals
with EdgeBoxes [
37], use the trained Fast-RCNN to detect
objects among the pool of proposals, and perform non-max-
suppression.
4. Experiments
In the following section, we discuss details of our meth-
ods and experiments which we applied on object detection
and classification in weakly supervised manner. We in-
troduce datasets and also analyze performance of our ap-
proaches on them in different aspects of evaluation.
4.1. Datasets and metrics
The experiments for our proposed methods are ex-
tensively done on the PASCAL VOC 2007, 2010, 2012
datasets and also ILSVRC 2013, 2014 which are large scale
datasets for objects. The PASCAL VOC is more common
dataset to evaluate weakly supervised object detection ap-
proaches. The VOC datasets have 20 categories of objects,
while ILSVRC dataset has 200 categories which we tar-
geted also for weakly supervised object classification and
localization. In all of the mentioned datasets, we incorpo-
rate the standard train, validation and test set.
Experimental metrics: To measure the object detection
performance, average precision (AP) and correct localiza-
tion (CorLoc) is used. Average precision is the standard
metric from PASCAL VOC which takes a bounding box as
a true detection where it has intersection-over-union (IoU)
of more than 50% with ground-truth box. The Corloc is the
fraction of positive images that the method obtained correct
location by most confident detection box for at least one ob-
ject instance per target category in an image. For the object
classification, also we use PASCAL VOC standard average
precision.
4.2. Experimental and implementation details
We have evaluated both of our proposed cascaded CNN
with two architectures: Alexnet [
16] and VGG-16 [27]. In
each case, the network has been pre-trained on ImageNet
dataset [8]. Since the multiple stages of cascades contain
different CNN networks losses, in the following we explain
details of each part separately to have better overview of the
implementation.
CNN architectures:
1. Loc Net: Inspired by [
36], we removed fully-
connected layers from each of Alexnet or VGG-16 and re-
placed them by two convolutional layers and one global
pooling layer. So for the Alexnet, the layers after conv5
layer have been removed and for VGG-16 after conv5-3.
For global pooling layer, we have tested average and max
pooling methods and we found that global average pooling
performs better than maximum pooling. For the training
loss criteria of this part of network, we use a simple sum
of C (number of classes) binary logistic regression losses,
similar to [
22].
2. Seg Net: This part of network is middle stage in the
3-stage cascaded network and is well-known fully convolu-
tional network for segmentation task [
3]. The convolutional
part is shared with the other stages which comes from the
first stage and additional fully-connected layers and a de-
convolutional layer is used to produce segmentation map.
5
918

Citations
More filters
Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Posted Content

Object Detection in 20 Years: A Survey

TL;DR: This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019), and makes an in-deep analysis of their challenges as well as technical improvements in recent years.
Journal ArticleDOI

A Survey of Deep Learning-Based Object Detection

TL;DR: This survey provides a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors, and lists the traditional and new applications.
Proceedings ArticleDOI

Few-Shot Object Detection via Feature Reweighting

TL;DR: In this article, a few-shot object detector is proposed that can learn to detect novel objects from only a few annotated examples, using a meta feature learner and a reweighting module within a one-stage detection architecture.
Proceedings ArticleDOI

Attention-Based Dropout Layer for Weakly Supervised Object Localization

TL;DR: Zhang et al. as discussed by the authors proposed an Attention-based Dropout Layer (ADL) which utilizes the self-attention mechanism to process the feature maps of the model, which is composed of two key components: hiding the most discriminative part from the model for capturing the integral extent of object, and highlighting the informative region for improving the recognition power.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Book ChapterDOI

SSD: Single Shot MultiBox Detector

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Proceedings ArticleDOI

Fast R-CNN

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Weakly supervised cascaded convolutional networks" ?

The authors introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. 

Since multiple categories can exist in a single image [22], the authors use an independent loss function for each class in this branch of the CNN architecture, so the loss function is the sum of C binary logistic regression loss functions. 

To improve the detection performance, object proposal generation, feature extraction, and MIL are trained in a cascaded manner, in an end-to-end way. 

The most common way of weakly supervised learning methods often work by selecting the candidate positive object instances in the positive bags, and then learning a model of the object appearance using appearance model. 

Using the the selected candidate bounding boxes from previous stage, it trains the multiple instance learning loss to select the best sample for each object presented in an image. 

CNN architectures:1. Loc Net: Inspired by [36], the authors removed fullyconnected layers from each of Alexnet or VGG-16 and replaced them by two convolutional layers and one global pooling layer. 

The experiments for their proposed methods are extensively done on the PASCAL VOC 2007, 2010, 2012 datasets and also ILSVRC 2013, 2014 which are large scale datasets for objects. 

Wang et al. [35] employ probabilistic latent semantic analysis on the windows of positive samples to select the most discriminative clusters that represents the object category. 

Given the importance of getting better object proposals the authors added a middle stage to the previous architecture in their three stage network. 

The total loss function of the cas-caded network is:LTotal = LGAP (y, I) + λLMIL(y,x, I) (2)where λ is the hyper-parameter balancing two loss functions. 

The first stage extracts class specific object proposals using a fully convolutional network followed by aglobal average (max) pooling layer. 

For an instance of using the segmentation stage by Alexnet architecture, cascaded network improves almost 2.5% on detection and 2% on classification in PASCAL VOC 2007. 

In [22], the same authors improved the performance further by incorporating both localization and classification on a new CNN architecture.