What contributions have the authors mentioned in the paper "Weakly supervised cascaded convolutional networks" ?

The authors introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage.

What is the way to train the multiple instance learning loss?

Using the the selected candidate bounding boxes from previous stage, it trains the multiple instance learning loss to select the best sample for each object presented in an image.

What is the importance of getting better object proposals?

Given the importance of getting better object proposals the authors added a middle stage to the previous architecture in their three stage network.

What is the total loss function of the cascaded network?

The total loss function of the cas-caded network is:LTotal = LGAP (y, I) + λLMIL(y,x, I) (2)where λ is the hyper-parameter balancing two loss functions.

How does the performance of the proposed cascaded network differ from other approaches?

For an instance of using the segmentation stage by Alexnet architecture, cascaded network improves almost 2.5% on detection and 2% on classification in PASCAL VOC 2007.

(Open Access) Weakly Supervised Cascaded Convolutional Networks (2017) | Ali Diba

Q: What is the loss function for the class activation map?

Since multiple categories can exist in a single image [22], the authors use an independent loss function for each class in this branch of the CNN architecture, so the loss function is the sum of C binary logistic regression loss functions.

Q: What is the common way of weakly supervised learning methods?

The most common way of weakly supervised learning methods often work by selecting the candidate positive object instances in the positive bags, and then learning a model of the object appearance using appearance model.

Q: What are the main components of the proposed CNN?

CNN architectures:1. Loc Net: Inspired by [36], the authors removed fullyconnected layers from each of Alexnet or VGG-16 and replaced them by two convolutional layers and one global pooling layer.

Q: What are the main datasets used for the proposed methods?

The experiments for their proposed methods are extensively done on the PASCAL VOC 2007, 2010, 2012 datasets and also ILSVRC 2013, 2014 which are large scale datasets for objects.

Q: What is the main reason why Wang et al. use probabilistic latent semantic analysis?

Wang et al. [35] employ probabilistic latent semantic analysis on the windows of positive samples to select the most discriminative clusters that represents the object category.

Weakly Supervised Cascaded Convolutional Networks

Ali Diba

, Vivek Sharma

2,⋆

, Ali Pazandeh

, Hamed Pirsiavash

and Luc Van Gool

1,5

ESAT-PSI, KU Leuven,

CV:HCI, Karlsruhe Institute of Technology

Sharif University,

University of Maryland Baltimore County,

CVL, ETH Z

urich

ali.diba@kuleuven.be, vivek.sharma@kit.edu, pazandeh@ee.sharif.edu, hpirsiav@umbc.edu

Abstract

Object detection is a challenging task in visual under-

standing domain, and even more so if the supervision is to

be weak. Recently, few efforts to handle the task without

expensive human annotations is established by promising

deep neural network. A new architecture of cascaded net-

works is proposed to learn a convolutional neural network

(CNN) under such conditions. We introduce two such ar-

chitectures, with either two cascade stages or three which

are trained in an end-to-end pipeline. The ﬁrst stage of both

architectures extracts best candidate of class speciﬁc region

proposals by training a fully convolutional network. In the

case of the three stage architecture, the middle stage pro-

vides object segmentation, using the output of the activation

maps of ﬁrst stage. The ﬁnal stage of both architectures is a

part of a convolutional neural network that performs mul-

tiple instance learning on proposals extracted in the previ-

ous stage(s). Our experiments on the PASCAL VOC 2007,

2010, 2012 and large scale object datasets, ILSVRC 2013,

2014 datasets show improvements in the areas of weakly-

supervised object detection, classiﬁcation and localization.

1. Introduction

The ability to train a system that detects objects in clut-

tered scenes by only naming the objects in the training im-

ages, without specifying their number or their bounding

boxes, is understood to be of major importance. Then it

becomes possible to annotate very large datasets or to auto-

matically collect them from the web.

Most current methods to train object detection systems

assume strong supervision [

12, 26, 19]. Providing both the

bounding boxes and their labels as annotations for each ob-

ject, still renders such methods more powerful than their

weakly supervised counterparts. Although the availability

of larger sets of training data is advantageous for the train-

ing of convolutional neural networks (CNNs), weak super-

⋆

This work was carried out while he was at ESAT-PSI, KU Leuven.

Primary

Stage

Secondary

Stage

cat

Primary

Stage

Secondary

Stage

Figure 1. Weakly Supervised Cascaded Deep CNN: Overview

of the proposed cascaded weakly supervised object detection and

classiﬁcation method. Our cascaded networks take images and ex-

isting object labels to ﬁnd the best location of objects samples in

each of images. Trained networks based on these location is ca-

pable of detecting and classifying objects in images, under weakly

supervision circumstances.

vision as a means of producing those has only been em-

braced to a limited degree.

The proposed weak supervision methods have come in

some different ﬂavors. One of the most common ap-

proaches [

7] consists of the following steps. The ﬁrst step

generates object proposals. The last stage extracts features

from the proposals. And the ﬁnal stage applies multiple

instance learning (MIL) to the features and ﬁnds the box la-

bels from the weak bag (image) labels. This approach can

thus be improved by enhancing any of its steps. For in-

stance, it would be advantageous if the ﬁrst stage were to

produce more reliable - and therefore fewer - object pro-

posals.

It is the aforementioned approach that our weak super-

vision algorithm also follows. To improve the detection

performance, object proposal generation, feature extraction,

and MIL are trained in a cascaded manner, in an end-to-end

way. We propose two architectures. The ﬁrst is a two stage

network. The ﬁrst stage extracts class speciﬁc object pro-

posals using a fully convolutional network followed by a

914

global average (max) pooling layer. The last stage extracts

features from the object proposals by a ROI pooling layer

and performs MIL. Given the importance of getting better

object proposals we added a middle stage to the previous

architecture in our three stage network. This middle stage

performs a class speciﬁc segmentation using the input im-

ages and the extracted objectness of the ﬁrst stage. This

results in more reliable object proposals and a better detec-

tion.

The proposed architecture improves both initial object

proposal extraction and ﬁnal object detection. In the for-

ward sense, less noisy proposals indeed lead to improved

object detection, due to the non-convexity of the cost func-

tion. In the reverse, backward sense, due the weight shar-

ing between the ﬁrst layers of both stages, training the MIL

on the extracted proposals will improve the performance of

feature extraction in the ﬁrst convolutional layers and as a

result will produce more reliable proposals.

Next, we review related works in section 2 and discuss

our proposed method in section 3. In section 4 we explain

the details of our experiments, including the dataset and

complete set of experiments and results.

2. Related works

Weakly supervised detection: In the last decade, sev-

eral weakly supervised object detection methods have been

studied using multiple instance learning algorithms [

4, 5,

29, 30]. To do so they deﬁne images as the bag of regions,

wherein they assume the image labeled positive contains at

least one object instance of a certain category and an im-

age labeled negative do not contain an object from the cat-

egory of interest. The most common way of weakly super-

vised learning methods often work by selecting the candi-

date positive object instances in the positive bags, and then

learning a model of the object appearance using appearance

model. Due to the training phase of the MIL problem al-

ternating between out of bag object extraction and training

classiﬁers, the solutions are non-convex and as a result is

sensitive to the initialization. In practice, a bad initializa-

tion is prone to getting the solution stuck in a local optima,

instead of global optima. To alleviate this shortcoming, sev-

eral methods try to improve the initialization [

31, 9, 28, 29]

as the solution strongly depends on the initialization, while

some others focus on regularizing the optimization strate-

gies [

4, 5, 7]. Kumar et al. [17] employ an iterative self-

learning strategy to employ harder samples to a small set

of initial samples at training stage. Joulin et al. [

15] use a

convex relaxation of soft-max loss in order to minimize the

prone to get stuck in the local minima. Deselaers et al. [9]

initialize the object locations via the objectness score. Cin-

bis et al. [

7] split the training date in a multi-fold manner

for escaping from getting trapped into the local minima.

In order to have more robustness from poor initialization,

Song et al. [

30] apply Nesterov’s smoothing technique to

latent SVM formulation [

10]. In [31], the same authors ini-

tialize the object locations based on sub-modular clustering

method. Bilen et al. [

4] formulates the MIL to softly label

the object instances by regularizing the latent object loca-

tions based on penalizing unlikely conﬁgurations. Further

in [

5], the authors extend their work [4] by enforcing simi-

larity between object windows via regularization technique.

Wang et al. [

35] employ probabilistic latent semantic anal-

ysis on the windows of positive samples to select the most

discriminative clusters that represents the object category.

As a matter of fact, majority of the previous works [

25, 32]

use a large collection of noisy object proposals to train their

object detector. In contrast, our method only focuses on a

very few clean collection of object proposals that are far

more reliable, robust, computationally efﬁcient, and gives

better performance.

Object proposal generation: In [20, 23], Nguyen et al.

and Pandey et al. extract dense regions of candidate pro-

posals from an image using an initial bounding box. To

handle the problem of not being able to generate enough

candidate proposals because of ﬁxed shape and size, ob-

ject saliency [

9, 28, 29] based approaches were proposed

to extract region proposals. Following this, generic object-

ness measure [

1] was employed to extract region proposals.

Selective search algorithm [

33], a segmentation based ob-

ject proposal generation was proposed, which is currently

among the most promising techniques used for proposal

generation. Recently, Ghodrati et al. [

11] proposed an in-

verse cascade method using various CNN feature maps to

localize object proposals in a coarse to ﬁne manner.

CNN based weakly supervised object detection: In

view of the promising results of CNNs for visual recogni-

tion, some recent efforts in weakly supervised classiﬁcation

have been based on CNNs. Oquab et al. [

21] improved fea-

ture discrimination based on a pre-trained CNN. In [

22], the

same authors improved the performance further by incor-

porating both localization and classiﬁcation on a new CNN

architecture. Bilen et al. [

4] proposed a CNN-based convex

optimization method to solve the problem to escape from

getting stuck in local minima. Their soft similarity between

possible regions and clusters was helpful in improving the

optimization. Li et al. [

18] introduced a class-speciﬁc object

proposal generation based on the mask out strategy of [

2],

in order to have a reliable initialization. They also proposed

their two-stage algorithm, classiﬁcation adaptation and de-

tection adaptation.

3. Proposed Method

This section introduces our weak cascaded convolutional

networks (WCCN) for object detection and classiﬁcation

with weak supervision. Our networks are designed to learn

multiple different but related tasks all together jointly. The

915

Conv5

Global

Pooling

Multi-

Class

Loss

Class Activation Map

Convs

ROI Pooling

FCs

…

MIL

Loss

Stage 1

Stage 2

Shared Convs

Image

LocNet

MilNet

Loss1

Loss2

Figure 2. WCCN (2stage): The pipeline of end-to-end 2-stage cascaded CNN for weakly supervised object detection. Inputs to the network

are images, labels and unsupervised object proposals. First stage learns to create a class activation map based on object categories to make

some candidate boxes for each instance of objects. Second stage picks the best bounding box among the candidates to represent the speciﬁc

category by multiple instance learning loss.

tasks are classiﬁcation, localization, and multiple instance

learning. We show that learning these tasks jointly in an

end-to-end fashion results in better object detection and lo-

calization. The goal is to learn good appearance models

from images with multiple objects where the only manual

supervision signal is image-level labels. Our main contribu-

tion is improving multiple object detection with such weak

annotation. To this end, we propose two different cascaded

network architectures. The ﬁrst one is a 2-stage cascade net-

work that ﬁrst localizes the objects and then learns to detect

them in a multiple instance learning framework. Our sec-

ond architecture is a 3-stage cascade network where the new

middle stage performs semantic segmentation with pseudo

ground truth in a weakly supervised setting.

3.1. Two-stage Cascade

As mentioned earlier, there are only a few end-to-end

frameworks with deep CNNs for weakly supervised object

detection. In particular, there is not much prior art on object

localization without supervising in localization level. Sup-

pose we have dataset I of N training images in C classes.

The set is given as I = {(I

, y

), ..., (I

, y

)} where I

is an image and y

= [y

, ..., y

] ∈ {0, 1}

is a vector of

labels indicating the presence or absence of each class in

image I

In the proposed cascaded network, the initial fully-

convolutional stage learns to infer object location maps

based on the object labels in the given images. This stage

produces some candidate boxes of objects as input to the

next stage. The last stage selects the best boxes through an

end-to-end multiple instance learning.

First stage (Location network): The ﬁrst stage of our

cascaded model is a fully-convolutional CNN with a global

average pooling (GAP) or global maximum pooling (GMP)

layer, inspired by [

36]. The training yields the object lo-

cation or ‘class activation’ maps, that provide candidate

bounding boxes. Since multiple categories can exist in a

single image [

22], we use an independent loss function for

each class in this branch of the CNN architecture, so the

loss function is the sum of C binary logistic regression loss

functions.

Last stage (MIL network): The goal of the last stage

is to select the best candidate boxes for each class from

the outputs of the ﬁrst stage using multiple instance learn-

ing (MIL). To obtain an end-to-end framework, we incor-

porate an MIL loss function into our network. Assume

x = {x

|j = 1, 2, ..., n} is a bag of instances for image

I where x

is a candidate box, and assume f

∈ ℜ

C×n

the score of box x

belonging to category i. We use ROI-

pooling layer [

12] to achieve f

. We deﬁne the probabili-

ties and loss as:

(x, I) =

exp



max



k=1

exp



max



MIL

(y, x, I) = −

c=1

log(P

(x, I))

(1)

The weights for conv1 till conv5 are shared between the

two stages. For the last stage, we have additional two fully

connected layers and a score layer for learning the MIL task.

End-to-End Training: The whole cascade with two loss

functions is learned jointly by end-to-end stochastic gradi-

ent descent optimization. The total loss function of the cas-

916

Class Activation Map

ROI Pooling

FCs

…

MIL

Loss

We ak l y su p erv is ed

segmentation

Segmentation Loss

Stage 2

Stage 3

Shared Convs

Image

LocNet

SegNet

MilNet

Loss1

Loss2

Loss3

Conv5

Global

Pooling

Multi

Class

Loss

Convs

Stage 1

Conv5

Figure 3. WCCN (3stage): The pipeline of end-to-end 3-stage cascaded CNN for weakly supervised object detection. For this cascaded

network, we designed new architecture to have weakly supervised segmentation as last stage, so ﬁrst and last stages are identical to the

stages of the previous cascade. The new stage will improve the selecting candidate bounding boxes by providing more accurate object

regions.

caded network is:

T otal

= L

GAP

(y, I) + λL

MIL

(y, x, I)

(2)

where λ is the hyper-parameter balancing two loss func-

tions. In the experiments, we set λ = 1. We suspect cross-

validation on this hyper-parameter can improve the results.

Generating bag of instances: We use Edgeboxs [

37] to

generate an initial set of object proposals. Then we thresh-

old the class activation map [

36] to come up with a mask.

Finally, we choose the initial boxes with largest overlap

with the mask.

3.2. Three-stage Cascade

In this section, we extend our 2-stage cascaded model by

another stage that adds object segmentation as another task.

We believe more information about the objects’ boundary

learned in a segmentation task can lead to acquisition of

a better appearance model and then better object localiza-

tion. For this purpose, our new stage uses another form of

weak supervision to learn a segmentation model, embedded

in the cascaded network and trained along with other stages.

This extra stage will help the multi-loss CNN to have better

initial locations for choosing candidate bounding boxes to

pass to the next stage. So this new cascade has three stages:

ﬁrst stage, similar to previous cascade is a CNN with global

pooling layer; middle stage, fully convolutional network

with segmentation loss; last stage, multiple instance learn-

ing with corresponding loss.

Middle stage (Segmentation Loss): Inspired by [

3, 24],

we propose to use a weakly supervised segmentation net-

work which uses an object point of location and also label

as supervisory signals. Incorporation of initial location of

object from previous stage (location network) in the seg-

mentation stage can obtain more meaningful object location

map. The weak segmentation network uses the results of the

ﬁrst stage as supervision signal (i.e., pseudo ground truth)

and learns jointly with the MIL stage to further improve the

object localization results.

In the middle stage, we add a fully convolutional CNN

similar to the one in [

3] to our network. The ﬁnal layer

is a pixel-wise softmax that outputs S ∈ ℜ

C×m

where m

is the number of pixels in the image. Assuming H

for

the heatmap for class c, we deﬁne α

= max(H

) across

the whole image and I

to be the neighborhood around

argmax(H

). In the experiments, we use a neighborhood

of 3 × 3 pixels. Note that our formulation is closely fol-

lowing the one in [

3] except that our point-wise annotation

is provided by the automatically generated heatmap rather

than manual annotation.

Considering y as the label set for image I , the loss

function for the weakly supervised segmentation network

917

is given by:

Seg

(S, H, y) = −

c=1



log(S

) +

i∈I

log(S

)



(3)

where t

= argmax

i∈I

. The ﬁrst term is used for image-

level label supervision and the second term is for the set of

pixels that the heatmap conﬁdently predicted to be a point

on the object. Note that α

is the second term is emphasiz-

ing on more conﬁdent categories.

Due to more supervision using psuedo-groundtruth pro-

vided by the heatmap, the middle stage provides a bet-

ter segmentation map compared to the original heatmap.

Hence, we pass the resulting segmentation map to the ﬁ-

nal MIL stage to ﬁnd candidate boxes with overlapping and

then calculate the MIL loss.

Output of this middle stage is a set of candidate bound-

ing boxes of objects for pushing to next stage of the CNN

cascade which uses multiple instance learning to choose the

most accurate box as the representative of object category.

In the experiments, we show that learning this extra task as

another stage of cascade can improve performance of the

whole network as a weakly supervised classiﬁer.

End-to-End Training: Similar to the last cascade, the

total loss in Eq.

4 is calculated by simply adding all three

loss terms. We learn all parameters of the network jointly

in an end-to-end fashion.

T otal

= L

GAP

(y, I) + γL

Seg

(y, I) + λL

MIL

(y, x, I)

(4)

In the experiments, we set λ = 1 and γ = 1.

3.3. Object Detection Training

Since we are interested in weakly supervised object de-

tection, we propose to use the output of our network as

pseudo-groundtruth in a standard object detection frame-

work e.g., Fast-RCNN [

12]. There are two ways of doing

this: we can either train a standard Fast-RCNN without our

trained model or we can transfer our learned model into the

Fast-RCNN framework and ﬁnetune it. For the later case,

we use the shared early convolutional layers along with the

fully connected layers in the last stage of our model. In

both cases, at the testing time, we extract object proposals

with EdgeBoxes [

37], use the trained Fast-RCNN to detect

objects among the pool of proposals, and perform non-max-

suppression.

4. Experiments

In the following section, we discuss details of our meth-

ods and experiments which we applied on object detection

and classiﬁcation in weakly supervised manner. We in-

troduce datasets and also analyze performance of our ap-

proaches on them in different aspects of evaluation.

4.1. Datasets and metrics

The experiments for our proposed methods are ex-

tensively done on the PASCAL VOC 2007, 2010, 2012

datasets and also ILSVRC 2013, 2014 which are large scale

datasets for objects. The PASCAL VOC is more common

dataset to evaluate weakly supervised object detection ap-

proaches. The VOC datasets have 20 categories of objects,

while ILSVRC dataset has 200 categories which we tar-

geted also for weakly supervised object classiﬁcation and

localization. In all of the mentioned datasets, we incorpo-

rate the standard train, validation and test set.

Experimental metrics: To measure the object detection

performance, average precision (AP) and correct localiza-

tion (CorLoc) is used. Average precision is the standard

metric from PASCAL VOC which takes a bounding box as

a true detection where it has intersection-over-union (IoU)

of more than 50% with ground-truth box. The Corloc is the

fraction of positive images that the method obtained correct

location by most conﬁdent detection box for at least one ob-

ject instance per target category in an image. For the object

classiﬁcation, also we use PASCAL VOC standard average

precision.

4.2. Experimental and implementation details

We have evaluated both of our proposed cascaded CNN

with two architectures: Alexnet [

16] and VGG-16 [27]. In

each case, the network has been pre-trained on ImageNet

dataset [8]. Since the multiple stages of cascades contain

different CNN networks losses, in the following we explain

details of each part separately to have better overview of the

implementation.

CNN architectures:

1. Loc Net: Inspired by [

36], we removed fully-

connected layers from each of Alexnet or VGG-16 and re-

placed them by two convolutional layers and one global

pooling layer. So for the Alexnet, the layers after conv5

layer have been removed and for VGG-16 after conv5-3.

For global pooling layer, we have tested average and max

pooling methods and we found that global average pooling

performs better than maximum pooling. For the training

loss criteria of this part of network, we use a simple sum

of C (number of classes) binary logistic regression losses,

similar to [

22].

2. Seg Net: This part of network is middle stage in the

3-stage cascaded network and is well-known fully convolu-

tional network for segmentation task [

3]. The convolutional

part is shared with the other stages which comes from the

ﬁrst stage and additional fully-connected layers and a de-

convolutional layer is used to produce segmentation map.

918

Weakly Supervised Cascaded Convolutional Networks

Figures

Citations

Deep Learning for Generic Object Detection: A Survey

Object Detection in 20 Years: A Survey

A Survey of Deep Learning-Based Object Detection

Few-Shot Object Detection via Feature Reweighting

Attention-Based Dropout Layer for Weakly Supervised Object Localization

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

SSD: Single Shot MultiBox Detector

Fast R-CNN

Related Papers (5)

Fast R-CNN

Deep Residual Learning for Image Recognition

SSD: Single Shot MultiBox Detector

The Pascal Visual Object Classes (VOC) Challenge

Microsoft COCO: Common Objects in Context

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Weakly supervised cascaded convolutional networks" ?

Q2. What is the loss function for the class activation map?

Q3. How are the methods used to train object detection systems?

Q4. What is the common way of weakly supervised learning methods?

Q5. What is the way to train the multiple instance learning loss?

Q6. What are the main components of the proposed CNN?

Q7. What are the main datasets used for the proposed methods?

Q8. What is the main reason why Wang et al. use probabilistic latent semantic analysis?

Q9. What is the importance of getting better object proposals?

Q10. What is the total loss function of the cascaded network?

Q11. What is the first stage of the proposed architecture?

Q12. How does the performance of the proposed cascaded network differ from other approaches?

Q13. How did the authors improve the performance of the CNN?