Why did the author conjecture that the degenerate re-localization observed?

The latter conjectured that the degenerate re-localization observed for standard MIL training is due to the trivial separability obtained for high-dimensional descriptors.

What is the common use of full-image descriptors?

Full-image descriptors, or image classification scores, are commonly used for fully supervised object detection, see e.g., [13], [48].

What is the approach to assigning object class labels?

Their approach assigns object class labels across different object categories concurrently, which allows to benefit from explaining-away effects, i.e., an image region cannot be identified as an instance for multiple categories.

How do the authors update the classification scores?

More specifically, the authors first utilize the local search procedure in order to update and score the candidate detection windows based on the objectness measure, without updating the classification scores.

How do the authors make the classification and objectness scores comparable?

To make the classification and objectness scores comparable, the authors scale each score channel to the range ½0; 1 for all windows in the positive training images.

What is the effect of using weakly supervised examples during training?

As a result, utilizing weakly supervised examples during training can sometimes deteriorate the detection performance due to the imperfect localizations provided by the WSL methods.

What is the effect of combining fully supervised images with weakly supervised ones?

the authors observe that the benefit of combining fully supervised images with weakly supervised ones is particularly significant when the ratio of fully supervised images is up to 50 percent for FV features.

Why does standard MIL get stuck after the first few iterations?

This is a result of the fact that whereas multi-fold MIL is able localize most discriminative subregions of the object categories, standard MIL tends to get stuck after the first few iterations, resulting in too large bounding box estimates.

(Open Access) Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning (2017) | Ramazan Gokberk Cinbis

Q: What are the contributions in "Weakly supervised object localization with multi-fold multiple instance learning" ?

The authors follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. The authors also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. The authors present a detailed experimental evaluation using the PASCALVOC 2007 dataset, which verifies the effectiveness of their approach.

Q: What is the contribution to the window score of this descriptor?

Since the authors use linear classifiers, the contribution to the window score of this descriptor, given by w>ðxb xfÞ, can be decomposed as a sum of a foreground and a background score: w>xb and w>xf respectively.

Q: What methods have been explored to reduce the amount of labeled training data for object detector training?

Besidesweakly supervised training, mixed fully and weakly supervised [9], active [52], and semi-supervised [40] learning and unsupervised object discovery [11] methods have also been explored to reduce the amount of labeled training data for object detector training.

Q: What is the method used for re-localizing the images in each fold?

The authors divide the positive training images into K disjoint folds, and re-localize the images in each fold using a detector trained using windows from positive images in the other folds.

Q: What is the advantage of using only negative windows?

Relying only on negative windows not only avoids the difficult combinatorial optimization problem, but also has the advantage that their labels are certain, and there is a larger number of negative windows available which makes the pairwise comparisons more robust.

Weakly Supervised Object Localization

with Multi-Fold Multiple Instance Learning

Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid, Fellow, IEEE

Abstract—Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding

box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this

case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image,

without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations

in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from

prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional

representations, such as Fisher vectors and convolutional neural network features. We also propose a window reﬁnement method,

which improves the localization accuracy by incorpora ting an objectness prior. We present a detailed experimental evaluation using

the PASCAL VOC 2007 dataset, which veriﬁes the effectiveness of our approach.

Index Terms—Weakly supervised learning, object detection

1INTRODUCTION

VER the last decade signiﬁcant progress has been made

in object category localization, as witnessed by the

PASCAL VOC challenges [20]. Training state-of-the-art

object detectors, however, requires bounding box annota-

tions of object instances, which are costly to acquire.

Weakly supervised learning (WSL) refers to methods that

rely on training data with incomplete ground-truth infor-

mation to learn recognition models. For object detection,

WSL from image-wide labels that indicate the presence of

instances of a category in images has recently been inten-

sively studied as a way to remove the need for bounding

box annotations, see e.g., [4], [8], [12], [15], [17], [35], [37],

[38], [40], [43], [45], [46], [47], [53]. Such methods can poten-

tially leverage the large amount of tagged images on the

internet as a data source to train object detectors. We give

an overview of the most relevant related work in Section 2.

Other examples of WSL include learning face recognition

models from image captions [6], or subtitle and script

information [19]. Yet another example is learning semantic

segmentation models from image-wide category labels [51].

Most WSL approaches are based on latent variable models

to account for the missing ground-truth information. Multi-

ple instance learning (MIL) [18] handles cases where the

weak supervision indicates that at least one positive instance

is present in a set of examples. More advanced inference and

learning methods are used in cases where the latent variable

structure is more complex, see e.g., [17], [40], [51]. Besides

weakly supervised training, mixed fully and weakly super-

vised [9], active [52], and semi-supervised [40] learning and

unsupervised object discovery [11] methods have also been

explored to reduce the amount of labeled training data for

object detector training. In active learning bounding box

annotations are used, but requested only for images where

the annotation is expected to be most effective. Semi-super-

vised learning, on the other hand, leverages unlabeled

images by automatically detecting objects in them, and use

those to better model the object appearance variations.

In this paper we consider WSL to learn object detectors

from image-wide labels. We follow an MIL approach that

interleaves training of the detector with re-localization of

object instances on the positive training images. Following

recent state-of-the-art work in fully supervised detection

[13], [22], [50], we represent (tentative) detection windows

using Fisher vectors (FVs) [39] and convolutional neural net-

work (CNN) features [29]. As we explain in Section 3, when

used in an MIL framework, the high-dimensionality of the

window features makes MIL quickly convergence to poor

local optima after initialization. Our main contribution is a

multi-fold training procedure for MIL, which avoids this

rapid convergence to poor local optima. A second novelty

of our approach is the use of a “contrastive” background

descriptor that is deﬁned as the difference of a descriptor of

the object window and a descriptor of the remaining image

area. The score for this descriptor of a linear classiﬁer can be

interpreted as the difference of scores for the foreground and

background. In this manner we direct the detector to learn

the difference between foreground and background appear-

ances. Finally, inspired from the objectness prior in [17], we

propose a window reﬁnement method that improves the

weakly supervised localization accuracy by incorporating a

category-independent objectness measure.

We present a detailed evaluation using the VOC 2007

dataset in Section 4. The experimental results show that our

multi-fold MIL training improves performance for both FV

 R.G. Cinbis is with the Department of Computer Engineering, Bilkent

University, Ankara, Turkey. E-mail: gcinbis@cs.bilkent.edu.tr.

 J. Verbeek and C. Schmid are with the LEAR team, Inria Grenoble Rh

one-

Alpes, Laboratoire Jean Kuntzmann, CNRS, University Grenoble Alpes,

France. E-mail: {Jakob.Verbeek, cordelia.schmid}@inria.fr.

Manuscript received 25 Dec. 2014; revised 8 Dec. 2015; accepted 15 Jan. 2016.

Date of publication 25 Feb. 2016; date of current version 12 Dec. 2016.

Recommended for acceptance by G. Mori.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TPAMI.2016.2535231

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017 189

0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

and CNN features. We also show that WSL performance

can be further improved by combining the two descriptor

types and applying our window reﬁnement method. The

evaluation shows that our system obtains state-of-the-art

results on VOC 2007. We also present results for VOC 2010

which was not yet used in previous work.

Part of the material presented here appeared in [14].

Besides a more detailed presentation and discussion of the

most recent related work, the current paper extends it in

several ways. We enhanced our WSL method by introduc-

ing a window reﬁnement method. We also added additional

experiments using CNN features, and their combination

with FV features. Finally, we included experiments when

training in a mixed supervision setting, where part of the

images are weakly supervised and others are labeled with

full bounding-box annotations.

2RELATED W ORK

The majority of related work treats WSL for object detection

as a multiple instance learning [18] problem. Each image is

considered as a “bag” of examples given by tentative object

windows. Positive images are assumed to contain at least one

positive object instance window, while negative images only

contain negative windows. The object detector is then

obtained by alternating detector training, and using the detec-

tor to select the most likely object instances in positive images.

In many MIL problems, e.g., such as those for weakly

supervised face recognition [6], [19], the number of exam-

ples per bag is limited to a few dozen at most. In contrast,

there is a vast number of examples per bag in the case of

object detector training since the number of possible object

bounding boxes is quadratic in the number of image pixels.

Candidate window generation methods, e.g., [1], [24], [49],

[56], can be used to make MIL approaches to WSL for object

localization manageable, and make it possible to use power-

ful and computationally expensive object models.

Although candidate window generation methods can

signiﬁcantly reduce the search space per image, the selec-

tion of windows across a large number of images is inher-

ently a challenging problem, where an iterative WSL

method can typically ﬁnd only a local optimum depending

on the initial windows. Therefore, in this section, we ﬁrst

overview the initialization methods proposed in the litera-

ture, and then summarize the iterative WSL approaches.

2.1 Initialization Methods

A number of different strategies to initialize the MIL detector

training have been proposed in the literature. A simple strat-

egy, e.g., taken in [28], [35], [38], is to initialize by taking large

windows in positive images that (nearly) cover the entire

image. This strategy exploits the inclusion structure of the

MIL problem for object detection. That is: although large

windows may contain a signiﬁcant amount of background

features, they are likely to include positive object instances.

Another strategy is to utilize a class-independent saliency

measure that aims to predict whether a given image region

belongs to an object or not. For example, Deselaers et al. [17]

generate candidate windows using the objectness method

[2] and assign per-window weights using a saliency model

trained on a small set of non-target classes. Siva et al. [44]

instead estimate an unsupervised patch-level saliency map

for a given image by measuring the average similarity of

each patch to the other patches in a retrieved set of similar

images. In each image, an initial window is found by sam-

pling from the corresponding saliency map.

Alternatively, a class-speciﬁc initialization method can

be used. For example, Chum and Zisserman [12] select the

visual words that predominantly appear in the positive

training images and initialize WSL by ﬁnding the bounding

box of these visual words in each image. Siva and Xiang [45]

propose to initially select one of the candidate windows

sampled using the objectness method at each image such

that an objective function based on intra-class and inter-

class pairwise similarities is maximized. However, this for-

mulation leads to a difﬁcult combinatorial optimization

problem. Siva et al. [43] propose a simpliﬁed approach

where a candidate window is selected for a given image

such that the distance from the selected window to its near-

est neighbor among windows from negative images is maxi-

mal. Relying only on negative windows not only avoids the

difﬁcult combinatorial optimization problem, but also has

the advantage that their labels are certain, and there is a

larger number of negative windows available which makes

the pairwise comparisons more robust.

Shi et al. [40] propose to estimate a per-patch class

distribution by using an extended version of the Latent

Dirichlet Allocation (LDA) [10] topic model. Their approach

assigns object class labels across different object categories

concurrently, which allows to beneﬁt from explaining-away

effects, i.e., an image region cannot be identiﬁed as an

instance for multiple categories. The initial windows are

then localized by sampling from the saliency maps.

Song et al. [46] propose a graph-based initialization

method. The main idea is to select a subset of the candidate

windows such that the nearest neighbors of the selected

windows correspond to the candidate windows in the posi-

tive images, rather than the ones in the negative images.

The approach is formulated as a discriminative submodular

cover problem on the similarity graph of the windows. In a

follow-up work, Song et al. [47] extend this approach to ﬁnd

multiple non-overlapping regions corresponding to object

parts. The initial object windows are then generated by ﬁnd-

ing frequent part conﬁgurations and their bounding boxes.

2.2 Iterative Learning Methods

Once the initial windows are localized, typically an iterative

learning approach is employed in order to improve the ini-

tial localizations in the training images.

One of the early e xamples o f WSL for object detector

training is proposed by Crandall and Huttenlocher [15].

In their work, object and part locations are treated as

latent variables in a probabilistic model. T hese varia bles

are automatically inferred and utilized during traini ng

using an Expectation Maximization (EM) algorithm. The

main focus of their work, however, is on training a part-

based object detector without using manua l part annota-

tions, rather than training in terms of image labels. Their

approach is evaluated on datasets containing images with

uncluttered backgrounds and little variance in terms of

object locations, which is an unrealistic testbed for WSL

of object detectors.

190 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017

Several WSL methods aim to localize objects via selecting

a subset of candidate windows based on pairwise similari-

ties. For example, Kim and Torralba [28] use a link analysis

based clustering approach. Chum and Zisserman [12] itera-

tively select windows and update the similarity measure

that is used to compare windows. The window selection is

done by updating one image at a time such that the average

pairwise similarity across the positive images is maximized.

The similarity measure, which is deﬁned in terms of bag-of-

word (BoW) descriptors [16], is updated by selecting the

visual words that predominantly appear in the selected

windows rather than the negative images.

Deselaers et al. [17] propose a CRF-based model that

jointly infers the object hypotheses across all positive train-

ing images, by exploiting a fully-connected graphical model

that encourages visual similarity across all selected object

hypotheses. Unlike the methods of [28] and [12], the CRF-

based model additionally utilize a unary potential function

that scores candidate windows individually based on their

window descriptors and objectness scores. The parameters

of the pairwise and unary potential functions are updated,

and the positive windows are selected in an iterative fash-

ion. Prest et al. [37] extend these ideas to weakly supervised

detector training from videos by extracting candidate spa-

tio-temporal tubes based on motion cues and by deﬁning

WSL potential functions over tubes instead of windows.

Our window reﬁnement method is inspired from the use

of an objectness model as a class-independent prior in [17].

While Deselaers et al. [17] use the objectness prior in all

training iterations, we update the coordinates of the top-

scoring ﬁnal localizations, using the local greedy search pro-

cedure from [56]. In addition, instead of using the objectness

model in [2], we use the edge-driven objectness measure

[56], which evalutes the alignment between each window

and the edges around it.

Most recent work is predominantly based on iteratively

selecting the highest scoring detections as the positive train-

ing examples and training the detection models. We refer to

this approach as standard MIL. Using this approach, an off-

the-shelf detector can be trained in a weakly supervised set-

ting. For example, Nguyen et al. [34] and Blaschko et al. [9]

train the branch-and-bound localization [31] based detectors

over BoW descriptors in this manner. Blaschko et al. also

investigate the use of object-center annotations as an alter-

native WSL setting.

The DPM model [21] has been utilized with standard MIL

based training approaches by a number of other WSL

approaches, see e.g., [35], [40], [43], [44], [45]. The majority of

the works use the standard DPM training procedure and dif-

fer in terms of their initialization procedures. One exception

is that Siva and Xiang [45] propose a method to detect when

the iterative training procedure drifts to background regions.

In addition, Pandey and Lazebnik [35] carefully study how

to tune DPM training for WSL purposes. They propose to

restrict each re-localization stage such that the bounding

boxes between two iterations must meet a minimum overlap

threshold, which avoids big ﬂuctuations across the itera-

tions. Moreover, they propose a heuristic to automatically

crop windows with near-uniform backgrounds.

Russakovsky et al. [38] use a similar approach based on

Locality-constrained Linear Coding descriptors [54] over

the candidate windows generated using the Selective Search

method [49]. They use a background descriptor computed

over features outside the window, which helps to better

localize the objects as compared to only modeling the win-

dows themselves.

Song et al. [46] develop a smoothed version of the stan-

dard MIL approach using Nesterov’s smoothing technique

[33]. The main motivation is to increase robustness against

incorrectly selected windows, particularly in early itera-

tions, by training with multiple windows per positive

image. The candidate windows are generated using selec-

tive search [49] and the window descriptors are extracted

using the CNN model of [29].

Bilen et al. [8] propose an alternative smoothed version of

standard MIL. Instead of selecting the top scoring window

in a positive image, they propose to train over all windows

that are weighted by a soft-max function over the classiﬁca-

tion scores. In addition, they utilize additional regulariza-

tion terms that aim to (i) enforce that positive training

windows and their horizontal mirrors score similarly and,

(ii) avoid obtaining high classiﬁcation scores for multiple

classes for a single window. They also utilize selective

search candidate windows [49] and CNN features [29].

Recently, Wang et al. [53] propose a two-step method,

which ﬁrst groups selective search candidate windows [49]

from the positive images of a class into visual clusters and

then chooses the most discriminative cluster of windows. In

the ﬁrst step, the CNN features [29] are clustered using

probabilistic latent semantic analysis (PLSA) [25]. In the sec-

ond step, for each visual cluster, image descriptors are

extracted from the CNN-based window descriptors of the

windows associated with the cluster. Finally, one visual

cluster for each class is selected based on the image classiﬁ-

cation performance of the corresponding image descriptors.

Our approach is most related to that of Russakovsky

et al. [38]. We also rely on the selective search windows [49],

and use a similar initialization strategy. A critical difference

from [38] and other WSL approaches based on iterative

detector training, however, is our multi-fold MIL training

procedure which we describe in the next section. Our multi-

fold MIL approach is also related to the work of Singh et al.

[42] on unsupervised vocabulary learning for image classiﬁ-

cation. Starting from an unsupervised clustering of local

patches, they iteratively train SVM classiﬁers on a subset of

the data, and evaluate it on another set to update the train-

ing data from the second set.

We note that avoiding poor local optima in training of

models with non-convex objectives is a fundamental prob-

lem in machine learning, and there are many aspects of it.

For example, curriculum learning (CL) [5], which is a con-

ceptual framework, suggests that training can be improved

by initializing a model with easy examples, and then, grad-

ually utilizing more complex ones. Kumar et al. [30] pro-

pose a CL formulation for latent variable models by

considering t he loss function as a measure of example d if-

ﬁculty, w hich exclu des low-scori ng examples in early

training iterations. Progressively increasing the latent

search space can also be interpreted as a CL approach to

avoid making unstable i nferences i n early iterations, see

e.g., [7], [38]. Although our work is related, our focus is

differentinthesensethatwetargettheproblemof

CINBIS ET AL.: WEAKLY SUPERVISED OBJECT LOCALIZATION WITH MULTI-FOLD MULTIPLE INSTANCE LEARNING 191

degenerate latent variable inference due to use of high-

dimensional descriptors.

3WEAKLY SUPERVISED OBJECT LOCALIZATION

Below, we present our multi-fold MIL approach in Section

3.2 and window reﬁnement method in Section 3.3, but ﬁrst

brieﬂy describe our FV and CNN based object appearance

descriptors.

3.1 Features and Detection Window Representation

In our experiments we rely on FV and CNN based represen-

tations. In either case, we use the selective search method of

Uijlings et al. [49]. It generates a limited set of around 1,500

candidate windows per image. This speeds-up detector

training and evaluation, while ﬁltering out the most implau-

sible object locations.

The FV-based representation is based on our previous

work [13] for fully supervised detection. In particular, we

aggregate local SIFT descriptors into an FV representation

to which we apply ‘

and power normalization [39]. We

concatenate the FV computed over the full detection win-

dow, and 16 FVs computed over the cells in a 4  4 grid

over the window, inspired by the spatial pyramid represen-

tation of Lazebnik et al. [32]. Using PCA to project the SIFTs

to 64 dimensions, and a mixture of Gaussians (MoG) of 64

components, this yields a descriptor of 140,352 dimensions.

We reduce the memory footprint, and speed up our iterative

training procedure, by using the PQ and Blosc feature com-

pression [3], [26].

Similar to Russakovsky et al. [38], we add contextual

information from the part of the image not covered by the

window. Full-image descriptors, or image classiﬁcation

scores, are commonly used for fully supervised object detec-

tion, see e.g., [13], [48]. For WSL, however, it is important to

use the complement of the object window rather than the

full image, to ensure that the context descriptor also

depends on the window location. This prevents learning

degenerate detection models, since otherwise the context

descriptor can be used to perfectly separate the training

images regardless of the object localization.

To enhance the effectiveness of the context descriptor we

propose a “contrastive” version, deﬁned as the difference

between the background FV xx

and the 1  1 foreground FV

. Since we use linear classiﬁers, the contribution to the

window score of this descriptor, given by ww

ðxx

 xx

Þ, can

be decomposed as a sum of a foreground and a background

score: ww

and ww

respectively. Because the fore-

ground and background descriptor have the same weight

vector, up to a sign ﬂip, we effectively force features to

either score positively on the foreground and negatively on

the background, or vice-versa within the contrastive descrip-

tor. This prevents the detector to score the same features

positively on both the foreground and the background.

To ensure that we have enough SIFT descriptors for the

background FV, we ﬁlter the detection windows to respect

a margin of at least 4 percent from the image border, i.e. for

a 100  100 pixel image, windows closer than four pixels to

the image border are suppressed. This ﬁltering step

removes about half of the windows. We initialize the MIL

training with the window that covers the image, up to a 4

percent margin, so that all instances are captured by the ini-

tial windows.

We extract the CNN features using the CNN architecture

of Krizhevsky et al. [29]. We utilize the ﬁrst seven layers of

the CNN model, which consists of ﬁve convolutional and

two fully-connected layers. The CNN model is pre-trained

on the ImageNet ILSVRC 2012 dataset using the Caffe

framework [27]. Following Girshick et al. [22], we crop and

resize the mean-subtracted regions corresponding to the

candidate windows to images of size 224  224, as required

by the CNN model. Finally, we apply ‘

normalization to

the resulting 4,096 dimensional descriptors.

An important advantage of the CNN features is that

some of the feature dimensions correspond to higher level

image structures, such as certain animal faces and bodies

[22], which can simplify the WSL problem. Our experimen-

tal results show that the CNN features perform better than

the FV features, but that they are complementary since best

performance is obtained when combining both features.

3.2 Weakly Super vised Object Detector Training

The dominant method for weakly supervised training

of object detectors i s the standa rd MIL approach, which

is based on iterating between the training and the re-

localization stages, as described in Section 2.2. Note that

in this approach, the detector used for re-local izatio n in

positive images is trained using positive samples that are

extracted from the very same images. Therefore, there is a

bias towards re-local izing on the same w indows; in par-

ticular when hi gh ca pacity classiﬁers are used which are

likely to separate the detector’s training data. For exam-

ple, when a nearest neighbor classiﬁer is used the re-

localization will be degenerate and not move a way from

its initialization, since the same window will be found as

its nearest neighbor.

The same phenomenon occurs when using powerful and

high-dimensional image representations to train linear clas-

siﬁers. We illustrate this in Fig. 1, which shows the distribu-

tion of the window scores in a typical standard MIL

iteration. We observe that the windows used in SVM train-

ing score signiﬁcantly higher than the other ones, including

those with a signiﬁcant spatial overlap with the most recent

Fig. 1. Distribution of the window scores in the positive training images

after the ﬁfth iteration of standard MIL training on VOC 2007 for FVs

(left) and CNNs (right). For each ﬁgure, the right-most curve corresponds

to the windows chosen in the most recent re-localization step and used

for training the detector. The curve in the middle corresponds to the other

windows that overlap more than 50 percent with the training windows.

Similarly, the left-most curve corresponds to the windows that overlap

less than 50 percent. Each curve is obtained by averaging all per-class

score distributions. The surrounding regions show the standard

deviation.

192 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017

training windows, especially when the high-dimensional

FV descriptors are used.

As a result, standard MIL typically results in degenerate re-

localization. This problem is related to the dimensionality of

the window descriptors. We illustrate this in Fig. 2, where we

show the distribution of inner products between the descrip-

tors of different windows. In Fig. 2a, we use random window

pairs within and across images. In Fig. 2b, we use only within-

image pairs, which are more likely to be similar, and therefore

the histograms models are shifted slightly to larger values.

We show the distributions using both our 140,352 dimen-

sional FVs, 516 dimensional FVs obtained using four Gaus-

sians without spatial grid, and 4,096 dimensional CNN-based

descriptors.

Unlike in the case of low-dimensional FVs or

CNN-based descriptors, almost all window descriptors are

near orthogonal in the high-dimensional FV case even when

we use within-image pairs only. Also, recall that the weight

vector of a standard linear SVM classiﬁer can be written as a

linear combination of training samples, ww ¼

.There-

fore, the training windows are likely to score signiﬁcantly

higher than the other windows in positive images in the high-

dimensional case, resulting in degenerate re-localization

behavior. In Section 4, we verify this hypothesis experimen-

tally by comparing the localization behavior using the low-

dimensional versus the high-dimensional descriptors.

Note that increasing regularization weight in SVM train-

ing does not remedy this problem. The ‘

regularization

term with weight  restricts the linear combination weights

such that ja

j1=. Therefore, although we can reduce the

inﬂuence of individual training samples via regularization,

the resulting classiﬁer remains biased towards the training

windows since the classiﬁer is a linear combination of the

window descriptors. In Section 4, we verify this hypothesis

by evaluating the regularization weight’s effect on the local-

ization performance.

To address this issue—without sacriﬁcing the descriptor

dimensionality, which would limit its descriptive power—

we propose to train the detector using a multi-fold proce-

dure, reminiscent of cross-validation, within the MIL itera-

tions. We divide the positive training images into K disjoint

folds, and re-localize the images in each fold using a detec-

tor trained using windows from positive images in the other

folds. In this manner the re-localization detectors never

use training windows from the images to which they are

applied. Once re-localization is performed in all positive

training images, we train another detector using all selected

windows. This detector is used for hard-negative mining on

negative training images, and returned as the ﬁnal detector.

We summarize our multi-fold MIL training procedure in

Algorithm 1. The standard MIL algorithm that does not use

multi-fold training does not execute steps 2(a) and 2(b), and

re-localizes based on the detector learned in step 2(c).

Algorithm 1. Multi-fold Weakly Supervised Training

1) Initialization: positive and negative examples are set to

entire images up to a 4% border

2) For iteration t ¼ 1 to T

a) Divide positive images randomly into K folds

b) For k ¼ 1 to K

i) Train using positive examples in all folds but k, and all

negative examples

ii) Re-localize positives by selecting the top scoring win-

dow in each image of fold k using this detector

c) Train detector using re-localized positives and all

negatives

d) Add new negative windows by hard-negative mining

3) Return ﬁnal detector and object windows in train data

Thenumberoffoldsusedinourmulti-foldMILtrain-

ing procedure should be set to strike a good trade-off

betweentwocompetingfactors.Ontheonehand,using

more folds increases the number of training samples per

fold, and is therefore likely to improve re-localization per-

formance. On the other hand, using more folds increases

the computational cost. We experimentally analyze this

trade-off in Section 4.

3.3 Window Reﬁnem ent

We now explain our window reﬁnement method. It updates

the localizations obtained by the last multi-fold MIL itera-

tion. The ﬁnal detector is, then, re-trained based on these

reﬁnements.

An inherent difﬁculty for weakly supervised object local-

ization is that WSL labels only permit to determine the most

repeatable and discriminative patterns for each class. There-

fore, even though the windows found by WSL are likely to

Fig. 2. Distribution of inner products, scaled to the interval [1 +1], of

pairs of 25,000 windows sampled from 250 images using our high-

dimensional FV (top), a low-dimensional FV (middle), and CNN features

(bottom). (a) uses all window pairs and (b) uses only within-image pairs,

which are more likely to be similar.

1. To make the histograms comparable, we make all descriptors zero

mean, before ‘

normalization, and computing the inner products.

CINBIS ET AL.: WEAKLY SUPERVISED OBJECT LOCALIZATION WITH MULTI-FOLD MULTIPLE INSTANCE LEARNING 193

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning

Figures

Citations

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Learning Deep Features for Discriminative Localization

Learning Deep Features for Discriminative Localization

Deep Learning for Generic Object Detection: A Survey

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Latent dirichlet allocation

Latent Dirichlet Allocation

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Related Papers (5)

Fast R-CNN

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Deep Residual Learning for Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

Frequently Asked Questions (17)

Q1. What are the contributions in "Weakly supervised object localization with multi-fold multiple instance learning" ?

Q2. What can be done to make the object detection process manageable?

Q3. What is the contribution to the window score of this descriptor?

Q4. What methods have been explored to reduce the amount of labeled training data for object detector training?

Q5. What is the method used for re-localizing the images in each fold?

Q6. What is the advantage of using only negative windows?

Q7. Why did the author conjecture that the degenerate re-localization observed?

Q8. What is the dominant method for weakly supervised training of object detectors?

Q9. What is the common use of full-image descriptors?

Q10. What is the approach to assigning object class labels?

Q11. How do the authors update the classification scores?

Q12. How do the authors make the classification and objectness scores comparable?

Q13. What is the common strategy for initializing the object detector?

Q14. What is the effect of using weakly supervised examples during training?

Q15. What is the effect of combining fully supervised images with weakly supervised ones?

Q16. Why does standard MIL get stuck after the first few iterations?

Q17. What is the difference between the foreground and background descriptor?