scispace - formally typeset
Open AccessJournal ArticleDOI

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning

Reads0
Chats0
TLDR
This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.
Abstract
Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach.

read more

Content maybe subject to copyright    Report

Weakly Supervised Object Localization
with Multi-Fold Multiple Instance Learning
Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid, Fellow, IEEE
Abstract—Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding
box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this
case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image,
without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations
in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from
prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional
representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method,
which improves the localization accuracy by incorpora ting an objectness prior. We present a detailed experimental evaluation using
the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach.
Index Terms—Weakly supervised learning, object detection
Ç
1INTRODUCTION
O
VER the last decade significant progress has been made
in object category localization, as witnessed by the
PASCAL VOC challenges [20]. Training state-of-the-art
object detectors, however, requires bounding box annota-
tions of object instances, which are costly to acquire.
Weakly supervised learning (WSL) refers to methods that
rely on training data with incomplete ground-truth infor-
mation to learn recognition models. For object detection,
WSL from image-wide labels that indicate the presence of
instances of a category in images has recently been inten-
sively studied as a way to remove the need for bounding
box annotations, see e.g., [4], [8], [12], [15], [17], [35], [37],
[38], [40], [43], [45], [46], [47], [53]. Such methods can poten-
tially leverage the large amount of tagged images on the
internet as a data source to train object detectors. We give
an overview of the most relevant related work in Section 2.
Other examples of WSL include learning face recognition
models from image captions [6], or subtitle and script
information [19]. Yet another example is learning semantic
segmentation models from image-wide category labels [51].
Most WSL approaches are based on latent variable models
to account for the missing ground-truth information. Multi-
ple instance learning (MIL) [18] handles cases where the
weak supervision indicates that at least one positive instance
is present in a set of examples. More advanced inference and
learning methods are used in cases where the latent variable
structure is more complex, see e.g., [17], [40], [51]. Besides
weakly supervised training, mixed fully and weakly super-
vised [9], active [52], and semi-supervised [40] learning and
unsupervised object discovery [11] methods have also been
explored to reduce the amount of labeled training data for
object detector training. In active learning bounding box
annotations are used, but requested only for images where
the annotation is expected to be most effective. Semi-super-
vised learning, on the other hand, leverages unlabeled
images by automatically detecting objects in them, and use
those to better model the object appearance variations.
In this paper we consider WSL to learn object detectors
from image-wide labels. We follow an MIL approach that
interleaves training of the detector with re-localization of
object instances on the positive training images. Following
recent state-of-the-art work in fully supervised detection
[13], [22], [50], we represent (tentative) detection windows
using Fisher vectors (FVs) [39] and convolutional neural net-
work (CNN) features [29]. As we explain in Section 3, when
used in an MIL framework, the high-dimensionality of the
window features makes MIL quickly convergence to poor
local optima after initialization. Our main contribution is a
multi-fold training procedure for MIL, which avoids this
rapid convergence to poor local optima. A second novelty
of our approach is the use of a “contrastive” background
descriptor that is defined as the difference of a descriptor of
the object window and a descriptor of the remaining image
area. The score for this descriptor of a linear classifier can be
interpreted as the difference of scores for the foreground and
background. In this manner we direct the detector to learn
the difference between foreground and background appear-
ances. Finally, inspired from the objectness prior in [17], we
propose a window refinement method that improves the
weakly supervised localization accuracy by incorporating a
category-independent objectness measure.
We present a detailed evaluation using the VOC 2007
dataset in Section 4. The experimental results show that our
multi-fold MIL training improves performance for both FV
R.G. Cinbis is with the Department of Computer Engineering, Bilkent
University, Ankara, Turkey. E-mail: gcinbis@cs.bilkent.edu.tr.
J. Verbeek and C. Schmid are with the LEAR team, Inria Grenoble Rh
^
one-
Alpes, Laboratoire Jean Kuntzmann, CNRS, University Grenoble Alpes,
France. E-mail: {Jakob.Verbeek, cordelia.schmid}@inria.fr.
Manuscript received 25 Dec. 2014; revised 8 Dec. 2015; accepted 15 Jan. 2016.
Date of publication 25 Feb. 2016; date of current version 12 Dec. 2016.
Recommended for acceptance by G. Mori.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2016.2535231
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017 189
0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

and CNN features. We also show that WSL performance
can be further improved by combining the two descriptor
types and applying our window refinement method. The
evaluation shows that our system obtains state-of-the-art
results on VOC 2007. We also present results for VOC 2010
which was not yet used in previous work.
Part of the material presented here appeared in [14].
Besides a more detailed presentation and discussion of the
most recent related work, the current paper extends it in
several ways. We enhanced our WSL method by introduc-
ing a window refinement method. We also added additional
experiments using CNN features, and their combination
with FV features. Finally, we included experiments when
training in a mixed supervision setting, where part of the
images are weakly supervised and others are labeled with
full bounding-box annotations.
2RELATED W ORK
The majority of related work treats WSL for object detection
as a multiple instance learning [18] problem. Each image is
considered as a “bag” of examples given by tentative object
windows. Positive images are assumed to contain at least one
positive object instance window, while negative images only
contain negative windows. The object detector is then
obtained by alternating detector training, and using the detec-
tor to select the most likely object instances in positive images.
In many MIL problems, e.g., such as those for weakly
supervised face recognition [6], [19], the number of exam-
ples per bag is limited to a few dozen at most. In contrast,
there is a vast number of examples per bag in the case of
object detector training since the number of possible object
bounding boxes is quadratic in the number of image pixels.
Candidate window generation methods, e.g., [1], [24], [49],
[56], can be used to make MIL approaches to WSL for object
localization manageable, and make it possible to use power-
ful and computationally expensive object models.
Although candidate window generation methods can
significantly reduce the search space per image, the selec-
tion of windows across a large number of images is inher-
ently a challenging problem, where an iterative WSL
method can typically find only a local optimum depending
on the initial windows. Therefore, in this section, we first
overview the initialization methods proposed in the litera-
ture, and then summarize the iterative WSL approaches.
2.1 Initialization Methods
A number of different strategies to initialize the MIL detector
training have been proposed in the literature. A simple strat-
egy, e.g., taken in [28], [35], [38], is to initialize by taking large
windows in positive images that (nearly) cover the entire
image. This strategy exploits the inclusion structure of the
MIL problem for object detection. That is: although large
windows may contain a significant amount of background
features, they are likely to include positive object instances.
Another strategy is to utilize a class-independent saliency
measure that aims to predict whether a given image region
belongs to an object or not. For example, Deselaers et al. [17]
generate candidate windows using the objectness method
[2] and assign per-window weights using a saliency model
trained on a small set of non-target classes. Siva et al. [44]
instead estimate an unsupervised patch-level saliency map
for a given image by measuring the average similarity of
each patch to the other patches in a retrieved set of similar
images. In each image, an initial window is found by sam-
pling from the corresponding saliency map.
Alternatively, a class-specific initialization method can
be used. For example, Chum and Zisserman [12] select the
visual words that predominantly appear in the positive
training images and initialize WSL by finding the bounding
box of these visual words in each image. Siva and Xiang [45]
propose to initially select one of the candidate windows
sampled using the objectness method at each image such
that an objective function based on intra-class and inter-
class pairwise similarities is maximized. However, this for-
mulation leads to a difficult combinatorial optimization
problem. Siva et al. [43] propose a simplified approach
where a candidate window is selected for a given image
such that the distance from the selected window to its near-
est neighbor among windows from negative images is maxi-
mal. Relying only on negative windows not only avoids the
difficult combinatorial optimization problem, but also has
the advantage that their labels are certain, and there is a
larger number of negative windows available which makes
the pairwise comparisons more robust.
Shi et al. [40] propose to estimate a per-patch class
distribution by using an extended version of the Latent
Dirichlet Allocation (LDA) [10] topic model. Their approach
assigns object class labels across different object categories
concurrently, which allows to benefit from explaining-away
effects, i.e., an image region cannot be identified as an
instance for multiple categories. The initial windows are
then localized by sampling from the saliency maps.
Song et al. [46] propose a graph-based initialization
method. The main idea is to select a subset of the candidate
windows such that the nearest neighbors of the selected
windows correspond to the candidate windows in the posi-
tive images, rather than the ones in the negative images.
The approach is formulated as a discriminative submodular
cover problem on the similarity graph of the windows. In a
follow-up work, Song et al. [47] extend this approach to find
multiple non-overlapping regions corresponding to object
parts. The initial object windows are then generated by find-
ing frequent part configurations and their bounding boxes.
2.2 Iterative Learning Methods
Once the initial windows are localized, typically an iterative
learning approach is employed in order to improve the ini-
tial localizations in the training images.
One of the early e xamples o f WSL for object detector
training is proposed by Crandall and Huttenlocher [15].
In their work, object and part locations are treated as
latent variables in a probabilistic model. T hese varia bles
are automatically inferred and utilized during traini ng
using an Expectation Maximization (EM) algorithm. The
main focus of their work, however, is on training a part-
based object detector without using manua l part annota-
tions, rather than training in terms of image labels. Their
approach is evaluated on datasets containing images with
uncluttered backgrounds and little variance in terms of
object locations, which is an unrealistic testbed for WSL
of object detectors.
190 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017

Several WSL methods aim to localize objects via selecting
a subset of candidate windows based on pairwise similari-
ties. For example, Kim and Torralba [28] use a link analysis
based clustering approach. Chum and Zisserman [12] itera-
tively select windows and update the similarity measure
that is used to compare windows. The window selection is
done by updating one image at a time such that the average
pairwise similarity across the positive images is maximized.
The similarity measure, which is defined in terms of bag-of-
word (BoW) descriptors [16], is updated by selecting the
visual words that predominantly appear in the selected
windows rather than the negative images.
Deselaers et al. [17] propose a CRF-based model that
jointly infers the object hypotheses across all positive train-
ing images, by exploiting a fully-connected graphical model
that encourages visual similarity across all selected object
hypotheses. Unlike the methods of [28] and [12], the CRF-
based model additionally utilize a unary potential function
that scores candidate windows individually based on their
window descriptors and objectness scores. The parameters
of the pairwise and unary potential functions are updated,
and the positive windows are selected in an iterative fash-
ion. Prest et al. [37] extend these ideas to weakly supervised
detector training from videos by extracting candidate spa-
tio-temporal tubes based on motion cues and by defining
WSL potential functions over tubes instead of windows.
Our window refinement method is inspired from the use
of an objectness model as a class-independent prior in [17].
While Deselaers et al. [17] use the objectness prior in all
training iterations, we update the coordinates of the top-
scoring final localizations, using the local greedy search pro-
cedure from [56]. In addition, instead of using the objectness
model in [2], we use the edge-driven objectness measure
[56], which evalutes the alignment between each window
and the edges around it.
Most recent work is predominantly based on iteratively
selecting the highest scoring detections as the positive train-
ing examples and training the detection models. We refer to
this approach as standard MIL. Using this approach, an off-
the-shelf detector can be trained in a weakly supervised set-
ting. For example, Nguyen et al. [34] and Blaschko et al. [9]
train the branch-and-bound localization [31] based detectors
over BoW descriptors in this manner. Blaschko et al. also
investigate the use of object-center annotations as an alter-
native WSL setting.
The DPM model [21] has been utilized with standard MIL
based training approaches by a number of other WSL
approaches, see e.g., [35], [40], [43], [44], [45]. The majority of
the works use the standard DPM training procedure and dif-
fer in terms of their initialization procedures. One exception
is that Siva and Xiang [45] propose a method to detect when
the iterative training procedure drifts to background regions.
In addition, Pandey and Lazebnik [35] carefully study how
to tune DPM training for WSL purposes. They propose to
restrict each re-localization stage such that the bounding
boxes between two iterations must meet a minimum overlap
threshold, which avoids big fluctuations across the itera-
tions. Moreover, they propose a heuristic to automatically
crop windows with near-uniform backgrounds.
Russakovsky et al. [38] use a similar approach based on
Locality-constrained Linear Coding descriptors [54] over
the candidate windows generated using the Selective Search
method [49]. They use a background descriptor computed
over features outside the window, which helps to better
localize the objects as compared to only modeling the win-
dows themselves.
Song et al. [46] develop a smoothed version of the stan-
dard MIL approach using Nesterov’s smoothing technique
[33]. The main motivation is to increase robustness against
incorrectly selected windows, particularly in early itera-
tions, by training with multiple windows per positive
image. The candidate windows are generated using selec-
tive search [49] and the window descriptors are extracted
using the CNN model of [29].
Bilen et al. [8] propose an alternative smoothed version of
standard MIL. Instead of selecting the top scoring window
in a positive image, they propose to train over all windows
that are weighted by a soft-max function over the classifica-
tion scores. In addition, they utilize additional regulariza-
tion terms that aim to (i) enforce that positive training
windows and their horizontal mirrors score similarly and,
(ii) avoid obtaining high classification scores for multiple
classes for a single window. They also utilize selective
search candidate windows [49] and CNN features [29].
Recently, Wang et al. [53] propose a two-step method,
which first groups selective search candidate windows [49]
from the positive images of a class into visual clusters and
then chooses the most discriminative cluster of windows. In
the first step, the CNN features [29] are clustered using
probabilistic latent semantic analysis (PLSA) [25]. In the sec-
ond step, for each visual cluster, image descriptors are
extracted from the CNN-based window descriptors of the
windows associated with the cluster. Finally, one visual
cluster for each class is selected based on the image classifi-
cation performance of the corresponding image descriptors.
Our approach is most related to that of Russakovsky
et al. [38]. We also rely on the selective search windows [49],
and use a similar initialization strategy. A critical difference
from [38] and other WSL approaches based on iterative
detector training, however, is our multi-fold MIL training
procedure which we describe in the next section. Our multi-
fold MIL approach is also related to the work of Singh et al.
[42] on unsupervised vocabulary learning for image classifi-
cation. Starting from an unsupervised clustering of local
patches, they iteratively train SVM classifiers on a subset of
the data, and evaluate it on another set to update the train-
ing data from the second set.
We note that avoiding poor local optima in training of
models with non-convex objectives is a fundamental prob-
lem in machine learning, and there are many aspects of it.
For example, curriculum learning (CL) [5], which is a con-
ceptual framework, suggests that training can be improved
by initializing a model with easy examples, and then, grad-
ually utilizing more complex ones. Kumar et al. [30] pro-
pose a CL formulation for latent variable models by
considering t he loss function as a measure of example d if-
ficulty, w hich exclu des low-scori ng examples in early
training iterations. Progressively increasing the latent
search space can also be interpreted as a CL approach to
avoid making unstable i nferences i n early iterations, see
e.g., [7], [38]. Although our work is related, our focus is
differentinthesensethatwetargettheproblemof
CINBIS ET AL.: WEAKLY SUPERVISED OBJECT LOCALIZATION WITH MULTI-FOLD MULTIPLE INSTANCE LEARNING 191

degenerate latent variable inference due to use of high-
dimensional descriptors.
3WEAKLY SUPERVISED OBJECT LOCALIZATION
Below, we present our multi-fold MIL approach in Section
3.2 and window refinement method in Section 3.3, but first
briefly describe our FV and CNN based object appearance
descriptors.
3.1 Features and Detection Window Representation
In our experiments we rely on FV and CNN based represen-
tations. In either case, we use the selective search method of
Uijlings et al. [49]. It generates a limited set of around 1,500
candidate windows per image. This speeds-up detector
training and evaluation, while filtering out the most implau-
sible object locations.
The FV-based representation is based on our previous
work [13] for fully supervised detection. In particular, we
aggregate local SIFT descriptors into an FV representation
to which we apply
2
and power normalization [39]. We
concatenate the FV computed over the full detection win-
dow, and 16 FVs computed over the cells in a 4 4 grid
over the window, inspired by the spatial pyramid represen-
tation of Lazebnik et al. [32]. Using PCA to project the SIFTs
to 64 dimensions, and a mixture of Gaussians (MoG) of 64
components, this yields a descriptor of 140,352 dimensions.
We reduce the memory footprint, and speed up our iterative
training procedure, by using the PQ and Blosc feature com-
pression [3], [26].
Similar to Russakovsky et al. [38], we add contextual
information from the part of the image not covered by the
window. Full-image descriptors, or image classification
scores, are commonly used for fully supervised object detec-
tion, see e.g., [13], [48]. For WSL, however, it is important to
use the complement of the object window rather than the
full image, to ensure that the context descriptor also
depends on the window location. This prevents learning
degenerate detection models, since otherwise the context
descriptor can be used to perfectly separate the training
images regardless of the object localization.
To enhance the effectiveness of the context descriptor we
propose a “contrastive” version, defined as the difference
between the background FV xx
b
and the 1 1 foreground FV
xx
f
. Since we use linear classifiers, the contribution to the
window score of this descriptor, given by ww
>
ðxx
b
xx
f
Þ, can
be decomposed as a sum of a foreground and a background
score: ww
>
xx
b
and ww
>
xx
f
respectively. Because the fore-
ground and background descriptor have the same weight
vector, up to a sign flip, we effectively force features to
either score positively on the foreground and negatively on
the background, or vice-versa within the contrastive descrip-
tor. This prevents the detector to score the same features
positively on both the foreground and the background.
To ensure that we have enough SIFT descriptors for the
background FV, we filter the detection windows to respect
a margin of at least 4 percent from the image border, i.e. for
a 100 100 pixel image, windows closer than four pixels to
the image border are suppressed. This filtering step
removes about half of the windows. We initialize the MIL
training with the window that covers the image, up to a 4
percent margin, so that all instances are captured by the ini-
tial windows.
We extract the CNN features using the CNN architecture
of Krizhevsky et al. [29]. We utilize the first seven layers of
the CNN model, which consists of five convolutional and
two fully-connected layers. The CNN model is pre-trained
on the ImageNet ILSVRC 2012 dataset using the Caffe
framework [27]. Following Girshick et al. [22], we crop and
resize the mean-subtracted regions corresponding to the
candidate windows to images of size 224 224, as required
by the CNN model. Finally, we apply
2
normalization to
the resulting 4,096 dimensional descriptors.
An important advantage of the CNN features is that
some of the feature dimensions correspond to higher level
image structures, such as certain animal faces and bodies
[22], which can simplify the WSL problem. Our experimen-
tal results show that the CNN features perform better than
the FV features, but that they are complementary since best
performance is obtained when combining both features.
3.2 Weakly Super vised Object Detector Training
The dominant method for weakly supervised training
of object detectors i s the standa rd MIL approach, which
is based on iterating between the training and the re-
localization stages, as described in Section 2.2. Note that
in this approach, the detector used for re-local izatio n in
positive images is trained using positive samples that are
extracted from the very same images. Therefore, there is a
bias towards re-local izing on the same w indows; in par-
ticular when hi gh ca pacity classifiers are used which are
likely to separate the detector’s training data. For exam-
ple, when a nearest neighbor classifier is used the re-
localization will be degenerate and not move a way from
its initialization, since the same window will be found as
its nearest neighbor.
The same phenomenon occurs when using powerful and
high-dimensional image representations to train linear clas-
sifiers. We illustrate this in Fig. 1, which shows the distribu-
tion of the window scores in a typical standard MIL
iteration. We observe that the windows used in SVM train-
ing score significantly higher than the other ones, including
those with a significant spatial overlap with the most recent
Fig. 1. Distribution of the window scores in the positive training images
after the fifth iteration of standard MIL training on VOC 2007 for FVs
(left) and CNNs (right). For each figure, the right-most curve corresponds
to the windows chosen in the most recent re-localization step and used
for training the detector. The curve in the middle corresponds to the other
windows that overlap more than 50 percent with the training windows.
Similarly, the left-most curve corresponds to the windows that overlap
less than 50 percent. Each curve is obtained by averaging all per-class
score distributions. The surrounding regions show the standard
deviation.
192 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 1, JANUARY 2017

training windows, especially when the high-dimensional
FV descriptors are used.
As a result, standard MIL typically results in degenerate re-
localization. This problem is related to the dimensionality of
the window descriptors. We illustrate this in Fig. 2, where we
show the distribution of inner products between the descrip-
tors of different windows. In Fig. 2a, we use random window
pairs within and across images. In Fig. 2b, we use only within-
image pairs, which are more likely to be similar, and therefore
the histograms models are shifted slightly to larger values.
We show the distributions using both our 140,352 dimen-
sional FVs, 516 dimensional FVs obtained using four Gaus-
sians without spatial grid, and 4,096 dimensional CNN-based
descriptors.
1
Unlike in the case of low-dimensional FVs or
CNN-based descriptors, almost all window descriptors are
near orthogonal in the high-dimensional FV case even when
we use within-image pairs only. Also, recall that the weight
vector of a standard linear SVM classifier can be written as a
linear combination of training samples, ww ¼
P
i
a
i
xx
i
.There-
fore, the training windows are likely to score significantly
higher than the other windows in positive images in the high-
dimensional case, resulting in degenerate re-localization
behavior. In Section 4, we verify this hypothesis experimen-
tally by comparing the localization behavior using the low-
dimensional versus the high-dimensional descriptors.
Note that increasing regularization weight in SVM train-
ing does not remedy this problem. The
2
regularization
term with weight restricts the linear combination weights
such that ja
i
j1=. Therefore, although we can reduce the
influence of individual training samples via regularization,
the resulting classifier remains biased towards the training
windows since the classifier is a linear combination of the
window descriptors. In Section 4, we verify this hypothesis
by evaluating the regularization weight’s effect on the local-
ization performance.
To address this issue—without sacrificing the descriptor
dimensionality, which would limit its descriptive power—
we propose to train the detector using a multi-fold proce-
dure, reminiscent of cross-validation, within the MIL itera-
tions. We divide the positive training images into K disjoint
folds, and re-localize the images in each fold using a detec-
tor trained using windows from positive images in the other
folds. In this manner the re-localization detectors never
use training windows from the images to which they are
applied. Once re-localization is performed in all positive
training images, we train another detector using all selected
windows. This detector is used for hard-negative mining on
negative training images, and returned as the final detector.
We summarize our multi-fold MIL training procedure in
Algorithm 1. The standard MIL algorithm that does not use
multi-fold training does not execute steps 2(a) and 2(b), and
re-localizes based on the detector learned in step 2(c).
Algorithm 1. Multi-fold Weakly Supervised Training
1) Initialization: positive and negative examples are set to
entire images up to a 4% border
2) For iteration t ¼ 1 to T
a) Divide positive images randomly into K folds
b) For k ¼ 1 to K
i) Train using positive examples in all folds but k, and all
negative examples
ii) Re-localize positives by selecting the top scoring win-
dow in each image of fold k using this detector
c) Train detector using re-localized positives and all
negatives
d) Add new negative windows by hard-negative mining
3) Return final detector and object windows in train data
Thenumberoffoldsusedinourmulti-foldMILtrain-
ing procedure should be set to strike a good trade-off
betweentwocompetingfactors.Ontheonehand,using
more folds increases the number of training samples per
fold, and is therefore likely to improve re-localization per-
formance. On the other hand, using more folds increases
the computational cost. We experimentally analyze this
trade-off in Section 4.
3.3 Window Refinem ent
We now explain our window refinement method. It updates
the localizations obtained by the last multi-fold MIL itera-
tion. The final detector is, then, re-trained based on these
refinements.
An inherent difficulty for weakly supervised object local-
ization is that WSL labels only permit to determine the most
repeatable and discriminative patterns for each class. There-
fore, even though the windows found by WSL are likely to
Fig. 2. Distribution of inner products, scaled to the interval [1 +1], of
pairs of 25,000 windows sampled from 250 images using our high-
dimensional FV (top), a low-dimensional FV (middle), and CNN features
(bottom). (a) uses all window pairs and (b) uses only within-image pairs,
which are more likely to be similar.
1. To make the histograms comparable, we make all descriptors zero
mean, before
2
normalization, and computing the inner products.
CINBIS ET AL.: WEAKLY SUPERVISED OBJECT LOCALIZATION WITH MULTI-FOLD MULTIPLE INSTANCE LEARNING 193

Figures
Citations
More filters
Proceedings ArticleDOI

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

TL;DR: This work combines existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and applies it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
Proceedings ArticleDOI

Learning Deep Features for Discriminative Localization

TL;DR: This work revisits the global average pooling layer proposed in [13], and sheds light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels.
Posted Content

Learning Deep Features for Discriminative Localization

TL;DR: In this article, the authors revisited the global average pooling layer and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.
Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Journal ArticleDOI

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

TL;DR: Grad-CAM as mentioned in this paper uses the gradients of any target concept (e.g., a dog in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions in "Weakly supervised object localization with multi-fold multiple instance learning" ?

The authors follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. The authors also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. The authors present a detailed experimental evaluation using the PASCALVOC 2007 dataset, which verifies the effectiveness of their approach. 

Candidate window generation methods, e.g., [1], [24], [49], [56], can be used to make MIL approaches to WSL for object localization manageable, and make it possible to use powerful and computationally expensive object models. 

Since the authors use linear classifiers, the contribution to the window score of this descriptor, given by w>ðxb xfÞ, can be decomposed as a sum of a foreground and a background score: w>xb and w>xf respectively. 

Besidesweakly supervised training, mixed fully and weakly supervised [9], active [52], and semi-supervised [40] learning and unsupervised object discovery [11] methods have also been explored to reduce the amount of labeled training data for object detector training. 

The authors divide the positive training images into K disjoint folds, and re-localize the images in each fold using a detector trained using windows from positive images in the other folds. 

Relying only on negative windows not only avoids the difficult combinatorial optimization problem, but also has the advantage that their labels are certain, and there is a larger number of negative windows available which makes the pairwise comparisons more robust. 

The latter conjectured that the degenerate re-localization observed for standard MIL training is due to the trivial separability obtained for high-dimensional descriptors. 

The dominant method for weakly supervised training of object detectors is the standard MIL approach, which is based on iterating between the training and the relocalization stages, as described in Section 2.2. 

Full-image descriptors, or image classification scores, are commonly used for fully supervised object detection, see e.g., [13], [48]. 

Their approach assigns object class labels across different object categories concurrently, which allows to benefit from explaining-away effects, i.e., an image region cannot be identified as an instance for multiple categories. 

More specifically, the authors first utilize the local search procedure in order to update and score the candidate detection windows based on the objectness measure, without updating the classification scores. 

To make the classification and objectness scores comparable, the authors scale each score channel to the range ½0; 1 for all windows in the positive training images. 

A simple strategy, e.g., taken in [28], [35], [38], is to initialize by taking large windows in positive images that (nearly) cover the entire image. 

As a result, utilizing weakly supervised examples during training can sometimes deteriorate the detection performance due to the imperfect localizations provided by the WSL methods. 

the authors observe that the benefit of combining fully supervised images with weakly supervised ones is particularly significant when the ratio of fully supervised images is up to 50 percent for FV features. 

This is a result of the fact that whereas multi-fold MIL is able localize most discriminative subregions of the object categories, standard MIL tends to get stuck after the first few iterations, resulting in too large bounding box estimates. 

Because the foreground and background descriptor have the same weight vector, up to a sign flip, the authors effectively force features to either score positively on the foreground and negatively on the background, or vice-versa within the contrastive descriptor.