Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning
read more
Citations
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
Learning Deep Features for Discriminative Localization
Learning Deep Features for Discriminative Localization
Deep Learning for Generic Object Detection: A Survey
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
References
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet: A large-scale hierarchical image database
Latent dirichlet allocation
Latent Dirichlet Allocation
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Related Papers (5)
Frequently Asked Questions (17)
Q2. What can be done to make the object detection process manageable?
Candidate window generation methods, e.g., [1], [24], [49], [56], can be used to make MIL approaches to WSL for object localization manageable, and make it possible to use powerful and computationally expensive object models.
Q3. What is the contribution to the window score of this descriptor?
Since the authors use linear classifiers, the contribution to the window score of this descriptor, given by w>ðxb xfÞ, can be decomposed as a sum of a foreground and a background score: w>xb and w>xf respectively.
Q4. What methods have been explored to reduce the amount of labeled training data for object detector training?
Besidesweakly supervised training, mixed fully and weakly supervised [9], active [52], and semi-supervised [40] learning and unsupervised object discovery [11] methods have also been explored to reduce the amount of labeled training data for object detector training.
Q5. What is the method used for re-localizing the images in each fold?
The authors divide the positive training images into K disjoint folds, and re-localize the images in each fold using a detector trained using windows from positive images in the other folds.
Q6. What is the advantage of using only negative windows?
Relying only on negative windows not only avoids the difficult combinatorial optimization problem, but also has the advantage that their labels are certain, and there is a larger number of negative windows available which makes the pairwise comparisons more robust.
Q7. Why did the author conjecture that the degenerate re-localization observed?
The latter conjectured that the degenerate re-localization observed for standard MIL training is due to the trivial separability obtained for high-dimensional descriptors.
Q8. What is the dominant method for weakly supervised training of object detectors?
The dominant method for weakly supervised training of object detectors is the standard MIL approach, which is based on iterating between the training and the relocalization stages, as described in Section 2.2.
Q9. What is the common use of full-image descriptors?
Full-image descriptors, or image classification scores, are commonly used for fully supervised object detection, see e.g., [13], [48].
Q10. What is the approach to assigning object class labels?
Their approach assigns object class labels across different object categories concurrently, which allows to benefit from explaining-away effects, i.e., an image region cannot be identified as an instance for multiple categories.
Q11. How do the authors update the classification scores?
More specifically, the authors first utilize the local search procedure in order to update and score the candidate detection windows based on the objectness measure, without updating the classification scores.
Q12. How do the authors make the classification and objectness scores comparable?
To make the classification and objectness scores comparable, the authors scale each score channel to the range ½0; 1 for all windows in the positive training images.
Q13. What is the common strategy for initializing the object detector?
A simple strategy, e.g., taken in [28], [35], [38], is to initialize by taking large windows in positive images that (nearly) cover the entire image.
Q14. What is the effect of using weakly supervised examples during training?
As a result, utilizing weakly supervised examples during training can sometimes deteriorate the detection performance due to the imperfect localizations provided by the WSL methods.
Q15. What is the effect of combining fully supervised images with weakly supervised ones?
the authors observe that the benefit of combining fully supervised images with weakly supervised ones is particularly significant when the ratio of fully supervised images is up to 50 percent for FV features.
Q16. Why does standard MIL get stuck after the first few iterations?
This is a result of the fact that whereas multi-fold MIL is able localize most discriminative subregions of the object categories, standard MIL tends to get stuck after the first few iterations, resulting in too large bounding box estimates.
Q17. What is the difference between the foreground and background descriptor?
Because the foreground and background descriptor have the same weight vector, up to a sign flip, the authors effectively force features to either score positively on the foreground and negatively on the background, or vice-versa within the contrastive descriptor.