Large-Scale Image Retrieval with Attentive Deep Local Features

doi:10.1109/ICCV.2017.374

Home
/
Papers
/
Large-Scale Image Retrieval with Attentive Deep Local Features

Proceedings Article•DOI•

Large-Scale Image Retrieval with Attentive Deep Local Features

Hyeonwoo Noh¹, Andre Araujo², Jack Sim³, Tobias Weyand³, Bohyung Han¹ - Show less +1 more•Institutions (3)

Pohang University of Science and Technology¹, Stanford University², Google³

01 Oct 2017-pp 3476-3485

TL;DR: An attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELE (DEep Local Feature), based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset.

read less

Abstract: We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELE (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset. To identify semantically useful local features for image retrieval, we also propose an attention mechanism for key point selection, which shares most network layers with the descriptor. This frame-work can be used for image retrieval as a drop-in replacement for other keypoint detectors and descriptors, enabling more accurate feature matching and geometric verification. Our system produces reliable confidence scores to reject false positives–in particular, it is robust against queries that have no correct match in the database. To evaluate the proposed descriptor, we introduce a new large-scale dataset, referred to as Google-Landmarks dataset, which involves challenges in both database and query such as background clutter, partial occlusion, multiple landmarks, objects in variable scales, etc. We show that DELE outperforms the state-of-the-art global and local descriptors in the large-scale setting by significant margins.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Fine-Tuning CNN Image Retrieval with No Human Annotation

[...]

Filip Radenovic¹, Giorgos Tolias¹, Ondrej Chum¹•Institutions (1)

Czech Technical University in Prague¹

01 Jul 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval.

...read moreread less

Abstract: Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.

...read moreread less

611 citations

Cites methods from "Large-Scale Image Retrieval with At..."

...[47] to learn global image descriptors using a saliency mask....
[...]

Posted Content•

Fine-tuning CNN Image Retrieval with No Human Annotation

[...]

Filip Radenovic¹, Giorgos Tolias¹, Ondrej Chum¹•Institutions (1)

Czech Technical University in Prague¹

03 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors proposed to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner, using Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods.

...read moreread less

607 citations

Proceedings Article•DOI•

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

[...]

Mihai Dusmanu¹, Ignacio Rocco², Tomas Pajdla³, Marc Pollefeys¹, Josef Sivic³, Akihiko Torii⁴, Torsten Sattler⁵ - Show less +3 more•Institutions (5)

ETH Zurich¹, PSL Research University², Czech Technical University in Prague³, Tokyo Institute of Technology⁴, Chalmers University of Technology⁵

15 Jun 2019

TL;DR: This work proposes an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector, and shows that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations.

...read moreread less

Abstract: In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

...read moreread less

594 citations

Cites background or methods or result from "Large-Scale Image Retrieval with At..."

...Several of these methods start by dense descriptor extraction [3,37,60,61] and later aggregate these descriptors into a compact image-level descriptor for retrieval....
[...]
...Works most related to our approach are [37,60]: [37] develops an approach similar to ours, where an attention module is added on top of the dense description stage to perform keypoint selection....
[...]
...We compare against upright RootSIFT descriptors extracted from DoG keypoints [29], HardNet++ descriptors with HesAffNet features [34, 35], DELF [37], SuperPoint [13] and DenseSfM [45]....
[...]
...As in other previous work [37,43,58], the most straightforward interpretation of the 3D tensor F is as a dense set of descriptor vectors d:...
[...]
...DELF does not refine its keypoint positions - thus, detecting the same pixel positions at feature map level yields perfect accuracy for strict thresholds....
[...]

Proceedings Article•DOI•

From Coarse to Fine: Robust Hierarchical Localization at Large Scale

[...]

Paul-Edouard Sarlin¹, Cesar Cadena², Roland Siegwart¹, Marcin Dymczyk•Institutions (2)

Institute of Robotics and Intelligent Systems¹, ETH Zurich²

15 Jun 2019

TL;DR: HF-Net is proposed, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.

...read moreread less

Abstract: Robust and accurate visual localization is a fundamental capability for numerous applications, such as autonomous driving, mobile robotics, or augmented reality. It remains, however, a challenging task, particularly for large-scale environments and in presence of significant appearance changes. State-of-the-art methods not only struggle with such scenarios, but are often too resource intensive for certain real-time applications. In this paper we propose HF-Net, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization. We exploit the coarse-to-fine localization paradigm: we first perform a global retrieval to obtain location hypotheses and only later match local features within those candidate places. This hierarchical approach incurs significant runtime savings and makes our system suitable for real-time operation. By leveraging learned descriptors, our method achieves remarkable localization robustness across large variations of appearance and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.

...read moreread less

378 citations

Cites methods from "Large-Scale Image Retrieval with At..."

...The images from both Google Landmarks [36] and Berkeley Deep Drive [58] are resized to 640×480 and converted to grayscale....
[...]
...We thus train on 185k images from the Google Landmarks dataset [36], containing a wide variety of day-time urban scenes, and 37k images from the night and dawn sequences of the Berkeley Deep Drive dataset [58], composed of road scenes with motion blur....
[...]
...The images from both Google Landmarks [11] and Berkeley Deep Drive [21] are resized to 640×480 and converted to grayscale....
[...]

Book Chapter•DOI•

PPF-FoldNet: Unsupervised Learning of Rotation Invariant 3D Local Descriptors

[...]

Haowen Deng¹, Tolga Birdal¹, Slobodan Ilic¹•Institutions (1)

Technische Universität München¹

08 Sep 2018

TL;DR: It is demonstrated that despite having six degree-of-freedom invariance and lack of training labels, PPF-FoldNet achieves state of the art results in standard benchmark datasets and outperforms its competitors when rotations and varying point densities are present.

...read moreread less

Abstract: We present PPF-FoldNet for unsupervised learning of 3D local descriptors on pure point cloud geometry Based on the folding-based auto-encoding of well known point pair features, PPF-FoldNet offers many desirable properties: it necessitates neither supervision, nor a sensitive local reference frame, benefits from point-set sparsity, is end-to-end, fast, and can extract powerful rotation invariant descriptors Thanks to a novel feature visualization, its evolution can be monitored to provide interpretable insights Our extensive experiments demonstrate that despite having six degree-of-freedom invariance and lack of training labels, our network achieves state of the art results in standard benchmark datasets and outperforms its competitors when rotations and varying point densities are present PPF-FoldNet achieves 9% higher recall on standard benchmarks, 23% higher recall when rotations are introduced into the same datasets and finally, a margin of >35% is attained when point density is significantly decreased

...read moreread less

330 citations

Cites background from "Large-Scale Image Retrieval with At..."

...Already in 2D, learned descriptors significantly outperform their engineered counterparts [49, 28]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Journal Article•DOI•

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

[...]

Martin A. Fischler¹, Robert C. Bolles¹•Institutions (1)

SRI International¹

01 Jun 1981-Communications of The ACM

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.

...read moreread less

Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

...read moreread less

23,396 citations