Comparative Evaluation of Hand-Crafted and Learned Local Features

doi:10.1109/CVPR.2017.736

Home
/
Papers
/
Comparative Evaluation of Hand-Crafted and Learned Local Features

Proceedings Article•DOI•

Comparative Evaluation of Hand-Crafted and Learned Local Features

Johannes L. Schonberger¹, Hans Hardmeier¹, Torsten Sattler¹, Marc Pollefeys²•Institutions (2)

ETH Zurich¹, Microsoft²

01 Jul 2017-pp 6959-6968

TL;DR: An extensive experimental evaluation of learned local features to establish a single evaluation protocol that ensures comparable results in terms of matching performance and describes the different descriptors regarding standard criteria.

read less

Abstract: Matching local image descriptors is a key step in many computer vision applications. For more than a decade, hand-crafted descriptors such as SIFT have been used for this task. Recently, multiple new descriptors learned from data have been proposed and shown to improve on SIFT in terms of discriminative power. This paper is dedicated to an extensive experimental evaluation of learned local features to establish a single evaluation protocol that ensures comparable results. In terms of matching performance, we evaluate the different descriptors regarding standard criteria. However, considering matching performance in isolation only provides an incomplete measure of a descriptors quality. For example, finding additional correct matches between similar images does not necessarily lead to a better performance when trying to match images under extreme viewpoint or illumination changes. Besides pure descriptor matching, we thus also evaluate the different descriptors in the context of image-based reconstruction. This enables us to study the descriptor performance on a set of more practical criteria including image retrieval, the ability to register images under strong viewpoint and illumination changes, and the accuracy and completeness of the reconstructed cameras and scenes. To facilitate future research, the full evaluation pipeline is made publicly available.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

[...]

Mihai Dusmanu¹, Ignacio Rocco², Tomas Pajdla³, Marc Pollefeys¹, Josef Sivic³, Akihiko Torii⁴, Torsten Sattler⁵ - Show less +3 more•Institutions (5)

ETH Zurich¹, PSL Research University², Czech Technical University in Prague³, Tokyo Institute of Technology⁴, Chalmers University of Technology⁵

15 Jun 2019

TL;DR: This work proposes an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector, and shows that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations.

...read moreread less

Abstract: In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

...read moreread less

594 citations

Journal Article•DOI•

Image Matching from Handcrafted to Deep Features: A Survey

[...]

Jiayi Ma¹, Xingyu Jiang¹, Aoxiang Fan¹, Junjun Jiang², Junchi Yan³ - Show less +1 more•Institutions (3)

Wuhan University¹, Harbin Institute of Technology², Shanghai Jiao Tong University³

01 Jan 2021-International Journal of Computer Vision

TL;DR: This survey introduces feature detection, description, and matching techniques from handcrafted methods to trainable ones and provides an analysis of the development of these methods in theory and practice, and briefly introduces several typical image matching-based applications.

...read moreread less

Abstract: As a fundamental and critical task in various visual applications, image matching can identify then correspond the same or similar structure/content from two or more images. Over the past decades, growing amount and diversity of methods have been proposed for image matching, particularly with the development of deep learning techniques over the recent years. However, it may leave several open questions about which method would be a suitable choice for specific applications with respect to different scenarios and task requirements and how to design better image matching methods with superior performance in accuracy, robustness and efficiency. This encourages us to conduct a comprehensive and systematic review and analysis for those classical and latest techniques. Following the feature-based image matching pipeline, we first introduce feature detection, description, and matching techniques from handcrafted methods to trainable ones and provide an analysis of the development of these methods in theory and practice. Secondly, we briefly introduce several typical image matching-based applications for a comprehensive understanding of the significance of image matching. In addition, we also provide a comprehensive and objective comparison of these classical and latest techniques through extensive experiments on representative datasets. Finally, we conclude with the current status of image matching technologies and deliver insightful discussions and prospects for future works. This survey can serve as a reference for (but not limited to) researchers and engineers in image matching and related fields.

...read moreread less

474 citations

Cites background or methods from "Comparative Evaluation of Hand-Craf..."

...Descriptorswith deep learning techniques can be regarded as an extension of those based on classical learning (Schonberger et al. 2017)....
[...]
...From the abovementioned,we can know that several comprehensive and thorough evaluation of feature detectors and descriptors can be found in Komorowski et al. (2018), Lenc and Vedaldi (2014), Heinly et al. (2012) and Schonberger et al. (2017)....
[...]
...D reconstruction task, including the works of Fan et al. (2019) and Schonberger et al. (2017)....
[...]
...…a single part of image matching community, either focus on detectors (Huang et al. 2018; Lenc and Vedaldi 2014) or descriptors (Balntas et al. 2017; Schonberger et al. 2017) or specific matching tasks (Ferrante and Paragios 2017; Haskins et al. 2020; Yan et al. 2016b; Maiseli et al. 2017), and…...
[...]
...2018; Lenc and Vedaldi 2014) or descriptors (Balntas et al. 2017; Schonberger et al. 2017) or specific matching tasks (Ferrante and Paragios 2017; Haskins et al....
[...]

Proceedings Article•DOI•

Learning to Find Good Correspondences

[...]

Kwang Moo Yi¹, Eduard Trulls², Yuki Ono³, Vincent Lepetit⁴, Mathieu Salzmann², Pascal Fua² - Show less +2 more•Institutions (4)

University of Victoria¹, École Polytechnique Fédérale de Lausanne², Sony Broadcast & Professional Research Laboratories³, Graz University of Technology⁴

01 Jan 2018

TL;DR: In this paper, a multi-layer perceptron operating on pixel coordinates rather than directly on the image is proposed to learn to find good correspondences for wide-baseline stereo.

...read moreread less

Abstract: We develop a deep architecture to learn to find good correspondences for wide-baseline stereo. Given a set of putative sparse matches and the camera intrinsics, we train our network in an end-to-end fashion to label the correspondences as inliers or outliers, while simultaneously using them to recover the relative pose, as encoded by the essential matrix. Our architecture is based on a multi-layer perceptron operating on pixel coordinates rather than directly on the image, and is thus simple and small. We introduce a novel normalization technique, called Context Normalization, which allows us to process each data point separately while embedding global information in it, and also makes the network invariant to the order of the correspondences. Our experiments on multiple challenging datasets demonstrate that our method is able to drastically improve the state of the art with little training data.

...read moreread less

456 citations

Proceedings Article•DOI•

From Coarse to Fine: Robust Hierarchical Localization at Large Scale

[...]

Paul-Edouard Sarlin¹, Cesar Cadena², Roland Siegwart¹, Marcin Dymczyk•Institutions (2)

Institute of Robotics and Intelligent Systems¹, ETH Zurich²

15 Jun 2019

TL;DR: HF-Net is proposed, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.

...read moreread less

Abstract: Robust and accurate visual localization is a fundamental capability for numerous applications, such as autonomous driving, mobile robotics, or augmented reality. It remains, however, a challenging task, particularly for large-scale environments and in presence of significant appearance changes. State-of-the-art methods not only struggle with such scenarios, but are often too resource intensive for certain real-time applications. In this paper we propose HF-Net, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization. We exploit the coarse-to-fine localization paradigm: we first perform a global retrieval to obtain location hypotheses and only later match local features within those candidate places. This hierarchical approach incurs significant runtime savings and makes our system suitable for real-time operation. By leveraging learned descriptors, our method achieves remarkable localization robustness across large variations of appearance and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.

...read moreread less

378 citations

Cites background from "Comparative Evaluation of Hand-Craf..."

...the number of observation per 3D point, as defined by [16]....
[...]

Proceedings Article•

Working hard to know your neighbor's margins: Local descriptor learning loss

[...]

Anastasiia Mishchuk¹, Dmytro Mishkin², Filip Radenovic², Jiri Matas²•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Czech Technical University in Prague²

30 May 2017

TL;DR: HardNet as mentioned in this paper introduces a loss for metric learning, which maximizes the distance between the closest positive and closest negative examples in the batch, which works well for both shallow and deep convolutional network architectures.

...read moreread less

Abstract: We introduce a loss for metric learning, which is inspired by the Lowe's matching criterion for SIFT. We show that the proposed loss, that maximizes the distance between the closest positive and closest negative example in the batch, is better than complex regularization methods; it works well for both shallow and deep convolution network architectures. Applying the novel loss to the L2Net CNN architecture results in a compact descriptor named HardNet. It has the same dimensionality as SIFT (128) and shows state-of-art performance in wide baseline stereo, patch verification and instance retrieval benchmarks.

...read moreread less

321 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Journal Article•DOI•

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

[...]

Martin A. Fischler¹, Robert C. Bolles¹•Institutions (1)

SRI International¹

01 Jun 1981-Communications of The ACM

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.

...read moreread less

Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

...read moreread less

23,396 citations

"Comparative Evaluation of Hand-Craf..." refers background in this paper

...Due to the exponential complexity in the number of outliers [10], it is practically more important to have good precision for manageable runtimes of geometric verification....
[...]

Distinctive Image Features from Scale-Invariant Keypoints

[...]

Matthijs Dorst

01 Jan 2011

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

...read moreread less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

14,708 citations

Book Chapter•DOI•

SURF: speeded up robust features

[...]

Herbert Bay¹, Tinne Tuytelaars², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

07 May 2006

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

...read moreread less

Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

...read moreread less

13,011 citations