Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints

doi:10.1109/IROS45743.2020.9341229

Home
/
Papers
/
Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints

Proceedings Article•DOI•

Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints

You-Yi Jau¹, Rui Zhu¹, Hao Su¹, Manmohan Chandraker¹•Institutions (1)

University of California, San Diego¹

24 Oct 2020-pp 4950-4957

TL;DR: In this article, an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection, while directly optimizing for the geometric pose objective is proposed.

read less

Abstract: Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry (VO) and simultaneous localization and mapping (SLAM), where classic methods consisting of hand-crafted features and sampling-based outlier rejection have been a dominant choice for over a decade. Although multiple works propose to replace these modules with learning-based counterparts, most have not yet been as accurate, robust and generalizable as conventional methods. In this paper, we design an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection, while directly optimizing for the geometric pose objective. We show both quantitatively and qualitatively that pose estimation performance may be achieved on par with the classic pipeline. Moreover, we are able to show by end-to-end training, the key components of the pipeline could be significantly improved, which leads to better generalizability to unseen datasets compared to existing learning-based methods.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

SuperPoint features in endoscopy

[...]

Oscar Leon Barbed, Francois Chadebecq, Javier Morlana, J. M. M. Montiel, Ana C. Murillo - Show less +1 more

08 Mar 2022-Lecture Notes in Computer Science

TL;DR: In this paper , the performance of well-known local features on a medical dataset captured during routine colonoscopy procedures was studied. But, the performance was not evaluated in routine medical practice, as there is often a significant gap between research results and applicability in routine practice.

...read moreread less

Abstract: There is often a significant gap between research results and applicability in routine medical practice. This work studies the performance of well-known local features on a medical dataset captured during routine colonoscopy procedures. Local feature extraction and matching is a key step for many computer vision applications, specially regarding 3D modelling. In the medical domain, handcrafted local features such as SIFT, with public pipelines such as COLMAP, are still a predominant tool for this kind of tasks. We explore the potential of the well known self-supervised approach SuperPoint, present an adapted variation for the endoscopic domain and propose a challenging evaluation framework. SuperPoint based models achieve significantly higher matching quality than commonly used local features in this domain. Our adapted model avoids features within specularity regions, a frequent and problematic artifact in endoscopic images, with consequent benefits for matching and reconstruction results.

...read moreread less

2 citations

Book Chapter•DOI•

A Keypoint Detection and Description Network Based on the Vessel Structure for Multi-Modal Retinal Image Registration

[...]

Aline Sindel, Bettina Hohberger, Sebastian Fassihi Dehcordi, Christian Y. Mardin, Robert Lämmer, Andreas Maier, Vincent Christlein - Show less +3 more

06 Jan 2022

TL;DR: This work uses a convolutional neural network to extract features of the vessel structure in multi-modal retinal images using a keypoint detection and description network and demonstrates the best registration performance on a public multi- modal dataset in comparison to competing methods.

...read moreread less

2 citations

Proceedings Article•DOI•

Looking Beyond Corners: Contrastive Learning of Visual Representations for Keypoint Detection and Description Extraction

[...]

18 Jul 2022

TL;DR: CorrNet as discussed by the authors learns to detect repeatable keypoints and extract discriminative descriptions via unsupervised contrastive learning under spatial constraints, and achieves competitive results under viewpoint changes and achieves state-of-the-art performance under illumination changes.

...read moreread less

Abstract: Learnable keypoint detectors and descriptors are beginning to outperform classical hand-crafted feature extraction methods. Recent studies on self-supervised learning of visual representations have driven the increasing performance of learnable models based on deep networks. By leveraging traditional data augmentations and homography transformations, these networks learn to detect corners under adverse conditions such as extreme illumination changes. However, their generalization capabilities are limited to corner-like features detected a priori by classical methods or synthetically generated data. In this paper, we propose the Correspondence Network (CorrNet) that learns to detect repeatable keypoints and extract discriminative descriptions via unsupervised contrastive learning under spatial constraints. Our experiments show that CorrNet is not only able to detect low-level features such as corners, but also high-level features that represent similar objects present in a pair of input images through our proposed joint guided backpropagation of their latent space. Our approach obtains competitive results under viewpoint changes and achieves state-of-the-art performance under illumination changes.

...read moreread less

1 citations

Journal Article•DOI•

Multi-modal Retinal Image Registration Using a Keypoint-Based Vessel Structure Aligning Network

[...]

Aline Sindel¹•Institutions (1)

Germanisches Nationalmuseum¹

01 Jan 2022-Lecture Notes in Computer Science

TL;DR: In this paper , an end-to-end trainable deep learning method for multi-modal retinal image registration is proposed, which extracts convolutional features from the vessel structure for keypoint detection and description and uses a graph neural network for feature matching.

...read moreread less

Abstract: In ophthalmological imaging, multiple imaging systems, such as color fundus, infrared, fluorescein angiography, optical coherence tomography (OCT) or OCT angiography, are often involved to make a diagnosis of retinal disease. Multi-modal retinal registration techniques can assist ophthalmologists by providing a pixel-based comparison of aligned vessel structures in images from different modalities or acquisition times. To this end, we propose an end-to-end trainable deep learning method for multi-modal retinal image registration. Our method extracts convolutional features from the vessel structure for keypoint detection and description and uses a graph neural network for feature matching. The keypoint detection and description network and graph neural network are jointly trained in a self-supervised manner using synthetic multi-modal image pairs and are guided by synthetically sampled ground truth homographies. Our method demonstrates higher registration accuracy as competing methods for our synthetic retinal dataset and generalizes well for our real macula dataset and a public fundus dataset.

...read moreread less

1 citations

Proceedings Article•DOI•

SceneSqueezer: Learning to Compress Scene for Camera Relocalization

[...]

01 Jun 2022

TL;DR: In this paper , a learned point selection module prunes the points in each cluster taking into account the final pose estimation accuracy, and then the features of the selected points are further compressed using learned quantization.

...read moreread less

Abstract: Standard visual localization methods build a priori 3D model of a scene which is used to establish correspondences against the 2D keypoints in a query image. Storing these pre-built 3D scene models can be prohibitively expensive for large-scale environments, especially on mobile devices with limited storage and communication bandwidth. We design a novel framework that compresses a scene while still maintaining localization accuracy. The scene is compressed in three stages: first, the database frames are clustered using pairwise co-visibility information. Then, a learned point selection module prunes the points in each cluster taking into account the final pose estimation accuracy. In the final stage, the features of the selected points are further compressed using learned quantization. Query image registration is done using only the compressed scene points. To the best of our knowledge, we are the first to propose learned scene compression for visual localization. We also demonstrate the effectiveness and efficiency of our method on various outdoor datasets where it can perform accurate localization with low memory consumption.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Journal Article•DOI•

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

[...]

Martin A. Fischler¹, Robert C. Bolles¹•Institutions (1)

SRI International¹

01 Jun 1981-Communications of The ACM

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.

...read moreread less

Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

...read moreread less

23,396 citations

Proceedings Article•DOI•

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

[...]

R. Qi Charles¹, Hao Su¹, Mo Kaichun¹, Leonidas J. Guibas¹•Institutions (1)

Stanford University¹

21 Jul 2017

TL;DR: This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.

...read moreread less

Abstract: Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

...read moreread less

9,457 citations

Proceedings Article•DOI•

ORB: An efficient alternative to SIFT or SURF

[...]

Ethan Rublee¹, Vincent Rabaud¹, Kurt Konolige¹, Gary Bradski¹•Institutions (1)

Willow Garage¹

06 Nov 2011

TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.

...read moreread less

Abstract: Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

...read moreread less

8,702 citations

Journal Article•DOI•

Vision meets robotics: The KITTI dataset

[...]

Andreas Geiger¹, Philip Lenz², Christoph Stiller², Raquel Urtasun³•Institutions (3)

Max Planck Society¹, Karlsruhe Institute of Technology², Toyota Technological Institute at Chicago³

01 Sep 2013-The International Journal of Robotics Research

TL;DR: A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research, using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras and a high-precision GPS/IMU inertial navigation system.

...read moreread less

Abstract: We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.

...read moreread less

7,153 citations