scispace - formally typeset
Open AccessJournal ArticleDOI

EPnP: An Accurate O(n) Solution to the PnP Problem

Reads0
Chats0
TLDR
A non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with n, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12×12 matrix.
Abstract
We propose a non-iterative solution to the PnP problem--the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences--whose computational complexity grows linearly with n This is in contrast to state-of-the-art methods that are O(n 5) or even O(n 8), without being more accurate Our method is applicable for all n?4 and handles properly both planar and non-planar configurations Our central idea is to express the n 3D points as a weighted sum of four virtual control points The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12×12 matrix and solving a small constant number of quadratic equations to pick the right weights Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time The advantages of our method are demonstrated by thorough testing on both synthetic and real-data

read more

Content maybe subject to copyright    Report

EPnP: Accurate Non-Iterative O(n) Solution to the PnP Problem
Vincent Lepetit Francesc Moreno-Noguer Pascal Fua
Computer Vision Laboratory
École Polytechnique Fédérale de Lausanne (EPFL)
CH-1015 Lausanne, Switzerland
http://cvlab.epfl.ch
Abstract
We propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated
camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with
n. This is in contrast to state-of-the-art methods that are O(n
5
) or even O(n
8
), without being more
accurate. Our method is applicable for all n 4 and handles properly both planar and non-planar
configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control
points. The problem then reduces to estimating the coordinates of these control points in the camera
referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the
eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick
the right weights. Furthermore, if maximal precision is required, the output of the closed-form solution
can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of
additional time. The advantages of our method are demonstrated by thorough testing on both synthetic
and real-data.
1
Index Terms
Pose estimation, Perspective-n-Point, Absolute orientation.
1
The Matlab and C++ implementations of the algorithm presented in this paper are available online at
http://cvlab.epfl.ch/software/EPnP/

1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
AD
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
Clamped DLT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
EPnP
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
LHM
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
EPnP+LHM
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
rotation error (%)
Gaussian image noise (pixels)
EPnP+GN
Fig. 1. Comparing the accuracy of our method against state-of-the-art ones. We use the boxplot representation: The blue boxes
denote the first and third quartiles of the errors, the lines extending from each end of the box depict the statistical extent of the
data, and the crosses indicate observations that fall out of it. Top row. Accuracy of non-iterative methods as a function of noise
when using n =63D-to-2D correspondences: AD is the method of Ansar and Daniilidis [2]; Clamped DLT is the DLT algorithm
after clamping the internal parameters with their known values; and EPnP is our method. Bottom row. Accuracy of iterative
methods using n =6: LHM is Lu et al.s method [20] initialized with a weak perspective assumption; EPnP+LHM is Lu et
al.s algorithm initialized with the output of our algorithm; EPnP+GN, our method followed by a Gauss-Newton optimization.
I. INTRODUCTION
The aim of the Perspective-n-Point problem—PnP in short—is to determine the position
and orientation of a camera given its intrinsic parameters and a set of n correspondences
between 3D points and their 2D projections. It has many applications in Computer Vision,
Robotics, Augmented Reality and has received much attention in both the Photogrammetry [21]
and Computer Vision [12] communities. In particular, applications such as feature point-based
camera tracking [27], [18] require dealing with hundreds of noisy feature points in real-time,
which requires computationally efficient methods.
In this paper, we introduce a non-iterative solution with better accuracy and much lower com-
February 18, 2009 DRAFT

2
putational complexity than non-iterative state-of-the-art methods, and much faster than iterative
ones with little loss of accuracy. Our approach is O(n) for n 4 whereas all other methods we
know of are either specialized for small fixed values of n, very sensitive to noise, or much slower.
The specialized methods include those designed to solve the P3P problem [9], [24]. Among those
that handle arbitrary values of n [8], [6], [13], [11], [24], [29], [7], [2], [9], the lowest-complexity
one [7] is O(n
2
) but has been shown to be unstable for noisy 2D locations [2]. This is currently
addressed by algorithms that are O(n
5
) [24]orevenO(n
8
) [2] for better accuracy whereas our
O(n) approach achieves even better accuracy and reduced sensitivity to noise, as depicted by
Fig. 1 in the n =6case and demonstrated for larger values of n in the result section.
A natural alternative to non-iterative approaches are iterative ones [19], [5], [14], [17], [20]
that rely on minimizing an appropriate criterion. They can deal with arbitrary numbers of
correspondences and achieve excellent precision when they converge properly. In particular,
Lu et al. [20] introduced a very accurate algorithm, which is fast in comparison with other
iterative ones but slow compared to non-iterative methods. As shown in Fig. 1 and Fig. 2, our
method achieves an accuracy that is almost as good, and is much faster and without requiring
an initial estimate. This is significant because iterative methods are prone to failure if poorly
initialized. For instance, Lu et al.s approach relies on an initial estimation of the camera pose
based on a weak-perspective assumption, which can lead to instabilities when the assumption
is not satisfied. This happens when the points of the object are projected onto a small region
on the side of the image and our solution performs more robustly under these circumstances.
Furthermore, if maximal precision is required our output can be used to initialize Lu et al.’s,
yielding both higher stability and faster convergence. Similarly, we can run a Gauss-Newton
scheme that improves our closed-form solution to the point where it is as accurate as the one
produced by Lu et al.s method when it is initialized by our method. Remarkably, this can be
done with only very little extra computation, which means that even with this extra step, our
method remains much faster. In fact, the optimization is performed in constant time, and hence,
the overall solution still remains O(n).
Our central idea is to write the coordinates of the n 3D points as a weighted sum of four
virtual control points. This reduces the problem to estimating the coordinates of the control points
in the camera referential, which can be done in O(n) time by expressing these coordinates as
weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of
February 18, 2009 DRAFT

3
0 50 100 150
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
number of points used to estimate pose
computation time (sec)
AD
Clamped DLT
LHM
EPnP
EPnP+LHM
EPnP+Gauss−Newton
Fig. 2. Comparing computation times of our method against the state-of-the-art ones introduced in Fig. 1. The computation
times of a MATLAB implementation on a standard PC, are plotted as a function of the number of correspondences. Our method
is both more accurate—see Fig. 1—and faster than the other non-iterative ones, especially for large amounts of noise, and is
almost as accurate as the iterative LHM. Furthermore, if maximal precision is required, the output of our algorithm can be used
to initialize a Gauss-Newton optimization procedure which requires a negligible amount of additional time.
quadratic equations to pick the right weights. Our approach also extends to planar configurations,
which cause problems for some methods as discussed in [23], [25], by using three control points
instead of four.
In the remainder of the paper, we first discuss related work focusing on accuracy and com-
putational complexity. We then introduce our new formulation and derive our system of linear
and quadratic equations. Finally, we compare our method against the state-of-the-art ones using
synthetic data and demonstrate it using real data. This paper is an expanded version of that in [22],
where a final Gauss-Newton optimization is added to the original algorithm. In Section IV we
show that optimizing over a reduced number of parameters, the accuracy of the closed-solution
proposed in [22] is considerably improved with almost no additional computational cost.
II. R
ELATED WORK
There is an immense body of literature on pose estimation from point correspondences and,
here, we focus on non-iterative approaches since our method falls in this category. In addition,
we will also introduce the Lu et al. [20] iterative method, which yields very good results and
against which we compare our own approach.
February 18, 2009 DRAFT

4
Most of the non-iterative approaches, if not all of them, proceed by first estimating the points
3D positions in the camera coordinate system by solving for the points depths. It is then easy to
retrieve the camera position and orientation as the Euclidean motion that aligns these positions
on the given coordinates in the world coordinate system [15], [3], [30].
The P3P case has been extensively studied in the literature, and many closed form solutions
have been proposed such as [6], [8], [9], [11], [24]. It typically involves solving for the roots
of an eight-degree polynomial with only even terms, yielding up to four solutions in general, so
that a fourth point is needed for disambiguation. Fisher and Bolles [8] reduced the P4P problem
to the P3P one by taking subsets of three points and checking consistency. Similarly, Horaud
et al. [13] reduced the P4P to a 3-line problem. For the 4 and 5 points problem, Triggs [29]
derived a system of quadratic polynomials, which solves using multiresultant theory. However,
as pointed out in [2], this does not perform well for larger number of points.
Even if four correspondences are sufficient in general to estimate the pose, it is nonetheless
desirable to consider larger point sets to introduce redundancy and reduce the sensitivity to
noise. To do so, Quan and Lan [24] consider triplets of points and for each one derive four-
degree polynomials in the unknown point depths. The coefficients of these polynomials are then
arranged in a
(n1)(n2)
2
× 5 matrix and singular value decomposition (SVD) is used to estimate
the unknown depths. This method is repeated for all of the n points and therefore involves
O(n
5
) operations.
2
It should be noted that, even if it is not done in [24], this complexity could
be reduced to O(n
3
) by applying the same trick as we do when performing the SVD, but even
then, it would remain slower than our method. Ansar and Daniilidis [2] derive a set of quadratic
equations arranged in a
n(n1)
2
×
n(n+1)
2
+1
linear system, which, as formulated in the paper,
requires O(n
8
) operations to be solved. They show their approach performs better than [24].
The complexity of the previous two approaches stems from the fact that quadratic terms
are introduced from the inter-point distances constraints. The linearization of these equations
produces additional parameters, which increase the complexity of the system. Fiore’s method [7]
avoids the need for these constraints: He initially forms a set of linear equations from which the
world to camera rotation and translation parameters are eliminated, allowing the direct recovery
of the point depths without considering the inter-point distances. This procedure allows the
2
Following [10], we consider that the SVD for a m × n matrix can be computed by a O(4m
2
n +8mn
2
+9n
3
) algorithm.
February 18, 2009 DRAFT

Figures
Citations
More filters
Proceedings ArticleDOI

ORB: An efficient alternative to SIFT or SURF

TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Journal ArticleDOI

ORB-SLAM: A Versatile and Accurate Monocular SLAM System

TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.
Journal ArticleDOI

ORB-SLAM: a Versatile and Accurate Monocular SLAM System

TL;DR: A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation.
Proceedings ArticleDOI

Structure-from-Motion Revisited

TL;DR: This work proposes a new SfM technique that improves upon the state of the art to make a further step towards building a truly general-purpose pipeline.
Journal ArticleDOI

VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator

TL;DR: In this article, a robust and versatile monocular visual-inertial state estimator is presented, which is the minimum sensor suite (in size, weight, and power) for the metric six degrees of freedom (DOF) state estimation.
References
More filters
Book

Matrix computations

Gene H. Golub
Journal ArticleDOI

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Journal ArticleDOI

Least-Squares Fitting of Two 3-D Point Sets

TL;DR: An algorithm for finding the least-squares solution of R and T, which is based on the singular value decomposition (SVD) of a 3 × 3 matrix, is presented.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the future works in "Epnp: accurate non-iterative o(n) solution to the pnp problem" ?

This is what the authors will focus on in future research. 

The authors propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with n. Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time. 

The authors refine the four values β = [β1, β2, β3, β4] of the coefficients in Eq. 8 by choosing the values that minimize the change in distance between control points. 

The coefficients of these polynomials are then arranged in a (n−1)(n−2) 2× 5 matrix and singular value decomposition (SVD) is used to estimate the unknown depths. 

(4)The unknown parameters of this linear system are the 12 control point coordinates { (xcj, y c j , z c j) } j=1,...,4and the n projective parameters {wi}i=1,...,n. 

A way to exploit their knowledge of the intrinsic parameters is to clamp the retrieved values to the known ones, but the accuracy still remains low. 

Given that the solution can be expressed as a linear combination of the null eigenvectors of M M, finding it amounts to computing the appropriate values for the {βi}i=1,...,N coefficients of Eq. 8. 

For instance, for n = 6 points, their algorithm is about 10 times faster than LHM, and about 200 times faster than AD.2) The Planar Case: Schweighofer and Pinz [25] prove that when the reference points lie on a plane, camera pose suffers from an ambiguity that results in significant instability. 

More precisely, it is a weighted sum of the null eigenvectors of M. Given that the correct linear combination is the one that yields 3D camera coordinates for the control points that preserve their distances, the authors can find the appropriate weights by solving small systems of quadratic equations, which can be done at a negligible computational cost. 

as the focal length increases and the camera becomes closer to being orthographic, all four smallest eigenvalues approach zero. 

4. In this section, the authors show that the fact that the distances between control points must be preserved can be expressed in terms of a small number of quadratic equations, which can be efficiently solved to compute {βi}i=1,...,N for N = 1, 2, 3 and 4. 

Even if four correspondences are sufficient in general to estimate the pose, it is nonetheless desirable to consider larger point sets to introduce redundancy and reduce the sensitivity to noise. 

in practice, the authors have found that the stability of their method is increased by taking the centroid of the reference points as one, and to select the rest in such a way that they form a basis aligned with the principal directions of the data. 

The authors omit as well the EPnP+GN, because for the planar case the closed-form solution for the non-ambiguous cases was already very accurate, and the Gauss-Newton optimization could not help to resolve the ambiguity in the rest of cases.