Open AccessJournal ArticleDOI

EPnP: An Accurate O(n) Solution to the PnP Problem

- 01 Feb 2009 -

International Journal of Computer Vision

- Vol. 81, Iss: 2, pp 155-166

Chats0

TLDR

A non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with n, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12×12 matrix.

Abstract:

We propose a non-iterative solution to the PnP problem--the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences--whose computational complexity grows linearly with n This is in contrast to state-of-the-art methods that are O(n 5) or even O(n 8), without being more accurate Our method is applicable for all n?4 and handles properly both planar and non-planar configurations Our central idea is to express the n 3D points as a weighted sum of four virtual control points The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12×12 matrix and solving a small constant number of quadratic equations to pick the right weights Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time The advantages of our method are demonstrated by thorough testing on both synthetic and real-data

Content maybe subject to copyright Report

EPnP: Accurate Non-Iterative O(n) Solution to the PnP Problem

Vincent Lepetit Francesc Moreno-Noguer Pascal Fua

Computer Vision Laboratory

École Polytechnique Fédérale de Lausanne (EPFL)

CH-1015 Lausanne, Switzerland

http://cvlab.epfl.ch

Abstract

We propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated

camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with

n. This is in contrast to state-of-the-art methods that are O(n

) or even O(n

), without being more

accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar

conﬁgurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control

points. The problem then reduces to estimating the coordinates of these control points in the camera

referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the

eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick

the right weights. Furthermore, if maximal precision is required, the output of the closed-form solution

can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of

additional time. The advantages of our method are demonstrated by thorough testing on both synthetic

and real-data.

Index Terms

Pose estimation, Perspective-n-Point, Absolute orientation.

The Matlab and C++ implementations of the algorithm presented in this paper are available online at

http://cvlab.epfl.ch/software/EPnP/

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

Clamped DLT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

EPnP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

LHM

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

EPnP+LHM

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

100

rotation error (%)

Gaussian image noise (pixels)

EPnP+GN

Fig. 1. Comparing the accuracy of our method against state-of-the-art ones. We use the boxplot representation: The blue boxes

denote the ﬁrst and third quartiles of the errors, the lines extending from each end of the box depict the statistical extent of the

data, and the crosses indicate observations that fall out of it. Top row. Accuracy of non-iterative methods as a function of noise

when using n =63D-to-2D correspondences: AD is the method of Ansar and Daniilidis [2]; Clamped DLT is the DLT algorithm

after clamping the internal parameters with their known values; and EPnP is our method. Bottom row. Accuracy of iterative

methods using n =6: LHM is Lu et al.’s method [20] initialized with a weak perspective assumption; EPnP+LHM is Lu et

al.’s algorithm initialized with the output of our algorithm; EPnP+GN, our method followed by a Gauss-Newton optimization.

I. INTRODUCTION

The aim of the Perspective-n-Point problem—PnP in short—is to determine the position

and orientation of a camera given its intrinsic parameters and a set of n correspondences

between 3D points and their 2D projections. It has many applications in Computer Vision,

Robotics, Augmented Reality and has received much attention in both the Photogrammetry [21]

and Computer Vision [12] communities. In particular, applications such as feature point-based

camera tracking [27], [18] require dealing with hundreds of noisy feature points in real-time,

which requires computationally efﬁcient methods.

In this paper, we introduce a non-iterative solution with better accuracy and much lower com-

February 18, 2009 DRAFT

putational complexity than non-iterative state-of-the-art methods, and much faster than iterative

ones with little loss of accuracy. Our approach is O(n) for n ≥ 4 whereas all other methods we

know of are either specialized for small ﬁxed values of n, very sensitive to noise, or much slower.

The specialized methods include those designed to solve the P3P problem [9], [24]. Among those

that handle arbitrary values of n [8], [6], [13], [11], [24], [29], [7], [2], [9], the lowest-complexity

one [7] is O(n

) but has been shown to be unstable for noisy 2D locations [2]. This is currently

addressed by algorithms that are O(n

) [24]orevenO(n

) [2] for better accuracy whereas our

O(n) approach achieves even better accuracy and reduced sensitivity to noise, as depicted by

Fig. 1 in the n =6case and demonstrated for larger values of n in the result section.

A natural alternative to non-iterative approaches are iterative ones [19], [5], [14], [17], [20]

that rely on minimizing an appropriate criterion. They can deal with arbitrary numbers of

correspondences and achieve excellent precision when they converge properly. In particular,

Lu et al. [20] introduced a very accurate algorithm, which is fast in comparison with other

iterative ones but slow compared to non-iterative methods. As shown in Fig. 1 and Fig. 2, our

method achieves an accuracy that is almost as good, and is much faster and without requiring

an initial estimate. This is signiﬁcant because iterative methods are prone to failure if poorly

initialized. For instance, Lu et al.’s approach relies on an initial estimation of the camera pose

based on a weak-perspective assumption, which can lead to instabilities when the assumption

is not satisﬁed. This happens when the points of the object are projected onto a small region

on the side of the image and our solution performs more robustly under these circumstances.

Furthermore, if maximal precision is required our output can be used to initialize Lu et al.’s,

yielding both higher stability and faster convergence. Similarly, we can run a Gauss-Newton

scheme that improves our closed-form solution to the point where it is as accurate as the one

produced by Lu et al.’s method when it is initialized by our method. Remarkably, this can be

done with only very little extra computation, which means that even with this extra step, our

method remains much faster. In fact, the optimization is performed in constant time, and hence,

the overall solution still remains O(n).

Our central idea is to write the coordinates of the n 3D points as a weighted sum of four

virtual control points. This reduces the problem to estimating the coordinates of the control points

in the camera referential, which can be done in O(n) time by expressing these coordinates as

weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of

February 18, 2009 DRAFT

0 50 100 150

0.01

0.02

0.03

0.04

0.05

0.06

0.07

number of points used to estimate pose

computation time (sec)

Clamped DLT

LHM

EPnP

EPnP+LHM

EPnP+Gauss−Newton

Fig. 2. Comparing computation times of our method against the state-of-the-art ones introduced in Fig. 1. The computation

times of a MATLAB implementation on a standard PC, are plotted as a function of the number of correspondences. Our method

is both more accurate—see Fig. 1—and faster than the other non-iterative ones, especially for large amounts of noise, and is

almost as accurate as the iterative LHM. Furthermore, if maximal precision is required, the output of our algorithm can be used

to initialize a Gauss-Newton optimization procedure which requires a negligible amount of additional time.

quadratic equations to pick the right weights. Our approach also extends to planar conﬁgurations,

which cause problems for some methods as discussed in [23], [25], by using three control points

instead of four.

In the remainder of the paper, we ﬁrst discuss related work focusing on accuracy and com-

putational complexity. We then introduce our new formulation and derive our system of linear

and quadratic equations. Finally, we compare our method against the state-of-the-art ones using

synthetic data and demonstrate it using real data. This paper is an expanded version of that in [22],

where a ﬁnal Gauss-Newton optimization is added to the original algorithm. In Section IV we

show that optimizing over a reduced number of parameters, the accuracy of the closed-solution

proposed in [22] is considerably improved with almost no additional computational cost.

II. R

ELATED WORK

There is an immense body of literature on pose estimation from point correspondences and,

here, we focus on non-iterative approaches since our method falls in this category. In addition,

we will also introduce the Lu et al. [20] iterative method, which yields very good results and

against which we compare our own approach.

February 18, 2009 DRAFT

Most of the non-iterative approaches, if not all of them, proceed by ﬁrst estimating the points

3D positions in the camera coordinate system by solving for the points depths. It is then easy to

retrieve the camera position and orientation as the Euclidean motion that aligns these positions

on the given coordinates in the world coordinate system [15], [3], [30].

The P3P case has been extensively studied in the literature, and many closed form solutions

have been proposed such as [6], [8], [9], [11], [24]. It typically involves solving for the roots

of an eight-degree polynomial with only even terms, yielding up to four solutions in general, so

that a fourth point is needed for disambiguation. Fisher and Bolles [8] reduced the P4P problem

to the P3P one by taking subsets of three points and checking consistency. Similarly, Horaud

et al. [13] reduced the P4P to a 3-line problem. For the 4 and 5 points problem, Triggs [29]

derived a system of quadratic polynomials, which solves using multiresultant theory. However,

as pointed out in [2], this does not perform well for larger number of points.

Even if four correspondences are sufﬁcient in general to estimate the pose, it is nonetheless

desirable to consider larger point sets to introduce redundancy and reduce the sensitivity to

noise. To do so, Quan and Lan [24] consider triplets of points and for each one derive four-

degree polynomials in the unknown point depths. The coefﬁcients of these polynomials are then

arranged in a

(n−1)(n−2)

× 5 matrix and singular value decomposition (SVD) is used to estimate

the unknown depths. This method is repeated for all of the n points and therefore involves

O(n

) operations.

It should be noted that, even if it is not done in [24], this complexity could

be reduced to O(n

) by applying the same trick as we do when performing the SVD, but even

then, it would remain slower than our method. Ansar and Daniilidis [2] derive a set of quadratic

equations arranged in a

n(n−1)



n(n+1)



linear system, which, as formulated in the paper,

requires O(n

) operations to be solved. They show their approach performs better than [24].

The complexity of the previous two approaches stems from the fact that quadratic terms

are introduced from the inter-point distances constraints. The linearization of these equations

produces additional parameters, which increase the complexity of the system. Fiore’s method [7]

avoids the need for these constraints: He initially forms a set of linear equations from which the

world to camera rotation and translation parameters are eliminated, allowing the direct recovery

of the point depths without considering the inter-point distances. This procedure allows the

Following [10], we consider that the SVD for a m × n matrix can be computed by a O(4m

n +8mn

+9n

) algorithm.

February 18, 2009 DRAFT

HTML Viewer

Figures

Fig. 4. Effective number N of null singular values in M M. Each vertical bar represents the distributions of N for a total of 300 experiments. On the left, we plot the results for a fixed image noise of σ = 10 pixels and an increasing number of reference points, and the results on the right correspond to a fixed n = 6 number of reference points and increasing level of noise in the 2D projections.

Fig. 5. Non Planar case. Mean and median rotation and translation errors for different experiments.

Fig. 1. Comparing the accuracy of our method against state-of-the-art ones. We use the boxplot representation: The blue boxes denote the first and third quartiles of the errors, the lines extending from each end of the box depict the statistical extent of the data, and the crosses indicate observations that fall out of it. Top row. Accuracy of non-iterative methods as a function of noise when using n = 6 3D-to-2D correspondences: AD is the method of Ansar and Daniilidis [2]; Clamped DLT is the DLT algorithm after clamping the internal parameters with their known values; and EPnP is our method. Bottom row. Accuracy of iterative methods using n = 6: LHM is Lu et al.’s method [20] initialized with a weak perspective assumption; EPnP+LHM is Lu et al.’s algorithm initialized with the output of our algorithm; EPnP+GN, our method followed by a Gauss-Newton optimization.

Fig. 7. Planar case. Errors as a function of the image noise when using n = 6 reference points. For Tilt=0o, there are no pose ambiguities and the results represent the mean over all the experiments. For Tilt=30o, the errors are only averaged among the solutions not affected by the pose ambiguity. The right-most figure represents the number of solutions considered as outliers, which are defined as those for which the average error in the estimated 3D position of the reference points is larger than a threshold.

Fig. 3. Left: Singular values of M M for different focal lengths. Each curve averages 100 synthetic trials. Right: Zooming in on the smallest eigenvalues. For small focal lengths, the camera is perspective and only one eigenvalue is zero, which reflects the scale ambiguity. As the focal length increases and the camera becomes orthographic, all four smallest eigenvalues approach zero.

Fig. 2. Comparing computation times of our method against the state-of-the-art ones introduced in Fig. 1. The computation times of a MATLAB implementation on a standard PC, are plotted as a function of the number of correspondences. Our method is both more accurate—see Fig. 1—and faster than the other non-iterative ones, especially for large amounts of noise, and is almost as accurate as the iterative LHM. Furthermore, if maximal precision is required, the output of our algorithm can be used to initialize a Gauss-Newton optimization procedure which requires a negligible amount of additional time.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

ORB: An efficient alternative to SIFT or SURF

Ethan Rublee, +3 more

TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.

...read moreread less

Journal ArticleDOI

ORB-SLAM: A Versatile and Accurate Monocular SLAM System

Raul Mur-Artal, +2 more

- 24 Aug 2015 -

IEEE Transactions on Robotics

TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.

...read moreread less

Journal ArticleDOI

ORB-SLAM: a Versatile and Accurate Monocular SLAM System

Raul Mur-Artal, +2 more

- 03 Feb 2015 -

arXiv: Robotics

TL;DR: A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation.

...read moreread less

Proceedings ArticleDOI

Structure-from-Motion Revisited

Johannes L. Schonberger, +1 more

TL;DR: This work proposes a new SfM technique that improves upon the state of the art to make a further step towards building a truly general-purpose pipeline.

...read moreread less

Journal ArticleDOI

VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator

Tong Qin, +2 more

- 01 Aug 2018 -

IEEE Transactions on Robotics

TL;DR: In this article, a robust and versatile monocular visual-inertial state estimator is presented, which is the minimum sensor suite (in size, weight, and power) for the metric six degrees of freedom (DOF) state estimation.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Matrix computations

Gene H. Golub

Journal ArticleDOI

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Martin A. Fischler, +1 more

- 01 Jun 1981 -

Communications of The ACM

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.

...read moreread less

Book

Multiple view geometry in computer vision

Richard Hartley, +1 more

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

...read moreread less

Multiple View Geometry in Computer Vision.

Bernhard P. Wrobel

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.

...read moreread less

Journal ArticleDOI

Least-Squares Fitting of Two 3-D Point Sets

K. S. Arun, +2 more

- 01 May 1987 -

IEEE Transactions on Pattern Analysis an...

TL;DR: An algorithm for finding the least-squares solution of R and T, which is based on the singular value decomposition (SVD) of a 3 × 3 matrix, is presented.

...read moreread less

Collapse

Frequently Asked Questions (14)

Q1. What are the future works in "Epnp: accurate non-iterative o(n) solution to the pnp problem" ?

This is what the authors will focus on in future research.

Q2. What contributions have the authors mentioned in the paper "Epnp: accurate non-iterative o(n) solution to the pnp problem" ?

The authors propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with n. Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time.

Q3. How do the authors refine the coefficients in Eq. 8?

The authors refine the four values β = [β1, β2, β3, β4] of the coefficients in Eq. 8 by choosing the values that minimize the change in distance between control points.

Q4. How do the authors estimate the unknown depths of the polynomials?

The coefficients of these polynomials are then arranged in a (n−1)(n−2) 2× 5 matrix and singular value decomposition (SVD) is used to estimate the unknown depths.

Q5. What are the unknown parameters of this linear system?

(4)The unknown parameters of this linear system are the 12 control point coordinates { (xcj, y c j , z c j) } j=1,...,4and the n projective parameters {wi}i=1,...,n.

Q6. What is the way to exploit the intrinsic parameters?

A way to exploit their knowledge of the intrinsic parameters is to clamp the retrieved values to the known ones, but the accuracy still remains low.

Q7. How can the authors find the correct value of the ii=1?

Given that the solution can be expressed as a linear combination of the null eigenvectors of M M, finding it amounts to computing the appropriate values for the {βi}i=1,...,N coefficients of Eq. 8.

Q8. How fast is the ambiguity in the camera pose?

For instance, for n = 6 points, their algorithm is about 10 times faster than LHM, and about 200 times faster than AD.2) The Planar Case: Schweighofer and Pinz [25] prove that when the reference points lie on a plane, camera pose suffers from an ambiguity that results in significant instability.

Q9. What is the weighted sum of the null eigenvectors of M?

More precisely, it is a weighted sum of the null eigenvectors of M. Given that the correct linear combination is the one that yields 3D camera coordinates for the control points that preserve their distances, the authors can find the appropriate weights by solving small systems of quadratic equations, which can be done at a negligible computational cost.

Q10. How many smallest eigenvalues do the camera have?

as the focal length increases and the camera becomes closer to being orthographic, all four smallest eigenvalues approach zero.

Q11. How many eigenvalues can be used to compute the distances between control points?

4. In this section, the authors show that the fact that the distances between control points must be preserved can be expressed in terms of a small number of quadratic equations, which can be efficiently solved to compute {βi}i=1,...,N for N = 1, 2, 3 and 4.

Q12. How many points are needed to estimate the pose?

Even if four correspondences are sufficient in general to estimate the pose, it is nonetheless desirable to consider larger point sets to introduce redundancy and reduce the sensitivity to noise.

Q13. How do the authors find the stability of their method?

in practice, the authors have found that the stability of their method is increased by taking the centroid of the reference points as one, and to select the rest in such a way that they form a basis aligned with the principal directions of the data.

Q14. Why did Schweighofer and Pinz omit the EPnP+GN?

The authors omit as well the EPnP+GN, because for the planar case the closed-form solution for the non-ambiguous cases was already very accurate, and the Gauss-Newton optimization could not help to resolve the ambiguity in the rest of cases.

EPnP: An Accurate O(n) Solution to the PnP Problem

Figures

Citations

ORB: An efficient alternative to SIFT or SURF

ORB-SLAM: A Versatile and Accurate Monocular SLAM System

ORB-SLAM: a Versatile and Accurate Monocular SLAM System

Structure-from-Motion Revisited

VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator

References

Matrix computations

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

Least-Squares Fitting of Two 3-D Point Sets

Related Papers (5)

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Distinctive Image Features from Scale-Invariant Keypoints

Multiple view geometry in computer vision

A flexible new technique for camera calibration

ORB: An efficient alternative to SIFT or SURF

Frequently Asked Questions (14)

Q1. What are the future works in "Epnp: accurate non-iterative o(n) solution to the pnp problem" ?

Q2. What contributions have the authors mentioned in the paper "Epnp: accurate non-iterative o(n) solution to the pnp problem" ?

Q3. How do the authors refine the coefficients in Eq. 8?

Q4. How do the authors estimate the unknown depths of the polynomials?

Q5. What are the unknown parameters of this linear system?

Q6. What is the way to exploit the intrinsic parameters?

Q7. How can the authors find the correct value of the ii=1?

Q8. How fast is the ambiguity in the camera pose?

Q9. What is the weighted sum of the null eigenvectors of M?

Q10. How many smallest eigenvalues do the camera have?

Q11. How many eigenvalues can be used to compute the distances between control points?

Q12. How many points are needed to estimate the pose?

Q13. How do the authors find the stability of their method?

Q14. Why did Schweighofer and Pinz omit the EPnP+GN?