What is the key criterion for accuracy in AR?

For AR application, rather than computational efficiency (as soon as real-time requirement are met), accuracy is the key criterion in order to avoid jitter effects.

How many correspondences are used to estimate the pose?

To filter out the erroneous ones, the authors use RANSAC on small subsets made of 7 correspondences from which the authors estimate the pose using their PnP method.

What is the reason why the pose computation is not always easy to do?

Since such 3D knowledge is not always easily available (although the authors have seen that it can be computed on-line), it is also possible to overcome the pose computation considering less constraining knowledge about the viewed scene.

What is the way to extract features from perspectively transformed images?

since a camera can freely move in AR applications, such features should be extracted from perspectively transformed images.

What is the solution to estimate the homography?

Another solution to estimate the homography is to consider the minimization of a cost function, the geometric distance, defined by:ĥ = argmin hN∑ i=1 d(x1i, 1H2x2i)2 (24)which can be solved directly for h which represents the 8 independent parameters hk,k = 1...8 of the homography 1H2 using a gradient approach such as a Gauss-Newton.

How many iterations are required to ensure that p = 0.99?

For the P4P problem (n = 4) when datais contaminated with 10% of outliers, 5 iterations are required to ensure that p = 0.99 and with 50% of outliers 72 iterations are necessary.

What is the advantage of the pose estimation methods?

Since a comprehensive or even a sparse 3D knowledge is not always easily available, the development of pose estimation methods that involve less constraining knowledge about the observed scene has been considered.

What was the first system that enabled scene reconstruction and consequently camera localization in real-time?

KinectFusion [89] was one of the first systems that enables scene reconstruction and consequently camera localization in real-time and in a way compatible with interactive applications [53] (see Figure 14).

What are the main areas of the research that must be considered by both academics and industries?

Scalability of the solutions, end-users and market acceptance are clearly potential improvement areas that must be considered by both academics and industries.

What is the common approach for detecting a rectangle?

More precisely, rectangle shape is first searched in a binarized image, and then camera pose with respect to the rectangle is computed from the known 3D coordinates of four corners of the marker using approaches similar to those presented in section 3.1 or 4.1.3.

(Open Access) Pose Estimation for Augmented Reality: A Hands-On Survey (2016) | Eric Marchand

HAL Id: hal-01246370

https://hal.inria.fr/hal-01246370

Submitted on 18 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Pose Estimation for Augmented Reality: A Hands-On

Survey

Eric Marchand, Hideaki Uchiyama, Fabien Spindler

To cite this version:

Eric Marchand, Hideaki Uchiyama, Fabien Spindler. Pose Estimation for Augmented Reality: A

Hands-On Survey. IEEE Transactions on Visualization and Computer Graphics, Institute of Electri-

cal and Electronics Engineers, 2016, 22 (12), pp.2633 - 2651. �10.1109/TVCG.2015.2513408�. �hal-

01246370�

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 1

Pose estimation for augmented reality:

a hands-on survey

Eric Marchand, Hideaki Uchiyama and Fabien Spindler

Abstract—Augmented reality (AR) allows to seamlessly insert virtual objects in an image sequence. In order to accomplish this goal, it is important

that synthetic elements are rendered and aligned in the scene in an accurate and visually acceptable way. The solution of this problem can be related

to a pose estimation or, equivalently, a camera localization process. This paper aims at presenting a brief but almost self-contented introduction to the

most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years. For

most of the presented approaches, we also provide links to code of short examples. This should allow readers to easily bridge the gap between

theoretical aspects and practical implementations.

Index Terms—Survey, augmented reality, vision-based camera localization, pose estimation, PnP, SLAM, motion estimation, homography, keypoint

matching, code examples.

1 INTRODUCTION

Augmented reality (AR) allows to seamlessly insert virtual objects

in an image sequence. A widely acknowledged deﬁnition of

augmented reality is due to Azuma in the ﬁrst survey dedicated

to the subject [7]. An AR system should combine real and virtual

objects, be interactive in real time, register real and virtual

objects. It has to be noted that this deﬁnition does not focus

on speciﬁc technologies for localization and visualization. Back

in 1997, registration was considered as "one of the most basic

problems currently limiting augmented reality [7]".

Pose estimation: a "basic problem" for augmented reality.

AR has been intrinsically a multidisciplinary and old research area.

It is clear that real and virtual world registration issues received

a large amount of interest. From a broader point of view, this is

a motion tracking issue. To achieve this task, many sensors have

been considered: mechanical devices, ultrasonic devices, magnetic

sensors, inertial devices, GPS, compass, and obviously, optical

sensors [146]. To paraphrase [146], there was no silver bullet to

solve this problem but vision-based techniques rapidly emerged.

Indeed, with respect to other sensors, a camera combined

with a display is an appealing conﬁguration. As pointed out

in [9], such a setup provides vision-based feedback that allows

to effectively close the loop between the localization process

and the display. This also reduces the need for heavy calibration

procedure. Nevertheless, when Azuma’s survey [7] was published,

only few vision-based techniques meeting his deﬁnition existed.

Until the early 2000s, almost all the vision-based registration

techniques relied on markers. Then various markerless approaches

quickly emerged in the literature. On one hand, markerless model-

based tracking techniques improve clearly (but are in line with)

marker-based methods. On the other hand, with the ability to easily

• E. Marchand is with Université de Rennes 1, IRISA, Inria Rennes-Bretagne

Atlantique, Rennes, France .

E-mail: Eric.Marchand@irisa.fr

• H. Uchiyama is with Kyushu University, Japan

• F. Spindler is with Inria Rennes-Bretagne Atlantique, Rennes, France

match keypoints like SIFT, and the perfect knowledge of multi-

view geometry, new approaches based on an image model and

on the estimation of the displacement of the camera [122] arose.

Finally, the late 2000s saw the introduction of keyframe-based

Simultaneous Localization and Mapping (SLAM) [57] that, as a

sequel of structure from motion approaches (widely used in off-

line compositing for the movie industry), allows to get rid of a

model of the scene.

Although vision-based registration is still a difﬁcult problem,

mature solutions may now be proposed to the end-users and real-

world or industrial applications can be foreseen (if not already

seen). Meanwhile, many open source software libraries (OpenCV,

ViSP, Vuforia,...) and commercial SDK (Metaio (now with Apple),

Wikitude, AugmentedPro, Diotasoft,...) have been released provid-

ing developers with easy-to-use interfaces and efﬁcient registration

processes. It therefore allows fast prototyping of AR systems.

Rationale.

Unfortunately, using such libraries, end-users may widely consider

the underlying technologies and methodological aspects as black

boxes. Our goal is then to present, in the reminder of the paper, a

brief but almost self-contained introduction to the most important

approaches dedicated to camera localization along with a survey

of the extensions that have been proposed in the recent years. We

also try to link these methodological concepts to the main libraries

and SDK available on the market.

The aim of this paper is then to provide researchers and

practitioners with an almost comprehensive and consolidated in-

troduction to effective tools for facilitating research in augmented

reality. It is also dedicated to academics involved in teaching

augmented reality at the undergraduate and graduate level. For

most of the presented approaches, we also provide links to code

of short examples. This should allow readers to easily bridge

the gap between theoretical aspects and practice. These examples

have been written using both OpenCV and the ViSP library [79]

developed at Inria.

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 2

Choices have to be made.

A comprehensive description of all the existing vision-based lo-

calization techniques used in AR is, at least in a journal paper, out

of reach and choices have to be made. For example, we disregard

Bayesian frameworks (Extended Kalman Filter). Although such

methods were widely used in the early 2000s, it appears that

EKF is less and less used nowadays for the proﬁt of deterministic

approaches (to mitigate this assertion, it is acknowledged that they

are still useful when considering sensor fusion). Not considering

display technologies (e.g., optical see-through HMD), we also

disregard eyes/head/display calibration issues. As pointed out

in [146], many other sensors exist and can be jointly used with

cameras. We acknowledge that this provides robustness to the

localization process. Nevertheless, as stated, we clearly focus in

this paper, only on the image-based pose estimation process.

Related work.

In the past, two surveys related to AR (in general) have been

published in 1997 [7] and 2001 [8]. These surveys have been

completed in 2008 by an analysis of 10 years of publications

in ISMAR [151]. Demonstrating the interest for vision-based

localization, it appears that more than 20% of the papers are

related to "tracking" and then to vision-based registration (and they

are also among the most cited papers). In [146] the use of other

sensors and hybrid systems is explored. Dealing more precisely

with 3D tracking, a short monograph was proposed in [65].

To help the students, engineers, or researchers pursue further

research and development in this very active research area, we

explain and discuss the various classes of approaches that have

been considered in the literature and that we found important

for vision-based AR. We hope this article will be accessible and

interesting to experts and students alike.

2 OVERVIEW OF THE PROBLEM

The goal of augmented reality is to insert virtual information in

the real world providing the end-user with additional knowledge

about the scene. The added information, usually virtual objects,

must be precisely aligned with the real world. Figure 1 shows

how these two worlds can be combined into a single and coherent

image.

Fig. 1. AR Principle and considered coordinate systems: to achieve a coherent

compositiong, computer graphics (CG) camera and real one should be located

at the very same position and have the same parameters.

From the real world side, we have the scene and the camera.

Let us denote F

the camera frame, F

the scene frame (or world

frame). On the virtual side, we have a virtual world with various

virtual objects whose position are expressed in the virtual world

frame F

CGw

(computer graphics (CG) frame). To render the virtual

scene, a virtual (CG) camera is added to the system. Let us denote

CGc

the virtual camera frame. For simplicity and without loss of

generality, let us assume that the world frame and the virtual world

are the same (F

CGw

= F

). To create an image of the virtual world

that is consistent with the real camera current view, CG camera and

real one should be located at the very same position and have the

same parameters (focal, viewing angle, etc). Once the real and CG

cameras are perfectly aligned, a compositing step simply provides

the resulting augmented image.

Within this process, the only unknown is the real camera

position in the world frame (we denote

the transformation

that fully deﬁnes the position of F

wrt. F

). Vision-based AR is

thus restricted to a camera pose estimation problem. Any error in

the estimation of the camera position in the world reference frame

appears to the user as inconsistencies.

Pose estimation is a problem which found its origin in pho-

togrammetry where it is known as space resection. A simple

deﬁnition could be: "given a set of correspondences between 3D

features and their projections in the images plane, pose estimation

consists in computing the position and orientation of the camera".

There are many ways to present the solutions to this inverse

problem. We made the choice to divide the paper according to

available data: do we have 3D models (or can we acquire them?)

or do we restrict to planar scenes? The paper is then organized as

follow:

• In Section 3, we chose to consider ﬁrst the general case

where 3D models are available or can be built on-line.

We ﬁrst review in Section 3.1 the solutions based on

classical pose estimation methods (known as PnP). We

then show in Section 3.2 a generalization of the previous

method to handle far more complex 3D model. When 3D

models are not a priori available, they can be estimated

on-line thanks to Simultaneous Localization and Mapping

(SLAM) techniques (see Section 3.3). Finally when 3D

data can be directly measured, registration with the 3D

model can be done directly in the 3D space. This is the

objective of Section 3.4.

• It appears that the problem could be easily simpliﬁed when

the scene is planar. This is the subject of Section4. In that

case, the pose estimation could be handled as a camera

motion estimation process.

• From a practical point of view, the development of actual

AR applications rises the question of the features extrac-

tion and of the matching issues between image features.

This issue will be discussed in Section 5.

Overall, whatever the method chosen, it will be seen that

pose estimation is an optimization problem. The quality of the

estimated pose is highly dependent on the quality of the mea-

surements. We therefore also introduce in Section 3.1.3 robust

estimation process able to deal with spurious data (outliers) which

is fundamental for real-life applications.

3 POSE ESTIMATION RELYING ON A 3D MODEL

In this section we assume that a 3D model of the scene is available

or can be estimated on-line. As stated in the previous section, the

pose should be estimated knowing the correspondences between

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 3

2D measurements in the images and 3D features of the model. It

is ﬁrst necessary to properly state the problem. We will consider

here that these features are 3D points and their 2D projections (as

a pixel) in the image.

Let us denote F

the camera frame and

the transformation

that fully deﬁnes the position of F

wrt. F

(see Figure 2).

is a homogeneous matrix deﬁned such that:



3×1



(1)

where

and

are the rotation matrix and translation vector

that deﬁne the position of the camera in the world frame (note that

being a rotation matrix, it should respect the orthogonality

constraints).

Fig. 2. Rigid transformation

between world frame F

and camera frame F

and perspective projection

The perspective projection

x = (u,v,1)

of a point

X =

(

Z,1)

will be given by (see Figure 2):

x = K Π

X (2)

where

x are the coordinates, expressed in pixel, of the point in the

image; K is the camera intrinsic parameters matrix and is deﬁned

by:

K =





0 u

0 p

0 0 1





where (u

,1)

are the coordinates of the principal point (the

intersection of the optical axes with the image plane) and p

(resp

) is the ratio between the focal length of the lens f and the

size of the pixel l

: p

= f /l

(resp, l

being the height of a pixel,

= f /l

). Π the projection matrix is given, in the case of a

perspective projection model, by:

Π =





1 0 0 0

0 1 0 0

0 0 1 0





The intrinsic parameters can be easily obtained through an off-line

calibration step (e.g. [20], [149]). Therefore, when considering the

AR problem, we shall consider image coordinates expressed in the

normalized metric space x = K

−1

x. Let us note that we consider

here only a pure perspective projection model but it is clear that

any model with distortion can be easily considered and handled.

From now, we will always consider that the camera is calibrated

and that the coordinates are expressed in the normalized space.

If we have N points

,i = 1..N whose coordinates expressed

in F

are given by

= (

,1)

, the projection x

,1)

of these points in the image plane is then given by:

= Π

. (3)

Knowing 2D-3D point correspondences, x

and

, pose estima-

tion consists in solving the system given by the set of equations (3)

for

. This is an inverse problem that is known as the Perspec-

tive from N Points problem or PnP (Perspective-n-point).

3.1 Pose estimation from a known 3D model

In this paragraph, we review methods allowing to solve the set of

equations (3) for the pose

. Among various solutions, we will

explain more deeply two classical algorithms widely considered

in augmented reality: one method that does not require any

initialization of the pose (Direct Linear Transform) and a method

based on a gradient approach that needs an initial pose but which

can be consider as the "gold standard" solution [48]. We will also

discuss more complex, but also more efﬁcient, solutions to the

pose estimation issue. Optimization procedure in the presence

of spurious data (outliers) is also considered. In each case, a

comprehensive description of each methods will be given.

3.1.1 P3P: solving pose estimation with the smallest subset of

correspondences

P3P is an important and old problem for which many solutions

have been proposed. Theoretically, since the pose can be rep-

resented by six independent parameters, three points should be

sufﬁcient to solve this problem.

Most of the P3P approaches rely on a 2 steps solution. First

an estimation of the unknown depth

of each point (in the

camera frame) is done thanks to constraints (law of cosines)

given by the triangle CX

for which the distance between X

and X

and the angle between the two directions CX

and CX

are known and measured. The estimation of the points depth is

usually done by solving a fourth order polynomial equation [39]

[105] [41] [5]. Once the three points coordinates are known in

the camera frame, the second step consists in estimating the rigid

transformation

that maps the coordinates expressed in the

camera frame to the coordinates expressed in the world frame

(3D-3D registration, see Section 3.4). The rotation represented

by quaternions can be obtained using a close form solution [49].

Alternatively least squares solution that use the Singular Value

Decomposition (SVD) [5] can also be considered. Since a fourth

order polynomial equation as to be solved, the problem features

up to four possible solutions. It is then necessary to have at least a

fourth point to disambiguate the obtained results [39] [48].

More recently, Kneip et al. [62] propose a novel closed-form

solution that directly computes the rigid transformation between

the camera and world frames

. This is made possible by

introducing ﬁrst a new intermediate camera frame centered in

C whose x axes is aligned with the direction of the ﬁrst point

and secondly a new world frame centered in X

and whose

x axes is aligned with the direction of the ﬁrst point X

. Their

relative position and orientation can be represented using only two

parameters. These parameters can then be computed by solving

a fourth order polynomial equation. A ﬁnal substitution allows

computing

. The proposed algorithm is much faster than the

other solutions since it avoids the estimation of the 3D points depth

in the camera frame and the estimation of the 3D-3D registration

step. Kneip’s P3P implementation is available in OpenGV [59].

Although P3P is a well-known solution to the pose estimation

problem, other PnP approaches that use more points (n > 3) were

usually preferred. Indeed pose accuracy usually increases with the

number of points. Nevertheless within an outliers rejection process

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 4

such as RANSAC, being fast to compute and requiring only three

points correspondences, fast P3P such as [59] is the solution to

chose (see Section 3.1.3). P3P is also an interesting solution to

bootstrap a non-linear optimization process that minimizes the

reprojection error as will be seen in Section 3.1.2.

3.1.2 PnP: pose estimation from N point correspondences

PnP considered an over-constrained and generic solution to the

pose estimation problem from 2D-3D point correspondences. Here

again, as for the P3P, one can consider multi-stage methods

that estimate the coordinates of the points [105] or of virtual

points [67] in the camera frame and then achieve a 3D-3D

registration process [105]. On the other side, direct or one stage

minimization approaches have been proposed.

Among the former, [105] extended their P3P algorithm to P4P,

P5P and ﬁnally to PnP. In the EPnP approach [67] the 3D point

coordinates are expressed as a weighted sum of four virtual control

points. The pose problem is then reduced to the estimation of

the coordinates of these control points in the camera frame. The

main advantage of this latter approach is its reduced computational

complexity, which is linear wrt. the number of points.

Within the latter one step approaches, the Direct Linear Trans-

form (DLT) is certainly the oldest one [48], [129]. Although not

very accurate, this solution and its sequels have historically widely

been considered in AR application. PnP is intrinsically a non-

linear problem; nevertheless a solution relying on the resolution

of a linear system can be considered. It consists in solving the

homogeneous linear system built from equations (3), for the

12 parameters of the matrix

. Indeed, considering that the

homogeneous matrix to be estimated is deﬁned by:







3×1







where r

, r

and r

are the rows of the rotation matrix

and

= (t

). Developing (3) yields to solve the system:

Ah =













h = 0 (4)

with A

given by [129]:



1 0 0 0 0

0 0 0 0

−x

−y



(5)

and

h =



, t

, r

, t

, r

, t



is a vector representation of

. The solution of this homoge-

neous system is the eigenvector of A corresponding to its minimal

eigenvalue (computed through a Singular Value Decomposition of

A). An orthonormalization of the obtained rotation matrix is then

necessary

Obviously and unfortunately, being over-parameterized, this

solution is very sensitive to noise and a solution that explicitly

1. The source code of the DLT algorithm is proposed as a supplementary

material of this paper and is available here.

considers the non-linear constraints of the system should be

preferred.

An alternative and very elegant solution, which takes these

non-linear constraints into account, has been proposed in [28]

[93]. Considering that the pose estimation problem is linear

under the scaled orthographic projection model (weak perspective

projection) [48] [28], Dementhon proposed to iteratively go back

from the scaled orthographic projection model to the perspective

one. POSIT is a standard approach used to solve the PnP problem.

An advantage of this approach is that it does not require any

initialization. It inherently enforces the non-linear constraints and

is computationally cheap. A drawback is that POSIT is not directly

suited for coplanar points. Nevertheless an extension of POSIT

has been proposed in [93]. Its implementation is available in

OpenCV [20] or in ViSP [79] and it has widely been used in

AR application (see Section 3.1.4).

In our opinion, the "gold-standard" solution to the PnP consists

in estimating the six parameters of the transformation

minimizing the norm of the reprojection error using a non-linear

minimization approach such as a Gauss-Newton of a Levenberg-

Marquardt technique. Minimizing this reprojection error provides

the Maximum Likelihood estimate when a Gaussian noise is

assumed on measurements (ie, on point coordinates x

). Another

advantage of this approach is that it allows easily integrating the

non-linear correlations induced by the PnP problem and provides

an optimal solution to the problem. The results corresponding to

this example is shown on Figure 4. Denoting q ∈ se(3) a minimal

representation of

(q = (

,θu)

where θ and u are the angle

and the axis of the rotation

), the problem can be formulated

as:

q = argmin

∑

i=1



,Π



(6)

where d(x,x

) is the Euclidian distance between two points x and

. The solution of this problem relies on an iterative minimization

process such as a Gauss-Newton method.

Solving equation (6) consists in minimizing the cost function

E(q) = ke(q)k deﬁned by:

E(q) = e(q)

e(q), with e(q) = x(q) −x (7)

where x(q) = (...,π(

),...)

and x = (...,

,...)

where

= (x

) is a Euclidian 2D point and π(X) is the projection

function that project a 3D point X into

x. . The solution consists

in linearizing e(q) = 0. A ﬁrst order Taylor expansion of the error

is given by:

e(q + δq) ≈ e(q) + J(q)δq (8)

where J(q) is the Jacobian of e(q) in q. With the Gauss-Newton

method the solution consists in minimizing E(q + δq) where:

E(q + δq) = ke(q + δq)k ≈ ke(q) + J(q)δqk (9)

This minimization problem can be solved by an iterative least

square approach (ILS), see Figure 3, and we have:

δq = −J(q)

e(q) (10)

Pose Estimation for Augmented Reality: A Hands-On Survey

Figures

Citations

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Massive MIMO is a reality—What is next?: Five promising research directions for antenna arrays

A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

All One Needs to Know about Metaverse: A Complete Survey on Technological Singularity, Virtual Ecosystem, and Research Agenda

PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation

References

Distinctive Image Features from Scale-Invariant Keypoints

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

A method for registration of 3-D shapes

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

Related Papers (5)

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

EPnP: An Accurate O(n) Solution to the PnP Problem

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

Distinctive Image Features from Scale-Invariant Keypoints

Frequently Asked Questions (13)

Q1. What are the contributions in "Pose estimation for augmented reality: a hands-on survey" ?

Q2. What is the key criterion for accuracy in AR?

Q3. How many correspondences are used to estimate the pose?

Q4. What is the reason why the pose computation is not always easy to do?

Q5. What is the way to extract features from perspectively transformed images?

Q6. What is the solution to the pose estimation problem?

Q7. What is the solution to estimate the homography?

Q8. How many iterations are required to ensure that p = 0.99?

Q9. What is the advantage of the pose estimation methods?

Q10. What was the first system that enabled scene reconstruction and consequently camera localization in real-time?

Q11. What are the main areas of the research that must be considered by both academics and industries?

Q12. What is the definition of a pose estimation problem?

Q13. What is the common approach for detecting a rectangle?