scispace - formally typeset
Open AccessJournal ArticleDOI

Pose Estimation for Augmented Reality: A Hands-On Survey

Reads0
Chats0
TLDR
This paper aims at presenting a brief but almost self-contented introduction to the most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years.
Abstract
Augmented reality (AR) allows to seamlessly insert virtual objects in an image sequence. In order to accomplish this goal, it is important that synthetic elements are rendered and aligned in the scene in an accurate and visually acceptable way. The solution of this problem can be related to a pose estimation or, equivalently, a camera localization process. This paper aims at presenting a brief but almost self-contented introduction to the most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years. For most of the presented approaches, we also provide links to code of short examples. This should allow readers to easily bridge the gap between theoretical aspects and practical implementations.

read more

Content maybe subject to copyright    Report

HAL Id: hal-01246370
https://hal.inria.fr/hal-01246370
Submitted on 18 Dec 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Pose Estimation for Augmented Reality: A Hands-On
Survey
Eric Marchand, Hideaki Uchiyama, Fabien Spindler
To cite this version:
Eric Marchand, Hideaki Uchiyama, Fabien Spindler. Pose Estimation for Augmented Reality: A
Hands-On Survey. IEEE Transactions on Visualization and Computer Graphics, Institute of Electri-
cal and Electronics Engineers, 2016, 22 (12), pp.2633 - 2651. �10.1109/TVCG.2015.2513408�. �hal-
01246370�

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 1
Pose estimation for augmented reality:
a hands-on survey
Eric Marchand, Hideaki Uchiyama and Fabien Spindler
Abstract—Augmented reality (AR) allows to seamlessly insert virtual objects in an image sequence. In order to accomplish this goal, it is important
that synthetic elements are rendered and aligned in the scene in an accurate and visually acceptable way. The solution of this problem can be related
to a pose estimation or, equivalently, a camera localization process. This paper aims at presenting a brief but almost self-contented introduction to the
most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years. For
most of the presented approaches, we also provide links to code of short examples. This should allow readers to easily bridge the gap between
theoretical aspects and practical implementations.
Index Terms—Survey, augmented reality, vision-based camera localization, pose estimation, PnP, SLAM, motion estimation, homography, keypoint
matching, code examples.
F
1 INTRODUCTION
Augmented reality (AR) allows to seamlessly insert virtual objects
in an image sequence. A widely acknowledged definition of
augmented reality is due to Azuma in the first survey dedicated
to the subject [7]. An AR system should combine real and virtual
objects, be interactive in real time, register real and virtual
objects. It has to be noted that this definition does not focus
on specific technologies for localization and visualization. Back
in 1997, registration was considered as "one of the most basic
problems currently limiting augmented reality [7]".
Pose estimation: a "basic problem" for augmented reality.
AR has been intrinsically a multidisciplinary and old research area.
It is clear that real and virtual world registration issues received
a large amount of interest. From a broader point of view, this is
a motion tracking issue. To achieve this task, many sensors have
been considered: mechanical devices, ultrasonic devices, magnetic
sensors, inertial devices, GPS, compass, and obviously, optical
sensors [146]. To paraphrase [146], there was no silver bullet to
solve this problem but vision-based techniques rapidly emerged.
Indeed, with respect to other sensors, a camera combined
with a display is an appealing configuration. As pointed out
in [9], such a setup provides vision-based feedback that allows
to effectively close the loop between the localization process
and the display. This also reduces the need for heavy calibration
procedure. Nevertheless, when Azuma’s survey [7] was published,
only few vision-based techniques meeting his definition existed.
Until the early 2000s, almost all the vision-based registration
techniques relied on markers. Then various markerless approaches
quickly emerged in the literature. On one hand, markerless model-
based tracking techniques improve clearly (but are in line with)
marker-based methods. On the other hand, with the ability to easily
E. Marchand is with Université de Rennes 1, IRISA, Inria Rennes-Bretagne
Atlantique, Rennes, France .
E-mail: Eric.Marchand@irisa.fr
H. Uchiyama is with Kyushu University, Japan
F. Spindler is with Inria Rennes-Bretagne Atlantique, Rennes, France
match keypoints like SIFT, and the perfect knowledge of multi-
view geometry, new approaches based on an image model and
on the estimation of the displacement of the camera [122] arose.
Finally, the late 2000s saw the introduction of keyframe-based
Simultaneous Localization and Mapping (SLAM) [57] that, as a
sequel of structure from motion approaches (widely used in off-
line compositing for the movie industry), allows to get rid of a
model of the scene.
Although vision-based registration is still a difficult problem,
mature solutions may now be proposed to the end-users and real-
world or industrial applications can be foreseen (if not already
seen). Meanwhile, many open source software libraries (OpenCV,
ViSP, Vuforia,...) and commercial SDK (Metaio (now with Apple),
Wikitude, AugmentedPro, Diotasoft,...) have been released provid-
ing developers with easy-to-use interfaces and efficient registration
processes. It therefore allows fast prototyping of AR systems.
Rationale.
Unfortunately, using such libraries, end-users may widely consider
the underlying technologies and methodological aspects as black
boxes. Our goal is then to present, in the reminder of the paper, a
brief but almost self-contained introduction to the most important
approaches dedicated to camera localization along with a survey
of the extensions that have been proposed in the recent years. We
also try to link these methodological concepts to the main libraries
and SDK available on the market.
The aim of this paper is then to provide researchers and
practitioners with an almost comprehensive and consolidated in-
troduction to effective tools for facilitating research in augmented
reality. It is also dedicated to academics involved in teaching
augmented reality at the undergraduate and graduate level. For
most of the presented approaches, we also provide links to code
of short examples. This should allow readers to easily bridge
the gap between theoretical aspects and practice. These examples
have been written using both OpenCV and the ViSP library [79]
developed at Inria.

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 2
Choices have to be made.
A comprehensive description of all the existing vision-based lo-
calization techniques used in AR is, at least in a journal paper, out
of reach and choices have to be made. For example, we disregard
Bayesian frameworks (Extended Kalman Filter). Although such
methods were widely used in the early 2000s, it appears that
EKF is less and less used nowadays for the profit of deterministic
approaches (to mitigate this assertion, it is acknowledged that they
are still useful when considering sensor fusion). Not considering
display technologies (e.g., optical see-through HMD), we also
disregard eyes/head/display calibration issues. As pointed out
in [146], many other sensors exist and can be jointly used with
cameras. We acknowledge that this provides robustness to the
localization process. Nevertheless, as stated, we clearly focus in
this paper, only on the image-based pose estimation process.
Related work.
In the past, two surveys related to AR (in general) have been
published in 1997 [7] and 2001 [8]. These surveys have been
completed in 2008 by an analysis of 10 years of publications
in ISMAR [151]. Demonstrating the interest for vision-based
localization, it appears that more than 20% of the papers are
related to "tracking" and then to vision-based registration (and they
are also among the most cited papers). In [146] the use of other
sensors and hybrid systems is explored. Dealing more precisely
with 3D tracking, a short monograph was proposed in [65].
To help the students, engineers, or researchers pursue further
research and development in this very active research area, we
explain and discuss the various classes of approaches that have
been considered in the literature and that we found important
for vision-based AR. We hope this article will be accessible and
interesting to experts and students alike.
2 OVERVIEW OF THE PROBLEM
The goal of augmented reality is to insert virtual information in
the real world providing the end-user with additional knowledge
about the scene. The added information, usually virtual objects,
must be precisely aligned with the real world. Figure 1 shows
how these two worlds can be combined into a single and coherent
image.
Fig. 1. AR Principle and considered coordinate systems: to achieve a coherent
compositiong, computer graphics (CG) camera and real one should be located
at the very same position and have the same parameters.
From the real world side, we have the scene and the camera.
Let us denote F
c
the camera frame, F
w
the scene frame (or world
frame). On the virtual side, we have a virtual world with various
virtual objects whose position are expressed in the virtual world
frame F
CGw
(computer graphics (CG) frame). To render the virtual
scene, a virtual (CG) camera is added to the system. Let us denote
F
CGc
the virtual camera frame. For simplicity and without loss of
generality, let us assume that the world frame and the virtual world
are the same (F
CGw
= F
w
). To create an image of the virtual world
that is consistent with the real camera current view, CG camera and
real one should be located at the very same position and have the
same parameters (focal, viewing angle, etc). Once the real and CG
cameras are perfectly aligned, a compositing step simply provides
the resulting augmented image.
Within this process, the only unknown is the real camera
position in the world frame (we denote
c
T
w
the transformation
that fully defines the position of F
w
wrt. F
c
). Vision-based AR is
thus restricted to a camera pose estimation problem. Any error in
the estimation of the camera position in the world reference frame
appears to the user as inconsistencies.
Pose estimation is a problem which found its origin in pho-
togrammetry where it is known as space resection. A simple
definition could be: "given a set of correspondences between 3D
features and their projections in the images plane, pose estimation
consists in computing the position and orientation of the camera".
There are many ways to present the solutions to this inverse
problem. We made the choice to divide the paper according to
available data: do we have 3D models (or can we acquire them?)
or do we restrict to planar scenes? The paper is then organized as
follow:
In Section 3, we chose to consider first the general case
where 3D models are available or can be built on-line.
We first review in Section 3.1 the solutions based on
classical pose estimation methods (known as PnP). We
then show in Section 3.2 a generalization of the previous
method to handle far more complex 3D model. When 3D
models are not a priori available, they can be estimated
on-line thanks to Simultaneous Localization and Mapping
(SLAM) techniques (see Section 3.3). Finally when 3D
data can be directly measured, registration with the 3D
model can be done directly in the 3D space. This is the
objective of Section 3.4.
It appears that the problem could be easily simplified when
the scene is planar. This is the subject of Section4. In that
case, the pose estimation could be handled as a camera
motion estimation process.
From a practical point of view, the development of actual
AR applications rises the question of the features extrac-
tion and of the matching issues between image features.
This issue will be discussed in Section 5.
Overall, whatever the method chosen, it will be seen that
pose estimation is an optimization problem. The quality of the
estimated pose is highly dependent on the quality of the mea-
surements. We therefore also introduce in Section 3.1.3 robust
estimation process able to deal with spurious data (outliers) which
is fundamental for real-life applications.
3 POSE ESTIMATION RELYING ON A 3D MODEL
In this section we assume that a 3D model of the scene is available
or can be estimated on-line. As stated in the previous section, the
pose should be estimated knowing the correspondences between

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 3
2D measurements in the images and 3D features of the model. It
is first necessary to properly state the problem. We will consider
here that these features are 3D points and their 2D projections (as
a pixel) in the image.
Let us denote F
c
the camera frame and
c
T
w
the transformation
that fully defines the position of F
w
wrt. F
c
(see Figure 2).
c
T
w
,
is a homogeneous matrix defined such that:
c
T
w
=
c
R
w
c
t
w
0
3×1
1
(1)
where
c
R
w
and
c
t
w
are the rotation matrix and translation vector
that define the position of the camera in the world frame (note that
c
R
w
being a rotation matrix, it should respect the orthogonality
constraints).
Fig. 2. Rigid transformation
c
T
w
between world frame F
w
and camera frame F
c
and perspective projection
The perspective projection
¯
x = (u,v,1)
>
of a point
w
X =
(
w
X,
w
Y,
w
Z,1)
>
will be given by (see Figure 2):
¯
x = K Π
c
T
w
w
X (2)
where
¯
x are the coordinates, expressed in pixel, of the point in the
image; K is the camera intrinsic parameters matrix and is defined
by:
K =
p
x
0 u
0
0 p
y
v
0
0 0 1
where (u
0
,v
0
,1)
>
are the coordinates of the principal point (the
intersection of the optical axes with the image plane) and p
x
(resp
p
y
) is the ratio between the focal length of the lens f and the
size of the pixel l
x
: p
x
= f /l
x
(resp, l
y
being the height of a pixel,
p
y
= f /l
y
). Π the projection matrix is given, in the case of a
perspective projection model, by:
Π =
1 0 0 0
0 1 0 0
0 0 1 0
The intrinsic parameters can be easily obtained through an off-line
calibration step (e.g. [20], [149]). Therefore, when considering the
AR problem, we shall consider image coordinates expressed in the
normalized metric space x = K
1
¯
x. Let us note that we consider
here only a pure perspective projection model but it is clear that
any model with distortion can be easily considered and handled.
From now, we will always consider that the camera is calibrated
and that the coordinates are expressed in the normalized space.
If we have N points
w
X
i
,i = 1..N whose coordinates expressed
in F
w
are given by
w
X
i
= (
w
X
i
,
w
Y
i
,
w
Z
i
,1)
>
, the projection x
i
=
(x
i
,y
i
,1)
>
of these points in the image plane is then given by:
x
i
= Π
c
T
w
w
X
i
. (3)
Knowing 2D-3D point correspondences, x
i
and
w
X
i
, pose estima-
tion consists in solving the system given by the set of equations (3)
for
c
T
w
. This is an inverse problem that is known as the Perspec-
tive from N Points problem or PnP (Perspective-n-point).
3.1 Pose estimation from a known 3D model
In this paragraph, we review methods allowing to solve the set of
equations (3) for the pose
c
T
w
. Among various solutions, we will
explain more deeply two classical algorithms widely considered
in augmented reality: one method that does not require any
initialization of the pose (Direct Linear Transform) and a method
based on a gradient approach that needs an initial pose but which
can be consider as the "gold standard" solution [48]. We will also
discuss more complex, but also more efficient, solutions to the
pose estimation issue. Optimization procedure in the presence
of spurious data (outliers) is also considered. In each case, a
comprehensive description of each methods will be given.
3.1.1 P3P: solving pose estimation with the smallest subset of
correspondences
P3P is an important and old problem for which many solutions
have been proposed. Theoretically, since the pose can be rep-
resented by six independent parameters, three points should be
sufficient to solve this problem.
Most of the P3P approaches rely on a 2 steps solution. First
an estimation of the unknown depth
c
Z
i
of each point (in the
camera frame) is done thanks to constraints (law of cosines)
given by the triangle CX
i
X
j
for which the distance between X
i
and X
j
and the angle between the two directions CX
i
and CX
j
are known and measured. The estimation of the points depth is
usually done by solving a fourth order polynomial equation [39]
[105] [41] [5]. Once the three points coordinates are known in
the camera frame, the second step consists in estimating the rigid
transformation
c
T
w
that maps the coordinates expressed in the
camera frame to the coordinates expressed in the world frame
(3D-3D registration, see Section 3.4). The rotation represented
by quaternions can be obtained using a close form solution [49].
Alternatively least squares solution that use the Singular Value
Decomposition (SVD) [5] can also be considered. Since a fourth
order polynomial equation as to be solved, the problem features
up to four possible solutions. It is then necessary to have at least a
fourth point to disambiguate the obtained results [39] [48].
More recently, Kneip et al. [62] propose a novel closed-form
solution that directly computes the rigid transformation between
the camera and world frames
c
T
w
. This is made possible by
introducing first a new intermediate camera frame centered in
C whose x axes is aligned with the direction of the first point
X
1
and secondly a new world frame centered in X
1
and whose
x axes is aligned with the direction of the first point X
2
. Their
relative position and orientation can be represented using only two
parameters. These parameters can then be computed by solving
a fourth order polynomial equation. A final substitution allows
computing
c
T
w
. The proposed algorithm is much faster than the
other solutions since it avoids the estimation of the 3D points depth
in the camera frame and the estimation of the 3D-3D registration
step. Kneip’s P3P implementation is available in OpenGV [59].
Although P3P is a well-known solution to the pose estimation
problem, other PnP approaches that use more points (n > 3) were
usually preferred. Indeed pose accuracy usually increases with the
number of points. Nevertheless within an outliers rejection process

IEEE TRANS. ON VISUALIZATION AND COMPUTER GRAPHICS, TO APPEAR 2016 4
such as RANSAC, being fast to compute and requiring only three
points correspondences, fast P3P such as [59] is the solution to
chose (see Section 3.1.3). P3P is also an interesting solution to
bootstrap a non-linear optimization process that minimizes the
reprojection error as will be seen in Section 3.1.2.
3.1.2 PnP: pose estimation from N point correspondences
PnP considered an over-constrained and generic solution to the
pose estimation problem from 2D-3D point correspondences. Here
again, as for the P3P, one can consider multi-stage methods
that estimate the coordinates of the points [105] or of virtual
points [67] in the camera frame and then achieve a 3D-3D
registration process [105]. On the other side, direct or one stage
minimization approaches have been proposed.
Among the former, [105] extended their P3P algorithm to P4P,
P5P and finally to PnP. In the EPnP approach [67] the 3D point
coordinates are expressed as a weighted sum of four virtual control
points. The pose problem is then reduced to the estimation of
the coordinates of these control points in the camera frame. The
main advantage of this latter approach is its reduced computational
complexity, which is linear wrt. the number of points.
Within the latter one step approaches, the Direct Linear Trans-
form (DLT) is certainly the oldest one [48], [129]. Although not
very accurate, this solution and its sequels have historically widely
been considered in AR application. PnP is intrinsically a non-
linear problem; nevertheless a solution relying on the resolution
of a linear system can be considered. It consists in solving the
homogeneous linear system built from equations (3), for the
12 parameters of the matrix
c
T
w
. Indeed, considering that the
homogeneous matrix to be estimated is defined by:
c
T
w
=
r
1
t
x
r
2
t
y
r
3
t
z
0
3×1
1
where r
1
, r
2
and r
3
are the rows of the rotation matrix
c
R
w
and
c
t
w
= (t
x
,t
y
,t
z
). Developing (3) yields to solve the system:
Ah =
.
.
.
A
i
.
.
.
h = 0 (4)
with A
i
given by [129]:
A
i
=
w
X
i
w
Y
i
w
Z
i
1 0 0 0 0
0 0 0 0
w
X
i
w
Y
i
w
Z
i
1
x
i
w
X
i
x
i
w
Y
i
x
i
w
Z
i
x
i
y
i
w
X
i
y
i
w
Y
i
y
i
w
Z
i
y
i
(5)
and
h =
r
1
, t
x
, r
2
, t
y
, r
3
, t
z
>
is a vector representation of
c
T
w
. The solution of this homoge-
neous system is the eigenvector of A corresponding to its minimal
eigenvalue (computed through a Singular Value Decomposition of
A). An orthonormalization of the obtained rotation matrix is then
necessary
1
.
Obviously and unfortunately, being over-parameterized, this
solution is very sensitive to noise and a solution that explicitly
1. The source code of the DLT algorithm is proposed as a supplementary
material of this paper and is available here.
considers the non-linear constraints of the system should be
preferred.
An alternative and very elegant solution, which takes these
non-linear constraints into account, has been proposed in [28]
[93]. Considering that the pose estimation problem is linear
under the scaled orthographic projection model (weak perspective
projection) [48] [28], Dementhon proposed to iteratively go back
from the scaled orthographic projection model to the perspective
one. POSIT is a standard approach used to solve the PnP problem.
An advantage of this approach is that it does not require any
initialization. It inherently enforces the non-linear constraints and
is computationally cheap. A drawback is that POSIT is not directly
suited for coplanar points. Nevertheless an extension of POSIT
has been proposed in [93]. Its implementation is available in
OpenCV [20] or in ViSP [79] and it has widely been used in
AR application (see Section 3.1.4).
In our opinion, the "gold-standard" solution to the PnP consists
in estimating the six parameters of the transformation
c
T
w
by
minimizing the norm of the reprojection error using a non-linear
minimization approach such as a Gauss-Newton of a Levenberg-
Marquardt technique. Minimizing this reprojection error provides
the Maximum Likelihood estimate when a Gaussian noise is
assumed on measurements (ie, on point coordinates x
i
). Another
advantage of this approach is that it allows easily integrating the
non-linear correlations induced by the PnP problem and provides
an optimal solution to the problem. The results corresponding to
this example is shown on Figure 4. Denoting q se(3) a minimal
representation of
c
T
w
(q = (
c
t
w
,θu)
>
where θ and u are the angle
and the axis of the rotation
c
R
w
), the problem can be formulated
as:
b
q = argmin
q
N
i=1
d
x
i
,Π
c
T
w
w
X
i
2
(6)
where d(x,x
0
) is the Euclidian distance between two points x and
x
0
. The solution of this problem relies on an iterative minimization
process such as a Gauss-Newton method.
Solving equation (6) consists in minimizing the cost function
E(q) = ke(q)k defined by:
E(q) = e(q)
>
e(q), with e(q) = x(q) x (7)
where x(q) = (...,π(
c
T
w
w
X
i
),...)
>
and x = (...,
˜
x
i
,...)
>
where
˜
x
i
= (x
i
,y
i
) is a Euclidian 2D point and π(X) is the projection
function that project a 3D point X into
˜
x. . The solution consists
in linearizing e(q) = 0. A first order Taylor expansion of the error
is given by:
e(q + δq) e(q) + J(q)δq (8)
where J(q) is the Jacobian of e(q) in q. With the Gauss-Newton
method the solution consists in minimizing E(q + δq) where:
E(q + δq) = ke(q + δq)k ke(q) + J(q)δqk (9)
This minimization problem can be solved by an iterative least
square approach (ILS), see Figure 3, and we have:
δq = J(q)
+
e(q) (10)

Citations
More filters
Proceedings ArticleDOI

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

TL;DR: DenseFusion as mentioned in this paper proposes a heterogeneous architecture that processes the two complementary data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated.
Journal ArticleDOI

Massive MIMO is a reality—What is next?: Five promising research directions for antenna arrays

TL;DR: In this paper, the authors explain how the first chapter of the massive MIMO research saga has come to an end, while the story has just begun, and outline five new massive antenna array related research directions.
Proceedings ArticleDOI

A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

TL;DR: SIFT and BRISK are found to be the most accurate algorithms while ORB and BRK are most efficient and a benchmark for researchers, regardless of any particular area is set.
Posted ContentDOI

All One Needs to Know about Metaverse: A Complete Survey on Technological Singularity, Virtual Ecosystem, and Research Agenda

TL;DR: This survey paper presents the first effort to offer a comprehensive framework that examines the latest metaverse development under the dimensions of state-of-the-art technologies and metaverse ecosystems, and illustrates the possibility of the digital `big bang' of the authors' cyberspace.
Proceedings ArticleDOI

PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation

TL;DR: PVN3D as mentioned in this paper proposes a deep Hough voting network to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Journal ArticleDOI

A method for registration of 3-D shapes

TL;DR: In this paper, the authors describe a general-purpose representation-independent method for the accurate and computationally efficient registration of 3D shapes including free-form curves and surfaces, based on the iterative closest point (ICP) algorithm, which requires only a procedure to find the closest point on a geometric entity to a given point.
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Pose estimation for augmented reality: a hands-on survey" ?

This paper aims at presenting a brief but almost self-contented introduction to the most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years. For most of the presented approaches, the authors also provide links to code of short examples. 

For AR application, rather than computational efficiency (as soon as real-time requirement are met), accuracy is the key criterion in order to avoid jitter effects. 

To filter out the erroneous ones, the authors use RANSAC on small subsets made of 7 correspondences from which the authors estimate the pose using their PnP method. 

Since such 3D knowledge is not always easily available (although the authors have seen that it can be computed on-line), it is also possible to overcome the pose computation considering less constraining knowledge about the viewed scene. 

since a camera can freely move in AR applications, such features should be extracted from perspectively transformed images. 

Although P3P is a well-known solution to the pose estimation problem, other PnP approaches that use more points (n > 3) were usually preferred. 

Another solution to estimate the homography is to consider the minimization of a cost function, the geometric distance, defined by:ĥ = argmin hN∑ i=1 d(x1i, 1H2x2i)2 (24)which can be solved directly for h which represents the 8 independent parameters hk,k = 1...8 of the homography 1H2 using a gradient approach such as a Gauss-Newton. 

For the P4P problem (n = 4) when datais contaminated with 10% of outliers, 5 iterations are required to ensure that p = 0.99 and with 50% of outliers 72 iterations are necessary. 

Since a comprehensive or even a sparse 3D knowledge is not always easily available, the development of pose estimation methods that involve less constraining knowledge about the observed scene has been considered. 

KinectFusion [89] was one of the first systems that enables scene reconstruction and consequently camera localization in real-time and in a way compatible with interactive applications [53] (see Figure 14). 

Scalability of the solutions, end-users and market acceptance are clearly potential improvement areas that must be considered by both academics and industries. 

A simple definition could be: "given a set of correspondences between 3D features and their projections in the images plane, pose estimation consists in computing the position and orientation of the camera". 

More precisely, rectangle shape is first searched in a binarized image, and then camera pose with respect to the rectangle is computed from the known 3D coordinates of four corners of the marker using approaches similar to those presented in section 3.1 or 4.1.3.