SoftPOSIT: Simultaneous Pose and Correspondence Determination

doi:10.1023/B:VISI.0000025800.10423.1F

Journal Article•DOI•

SoftPOSIT: Simultaneous Pose and Correspondence Determination

Philip David¹, Daniel DeMenthon², Ramani Duraiswami², Hanan Samet²•Institutions (2)

United States Army Research Laboratory¹, University of Maryland, College Park²

21 Sep 2004-International Journal of Computer Vision (Kluwer Academic Publishers)-Vol. 59, Iss: 3, pp 259-284

TL;DR: A new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known, which has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points.

read less

Abstract: The problem of pose estimation arises in many areas of computer vision, including object recognition, object tracking, site inspection and updating, and autonomous navigation when scene models are available. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known. The algorithm combines the iterative softassign algorithm (Gold and Rangarajan, 1996; Gold et al., 1998) for computing correspondences and the iterative POSIT algorithm (DeMenthon and Davis, 1995) for computing object pose under a full-perspective camera model. Our algorithm, unlike most previous algorithms for pose determination, does not have to hypothesize small sets of matches and then verify the remaining image points. Instead, all possible matches are treated identically throughout the search for an optimal pose. The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm performs well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points. The algorithm is being applied to a number of practical autonomous vehicle navigation problems including the registration of 3D architectural models of a city to images, and the docking of small robots onto larger robots.

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 POSIT Algorithm] – [3 Geometry and Objective Function] – [4 Pose Calculation with Unknown Correspondences] – [4.2 Correspondence Problem] – [4.3 Pseudocode for SoftPOSIT] – [5 Random Start SoftPOSIT] – [5.1 Generating Initial Guesses] – [5.2 Search Termination] – [5.3 Early Search Termination] – [6.1 Monte Carlo Evaluation] – [6.3 Algorithm Complexity] – [6.4.1 Autonomous Navigation Application] – [6.4.2 Robot Docking Application] and [7 Conclusions]

1 Introduction

This paper presents an algorithm for solving the model-to-image registration problem, which is the task of determining the position and orientation (the pose) of a three-dimensional object with respect to a camera coordinate system, given a model of the object consisting of 3D reference points and a single 2D image of these points.
Solving the correspondence problem consists of finding matching image features and model fea- tures.
Projecting the model in the known pose into the original image, one can identify matches among the model features that project sufficiently close to an image feature.
A global objective function is defined that captures the nature of the problem in terms of both pose and correspondence and combines the formalisms of both iterative techniques.
In the following sections, the authors examine each step of the method.

2 POSIT Algorithm

One of the building blocks of the new algorithm is the POSIT algorithm, presented in detail in [DeMenthon 1995], which determines pose from known correspondences.
With a factor w different from 1, this image is scaled and approximates a perspective image because the scaling is inversely proportional to the distance \ea from the camera center of projection to the object origin :|M ( w < 2_u \_a ).
These vectors can be found by singular value decomposition (SVD) (see the Matlab code in [DeMenthon 2001]).
Then the authors can solve the system of equations (4) again to obtain a refined pose.
This process is repeated, and the iteration is stopped when the process becomes stationary.

3 Geometry and Objective Function

In other words, the left-hand side of equation (4) represents the vector , in the image plane.
In other words, the right-hand side of equation (4) represents the vector , in the image plane.
The least squares solution of equations (4) for pose enforces these constraints.
Therefore minimizing this objective function consists of minimizing the scaled sum of squared distances of object points to lines of sight, when distances are taken along directions parallel to the image plane.
The next iteration step finds the pose such that the scaled orthographic projection of each point : is as close as possible to its corrected image point.

4 Pose Calculation with Unknown Correspondences

The À º« are correspondence variables that define the assignments between image and object feature points; these must satisfy a number of correspondence Added in response to a reviewer.
Note that when all the assignments are well-defined, this objective function becomes equivalent to the objective function defined in equation (5).
Compute the correction terms 6 using the pose vectors © W and © just computed (as described in the previous section).
In EM, given a guess for the unknown parameters (the pose in their problem) and a set of observed data (the image points in their problem), the expected value of the unobserved variables (the correspondence matrix in their problem) is estimated.
This process is repeated until these estimates converge.

4.2 Correspondence Problem

º« in the expression for the objective function ¤ are known and fixed.
The assignment matrix must satisfy the constraint that each image point match at most one object point, and vice versa (i.e., ÏµÐ À º«ÐÑ<yÏ Ð À Ðr Ò<Ó% for all Ô and ).
The exponentiation has the effect of ensuring that all elements of the assignment matrix are positive.
See [Gold 1998] for an analytical justification.
This combination of deterministic annealing and Sinkhorn’s technique in an iteration loop was called softassign by Gold and Rangarajan [Gold 1996, Gold 1998].

4.3 Pseudocode for SoftPOSIT

Initialize pose vectors © W and © using the expected pose or a random pose within the expected range.
Compute the squared distances ¥ between the list of image points and the list of object points of the input.

5 Random Start SoftPOSIT

The SoftPOSIT algorithm described above performs a deterministic annealing search starting from an initial guess at the object’s pose.
The probability of finding the globally optimal object pose and correspondences starting from an initial guess depends on a number of factors including the accuracy of the initial guess, the number of object points, the number of image points, the amount of object occlusion, the amount of clutter in the image, and the image measurement noise.
A common method of searching for a global optimum, and the one used here, is to run the search algorithm starting from a number of different initial guesses, and keep the first solution that meets a specified termination criterion.
The authors initial guesses range over the range T ¨µü ' ü X for the three Euler rotation angles, and over a 3D space of translations known to contain the true translation.
The authors describe their procedure for generating initial guesses for pose when no knowledge of the correct pose is available, and then they discuss their termination criterion.

5.1 Generating Initial Guesses

Given an initial pose that lies in a valley of the cost function in the parameter space, the authors expect the algorithm to converge to the minimum associated with that valley.
To examine other valleys, the authors must start with points that lie in them.
These points are scaled to cover the expected ranges of translation and rotation.
The material on quasirandom numbers was removed.

5.2 Search Termination

Ideally, one would like to repeat the search from a new starting point whenever the number of object-toimage correspondences determined by the search is not maximal.
With real data, however, one usually does not know what this maximal number is.
Instead, the authors repeat the search when the number of object points that match image points is less than some threshold ýYþ .
Let the fraction of detected object features be 3 ö < number of object points detected as image features total number of object points +.
Ÿ accounts for measurement noise that typically prevents some detected object features from being matched even when a good pose is found.

5.3 Early Search Termination

The deterministic annealing loop of the SoftPOSIT algorithm iterates over a range of values for the annealing parameter Ü .
This means that the annealing loop can run for up to 147 iterations.
This measure is commonly used [Grimson 1991] at the end of a local search to determine if the current solution for correspondence and pose is good enough to end the search for the global optimum.
For each test, the values of the match ratio Ð computed at each iteration are recorded.
Once a SoftPOSIT iteration is completed, ground truth information is used to determine whether or not the correct pose was found.

6.1 Monte Carlo Evaluation

The random-start SoftPOSIT algorithm has been extensively evaluated in Monte Carlo simulations.
It can be seen from this figure that, for more than 92% of the different combinations of simulation parameters, a good pose is found in 90% or more of the associated trials.
Again, except for the two highest occlusion and clutter cases, the mean number of starts is about constant or increases very slowly as the number of image points increases.
The RANSAC algorithm [Fischler 1981] is the best known algorithm that computes an objects pose given noncorresponding 3D object and 2D image points.
The authors compare the expected run time2 of SoftPOSIT to that of RANSAC for each of the simulated data sets discussed in section 6.1.

6.3 Algorithm Complexity

The run-time complexity of a single invocation of SoftPOSIT is where is the number of object points and is the number of image points; this is because the number of iterations on all of the loops in the pseudocode in Section 4.3 are bounded by a constant, and each line inside a loop is computed in time at most .
Then the run-time complexity of SoftPOSIT with random starts is ).
This is a factor of better than the complexity of any published algorithm that solves the simultaneous pose and correspondence problem under a full perspective camera model.

6.4.1 Autonomous Navigation Application

The SoftPOSIT algorithm is being applied to the problem of autonomous vehicle navigation through a city where a 3D architectural model of the city is registered to images obtained from an on-board video camera.
Thus far, the algorithm has been applied only to imagery generated by a commercial virtual reality system.
Figure 14 shows an image generated by this system and a world model projected into that image using the pose computed by SoftPOSIT.
Image feature points are automatically located in the image by detecting corners along the boundary of bright sky regions.
Then the object points that do fall into this estimated field are further culled by keeping only those that project near the detected skyline.

6.4.2 Robot Docking Application

The robot docking application requires that a small robot drive onto a docking platform that is mounted on a larger robot.
To detect the corresponding points in the image, lines are first detected using a combination of the Canny edge detector, the Hough transform, and a sorting procedure used to rank the lines produced by the Hough transform.
Figure 16 shows the lines and corner points detected in one image of the large robot.
And 58% of the image points are clutter.
Figure 17a shows the initial guess generated by SoftPOSIT which led to the correct pose being found, and Figure 17b shows this correct pose.

7 Conclusions

The authors have developed and evaluated the SoftPOSIT algorithm for determining the poses of objects from images.
The correspondence and pose calculation combines into one efficient iterative process the softassign algorithm for determining correspondences and the POSIT algorithm for determining pose.
The algorithm has been tested on synthetic data for an autonomous navigation application, and the authors are currently collecting real imagery for further tests with this application.
The complexity of SoftPOSIT has been empirically determined to be .
More data should be collected to further validate this claim.

Did you find this useful? Give us your feedback

Figures (17)

Figure 16: An image of the large robot as seen from the small robot’s point of view. Long straight lines detected in the image are shown in white, and their intersections, which ideally should correspond to vertices in the 3D model, are shown in black.

Figure 6: Typical images of randomly generated objects and images. The black points are projected object points and the white points (circles) are clutter points. The black lines, which connect the object points, are included in these pictures to assist the reader in understanding the pictures; they are not used by the algorithm. The number of points in the objects are 20 for (a), 30 for (b), 40 for (c), 50 for (d) and (e), 60 for (f) and (g), 70 for (h), and 80 for (i). In all cases shown here, and . This is the best case for occlusion (none), but the worst case for clutter. In the actual experiments, and vary.

Figure 11: Number of random starts required to find a good pose as a function of the number of image points for fixed values of and . (a) Mean . (b) Standard deviation.

Figure 1: Evolution of perspective projections for a 15-point object (solid lines) being aligned by the SoftPOSIT algorithm to an image (dashed lines) with one occluded object point and two clutter points. The iteration step of the algorithm is shown under each frame.

Figure 10: Number of random starts required to find a good pose as a function of the number of object points for fixed values of and . (a) Mean . (b) Standard deviation.

Figure 9: Success rate as a function of the number of object points for fixed values of and . (Note that and are denoted by and , respectively, in the legend of this figure and in the next few figures.)

Figure 17: The initial guess at the robot’s pose (a) that leads to the correct pose as shown in (b).

Figure 12: Comparison of the run times of SoftPOSIT to those of RANSAC for various problem complexities. The SoftPOSIT run times are shown with a solid line. The RANSAC run times are shown with a dashed line for the case that the number of samples is determined by , and with a dotted line for the case that the number of samples is determined by .

Figure 5: Probability functions estimated for (a) the first iteration, and (b) the 31st iteration, of the SoftPOSIT algorithm.

Figure 8: More projected objects and cluttered images for which SoftPOSIT was successful. The Monte Carlo parameters for these tests are , , and for (a) and (b), for (c) and (d).

Figure 3: Camera geometry. A camera with center of projection , focal length , image center , and image plane , projects object point onto image point . is the translation between the camera frame and the object frame, whose origin is at with respect to the camera frame. The coordinates of point with respect to the object frame are given by the 3-vector .

Figure 2: The trajectory of the perspective projection of a cube (solid lines) being aligned by the SoftPOSIT algorithm to an image of a cube (dashed lines), where one vertex of the cube is occluded. A simple object is used for the sake of clarity.

Figure 14: (a) Original image from a virtual reality system. (b) World model (white lines) projected into this image using the pose computed by SoftPOSIT.

Figure 4: Geometric interpretation of the POSIT computation. Image point , the scaled orthographic projection of world point , is computed by the left-hand side of equation (4). Image point the scaled orthographic projection of point on the line of sight of , is computed by the right-hand side of this equation. The equation is satisfied when the two points are superposed, which requires that the world point be on the line of sight of image point . The plane of the figure is chosen to contain the optical axis and the line of sight . The points , , , and are generally out of this plane.

Figure 7: Some projected objects and cluttered images for which SoftPOSIT was successful. The small circles are the image points (including projected model and clutter) to which the objects must be matched. The light gray points and lines show the projections of the objects in the initial poses (random guesses) which lead to good poses being found. The black points and lines show the projections of the objects in the good poses that are found. The black points that are not near any circle are occluded object points. Circles not near any black point are clutter. Again, the gray and black lines are included in these pictures to assist the reader in understanding the pictures; they are not used by the algorithm. The Monte Carlo parameters for these tests are , , , for (a) and (b), for (c) and (d).

Figure 15: A small robot docking onto a larger robot.

Content maybe subject to copyright Report

SoftPOSIT: Simultaneous Pose and Correspondence

Determination



Philip David

 

, Daniel DeMenthon



, Ramani Duraiswami



, and Hanan Samet



University of Maryland Institute for Advanced Computer Studies,

College Park, MD 20742



Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783-1197

Abstract

The problem of pose estimation arises in many areas of computer vision, including object recognition,

object tracking, site inspection and updating, and autonomous navigation when scene models are avail-

able. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a

single 2D image when correspondences between model points and image points are not known. The

algorithm combines Gold’s iterative softassign algorithm [Gold 1996, Gold 1998] for computing corre-

spondences and DeMenthon’s iterative POSIT algorithm [DeMenthon 1995] for computing object pose

under a full-perspective camera model. Our algorithm, unlike most previous algorithms for pose de-

termination, does not have to hypothesize small sets of matches and then verify the remaining image

points. Instead, all possible matches are treated identically throughout the search for an optimal pose.

The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data



The support of NSF grants EAR-99-05844 and IIS-00-86116 is gratefully acknowledged.

under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm per-

forms well in a variety of difﬁcult scenarios, and empirical evidence suggests that the algorithm has an

asymptotic run-time complexity that is better than previous methods by a factor of the number of image

points. The algorithm is being applied to a number of practical autonomous vehicle navigation problems

including the registration of 3D architectural models of a city to images, and the docking of small robots

onto larger robots.

1 Introduction

This paper presents an algorithm for solving the model-to-image registration problem, which is the task

of determining the position and orientation (the pose) of a three-dimensional object with respect to a

camera coordinate system, given a model of the object consisting of 3D reference points and a single 2D

image of these points. We assume that no additional information is available with which to constrain the

pose of the object or to constrain the correspondence of model features to image features. This is also

known as the simultaneous pose and correspondence problem.

Automatic registration of 3D models to images is an important problem. Applications include object

recognition, object tracking, site inspection and updating, and autonomous navigation when scene mod-

els are available. It is a difﬁcult problem because it comprises two coupled problems, the correspondence

problem and the pose problem, each easy to solve only if the other has been solved ﬁrst:

1. Solving the pose (or exterior orientation) problem consists of ﬁnding the rotation and translation

of the object with respect to the camera coordinate system. Given matching model and image fea-

tures, one can easily determine the pose that best aligns those matches. For three to ﬁve matches,

the pose can be found in closed form by solving sets of polynomial equations [Fischler 1981,

Haralick 1991, Horaud 1989, Yuan 1989]. For six or more matches, linear and nonlinear ap-

proximate methods are generally used [DeMenthon 1995, Fiore 2001, Hartley 2000, Horn 1986,

Lu 2000].

2. Solving the correspondence problem consists of ﬁnding matching image features and model fea-

tures. If the object pose is known, one can relatively easily determine the matching features. Pro-

jecting the model in the known pose into the original image, one can identify matches among

the model features that project sufﬁciently close to an image feature. This approach is typi-

cally used for pose veriﬁcation, which attempts to determine how good a hypothesized pose is

[Grimson 1991].

The classic approach to solving these coupled problems is the hypothesize-and-test approach [Grimson 1990].

In this approach, a small set of image feature to model feature correspondences are ﬁrst hypothesized.

Based on these correspondences, the pose of the object is computed. Using this pose, the model points

are back-projected into the image. If the original and back-projected images are sufﬁciently similar,

then the pose is accepted; otherwise, a new hypothesis is formed and this process is repeated. Perhaps

the best known example of this approach is the RANSAC algorithm [Fischler 1981] for the case that

no information is available to constrain the correspondences of model points to image points. When

three correspondences are used to determine a pose, a high probability of success can be achieved by the

RANSAC algorithm in



time when there are



image points and



object points (see “Model

point”

has been

changed

“object

point”

through-

out the

paper.

Previ-

ously,

we used

both

“model”

and

“object”

to mean

the same

thing.

Appendix A for details).

The problem addressed here is one that is encountered when taking a model-based approach to the

object recognition problem, and as such has received considerable attention. (The other main approach

to object recognition is the appearance-based approach [Murase 1995] in which multiple views of the

object are compared to the image. However, since 3D models are not used, this approach doesn’t pro-

vide accurate object pose.) Many investigators (e.g., [Cass 1994, Cass 1998, Ely 1995, Jacobs 1992,

Lamdan 1988, Procter 1997]) approximate the nonlinear perspective projection via linear afﬁne ap-

proximations. This is accurate when the relative depths of object features are small compared to the

distance of the object from the camera. Among the pioneer contributions were Baird’s tree-pruning

method [Baird 1985], with exponential time complexity for unequal point sets, and Ullman’s alignment

method [Ullman 1989] with time complexity







The geometric hashing method [Lamdan 1988] determines an object’s identity and pose using a hash-

ing metric computed from a set of image features. Because the hashing metric must be invariant to

camera viewpoint, and because there are no view-invariant image features for general 3D point sets (for

either perspective or afﬁne cameras) [Burns 1993], this method can only be applied to planar scenes.

In [DeMenthon 1993], we proposed an approach using binary search by bisection of pose boxes in

two 4D spaces, extending the research of [Baird 1985, Cass 1992, Breuel 1992] on afﬁne transforms,

but it had high-order complexity. The approach taken by Jurie [Jurie 1999] was inspired by our work

and belongs to the same family of methods. An initial volume of pose space is guessed, and all of

the correspondences compatible with this volume are ﬁrst taken into account. Then the pose volume is

recursively reduced until it can be viewed as a single pose. As a Gaussian error model is used, boxes

of pose space are pruned not by counting the number of correspondences that are compatible with the

box as in [DeMenthon 1993], but on the basis of the probability of having an object model in the image

within the range of poses deﬁned by the box.

Among the researchers who have addressed the full perspective problem, Wunsch and Hirzinger

[Wunsch 1996] formalize the abstract problem in a way similar to the approach advocated here as the

optimization of an objective function combining correspondence and pose constraints. However, the

correspondence constraints are not represented analytically. Instead, each model feature is explicitly

matched to the closest lines of sight of the image features. The closest 3D points on the lines of sight

are found for each model feature, and the pose that brings the model features closest to these 3D points

is selected; this allows an easier 3D to 3D pose problem to be solved. The process is repeated until a

minimum of the objective function is reached.

The object recognition approach of Beis [Beis 1999] uses view-variant 2D image features to index

3D object models. Off-line training is performed to learn 2D feature groupings associated with large

numbers of views of the objects. Then, the on-line recognition stage uses new feature groupings to

index into a database of learned model-to-image correspondence hypotheses, and these hypotheses are

used for pose estimation and veriﬁcation.

The pose clustering approach to model-to-image registration is similar to the classic hypothesize-and-

test approach. Instead of testing each hypothesis as it is generated, all hypotheses are generated and

clustered in a pose space before any back-projection and testing takes place. This later step is performed

only on poses associated with high-probability clusters. The idea is that hypotheses including only

correct correspondences should form larger clusters in pose space than hypotheses that include incorrect

correspondences. Olson [Olson 1997] gives a randomized algorithm for pose clustering whose time

complexity is



The method of Beveridge and Riseman [Beveridge 1992, Beveridge 1995] is also related to our ap-

proach. Random-start local search is combined with a hybrid pose estimation algorithm employing

both full-perspective and weak-perspective camera models. A steepest descent search in the space of

model-to-image line segment correspondences is performed. A weak-perspective pose algorithm is used

to rank neighboring points in this search space, and a full-perspective pose algorithm is used to update

the model’s pose after making a move to a new set of correspondences. The time complexity of this

algorithm was empirically determined to be

 

When there are



object points and



image points, the dimension of the solution space for this

problem is

 !#"

since there are



correspondence variables and 6 pose variables. Each correspon-

dence variable has the domain

$&%(')*',+,+,+'



'-/.

representing a match of a object point to one of the



image points or to no image point (represented by

), and each pose variable has a continuous domain

determined by the allowed range of object translations and rotations. Most algorithms don’t explicitly

search this

0!1"

-dimensional space, but instead assume that pose is determined by correspondences or

that correspondences are determined by pose, and so search either an



-dimensional or a 6-dimensional

space. The SoftPOSIT approach is different in that its search alternates between these two spaces.

The SoftPOSIT approach to solving the model-to-image registration problem applies the formalism

proposed by Gold, Rangarajan and others [Gold 1996, Gold 1998] when they solved the correspon-

dence and pose problem in matching two images or two 3D models. We extend it to the more difﬁcult

problem of registration between a 3D model and its perspective image, which they did not address.

The SoftPOSIT algorithm integrates an iterative pose technique called POSIT (Pose from Orthography

and Scaling with ITerations) [DeMenthon 1995], and an iterative correspondence assignment technique

called softassign [Gold 1996, Gold 1998] into a single iteration loop. A global objective function is

deﬁned that captures the nature of the problem in terms of both pose and correspondence and combines

HTML Viewer

SoftPOSIT: Simultaneous Pose and Correspondence Determination

Summary (4 min read)

1 Introduction

2 POSIT Algorithm

3 Geometry and Objective Function

4 Pose Calculation with Unknown Correspondences

4.2 Correspondence Problem

4.3 Pseudocode for SoftPOSIT

5 Random Start SoftPOSIT

5.1 Generating Initial Guesses

5.2 Search Termination

5.3 Early Search Termination

6.1 Monte Carlo Evaluation

6.3 Algorithm Complexity

6.4.1 Autonomous Navigation Application

6.4.2 Robot Docking Application

7 Conclusions

Figures (17)

Citations

Cites methods from "SoftPOSIT: Simultaneous Pose and Co..."

Cites background from "SoftPOSIT: Simultaneous Pose and Co..."

References

"SoftPOSIT: Simultaneous Pose and Co..." refers methods in this paper

Related Papers (5)