scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SoftPOSIT: Simultaneous Pose and Correspondence Determination

TL;DR: A new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known, which has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points.
Abstract: The problem of pose estimation arises in many areas of computer vision, including object recognition, object tracking, site inspection and updating, and autonomous navigation when scene models are available. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known. The algorithm combines the iterative softassign algorithm (Gold and Rangarajan, 1996; Gold et al., 1998) for computing correspondences and the iterative POSIT algorithm (DeMenthon and Davis, 1995) for computing object pose under a full-perspective camera model. Our algorithm, unlike most previous algorithms for pose determination, does not have to hypothesize small sets of matches and then verify the remaining image points. Instead, all possible matches are treated identically throughout the search for an optimal pose. The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm performs well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points. The algorithm is being applied to a number of practical autonomous vehicle navigation problems including the registration of 3D architectural models of a city to images, and the docking of small robots onto larger robots.

Summary (4 min read)

1 Introduction

  • This paper presents an algorithm for solving the model-to-image registration problem, which is the task of determining the position and orientation (the pose) of a three-dimensional object with respect to a camera coordinate system, given a model of the object consisting of 3D reference points and a single 2D image of these points.
  • Solving the correspondence problem consists of finding matching image features and model fea- tures.
  • Projecting the model in the known pose into the original image, one can identify matches among the model features that project sufficiently close to an image feature.
  • A global objective function is defined that captures the nature of the problem in terms of both pose and correspondence and combines the formalisms of both iterative techniques.
  • In the following sections, the authors examine each step of the method.

2 POSIT Algorithm

  • One of the building blocks of the new algorithm is the POSIT algorithm, presented in detail in [DeMenthon 1995], which determines pose from known correspondences.
  • With a factor w different from 1, this image is scaled and approximates a perspective image because the scaling is inversely proportional to the distance \ea from the camera center of projection to the object origin :|M ( w < 2_u \_a ).
  • These vectors can be found by singular value decomposition (SVD) (see the Matlab code in [DeMenthon 2001]).
  • Then the authors can solve the system of equations (4) again to obtain a refined pose.
  • This process is repeated, and the iteration is stopped when the process becomes stationary.

3 Geometry and Objective Function

  • In other words, the left-hand side of equation (4) represents the vector , in the image plane.
  • In other words, the right-hand side of equation (4) represents the vector ,   in the image plane.
  • The least squares solution of equations (4) for pose enforces these constraints.
  • Therefore minimizing this objective function consists of minimizing the scaled sum of squared distances of object points to lines of sight, when distances are taken along directions parallel to the image plane.
  • The next iteration step finds the pose such that the scaled orthographic projection of each point : is as close as possible to its corrected image point.

4 Pose Calculation with Unknown Correspondences

  • The À º« are correspondence variables that define the assignments between image and object feature points; these must satisfy a number of correspondence Added in response to a reviewer.
  • Note that when all the assignments are well-defined, this objective function becomes equivalent to the objective function defined in equation (5).
  • Compute the correction terms 6 using the pose vectors © W and © just computed (as described in the previous section).
  • In EM, given a guess for the unknown parameters (the pose in their problem) and a set of observed data (the image points in their problem), the expected value of the unobserved variables (the correspondence matrix in their problem) is estimated.
  • This process is repeated until these estimates converge.

4.2 Correspondence Problem

  • º« in the expression for the objective function ¤ are known and fixed.
  • The assignment matrix must satisfy the constraint that each image point match at most one object point, and vice versa (i.e., ϵРÀ º«ÐÑ<yÏ Ð À Ðr Ò<Ó% for all Ô and ).
  • The exponentiation has the effect of ensuring that all elements of the assignment matrix are positive.
  • See [Gold 1998] for an analytical justification.
  • This combination of deterministic annealing and Sinkhorn’s technique in an iteration loop was called softassign by Gold and Rangarajan [Gold 1996, Gold 1998].

4.3 Pseudocode for SoftPOSIT

  • Initialize pose vectors © W and © using the expected pose or a random pose within the expected range.
  • Compute the squared distances ¥ between the list of image points and the list of object points of the input.

5 Random Start SoftPOSIT

  • The SoftPOSIT algorithm described above performs a deterministic annealing search starting from an initial guess at the object’s pose.
  • The probability of finding the globally optimal object pose and correspondences starting from an initial guess depends on a number of factors including the accuracy of the initial guess, the number of object points, the number of image points, the amount of object occlusion, the amount of clutter in the image, and the image measurement noise.
  • A common method of searching for a global optimum, and the one used here, is to run the search algorithm starting from a number of different initial guesses, and keep the first solution that meets a specified termination criterion.
  • The authors initial guesses range over the range T ¨µü ' ü X for the three Euler rotation angles, and over a 3D space of translations known to contain the true translation.
  • The authors describe their procedure for generating initial guesses for pose when no knowledge of the correct pose is available, and then they discuss their termination criterion.

5.1 Generating Initial Guesses

  • Given an initial pose that lies in a valley of the cost function in the parameter space, the authors expect the algorithm to converge to the minimum associated with that valley.
  • To examine other valleys, the authors must start with points that lie in them.
  • These points are scaled to cover the expected ranges of translation and rotation.
  • The material on quasirandom numbers was removed.

5.2 Search Termination

  • Ideally, one would like to repeat the search from a new starting point whenever the number of object-toimage correspondences determined by the search is not maximal.
  • With real data, however, one usually does not know what this maximal number is.
  • Instead, the authors repeat the search when the number of object points that match image points is less than some threshold ýYþ .
  • Let the fraction of detected object features be 3 ö < number of object points detected as image features total number of object points +.
  • Ÿ accounts for measurement noise that typically prevents some detected object features from being matched even when a good pose is found.

5.3 Early Search Termination

  • The deterministic annealing loop of the SoftPOSIT algorithm iterates over a range of values for the annealing parameter Ü .
  • This means that the annealing loop can run for up to 147 iterations.
  • This measure is commonly used [Grimson 1991] at the end of a local search to determine if the current solution for correspondence and pose is good enough to end the search for the global optimum.
  • For each test, the values of the match ratio Ð computed at each iteration are recorded.
  • Once a SoftPOSIT iteration is completed, ground truth information is used to determine whether or not the correct pose was found.

6.1 Monte Carlo Evaluation

  • The random-start SoftPOSIT algorithm has been extensively evaluated in Monte Carlo simulations.
  • It can be seen from this figure that, for more than 92% of the different combinations of simulation parameters, a good pose is found in 90% or more of the associated trials.
  • Again, except for the two highest occlusion and clutter cases, the mean number of starts is about constant or increases very slowly as the number of image points increases.
  • The RANSAC algorithm [Fischler 1981] is the best known algorithm that computes an objects pose given noncorresponding 3D object and 2D image points.
  • The authors compare the expected run time2 of SoftPOSIT to that of RANSAC for each of the simulated data sets discussed in section 6.1.

6.3 Algorithm Complexity

  • The run-time complexity of a single invocation of SoftPOSIT is where is the number of object points and is the number of image points; this is because the number of iterations on all of the loops in the pseudocode in Section 4.3 are bounded by a constant, and each line inside a loop is computed in time at most .
  • Then the run-time complexity of SoftPOSIT with random starts is ).
  • This is a factor of better than the complexity of any published algorithm that solves the simultaneous pose and correspondence problem under a full perspective camera model.

6.4.1 Autonomous Navigation Application

  • The SoftPOSIT algorithm is being applied to the problem of autonomous vehicle navigation through a city where a 3D architectural model of the city is registered to images obtained from an on-board video camera.
  • Thus far, the algorithm has been applied only to imagery generated by a commercial virtual reality system.
  • Figure 14 shows an image generated by this system and a world model projected into that image using the pose computed by SoftPOSIT.
  • Image feature points are automatically located in the image by detecting corners along the boundary of bright sky regions.
  • Then the object points that do fall into this estimated field are further culled by keeping only those that project near the detected skyline.

6.4.2 Robot Docking Application

  • The robot docking application requires that a small robot drive onto a docking platform that is mounted on a larger robot.
  • To detect the corresponding points in the image, lines are first detected using a combination of the Canny edge detector, the Hough transform, and a sorting procedure used to rank the lines produced by the Hough transform.
  • Figure 16 shows the lines and corner points detected in one image of the large robot.
  • And 58% of the image points are clutter.
  • Figure 17a shows the initial guess generated by SoftPOSIT which led to the correct pose being found, and Figure 17b shows this correct pose.

7 Conclusions

  • The authors have developed and evaluated the SoftPOSIT algorithm for determining the poses of objects from images.
  • The correspondence and pose calculation combines into one efficient iterative process the softassign algorithm for determining correspondences and the POSIT algorithm for determining pose.
  • The algorithm has been tested on synthetic data for an autonomous navigation application, and the authors are currently collecting real imagery for further tests with this application.
  • The complexity of SoftPOSIT has been empirically determined to be .
  • More data should be collected to further validate this claim.

Did you find this useful? Give us your feedback

Figures (17)

Content maybe subject to copyright    Report

SoftPOSIT: Simultaneous Pose and Correspondence
Determination
Philip David

, Daniel DeMenthon
, Ramani Duraiswami
, and Hanan Samet
University of Maryland Institute for Advanced Computer Studies,
College Park, MD 20742
Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783-1197
Abstract
The problem of pose estimation arises in many areas of computer vision, including object recognition,
object tracking, site inspection and updating, and autonomous navigation when scene models are avail-
able. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a
single 2D image when correspondences between model points and image points are not known. The
algorithm combines Gold’s iterative softassign algorithm [Gold 1996, Gold 1998] for computing corre-
spondences and DeMenthon’s iterative POSIT algorithm [DeMenthon 1995] for computing object pose
under a full-perspective camera model. Our algorithm, unlike most previous algorithms for pose de-
termination, does not have to hypothesize small sets of matches and then verify the remaining image
points. Instead, all possible matches are treated identically throughout the search for an optimal pose.
The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data
The support of NSF grants EAR-99-05844 and IIS-00-86116 is gratefully acknowledged.
1

under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm per-
forms well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has an
asymptotic run-time complexity that is better than previous methods by a factor of the number of image
points. The algorithm is being applied to a number of practical autonomous vehicle navigation problems
including the registration of 3D architectural models of a city to images, and the docking of small robots
onto larger robots.
1 Introduction
This paper presents an algorithm for solving the model-to-image registration problem, which is the task
of determining the position and orientation (the pose) of a three-dimensional object with respect to a
camera coordinate system, given a model of the object consisting of 3D reference points and a single 2D
image of these points. We assume that no additional information is available with which to constrain the
pose of the object or to constrain the correspondence of model features to image features. This is also
known as the simultaneous pose and correspondence problem.
Automatic registration of 3D models to images is an important problem. Applications include object
recognition, object tracking, site inspection and updating, and autonomous navigation when scene mod-
els are available. It is a difficult problem because it comprises two coupled problems, the correspondence
problem and the pose problem, each easy to solve only if the other has been solved first:
1. Solving the pose (or exterior orientation) problem consists of finding the rotation and translation
of the object with respect to the camera coordinate system. Given matching model and image fea-
tures, one can easily determine the pose that best aligns those matches. For three to five matches,
the pose can be found in closed form by solving sets of polynomial equations [Fischler 1981,
Haralick 1991, Horaud 1989, Yuan 1989]. For six or more matches, linear and nonlinear ap-
proximate methods are generally used [DeMenthon 1995, Fiore 2001, Hartley 2000, Horn 1986,
Lu 2000].
2. Solving the correspondence problem consists of finding matching image features and model fea-
2

tures. If the object pose is known, one can relatively easily determine the matching features. Pro-
jecting the model in the known pose into the original image, one can identify matches among
the model features that project sufficiently close to an image feature. This approach is typi-
cally used for pose verification, which attempts to determine how good a hypothesized pose is
[Grimson 1991].
The classic approach to solving these coupled problems is the hypothesize-and-test approach [Grimson 1990].
In this approach, a small set of image feature to model feature correspondences are first hypothesized.
Based on these correspondences, the pose of the object is computed. Using this pose, the model points
are back-projected into the image. If the original and back-projected images are sufficiently similar,
then the pose is accepted; otherwise, a new hypothesis is formed and this process is repeated. Perhaps
the best known example of this approach is the RANSAC algorithm [Fischler 1981] for the case that
no information is available to constrain the correspondences of model points to image points. When
three correspondences are used to determine a pose, a high probability of success can be achieved by the
RANSAC algorithm in

time when there are
image points and
object points (see “Model
point”
has been
changed
to
“object
point”
through-
out the
paper.
Previ-
ously,
we used
both
“model”
and
“object”
to mean
the same
thing.
Appendix A for details).
The problem addressed here is one that is encountered when taking a model-based approach to the
object recognition problem, and as such has received considerable attention. (The other main approach
to object recognition is the appearance-based approach [Murase 1995] in which multiple views of the
object are compared to the image. However, since 3D models are not used, this approach doesn’t pro-
vide accurate object pose.) Many investigators (e.g., [Cass 1994, Cass 1998, Ely 1995, Jacobs 1992,
Lamdan 1988, Procter 1997]) approximate the nonlinear perspective projection via linear affine ap-
proximations. This is accurate when the relative depths of object features are small compared to the
distance of the object from the camera. Among the pioneer contributions were Baird’s tree-pruning
method [Baird 1985], with exponential time complexity for unequal point sets, and Ullman’s alignment
method [Ullman 1989] with time complexity


.
The geometric hashing method [Lamdan 1988] determines an object’s identity and pose using a hash-
ing metric computed from a set of image features. Because the hashing metric must be invariant to
3

camera viewpoint, and because there are no view-invariant image features for general 3D point sets (for
either perspective or affine cameras) [Burns 1993], this method can only be applied to planar scenes.
In [DeMenthon 1993], we proposed an approach using binary search by bisection of pose boxes in
two 4D spaces, extending the research of [Baird 1985, Cass 1992, Breuel 1992] on affine transforms,
but it had high-order complexity. The approach taken by Jurie [Jurie 1999] was inspired by our work
and belongs to the same family of methods. An initial volume of pose space is guessed, and all of
the correspondences compatible with this volume are first taken into account. Then the pose volume is
recursively reduced until it can be viewed as a single pose. As a Gaussian error model is used, boxes
of pose space are pruned not by counting the number of correspondences that are compatible with the
box as in [DeMenthon 1993], but on the basis of the probability of having an object model in the image
within the range of poses defined by the box.
Among the researchers who have addressed the full perspective problem, Wunsch and Hirzinger
[Wunsch 1996] formalize the abstract problem in a way similar to the approach advocated here as the
optimization of an objective function combining correspondence and pose constraints. However, the
correspondence constraints are not represented analytically. Instead, each model feature is explicitly
matched to the closest lines of sight of the image features. The closest 3D points on the lines of sight
are found for each model feature, and the pose that brings the model features closest to these 3D points
is selected; this allows an easier 3D to 3D pose problem to be solved. The process is repeated until a
minimum of the objective function is reached.
The object recognition approach of Beis [Beis 1999] uses view-variant 2D image features to index
3D object models. Off-line training is performed to learn 2D feature groupings associated with large
numbers of views of the objects. Then, the on-line recognition stage uses new feature groupings to
index into a database of learned model-to-image correspondence hypotheses, and these hypotheses are
used for pose estimation and verification.
The pose clustering approach to model-to-image registration is similar to the classic hypothesize-and-
test approach. Instead of testing each hypothesis as it is generated, all hypotheses are generated and
clustered in a pose space before any back-projection and testing takes place. This later step is performed
4

only on poses associated with high-probability clusters. The idea is that hypotheses including only
correct correspondences should form larger clusters in pose space than hypotheses that include incorrect
correspondences. Olson [Olson 1997] gives a randomized algorithm for pose clustering whose time
complexity is

.
The method of Beveridge and Riseman [Beveridge 1992, Beveridge 1995] is also related to our ap-
proach. Random-start local search is combined with a hybrid pose estimation algorithm employing
both full-perspective and weak-perspective camera models. A steepest descent search in the space of
model-to-image line segment correspondences is performed. A weak-perspective pose algorithm is used
to rank neighboring points in this search space, and a full-perspective pose algorithm is used to update
the model’s pose after making a move to a new set of correspondences. The time complexity of this
algorithm was empirically determined to be
 
.
When there are
object points and
image points, the dimension of the solution space for this
problem is
!#"
since there are
correspondence variables and 6 pose variables. Each correspon-
dence variable has the domain
$&%(')*',+,+,+'
'-/.
representing a match of a object point to one of the
image points or to no image point (represented by
-
), and each pose variable has a continuous domain
determined by the allowed range of object translations and rotations. Most algorithms don’t explicitly
search this
0!1"
-dimensional space, but instead assume that pose is determined by correspondences or
that correspondences are determined by pose, and so search either an
-dimensional or a 6-dimensional
space. The SoftPOSIT approach is different in that its search alternates between these two spaces.
The SoftPOSIT approach to solving the model-to-image registration problem applies the formalism
proposed by Gold, Rangarajan and others [Gold 1996, Gold 1998] when they solved the correspon-
dence and pose problem in matching two images or two 3D models. We extend it to the more difficult
problem of registration between a 3D model and its perspective image, which they did not address.
The SoftPOSIT algorithm integrates an iterative pose technique called POSIT (Pose from Orthography
and Scaling with ITerations) [DeMenthon 1995], and an iterative correspondence assignment technique
called softassign [Gold 1996, Gold 1998] into a single iteration loop. A global objective function is
defined that captures the nature of the problem in terms of both pose and correspondence and combines
5

Citations
More filters
Book
30 Sep 2010
TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Abstract: Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

4,146 citations

Proceedings Article
01 Jan 1989
TL;DR: A scheme is developed for classifying the types of motion perceived by a humanlike robot and equations, theorems, concepts, clues, etc., relating the objects, their positions, and their motion to their images on the focal plane are presented.
Abstract: A scheme is developed for classifying the types of motion perceived by a humanlike robot. It is assumed that the robot receives visual images of the scene using a perspective system model. Equations, theorems, concepts, clues, etc., relating the objects, their positions, and their motion to their images on the focal plane are presented. >

2,000 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task, is proposed that can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark.
Abstract: Object viewpoint estimation from 2D images is an essential task in computer vision However, two issues hinder its progress: scarcity of training data with viewpoint annotations, and a lack of powerful features Inspired by the growing availability of 3D models, we propose a framework to address both issues by combining render-based image synthesis and CNNs (Convolutional Neural Networks) We believe that 3D models have the potential in generating a large number of images of high variation, which can be well exploited by deep CNN with a high learning capacity Towards this goal, we propose a scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task Experimentally, we show that the viewpoint estimation from our pipeline can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark

795 citations


Cites methods from "SoftPOSIT: Simultaneous Pose and Co..."

  • ...3D Object Detection Most 3D object detection methods are based on representing objects with discriminative features for points [4], patches [6] and parts [19, 29, 34], or by exploring topological structures [15, 2, 3]....

    [...]

01 Jan 2010
TL;DR: A new voting-based object pose extraction algorithm that does not rely on 2D/3D feature correspondences and thus reduces the early-commitment problem plaguing the generality of traditional vision-based pose extraction algorithms is shown.
Abstract: Society is becoming more automated with robots beginning to perform most tasks in factories and starting to help out in home and office environments. One of the most important functions of robots is the ability to manipulate objects in their environment. Because the space of possible robot designs, sensor modalities, and target tasks is huge, researchers end up having to manually create many models, databases, and programs for their specific task, an effort that is repeated whenever the task changes. Given a specification for a robot and a task, the presented framework automatically constructs the necessary databases and programs required for the robot to reliably execute manipulation tasks. It includes contributions in three major components that are critical for manipulation tasks. The first is a geometric-based planning system that analyzes all necessary modalities of manipulation planning and offers efficient algorithms to formulate and solve them. This allows identification of the necessary information needed from the task and robot specifications. Using this set of analyses, we build a planning knowledge-base that allows informative geometric reasoning about the structure of the scene and the robot's goals. We show how to efficiently generate and query the information for planners. The second is a set of efficient algorithms considering the visibility of objects in cameras when choosing manipulation goals. We show results with several robot platforms using grippers cameras to boost accuracy of the detected objects and to reliably complete the tasks. Furthermore, we use the presented planning and visibility infrastructure to develop a completely automated extrinsic camera calibration method and a method for detecting insufficient calibration data. The third is a vision-centric database that can analyze a rigid object's surface for stable and discriminable features to be used in pose extraction programs. Furthermore, we show work towards a new voting-based object pose extraction algorithm that does not rely on 2D/3D feature correspondences and thus reduces the early-commitment problem plaguing the generality of traditional vision-based pose extraction algorithms. In order to reinforce our theoric contributions with a solid implementation basis, we discuss the open-source planning environment OpenRAVE, which began and evolved as a result of the work done in this thesis. We present an analysis of its architecture and provide insight for successful robotics software environments.

540 citations

Journal ArticleDOI
Xinyu Huang1, Peng Wang1, Cheng Xinjing1, Dingfu Zhou1, Qichuan Geng1, Ruigang Yang1 
TL;DR: This paper provides a sensor fusion scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and a 3D semantic map in order to achieve robust self-localization and semantic segmentation for autonomous driving.
Abstract: Autonomous driving has attracted tremendous attention especially in the past few years. The key techniques for a self-driving car include solving tasks like 3D map construction, self-localization, parsing the driving road and understanding objects, which enable vehicles to reason and act. However, large scale data set for training and system evaluation is still a bottleneck for developing robust perception models. In this paper, we present the ApolloScape dataset [1] and its applications for autonomous driving. Compared with existing public datasets from real scenes, e.g., KITTI [2] or Cityscapes [3] , ApolloScape contains much large and richer labelling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labelling, lanemark labelling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities and daytimes. For each task, it contains at lease 15x larger amount of images than SOTA datasets. To label such a complete dataset, we develop various tools and algorithms specified for each task to accelerate the labelling process, such as joint 3D-2D segment labeling, active labelling in videos etc. Depend on ApolloScape , we are able to develop algorithms jointly consider the learning and inference of multiple tasks. In this paper, we provide a sensor fusion scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and a 3D semantic map in order to achieve robust self-localization and semantic segmentation for autonomous driving. We show that practically, sensor fusion and joint learning of multiple tasks are beneficial to achieve a more robust and accurate system. We expect our dataset and proposed relevant algorithms can support and motivate researchers for further development of multi-sensor fusion and multi-task learning in the field of computer vision.

396 citations


Cites background from "SoftPOSIT: Simultaneous Pose and Co..."

  • ...Usually in a large environment, a pose prior is required in order to obtain good estimation [43], [44]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

23,396 citations

Book
01 Jan 2000
TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.
Abstract: From the Publisher: A basic problem in computer vision is to understand the structure of a real world scene given several images of it. Recent major developments in the theory and practice of scene reconstruction are described in detail in a unified framework. The book covers the geometric principles and how to represent objects algebraically so they can be computed and applied. The authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly.

15,558 citations

01 Jan 2001
TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Abstract: Downloading the book in this website lists can give you more advantages. It will show you the best book collections and completed collections. So many books can be found in this website. So, this is not only this multiple view geometry in computer vision. However, this book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts. This is simple, read the soft file of the book and you get it.

14,282 citations


"SoftPOSIT: Simultaneous Pose and Co..." refers methods in this paper

  • ...For six or more matches, linear and approximate nonlinear methods are generally used (DeMenthon and Davis, 1995; Fiore, 2001; Hartley and Zisserman, 2000; Horn, 1986; Lu et al., 2000)....

    [...]

Book
01 Mar 1986
TL;DR: Robot Vision as discussed by the authors is a broad overview of the field of computer vision, using a consistent notation based on a detailed understanding of the image formation process, which can provide a useful and current reference for professionals working in the fields of machine vision, image processing, and pattern recognition.
Abstract: From the Publisher: This book presents a coherent approach to the fast-moving field of computer vision, using a consistent notation based on a detailed understanding of the image formation process. It covers even the most recent research and will provide a useful and current reference for professionals working in the fields of machine vision, image processing, and pattern recognition. An outgrowth of the author's course at MIT, Robot Vision presents a solid framework for understanding existing work and planning future research. Its coverage includes a great deal of material that is important to engineers applying machine vision methods in the real world. The chapters on binary image processing, for example, help explain and suggest how to improve the many commercial devices now available. And the material on photometric stereo and the extended Gaussian image points the way to what may be the next thrust in commercialization of the results in this area. Chapters in the first part of the book emphasize the development of simple symbolic descriptions from images, while the remaining chapters deal with methods that exploit these descriptions. The final chapter offers a detailed description of how to integrate a vision system into an overall robotics system, in this case one designed to pick parts out of a bin. The many exercises complement and extend the material in the text, and an extensive bibliography will serve as a useful guide to current research. Errata (164k PDF)

3,783 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that given an integer k ≥ 1, (1 + ϵ)-approximation to the k nearest neighbors of q can be computed in additional O(kd log n) time.
Abstract: Consider a set of S of n data points in real d-dimensional space, Rd, where distances are measured using any Minkowski metric. In nearest neighbor searching, we preprocess S into a data structure, so that given any query point q∈ Rd, is the closest point of S to q can be reported quickly. Given any positive real ϵ, data point p is a (1 +ϵ)-approximate nearest neighbor of q if its distance from q is within a factor of (1 + ϵ) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set of n points in Rd in O(dn log n) time and O(dn) space, so that given a query point q ∈ Rd, and ϵ > 0, a (1 + ϵ)-approximate nearest neighbor of q can be computed in O(cd, ϵ log n) time, where cd,ϵ≤d ⌈1 + 6d/ϵ⌉d is a factor depending only on dimension and ϵ. In general, we show that given an integer k ≥ 1, (1 + ϵ)-approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.

2,813 citations