scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Dense Omnidirectional RGB-D Mapping of Large-scale Outdoor Environments for Real-time Localization and Autonomous Navigation

TL;DR: An efficient and accurate 3D world representation that allows us to extend the boundaries of state‐of‐the‐art dense visual mapping to large scales via an omnidirectional key‐frame representation of the environment, which is able to synthesize photorealistic views of captured environments at arbitrary locations.
Abstract: This paper presents a novel method and innovative apparatus for building three-dimensional 3D dense visual maps of large-scale unstructured environments for autonomous navigation and real-time localization. The main contribution of the paper is focused on proposing an efficient and accurate 3D world representation that allows us to extend the boundaries of state-of-the-art dense visual mapping to large scales. This is achieved via an omnidirectional key-frame representation of the environment, which is able to synthesize photorealistic views of captured environments at arbitrary locations. Locally, the representation is image-based egocentric and is composed of accurate augmented spherical panoramas combining photometric information RGB, depth information D, and saliency for all viewing directions at a particular point in space i.e., a point in the light field. The spheres are related by a graph of six degree of freedom DOF poses 3 DOF translation and 3 DOF rotation that are estimated through multiview spherical registration. It is shown that this world representation can be used to perform robust real-time localization in 6 DOF of any configuration of visual sensors within their environment, whether they be monocular, stereo, or multiview. Contrary to feature-based approaches, an efficient direct image registration technique is formulated. This approach directly exploits the advantages of the spherical representation by minimizing a photometric error between a current image and a reference sphere. Two novel multicamera acquisition systems have been developed and calibrated to acquire this information, and this paper reports for the first time the second system. Given the robustness and efficiency of this representation, field experiments demonstrating autonomous navigation and large-scale mapping will be reported in detail for challenging unstructured environments, containing vegetation, pedestrians, varying illumination conditions, trams, and dense traffic.

Summary (8 min read)

1 Introduction

  • Autonomous navigation in complex and dynamic unstructured environments over large-scale distances, is an active field in robotics.
  • This representation allows to efficiently perform online localisation and navigation using state of the art dense registration techniques (Comport et al., 2007; Meilland et al., 2010; Meilland et al., 2012).
  • This paper has been structured in such a way that the relevant literature review is included at the beginning of each section.
  • Finally, Section 6 will detail several field experiments.

2 Visual localisation

  • This feature extraction and matching process is often badly conditioned, noisy and not robust, and therefore it must rely on higher level robust estimation techniques and on filtering.
  • Direct approaches (image-based), however, do not rely on this feature extraction and matching process.
  • The camera motion is directly estimated by minimising a non-linear intensity error between images, via a parametric warping function.
  • In (Comport et al., 2007) direct approaches were generalised to use the full-image densely and track 6 dof pose using stereo cameras whilst mapping the environment through dense stereo matching.

2.1 Direct 3D registration

  • Under full brightness consistency assumption, the current image intensities can be warped via novel view synthesis (Avidan and Shashua, 1997) onto the reference view such that I∗(P∗) = I (w(T;Z∗,P∗)) , (2) where the warping function w(·) transfers the current image intensities onto the reference pixels P∗ via the depth-map Z∗.
  • This warping function depends on the acquisition sensor, and can be for example a perspective projection, an omnidirectional projection or a spherical projection ).
  • Now supposing that only a close approximation.
  • This function allows to rejects outliers such as moving objects and local illumination changes between the images.

2.2 Non-linear minimisation

  • An efficient second order minimisation (ESM ) is employed (Malis, 2004), which allows to pre-compute most of the minimisation parts directly on the reference image.
  • The pose estimate is updated at each step by an homogeneous transformation: T̂ ← T̂T(x), (7) and the minimisation is iterated until the increments on x are sufficiently small.

2.3 Multiple key-frame registration

  • In Section 2.1, only one reference image was considered.
  • This approach is highly incremental and prone to drift and linearisation approximation.
  • If more than one reference image is available then it is possible to consider global bundle adjustment approaches (Triggs et al., 2000), however, these approaches are not computationally efficient for real-time applications.
  • Furthermore, in large scale mapping approaches, the connectivity between the various camera poses is relatively sparse meaning that performing global bundle adjustment is not efficient.
  • Alternatively, it is possible to consider sliding window bundle adjustment (Mouragnon et al., 2006; Sibley et al., 2008), which improves performance at little cost.

2.4 Multi-resolution approach

  • One of the major drawbacks of direct approaches, and more generally of iterative approaches, is that the initial pose T̂must be close to the solution T̃ to converge.
  • The convergence domain and the algorithm performance can be improved using a coarse-to-fine optimisation strategy.
  • This is achieved using multi-resolution image pyramids (e.g. constructed by Gaussian filtering and sub-sampling (Burt and Adelson, 1983)).
  • The minimisation begins at the lowest resolution and the result is used to initialise the next level repeatedly until the highest resolution is reached.
  • This greatly improves the convergence domain/speed and some local minima can be avoided.

3.1 Parametric models

  • Many approaches in the literature use parametric models to efficiently represent the environment and store prior information about the scene.
  • Several studies have been conducted to improve localisation techniques for larger scale environments using parametric models.
  • This type of representation is heavily influenced by 3D modelling tools provided by the computer graphics community who are only interested in rendering visual pleasing scenes but not in representing the sensor data as best as possible.

3.2 Image-based models

  • Acquired during a learning phase, are stored in a database and the sensor data remains in its raw un-modified form.
  • The local information contained in the key-frames is then used to estimate the position of a camera navigating within the pre-mapped environment.
  • Unlike the parametric methods, these maps are much simpler to acquire and they provide greater accuracy and robustness since the raw sensor data has not been approximated and a direct image error can be minimised.
  • In the literature, the proposed quantitative localisation methods using key-frames all rely on featurebased techniques ((Royer et al., 2005; Courbon et al., 2008)).

3.3 Local dense representation: augmented visual sphere

  • In the representation proposed in this paper, a local area in space is defined by an augmented spherical key-frame which captures the light field of all viewing directions from a particular point in 3D space along with its depth map.
  • This allows to extend robust and accurate state-of-the-art dense full-image approaches (Comport et al., 2007) to its most general form (i.e. all viewing directions are densely modelled).
  • This image is obtained from the custom camera system presented in Section 4.3 by warping multiple images onto the sphere.
  • These points have been sampled uniformly on the sphere as in (Meilland et al., 2010).
  • - ZS are the depths associated with each pixel which have been obtained from dense stereo matching.

3.4 Global topologic representation: omnidirectional key-frame graph

  • The main advantages of this spherical representation are: .
  • This enhances the performance of direct registration techniques (accuracy, robustness, convergence speed and domain of convergence). .
  • The map can be accessed in constant time independently of the graph size.
  • A spherical representation provides all local view directions and therefore provides a general repre- sentation that can encode data from different kinds of sensors like perspective cameras, multi-view cameras or omnidirectional cameras and laser range finders.
  • Full-view sensors condition the observability of 3D motion well (Baker et al., 2001) which greatly improves robustness.

3.5 Global spherical key-frame positioning

  • Considering S∗, an augmented sphere defined in Section 3.3, the objective is to compute the pose between a reference sphere S∗ and the next one S.
  • The localisation problem is solved using a dense visual odometry approach which uses a direct 3D image registration technique that provides accurate pose estimation.
  • Since this is a local optimisation approach it is assumed that the camera frame-rate is high (30Hz) and that inter-frame displacements are small.
  • The direct minimisation procedure was given in Section 2.
  • The main difference with respect to equation (4) being that the sphere is sampled, uniformly if possible, within polar spherical coordinates.

3.5.1 Spherical node selection

  • Indeed, the vertices of the graph should be carefully placed in the world so as to represent the environment with little redundancy.
  • In practice, the trajectory of the acquisition system along a sequence is computed by integrating elementary displacements estimated from successive spherical registration.
  • Therefore a new reference sphere is placed according to the Median Absolute Deviation (MAD): λ < Median(|e−Median(e)|), (12) where e is the vector of intensities error between the current sphere and the previous set of reference spheres defined in (11) at the end of the minimisation.

3.5.2 Loop closure detection and drift compensation

  • Since a dense spherical visual odometry technique was used, small amounts of drift are integrated along the sequence (typically 1% which could cause the final graph to become inconsistent or redundant (i.e. a route mapped twice in opposite directions).
  • To correct the drift and remove redundant spheres, the spherical loop closing technique proposed by (Chapoulie et al., 2011) was used.
  • This method is based on SIFT descriptors augmented by their distribution across the sphere which is represented by histograms.
  • Middle image corresponds to the corrected trajectory taking into account loop closures.
  • Bottom image details a place where loop closures are detected from perpendicular points of view.

3.6 Information selection

  • The essence of direct approaches is to minimise pixel intensities directly between images.
  • One of them is that if spheres are constructed from a panoramic multi-camera system at high resolution, the number of pixels to warp at each iteration of the real-time minimisation process could be extremely large.
  • The aim being to select points which best condition the six degrees of freedom of the pose.
  • The first 3 images (|J1|,|J2| et |J3|) show the translational part: close 3D measurements in the scene have a strong gradient.
  • During the learning phase, J(x̃) is pre-computed for each reference image and the entire pixels are sorted according to (15).

4 Spherical sensors

  • The spherical key-frame model presented in the previous section requires an adequate system for acquiring these types of measurements.
  • As such, several classic omnidirectional and panoramic camera systems are analysed for this purpose and two new multi-view RGB-D camera system are presented.

4.1 Photometric sphere

  • Capturing omnidirectional views of the environment can be achieved using either omnidirectional catadioptric cameras (Nayar, 1997) or with multiple perspective cameras.
  • It, however, suffers from a low and non-uniform spatial resolution, and a limited vertical field of view (half-sphere,<90◦).
  • This kind of sensor is therefore not efficient for outdoor mapping.
  • Multi-view stitching techniques allow to build high resolution panoramic images by stitching and blending perspective images together (Szeliski, 2006).
  • Those images can be acquired using several cameras (Baker et al., 2001), or from a sequence of images (Lovegrove and Davison, 2010).

4.2 Depth sphere acquisition

  • Passive systems are capable of simultaneously capturing a dense depth-map from cameras alone without any additional system.
  • Multi-view spherical systems have been also widely studied in the literature, however, it is generally assumed that all cameras view the same object and that the cameras are all converging to that point (Kutulakos and Seitz, 2000).
  • Active systems require projecting laser or light patterns onto the environment to obtain dense depth measurements.
  • Active RGB-D sensors such as Microsoft Kinect and Asus Xtion are actually popular in the robotics community.
  • This kind of sensor is, however, sensitive to sun light and is therefore limited to indoor mapping.

4.3 Spherical acquisition system

  • To acquire augmented visual spheres, for large-scale environment mapping, a first multi-camera system has been initially proposed in (Meilland et al., 2011a).
  • This system will not be detailed here but the interested reader can refer to that article.
  • This compact configuration allowed to build spherical photometric panoramas with a depth information associated to each pixel of the sphere.
  • Dense matching is much more robust since the sensor is composed of three classic fronto-parallel stereo pairs back to back.
  • In practice those effects are only visible for very close objects (eg. < 1m), which is not an issue when mapping outdoor environments.

5 Real-time localisation

  • This section aims at presenting several techniques that have been used to improve the real-time performance of 6 dof pose estimation with respect to the previously described dense key-frame graph of augmented spherical images.
  • First of all, it will be shown how the graph is used for real-time monocular pose estimation in an asymmetric manner (i.e. a single perspective image is compared to the closest augmented spherical image).
  • Following that a multiple reference sphere localisation that provides a smooth localisation will be proposed along with a technique that improves robustness in dynamic environments.
  • Finally an approach to first estimating only rotation before refining the full pose will be given that greatly improves convergence speed.

5.1.1 Online sphere selection

  • It is assumed that during online navigation, a current image I, captured by a generic camera (e.g. monocular, stereo or omnidirectional) and an initial guess T̂G of the current camera position within the graph are available.
  • Contrary to non-spherical approaches, a sphere provides all viewing directions and therefore it is not necessary to consider the rotational distance (to ensure image overlap).
  • In particular this avoids choosing a reference sphere that has similar rotation but large translational difference which induces self occlusions of buildings and also differences in image resolution caused by distance (which affects direct registration methods).

5.1.2 Efficient minimisation

  • The warping function w(·) transfers the current image intensities onto the reference sphere via both the perspective and spherical warping functions given in equations (26) and (29), using the depth information ZS of the reference sphere.
  • The function s(·), introduced in Section 3.6, selects only informative pixels, with respect to a saliency map WS which is already pre-computed on the reference sphere and stored in the graph.
  • This selection speeds up the tracking algorithm without neither degrading observability of 3D motion nor accuracy.

5.1.3 Smooth localisation

  • During this localisation phase, the live current image is permanently registered onto the closest reference sphere which is retained for localisation until a new sphere is selected according to criterion (17).
  • This can create discontinuities in the trajectory when a new reference sphere is chosen .
  • Note that (8) can easily be expanded to more reference spheres, but it appears experimentally that two spheres are sufficient for smooth and robust localisation.
  • Indeed using reference images that are too distant from the current view introduce additional occlusions which reduce the efficiency of the minimisation.
  • Figure 9 compares a single node localisation (red) with a multi-node localisation (blue).

5.2 Efficient estimation of local 3D rotations

  • Indeed, for a pure rotation, the pixel’s motion is independent of the scene geometry.
  • For a vision based localisation algorithm, large rotations between successive frames is often prone to failure.
  • Since those images are generated from successive Gaussian filters and sub-sampling, rotations are clearly dominant and translations can be neglected.
  • This error can be efficiently minimised as described in Section 2.2, and allows to quickly recover large angular motions.

5.3 Dynamic environment mapping

  • In practice outdoor environments have very challenging dynamic effects that up until now have been handled as outliers, however, when both extreme illumination variations and dynamic moving objects are evolving within the scene the robust estimator breaks down.
  • It with the model I ∗ and with the last registered image Iwt−1.
  • In this case the model-based error (eMB) criteria ensures that the pose estimation does not drift but is not able to handle illumination change while the visual odometry error (eV O) is invariant to the large illumination change (variation interframe at 45 Hz is considered negligible).
  • The scene contains both global illumination changes and local illumination changes, as well as dynamic occlusions (e.g. moving cars, bicycle, pedestrians).
  • The associated M-estimator weights DMB are rejecting most of the information.

5.4 Initialisation and tracking failure

  • To handle initialisation and tracking failure, a simple technique is employed.
  • The current camera pose with respect to the reference sphere is supposed to be close to the identity.
  • Dense registration is then performed with all the spheres of the graph, using the smallest image size in the pyramid.
  • Even if this initialisation step does not fit well with large-scale databases, it needs less than 30ms per reference sphere, which is 9 seconds for a 300 spheres graph.

6 Field experiments

  • Several large scale field experiments involving each step of the proposed framework are presented.
  • Firstly dense spherical environment mapping experiments are presented.
  • Following this, realtime localisation and autonomous navigation experiments are detailed in the context of a national urban autonomous navigation challenge.
  • Due to the overall complexity of the system and the breakup into a map learning phase and an online real-time navigation phase, it is suggested that the reader view the video associated with this paper before proceeding.
  • The video illustrating these experiments can be found with the paper submission or alternatively at the following address.

6.1 Implementation

  • A real-time implementation of the online localisation algorithm has been made in C++.
  • Most parts of the code are SSE optimised (Steaming SIMD extensions), in order to take advantages of native vector instructions of modern processors.
  • It involves computing a Gaussian pyramid of the current image, warping the current image, computing robust statistics (median, MAD) using histograms, and computing the weighted Gauss-Newton Hessian and gradient approximations (see Section 2.2).
  • The pose increment is then obtained using Cholesky factorisation of the 6× 6 matrix, and the minimisation is iterated until convergence.
  • Since the localisation algorithm is based on an iterative minimisation, the number of iterations can be directly adjusted with the desired camera frame-rate.

6.2 Storage cost and memory management

  • For each sphere a 3 level multi-resolution pyramid is also pre-computed.
  • A better memory compression could be achieved using a lossless image compression algorithm.
  • Whilst the camera moves in the environment, a parallel CPU thread is used to stream out the farthest spheres and to stream in new spheres with respect to the current camera pose.
  • This allows to navigate seamlessly in very large maps.

6.3.2 Photo-realistic virtual navigation

  • The final spherical representation can be used to synthesise photo-realistic virtual images, using a technique similar to those developed in image-based rendering ((Gortler et al., 1996; Levoy and Hanrahan, 1996; Debevec et al., 1996)).
  • This basic way of interacting with the proposed world representation illustrates the richness and quality of the model and also suggests possible alternative applications.
  • In order to generate images, a virtual camera is controlled manually by the user (keyboard and mouse) to define the 6 dof pose within the graph.
  • The closest RGB-D sphere to the camera is then used to generate the view of the virtual camera using novel view synthesis.
  • It can be noted that each sphere is valid for a local domain (in practice around 5m3) until errors in the depth map begin to create visual distortion.

6.3.3 Asymmetric real-time localisation

  • One of the main advantages of a spherical key-frame graph is that any type of online sensor can be used to estimate the relative pose with respect to this environment model.
  • It can be seen that using a spherical image allows to localise the camera in all viewing directions.
  • The algorithm was then validated on a subset graph of 12 spheres extracted from the results presented in section 6.3.1.
  • The real-time 6 dof localisation method was also validated on large-scale stereo sequences captured in uncontrolled urban environments in the XII th district of Paris for the purpose of the French national ANR-CityVIP project.
  • The path is locally different from the learnt trajectory (the car overtakes, the turns are taken much more widely, etc).

6.4 CityVIP autonomous driving challenge

  • This section reports the autonomous navigation results obtained during the final challenge of the French ANR CityVIP project.
  • This project was conducted by several leading French research teams from June 2008 to December 2011 and its aim was to develop autonomous transportation vehicles for urban environments.
  • The goal of the challenge was to autonomously follow a reference trajectory over large scales in a completely uncontrolled environment.
  • The entry presented here was based on a learnt environment map and trajectory acquired during a manual teaching phase.
  • In a first part, the platform will be presented before introducing the control law that was used for trajectory following.

6.4.1 Platform

  • The platform used for the experiments is an electric Cycab vehicle ) which can be controlled using longitudinal velocities and steering angle.
  • A single Firewire camera, placed on the top of the vehicle, is used for online localisation.
  • All the computations and control commands are performed on a Core i7 standard laptop, embedding the pre-computed graph of augmented spheres.

6.4.2 Control aspects

  • For autonomous driving, the aim is to follow automatically a reference trajectory U generated locally around the learnt graph.
  • In the proposed case, a virtual vehicle is followed and used to generate an error in translation and orientation that can be regulated using state feedback.
  • Longitudinal velocity is controlled using a proportional feedback on the longitudinal error and steering angle depends on yaw and transversal errors.
  • These errors are obtained by projecting the current vehicle position onto the closest reference trajectory’s edge ).
  • The reference position is then selected by translating the projected position along the trajectory by a distance d.

6.4.3 Challenge preparation and testing

  • Preliminary autonomous navigation results were conducted at INRIA Sophia Antipolis in order to test the entire autonomous navigation pipeline.
  • The following result reports autonomous driving on a small trajectory of 100 meters, which was manually generated around a pre-learnt key-frame graph.
  • The reference vehicle longitudinal velocity was set to 1.4m/s. Figure 18 plots the longitudinal and transversal errors with respect to the time.
  • It can be seen that the vehicle was able to follow the reference trajectory, keeping a longitudinal error close to zero, whilst the transversal error is less than 25cm.
  • Note that the peaks in transversal errors correspond to the 90 degrees turns, which can be avoided with a better setting of the control gains of equation (25).

6.4.4 Final CityVIP challenge

  • The following section will present one of the many experiments conducted during the final CityVIP challenge that lasted one whole week at Place Jaude in the city centre of Clermont Ferrand, France.
  • Over the course of one week, various trials were conducted in a large range of highly varying situations, allowing validation of the proposed approach.
  • Figure 19 shows the desired trajectory in black, and the trajectory followed autonomously by the vehicle in red.
  • As it can be seen on Figures (a),(b) and (c), the accuracy of the localisation method allows to navigate in narrowed corridors, whilst the M-estimator ensure robustness to occlusions like pedestrians.
  • These effects can be seen on the red trajectory around the landmark (e).

6.4.5 Discussions on the experiments

  • As mentioned in the previous section, visual navigation in large open-spaces is challenging since translations are not accurately estimated.
  • It could be interesting to combine a wide angle camera for robustness with a long focal camera for accuracy.
  • A second scenario has shown one limitation of the visual localisation approach.
  • The authors can see that the strong gradients generated by the shadows are selected by the saliency selection algorithm.

Did you find this useful? Give us your feedback

Figures (21)

Content maybe subject to copyright    Report

HAL Id: hal-01010429
https://hal.inria.fr/hal-01010429
Submitted on 19 Jun 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Dense omnidirectional RGB-D mapping of large scale
outdoor environments for real-time localisation and
autonomous navigation
Maxime Meilland, Andrew Ian Comport, Patrick Rives
To cite this version:
Maxime Meilland, Andrew Ian Comport, Patrick Rives. Dense omnidirectional RGB-D mapping of
large scale outdoor environments for real-time localisation and autonomous navigation. Journal of
Field Robotics, Wiley, 2015, 32 (4), pp.474-503. �10.1002/rob.21531�. �hal-01010429�

Dense omnidirectional RGB-D mapping of large scale
outdoor environments for real-time localisation and
autonomous navigation
Maxime Meilland
INRIA, I3S/CNRS
University of Nice Sophia Antipolis
maxime.meilland@i3s.unice.fr
Andrew I. Comport
I3S/CNRS
University of Nice Sophia Antipolis
andrew.comport@cnrs.fr
Patrick Rives
INRIA Sophia Antipolis
patrick.rives@inria.fr
Abstract
This paper presents a novel method and innovative apparatus for building 3D dense visual
maps of large-scale unstructured environments for autonomous navigation and real-time
localisation. The main contribution of the paper is focused on proposing an ecient and
accurate 3D world representation that allows to extend the boundaries of state-of-the-art
dense visual mapping to large-scales. This is achieved via an omni-directional key-frame rep-
resentation of the environment which is able to synthesise photo-realistic views of captured
environments at arbitrary locations. Locally the representation is image-based (ego-centric)
and is composed of accurate augmented spherical panoramas combining photometric infor-
mation (RGB), depth information (D) and saliency for all viewing directions at a particular
point in space (i.e. a point in the light field). The spheres are related by a graph of 6
degrees of freedom poses (3 dof translation and 3 dof rotation) which are estimated through
multi-view spherical registration. It is shown that this world representation can be used
to perform robust real-time localisation (in 6 dof) of any configuration of visual sensors
within their environment whether they be monocular, stereo or multi-view. Contrary to

feature-based approaches, an ecient direct image registration technique is formulated. This
approach directly exploits the advantages of the spherical representation by minimising a
photometric error between a current image and a reference sphere. Two novel multi-camera
acquisition systems have been developed and calibrated to acquire this information and this
paper reports for the first time the second system. Given the robustness and eciency of
this representation, field experiments demonstrating autonomous navigation and large-scale
mapping will be reported in detail for challenging unstructured environments, containing
vegetation, pedestrians, varying illumination conditions, trams and dense trac.
1Introduction
Autonomous navigation in complex and dynamic unstructured environments over large-scale distances, is
an active field in robotics. One of the major diculties in this objective is precise localisation of the
vehicle within its environment so that autonomous navigation techniques can be employed. Indeed, robust
localisation, particularly in heavily occluded areas (such as canyons, tunnels or forest areas), is a non-trivial
problem due to GPS masking and poor precision of low cost units. Classical dead reckoning methods like
odometry, typically performed by inertial sensors and wheels encoders are prone to drift and therefore are
not suited to large distances.
Whilst very powerful platforms exist such as the Google car (
Guizzo, 2011), they rely on a plethora of
sensors which are fused together in a probabilistic framework. Perhaps the most important sensor is the 3D
Velodyne Lidar which is an active sensor that is extremely expensive when compared to cheap o-the-shelf
cameras. Relatively recent performance improvements in computing hardware have, however, made real-time
computer vision suitable for autonomous navigation of mobile robots. Visual odometry, which estimates the
full 6 degrees of freedom (dof) of vehicle motion from image sequences produces very precise results and has
lower drift than the most expensive IMU’s (
Howard, 2008).
Visual odometry methods (
Nist´er et al., 2004; Comport et al., 2007; Konolige and Agrawal, 2008)
are, however, incremental and prone to small drifts, which when integrated over time, be-
come increasingly significant over large distances. In a similar way, simultaneous locali-
sation and mapping approaches (Montemerlo et al., 2002; Davison and Murray, 2002; Thrun, 2002;
Durrant-Whyte and Bailey, 2006; Klein and Murray, 2007) allow both localisation and cartography, giving

alternative and promising solutions. Classically based on the Extended Kalman filter, these techniques have
proven to work over very long trajectories, however, their robustness and accuracy is highly dependant
on tuning low level feature extraction-matching and the filtering approach often smooths out important
non-linear information.
A solution to minimise drift is to use a pre-computed 3D model of the environment obtained during a
learning phase. This kind of approach has several advantages compared to pure SLAM approaches:
Map building is performed via a learning phase which can be computed o-line using powerful yet
computationally expensive techniques.
More computational eort can be dedicated to robust real-time pose estimation online.
Localisation with respect to the 3D model remains drift-free.
If the 3D model is at scale, monocular localisation is at scale too.
In visual navigation, two main classes of approach exist, global parametric 3D models and image-based key-
frame models. In the first case, 3D parametric models have been extensively used for object tracking by min-
imising the re-projection error between the model and salient features in the images (
Marchand et al., 2001;
Drummond et al., 2002; Vacchetti et al., 2004; Comport et al., 2006; Lothe et al., 2010). This kind of ap-
proach represents the 3D data in a single global reference frame and has diculty in representing local data
accurately. Even if automatic techniques which build 3D parametric models of large-scale environments are
becoming more and more accurate (
Hammoudi et al., 2010; Craciun et al., 2010; Lafarge and Mallet, 2011),
they rely heavily on the structure of urban environments and are therefore not well adapted to unstructured
environments.
On the other hand, key-frame techniques aim at representing the world with a set of images positioned
in 2D or 3D, without performing an explicit 3D reconstruction of the environment. This allows to store
raw (unmodified) local sensor data in the representation for high accuracy and maintains a topological
framework at large-scale that is accurate enough to ensure the connectivity between locally precise key-
frames. In the literature, several approaches have successively performed localisation using an image-based
memory. In (
Royer et al., 2005; Konolige and Agrawal, 2008), a database of key-frames, containing ge-
ometric points features (Harris) is used for online camera localisation. A similar approach is proposed
in (
Courbon et al., 2009) with an omnidirectional camera. In (Jogan and Leonardis, 2000) a cylindrical
panorama memory is proposed, and in (
Cobzas et al., 2003) a panoramic image memory combined with
depth information extracted from a laser scan is used for localisation.

Those approaches, however, all rely on feature extraction and matching, and therefore do not take full
advantage of the dense information contained in the images. Our main contribution is to propose an accurate
and robust image-based representation which allows to map large-scale unstructured environments as densely
as possible. The aim is to provide a world model that provides maximal robustness and maximal flexibility
for online localisation and navigation purposes. Online real-time localisation eciency is achieved, not by
extracting and matching features, but by proposing a novel non-linear saliency measure for individual pixels.
In particular, the proposed representation is composed of a graph of augmented visual spheres containing
omnidirectional RGB-D information (colour and depth) for many spatial locations. This representation
allows to eciently perform online localisation and navigation using state of the art dense registration
techniques (
Comport et al., 2007; Meilland et al., 2010; Meilland et al., 2012).
This present publication is the culmination of many previously published articles which were carried out in
the context of the French national CityVIP project aimed at autonomous vehicle navigation. The individual
contributions have never been published together and the aim of this article is to provide the global scope of
this complex system whilst demonstrating new experimental results for several large-scale field experiments:
In (
Comport et al., 2010) a dense direct approach was proposed for visual odometry.
In (Meilland et al., 2010) a spherical representation was proposed along with a saliency selection
approach.
In (Meilland et al., 2011a) a spherical sensor and real-time localisation with respect to a spherical
database was proposed.
In (
Meilland et al., 2011b) a robust approach to illumination variations was proposed.
In (
Meilland et al., 2012) autonomous navigation in challenging environments was demonstrated.
This paper has been structured in such a way that the relevant literature review is included at the beginning
of each section. As such, in Section
2 the global non-linear optimisation criterion that is minimised will be
introduced and formalised. In Section
3 a dense spherical key-frame model will be introduced to represent
the 3D world. Section
4 will detail a novel spherical acquisition system that was built for the purpose of
this project to acquire streams of photometric colour and depth panoramas at 45Hz. In Section
5 we will
provide computational tools to allow real-time localisation of dierent types of cameras with respect to this
world model. Finally, Section
6 will detail several field experiments. Initial results for the acquisition and
mapping of large-scale environments will be given. Results from several sites including the XIIth district of
Paris, INRIA Sophia-Antipolis and Place Jaude in Clermont-Ferrand will be given. In a last part a fully

Citations
More filters
Journal ArticleDOI
03 Oct 2020
TL;DR: It is evident that the more concentrated deployment of AI and robotics in smart city development is currently in the Global North, although countries in the global South are also increasingly represented.
Abstract: Smart city strategies developed by cities around the world provide a useful resource for insights into the future of smart development. This study examines such strategies to identify plans for the explicit deployment of artificial intelligence (AI) and robotics. A total of 12 case studies emerged from an online keyword search representing cities of various sizes globally. The search was based on the keywords of “artificial intelligence” (or “AI”), and “robot,” representing robotics and associated terminology. Based on the findings, it is evident that the more concentrated deployment of AI and robotics in smart city development is currently in the Global North, although countries in the Global South are also increasingly represented. Multiple cities in Australia and Canada actively seek to develop AI and robotics, and Moscow has one of the most in-depth elaborations for this deployment. The ramifications of these plans are discussed as part of cyber–physical systems alongside consideration given to the social and ethical implications.

37 citations

Journal ArticleDOI
TL;DR: A portable mobile mapping research platform with a strong focus on acquiring accurate 3D imagery that provides for completely GNSS-independent data acquisition and camera pose estimation using LiDAR-based SLAM and applies a novel image-based georeferencing approach for further improving camera poses.
Abstract: . The rapid progression in digitalization in the construction industry and in facility management creates an enormous demand for the efficient and accurate reality capturing of indoor spaces. Cloud-based services based on georeferenced metric 3D imagery are already extensively used for infrastructure management in outdoor environments. The goal of our research is to enable such services for indoor applications as well. For this purpose, we designed a portable mobile mapping research platform with a strong focus on acquiring accurate 3D imagery. Our system consists of a multi-head panorama camera in combination with two multi-profile LiDAR scanners and a MEMS-based industrial grade IMU for LiDAR-based online and offline SLAM. Our modular implementation based on the Robot Operating System enables rapid adaptations of the sensor configuration and the acquisition software. The developed workflow provides for completely GNSS-independent data acquisition and camera pose estimation using LiDAR-based SLAM. Furthermore, we apply a novel image-based georeferencing approach for further improving camera poses. First performance evaluations show an improvement from LiDAR-based SLAM to image-based georeferencing by an order of magnitude: from 10–13 cm to 1.3–1.8 cm in absolute 3D point accuracy and from 8–12 cm to sub-centimeter in relative 3D point accuracy.

26 citations

Journal ArticleDOI
TL;DR: This paper investigates quality aspects of the urban model, namely the obtainable georeferencing accuracy and the quality of the depth map extraction, and presents a scalable cloud-based framework for generating 3D image spaces of entire cities or states and a client architecture for their web-based exploitation.
Abstract: In this paper, we introduce the concept and an implementation of geospatial 3D image spaces as new type of native urban models. 3D image spaces are based on collections of georeferenced RGB-D imagery. This imagery is typically acquired using multi-view stereo mobile mapping systems capturing dense sequences of street level imagery. Ideally, image depth information is derived using dense image matching. This delivers a very dense depth representation and ensures the spatial and temporal coherence of radiometric and depth data. This results in a high-definition WYSIWYG (“what you see is what you get”) urban model, which is intuitive to interpret and easy to interact with, and which provides powerful augmentation and 3D measuring capabilities. Furthermore, we present a scalable cloud-based framework for generating 3D image spaces of entire cities or states and a client architecture for their web-based exploitation. The model and the framework strongly support the smart city notion of efficiently connecting the urban environment and its processes with experts and citizens alike. In the paper we particularly investigate quality aspects of the urban model, namely the obtainable georeferencing accuracy and the quality of the depth map extraction. We show that our image-based georeferencing approach is capable of improving the original direct georeferencing accuracy by an order of magnitude and that the presented new multi-image matching approach is capable of providing high accuracies along with a significantly improved completeness of the depth maps.

23 citations


Cites background from "Dense Omnidirectional RGB-D Mapping..."

  • ...[14], for example, distinguish between 3D parametric models and image-based key-frame models, but they do not cover point cloud-based models, which play an important role in urban modeling....

    [...]

  • ...A recently published and similar approach uses collections of spherical RGB-D panoramas of large-scale urban environments for autonomous navigation and real-time localization [14]....

    [...]

  • ...In outdoor urban environments, the requirement of dense and temporally-coherent RGB-D values with a sufficiently high spatial resolution can currently only be met by stereo or trinocular camera setups [6,7,14,20] and a subsequent depth extraction using dense image matching (see Section 6)....

    [...]

  • ...By considering all sensors of a panoramic head to have a unique center of projection [14], both approaches reduce the relative and absolute 3D measurement accuracy from a potential cm-level to the dm-level or even lower....

    [...]

  • ...These include: monoscopic, stereoscopic [7], panoramic [23] or RGB-D imagery [7,8,14], 3D point clouds [15] or combinations thereof [24]....

    [...]

Posted Content
TL;DR: This work crafts the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation and indicates that the approach can predict location robustly even in challenging inner-city settings.
Abstract: Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.

17 citations


Cites background or methods from "Dense Omnidirectional RGB-D Mapping..."

  • ...However, [21] [4] use panorama or spherical views for feature storage and matching which requires a large computation load; [31] [28] require expensive high dimensional image data, such as 3D point maps and 6 degree-of-freedom (6DOF) camera inputs for training and prediction....

    [...]

  • ...There are existing methods from [21] [4] [28] [31] that also leverage camera images to locate the mobile agent....

    [...]

Journal ArticleDOI
TL;DR: A hybrid maps enhanced localization system which mainly consists of a global localization method and a pose tracking method to achieve real-time localization reliably and accurately is presented.
Abstract: With excellent mobility and flexibility, mobile manipulators have great potential for loading and unloading tasks of numerical control machine tools (CNC) in manufacturing workshops. However, because of the rough and oily ground, dynamic obstacles and the convex plate of a CNC, harsh manufacturing workshop poses a huge challenge to the localization system of an autonomous mobile manipulator. To address the above problem, this paper presents a hybrid maps enhanced localization system which mainly consists of a global localization method and a pose tracking method. Hybrid maps including hybrid grid map, multi-resolution likelihood fields (MLFs) and hybrid point map are constructed to efficaciously model the harsh environment and to improve localization performance. Our global localization method employs the convex hull sampling to spares dense Lidar data and the MLFs based branch and bound (BnB) search to speed up global search. To achieve real-time localization reliably and accurately, our pose tracking method seamlessly combines the BnB search and the adaptive Monte Carlo localization, and the Iterative Closest Point (ICP) based scan matching using the hybrid point map is adopted for higher accuracy. In addition, a distance filter improved by unscented transform is integrated into the pose tracking process to mitigate the influence of dynamic obstacles. The developed localization system is evaluated through different experiments including two weeks of loading and unloading tasks in a real manufacturing scenario, resulting in superior localization performance.

15 citations


Cites background from "Dense Omnidirectional RGB-D Mapping..."

  • ...Visual sensors such as mono-camera [6] and RGB-D camera [7] can obtain abundant information for robot localization....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations


"Dense Omnidirectional RGB-D Mapping..." refers background or methods in this paper

  • ...…and Murray, 2002; Nistér et al., 2004; Mouragnon et al., 2006; Mei et al., 2010; Kitt et al., 2010; Tardif et al., 2010)) rely on an intermediary estimation process based on thresholding (Harris and Stephens, 1988; Lowe, 2004) before requiring matching between frames to recover camera motion....

    [...]

  • ...The learning phase also allows to estimate the transformations between the key-frames and store their position in global world coordinates....

    [...]

Proceedings Article
24 Aug 1981
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Abstract: Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is taster because it examines far fewer potential matches between the images than existing techniques Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show how our technique can be adapted tor use in a stereo vision system.

12,944 citations

Proceedings ArticleDOI
21 Jun 1994
TL;DR: A feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world are proposed.
Abstract: No feature-based vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world. These methods are based on a new tracking algorithm that extends previous Newton-Raphson style search methods to work under affine image transformations. We test performance with several simulations and experiments. >

8,432 citations


"Dense Omnidirectional RGB-D Mapping..." refers background in this paper

  • ...Classically direct approaches have focused on region-of-interest tracking whether they be modelled by Affine (Hager and Belhumeur, 1998), planar (Lucas and Kanade, 1981; Irani and Anandan, 1998; Irani and Anandan, 2000; Baker and Matthews, 2001; Malis, 2004; Dame and Marchand, 2010), or multiple-plane tracking (Mei et al., 2006; Silveira et al., 2008; Shi and Tomasi, 1994; Caron et al., 2011)....

    [...]

  • ...…by Affine (Hager and Belhumeur, 1998), planar (Lucas and Kanade, 1981; Irani and Anandan, 1998; Irani and Anandan, 2000; Baker and Matthews, 2001; Malis, 2004; Dame and Marchand, 2010), or multiple-plane tracking (Mei et al., 2006; Silveira et al., 2008; Shi and Tomasi, 1994; Caron et al., 2011)....

    [...]

Book
01 Jan 2005
TL;DR: This research presents a novel approach to planning and navigation algorithms that exploit statistics gleaned from uncertain, imperfect real-world environments to guide robots toward their goals and around obstacles.
Abstract: Planning and navigation algorithms exploit statistics gleaned from uncertain, imperfect real-world environments to guide robots toward their goals and around obstacles.

6,425 citations


"Dense Omnidirectional RGB-D Mapping..." refers background in this paper

  • ...In a similar way, simultaneous localisation and mapping approaches (Montemerlo et al., 2002; Davison and Murray, 2002; Thrun, 2002; Durrant-Whyte and Bailey, 2006; Klein and Murray, 2007) allow both localisation and cartography, giving alternative and promising solutions....

    [...]

  • ...In a similar way, simultaneous localization and mapping (SLAM) approaches (Davison & Murray, 2002; Durrant-Whyte & Bailey, 2006; Klein & Murray, 2007; Montemerlo, Thrun, Koller, & Wegbreit, 2002; Thrun, 2002) allow both localization and cartography, giving alternative and promising solutions....

    [...]

Frequently Asked Questions (16)
Q1. What are the contributions mentioned in the paper "Dense omnidirectional rgb-d mapping of large scale outdoor environments for real-time localisation and autonomous navigation" ?

This paper presents a novel method and innovative apparatus for building 3D dense visual maps of large-scale unstructured environments for autonomous navigation and real-time localisation. The main contribution of the paper is focused on proposing an efficient and accurate 3D world representation that allows to extend the boundaries of state-of-the-art dense visual mapping to large-scales. Two novel multi-camera acquisition systems have been developed and calibrated to acquire this information and this paper reports for the first time the second system. Given the robustness and efficiency of this representation, field experiments demonstrating autonomous navigation and large-scale mapping will be reported in detail for challenging unstructured environments, containing vegetation, pedestrians, varying illumination conditions, trams and dense traffic. 

Future work will aim at improving the database construction, by geo-referencing the spherical graph in a GIS ( Geographic Information System ), which may contain higher level information such as free space. This geolocation will allow to use advanced path planning algorithms. To further improve the autonomous navigation solution, it should be interesting to fuse vision based localisation, with GPS signal and inertial measurements. Another important feature that will be studied is life long mapping, since the environment is not static it is necessary to update the database with new geometric and photometric information gathered during the online phase. 

To accurately recover the position of the spheres of the graph with respect to one-another, a 6 degrees of freedom spherical localisation model is used based on an accurate dense localisation (Comport et al., 2010)(Meilland et al., 2010). 

Realistic computer graphics models are obtained by texture mapping onto 3D planar façades which have been extracted from the environment. 

Since the localisation algorithm is based on an iterative minimisation, the number of iterations can be directly adjusted with the desired camera frame-rate. 

Due to the overall complexity of the system and the breakup into a map learning phase and an online real-time navigation phase, it is suggested that the reader view the video associated with this paper before proceeding. 

The 6 dof localisation is performed using a 3 levels image pyramid with respectively 3, 5, 8 iterations for each level of the pyramid (from coarse to fine). 

When the environment is far from the camera viewpoint (0 < ∆ < 8 and 16 < ∆ < 25), the MAD increases slowly, due to very small changes between the images. 

Using a small fraction of salient pixels (i.e. 60k pixels ), the localisation algorithm runs flawlessly at 45 Hz with an input image of dimensions 800×600 pixels. 

in large scale mapping approaches, the connectivity between the various camera poses is relatively sparse meaning that performing global bundle adjustment is not efficient. 

In particular, the proposed representation is composed of a graph of augmented visual spheres containing omnidirectional RGB-D information (colour and depth) for many spatial locations. 

As it was highlighted in (Handa et al., 2012), a high camera frame-rate reduces the inter-frame camera motion, allowing to perform less iterations until convergence. 

In practice, the trajectory of the acquisition system along a sequence is computed by integrating elementary displacements estimated from successive spherical registration. 

The procedure to build an augmented sphere with this system can be summarized as:1. Calibration of the extrinsic and intrinsic parameters of each camera with a checker-board, via bundleadjustment: this calibration step is only performed once since the system is rigid. 

the vertices of the graph should be carefully placed in the world so as to represent the environment with little redundancy. 

To ensure a better initialisation, a direct registration technique is employed to estimate the local inter-frame 3D rotation of the camera as in (Mei et al., 2010; Lovegrove and Davison, 2010; Newcombe et al., 2011b).