scispace - formally typeset
Open AccessProceedings ArticleDOI

Hybrid metric-topological-semantic mapping in dynamic environments

TLDR
This work proposes a new approach to update maps pertaining to large-scale dynamic environments with semantics with semantics which is able to build a stable representation with only two observations of the environment.
Abstract
Mapping evolving environments requires an update mechanism to efficiently deal with dynamic objects. In this context, we propose a new approach to update maps pertaining to large-scale dynamic environments with semantics. While previous works mainly rely on large amount of observations, the proposed framework is able to build a stable representation with only two observations of the environment. To do this, scene understanding is used to detect dynamic objects and to recover the labels of the occluded parts of the scene through an inference process which takes into account both spatial context and a class occlusion model. Our method was evaluated on a database acquired at two different times with an interval of three years in a large dynamic outdoor environment. The results point out the ability to retrieve the hidden classes with a precision score of 0.98. The performances in term of localisation are also improved.

read more

Content maybe subject to copyright    Report

HAL Id: hal-01237850
https://hal.inria.fr/hal-01237850
Submitted on 3 Dec 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Hybrid Metric-Topological-Semantic Mapping in
Dynamic Environments
Romain Drouilly, Patrick Rives, Benoit Morisset
To cite this version:
Romain Drouilly, Patrick Rives, Benoit Morisset. Hybrid Metric-Topological-Semantic Mapping in
Dynamic Environments. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, IROS’15, Sep
2015, Hamburg, Germany. �hal-01237850�

Hybrid Metric-Topological-Semantic Mapping in Dynamic
Environments
Romain Drouilly
1,2
, Patrick Rives
1
, Benoit Morisset
2
Abstract Mapping evolving environments requires an up-
date mechanism to efficiently deal with dynamic objects. In this
context, we propose a new approach to update maps pertaining
to large-scale dynamic environments with semantics. While
previous works mainly rely on large amount of observations,
the proposed framework is able to build a stable representation
with only two observations of the environment. To do this,
scene understanding is used to detect dynamic objects and to
recover the labels of the occluded parts of the scene through an
inference process which takes into account both spatial context
and a class occlusion model. Our method was evaluated on
a database acquired at two different times with an interval
of three years in a large dynamic outdoor environment. The
results point out the ability to retrieve the hidden classes with a
precision score of 0.98. The performances in term of localisation
are also improved.
I. INTRODUCTION
Lifelong mapping has received an increasing amount of
attention during last years, largely motivated by the grow-
ing need to integrate robots into the real world wherein
dynamic objects constantly change the appearance of the
scene. A mobile robot evolving in such a dynamic world
should not only be able to build a map of the observed
environment at a specific moment, but also to maintain this
map consistent over a long period of time. It has to deal
with dynamic changes that can cause the navigation process
to fail. However updating the map is particularly challenging
in large-scale environments. To identify changes, robots have
to keep a memory of the previous states of the environment
and the more dynamic it is, the higher will be the number
of states to manage and the more computationally intensive
will be the updating process. Mapping large-scale dynamic
environments is then particularly difficult as the map size
can be arbitrary large. Additionally, mapping many times
the whole environment is not always possible or convenient
and we could take advantages of methods using only a small
number of observations. The idea exploited in this paper is
to use scene understanding to retrieve a stable world model
with only two acquisition sequences.
Previous mapping strategies developed for dynamic envi-
ronments can be grouped in a few categories. The first group
concerns methods that remove dynamic objects in order to
achieve a stable representation [1]. A strategy to identify
dynamic objects in the scene and map them in a separate
occupancy grid is proposed in [2]. Those methods requires
*This work was supported by ECA Robotics
1
Authors are with tem Lagadic at INRIA Mditer-
rane, France romain.drouilly@inria.fr,
patrick.rives@inria.fr
2
Authors are with ECA Robotics, bmo@eca.fr
Fig. 1. Map Structure. Spherical images augmented with depth are captured
while the robot explore its environment. Then each spherical image is
automatically annotated. Finally a semantic graph is extracted from the
annotated image
to identify moving objects which make them more suited
for fast dynamic changes. Another approach that does not
require to explicitly identify moving objects is presented
in [3]. It consist in maintaining several maps acquired at
different time scale and to select the most adapted at a given
time for navigation by checking its consistency with the
current observations. The main drawback of this approach is
the large amount of data needed, five times as big as a single
map, which prevents its use for large-scale environments
where the size of the map is already a problem. A third group
of methods assumes that mapping is a never-ending process
and continuously update data. The biologically inspired
algorithm RATSLAM is used in [4] to perform persistent
mapping at the cost of an increasing map size. Similarly,
an hybrid metric-topological map is proposed in [5] where
a model based on the human memory is used to update
a feature-based description of spherical views constituting
the map. The update process consists in adding information
about stable features and in removing features that no longer
exist in spherical images. Once again, the major issue is to
deal with the large amount of data. A last approach consists
in transposing the problem in another space where changes

Fig. 2. Example of semantic graph extracted from an annotated image
are easier to handle and to model. An example, introduced
in [6], is the use of spectral analysis to model dynamics of
the environment. However identification of dynamic behavior
requires once again, large amount of observations which is
the the main problem of most of the previous approaches,
making them unsuitable for large-scale environments where
the amount of data for a single mapping session is already
significant.
In this paper we propose a framework adapted to large
scale outdoor environments that relies on the hybrid Metric-
Topological-Semantic (MTS) map, introduced in our previ-
ous work[7]. We show how our mapping approach can nat-
urally handle changes in the environment for both mapping
and localisation through the use of semantic maps. Instead of
updating directly the low-level layers of the map, we update
semantic data and generalise changes observed over space to
changes over time using the ergodic assumption.
II. PROPOSED APPROACH
In this work we propose a new scheme to update the map
of large-scale dynamic environments using a stable represen-
tation based on semantic information. While many mapping
systems rely on a representation based on low-level features
descriptors, our method only requires semantic description of
a set of reference images [7], [8]. Consequently, the changes
occurring in the scene will be coded in the semantic layer
of the map directly. In the presented approach, instead of
stacking all perceptions at a very high computational cost, a
compact representation of changes is built. More precisely,
the set of possible classes is split in two groups, namely
dynamic and static classes. Then, changes are modeled in
terms of static classes occlusions due to dynamic objects.
This model takes the form of a probability distribution that
encodes the risk that a given dynamic class occludes a static
class. It is build from the observation of occlusions over
space and time and is used in conjunction with contextual
information to infer semantics of occluded areas. For the
sake of completeness, we briefly review in the next two
sections our MTS map structure, illustrated at figure 1, and
our localisation strategy, previously introduced in [7]. Then
the proposed map update framework is detailed.
A. MTS Map Structure
Our map consists of a set of spherical RGB images
augmented by depth and semantic data, as illustrated at fig
1. Those images, so-called reference images, are multi-layers
local submaps of the environment perceived from a particular
viewpoint. The metric layer is built from data acquired with
I
s
I
s
b
TT(x)
Fig. 3. Illustration of the registration process. The current sphere is in blue
and the reference spheres are in red.
a multi-camera stereovision system previously described in
[9]. The sensor consists in two superimposed rings of three
cameras allowing to capture both photometric and geometric
data over a 360
o
field of view. At each pixel is attributed a
semantic label thanks to a two steps process as presented
in [7]. Firstly, Random Forest are used to classify SIFT-
based descriptors extracted densely from images. Then a
Fully Connected Conditional Random Field is used to model
neighborhood and an efficient inference method [10] allows
us to correct the labels over spatial context. From these
annotated images are extracted semantic graphs, illustrated
at figure 2. Let A
i
be a group of contiguous pixels with the
same label in the annotated image, called a semantic area.
A semantic graphs is denoted as g = {A, E} where A is the
set of semantic areas and E the set of edges encoding their
adjacency in the annotated image. Each A
i
A is character-
ized by a fitted ellipsoid envelop f
i
= {x
i
, y
i
, h
i
, w
i
, α
i
} where
(x, y) is the position of the ellipse, (h
i
, w
i
, α
i
) its main axis
and orientation respectively.
Semantic graphs are local powerful representations of the
environment as they encode both scene structure and high-
level description of the context in a very compact way. The
localisation strategy strongly relies on those graphs. At a
larger scale, all submaps are positioned in the scene thanks
to a dense visual odometry method presented in [11] and
constitute a global graph of the environment.
B. Localization in MTS Map
Localization in MTS map is a coarse-to-fine two steps
process. For a given image of the current scene, a similar
submap is efficiently retrieved in the global graph using se-
mantics. The semantic graph g
cur
extracted from the current
annotated image is compared to the N semantic graphs g
i
of
the map using an Interpretation Tree. This algorithm allows
to compare efficiently graphs using both nodes appearance
and neighbors. Each nodes of the semantic graphs with sim-
ilar labels are matched two by two using unary constraints,
capturing their intrinsic properties (h
i
, w
i
, α
i
), and pairwise
constraints, capturing the context consisting of the nearest
neighbors in the graph.
Matching graphs allows to compute a similarity score,
denoted as σ , between two semantic graphs G
1
and G
2
. It
is measured as follows:
σ (G
1
, G
2
) = exp
1
N
N
m
(1)
where N
m
is the number of nodes matched between the two
graphs denoted as A
12
= A
G1
T
A
G
2
and N the total number

of nodes in the current semantic graph. The submap with
the highest score σ corresponds to the most probable closest
location.
Once the submap corresponding to the closest position is
retrieved, a dense registration method between the submap
and the current spherical image, described in [12], is applied
to refine the pose estimate locally (see figure 3). Pose esti-
mation between a current spherical image I
s
and a retrieved
spherical image I
s
is done using robust minimization tech-
niques. Following the formulation of [13], the cost function
for optimising intensity errors between spheres {I
s
, I
s
} is
given as:
F
I
=
1
2
k
i
Ψ
hub
I
s
ω(
b
TT(x);P
i
)
I
s
ω(I; P
i
)
2
, (2)
where ω(.) is the warping function that projects a 3-D point
P
i
given a pose T onto the sphere. The pose
b
TT(x) is an
approximation of the true transformation T(
˜
x) and Ψ
hub
is
a robust weighting function on the error given by Huber’s
M-estimator [14].
C. Updating the Maps
They are two main changes occurring in a dynamic
environment. Those due to dynamic objects and those caused
by illumination changes (day/night). As we are focused on
life-long mapping, we consider here only the changes in the
scene due to occlusions caused by dynamic objects. This
choice is motivated by the fact that, in a semantic approach,
robustness to illumination changes should be more easily
handled by the classification process, which is beyond the
scope of this article.
A dynamic class, denoted as C
D
occludes a static class
C
S
by changing the label associated to the corresponding
pixels in the image. The objective is to identify the parts of
the images corresponding to dynamic classes and to retrieve
the static class occluded by the dynamic object in order to
achieve a stable representation. Let us consider two different
cases depending if the scene is re-observed or not.
1) Map updates in re-observed areas: When the robot
navigates its environment, it can observe several times the
same place. Dynamic objects may have moved and then pre-
viously occluded areas can be observed. These observations
are used to build more stable semantic representation by
replacing previously occluded areas in the reference anno-
tated image, by the labels provided by the new observation.
To do this, the pose between the two images is computed
using dense matching techniques. Then a semantic warping
function is used to project labels of the new observation onto
the reference annotated image.
Let I
s
be the reference annotated image of size m ×n from
which we want to compute the stable representation. A pixel
in I
s
is identified by its position p
= (u, v), where u [0, m[
and v [0, n[. A 3D point in the euclidean space is defined
as P
= {p
, Z, l} where Z R
+
is the depth expressed in
the image reference frame and l L
S
the associated label.
Let I
i
be another observation of the same scene view from
the position T(x)
i
, where T (x)
i
is expressed in the reference
frame of I
s
.
It is possible to synthesize a new annotated image, denoted
as I
0
i
from the labels L(p) of I
i
at the position of the reference
image using the warping function:
p
ω
= ω(T(x)
i
;Z, l, p
) (3)
where ω (.) function lifts the pixel p
from the reference
image to the new observed image using the rigid transform
T (x)
i
followed by a spheric projection. The projected point
does not correspond exactly to the pixels position so a closest
neighbor interpolation is used to select the corresponding
label. Finally, if a given pixel p
i
is associated to a class
C
j
C
D
in I
s
a class C
k
C
S
in I
0
i
, its class is set to C
k
.
2) Map update in unobserved areas: The warping func-
tion allows a partial update of the map where additional
observation informs about the underlying static classes. But
some areas may remain unobserved if dynamic objects
occlude the same part of the scene for the two mapping
sequences.
To deal with these areas, we need to do semantic in-
painting: unobserved areas are treated as holes that have
to be filled with static class labels. A model of occlusions
occurring in the scene is computed to infer the label of the
pixels that remain occluded by dynamic objects. It relies
on the ergodic assumption which states that the average
behavior of dynamic objects over the time is essentially
the same as the average behavior of dynamic objects over
the space. More precisely, this assumption state that we can
generalize occlusions observed over the mapping sequences
to unobserved areas which remain occluded in other parts
of the map. Practically, we compute a model that describes
which static class is likely to be occluded by a given dynamic
class. For example the dynamic class ”pedestrian” is more
likely to occlude the static class ”side-walk” than ”sky” as
the class ”car” is likely to occlude ”road”. This model takes
the form of a probability distribution of the existence of an
underlying static class C
i
given the observation of a dynamic
class C
j
:
P(C
i
|C
j
) =
O(C
i
, C
j
)
N
(4)
where N is the total number of pixels initially associated
to a dynamic label and O(C
i
, C
j
) the number of pixels
initially labelled as C
j
and corrected to C
i
using the warping
function. This model is computed thanks to observations
made over several acquisition sequences. It is not necessary
to remember specific observations but only the model which
is extremely compact.
However this model is not sufficient to correctly estimate
the occluded classes because it only takes into account
statistics over time. It is necessary to take into account
the spatial context to estimate the probability that a given
static class is occluded by a dynamic class. For example
if a dynamic object of type ”car” is mainly surrounded by
the class ”building”, it is more likely that the car occludes
a building than a road even if roads are usually the most
probable occluded class.

To take into account the context, the semantic graph
associated with the annotated image is used. For a dynamic
area in the image, the semantic graph gives the adjacent
semantic areas, constituting the neighbors, denoted as N .
Each node n
i
in a semantic graph is characterized by a fitted
ellipsoid f
i
to describe its shape [7] which parameters are
presented at section II-A.
To model the probability of associating a static label to a
pixel p = (x
p
, y
p
), a Gaussian function is associated to each
neighbor node n
i
N . It takes the general form:
F
i
(x
p
, y
p
) = A
i
exp
(a(x
p
x
i
)
2
+2b(x
p
x
i
)(y
p
y
i
)+c(y
p
y
i
)
2
)
(5)
where A
i
is the amplitude set as P(C
i
|C
j
), (x
i
, y
i
) the position
of the area in the image, and where:
a =
cos
2
θ
2σ
2
x
+
sin
2
θ
2σ
2
y
(6)
b =
sin(2θ )
4σ
2
x
sin(2θ )
4σ
2
y
(7)
c =
sin
2
θ
2σ
2
x
+
cos
2
θ
2σ
2
y
(8)
with σ
x
= h
i
,σ
y
= w
i
and θ = α
i
.
Then for each pixel of the area requiring a new label, the
most probable label is computed as follows:
L(C
i
) = max
iN
(F
i
(x
p
, y
p
)) (9)
where L(.) stands for the likelihood.
Using the proposed approach, it is possible to update the
map by exploiting both the spatial context and the knowledge
acquired along robot’s experience, resulting in a robust and
stable representation of the environment. Conversely to many
other approaches, we do not require to consider a large set
of observations but only a simple and compact model of
occlusions.
III. EXPERIMENTS
Our framework has been tested in two ways. First, the
correctness of the class prediction in occluded parts of the
scene has been evaluated by making predictions in areas
where observations of static labels are accessible and used
as ground truth. Then, the usefulness of the approach for
localization is evaluated by comparing similarity scores of
images taken at the same place but at different moment with
and without updating data. All experiments were performed
using an Intel i7-3840QM CPU at 2.80GHz. All programs
are single-threaded.
The two experiments are realized with a challenging
dataset modeling an outdoor environment with forest and
building areas at the INRIA Sophia-Antipolis campus. It is
composed of hundreds of high resolution
1
spherical images
taken along a 1.6km pathway. Two sequences have been
acquired with our multi-camera stereovision system on the
1
The full resolution is 2048x665 but we use 1024x333 resolution for
classification
TABLE I
OCCLUSION MODEL
Class Associated Probability
Sky 0
Building 0.04
Road 0.76
Sidewalk 0.08
Tree 0.11
Signs 0
Ground Signs 0
TABLE II
INFERENCE RESULTS
Class Score
Building 0.96
Road 0.98
Sidewalk 0.99
Tree 0.97
Global 0.98
same pathway at two different time-scales with an interval of
three years. The automatic annotation of images produces 9
classes: tree, sky, road, sign, sidewalk, ground sign, building,
car, other. The scene parsing stage produces a not perfect
labelling, achieving 82% of correctly labelled pixels (see
more results in [7]). It is important to notice that learning is
done only on images extracted from the first sequence. Then
changes in illumination are managed by the classification
stage only.
A. Predictions Correctness
To estimate the correctness of the predictions, we report
two measures. The first one is the global precision of the
inference process. The global precision is given by the
number of pixels correctly associated to the class C
i
over
the total number of pixels labelled as C
i
, denoted as S
global
.
But this measure alone is not significant as some classes
representing small objects can be ignored without significant
decrease of performance. Then we also report by class
unweighed precision score, denoted as S
class
. The occlusion
model computed from experiments is reported at table I.
The results of inference process are reported in table II and
illustrated at figure 4.
As expected the occlusion model encodes the common fact
that cars are more likely to appear on road, in front of trees
or buildings than in the sky. Results presented in table II for
classes with non null probabilities, show that the inference
process is very efficient. Almost all pixels are associated
with the correct class. The remaining pixels correspond to
slight changes in border position. These very good results
demonstrate the efficiency of our approach to infer semantics
in occluded parts of the scene. These good results can
be explained by two facts. First, taking into account both
spatial context and temporal changes allows to build a very
robust model of the world. The ergodic assumption is a
very efficient way to compensate for the small number of
observations, only two here. The second point is that static

Citations
More filters
Proceedings ArticleDOI

Hybrid Indoor Navigation as sistant for visually impaired people based on fusion of proximity method and pattern recognition algorithm

TL;DR: This paper presents an audio guided Indoor Navigation Systems built in a wearable device designed to work with a hybrid mapping that allows blind users a safe guided navigation with quick and low computational complexity.
Proceedings ArticleDOI

A Guidance System for Blind and Visually Impaired People via Hybrid Data Fusion

TL;DR: An indoor navigation system built in a wearable device that allows visually impaired users to perform guided audio navigation safe, fast and has a low computational complexity is presented.
Journal ArticleDOI

Indoor Navigation Assistant for Visually Impaired by Pedestrian Dead Reckoning and Position Estimative of Correction for Patterns Recognition

TL;DR: A guided audio Indoor navigation system built in a wearable device designed to work with a hybrid mapping that allows visually impaired users to perform a guided navigation safe and fast, and has a low computational complexity.

Semantic Mapping for Autonomous Robots in Urban Environments

TL;DR: The estimation of spatial relations between objects using Probabilistic Logic, including a novel inference method for Markov Logic Networks, and the benefits of combining different sources of semantic information with sensor data are shown in a scene interpretation and a semantic localization task.
Journal ArticleDOI

Hybrid metric-feature mapping based on camera and Lidar sensor fusion

TL;DR: In this paper , the Lidar and camera measurements are fused to obtain robust motion estimation and construct a hybrid metric-feature map, and a multi-stage loop closure strategy is applied to determine the loop candidate.
References
More filters
Proceedings ArticleDOI

Semantic representation for navigation in large-scale environments

TL;DR: The capability to infer a route in a global map by using semantics is demonstrated and a new approach to specify paths in terms of high-level robot actions is proposed, which provides robots with the ability to interact with humans in an intuitive way.
Proceedings Article

A dense map building approach from spherical RGBD images

TL;DR: This work proposes to tackle the pose estimation problem by using both photometric and geometric information in a direct RGBD image registration method and a pose graph representation, whereby, given a database of augmented visual spheres, a travelled trajectory with redundant information is pruned out to a skeletal pose graph.
Proceedings ArticleDOI

Fast Hybrid Relocation in Large Scale Metric-Topologic-Semantic Map

TL;DR: This work proposes a robust and efficient algorithm that relies on MTS-map structure and semantic description of sub-maps to relocate very fast and combines the discriminative power of semantics with the robustness of an interpretation tree to compare the graphsvery fast and outperform state-of-the-art-techniques.
Proceedings ArticleDOI

A compact spherical RGBD keyframe-based representation

TL;DR: An environmental representation approach based on hybrid metric and topological maps as a key component for mobile robot navigation is proposed and an uncertainty error model propagation is formulated for outlier rejection and data fusion.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Hybrid metric-topological-semantic mapping in dynamic environments" ?

In this context, the authors propose a new approach to update maps pertaining to large-scale dynamic environments with semantics. 

Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context. 

Once the submap corresponding to the closest position is retrieved, a dense registration method between the submap and the current spherical image, described in [12], is applied to refine the pose estimate locally (see figure 3). 

Two sequences have been acquired with their multi-camera stereovision system on the1The full resolution is 2048x665 but the authors use 1024x333 resolution for classificationsame pathway at two different time-scales with an interval of three years. 

the correctness of the class prediction in occluded parts of the scene has been evaluated by making predictions in areas where observations of static labels are accessible and used as ground truth. 

The pose T̂T(x) is an approximation of the true transformation T(x̃) and Ψhub is a robust weighting function on the error given by Huber’s M-estimator [14]. 

Using the proposed approach, it is possible to update the map by exploiting both the spatial context and the knowledge acquired along robot’s experience, resulting in a robust and stable representation of the environment. 

A dynamic class, denoted as CD occludes a static class CS by changing the label associated to the corresponding pixels in the image. 

To model the probability of associating a static label to a pixel p = (xp,yp), a Gaussian function is associated to each neighbor node ni ∈N . 

Following the formulation of [13], the cost function for optimising intensity errors between spheres {Is, I∗s } is given as:FI = 1 2k∑ iΨhub ∥∥∥∥Is(ω(T̂T(x);Pi))−