What is the efficient method for resolving the labels over spatial context?

Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context.

How many sequences have been acquired with the multi-camera stereovision system?

Two sequences have been acquired with their multi-camera stereovision system on the1The full resolution is 2048x665 but the authors use 1024x333 resolution for classificationsame pathway at two different time-scales with an interval of three years.

What is the weighting function for the pose?

The pose T̂T(x) is an approximation of the true transformation T(x̃) and Ψhub is a robust weighting function on the error given by Huber’s M-estimator [14].

What is the definition of a dynamic class?

A dynamic class, denoted as CD occludes a static class CS by changing the label associated to the corresponding pixels in the image.

What is the probability of a static label being associated to a pixel?

To model the probability of associating a static label to a pixel p = (xp,yp), a Gaussian function is associated to each neighbor node ni ∈N .

(Open Access) Hybrid metric-topological-semantic mapping in dynamic environments (2015) | Romain Drouilly

Q: What are the contributions mentioned in the paper "Hybrid metric-topological-semantic mapping in dynamic environments" ?

In this context, the authors propose a new approach to update maps pertaining to large-scale dynamic environments with semantics.

HAL Id: hal-01237850

https://hal.inria.fr/hal-01237850

Submitted on 3 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Hybrid Metric-Topological-Semantic Mapping in

Dynamic Environments

Romain Drouilly, Patrick Rives, Benoit Morisset

To cite this version:

Romain Drouilly, Patrick Rives, Benoit Morisset. Hybrid Metric-Topological-Semantic Mapping in

Dynamic Environments. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, IROS’15, Sep

2015, Hamburg, Germany. �hal-01237850�

Hybrid Metric-Topological-Semantic Mapping in Dynamic

Environments

Romain Drouilly

1,2

, Patrick Rives

, Benoit Morisset

Abstract— Mapping evolving environments requires an up-

date mechanism to efﬁciently deal with dynamic objects. In this

context, we propose a new approach to update maps pertaining

to large-scale dynamic environments with semantics. While

previous works mainly rely on large amount of observations,

the proposed framework is able to build a stable representation

with only two observations of the environment. To do this,

scene understanding is used to detect dynamic objects and to

recover the labels of the occluded parts of the scene through an

inference process which takes into account both spatial context

and a class occlusion model. Our method was evaluated on

a database acquired at two different times with an interval

of three years in a large dynamic outdoor environment. The

results point out the ability to retrieve the hidden classes with a

precision score of 0.98. The performances in term of localisation

are also improved.

I. INTRODUCTION

Lifelong mapping has received an increasing amount of

attention during last years, largely motivated by the grow-

ing need to integrate robots into the real world wherein

dynamic objects constantly change the appearance of the

scene. A mobile robot evolving in such a dynamic world

should not only be able to build a map of the observed

environment at a speciﬁc moment, but also to maintain this

map consistent over a long period of time. It has to deal

with dynamic changes that can cause the navigation process

to fail. However updating the map is particularly challenging

in large-scale environments. To identify changes, robots have

to keep a memory of the previous states of the environment

and the more dynamic it is, the higher will be the number

of states to manage and the more computationally intensive

will be the updating process. Mapping large-scale dynamic

environments is then particularly difﬁcult as the map size

can be arbitrary large. Additionally, mapping many times

the whole environment is not always possible or convenient

and we could take advantages of methods using only a small

number of observations. The idea exploited in this paper is

to use scene understanding to retrieve a stable world model

with only two acquisition sequences.

Previous mapping strategies developed for dynamic envi-

ronments can be grouped in a few categories. The ﬁrst group

concerns methods that remove dynamic objects in order to

achieve a stable representation [1]. A strategy to identify

dynamic objects in the scene and map them in a separate

occupancy grid is proposed in [2]. Those methods requires

*This work was supported by ECA Robotics

Authors are with tem Lagadic at INRIA Mditer-

rane, France romain.drouilly@inria.fr,

patrick.rives@inria.fr

Authors are with ECA Robotics, bmo@eca.fr

Fig. 1. Map Structure. Spherical images augmented with depth are captured

while the robot explore its environment. Then each spherical image is

automatically annotated. Finally a semantic graph is extracted from the

annotated image

to identify moving objects which make them more suited

for fast dynamic changes. Another approach that does not

require to explicitly identify moving objects is presented

in [3]. It consist in maintaining several maps acquired at

different time scale and to select the most adapted at a given

time for navigation by checking its consistency with the

current observations. The main drawback of this approach is

the large amount of data needed, ﬁve times as big as a single

map, which prevents its use for large-scale environments

where the size of the map is already a problem. A third group

of methods assumes that mapping is a never-ending process

and continuously update data. The biologically inspired

algorithm RATSLAM is used in [4] to perform persistent

mapping at the cost of an increasing map size. Similarly,

an hybrid metric-topological map is proposed in [5] where

a model based on the human memory is used to update

a feature-based description of spherical views constituting

the map. The update process consists in adding information

about stable features and in removing features that no longer

exist in spherical images. Once again, the major issue is to

deal with the large amount of data. A last approach consists

in transposing the problem in another space where changes

Fig. 2. Example of semantic graph extracted from an annotated image

are easier to handle and to model. An example, introduced

in [6], is the use of spectral analysis to model dynamics of

the environment. However identiﬁcation of dynamic behavior

requires once again, large amount of observations which is

the the main problem of most of the previous approaches,

making them unsuitable for large-scale environments where

the amount of data for a single mapping session is already

signiﬁcant.

In this paper we propose a framework adapted to large

scale outdoor environments that relies on the hybrid Metric-

Topological-Semantic (MTS) map, introduced in our previ-

ous work[7]. We show how our mapping approach can nat-

urally handle changes in the environment for both mapping

and localisation through the use of semantic maps. Instead of

updating directly the low-level layers of the map, we update

semantic data and generalise changes observed over space to

changes over time using the ergodic assumption.

II. PROPOSED APPROACH

In this work we propose a new scheme to update the map

of large-scale dynamic environments using a stable represen-

tation based on semantic information. While many mapping

systems rely on a representation based on low-level features

descriptors, our method only requires semantic description of

a set of reference images [7], [8]. Consequently, the changes

occurring in the scene will be coded in the semantic layer

of the map directly. In the presented approach, instead of

stacking all perceptions at a very high computational cost, a

compact representation of changes is built. More precisely,

the set of possible classes is split in two groups, namely

dynamic and static classes. Then, changes are modeled in

terms of static classes occlusions due to dynamic objects.

This model takes the form of a probability distribution that

encodes the risk that a given dynamic class occludes a static

class. It is build from the observation of occlusions over

space and time and is used in conjunction with contextual

information to infer semantics of occluded areas. For the

sake of completeness, we brieﬂy review in the next two

sections our MTS map structure, illustrated at ﬁgure 1, and

our localisation strategy, previously introduced in [7]. Then

the proposed map update framework is detailed.

A. MTS Map Structure

Our map consists of a set of spherical RGB images

augmented by depth and semantic data, as illustrated at ﬁg

1. Those images, so-called reference images, are multi-layers

local submaps of the environment perceived from a particular

viewpoint. The metric layer is built from data acquired with

∗

TT(x)

Fig. 3. Illustration of the registration process. The current sphere is in blue

and the reference spheres are in red.

a multi-camera stereovision system previously described in

[9]. The sensor consists in two superimposed rings of three

cameras allowing to capture both photometric and geometric

data over a 360

ﬁeld of view. At each pixel is attributed a

semantic label thanks to a two steps process as presented

in [7]. Firstly, Random Forest are used to classify SIFT-

based descriptors extracted densely from images. Then a

Fully Connected Conditional Random Field is used to model

neighborhood and an efﬁcient inference method [10] allows

us to correct the labels over spatial context. From these

annotated images are extracted semantic graphs, illustrated

at ﬁgure 2. Let A

be a group of contiguous pixels with the

same label in the annotated image, called a semantic area.

A semantic graphs is denoted as g = {A, E} where A is the

set of semantic areas and E the set of edges encoding their

adjacency in the annotated image. Each A

∈ A is character-

ized by a ﬁtted ellipsoid envelop f

= {x

, y

, h

, w

, α

} where

(x, y) is the position of the ellipse, (h

, w

, α

) its main axis

and orientation respectively.

Semantic graphs are local powerful representations of the

environment as they encode both scene structure and high-

level description of the context in a very compact way. The

localisation strategy strongly relies on those graphs. At a

larger scale, all submaps are positioned in the scene thanks

to a dense visual odometry method presented in [11] and

constitute a global graph of the environment.

B. Localization in MTS Map

Localization in MTS map is a coarse-to-ﬁne two steps

process. For a given image of the current scene, a similar

submap is efﬁciently retrieved in the global graph using se-

mantics. The semantic graph g

cur

extracted from the current

annotated image is compared to the N semantic graphs g

the map using an Interpretation Tree. This algorithm allows

to compare efﬁciently graphs using both nodes appearance

and neighbors. Each nodes of the semantic graphs with sim-

ilar labels are matched two by two using unary constraints,

capturing their intrinsic properties (h

, w

, α

), and pairwise

constraints, capturing the context consisting of the nearest

neighbors in the graph.

Matching graphs allows to compute a similarity score,

denoted as σ , between two semantic graphs G

and G

. It

is measured as follows:

σ (G

, G

) = exp

1−

(1)

where N

is the number of nodes matched between the two

graphs denoted as A

= A

and N the total number

of nodes in the current semantic graph. The submap with

the highest score σ corresponds to the most probable closest

location.

Once the submap corresponding to the closest position is

retrieved, a dense registration method between the submap

and the current spherical image, described in [12], is applied

to reﬁne the pose estimate locally (see ﬁgure 3). Pose esti-

mation between a current spherical image I

and a retrieved

spherical image I

∗

is done using robust minimization tech-

niques. Following the formulation of [13], the cost function

for optimising intensity errors between spheres {I

, I

∗

} is

given as:

∑

hub





ω(

TT(x);P

)



− I

∗



ω(I; P

∗

)





, (2)

where ω(.) is the warping function that projects a 3-D point

given a pose T onto the sphere. The pose

TT(x) is an

approximation of the true transformation T(

x) and Ψ

hub

a robust weighting function on the error given by Huber’s

M-estimator [14].

C. Updating the Maps

They are two main changes occurring in a dynamic

environment. Those due to dynamic objects and those caused

by illumination changes (day/night). As we are focused on

life-long mapping, we consider here only the changes in the

scene due to occlusions caused by dynamic objects. This

choice is motivated by the fact that, in a semantic approach,

robustness to illumination changes should be more easily

handled by the classiﬁcation process, which is beyond the

scope of this article.

A dynamic class, denoted as C

occludes a static class

by changing the label associated to the corresponding

pixels in the image. The objective is to identify the parts of

the images corresponding to dynamic classes and to retrieve

the static class occluded by the dynamic object in order to

achieve a stable representation. Let us consider two different

cases depending if the scene is re-observed or not.

1) Map updates in re-observed areas: When the robot

navigates its environment, it can observe several times the

same place. Dynamic objects may have moved and then pre-

viously occluded areas can be observed. These observations

are used to build more stable semantic representation by

replacing previously occluded areas in the reference anno-

tated image, by the labels provided by the new observation.

To do this, the pose between the two images is computed

using dense matching techniques. Then a semantic warping

function is used to project labels of the new observation onto

the reference annotated image.

Let I

∗

be the reference annotated image of size m ×n from

which we want to compute the stable representation. A pixel

in I

∗

is identiﬁed by its position p

∗

= (u, v), where u ∈ [0, m[

and v ∈ [0, n[. A 3D point in the euclidean space is deﬁned

as P

∗

= {p

∗

, Z, l} where Z ∈ R

is the depth expressed in

the image reference frame and l ∈ L

the associated label.

Let I

be another observation of the same scene view from

the position T(x)

, where T (x)

is expressed in the reference

frame of I

∗

It is possible to synthesize a new annotated image, denoted

as I

from the labels L(p) of I

at the position of the reference

image using the warping function:

= ω(T(x)

;Z, l, p

∗

) (3)

where ω (.) function lifts the pixel p

∗

from the reference

image to the new observed image using the rigid transform

T (x)

followed by a spheric projection. The projected point

does not correspond exactly to the pixels position so a closest

neighbor interpolation is used to select the corresponding

label. Finally, if a given pixel p

is associated to a class

∈ C

in I

∗

a class C

∈ C

in I

, its class is set to C

2) Map update in unobserved areas: The warping func-

tion allows a partial update of the map where additional

observation informs about the underlying static classes. But

some areas may remain unobserved if dynamic objects

occlude the same part of the scene for the two mapping

sequences.

To deal with these areas, we need to do semantic in-

painting: unobserved areas are treated as holes that have

to be ﬁlled with static class labels. A model of occlusions

occurring in the scene is computed to infer the label of the

pixels that remain occluded by dynamic objects. It relies

on the ergodic assumption which states that the average

behavior of dynamic objects over the time is essentially

the same as the average behavior of dynamic objects over

the space. More precisely, this assumption state that we can

generalize occlusions observed over the mapping sequences

to unobserved areas which remain occluded in other parts

of the map. Practically, we compute a model that describes

which static class is likely to be occluded by a given dynamic

class. For example the dynamic class ”pedestrian” is more

likely to occlude the static class ”side-walk” than ”sky” as

the class ”car” is likely to occlude ”road”. This model takes

the form of a probability distribution of the existence of an

underlying static class C

given the observation of a dynamic

class C

P(C

) =

O(C

, C

)

(4)

where N is the total number of pixels initially associated

to a dynamic label and O(C

, C

) the number of pixels

initially labelled as C

and corrected to C

using the warping

function. This model is computed thanks to observations

made over several acquisition sequences. It is not necessary

to remember speciﬁc observations but only the model which

is extremely compact.

However this model is not sufﬁcient to correctly estimate

the occluded classes because it only takes into account

statistics over time. It is necessary to take into account

the spatial context to estimate the probability that a given

static class is occluded by a dynamic class. For example

if a dynamic object of type ”car” is mainly surrounded by

the class ”building”, it is more likely that the car occludes

a building than a road even if roads are usually the most

probable occluded class.

To take into account the context, the semantic graph

associated with the annotated image is used. For a dynamic

area in the image, the semantic graph gives the adjacent

semantic areas, constituting the neighbors, denoted as N .

Each node n

in a semantic graph is characterized by a ﬁtted

ellipsoid f

to describe its shape [7] which parameters are

presented at section II-A.

To model the probability of associating a static label to a

pixel p = (x

, y

), a Gaussian function is associated to each

neighbor node n

∈ N . It takes the general form:

, y

) = A

exp

(−a(x

−x

)

+2b(x

−x

)(y

−y

)+c(y

−y

)

(5)

where A

is the amplitude set as P(C

), (x

, y

) the position

of the area in the image, and where:

a =

cos

2σ

sin

2σ

(6)

b =

sin(2θ )

4σ

−

sin(2θ )

4σ

(7)

c =

sin

2σ

cos

2σ

(8)

with σ

= h

,σ

= w

and θ = α

Then for each pixel of the area requiring a new label, the

most probable label is computed as follows:

L(C

) = max

i∈N

, y

)) (9)

where L(.) stands for the likelihood.

Using the proposed approach, it is possible to update the

map by exploiting both the spatial context and the knowledge

acquired along robot’s experience, resulting in a robust and

stable representation of the environment. Conversely to many

other approaches, we do not require to consider a large set

of observations but only a simple and compact model of

occlusions.

III. EXPERIMENTS

Our framework has been tested in two ways. First, the

correctness of the class prediction in occluded parts of the

scene has been evaluated by making predictions in areas

where observations of static labels are accessible and used

as ground truth. Then, the usefulness of the approach for

localization is evaluated by comparing similarity scores of

images taken at the same place but at different moment with

and without updating data. All experiments were performed

using an Intel i7-3840QM CPU at 2.80GHz. All programs

are single-threaded.

The two experiments are realized with a challenging

dataset modeling an outdoor environment with forest and

building areas at the INRIA Sophia-Antipolis campus. It is

composed of hundreds of high resolution

spherical images

taken along a 1.6km pathway. Two sequences have been

acquired with our multi-camera stereovision system on the

The full resolution is 2048x665 but we use 1024x333 resolution for

classiﬁcation

TABLE I

OCCLUSION MODEL

Class Associated Probability

Sky 0

Building 0.04

Road 0.76

Sidewalk 0.08

Tree 0.11

Signs 0

Ground Signs 0

TABLE II

INFERENCE RESULTS

Class Score

Building 0.96

Road 0.98

Sidewalk 0.99

Tree 0.97

Global 0.98

same pathway at two different time-scales with an interval of

three years. The automatic annotation of images produces 9

classes: tree, sky, road, sign, sidewalk, ground sign, building,

car, other. The scene parsing stage produces a not perfect

labelling, achieving 82% of correctly labelled pixels (see

more results in [7]). It is important to notice that learning is

done only on images extracted from the ﬁrst sequence. Then

changes in illumination are managed by the classiﬁcation

stage only.

A. Predictions Correctness

To estimate the correctness of the predictions, we report

two measures. The ﬁrst one is the global precision of the

inference process. The global precision is given by the

number of pixels correctly associated to the class C

over

the total number of pixels labelled as C

, denoted as S

global

But this measure alone is not signiﬁcant as some classes

representing small objects can be ignored without signiﬁcant

decrease of performance. Then we also report by class

unweighed precision score, denoted as S

class

. The occlusion

model computed from experiments is reported at table I.

The results of inference process are reported in table II and

illustrated at ﬁgure 4.

As expected the occlusion model encodes the common fact

that cars are more likely to appear on road, in front of trees

or buildings than in the sky. Results presented in table II for

classes with non null probabilities, show that the inference

process is very efﬁcient. Almost all pixels are associated

with the correct class. The remaining pixels correspond to

slight changes in border position. These very good results

demonstrate the efﬁciency of our approach to infer semantics

in occluded parts of the scene. These good results can

be explained by two facts. First, taking into account both

spatial context and temporal changes allows to build a very

robust model of the world. The ergodic assumption is a

very efﬁcient way to compensate for the small number of

observations, only two here. The second point is that static

Hybrid metric-topological-semantic mapping in dynamic environments

Figures

Citations

Hybrid Indoor Navigation as sistant for visually impaired people based on fusion of proximity method and pattern recognition algorithm

A Guidance System for Blind and Visually Impaired People via Hybrid Data Fusion

Indoor Navigation Assistant for Visually Impaired by Pedestrian Dead Reckoning and Position Estimative of Correction for Patterns Recognition

Semantic Mapping for Autonomous Robots in Urban Environments

Hybrid metric-feature mapping based on camera and Lidar sensor fusion

References

Semantic representation for navigation in large-scale environments

A dense map building approach from spherical RGBD images

Fast Hybrid Relocation in Large Scale Metric-Topologic-Semantic Map

A compact spherical RGBD keyframe-based representation

Related Papers (5)

Scene flow propagation for semantic mapping and object discovery in dynamic street scenes

Semantic mapping and semantics-boosted navigation with path creation on a mobile robot

Topological mapping for robot navigation using affordance features

Autonomous learning of vision-based layered object models on mobile robots

Indoor scene recognition by a mobile robot through adaptive object detection

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Hybrid metric-topological-semantic mapping in dynamic environments" ?

Q2. What is the efficient method for resolving the labels over spatial context?

Q3. What is the likely location of the submap?

Q4. How many sequences have been acquired with the multi-camera stereovision system?

Q5. What is the probable class prediction in occluded parts of the scene?

Q6. What is the weighting function for the pose?

Q7. What is the way to update the map?

Q8. What is the definition of a dynamic class?

Q9. What is the probability of a static label being associated to a pixel?

Q10. What is the cost function for optimising intensity errors between spheres?