scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Hybrid metric-topological-semantic mapping in dynamic environments

TL;DR: This work proposes a new approach to update maps pertaining to large-scale dynamic environments with semantics with semantics which is able to build a stable representation with only two observations of the environment.
Abstract: Mapping evolving environments requires an update mechanism to efficiently deal with dynamic objects. In this context, we propose a new approach to update maps pertaining to large-scale dynamic environments with semantics. While previous works mainly rely on large amount of observations, the proposed framework is able to build a stable representation with only two observations of the environment. To do this, scene understanding is used to detect dynamic objects and to recover the labels of the occluded parts of the scene through an inference process which takes into account both spatial context and a class occlusion model. Our method was evaluated on a database acquired at two different times with an interval of three years in a large dynamic outdoor environment. The results point out the ability to retrieve the hidden classes with a precision score of 0.98. The performances in term of localisation are also improved.

Summary (3 min read)

1. Introduction

  • Structure from motion (SfM) is a long standing task in computer vision.
  • The network estimates the depth in the first image and the camera motion.
  • This potential is indicated by their results for the two-frame scenario, where the learning approach clearly outperforms traditional methods.
  • Singleimage methods have more problems generalizing to previously unseen types of images.
  • The key to the problem is an architecture that alternates optical flow estimation with the estimation of camera motion and depth; see Fig.

3. Network Architecture

  • The overall network architecture is shown in Fig. 2. DeMoN is a chain of encoder-decoder networks solving different tasks.
  • The last component is a single encoder-decoder network that generates the final upsampled and refined depth map.
  • Likewise, the authors convert the optical flow to a depth map using the previous camera motion prediction and pass it along with the optical flow to the second encoder-decoder.
  • The improvements largely saturate after 3 or 4 iterations.
  • The authors also train the first iteration on its own, but then train all iterations jointly which avoids intermediate storage.

4. Depth and Motion Parameterization

  • The network computes the depth map in the first view and the camera motion to the second view.
  • The translation t is given in Cartesian coordinates.
  • The bootstrap net fails to accurately estimate the scale of the depth.
  • The iterations refine the depth prediction and strongly improve the scale of the depth values.
  • Images show the x component of the optical flow for better visibility.

5.1. Loss functions

  • The network estimates outputs of very different nature: high-dimensional (per-pixel) depth maps and lowdimensional camera motion vectors.
  • The authors apply point-wise losses to their outputs: inverse depth ξ, surface normals n, optical flow w, and optical flow confidence c.
  • Note that the authors apply the predicted scale s to the predicted values ξ.
  • The authors use a minimal parameterization of the camera motion with 3 parameters for rotation r and translation t each.
  • It emphasizes depth discontinuities, stimulates sharp edges in the depth map and increases smoothness within homogeneous regions as seen in Fig. 10.

5.2. Training Schedule

  • The network training is based on the Caffe framework [20].
  • The whole training procedure consists of three phases.
  • First, the authors sequentially train the four encoder-decoder components in both bootstrap and iterative nets for 250k iterations each with a batch size of 32.
  • The outputs of the previous three network iterations are added to the batch, which yields a total batch size of 32 for the iterative network.

6.1. Datasets

  • SUN3D [43] provides a diverse set of indoor images together with depth and camera pose.
  • Depth maps are disturbed by measurement noise, and the authors use the same preprocessing as for SUN3D.
  • Scenes11 is a synthetic dataset with generated images of virtual scenes with random geometry, which provide perfect depth and motion ground truth, but lack realism.
  • The authors did not train on NYU and used the same test split as in Eigen et al. [7].
  • Thus, the authors automatically chose the next image that is sufficiently different from the first image according to a threshold on the difference image.

6.2. Error metrics

  • While single-image methods aim to predict depth at the actual physical scale, two-image methods typically yield the scale relative to the norm of the camera translation vector.
  • Comparing the results of these two families of methods requires a scale-invariant error metric.
  • 1n ∑ i ∣∣∣ 1zi − 1 ẑi ∣∣∣ (10) L1-rel computes the depth error relative to the ground truth depth and therefore reduces errors where the ground truth depth is large and increases the importance of close objects in the ground truth.
  • The length of the translation vector is 1 by definition.
  • The accuracy of optical flow is measured by the average endpoint error (EPE), that is, the Euclidean norm of the difference between the predicted and the true flow vector, averaged over all image pixels.

6.3. Comparison to classic structure from motion

  • The authors compare to several strong baselines implemented by us from state-of-the-art components (“Base-*”).
  • The essential matrix is computed with RANSAC and the 5-point algorithm [31] for both.
  • Tab. 2 shows that DeMoN outperforms all baseline methods both on motion and depth accuracy by a factor of 1.5 to 2 on most datasets.
  • The depth prediction of the first frame is shown.
  • Higher resolution gives the Base-* methods an advantage in depth accuracy, but on the other hand these methods are more prone to outliers.

6.4. Comparison to depth from single image

  • To demonstrate the value of the motion parallax, the authors additionally compare to the single-image depth estimation methods by Eigen & Fergus [7] and Liu et al. [24].
  • The Base-Oracle prediction on NYUv2 is missing because the motion ground truth is not available.
  • Results on more methods and examples are shown in the supplementary material.
  • Models by Liu et al.: one trained on indoor scenes from the NYUv2 dataset (“indoor”) and another, trained on outdoor images from the Make3D dataset [32] (“outdoor”).
  • On all but one dataset, DeMoN outperforms the singleframe methods also by numbers, typically by a large margin.

6.4.1 Generalization to new data

  • Scene-specific priors learned during training may be useless or even harmful when being confronted with a scene that is very different from the training data.
  • In contrast, the geometric relations between a pair of images are independent of the content of the scene and should generalize to unknown scenes.
  • Single-frame methods have severe problems in such cases, as most clearly visible in the point cloud visualization of the depth estimate for the last example.
  • Fig. 9 and Tab. 3 show that DeMoN, as to be expected, generalizes better to these unexpected scenes than singleimage methods.
  • It shows that the network has learned to make use of the motion parallax.

6.5. Ablation studies

  • The authors architecture contains some design decisions that the authors justify by the following ablation studies.
  • All results have been obtained on the Sun3D dataset with the bootstrap net.
  • Interestingly, while the scale invariant loss greatly improves the prediction qualitatively (see Fig. 10), it has negative effects on depth scale estimation.
  • (a) Just L1 loss on the absolute depth values.
  • 5 shows that given the same flow, egomotion estimation improves when given the flow confidence as an extra input.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01237850
https://hal.inria.fr/hal-01237850
Submitted on 3 Dec 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Hybrid Metric-Topological-Semantic Mapping in
Dynamic Environments
Romain Drouilly, Patrick Rives, Benoit Morisset
To cite this version:
Romain Drouilly, Patrick Rives, Benoit Morisset. Hybrid Metric-Topological-Semantic Mapping in
Dynamic Environments. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, IROS’15, Sep
2015, Hamburg, Germany. �hal-01237850�

Hybrid Metric-Topological-Semantic Mapping in Dynamic
Environments
Romain Drouilly
1,2
, Patrick Rives
1
, Benoit Morisset
2
Abstract Mapping evolving environments requires an up-
date mechanism to efficiently deal with dynamic objects. In this
context, we propose a new approach to update maps pertaining
to large-scale dynamic environments with semantics. While
previous works mainly rely on large amount of observations,
the proposed framework is able to build a stable representation
with only two observations of the environment. To do this,
scene understanding is used to detect dynamic objects and to
recover the labels of the occluded parts of the scene through an
inference process which takes into account both spatial context
and a class occlusion model. Our method was evaluated on
a database acquired at two different times with an interval
of three years in a large dynamic outdoor environment. The
results point out the ability to retrieve the hidden classes with a
precision score of 0.98. The performances in term of localisation
are also improved.
I. INTRODUCTION
Lifelong mapping has received an increasing amount of
attention during last years, largely motivated by the grow-
ing need to integrate robots into the real world wherein
dynamic objects constantly change the appearance of the
scene. A mobile robot evolving in such a dynamic world
should not only be able to build a map of the observed
environment at a specific moment, but also to maintain this
map consistent over a long period of time. It has to deal
with dynamic changes that can cause the navigation process
to fail. However updating the map is particularly challenging
in large-scale environments. To identify changes, robots have
to keep a memory of the previous states of the environment
and the more dynamic it is, the higher will be the number
of states to manage and the more computationally intensive
will be the updating process. Mapping large-scale dynamic
environments is then particularly difficult as the map size
can be arbitrary large. Additionally, mapping many times
the whole environment is not always possible or convenient
and we could take advantages of methods using only a small
number of observations. The idea exploited in this paper is
to use scene understanding to retrieve a stable world model
with only two acquisition sequences.
Previous mapping strategies developed for dynamic envi-
ronments can be grouped in a few categories. The first group
concerns methods that remove dynamic objects in order to
achieve a stable representation [1]. A strategy to identify
dynamic objects in the scene and map them in a separate
occupancy grid is proposed in [2]. Those methods requires
*This work was supported by ECA Robotics
1
Authors are with tem Lagadic at INRIA Mditer-
rane, France romain.drouilly@inria.fr,
patrick.rives@inria.fr
2
Authors are with ECA Robotics, bmo@eca.fr
Fig. 1. Map Structure. Spherical images augmented with depth are captured
while the robot explore its environment. Then each spherical image is
automatically annotated. Finally a semantic graph is extracted from the
annotated image
to identify moving objects which make them more suited
for fast dynamic changes. Another approach that does not
require to explicitly identify moving objects is presented
in [3]. It consist in maintaining several maps acquired at
different time scale and to select the most adapted at a given
time for navigation by checking its consistency with the
current observations. The main drawback of this approach is
the large amount of data needed, five times as big as a single
map, which prevents its use for large-scale environments
where the size of the map is already a problem. A third group
of methods assumes that mapping is a never-ending process
and continuously update data. The biologically inspired
algorithm RATSLAM is used in [4] to perform persistent
mapping at the cost of an increasing map size. Similarly,
an hybrid metric-topological map is proposed in [5] where
a model based on the human memory is used to update
a feature-based description of spherical views constituting
the map. The update process consists in adding information
about stable features and in removing features that no longer
exist in spherical images. Once again, the major issue is to
deal with the large amount of data. A last approach consists
in transposing the problem in another space where changes

Fig. 2. Example of semantic graph extracted from an annotated image
are easier to handle and to model. An example, introduced
in [6], is the use of spectral analysis to model dynamics of
the environment. However identification of dynamic behavior
requires once again, large amount of observations which is
the the main problem of most of the previous approaches,
making them unsuitable for large-scale environments where
the amount of data for a single mapping session is already
significant.
In this paper we propose a framework adapted to large
scale outdoor environments that relies on the hybrid Metric-
Topological-Semantic (MTS) map, introduced in our previ-
ous work[7]. We show how our mapping approach can nat-
urally handle changes in the environment for both mapping
and localisation through the use of semantic maps. Instead of
updating directly the low-level layers of the map, we update
semantic data and generalise changes observed over space to
changes over time using the ergodic assumption.
II. PROPOSED APPROACH
In this work we propose a new scheme to update the map
of large-scale dynamic environments using a stable represen-
tation based on semantic information. While many mapping
systems rely on a representation based on low-level features
descriptors, our method only requires semantic description of
a set of reference images [7], [8]. Consequently, the changes
occurring in the scene will be coded in the semantic layer
of the map directly. In the presented approach, instead of
stacking all perceptions at a very high computational cost, a
compact representation of changes is built. More precisely,
the set of possible classes is split in two groups, namely
dynamic and static classes. Then, changes are modeled in
terms of static classes occlusions due to dynamic objects.
This model takes the form of a probability distribution that
encodes the risk that a given dynamic class occludes a static
class. It is build from the observation of occlusions over
space and time and is used in conjunction with contextual
information to infer semantics of occluded areas. For the
sake of completeness, we briefly review in the next two
sections our MTS map structure, illustrated at figure 1, and
our localisation strategy, previously introduced in [7]. Then
the proposed map update framework is detailed.
A. MTS Map Structure
Our map consists of a set of spherical RGB images
augmented by depth and semantic data, as illustrated at fig
1. Those images, so-called reference images, are multi-layers
local submaps of the environment perceived from a particular
viewpoint. The metric layer is built from data acquired with
I
s
I
s
b
TT(x)
Fig. 3. Illustration of the registration process. The current sphere is in blue
and the reference spheres are in red.
a multi-camera stereovision system previously described in
[9]. The sensor consists in two superimposed rings of three
cameras allowing to capture both photometric and geometric
data over a 360
o
field of view. At each pixel is attributed a
semantic label thanks to a two steps process as presented
in [7]. Firstly, Random Forest are used to classify SIFT-
based descriptors extracted densely from images. Then a
Fully Connected Conditional Random Field is used to model
neighborhood and an efficient inference method [10] allows
us to correct the labels over spatial context. From these
annotated images are extracted semantic graphs, illustrated
at figure 2. Let A
i
be a group of contiguous pixels with the
same label in the annotated image, called a semantic area.
A semantic graphs is denoted as g = {A, E} where A is the
set of semantic areas and E the set of edges encoding their
adjacency in the annotated image. Each A
i
A is character-
ized by a fitted ellipsoid envelop f
i
= {x
i
, y
i
, h
i
, w
i
, α
i
} where
(x, y) is the position of the ellipse, (h
i
, w
i
, α
i
) its main axis
and orientation respectively.
Semantic graphs are local powerful representations of the
environment as they encode both scene structure and high-
level description of the context in a very compact way. The
localisation strategy strongly relies on those graphs. At a
larger scale, all submaps are positioned in the scene thanks
to a dense visual odometry method presented in [11] and
constitute a global graph of the environment.
B. Localization in MTS Map
Localization in MTS map is a coarse-to-fine two steps
process. For a given image of the current scene, a similar
submap is efficiently retrieved in the global graph using se-
mantics. The semantic graph g
cur
extracted from the current
annotated image is compared to the N semantic graphs g
i
of
the map using an Interpretation Tree. This algorithm allows
to compare efficiently graphs using both nodes appearance
and neighbors. Each nodes of the semantic graphs with sim-
ilar labels are matched two by two using unary constraints,
capturing their intrinsic properties (h
i
, w
i
, α
i
), and pairwise
constraints, capturing the context consisting of the nearest
neighbors in the graph.
Matching graphs allows to compute a similarity score,
denoted as σ , between two semantic graphs G
1
and G
2
. It
is measured as follows:
σ (G
1
, G
2
) = exp
1
N
N
m
(1)
where N
m
is the number of nodes matched between the two
graphs denoted as A
12
= A
G1
T
A
G
2
and N the total number

of nodes in the current semantic graph. The submap with
the highest score σ corresponds to the most probable closest
location.
Once the submap corresponding to the closest position is
retrieved, a dense registration method between the submap
and the current spherical image, described in [12], is applied
to refine the pose estimate locally (see figure 3). Pose esti-
mation between a current spherical image I
s
and a retrieved
spherical image I
s
is done using robust minimization tech-
niques. Following the formulation of [13], the cost function
for optimising intensity errors between spheres {I
s
, I
s
} is
given as:
F
I
=
1
2
k
i
Ψ
hub
I
s
ω(
b
TT(x);P
i
)
I
s
ω(I; P
i
)
2
, (2)
where ω(.) is the warping function that projects a 3-D point
P
i
given a pose T onto the sphere. The pose
b
TT(x) is an
approximation of the true transformation T(
˜
x) and Ψ
hub
is
a robust weighting function on the error given by Huber’s
M-estimator [14].
C. Updating the Maps
They are two main changes occurring in a dynamic
environment. Those due to dynamic objects and those caused
by illumination changes (day/night). As we are focused on
life-long mapping, we consider here only the changes in the
scene due to occlusions caused by dynamic objects. This
choice is motivated by the fact that, in a semantic approach,
robustness to illumination changes should be more easily
handled by the classification process, which is beyond the
scope of this article.
A dynamic class, denoted as C
D
occludes a static class
C
S
by changing the label associated to the corresponding
pixels in the image. The objective is to identify the parts of
the images corresponding to dynamic classes and to retrieve
the static class occluded by the dynamic object in order to
achieve a stable representation. Let us consider two different
cases depending if the scene is re-observed or not.
1) Map updates in re-observed areas: When the robot
navigates its environment, it can observe several times the
same place. Dynamic objects may have moved and then pre-
viously occluded areas can be observed. These observations
are used to build more stable semantic representation by
replacing previously occluded areas in the reference anno-
tated image, by the labels provided by the new observation.
To do this, the pose between the two images is computed
using dense matching techniques. Then a semantic warping
function is used to project labels of the new observation onto
the reference annotated image.
Let I
s
be the reference annotated image of size m ×n from
which we want to compute the stable representation. A pixel
in I
s
is identified by its position p
= (u, v), where u [0, m[
and v [0, n[. A 3D point in the euclidean space is defined
as P
= {p
, Z, l} where Z R
+
is the depth expressed in
the image reference frame and l L
S
the associated label.
Let I
i
be another observation of the same scene view from
the position T(x)
i
, where T (x)
i
is expressed in the reference
frame of I
s
.
It is possible to synthesize a new annotated image, denoted
as I
0
i
from the labels L(p) of I
i
at the position of the reference
image using the warping function:
p
ω
= ω(T(x)
i
;Z, l, p
) (3)
where ω (.) function lifts the pixel p
from the reference
image to the new observed image using the rigid transform
T (x)
i
followed by a spheric projection. The projected point
does not correspond exactly to the pixels position so a closest
neighbor interpolation is used to select the corresponding
label. Finally, if a given pixel p
i
is associated to a class
C
j
C
D
in I
s
a class C
k
C
S
in I
0
i
, its class is set to C
k
.
2) Map update in unobserved areas: The warping func-
tion allows a partial update of the map where additional
observation informs about the underlying static classes. But
some areas may remain unobserved if dynamic objects
occlude the same part of the scene for the two mapping
sequences.
To deal with these areas, we need to do semantic in-
painting: unobserved areas are treated as holes that have
to be filled with static class labels. A model of occlusions
occurring in the scene is computed to infer the label of the
pixels that remain occluded by dynamic objects. It relies
on the ergodic assumption which states that the average
behavior of dynamic objects over the time is essentially
the same as the average behavior of dynamic objects over
the space. More precisely, this assumption state that we can
generalize occlusions observed over the mapping sequences
to unobserved areas which remain occluded in other parts
of the map. Practically, we compute a model that describes
which static class is likely to be occluded by a given dynamic
class. For example the dynamic class ”pedestrian” is more
likely to occlude the static class ”side-walk” than ”sky” as
the class ”car” is likely to occlude ”road”. This model takes
the form of a probability distribution of the existence of an
underlying static class C
i
given the observation of a dynamic
class C
j
:
P(C
i
|C
j
) =
O(C
i
, C
j
)
N
(4)
where N is the total number of pixels initially associated
to a dynamic label and O(C
i
, C
j
) the number of pixels
initially labelled as C
j
and corrected to C
i
using the warping
function. This model is computed thanks to observations
made over several acquisition sequences. It is not necessary
to remember specific observations but only the model which
is extremely compact.
However this model is not sufficient to correctly estimate
the occluded classes because it only takes into account
statistics over time. It is necessary to take into account
the spatial context to estimate the probability that a given
static class is occluded by a dynamic class. For example
if a dynamic object of type ”car” is mainly surrounded by
the class ”building”, it is more likely that the car occludes
a building than a road even if roads are usually the most
probable occluded class.

To take into account the context, the semantic graph
associated with the annotated image is used. For a dynamic
area in the image, the semantic graph gives the adjacent
semantic areas, constituting the neighbors, denoted as N .
Each node n
i
in a semantic graph is characterized by a fitted
ellipsoid f
i
to describe its shape [7] which parameters are
presented at section II-A.
To model the probability of associating a static label to a
pixel p = (x
p
, y
p
), a Gaussian function is associated to each
neighbor node n
i
N . It takes the general form:
F
i
(x
p
, y
p
) = A
i
exp
(a(x
p
x
i
)
2
+2b(x
p
x
i
)(y
p
y
i
)+c(y
p
y
i
)
2
)
(5)
where A
i
is the amplitude set as P(C
i
|C
j
), (x
i
, y
i
) the position
of the area in the image, and where:
a =
cos
2
θ
2σ
2
x
+
sin
2
θ
2σ
2
y
(6)
b =
sin(2θ )
4σ
2
x
sin(2θ )
4σ
2
y
(7)
c =
sin
2
θ
2σ
2
x
+
cos
2
θ
2σ
2
y
(8)
with σ
x
= h
i
,σ
y
= w
i
and θ = α
i
.
Then for each pixel of the area requiring a new label, the
most probable label is computed as follows:
L(C
i
) = max
iN
(F
i
(x
p
, y
p
)) (9)
where L(.) stands for the likelihood.
Using the proposed approach, it is possible to update the
map by exploiting both the spatial context and the knowledge
acquired along robot’s experience, resulting in a robust and
stable representation of the environment. Conversely to many
other approaches, we do not require to consider a large set
of observations but only a simple and compact model of
occlusions.
III. EXPERIMENTS
Our framework has been tested in two ways. First, the
correctness of the class prediction in occluded parts of the
scene has been evaluated by making predictions in areas
where observations of static labels are accessible and used
as ground truth. Then, the usefulness of the approach for
localization is evaluated by comparing similarity scores of
images taken at the same place but at different moment with
and without updating data. All experiments were performed
using an Intel i7-3840QM CPU at 2.80GHz. All programs
are single-threaded.
The two experiments are realized with a challenging
dataset modeling an outdoor environment with forest and
building areas at the INRIA Sophia-Antipolis campus. It is
composed of hundreds of high resolution
1
spherical images
taken along a 1.6km pathway. Two sequences have been
acquired with our multi-camera stereovision system on the
1
The full resolution is 2048x665 but we use 1024x333 resolution for
classification
TABLE I
OCCLUSION MODEL
Class Associated Probability
Sky 0
Building 0.04
Road 0.76
Sidewalk 0.08
Tree 0.11
Signs 0
Ground Signs 0
TABLE II
INFERENCE RESULTS
Class Score
Building 0.96
Road 0.98
Sidewalk 0.99
Tree 0.97
Global 0.98
same pathway at two different time-scales with an interval of
three years. The automatic annotation of images produces 9
classes: tree, sky, road, sign, sidewalk, ground sign, building,
car, other. The scene parsing stage produces a not perfect
labelling, achieving 82% of correctly labelled pixels (see
more results in [7]). It is important to notice that learning is
done only on images extracted from the first sequence. Then
changes in illumination are managed by the classification
stage only.
A. Predictions Correctness
To estimate the correctness of the predictions, we report
two measures. The first one is the global precision of the
inference process. The global precision is given by the
number of pixels correctly associated to the class C
i
over
the total number of pixels labelled as C
i
, denoted as S
global
.
But this measure alone is not significant as some classes
representing small objects can be ignored without significant
decrease of performance. Then we also report by class
unweighed precision score, denoted as S
class
. The occlusion
model computed from experiments is reported at table I.
The results of inference process are reported in table II and
illustrated at figure 4.
As expected the occlusion model encodes the common fact
that cars are more likely to appear on road, in front of trees
or buildings than in the sky. Results presented in table II for
classes with non null probabilities, show that the inference
process is very efficient. Almost all pixels are associated
with the correct class. The remaining pixels correspond to
slight changes in border position. These very good results
demonstrate the efficiency of our approach to infer semantics
in occluded parts of the scene. These good results can
be explained by two facts. First, taking into account both
spatial context and temporal changes allows to build a very
robust model of the world. The ergodic assumption is a
very efficient way to compensate for the small number of
observations, only two here. The second point is that static

Citations
More filters
Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper presents an audio guided Indoor Navigation Systems built in a wearable device designed to work with a hybrid mapping that allows blind users a safe guided navigation with quick and low computational complexity.
Abstract: This paper presents an audio guided Indoor Navigation Systems built in a wearable device designed to work with a hybrid mapping. This mapping consist of radio frequency markers as well as visual markers located in special places inside known ambient. This system allows blind users a safe guided navigation with quick and low computational complexity. The chosen methodological approach divides the system in two stages: one offline and another online. In offline stage the indoor mapping is made through construction of markers to generate a contextual database that increases the quality of location indication. The online stage, where the indoor navigation is performed, is based on the proximity method, visual pattern recognition, odometry, and ultrasonic perception of barriers. Results showed rates of over 90.0% of the recognition of RF and visual markers with time of 0.4 seconds, as well as over 95.0% of positive ultrasonic perception of obstacles.

14 citations

Proceedings ArticleDOI
25 Jun 2018
TL;DR: An indoor navigation system built in a wearable device that allows visually impaired users to perform guided audio navigation safe, fast and has a low computational complexity is presented.
Abstract: This paper presents an indoor navigation system built in a wearable device. The system allows visually impaired users to perform guided audio navigation safe, fast and has a low computational complexity. The methodological approach chosen divides the process into two phases: offline and online. In offline phase, the indoor mapping is done by data fusing of radio frequency and visual markers, constructing a unique and consistent representation. In the online phase navigation and recognition of each of the internal positions are performed through the fusion representation or only by the wi-fi or visual signals when one of the sensors is strongly affected by noises or other interferences. The results showed that the recognition levels of wi-fi, visual and fusion markers were 87.59%, 90.92%, and 92.03%, respectively. The error margin after the data fusion application was 0.8 m, with an average time of 0.62 ms.

9 citations

Journal ArticleDOI
TL;DR: A guided audio Indoor navigation system built in a wearable device designed to work with a hybrid mapping that allows visually impaired users to perform a guided navigation safe and fast, and has a low computational complexity.

5 citations

01 Jan 2018
TL;DR: The estimation of spatial relations between objects using Probabilistic Logic, including a novel inference method for Markov Logic Networks, and the benefits of combining different sources of semantic information with sensor data are shown in a scene interpretation and a semantic localization task.
Abstract: Increasing autonomy and interactivity of mobile robotic technologies require the inclusion of semantic information in environment representations. This thesis focuses on representations and methods for semantic mapping in particular for urban environments. It covers the estimation of spatial relations between objects using Probabilistic Logic, including a novel inference method for Markov Logic Networks. Moreover, the benefits of combining different sources of semantic information with sensor data are shown in a scene interpretation and a semantic localization task.

2 citations


Cites background from "Hybrid metric-topological-semantic ..."

  • ...Semantic categories can also be used to determine the dynamic properties of parts of the map, which helps to keep track of changes when revisiting places where certain objects of dynamic classes have moved, while static objects can be assumed to remain in the same place over time [45]....

    [...]

Journal ArticleDOI
TL;DR: In this paper , the Lidar and camera measurements are fused to obtain robust motion estimation and construct a hybrid metric-feature map, and a multi-stage loop closure strategy is applied to determine the loop candidate.

1 citations

References
More filters
Proceedings Article
12 Dec 2011
TL;DR: This paper considers fully connected CRF models defined on the complete set of pixels in an image and proposes a highly efficient approximate inference algorithm in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels.
Abstract: Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While region-level models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy.

3,233 citations


"Hybrid metric-topological-semantic ..." refers methods in this paper

  • ...Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context....

    [...]

Journal ArticleDOI
TL;DR: This work investigated the persistent navigation and mapping problem in the context of an autonomous robot that performs mock deliveries in a working office environment over a two-week period and found the solution was based on the biologically inspired visual SLAM system, RatSLAM.
Abstract: The challenge of persistent navigation and mapping is to develop an autonomous robot system that can simultaneously localize, map and navigate over the lifetime of the robot with little or no human intervention. Most solutions to the simultaneous localization and mapping (SLAM) problem aim to produce highly accurate maps of areas that are assumed to be static. In contrast, solutions for persistent navigation and mapping must produce reliable goal-directed navigation outcomes in an environment that is assumed to be in constant flux. We investigate the persistent navigation and mapping problem in the context of an autonomous robot that performs mock deliveries in a working office environment over a two-week period. The solution was based on the biologically inspired visual SLAM system, RatSLAM. RatSLAM performed SLAM continuously while interacting with global and local navigation systems, and a task selection module that selected between exploration, delivery, and recharging modes. The robot performed 1,143 delivery tasks to 11 different locations with only one delivery failure (from which it recovered), traveled a total distance of more than 40 km over 37 hours of active operation, and recharged autonomously a total of 23 times.

302 citations


"Hybrid metric-topological-semantic ..." refers methods in this paper

  • ...The biologically inspired algorithm RATSLAM is used in [4] to perform persistent...

    [...]

Journal ArticleDOI
TL;DR: An on-line algorithm capable of differentiating static and dynamic parts of the environment and representing them appropriately on the map is proposed, based on maintaining two occupancy grids.
Abstract: We propose an on-line algorithm for simultaneous localization and mapping of dynamic environments. Our algorithm is capable of differentiating static and dynamic parts of the environment and representing them appropriately on the map. Our approach is based on maintaining two occupancy grids. One grid models the static parts of the environment, and the other models the dynamic parts of the environment. The union of the two grid maps provides a complete description of the environment over time. We also maintain a third map containing information about static landmarks detected in the environment. These landmarks provide the robot with localization. Results in simulation and real robots experiments show the efficiency of our approach and also show how the differentiation of dynamic and static entities in the environment and SLAM can be mutually beneficial.

262 citations


"Hybrid metric-topological-semantic ..." refers background in this paper

  • ...A strategy to identify dynamic objects in the scene and map them in a separate occupancy grid is proposed in [2]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a system for long-term SLAM (simultaneous localization and mapping) by mobile service robots and its experimental evaluation in a real dynamic environment and proposes a sample-based representation, where older memories fade at different rates depending on the timescale.
Abstract: This paper presents a system for long-term SLAM (simultaneous localization and mapping) by mobile service robots and its experimental evaluation in a real dynamic environment. To deal with the stability-plasticity dilemma (the trade-off between adaptation to new patterns and preservation of old patterns), the environment is represented by multiple timescales simultaneously (five in our experiments). A sample-based representation is proposed, where older memories fade at different rates depending on the timescale and robust statistics are used to interpret the samples. The dynamics of this representation are analyzed in a five-week experiment, measuring the relative influence of short- and long-term memories over time and further demonstrating the robustness of the approach.

91 citations

Journal ArticleDOI
TL;DR: This paper uses a probabilistic method to track multiple people and to incorporate the estimates of the tracking technique into the mapping process, which results in more accurate maps.
Abstract: The problem of learning maps with mobile robots has received considerable attention over the past years Most of the approaches, however, assume that the environment is static during the data-acqui

85 citations


"Hybrid metric-topological-semantic ..." refers background in this paper

  • ...concerns methods that remove dynamic objects in order to achieve a stable representation [1]....

    [...]

Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Hybrid metric-topological-semantic mapping in dynamic environments" ?

In this context, the authors propose a new approach to update maps pertaining to large-scale dynamic environments with semantics. 

Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context. 

Once the submap corresponding to the closest position is retrieved, a dense registration method between the submap and the current spherical image, described in [12], is applied to refine the pose estimate locally (see figure 3). 

Two sequences have been acquired with their multi-camera stereovision system on the1The full resolution is 2048x665 but the authors use 1024x333 resolution for classificationsame pathway at two different time-scales with an interval of three years. 

the correctness of the class prediction in occluded parts of the scene has been evaluated by making predictions in areas where observations of static labels are accessible and used as ground truth. 

The pose T̂T(x) is an approximation of the true transformation T(x̃) and Ψhub is a robust weighting function on the error given by Huber’s M-estimator [14]. 

Using the proposed approach, it is possible to update the map by exploiting both the spatial context and the knowledge acquired along robot’s experience, resulting in a robust and stable representation of the environment. 

A dynamic class, denoted as CD occludes a static class CS by changing the label associated to the corresponding pixels in the image. 

To model the probability of associating a static label to a pixel p = (xp,yp), a Gaussian function is associated to each neighbor node ni ∈N . 

Following the formulation of [13], the cost function for optimising intensity errors between spheres {Is, I∗s } is given as:FI = 1 2k∑ iΨhub ∥∥∥∥Is(ω(T̂T(x);Pi))−