What have the authors contributed in "Fast hybrid relocation in large scale metric-topologic-semantic map" ?

The authors present a new hybrid metric-topological-semantic map structure, called MTS-map, that allows a fine metric-based navigation and fast coarse query-based localisation. The authors propose a robust and efficient algorithm that relies on MTS-map structure and semantic description of sub-maps to relocate very fast. The authors combine the discriminative power of semantics with the robustness of an interpretation tree to compare the graphs very fast and outperform state-of-the-art-techniques. The proposed approach is tested on a challenging dataset composed of more than 13000 real world images where the authors demonstrate the ability to relocate within 0. 12ms.

What are the future works in "Fast hybrid relocation in large scale metric-topologic-semantic map" ?

There are several ways to extend this work. Pixel-wise temporal consistency has been shown to improve labelling quality and higher level temporal consistency can be interesting to investigate.

What is the way to deal with dynamic environments?

Relocation under constraint: Navigation-oriented maps should provide an efficient way to deal with dynamic environments for lifelong mapping.

What is the way to capture the spatial structure of the scene?

To capture global structure of the scene a common solution is to embed first stage prediction results into a probabilistic graphical model [5].

What is the Gibbs energy of a labelling?

In the fully connected pairwise CRF model the Gibbs energy [22] of a labelling y is:E(y) = ∑ i ψu(yi)+∑ i< j ψc(yi,y j)where ψc(yi) denotes φ(yi|X), ψu is the unary potential and ψc the pairwise potential.

What is the classification of a random forest?

A Random Forest is a set of T Decision Trees that achieves good classification rate by averaging prediction over all leaves L of all trees:

What are the possible values for the map?

The possible values are: 1 = left, 2 = top left, 3 = top, 4 = top right, 5 = right, 6 = bottom right, 7 = bottom, 8 = bottom left.

(Open Access) Fast Hybrid Relocation in Large Scale Metric-Topologic-Semantic Map (2014) | Romain Drouilly

Q: What is the simplest way to enforce temporal consistency?

To enforce temporal consistency the authors accumulate Random Forest predictions from neighbours of the current view so that the unary potential ψc(yi) takes the form :ψu(yi) = α ∑ n∈N ψn(yi)where N is the neighbourhood of sphere i and α is a normalization factor.

Q: What is the weighting strategy used between visual words?

The weighting strategy adopted between visual words is the term frequency-inverse document frequency tf-idf and the scoring type is L1-Norm (for details about parameters see [14]).

Q: Why is it called multivalent graph matching problem?

Due to change in viewpoint that can possibly fuse several objects, comparing those graphs formulates as multivalent graph matching problem.

Q: What is the classification of a random forest?

A Random Forest is a set of T Decision Trees that achieves good classification rate by averaging prediction over all leaves L of all trees:

Q: What is the dataset used for tests?

The dataset used for tests is the INRIA dataset presented in section III-C. CamVid is not used in this section because the dataset is too small with only 101 labelled images for sequence 06R0 and 124 for sequence 01TP.A.

Fast Hybrid Relocation in Large Scale Metric-Topologic-Semantic Map

Romain Drouilly

1,2

, Patrick Rives

, Benoit Morisset

Abstract— Navigation in large scale environments is challeng-

ing because it requires accurate local map and global relocation

ability. We present a new hybrid metric-topological-semantic

map structure, called MTS-map, that allows a ﬁne metric-based

navigation and fast coarse query-based localisation. It consists

of local sub-maps connected through two topological layers at

metric and semantic levels. Semantic information is used to

build concise local graph-based descriptions of sub-maps. We

propose a robust and efﬁcient algorithm that relies on MTS-map

structure and semantic description of sub-maps to relocate very

fast. We combine the discriminative power of semantics with the

robustness of an interpretation tree to compare the graphs very

fast and outperform state-of-the-art-techniques. The proposed

approach is tested on a challenging dataset composed of more

than 13000 real world images where we demonstrate the ability

to relocate within 0.12ms.

I. INTRODUCTION

Although it has been largely studied in the last decade,

autonomous navigation remains a challenging issue, particu-

larly in complex large scale environments. In this paper we

address the problem of building navigation-oriented maps

capable of dealing with different localization levels, from

coarse to ﬁne. Map-based navigation requires the robot to

be able to request efﬁciently the content of the map at large

scale to retrieve its position and simultaneously to infer the

position of local objects around it. The map needs to be

precise locally but lightweight at large scale. However in

most of 3D maps, information density is homogeneous in

space yielding to a compromise between precision of local

model and size of the environment to model. This kind of

representation intrinsically limits local quality of the model

or reduces its scale-up capability. In this work, we use a more

convenient map structure. It consists of a graph whose nodes

are local sub-maps built from ego-centred spherical views of

the environment, previously introduced in [1].

Navigation is fuirther a product of environment under-

standing and planning, besides local metrical precision. A

map is a cognitive representation of the world and the robot

is able to reason only about concepts encoded within it.

The more complex are these concepts the more ”intelligent”

could be its behaviour. Therefore intelligent navigation needs

the map to contain abstraction layers to represent higher

level concepts than geometry and color. Toward this goal

topological mapping has early been considered of interest

[2], [3], capturing the environment accessibility properties

and allowing navigation in complex large scale environments

*This work was supported by ECA Robotics

Authors are with INRIA Sophia-Antipolis,

France romain.drouilly@inria.fr,

patrick.rives@inria.fr

Authors are with ECA Robotics, bmo@eca.fr

[4]. Semantic mapping is only recently receiving signiﬁcant

attention. It provides a powerful way to enrich the cognitive

model of the world and thereby being of interest for nav-

igation. However, despite a notable amount of work about

outdoor scene parsing, the use of semantics for outdoor

navigation has been poorly studied. Many mapping strategies

rely on place classiﬁcation or landmarks like doors to infer

robot’s position [5]. But localisation is only possible if object

classes are strongly related to particular places, which is not

the case outdoors. Additionally the place concept is hard to

deﬁne for most of outdoor environments as these scenes are

not structured enough to allow unambiguous delimitations

between different areas.

We propose three main contributions to deal with those

problems: a new 3D hybrid map structure designed for

navigation purposes, a new framework to extract semantic

information and an efﬁcient algorithm to request the content

of the map in a human-friendly way. All these improvements

provide the robot both with precise local representation and

fast global content request ability.

The rest of this paper is organized as follows: related

works for space modelling and relocation in large databases

are discussed in section II, MTS-map architecture is pre-

sented in section III followed by scene parsing results. Then

the content request problem is treated in section IV before

wrapping up with experimental results in section V and

conclusion in section VI.

II. RELATED WORK

A. Hybrid mapping

Semantic mapping is an active research domain for the

last years and many semantic representations exist in the

literature. A semantic map model has been proposed in [6]

where objects are grouped along two dimensions - semantic

and spatial. Objects clustering along semantic dimension

allows to capture place label where place is deﬁned as group

of objects. Grouping objects in clusters of increasing size

provides meaning to the global scene. A multi-layers map is

proposed in [7]. It is constructed in a semi-supervised way.

The ﬁrst three levels are composed of metric, navigation and

topological maps. The last level is the semantic map that

integrates acquired, asserted inferred and innate conceptual-

ontological knowledge. A 3D extension of the well-known

constellation model is presented in [8]. Here again object

is the basic unit of representation for semantic labelling of

place. Despite their interest these approaches are difﬁcult

to adapt to outdoor environments because they rely on the

concept of place that is not well deﬁned outdoors. Other

methods do not directly rely on this concept. The 3D

semantic map presented in [9] is deﬁned as a map containing

both metric information and labels of recognised classes.

Prior knowledge and object models are needed for scene in-

terpretation and object recognition. More recently [10] deﬁne

a semantic SLAM approach that allows to build a map based

on previously known object models. If they perform well

indoors these works are not easily transferrable to outdoor

environments. These models rely on object recognition and

require to have a model of every object which is not easily

tractable in large scale environments.

B. Content Request

Relocation can be formulated as a content request prob-

lem: given the current observation, we ask the database to

provide the corresponding position. Vision-based localization

is studied for a long time and the use of omni-images dates

back to early 90’s [11]. Most of the modern techniques de-

compose to three steps: ﬁrst, interest points are extracted and

descriptors computed. Then descriptors between two images

are matched. Finally outliers are rejected. In well-known

Bag-Of-Words (BoW) methods, tree structure is commonly

used to organize the search and speed up the comparison

process. Many variations of BoW algorithm exist. We may

cite [12] that uses feature context to improve their discrim-

inant power. The idea is to select good features candidates

to match in order to avoid perceptual aliasing. Other recent

BoW methods offer good performances in image retrieval

using different strategies. A tree structured Bayesian network

is used in [13] to capture words co-occurrence in images. A

compact vocabulary is created in [14] through discretization

of a binary descriptor space. Despite undeniable efﬁciency,

these algorithms have the drawback of providing a low-level

description of the scene which is not human-friendly and

makes it useless for human-robot cooperation.

III. HYBRID METRIC-TOPOLOGICAL-SEMANTIC

MAP ARCHITECTURE

In this section, we propose a new hybrid metric-

topological-semantic map structure called MTS map. The

map architecture is detailed below and illustrated in ﬁg1.

A. Map Architecture

MTS-map consists of 3-layered local sub-maps globally

connected to each other in a graph structure through the

use of a dense visual odometry method, ﬁrst introduced in

[1]. The bottom layer of each sub-map is an ego-centred

RGBD spherical view of the environment acquired with a

multi-cameras system [15]. As close as possible to sensor

measurements, it provides a realistic representation of the

local environment. The second layer presents a ﬁrst level of

abstraction with densely labelled spherical images, the ”label

layer”. There is a direct correspondence between pixels of

these two layers. At the upper stage lies the semantic-graph

layer G

which provides a compact and robust high-level

description of the viewpoint. Nodes are the different regions

semantically consistent in the labelled image and edges

represent the connection between these regions. A label is

Fig. 1. MTS-map Architecture. Dark blue rectangles correspond to local

sub-maps and light blue rounded rectangles to different layers.

attached to every node together with the size of the area,

measured in pixels and its eccentricity represented by ratio

of length and the width of the shape. Edges are weighted

by the number of neighbouring pixels of two objects. All

these layers constitute the RGBDL sphere, where ”L” stands

for Label. At the global scale every RGBDL sphere is

globally referenced in a tree structure that clusters spheres

according to class presence and class occurrence. Finally,

atop of all sub-maps is the conceptual layer, deﬁned as a non-

spatial map which characterizes strength of relations between

classes, generalized from G

graphs.

B. Scene Parsing

In this part we propose a framework to extract semantic

information from spherical images. It should be noted that

our work consists of separated building blocks and the

localization step is independent of the algorithm used to label

images.

1) Local Features and Random Forest Classiﬁcation: The

ﬁrst step of the classiﬁcation process uses Random Forest

(RF) [16] to estimate classes distribution. A Random Forest

is a set of T Decision Trees that achieves good classiﬁcation

rate by averaging prediction over all leaves L of all trees:

P(c|L) =

∑

i=1

P(c|l

)

The model construction complexity is approximately of

O(T (mnlog(n)) where n is the number of instances in the

training data, m the vectors size. Provided they are correctly

trained, RF has comparable performance with multi-class

SVM with a reduced training and testing costs [17] that make

them popular in computer vision. Moreover Random Forest

has deeper architecture than Decision Tree or other well-

known classiﬁcation algorithm which makes it better able to

generalize to variations not encountered in the training data

[18].

Each Decision Tree is trained on a reduced subset of

input data randomly chosen with replacement. Then each

node is split using the best split among a subset of variables

randomly chosen. The Decision Tree produces prediction by

recursively branching left or right at each node until a leaf

is reached. Due to classes imbalance in input data, prior

preference for some classes can affect results. For that reason

we weigh each training sample proportionally to the inverse

class frequency.

To achieve good robustness to changes in orientation and

scale, the feature vectors use SIFT descriptor computed

densely on the gray scale image and augmented with color

information computed on normalized RGB images.

2) Spatio-Temporal consistency with CRF: Random For-

est produces accurate results but fails to catch contextual

information at large scale. To capture global structure of the

scene a common solution is to embed ﬁrst stage prediction

results into a probabilistic graphical model [5].

However applying classiﬁer on single images results in

practice in twinkling classiﬁcation. To enforce temporal

consistency large graphical models can be built among

consecutive images to propagate labels [19] and [20]. The

drawback of these methods is the complexity of the graph

that can reach billions of edges. Other methods [21] use

optical ﬂow to propagate labels but need to previously learn

similarity between pixels.

In this section, we present a way to simultaneously embed

in the CRF the temporal and spatial context without the need

to increase the number of edges in the CRF. We use the CRF

architecture and efﬁcient MAP inference algorithm presented

in [22]. Fully connected pairwise CRF model is brieﬂy

reviewed here. Let X = {x

,...,x

} and Y = {y

,..., y

}

be sets of random variables corresponding respectively to

observations and labels. A CRF is an undirected graph

G whose node correspond to X ∪ Y and that encodes a

conditional distribution as follows:

P(Y |X ) =

Z(X)

exp{−

∑

c∈C

|X)}

with C

the cliques of G, φ

the induced potential and Z(X)

a normalization factor.

In the fully connected pairwise CRF model the Gibbs

energy [22] of a labelling y is:

E(y) =

∑

) +

∑

i< j

)

where ψ

) denotes φ(y

|X), ψ

is the unary potential and

the pairwise potential.

To enforce temporal consistency we accumulate Random

Forest predictions from neighbours of the current view so

that the unary potential ψ

) takes the form :

) = α

∑

n∈N

)

where N is the neighbourhood of sphere i and α is a nor-

malization factor. Predictions are accumulated by projection

of neighbours prediction on the current one using odometry.

C. Scene Parsing Results

We evaluate our labelling framework on two datasets:

CamVid and our INRIA dataset. Due to the lack of other

dataset providing panoramic RGBD images fully annotated,

we ﬁrst apply our algorithm frame by frame embedding only

spatial information in the CRF. Then we study the temporal

consistency improvement on our dataset. CRF parameters

are tuned by 2fold cross-validation on CamVid and 5fold

cross validation on INRIA dataset. All experiments were

performed using an Intel i7-3840QM CPU at 2.80GHz. All

programs are single-threaded.

a) INRIA Dataset: consists of more than 13000 high

resolution

panoramic images taken along a 1.6km pathway

in a outdoor environment with forest and building areas.

There are 9 classes corresponding to tree, sky, road, signs,

sidewalk, ground signs, building, car, others. We manually

label a subset of all images randomly chosen in the dataset.

The training time for Random forest is 58 minutes and for

CRF is 43 minutes. The mean prediction time is 2.5s for

Random Forest and 3.5s for CRF.

b) CamVid dataset:

consists of several 960x720 video

sequences taken in a highly dynamic street environment and

labelled at 1Hz. There are 32 classes in the dataset. We use

two sequences: 01TP sequence that lasts 2:04minutes and

06R0 sequence that lasts a 1:41minutes. We use the ﬁrst

half of the sequence as training set and the second for test.

The training time for RF is 1h09 and for CRF is 48minutes.

Prediction time is 1.5s for Random Forest and 3.1s for CRF.

c) Performance measurement: Two standard measures

for multi-class classiﬁcation are reported: the overall percent-

age of correctly classiﬁed pixels denoted as global and the

unbalanced average per-class accuracy denoted as average

and deﬁned as

∑

k=1

where N is the number of classes,

t the number of pixels correctly classiﬁed and n the number

of annotated pixels of the k

class.

d) Results: Results for the frame-by-frame labelling are

presented in table 1 and illustrated in ﬁgure 2 for INRIA

dataset and in ﬁgure 3 for CamVid. As comparisons only

make sense if we compare similar measures over similar

datasets, we compare our results with those of [21]. Our al-

gorithm reaches near state-of-the-art performances for global

per-pixel accuracy and outperforms [21] for average per-class

accuracy. Concretely, our algorithm is better in detecting

each instance at the cost of a lower quality. This result is in

accordance with our goal to build a discriminative semantic

representation of the scene. We need to catch as much objects

as possible with a lower priority on the pixelwise labelling.

Despite good results some parts of labelled spherical images

in INRIA dataset are particularly noisy. It is due to the

stitching algorithm used to build each view from three

images that change locally the light intensity (please consult

the video attachment of the paper).

Results with enforced temporal consistency are presented

in table II. It improves results of both global and average per

The full resolution is 2048x665 but we use 1024x333 resolution for

classiﬁcation

http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

Fig. 2. Examples of frame by frame labelling results on INRIA database.

Images are presented in the following order. Top: RGB image, Middle:

Labelling results, Down: Ground Truth. Colors correspondance are: green:

tree - blue: sky - gray: road - brown: building - blue: signs - red: car -

orange: ground signs - purple: sidewalk - black: others

class accuracy. However if the neighbourhood is too large

labelling quality decreases. It comes from errors in depth

estimation that project labels on wrong position. Attached

video shows efﬁciency of temporal consistency to decrease

over-illumination noise.

TABLE I

RESULTS OF FRAME BY FRAME LABELLING

Method Our algorithm [21]

Results Global Average Global Average

CamVid 79.2 75.2 84.2 59.5

INRIA 81.9 80.2 - -

TABLE II

COMPARISON OF LABELLING WITH TEMPORAL CONSISTENCY OVER

DIFFERENT NEIGHBOURHOOD SIZES N

Neighbourhood size Global Average

= 3 82.1 81.0

= 7 83.1 82.2

= 11 81.3 80.4

IV. MAP CONTENT REQUEST

Localization is the task of retrieving the position of a map

content query. It could be the position of the robot or any

other content. Several methods like [23] propose a scene

recognition scheme based on very low dimensional repre-

sentation. Despite undeniable efﬁciency in scene recognition,

those methods do not allow high-level content request and so

are hardly extensible to tasks where human are ”in the loop”.

At the opposite side, works like [24] propose a modelling

scheme based on objects and places and use it to request

high level content. These methods use the co-occurrence

Fig. 3. Examples of frame by frame labelling results on CamVid database.

Images are presented in the following order. Top: RGB image, Middle:

Labelling results, Down: Ground Truth. From right to left: best to worst

results.

of objects classes and places classes to predict place label

or perform informed search. However, as said earlier, this

strategy does not work outdoors because any object classe

can be present anywhere and the concept of ”place” for

open space is not straightforward. In this section we propose

an algorithm that relies on MTS-map to efﬁciently realize

localization of robot or any human-understandable concept

like object or group of objects with given relations.

A. Semantic Localization

To achieve robust and efﬁcient localization, our method

relies on the proposed MTS-map structure. As explained in

section III-A, local sub-maps are indexed in a tree structure

encoding classes presence and occurrence. Each leaf is a set

of sub-maps with similar semantic content. The ﬁrst step

consists in searching the tree for the leaf/leaves with corre-

sponding content, for example, all leaves with two buildings.

It allows to drastically reduce the number of sub-maps to

compare with. Then semantic graphs G

are compared to

select the most probable local sub-map where for ﬁnding

the needed information. Due to change in viewpoint that

can possibly fuse several objects, comparing those graphs

formulates as multivalent graph matching problem. This is

NP hard problem but we can use structure of the graph to

speed up the process. We use a variation of the interpretation

tree algorithm [25] presented at Algorithm 1. Finally, when

high precision relocation is needed, the visual odometry

method presented in [1] is used on the RGBD layer to

achieve centimetrical precision.

Our semantic graph representation presents several ad-

vantages over other ways to abstract an image: it relies on

the entire image and not just on sparse local features that

could be subject to noise, perceptual aliasing or occlusion. It

intrinsically encodes image structure that contains an impor-

tant part of the information. Graphical description allows to

reconstruct approximate image structure while collection of

low-level features do not. It is extremely lightweight: the size

of the map with all full size images is 53Gbytes while se-

mantic graphs representation needs only 18.5MBytes, which

correspond to a compression ratio around 3000.

B. Interpretation Tree Algorithm

Interpretation tree is an efﬁcient algorithm that uses re-

lationships between nodes of two graphs to speed up the

matching process. It relies on two kinds of constraints to

measure similarities called unary constraints and binary con-

straints. Let G

and G

be such two graphs. Unary constraints

compare a node of G

to those of G

. If comparison succeed

nodes are matched and a score is computed for the couple

of nodes. Then the best pair of nodes is added to the list L

of matched nodes and binary constraints check if every two

pairs of nodes in G

and G

have compatible relationships.

We use the following constraints :

Unary constraints: they use three properties of nodes.

Their label, the eccentricity and the orientation of the ellip-

tical envelop that ﬁts the corresponding area shape. If labels

are different or the difference of shape properties is higher

than a given threshold, comparison fails. Taking into account

only labels, eccentricity and orientation allows to be robust

to change in apparent size of semantic areas.

Pairwise constraints: they check relationships of two

nodes. To do this they us weights w

provided by the

adjacency matrix of each semantic graph.

The interpretation tree returns the number of matched

nodes. The highest score gives the most probable position.

Algorithm 1 Details of our Interpretation Tree algorithm

used to compare semantic graphs

INPUTS: G

, G

: graphs of the current view and a given view in the

database

OUTPUTS: Score of the matching (list of matched nodes)

for all Nodes n

∈ G

for all Nodes n

∈ G

if UnaryConstraint(n

) then

add (n

) to MatchNodesList

end if

end for

if MatchedNodesList ≥ 1 then

sort MatchNodesList

for all (n

) in MatchedNodesList do

add (n

) to InterpList

if PairwiseConstraint(InterpList) == False then

remove (n

) to InterpList

end if

end for

end if

end for

V. LOCALIZATION FROM IMAGES RESULTS

In this section, we present our results to the problem of

localizing an image in a database, which corresponds to

the robot localization problem. We compare our algorithm

performance with recent state-of-the-art Bag-of-Words tech-

niques

presented in [14]. Their algorithm builds ofﬂine

a tree structure that performs hierarchical clustering within

the image descriptor space. Then similarity between current

image and images in database is evaluated by counting

We used the the implementation publicly available at:

http://webdiis.unizar.es/ dorian/

TABLE III

RETRIEVAL TEST RESULTS: TIME EFFICIENCY FOR EACH ALGORITHM.

Dataset Mean retrieval time

BoW K=10, L=5 22ms

BoW K=8, L=4 16ms

Interp 8.40ms

Interp+Index 0.12ms

Index 54.20µs

the number of common visual words. We have trained the

vocabulary tree with two sets of branching factor and depth

levels: K=10, L=5 producing 100000 visual words and K=8,

L=4 producing 4096 visual words. The weighting strategy

adopted between visual words is the term frequency-inverse

document frequency tf-idf and the scoring type is L1-Norm

(for details about parameters see [14]).

We evaluate several aspects of the algorithm. In subsec-

tion A we study performances for image retrieval in wide

database. In subsection B we evaluate the robustness of our

algorithm to wide changes in viewpoint. In subsection C we

present some interesting results that cannot be attained with

low-level feature-based algorithms. The dataset used for tests

is the INRIA dataset presented in section III-C. CamVid is

not used in this section because the dataset is too small with

only 101 labelled images for sequence 06R0 and 124 for

sequence 01TP.

A. Image retrieval

Experiments setup: The experiment consists in retriev-

ing successively all images in the database. We use three

variations of our method: ﬁrst the tree structure is used

alone to search for images with same classes and same

number of occurrence. It is denoted as ”Index”. Then the

Interpretation Tree is used to discriminate between remaining

results. It is denoted as ”Interp+Index”. Finally we use only

the Interpretation Tree, denoted as ”Interp”. ”BoW” denotes

the Bag-of-words algorithm.

Results: Timings are presented in table III. All versions of

our algorithm outperform BoW techniques in terms of time

efﬁciency. This comes from the use of image structure that

discriminate very fast between good and false candidates and

the simple tests performed. Checking labels, shape properties

and strength of the relation is very fast. The use of index

alone is faster than all the other methods as it simply count

the number of nodes of each class. However it does not

encode image structure so it is subject to aliasing.

B. Accommodation to view point change

Experiments setup: We run two experiments to evaluate

robustness to changes in viewpoint. The ﬁrst one consists in

taking a subset of images from the original dataset to build a

reference database and a different interleaved subset to build

required database. We take 1 image out of 40 to build the

database and 1 out of 40 shifted by 20 images to build the

required set, denoted as Distant images. Then, we retrieve

the positions of distant images in the reference database.

Fast Hybrid Relocation in Large Scale Metric-Topologic-Semantic Map

Figures

Citations

MapBased Navigation for a Mobile Robot with Ominidirectional Image Sensor COPIS

Recent trends in social aware robot navigation

Semantic Modeling of Places using Objects

Semantic loop closure detection based on graph matching in multi-objects scenes

Semantic representation for navigation in large-scale environments

References

Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

Extremely randomized trees

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Related Papers (5)

Real-time building of a thinning-based topological map with metric features

Object-spatial layout-route based hybrid map and global localization for mobile robots

Landmark-Tree map: A biologically inspired topological map for long-distance robot navigation

Topological Map Building and Navigation in Large-scale Environments

Semantic RGB-D SLAM for Rescue Robot Navigation

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Fast hybrid relocation in large scale metric-topologic-semantic map" ?

Q2. What are the future works in "Fast hybrid relocation in large scale metric-topologic-semantic map" ?

Q3. What is the simplest way to enforce temporal consistency?

Q4. What is the way to deal with dynamic environments?

Q5. What is the way to capture the spatial structure of the scene?

Q6. What is the weighting strategy used between visual words?

Q7. What is the Gibbs energy of a labelling?

Q8. Why is it called multivalent graph matching problem?

Q9. What is the classification of a random forest?

Q10. What is the dataset used for tests?

Q11. What is the algorithm for detecting a pixel?

Q12. Why is the image noise in the INRIA dataset so high?

Q13. What are the possible values for the map?

Q14. What is the difference between the two graphs?