Hybrid metric-topological-semantic mapping in dynamic environments
Summary (3 min read)
1. Introduction
- Structure from motion (SfM) is a long standing task in computer vision.
- The network estimates the depth in the first image and the camera motion.
- This potential is indicated by their results for the two-frame scenario, where the learning approach clearly outperforms traditional methods.
- Singleimage methods have more problems generalizing to previously unseen types of images.
- The key to the problem is an architecture that alternates optical flow estimation with the estimation of camera motion and depth; see Fig.
3. Network Architecture
- The overall network architecture is shown in Fig. 2. DeMoN is a chain of encoder-decoder networks solving different tasks.
- The last component is a single encoder-decoder network that generates the final upsampled and refined depth map.
- Likewise, the authors convert the optical flow to a depth map using the previous camera motion prediction and pass it along with the optical flow to the second encoder-decoder.
- The improvements largely saturate after 3 or 4 iterations.
- The authors also train the first iteration on its own, but then train all iterations jointly which avoids intermediate storage.
4. Depth and Motion Parameterization
- The network computes the depth map in the first view and the camera motion to the second view.
- The translation t is given in Cartesian coordinates.
- The bootstrap net fails to accurately estimate the scale of the depth.
- The iterations refine the depth prediction and strongly improve the scale of the depth values.
- Images show the x component of the optical flow for better visibility.
5.1. Loss functions
- The network estimates outputs of very different nature: high-dimensional (per-pixel) depth maps and lowdimensional camera motion vectors.
- The authors apply point-wise losses to their outputs: inverse depth ξ, surface normals n, optical flow w, and optical flow confidence c.
- Note that the authors apply the predicted scale s to the predicted values ξ.
- The authors use a minimal parameterization of the camera motion with 3 parameters for rotation r and translation t each.
- It emphasizes depth discontinuities, stimulates sharp edges in the depth map and increases smoothness within homogeneous regions as seen in Fig. 10.
5.2. Training Schedule
- The network training is based on the Caffe framework [20].
- The whole training procedure consists of three phases.
- First, the authors sequentially train the four encoder-decoder components in both bootstrap and iterative nets for 250k iterations each with a batch size of 32.
- The outputs of the previous three network iterations are added to the batch, which yields a total batch size of 32 for the iterative network.
6.1. Datasets
- SUN3D [43] provides a diverse set of indoor images together with depth and camera pose.
- Depth maps are disturbed by measurement noise, and the authors use the same preprocessing as for SUN3D.
- Scenes11 is a synthetic dataset with generated images of virtual scenes with random geometry, which provide perfect depth and motion ground truth, but lack realism.
- The authors did not train on NYU and used the same test split as in Eigen et al. [7].
- Thus, the authors automatically chose the next image that is sufficiently different from the first image according to a threshold on the difference image.
6.2. Error metrics
- While single-image methods aim to predict depth at the actual physical scale, two-image methods typically yield the scale relative to the norm of the camera translation vector.
- Comparing the results of these two families of methods requires a scale-invariant error metric.
- 1n ∑ i ∣∣∣ 1zi − 1 ẑi ∣∣∣ (10) L1-rel computes the depth error relative to the ground truth depth and therefore reduces errors where the ground truth depth is large and increases the importance of close objects in the ground truth.
- The length of the translation vector is 1 by definition.
- The accuracy of optical flow is measured by the average endpoint error (EPE), that is, the Euclidean norm of the difference between the predicted and the true flow vector, averaged over all image pixels.
6.3. Comparison to classic structure from motion
- The authors compare to several strong baselines implemented by us from state-of-the-art components (“Base-*”).
- The essential matrix is computed with RANSAC and the 5-point algorithm [31] for both.
- Tab. 2 shows that DeMoN outperforms all baseline methods both on motion and depth accuracy by a factor of 1.5 to 2 on most datasets.
- The depth prediction of the first frame is shown.
- Higher resolution gives the Base-* methods an advantage in depth accuracy, but on the other hand these methods are more prone to outliers.
6.4. Comparison to depth from single image
- To demonstrate the value of the motion parallax, the authors additionally compare to the single-image depth estimation methods by Eigen & Fergus [7] and Liu et al. [24].
- The Base-Oracle prediction on NYUv2 is missing because the motion ground truth is not available.
- Results on more methods and examples are shown in the supplementary material.
- Models by Liu et al.: one trained on indoor scenes from the NYUv2 dataset (“indoor”) and another, trained on outdoor images from the Make3D dataset [32] (“outdoor”).
- On all but one dataset, DeMoN outperforms the singleframe methods also by numbers, typically by a large margin.
6.4.1 Generalization to new data
- Scene-specific priors learned during training may be useless or even harmful when being confronted with a scene that is very different from the training data.
- In contrast, the geometric relations between a pair of images are independent of the content of the scene and should generalize to unknown scenes.
- Single-frame methods have severe problems in such cases, as most clearly visible in the point cloud visualization of the depth estimate for the last example.
- Fig. 9 and Tab. 3 show that DeMoN, as to be expected, generalizes better to these unexpected scenes than singleimage methods.
- It shows that the network has learned to make use of the motion parallax.
6.5. Ablation studies
- The authors architecture contains some design decisions that the authors justify by the following ablation studies.
- All results have been obtained on the Sun3D dataset with the bootstrap net.
- Interestingly, while the scale invariant loss greatly improves the prediction qualitatively (see Fig. 10), it has negative effects on depth scale estimation.
- (a) Just L1 loss on the absolute depth values.
- 5 shows that given the same flow, egomotion estimation improves when given the flow confidence as an extra input.
Did you find this useful? Give us your feedback
Citations
14 citations
9 citations
5 citations
2 citations
Cites background from "Hybrid metric-topological-semantic ..."
...Semantic categories can also be used to determine the dynamic properties of parts of the map, which helps to keep track of changes when revisiting places where certain objects of dynamic classes have moved, while static objects can be assumed to remain in the same place over time [45]....
[...]
1 citations
References
3,233 citations
"Hybrid metric-topological-semantic ..." refers methods in this paper
...Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context....
[...]
302 citations
"Hybrid metric-topological-semantic ..." refers methods in this paper
...The biologically inspired algorithm RATSLAM is used in [4] to perform persistent...
[...]
262 citations
"Hybrid metric-topological-semantic ..." refers background in this paper
...A strategy to identify dynamic objects in the scene and map them in a separate occupancy grid is proposed in [2]....
[...]
91 citations
85 citations
"Hybrid metric-topological-semantic ..." refers background in this paper
...concerns methods that remove dynamic objects in order to achieve a stable representation [1]....
[...]
Related Papers (5)
Frequently Asked Questions (10)
Q2. What is the efficient method for resolving the labels over spatial context?
Then a Fully Connected Conditional Random Field is used to model neighborhood and an efficient inference method [10] allows us to correct the labels over spatial context.
Q3. What is the likely location of the submap?
Once the submap corresponding to the closest position is retrieved, a dense registration method between the submap and the current spherical image, described in [12], is applied to refine the pose estimate locally (see figure 3).
Q4. How many sequences have been acquired with the multi-camera stereovision system?
Two sequences have been acquired with their multi-camera stereovision system on the1The full resolution is 2048x665 but the authors use 1024x333 resolution for classificationsame pathway at two different time-scales with an interval of three years.
Q5. What is the probable class prediction in occluded parts of the scene?
the correctness of the class prediction in occluded parts of the scene has been evaluated by making predictions in areas where observations of static labels are accessible and used as ground truth.
Q6. What is the weighting function for the pose?
The pose T̂T(x) is an approximation of the true transformation T(x̃) and Ψhub is a robust weighting function on the error given by Huber’s M-estimator [14].
Q7. What is the way to update the map?
Using the proposed approach, it is possible to update the map by exploiting both the spatial context and the knowledge acquired along robot’s experience, resulting in a robust and stable representation of the environment.
Q8. What is the definition of a dynamic class?
A dynamic class, denoted as CD occludes a static class CS by changing the label associated to the corresponding pixels in the image.
Q9. What is the probability of a static label being associated to a pixel?
To model the probability of associating a static label to a pixel p = (xp,yp), a Gaussian function is associated to each neighbor node ni ∈N .
Q10. What is the cost function for optimising intensity errors between spheres?
Following the formulation of [13], the cost function for optimising intensity errors between spheres {Is, I∗s } is given as:FI = 1 2k∑ iΨhub ∥∥∥∥Is(ω(T̂T(x);Pi))−