SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?
read more
Citations
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences
References
ImageNet Classification with Deep Convolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
U-Net: Convolutional Networks for Biomedical Image Segmentation
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
U-Net: Convolutional Networks for Biomedical Image Segmentation
Related Papers (5)
Frequently Asked Questions (17)
Q2. What is the first step to rendering on the GPU?
Since OptiX allows rendering on the GPU it is able to fully utilise the parallelisation offered by readily-available modern day consumer grade graphics cards.
Q3. What is the primary goal of computer vision research?
A primary goal of computer vision research is to give computers the capability to reason about real-world images in a human-like manner.
Q4. What is the way to simulate the motion of the bodies?
The authors use simple Euler integration to simulate the motion of the bodies and apply randomly sampled 3D directional force vectors as well as drag to each of the bodies independently, with a maximum cap on the permitted speed.
Q5. What is the recent work by Gaidon et al.?
Gaidon et al. [9] used the Unity engine to create the Virtual KITTI dataset, which takes real-world seed videos to produce photorealistic synthetic variations to evaluate robustness of models to various visual factors.
Q6. How do the authors initialise the pose and look-at point?
The authors initialise the pose and ‘look-at’ point from a uniform random distribution within the bounding box of the scene, ensuring they are less than 50cm apart.
Q7. How do the authors sample objects for a given scene?
The authors sample objects for a given scene according to the distribution of objects categories in that scene type in the SUN RGB-D real-world dataset.
Q8. Why do the authors skip the final max-pooling layer and 1024-channel convolutions?
to maintain approximately consistent memory usage and batch sizes during training, the authors skip the final max-pooling layer and 1024- channel convolutions.
Q9. What is the common dataset used for indoor scenes?
Their dataset, SceneNet RGB-D, samples random layouts from SceneNet [12] and objects from ShapeNet [3] to create a practically unlimited number of scene configurations.
Q10. What is the main area of interest of the paper?
For semantic scene understanding, their main area of interest, Handa et al. [12] produced SceneNet, a repository of labelled synthetic 3D scenes from five different categories.
Q11. How did Peng et al. achieve this?
Peng et al.[22] augmented small datasets of objects with renderings of synthetic 3D objects with random textures and backgrounds to improve object detection performance.
Q12. What is the important factor to consider when rendering a scene?
The authors do not have strict real-time constraints to produce photorealistic rendering, but the scale and quality of images required does mean the computational cost is an important factor to consider.
Q13. How long did the CNN take to be pre-trained?
Their networks pre-trained on SceneNet RGB-D were both maintained at a constant learning rate as they were below 30 epochs - the RGB CNN was pretrained for 15 epochs which took approximately 1 month on 4 Nvidia Titan X GPUs, and the RGB-D CNN was pretrained for 10 epochs, taking 3 weeks.
Q14. What does the algorithm do to the renderings?
The authors do not explicitly add camera noise or distortion to the renderings, however the random raysampling and integration procedure from ray-tracing naturally adds a certain degree of noise to the final images.
Q15. What is the closely related approach to ours?
The most closely related approach to ours, and performed concurrently with it, is the subsequent work on the same set of layouts by Zhang et al. [32], which provided 400K physically-based RGB renderings of a randomly sampled still camera within those indoor scenes and provided the ground truth for three selected tasks: normal estimation, semantic annotation, and object boundary prediction.
Q16. How many labelled meshes of real world scenes have been obtained?
Hua et al. provide sceneNN [15], a dataset of 100 labelled meshes of real world scenes, obtained with a reconstruction system with objects labelled directly in 3D for semantic segmentation ground truth.
Q17. How many layouts did the authors select for the validation and test sets?
The authors selected two layouts from each type (bathroom, kitchen, office, living room, and bedroom) for the validation and test sets making the layout split 37-10-10.