How do the authors make sure that the rendered images are a faithful approximation to real?

In order to make sure that the rendered images are a faithful approximation to real-world images, the authors also apply a non-linear Camera Response Function (CRF) that maps irradiance to quantised brightness as in a real camera.

Why do the authors skip the final max-pooling layer and 1024-channel convolutions?

to maintain approximately consistent memory usage and batch sizes during training, the authors skip the final max-pooling layer and 1024- channel convolutions.

How long did the CNN take to be pre-trained?

Their networks pre-trained on SceneNet RGB-D were both maintained at a constant learning rate as they were below 30 epochs - the RGB CNN was pretrained for 15 epochs which took approximately 1 month on 4 Nvidia Titan X GPUs, and the RGB-D CNN was pretrained for 10 epochs, taking 3 weeks.

What does the algorithm do to the renderings?

The authors do not explicitly add camera noise or distortion to the renderings, however the random raysampling and integration procedure from ray-tracing naturally adds a certain degree of noise to the final images.

How many layouts did the authors select for the validation and test sets?

The authors selected two layouts from each type (bathroom, kitchen, office, living room, and bedroom) for the validation and test sets making the layout split 37-10-10.

(Open Access) SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? (2017) | John McCormac

Q: What is the first step to rendering on the GPU?

Since OptiX allows rendering on the GPU it is able to fully utilise the parallelisation offered by readily-available modern day consumer grade graphics cards.

Q: What is the way to simulate the motion of the bodies?

The authors use simple Euler integration to simulate the motion of the bodies and apply randomly sampled 3D directional force vectors as well as drag to each of the bodies independently, with a maximum cap on the permitted speed.

Q: How do the authors initialise the pose and look-at point?

The authors initialise the pose and ‘look-at’ point from a uniform random distribution within the bounding box of the scene, ensuring they are less than 50cm apart.

Q: How do the authors sample objects for a given scene?

The authors sample objects for a given scene according to the distribution of objects categories in that scene type in the SUN RGB-D real-world dataset.

Q: What is the common dataset used for indoor scenes?

Their dataset, SceneNet RGB-D, samples random layouts from SceneNet [12] and objects from ShapeNet [3] to create a practically unlimited number of scene configurations.

Q: What is the main area of interest of the paper?

For semantic scene understanding, their main area of interest, Handa et al. [12] produced SceneNet, a repository of labelled synthetic 3D scenes from five different categories.

SceneNet RGB-D: Can 5M Synthetic Images

Beat Generic ImageNet Pre-training on Indoor Segmentation?

John McCormac, Ankur Handa, Stefan Leutenegger, Andrew J. Davison

Dyson Robotics Laboratory at Imperial College, Department of Computing,

Imperial College London

{brendan.mccormac13,s.leutenegger,a.davison}@imperial.ac.uk, handa.ankur@gmail.com

Abstract

We introduce SceneNet RGB-D, a dataset providing

pixel-perfect ground truth for scene understanding prob-

lems such as semantic segmentation, instance segmenta-

tion, and object detection. It also provides perfect camera

poses and depth data, allowing investigation into geomet-

ric computer vision problems such as optical ﬂow, cam-

era pose estimation, and 3D scene labelling tasks. Ran-

dom sampling permits virtually unlimited scene conﬁgu-

rations, and here we provide 5M rendered RGB-D im-

ages from 16K randomly generated 3D trajectories in syn-

thetic layouts, with random but physically simulated ob-

ject conﬁgurations. We compare the semantic segmenta-

tion performance of network weights produced from pre-

training on RGB images from our dataset against generic

VGG-16 ImageNet weights. After ﬁne-tuning on the SUN

RGB-D and NYUv2 real-world datasets we ﬁnd in both

cases that the synthetically pre-trained network outper-

forms the VGG-16 weights. When synthetic pre-training

includes a depth channel (something ImageNet cannot na-

tively provide) the performance is greater still. This sug-

gests that large-scale high-quality synthetic RGB datasets

with task-speciﬁc labels can be more useful for pre-

training than real-world generic pre-training such as Im-

ageNet. We host the dataset at http://robotvault.

bitbucket.io/scenenet-rgbd.html.

1. Introduction

A primary goal of computer vision research is to give

computers the capability to reason about real-world im-

ages in a human-like manner. Recent years have witnessed

large improvements in indoor scene understanding, largely

driven by the seminal work of Krizhevsky et al. [19] and

the increasing popularity of Convolutional Neural Networks

(CNNs). That work highlighted the importance of large

scale labelled datasets for supervised learning algorithms.

Figure 1. Example RGB rendered scenes from our dataset.

In this work we aim to obtain and experiment with large

quantities of labelled data without the cost of manual cap-

turing and labelling. In particular, we are motivated by tasks

which require more than a simple text label for an image.

For tasks such as semantic labelling and instance segmenta-

tion, obtaining accurate per-pixel ground truth annotations

by hand is a pain-staking task, and the majority of RGB-D

datasets have until recently been limited in scale [28, 30].

A number of recent works have started to tackle this

problem. Hua et al. provide sceneNN [15], a dataset of

100 labelled meshes of real world scenes, obtained with a

reconstruction system with objects labelled directly in 3D

for semantic segmentation ground truth. Armeni et al. [1]

produced 2D-3D-S dataset with 70K RGB-D images of 6

large-scale indoor (educational and ofﬁce) areas with 270

smaller rooms, and the accompanying ground-truth annota-

tions. Their work used 360

◦

rotational scans at ﬁxed loca-

tions rather than a free 3D trajectory. Very recently, Scan-

Net by Dai et al. [6] provided a large and impressive real-

world RGB-D dataset consisting of 1.5K free reconstruc-

tion trajectories taken from 707 indoor spaces, with 2.5M

frames, along with dense 3D semantic annotations obtained

manually via mechanical turk.

Obtaining other forms of ground-truth data from real-

world scenes, such as noise-free depth readings, precise

camera poses, or 3D models is even harder and often can

only be estimated or potentially provided with costly addi-

tional equipment (e.g. LIDAR for depth, VICON for cam-

NYUv2 [28] SUN RGB-D [30] sceneNN [15] 2D-3D-S [1] ScanNet [6] SceneNet [12] SUN CG

∗

[31, 32] SceneNet RGB-D

RGB-D videos available 3 7 7 3 3 7 7 3

Per-pixel annotations Key frames Key frames Videos Videos Videos Key frames Key Frames Videos

Trajectory ground truth 7 7 7 3 3 7 3 3

RGB texturing Real Real Real Real Real Non-photorealistic Photorealistic Photorealistic

Number of layouts 464 - 100 270 1513 57 45,622 57

Number of conﬁgurations 464 - 100 270 1513 1000 45,622 16,895

Number of annotated frames 1,449 10K - 70K 2.5M 10K 400K 5M

3D models available 7 7 3 3 3 3 3 3

Method of design Real Real Real Real Real Manual and Random Manual Random

Table 1. A comparison table of 3D indoor scene datasets and their differing characteristics. sceneNN provides annotated 3D meshes

instead of frames, and so we leave the number of annotated frames blank. 2D-3D-S provides a different type of camera trajectory in the

form of rotational scans at positions rather than free moving 3D trajectories. *We combine within this column the additional recent work

of physically based renderings of the same scenes produced by Zhang et al. [32], it is that work which produced 400K annotated frames.

era pose tracking). In other domains, such as highly dy-

namic or interactive scenes, synthetic data becomes a neces-

sity. Inspired by the low cost of producing very large-scale

synthetic datasets with complete and accurate ground-truth

information, as well as the recent successes of synthetic data

for training scene understanding systems, our goal is to gen-

erate a large photorealistic indoor RGB-D video dataset and

validate its usefulness in the real-world.

This paper makes the following core contributions:

• We make available the largest (5M) indoor synthetic

video dataset of high-quality ray-traced RGB-D im-

ages with full lighting effects, visual artefacts such as

motion blur, and accompanying ground truth labels.

• We outline a dataset generation pipeline that relies to

the greatest degree possible on fully automatic ran-

domised methods.

• We propose a novel and straightforward algorithm

to generate sensible random 3D camera trajectories

within an arbitrary indoor scene.

• To the best of our knowledge this is the ﬁrst work to

show that a RGB-CNN pre-trained from scratch on

synthetic RGB images can outperform an identical net-

work initialised with the real-world VGG-16 ImageNet

weights [29] on a real-world indoor semantic labelling

dataset, after ﬁne-tuning.

In Section 3 we provide a description of the dataset itself.

Section 4 describes our random scene generation method,

and Section 5 discusses random trajectory generation. In

Section 6 we describe our rendering framework. Finally,

Section 7 details our experimental results.

2. Background

A growing body of research has highlighted that care-

fully synthesised artiﬁcial data with appropriate noise mod-

els can be an effective substitute for real-world labelled data

in problems where ground-truth data is difﬁcult to obtain.

Aubry et al. [2] used synthetic 3D CAD models for learn-

ing visual elements to do 2D-3D alignment in images, and

similarly Gupta et al. [10] trained on renderings of synthetic

objects to do alignment of 3D models with RGB-D images.

Peng et al.[22] augmented small datasets of objects with

renderings of synthetic 3D objects with random textures

and backgrounds to improve object detection performance.

FlowNet [8] and FlowNet 2.0 [16] both used training data

obtained from synthetic ﬂying chairs for optical ﬂow esti-

mation; and de Souza et al. [7] used procedural generation

of human actions with computer graphics to generate large

dataset of videos for human action recognition.

For semantic scene understanding, our main area of in-

terest, Handa et al. [12] produced SceneNet, a repository of

labelled synthetic 3D scenes from ﬁve different categories.

That repository was used to generate per-pixel semantic

segmentation ground truth for depth-only images from ran-

dom viewpoints. They demonstrated that a network trained

on 10K images of synthetic depth data and ﬁne-tuned on

the original NYUv2 [28] and SUN RGB-D [30] real image

datasets shows an increase in the performance of semantic

segmentation when compared to a network trained on just

the original datasets.

For outdoor scenes, Ros et al. generated the SYNTHIA

[26] dataset for road scene understanding, and two inde-

pendent works by Richter et al. [24] and Shafaei et al. [27]

produced synthetic training data from photorealistic gam-

ing engines, validating the performance on real-world seg-

mentation tasks. Gaidon et al. [9] used the Unity engine

to create the Virtual KITTI dataset, which takes real-world

seed videos to produce photorealistic synthetic variations to

evaluate robustness of models to various visual factors. For

indoor scenes, recent work by Qui et al. [23] called Unre-

alCV provided a plugin to generate ground truth data and

photorealistic images from the UnrealEngine. This use of

gaming engines is an exciting direction, but is can be lim-

ited by proprietary issues either by the engine or the assets.

Our SceneNet RGB-D dataset uses open-source scene

layouts [12] and 3D object repositories [3] to provide tex-

tured objects. For rendering, we have built upon an open-

source ray-tracing framework which allows signiﬁcant ﬂex-

Figure 2. Flow chart of the different stages in our dataset generation pipeline.

ibility in the ground truth data we can collect and visual

effects we can simulate.

Recently, Song et al. released the SUN-CG dataset [31]

containing ≈46K synthetic scene layouts created using

Planner5D. The most closely related approach to ours, and

performed concurrently with it, is the subsequent work on

the same set of layouts by Zhang et al. [32], which provided

400K physically-based RGB renderings of a randomly sam-

pled still camera within those indoor scenes and provided

the ground truth for three selected tasks: normal estima-

tion, semantic annotation, and object boundary prediction.

Zhang et al. compared pre-training a CNN (already with

ImageNet initialisation) on lower quality OpenGL render-

ings against pre-training on high quality physically-based

renderings, and found pre-training on high quality render-

ings outperformed on all three tasks.

Our dataset, SceneNet RGB-D, samples random layouts

from SceneNet [12] and objects from ShapeNet [3] to cre-

ate a practically unlimited number of scene conﬁgurations.

As shown in Table 1, there are a number of key differ-

ences between our work and others. Firstly, our dataset

explicitly provides a randomly generated sequential video

trajectory within a scene, allowing 3D correspondences be-

tween viewpoints for 3D scene understanding tasks, with

the ground truth camera poses acting in lieu of a SLAM

system [20]. Secondly, Zhang et al. [32] use manually

designed scenes, while our randomised approach produces

chaotic conﬁgurations that can be generated on-the-ﬂy with

little chance of repeating. Moreover, the layout textures,

lighting, and camera trajectories are all randomised, allow-

ing us to generate a wide variety of geometrically identical

but visually differing renders as shown in Figure 7.

We believe such randomness could help prevent overﬁt-

ting by providing large quantities of less predictable training

examples with high instructional value. Additionally, ran-

domness provides a simple baseline approach against which

more complex scene-grammars can justify their added com-

plexity. It remains an open question whether randomness is

preferable to designed scenes for learning algorithms. Ran-

domness leads to a simpler data generation pipeline and,

given a sufﬁcient computational budget, allows for dynamic

on-the-ﬂy generated training examples suitable for active

machine learning. A combination of the two approaches,

with a reasonable manually designed scene layouts or se-

mantic constraints along-side physically simulated random-

ness, may in the future provide the best of both worlds.

3. Dataset Overview

The overall pipeline is depicted in Figure 2. It was neces-

sary to balance the competing requirements of high frame-

rates for video sequences with the computational cost of

rendering many very similar images, which would not pro-

vide signiﬁcant variation in the training data. We decided

upon 5 minute trajectories at 320×240 image resolution,

with a single frame per second, resulting in 300 images per

trajectory (the trajectory is calculated at 25Hz, however we

only render every 25th pose). Each view consists of both

a shutter open and shutter close camera pose. We sample

from linearly interpolations of these poses to produce mo-

tion blur. Each render takes 2–3 seconds on an Nvidia GTX

1080 GPU. There is also a trade off between rendering time

and quality of renders (see Figure 6 in Section 6.2).

Various ground truth labels can be obtained with an extra

rendering pass. Depth is rendered as the ﬁrst ray intersec-

tion euclidean distance, and instance labels are obtained by

assigning indices to each object and rendering these. For

ground truth data a single ray is emitted from the pixel cen-

tre. In accompanying dataﬁles we store, for each trajectory,

a mapping from instance label to a WordNet semantic la-

bel. We have 255 WordNet semantic categories, including

40 added by the ShapeNet dataset. Given the static scene

assumption and the depth map, instantaneous optical ﬂow

can also be calculated as the time-derivative of a surface

points projection into camera pixel space with respect to

the linear interpolation of the shutter open and shutter close

poses. Examples of the available ground-truth is shown in

Figure 3, and code to reproduce it is open-source.

Our dataset is separated into train, validation, and test

sets. Each set has a unique set of layouts, objects, and tra-

jectories. However the parameters for randomly choosing

lighting and trajectories remain the same. We selected two

layouts from each type (bathroom, kitchen, ofﬁce, living

room, and bedroom) for the validation and test sets making

the layout split 37-10-10. For ShapeNet objects within a

scene we randomly divide the objects within each WordNet

class into 80-10-10% splits for train-val-test. This ensures

that some of each type of object are in each training set. Our

ﬁnal training set has 5M images from 16K room conﬁgura-

tions, and our validation and test set have 300K images from

1K different conﬁgurations. Each conﬁguration has a single

trajectory through it.

https://github.com/jmccormac/pySceneNetRGBD

(a) photo (b) depth (c) instance (d) class segmentation (e) optical ﬂow

Figure 3. Hand-picked examples from our dataset; (a) rendered images and (b)–(e) the ground truth labels we generate.

4. Generating Random Scenes with Physics

To create scenes, we randomly select a density of objects

per square metre. In our case we have two of these den-

sities. For large objects we choose a density between 0.1

and 0.5 objects m

−2

, and for small objects (<0.4m tall) we

choose a density between 0.5 and 3.0 objects m

−2

. Given

the ﬂoor area of a scene, we calculate the number objects

needed. We sample objects for a given scene according to

the distribution of objects categories in that scene type in the

SUN RGB-D real-world dataset. We do this with the aim

of including relevant objects within a context e.g. a bath-

room is more likely to contain a sink than a microwave. We

then randomly pick an instance uniformly from the avail-

able models for that object category.

We use an off-the-shelf physics engine, Project Chrono,

to dynamically simulate the scene. The objects are pro-

vided with a constant mass (10kg) and convex collision hull

and positioned randomly within the 3D space of the layouts

axis-aligned bounding box. To slightly bias objects towards

the correct orientation, we offset the center of gravity on

the objects to be below the mesh. Without this, we found

very few objects were in their normal upright position after

the simulation. We simulate 60s of the system to allow ob-

jects to settle to a physically realistic conﬁguration. While

not organised in a human manner, the overall conﬁguration

aims to be physically plausible i.e. avoiding conﬁgurations

where an object cannot physically support another against

gravity or with unrealistic object intersections.

https://projectchrono.org/

5. Generating Random Trajectories

As we render videos at a large scale, it is imperative that

the trajectory generation be automated to avoid costly man-

ual labour. The majority of previous works have used a

SLAM system operated by a human to collect hand-held

motion: the trajectory of the camera poses returned by the

SLAM system is then inserted into a synthetic scene and the

corresponding data is rendered at discrete or interpolated

poses of the trajectory [11, 13]. However, such reliance on

humans to collect trajectories quickly limits the potential

scale of the dataset.

We automate this process using a simple random camera

trajectory generation procedure which we have not found in

any previous synthetic dataset work. For our trajectories,

we have the following desiderata. Our generated trajecto-

ries should be random, but slightly biased towards looking

into central areas of interest, rather than for example pan-

ning along a wall. It should contain a mix of fast and slow

rotations like those of a human operator focussing on nearby

and far away points. It should also have limited rotational

freedom that emphasises yaw and pitch rather than rolling,

which is a less prominent motion in human trajectories.

To achieve the desired trajectories we simulate two phys-

ical bodies. One deﬁnes the location of the camera, and an-

other the point in space that it is focussing on as a proxy for

a human paying attention to random points in a scene. We

take the simple approach of locking roll entirely, by setting

the up vector to always be along the positive y-axis. These

two points completely deﬁne the camera coordinate system.

We simulate the motion of the two bodies using a physi-

cal motion model. We use simple Euler integration to sim-

ulate the motion of the bodies and apply randomly sampled

3D directional force vectors as well as drag to each of the

bodies independently, with a maximum cap on the permit-

ted speed. This physical model has a number of beneﬁts.

Firstly, it provides an intuitive set of metric physical prop-

erties we can set to achieve a desired trajectory, such as the

strength of the force in Newtons and the drag coefﬁcients.

Secondly, it naturally produces smooth trajectories. Finally,

although not currently provided in our dataset, it can au-

tomatically produce synthetic IMU measurements, which

could prove useful for Visual-Inertial systems.

We initialise the pose and ‘look-at’ point from a uniform

random distribution within the bounding box of the scene,

ensuring they are less than 50cm apart. As not all scenes are

convex, it is possible to initialise the starting points outside

of a layout, for example in an ‘L’-shaped room. Therefore,

we have two simple checks. The ﬁrst is to restart the simula-

tion if either body leaves the bounding volume. The second

is that within the ﬁrst 500 poses at least 10 different object

instances must have been visible. This prevents trajectories

external to the scene layout with only the outer wall visible.

Finally, to avoid collisions with the scene or objects we

render a depth image using the z-buffer of OpenGL. If a col-

lision occurs, the velocity is simply negated in a ‘bounce’,

which simpliﬁes the collision by assuming the surface nor-

mal is always the inverse of the velocity vector. Figure 4

visualises a two-body trajectory from the ﬁnal dataset.

Figure 4. Example camera and lookat trajectory through a syn-

thetic scene (with rendered views from the ﬁrst and last frustum).

6. Rendering Photorealistic RGB Frames

The rendering engine used was a modiﬁed version of the

Opposite Renderer

[21], a ﬂexible open-source ray-tracer

http://apartridge.github.io/OppositeRenderer/

built on top of the Nvidia OptiX framework. We do not have

strict real-time constraints to produce photorealistic render-

ing, but the scale and quality of images required does mean

the computational cost is an important factor to consider.

Since OptiX allows rendering on the GPU it is able to fully

utilise the parallelisation offered by readily-available mod-

ern day consumer grade graphics cards.

(a) No reﬂections & transparency (b) With reﬂections & transparency

Figure 5. Reﬂections and transparency

6.1. Photon Mapping

We use a process known as photon mapping to approx-

imate the rendering equation. Our static scene assump-

tion makes photon mapping particularly efﬁcient as we can

produce photon maps for a scene which are maintained

throughout the trajectory. A good tutorial on photon map-

ping is given by its creators Jensen et al.[18]. Normal ray-

tracing allows for accurate reﬂections and transparency ren-

derings, but photon mapping provides a global illumination

model that also approximates indirect illumination, colour-

bleeding from diffuse surfaces, and caustics. Many of these

effects can be seen in Figure 5.

6.2. Rendering Quality

Rendering over 5M images requires a signiﬁcant amount

of computation. We rendered our images on 4-12 GPUs for

approximately one month. An important trade-off in this

calculation is between the quality of the renders and the

quantity of images. Figure 6 shows two of the most im-

portant variables dictating this balance within our rendering

framework. Our ﬁnal dataset was rendered with 16 samples

per pixel and 4 photon maps. This equates to approximately

3s per image on a single GPU.

6.3. Random Layout Textures and Lighting

To improve the variability within our 57 layouts, we ran-

domly assign textures from a curated library of selected

seamless textures to their components. Each layout object

has a material type, which then gives a number of random

texture images for that type. For example, we have a large

number of different wall textures, ﬂoor textures, and curtain

textures. We also generate random indoor lighting for the

scene. We have two types of lights: spherical orbs, which

serve as point light sources, and parallelograms which act

SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?

Figures

Citations

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

U-Net: Convolutional Networks for Biomedical Image Segmentation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

U-Net: Convolutional Networks for Biomedical Image Segmentation

Related Papers (5)

Deep Residual Learning for Image Recognition

Are we ready for autonomous driving? The KITTI vision benchmark suite

U-Net: Convolutional Networks for Biomedical Image Segmentation

Generative Adversarial Nets

ImageNet: A large-scale hierarchical image database

Frequently Asked Questions (17)

Q1. How do the authors make sure that the rendered images are a faithful approximation to real?

Q2. What is the first step to rendering on the GPU?

Q3. What is the primary goal of computer vision research?

Q4. What is the way to simulate the motion of the bodies?

Q5. What is the recent work by Gaidon et al.?

Q6. How do the authors initialise the pose and look-at point?

Q7. How do the authors sample objects for a given scene?

Q8. Why do the authors skip the final max-pooling layer and 1024-channel convolutions?

Q9. What is the common dataset used for indoor scenes?

Q10. What is the main area of interest of the paper?

Q11. How did Peng et al. achieve this?

Q12. What is the important factor to consider when rendering a scene?

Q13. How long did the CNN take to be pre-trained?

Q14. What does the algorithm do to the renderings?

Q15. What is the closely related approach to ours?

Q16. How many labelled meshes of real world scenes have been obtained?

Q17. How many layouts did the authors select for the validation and test sets?