scispace - formally typeset
Open AccessProceedings ArticleDOI

SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?

TLDR
Analysis of SceneNet RGB-D suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet.
Abstract
We introduce SceneNet RGB-D, a dataset providing pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It also provides perfect camera poses and depth data, allowing investigation into geometric computer vision problems such as optical flow, camera pose estimation, and 3D scene labelling tasks. Random sampling permits virtually unlimited scene configurations, and here we provide 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts, with random but physically simulated object configurations. We compare the semantic segmentation performance of network weights produced from pretraining on RGB images from our dataset against generic VGG-16 ImageNet weights. After fine-tuning on the SUN RGB-D and NYUv2 real-world datasets we find in both cases that the synthetically pre-trained network outperforms the VGG-16 weights. When synthetic pre-training includes a depth channel (something ImageNet cannot natively provide) the performance is greater still. This suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet. We host the dataset at http://robotvault. bitbucket.io/scenenet-rgbd.html.

read more

Content maybe subject to copyright    Report

SceneNet RGB-D: Can 5M Synthetic Images
Beat Generic ImageNet Pre-training on Indoor Segmentation?
John McCormac, Ankur Handa, Stefan Leutenegger, Andrew J. Davison
Dyson Robotics Laboratory at Imperial College, Department of Computing,
Imperial College London
{brendan.mccormac13,s.leutenegger,a.davison}@imperial.ac.uk, handa.ankur@gmail.com
Abstract
We introduce SceneNet RGB-D, a dataset providing
pixel-perfect ground truth for scene understanding prob-
lems such as semantic segmentation, instance segmenta-
tion, and object detection. It also provides perfect camera
poses and depth data, allowing investigation into geomet-
ric computer vision problems such as optical flow, cam-
era pose estimation, and 3D scene labelling tasks. Ran-
dom sampling permits virtually unlimited scene configu-
rations, and here we provide 5M rendered RGB-D im-
ages from 16K randomly generated 3D trajectories in syn-
thetic layouts, with random but physically simulated ob-
ject configurations. We compare the semantic segmenta-
tion performance of network weights produced from pre-
training on RGB images from our dataset against generic
VGG-16 ImageNet weights. After fine-tuning on the SUN
RGB-D and NYUv2 real-world datasets we find in both
cases that the synthetically pre-trained network outper-
forms the VGG-16 weights. When synthetic pre-training
includes a depth channel (something ImageNet cannot na-
tively provide) the performance is greater still. This sug-
gests that large-scale high-quality synthetic RGB datasets
with task-specific labels can be more useful for pre-
training than real-world generic pre-training such as Im-
ageNet. We host the dataset at http://robotvault.
bitbucket.io/scenenet-rgbd.html.
1. Introduction
A primary goal of computer vision research is to give
computers the capability to reason about real-world im-
ages in a human-like manner. Recent years have witnessed
large improvements in indoor scene understanding, largely
driven by the seminal work of Krizhevsky et al. [19] and
the increasing popularity of Convolutional Neural Networks
(CNNs). That work highlighted the importance of large
scale labelled datasets for supervised learning algorithms.
Figure 1. Example RGB rendered scenes from our dataset.
In this work we aim to obtain and experiment with large
quantities of labelled data without the cost of manual cap-
turing and labelling. In particular, we are motivated by tasks
which require more than a simple text label for an image.
For tasks such as semantic labelling and instance segmenta-
tion, obtaining accurate per-pixel ground truth annotations
by hand is a pain-staking task, and the majority of RGB-D
datasets have until recently been limited in scale [28, 30].
A number of recent works have started to tackle this
problem. Hua et al. provide sceneNN [15], a dataset of
100 labelled meshes of real world scenes, obtained with a
reconstruction system with objects labelled directly in 3D
for semantic segmentation ground truth. Armeni et al. [1]
produced 2D-3D-S dataset with 70K RGB-D images of 6
large-scale indoor (educational and office) areas with 270
smaller rooms, and the accompanying ground-truth annota-
tions. Their work used 360
rotational scans at fixed loca-
tions rather than a free 3D trajectory. Very recently, Scan-
Net by Dai et al. [6] provided a large and impressive real-
world RGB-D dataset consisting of 1.5K free reconstruc-
tion trajectories taken from 707 indoor spaces, with 2.5M
frames, along with dense 3D semantic annotations obtained
manually via mechanical turk.
Obtaining other forms of ground-truth data from real-
world scenes, such as noise-free depth readings, precise
camera poses, or 3D models is even harder and often can
only be estimated or potentially provided with costly addi-
tional equipment (e.g. LIDAR for depth, VICON for cam-

NYUv2 [28] SUN RGB-D [30] sceneNN [15] 2D-3D-S [1] ScanNet [6] SceneNet [12] SUN CG
[31, 32] SceneNet RGB-D
RGB-D videos available 3 7 7 3 3 7 7 3
Per-pixel annotations Key frames Key frames Videos Videos Videos Key frames Key Frames Videos
Trajectory ground truth 7 7 7 3 3 7 3 3
RGB texturing Real Real Real Real Real Non-photorealistic Photorealistic Photorealistic
Number of layouts 464 - 100 270 1513 57 45,622 57
Number of configurations 464 - 100 270 1513 1000 45,622 16,895
Number of annotated frames 1,449 10K - 70K 2.5M 10K 400K 5M
3D models available 7 7 3 3 3 3 3 3
Method of design Real Real Real Real Real Manual and Random Manual Random
Table 1. A comparison table of 3D indoor scene datasets and their differing characteristics. sceneNN provides annotated 3D meshes
instead of frames, and so we leave the number of annotated frames blank. 2D-3D-S provides a different type of camera trajectory in the
form of rotational scans at positions rather than free moving 3D trajectories. *We combine within this column the additional recent work
of physically based renderings of the same scenes produced by Zhang et al. [32], it is that work which produced 400K annotated frames.
era pose tracking). In other domains, such as highly dy-
namic or interactive scenes, synthetic data becomes a neces-
sity. Inspired by the low cost of producing very large-scale
synthetic datasets with complete and accurate ground-truth
information, as well as the recent successes of synthetic data
for training scene understanding systems, our goal is to gen-
erate a large photorealistic indoor RGB-D video dataset and
validate its usefulness in the real-world.
This paper makes the following core contributions:
We make available the largest (5M) indoor synthetic
video dataset of high-quality ray-traced RGB-D im-
ages with full lighting effects, visual artefacts such as
motion blur, and accompanying ground truth labels.
We outline a dataset generation pipeline that relies to
the greatest degree possible on fully automatic ran-
domised methods.
We propose a novel and straightforward algorithm
to generate sensible random 3D camera trajectories
within an arbitrary indoor scene.
To the best of our knowledge this is the first work to
show that a RGB-CNN pre-trained from scratch on
synthetic RGB images can outperform an identical net-
work initialised with the real-world VGG-16 ImageNet
weights [29] on a real-world indoor semantic labelling
dataset, after fine-tuning.
In Section 3 we provide a description of the dataset itself.
Section 4 describes our random scene generation method,
and Section 5 discusses random trajectory generation. In
Section 6 we describe our rendering framework. Finally,
Section 7 details our experimental results.
2. Background
A growing body of research has highlighted that care-
fully synthesised artificial data with appropriate noise mod-
els can be an effective substitute for real-world labelled data
in problems where ground-truth data is difficult to obtain.
Aubry et al. [2] used synthetic 3D CAD models for learn-
ing visual elements to do 2D-3D alignment in images, and
similarly Gupta et al. [10] trained on renderings of synthetic
objects to do alignment of 3D models with RGB-D images.
Peng et al.[22] augmented small datasets of objects with
renderings of synthetic 3D objects with random textures
and backgrounds to improve object detection performance.
FlowNet [8] and FlowNet 2.0 [16] both used training data
obtained from synthetic flying chairs for optical flow esti-
mation; and de Souza et al. [7] used procedural generation
of human actions with computer graphics to generate large
dataset of videos for human action recognition.
For semantic scene understanding, our main area of in-
terest, Handa et al. [12] produced SceneNet, a repository of
labelled synthetic 3D scenes from ve different categories.
That repository was used to generate per-pixel semantic
segmentation ground truth for depth-only images from ran-
dom viewpoints. They demonstrated that a network trained
on 10K images of synthetic depth data and fine-tuned on
the original NYUv2 [28] and SUN RGB-D [30] real image
datasets shows an increase in the performance of semantic
segmentation when compared to a network trained on just
the original datasets.
For outdoor scenes, Ros et al. generated the SYNTHIA
[26] dataset for road scene understanding, and two inde-
pendent works by Richter et al. [24] and Shafaei et al. [27]
produced synthetic training data from photorealistic gam-
ing engines, validating the performance on real-world seg-
mentation tasks. Gaidon et al. [9] used the Unity engine
to create the Virtual KITTI dataset, which takes real-world
seed videos to produce photorealistic synthetic variations to
evaluate robustness of models to various visual factors. For
indoor scenes, recent work by Qui et al. [23] called Unre-
alCV provided a plugin to generate ground truth data and
photorealistic images from the UnrealEngine. This use of
gaming engines is an exciting direction, but is can be lim-
ited by proprietary issues either by the engine or the assets.
Our SceneNet RGB-D dataset uses open-source scene
layouts [12] and 3D object repositories [3] to provide tex-
tured objects. For rendering, we have built upon an open-
source ray-tracing framework which allows significant flex-

Figure 2. Flow chart of the different stages in our dataset generation pipeline.
ibility in the ground truth data we can collect and visual
effects we can simulate.
Recently, Song et al. released the SUN-CG dataset [31]
containing 46K synthetic scene layouts created using
Planner5D. The most closely related approach to ours, and
performed concurrently with it, is the subsequent work on
the same set of layouts by Zhang et al. [32], which provided
400K physically-based RGB renderings of a randomly sam-
pled still camera within those indoor scenes and provided
the ground truth for three selected tasks: normal estima-
tion, semantic annotation, and object boundary prediction.
Zhang et al. compared pre-training a CNN (already with
ImageNet initialisation) on lower quality OpenGL render-
ings against pre-training on high quality physically-based
renderings, and found pre-training on high quality render-
ings outperformed on all three tasks.
Our dataset, SceneNet RGB-D, samples random layouts
from SceneNet [12] and objects from ShapeNet [3] to cre-
ate a practically unlimited number of scene configurations.
As shown in Table 1, there are a number of key differ-
ences between our work and others. Firstly, our dataset
explicitly provides a randomly generated sequential video
trajectory within a scene, allowing 3D correspondences be-
tween viewpoints for 3D scene understanding tasks, with
the ground truth camera poses acting in lieu of a SLAM
system [20]. Secondly, Zhang et al. [32] use manually
designed scenes, while our randomised approach produces
chaotic configurations that can be generated on-the-fly with
little chance of repeating. Moreover, the layout textures,
lighting, and camera trajectories are all randomised, allow-
ing us to generate a wide variety of geometrically identical
but visually differing renders as shown in Figure 7.
We believe such randomness could help prevent overfit-
ting by providing large quantities of less predictable training
examples with high instructional value. Additionally, ran-
domness provides a simple baseline approach against which
more complex scene-grammars can justify their added com-
plexity. It remains an open question whether randomness is
preferable to designed scenes for learning algorithms. Ran-
domness leads to a simpler data generation pipeline and,
given a sufficient computational budget, allows for dynamic
on-the-fly generated training examples suitable for active
machine learning. A combination of the two approaches,
with a reasonable manually designed scene layouts or se-
mantic constraints along-side physically simulated random-
ness, may in the future provide the best of both worlds.
3. Dataset Overview
The overall pipeline is depicted in Figure 2. It was neces-
sary to balance the competing requirements of high frame-
rates for video sequences with the computational cost of
rendering many very similar images, which would not pro-
vide significant variation in the training data. We decided
upon 5 minute trajectories at 320×240 image resolution,
with a single frame per second, resulting in 300 images per
trajectory (the trajectory is calculated at 25Hz, however we
only render every 25th pose). Each view consists of both
a shutter open and shutter close camera pose. We sample
from linearly interpolations of these poses to produce mo-
tion blur. Each render takes 2–3 seconds on an Nvidia GTX
1080 GPU. There is also a trade off between rendering time
and quality of renders (see Figure 6 in Section 6.2).
Various ground truth labels can be obtained with an extra
rendering pass. Depth is rendered as the first ray intersec-
tion euclidean distance, and instance labels are obtained by
assigning indices to each object and rendering these. For
ground truth data a single ray is emitted from the pixel cen-
tre. In accompanying datafiles we store, for each trajectory,
a mapping from instance label to a WordNet semantic la-
bel. We have 255 WordNet semantic categories, including
40 added by the ShapeNet dataset. Given the static scene
assumption and the depth map, instantaneous optical flow
can also be calculated as the time-derivative of a surface
points projection into camera pixel space with respect to
the linear interpolation of the shutter open and shutter close
poses. Examples of the available ground-truth is shown in
Figure 3, and code to reproduce it is open-source.
1
Our dataset is separated into train, validation, and test
sets. Each set has a unique set of layouts, objects, and tra-
jectories. However the parameters for randomly choosing
lighting and trajectories remain the same. We selected two
layouts from each type (bathroom, kitchen, office, living
room, and bedroom) for the validation and test sets making
the layout split 37-10-10. For ShapeNet objects within a
scene we randomly divide the objects within each WordNet
class into 80-10-10% splits for train-val-test. This ensures
that some of each type of object are in each training set. Our
final training set has 5M images from 16K room configura-
tions, and our validation and test set have 300K images from
1K different configurations. Each configuration has a single
trajectory through it.
1
https://github.com/jmccormac/pySceneNetRGBD

(a) photo (b) depth (c) instance (d) class segmentation (e) optical flow
Figure 3. Hand-picked examples from our dataset; (a) rendered images and (b)–(e) the ground truth labels we generate.
4. Generating Random Scenes with Physics
To create scenes, we randomly select a density of objects
per square metre. In our case we have two of these den-
sities. For large objects we choose a density between 0.1
and 0.5 objects m
2
, and for small objects (<0.4m tall) we
choose a density between 0.5 and 3.0 objects m
2
. Given
the floor area of a scene, we calculate the number objects
needed. We sample objects for a given scene according to
the distribution of objects categories in that scene type in the
SUN RGB-D real-world dataset. We do this with the aim
of including relevant objects within a context e.g. a bath-
room is more likely to contain a sink than a microwave. We
then randomly pick an instance uniformly from the avail-
able models for that object category.
We use an off-the-shelf physics engine, Project Chrono,
2
to dynamically simulate the scene. The objects are pro-
vided with a constant mass (10kg) and convex collision hull
and positioned randomly within the 3D space of the layouts
axis-aligned bounding box. To slightly bias objects towards
the correct orientation, we offset the center of gravity on
the objects to be below the mesh. Without this, we found
very few objects were in their normal upright position after
the simulation. We simulate 60s of the system to allow ob-
jects to settle to a physically realistic configuration. While
not organised in a human manner, the overall configuration
aims to be physically plausible i.e. avoiding configurations
where an object cannot physically support another against
gravity or with unrealistic object intersections.
2
https://projectchrono.org/
5. Generating Random Trajectories
As we render videos at a large scale, it is imperative that
the trajectory generation be automated to avoid costly man-
ual labour. The majority of previous works have used a
SLAM system operated by a human to collect hand-held
motion: the trajectory of the camera poses returned by the
SLAM system is then inserted into a synthetic scene and the
corresponding data is rendered at discrete or interpolated
poses of the trajectory [11, 13]. However, such reliance on
humans to collect trajectories quickly limits the potential
scale of the dataset.
We automate this process using a simple random camera
trajectory generation procedure which we have not found in
any previous synthetic dataset work. For our trajectories,
we have the following desiderata. Our generated trajecto-
ries should be random, but slightly biased towards looking
into central areas of interest, rather than for example pan-
ning along a wall. It should contain a mix of fast and slow
rotations like those of a human operator focussing on nearby
and far away points. It should also have limited rotational
freedom that emphasises yaw and pitch rather than rolling,
which is a less prominent motion in human trajectories.
To achieve the desired trajectories we simulate two phys-
ical bodies. One defines the location of the camera, and an-
other the point in space that it is focussing on as a proxy for
a human paying attention to random points in a scene. We
take the simple approach of locking roll entirely, by setting
the up vector to always be along the positive y-axis. These
two points completely define the camera coordinate system.

We simulate the motion of the two bodies using a physi-
cal motion model. We use simple Euler integration to sim-
ulate the motion of the bodies and apply randomly sampled
3D directional force vectors as well as drag to each of the
bodies independently, with a maximum cap on the permit-
ted speed. This physical model has a number of benefits.
Firstly, it provides an intuitive set of metric physical prop-
erties we can set to achieve a desired trajectory, such as the
strength of the force in Newtons and the drag coefficients.
Secondly, it naturally produces smooth trajectories. Finally,
although not currently provided in our dataset, it can au-
tomatically produce synthetic IMU measurements, which
could prove useful for Visual-Inertial systems.
We initialise the pose and ‘look-at’ point from a uniform
random distribution within the bounding box of the scene,
ensuring they are less than 50cm apart. As not all scenes are
convex, it is possible to initialise the starting points outside
of a layout, for example in an ‘L’-shaped room. Therefore,
we have two simple checks. The first is to restart the simula-
tion if either body leaves the bounding volume. The second
is that within the first 500 poses at least 10 different object
instances must have been visible. This prevents trajectories
external to the scene layout with only the outer wall visible.
Finally, to avoid collisions with the scene or objects we
render a depth image using the z-buffer of OpenGL. If a col-
lision occurs, the velocity is simply negated in a ‘bounce’,
which simplifies the collision by assuming the surface nor-
mal is always the inverse of the velocity vector. Figure 4
visualises a two-body trajectory from the final dataset.
Figure 4. Example camera and lookat trajectory through a syn-
thetic scene (with rendered views from the first and last frustum).
6. Rendering Photorealistic RGB Frames
The rendering engine used was a modified version of the
Opposite Renderer
3
[21], a flexible open-source ray-tracer
3
http://apartridge.github.io/OppositeRenderer/
built on top of the Nvidia OptiX framework. We do not have
strict real-time constraints to produce photorealistic render-
ing, but the scale and quality of images required does mean
the computational cost is an important factor to consider.
Since OptiX allows rendering on the GPU it is able to fully
utilise the parallelisation offered by readily-available mod-
ern day consumer grade graphics cards.
(a) No reflections & transparency (b) With reflections & transparency
Figure 5. Reflections and transparency
6.1. Photon Mapping
We use a process known as photon mapping to approx-
imate the rendering equation. Our static scene assump-
tion makes photon mapping particularly efficient as we can
produce photon maps for a scene which are maintained
throughout the trajectory. A good tutorial on photon map-
ping is given by its creators Jensen et al.[18]. Normal ray-
tracing allows for accurate reflections and transparency ren-
derings, but photon mapping provides a global illumination
model that also approximates indirect illumination, colour-
bleeding from diffuse surfaces, and caustics. Many of these
effects can be seen in Figure 5.
6.2. Rendering Quality
Rendering over 5M images requires a significant amount
of computation. We rendered our images on 4-12 GPUs for
approximately one month. An important trade-off in this
calculation is between the quality of the renders and the
quantity of images. Figure 6 shows two of the most im-
portant variables dictating this balance within our rendering
framework. Our final dataset was rendered with 16 samples
per pixel and 4 photon maps. This equates to approximately
3s per image on a single GPU.
6.3. Random Layout Textures and Lighting
To improve the variability within our 57 layouts, we ran-
domly assign textures from a curated library of selected
seamless textures to their components. Each layout object
has a material type, which then gives a number of random
texture images for that type. For example, we have a large
number of different wall textures, floor textures, and curtain
textures. We also generate random indoor lighting for the
scene. We have two types of lights: spherical orbs, which
serve as point light sources, and parallelograms which act

Citations
More filters
Journal ArticleDOI

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Proceedings ArticleDOI

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

TL;DR: In this paper, the KITTI Vision Odometry Benchmark was used to provide dense point-wise annotations for the complete 360-degree field-of-view of the employed automotive LiDAR.
Proceedings ArticleDOI

Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization

TL;DR: This work presents a system for training deep neural networks for object detection using synthetic images that relies upon the technique of domain randomization, in which the parameters of the simulator are randomized in non-realistic ways to force the neural network to learn the essential features of the object of interest.
Posted Content

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

TL;DR: Self-Supervised Learning: Self-supervised learning as discussed by the authors is a subset of unsupervised image and video feature learning, which aims to learn general image features from large-scale unlabeled data without using any human-annotated labels.
Posted Content

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

TL;DR: A large dataset to propel research on laser-based semantic segmentation, which opens the door for the development of more advanced methods, but also provides plentiful data to investigate new research directions.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Book ChapterDOI

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Posted Content

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Related Papers (5)
Frequently Asked Questions (17)
Q1. How do the authors make sure that the rendered images are a faithful approximation to real?

In order to make sure that the rendered images are a faithful approximation to real-world images, the authors also apply a non-linear Camera Response Function (CRF) that maps irradiance to quantised brightness as in a real camera. 

Since OptiX allows rendering on the GPU it is able to fully utilise the parallelisation offered by readily-available modern day consumer grade graphics cards. 

A primary goal of computer vision research is to give computers the capability to reason about real-world images in a human-like manner. 

The authors use simple Euler integration to simulate the motion of the bodies and apply randomly sampled 3D directional force vectors as well as drag to each of the bodies independently, with a maximum cap on the permitted speed. 

Gaidon et al. [9] used the Unity engine to create the Virtual KITTI dataset, which takes real-world seed videos to produce photorealistic synthetic variations to evaluate robustness of models to various visual factors. 

The authors initialise the pose and ‘look-at’ point from a uniform random distribution within the bounding box of the scene, ensuring they are less than 50cm apart. 

The authors sample objects for a given scene according to the distribution of objects categories in that scene type in the SUN RGB-D real-world dataset. 

to maintain approximately consistent memory usage and batch sizes during training, the authors skip the final max-pooling layer and 1024- channel convolutions. 

Their dataset, SceneNet RGB-D, samples random layouts from SceneNet [12] and objects from ShapeNet [3] to create a practically unlimited number of scene configurations. 

For semantic scene understanding, their main area of interest, Handa et al. [12] produced SceneNet, a repository of labelled synthetic 3D scenes from five different categories. 

Peng et al.[22] augmented small datasets of objects with renderings of synthetic 3D objects with random textures and backgrounds to improve object detection performance. 

The authors do not have strict real-time constraints to produce photorealistic rendering, but the scale and quality of images required does mean the computational cost is an important factor to consider. 

Their networks pre-trained on SceneNet RGB-D were both maintained at a constant learning rate as they were below 30 epochs - the RGB CNN was pretrained for 15 epochs which took approximately 1 month on 4 Nvidia Titan X GPUs, and the RGB-D CNN was pretrained for 10 epochs, taking 3 weeks. 

The authors do not explicitly add camera noise or distortion to the renderings, however the random raysampling and integration procedure from ray-tracing naturally adds a certain degree of noise to the final images. 

The most closely related approach to ours, and performed concurrently with it, is the subsequent work on the same set of layouts by Zhang et al. [32], which provided 400K physically-based RGB renderings of a randomly sampled still camera within those indoor scenes and provided the ground truth for three selected tasks: normal estimation, semantic annotation, and object boundary prediction. 

Hua et al. provide sceneNN [15], a dataset of 100 labelled meshes of real world scenes, obtained with a reconstruction system with objects labelled directly in 3D for semantic segmentation ground truth. 

The authors selected two layouts from each type (bathroom, kitchen, office, living room, and bedroom) for the validation and test sets making the layout split 37-10-10.