scispace - formally typeset
Open AccessProceedings ArticleDOI

SemanticFusion: Dense 3D semantic mapping with convolutional neural networks

Reads0
Chats0
TLDR
In this paper, the authors combine CNNs and a state-of-the-art dense SLAM system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories.
Abstract
Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance — they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.

read more

Content maybe subject to copyright    Report

SemanticFusion: Dense 3D Semantic Mapping with Convolutional
Neural Networks
John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger
Dyson Robotics Lab, Imperial College London
Abstract Ever more robust, accurate and detailed mapping
using visual sensing has proven to be an enabling factor for
mobile robots across a wide variety of applications. For the next
level of robot intelligence and intuitive user interaction, maps
need extend beyond geometry and appearence they need
to contain semantics. We address this challenge by combining
Convolutional Neural Networks (CNNs) and a state of the art
dense Simultaneous Localisation and Mapping (SLAM) system,
ElasticFusion, which provides long-term dense correspondence
between frames of indoor RGB-D video even during loopy
scanning trajectories. These correspondences allow the CNN’s
semantic predictions from multiple view points to be proba-
bilistically fused into a map. This not only produces a useful
semantic 3D map, but we also show on the NYUv2 dataset that
fusing multiple predictions leads to an improvement even in the
2D semantic labelling over baseline single frame predictions. We
also show that for a smaller reconstruction dataset with larger
variation in prediction viewpoint, the improvement over single
frame segmentation increases. Our system is efficient enough
to allow real-time interactive use at frame-rates of 25Hz.
I. INTRODUCTION
The inclusion of rich semantic information within a dense
map enables a much greater range of functionality than
geometry alone. For instance, in domestic robotics, a simple
fetching task requires knowledge of both what something is,
as well as where it is located. As a specific example, thanks
to sharing of the same spatial and semantic understanding
between user and robot, we may issue commands such
as ’fetch the coffee mug from the nearest table on your
right’. Similarly, the ability to query semantic information
within a map is useful for humans directly, providing a
database for answering spoken queries about the semantics
of a previously made map; ‘How many chairs do we have
in the conference room? What is the distance between the
lectern and its nearest chair?’ In this work, we combine
the geometric information from a state-of-the-art SLAM
system ElasticFusion [25] with recent advances in semantic
segmentation using Convolutional Neural Networks (CNNs).
Our approach is to use the SLAM system to provide
correspondences from the 2D frame into a globally consistent
3D map. This allows the CNN’s semantic predictions from
multiple viewpoints to be probabilistically fused into a dense
semantically annotated map, as shown in Figure 1. Elas-
ticFusion is particularly suitable for fusing semantic labels
because its surfel-based surface representation is automati-
cally deformed to remain consistent after the small and large
loop closures which would frequently occur during typical
interactive use by an agent (whether human or robot). As the
surface representation is deformed and corrected, individual
Fig. 1: The output of our system: On the left, a dense surfel
based reconstruction from a video sequence in the NYUv2
test set. On the right the same map, semantically annotated
with the classes given in the legend below.
surfels remain persistently associated with real-world entities
and this enables long-term fusion of per-frame semantic
predictions over wide changes in viewpoint.
The geometry of the map itself also provides useful
information which can be used to efficiently regularise the
final predictions. Our pipeline is designed to work online, and
although we have not focused on performance, the efficiency
of each component leads to a real-time capable ( 25Hz)
interactive system. The resulting map could also be used
as a basis for more expensive offline processing to further
improve both the geometry and the semantics; however that
has not been explored in the current work.
We evaluate the accuracy of our system on the NYUv2
dataset, and show that by using information from the unla-
belled raw video footage we can improve upon baseline ap-
proaches performing segmentation using only a single frame.
This suggests the inclusion of SLAM not only provides an
immediately useful semantic 3D map, but it suggests that
many state-of-the art 2D single frame semantic segmentation
approaches may be boosted in performance when linked with
SLAM.
The NYUv2 dataset was not taken with full room recon-
struction in mind, and often does not provide significant vari-
ations in viewpoints for a given scene. To explore the benefits
of SemanticFusion within a more thorough reconstruction,
we developed a small dataset of a reconstructed office
room, annotated with the NYUv2 semantic classes. Within
this dataset we witness a more significant improvement in
segmentation accuracy over single frame 2D segmentation.
This indicates that the system is particularly well suited to
longer duration scans with wide viewpoint variation aiding
to disambiguate the single-view 2D semantics.
arXiv:1609.05130v2 [cs.CV] 28 Sep 2016

II. RELATED WORK
The works most closely related are St
¨
uckler et al. [23] and
Hermans et al. [8]; both aim towards a dense, semantically
annotated 3D map of indoor scenes. They both obtain per-
pixel label predictions for incoming frames using Random
Decision Forests, whereas ours exploits recent advances in
Convolutional Neural Networks that provide state-of-the-art
accuracy, with a real-time capable run-time performance.
They both fuse predictions from different viewpoints in a
classic Bayesian framework. St
¨
uckler et al. [23] used a
Multi-Resolution Surfel Map-based SLAM system capable
of operating at 12.8Hz, however unlike our system they
do not maintain a single global semantic map as local key
frames store aggregated semantic information and these are
subject to graph optimisation in each frame. Hermans et
al. [8] did not use the capability of a full SLAM system with
explicit loop closure: they registered the predictions in the
reference frames using only camera tracking. Their run-time
performance was 4.6Hz, which would prohibit processing a
live video feed, whereas our system is capable of operating
online and interactively. As here, they regularised their pre-
dictions using Kr
¨
ahenb
¨
uhl and Koltun’s [13] fully-connected
CRF inference scheme to obtain a final semantic map.
Previous work by Salas-Moreno et al. aimed to create a
fully capable SLAM system, SLAM++ [19], which maps
indoor scenes at the level of semantically defined objects.
However, their method is limited to mapping objects that are
present in a pre-defined database. It also does not provide the
dense labelling of entire scenes that we aim for in this work,
which also includes walls, floors, doors, and windows which
are equally important to describe the extent of the room.
Additionally, the features they use to match template models
are hand-crafted unlike our CNN features that are learned in
an end-to-end fashion with large training datasets.
The majority of other approaches to indoor semantic la-
belling either focuses on offline batch mapping methods [24],
[12] or on single-frame 2D segmentations which do not
aim to produce a semantically annotated 3D map [3], [20],
[15], [22]. Valentin et al. [24] used a CRF and a per-
pixel labelling from a variant of TextonBoost to reconstruct
semantic maps of both indoor and outdoor scenes. This
produces a globally consistent 3D map, however inference is
performed on the whole mesh once instead of incrementally
fusing the predictions online. Koppula et al. [12] also tackle
the problem on a completed 3D map, forming segments of
the map into nodes of a graphical model and using hand-
crafted geometric and visual features as edge potentials to
infer the final semantic labelling.
Our semantic mapping pipeline is inspired by the re-
cent success of Convolution Neural Networks in semantic
labelling and segmentation tasks [14], [16], [17]. CNNs
have proven capable of both state-of-the-art accuracy and
efficient test-time performance. They have have exhibited
these capabilities on numerous datasets and a variety of data
modalities, in particular RGB [17], [16], Depth [1], [7] and
Normals [2], [4], [6], [5]. In this work we build on the CNN
Fig. 2: An overview of our pipeline: Input images are used
to produce a SLAM map, and a set of probability prediction
maps (here only four are shown). These maps are fused into
the final dense semantic map via Bayesian updates.
model proposed by Noh et. al. [17], but modify it to take
advantage of the directly available depth data in a manner
that does not require significant additional pre-processing.
III. METHOD
Our SemanticFusion pipeline is composed of three sepa-
rate units; a real-time SLAM system ElasticFusion, a Con-
volutional Neural Network, and a Bayesian update scheme,
as illustrated in Figure 2. The role of the SLAM system is
to provide correspondences between frames, and a globally
consistent map of fused surfels. Separately, the CNN receives
a 2D image (for our architecture this is RGBD, for Eigen et
al. [2] it also includes estimated normals), and returns a set
of per pixel class probabilities. Finally, a Bayesian update
scheme keeps track of the class probability distribution for
each surfel, and uses the correspondences provided by the
SLAM system to update those probabilities based on the
CNN’s predictions. Finally, we also experiment with a CRF
regularisation scheme to use the geometry of the map itself
to improve the semantic predictions [8], [13]. The following
section outlines each of these components in more detail.
A. SLAM Mapping
We choose ElasticFusion as our SLAM system.
1
For each
arriving frame, k, ElasticFusion tracks the camera pose
via a combined ICP and RGB alignment, to yield a new
pose T
W C
, where W denotes the World frame and C the
camera frame. New surfels are added into our map using this
camera pose, and existing surfel information is combined
with new evidence to refine their positions, normals, and
1
Available on https://github.com/mp3guy/ElasticFusion

colour information. Additional checks for a loop closure
event run in parallel and the map is optimised immediately
upon a loop closure detection.
The deformation graph and surfel based representation of
ElasticFusion lend themselves naturally to the task at hand,
allow probability distributions to be ‘carried along’ with
the surfels during loop closure, and also fusing new depth
readings to update the surfel’s depth and normal information,
without destroying the surfel, or its underlying probability
distribution. It operates at real-time frame-rates at VGA
resolution and so can be used both interactively by a human
or in robotic applications. We used the default parameters in
the public implementation, except for the depth cutoff, which
we extend from 3m to 8m to allow reconstruction to occur
on sequences with geometry outside of the 3m range.
B. CNN Architecture
Our CNN is implemented in caffe [11] and adopts the
Deconvolutional Semantic Segmentation network architec-
ture proposed by Noh et. al. [17]. Their architecture is
itself based on the VGG 16-layer network [21], but with
the addition of max unpooling and deconvolutional layers
which are trained to output a dense pixel-wise semantic
probability map. This CNN was trained for RGB input, and
in the following sections when using a network with this
setup we describe it RGB-CNN.
Given the availability of depth data, we modified the
original network architecture to accept depth information as
a fourth channel. Unfortunately, the depth modality lacks
the large scale training datasets of its RGB counterpart. The
NYUv2 dataset only consists of 795 labelled training images.
To effectively use depth, we initialized the depth filters with
the average intensity of the other three inputs, which had
already been trained on a large dataset, and converted it
from the 0–255 colour range to the 0–8m depth range by
increasing the weights by a factor of 32×.
We rescale incoming images to the native 224×224 reso-
lution for our CNNs, using bilinear interpolation for RGB,
and nearest neighbour for depth. In our experiments with
Eigen et. al.s implementation we rescale the inputs in the
same manner to 320×240 resolution. We upsample the
network output probabilites to full 640×480 image resolution
using nearest neighbour when applying the update to surfels,
described in the section below.
C. Incremental Semantic Label Fusion
In addition to normal and location information, each surfel
(index s) in our map, M, stores a discrete probability
distribution, P (L
s
= l
i
) over the set of class labels, l
i
L.
Each newly generated surfel is initialised with a uniform
distribution over the semantic classes, as we begin with no
a priori evidence as to its latent classification.
After a prespecified number of frames, we perform a
forward pass of the CNN with the image I
k
coming directly
from the camera. Depending on the CNN architecture, this
image can include any combination of RGB, depth, or
normals. Given the data I
k
of the k
th
image, the output of
the CNN is interpreted in a simplified manner as a per-pixel
independent probability distribution over the class labels
P (O
u
= l
i
|I
k
), with u denoting pixel coordinates.
Using the tracked camera pose T
W C
, we associate every
surfel at a given 3D location
W
x(s) in the map, with
pixel coordinates u via the camera projection u(s, k) =
π(T
CW
(k)
W
x(s)), employing the homogeneous transfor-
mation matrix T
CW
(k) = T
1
W C
(k) and using homogeneous
3D coordinates. This enables us to update all the surfels in
the visible set V
k
M with the corresponding probability
distribution by means of a recursive Bayesian update
P (l
i
|I
1,...,k
) =
1
Z
P (l
i
|I
1,...,k1
)P (O
u(s,k)
= l
i
|I
k
),
(1)
which is applied to all label probabilities per surfel, finally
normalising with constant Z to yield a proper distribution.
It is the SLAM correspondences that allow us to accurately
associate label hypotheses from multiple images and com-
bine evidence in a Bayesian way. The following section dis-
cusses how the na
¨
ıve independence approximation employed
so far can be mitigated, allowing semantic information to be
propagated spatially when semantics are fused from different
viewpoints.
D. Map Regularisation
We explore the benefits of using map geometry to regu-
larise predictions by applying a fully-connected CRF with
Gaussian edge potentials to surfels in the 3D world frame,
as in the work of Hermans et al. [8], [13]. We do not use the
CRF to arrive at a final prediction for each surfel, but instead
use it incrementally to update the probability distributions.
In our work, we treat each surfel as a node in the graph. The
algorithm uses the mean-field approximation and a message
passing scheme to efficiently infer the latent variables that
approximately minimise the Gibbs energy E of a labelling,
x, in a fully-connected graph, where x
s
{l
i
} denotes a
given labelling for the surfel with index s.
The energy E(x) consists of two parts, the unary data term
ψ
u
(x
s
) is a function of a given label, and is parameterised by
the internal probability distribution of the surfel from fusing
multiple CNN predictions as described above. The pairwise
smoothness term, ψ
p
(x
s
, x
s
) is a function of the labelling
of two connected surfels in the graph, and is parameterised
by the geometry of the map:
E(x) =
X
s
ψ
u
(x
s
) +
X
s<s
ψ
p
(x
s
, x
s
). (2)
For the data term we simply use the negative logarithm of
the chosen labelling’s probability for a given surfel,
ψ
u
(x
s
) = log(P (L
s
= x
s
|I
1,...,k
)). (3)
In the scheme proposed by Kr
¨
ahenb
¨
uhl and Koltun [13]
the smoothness term is constrained to be a linear combination
of K Gaussian edge potential kernels, where f
s
denotes some
feature vector for surfel, s, and in our case µ(x
s
, x
s
) is given
by the Potts model, µ(x
s
, x
s
) = [x
s
6= x
s
]:

ψ
p
(x
s
, x
s
) = µ(x
s
, x
s
)
K
X
m=1
w
(m)
k
(m)
(f
s
, f
s
)
!
. (4)
Following previous work [8] we use two pairwise poten-
tials; a bilateral appearance potential seeking to closely tie
together surfels with both a similar position and appearance,
and a spatial smoothing potential which enforces smooth
predictions in areas with similar surface normals:
k
1
(f
s
, f
s
) = exp
|p
s
p
s
|
2
2θ
2
α
|c
s
c
s
|
2
2θ
2
β
!
, (5)
k
2
(f
s
, f
s
) = exp
|p
s
p
s
|
2
2θ
2
α
|n
s
n
s
|
2
2θ
2
γ
. (6)
We chose unit standard deviations of θ
α
= 0.05m in
the spatial domain, θ
β
= 20 in the RGB colour domain,
and θ
γ
= 0.1 radians in the angular domain. We did not
tune these parameters for any particular dataset. We also
maintained w
1
of 10 and w
2
of 3 for all experiments. These
were the default settings in Kr
¨
ahenb
¨
uhl and Koltun’s public
implementation
2
[13] .
IV. EXPERIMENTS
A. Network Training
We initialise our CNNs with weights from Noh et. al. [17]
trained for segmentation on the PASCAL VOC 2012 segmen-
tation dataset [3]. For depth input we initialise the fourth
channel as described in Section III-B, above. We finetuned
this network on the training set of the NYUv2 dataset for
the 13 semantic classes defined by Couprie et al. [1].
For optimisation we used standard stochastic gradient
descent, with a learning rate of 0.01, momentum of 0.9, and
weight decay of 5 × 10
4
. After 10k iterations we reduced
the learning rate to 1 × 10
3
. We use a mini-batch size of
64, and trained the networks for a total of 20k iterations over
the course of 2 days on an Nvidia GTX Titan X.
B. Reconstruction Dataset
We produced a small experimental RGB-D reconstruction
dataset, which aimed for a relatively complete reconstruction
of an office room. The trajectory used is notably more loopy,
both locally and globally, than the NYUv2 dataset which
typically consists of a single back and forth sweep. We
believe the trajectory in our dataset is more representative
of the scanning motion an active agent may perform when
inspecting a scene.
We also took a different approach to manual annotation of
this data, by using a 3D tool we developed to annotate the
surfels of the final 3D reconstruction with the 13 NYUv2
semantic classes under consideration (only 9 were present).
We then automatically generated 2D labellings for any frame
in the input sequence via projection. The tool, and the
2
Available from: http://www.philkr.net/home/densecrf
Fig. 3: Our office reconstruction dataset: On the left are
the captured RGB and Depth images. On the right, is our
3D reconstruction and annotation. Inset into that is the final
ground truth rendered labelling we use for testing.
resulting annotations are depicted in Figure 3. Every 100
th
frame of the sequence was used as a test sample to validate
our predictions against the annotated ground truth, resulting
in 49 test frames.
C. CNN and CRF Update Frequency Experiments
We used the dataset to evaluate the accuracy of our
system when only performing a CNN prediction on a subset
of the incoming video frames. We used the RGB-CNN
described above, and evaluated the accuracy of our system
when performing a prediction on every 2
n
frames, where
n {0..7}. We calculate the average frame-rate based upon
the run-time analysis discussed in Section IV-F. As shown
in Figure 4, the accuracy is highest (52.5%) when every
frame is processed by the network, however this leads to
a significant drop in frame-rate to 8.2Hz. Processing every
10
th
frame results in a slightly reduced accuracy (49-51%),
but over three times the frame-rate of 25.3Hz. This is the
approach taken in all of our subsequent evaluations.
We also evaluated the effect of varying the number of
frames between CRF updates (Figure 5). We found that when
applied too frequently, the CRF can ‘drown out’ predictions
of the CNN, resulting in a significant reduction in accuracy.
Performing an update every 500 frames results in a slight
improvement, and so we use that as the default update rate
in all subsequent experiments.
D. Accuracy Evaluation
We evaluate the accuracy of our SemanticFusion pipeline
against the accuracy achieved by a single frame CNN seg-
mentation. The results of this evaluation are summarised in
Table I. We observe that in all cases semantically fusing
additional viewpoints improved the accuracy of the segmen-
tation over a single frame system. Performance improved
from 43.6% for a single frame to 48.3% when projecting the
predictions from the 3D SemanticFusion map.
We also evaluate our system on the office dataset when
using predictions from the state-of-the-art CNN developed
by Eigen et al.
3
based on the VGG architecture. To maintain
3
We use the publicly available network weights and implementation from:
http://www.cs.nyu.edu/˜deigen/dnl/.

0 4 8 16 32 64 128
0
10
20
30
40
50
60
No. Frames Skipped by CNN
Class Avg. Accuracy (%)
RGB-CNN (LHS)
Avg. Frame Rate (RHS)
0
5
10
15
20
25
30
35
Frame Rate (Hz.)
Fig. 4: The class average accuracy of our RGB-CNN on the
office reconstruction dataset against the number of frames
skipped between fusing semantic predictions. We perform
this evaluation without CRF smoothing. The right hand axis
shows the estimated run-time performance in terms of FPS.
500 1,000
46
47
48
49
50
51
No. Frames Skipped by CRF
Class Avg. Accuracy (%)
RGB-CNN, CRF
RGB-CNN, No CRF
Fig. 5: The average class accuracy processing every 10
th
frame with a CNN, with a variable number of frames
between CRF updates. If applied too frequently the CRF
was detrimental to performance, and the performance im-
provement from the CRF was not significant for this CNN.
consistency with the rest of the system, we perform only a
single forward pass of the network to calculate the output
probabilities. The network requires ground truth normal
information, and so to ensure the input pipeline is the
same as in Eigen et al. [2], we preprocess the sequence
with the MATLAB script linked to in the project page to
produce the ground truth normals. With this setup we see an
improvement of 2.9% over the single frame implementation
with SemanticFusion, from 57.1% to 60.0%.
The performance benefit of the CRF was less clear. It
provided a very small improvement of 0.5% for the Eigen
network, but a slight detriment to the RGBD-CNN of 0.2%.
E. NYU Dataset
We choose to validate our approach on the NYUv2
dataset [20], as it is one of the few datasets which provides
all of the information required to evaluate semantic RGB-D
reconstruction. The SUN RGB-D [22], although an order of
magnitude larger than NYUv2 in terms of labelled images,
does not provide the raw RGB-D videos and therefore is
could not be used in our evaluation.
The NYUv2 dataset itself is still not ideally suited to
the role. Many of the 206 test set video sequences exhibit
significant drops in frame-rate and thus prove unsuitable for
tracking and reconstruction. In our evaluations we excluded
any sequence which experienced a frame-rate under 2Hz.
The remaining 140 test sequences result in 360 labelled test
images of the original 654 image test set in NYUv2. The
results of our evaluation are presented in Table II and some
qualitative results are shown in Figure 6.
Overall, fusing semantic predictions resulted in a notable
improvement over single frame predictions. However, the
total relative gains of 2.3% for the RGBD-CNN was approx-
imately half of the 4.7% improvement witnessed in the office
reconstruction dataset. We believe this is largely a result
of the style of capturing NYUv2 datasets. The primarily
rotational scanning pattern often used in test trajectories does
not provide as many useful different viewpoints from which
to fuse independent predictions. Despite this, there is still
a significant accuracy improvement over the single frame
predictions.
We also improved upon the state-of-the-art Eigen et al. [2]
CNN, with the class average accuracy going from 59.9% to
63.2% (+3.3%). This result clearly shows, even on this chal-
lenging dataset, the capacity of SemanticFusion to not only
provide a useful semantically annotated 3D map, but also
to improve the predictions of state-of-the-art 2D semantic
segmentation systems.
The improvement as a result of the CRF was not par-
ticularly significant, but positive for both CNNs. Eigen’s
CNN saw +0.4% improvement, and the RGBD-CNN saw
+0.3%. This could possibly be improved with proper tuning
of edge potential weights and unit standard deviations, and
the potential exists to explore many other kinds of map-based
semantic regularisation schemes. We leave these explorations
to future work.
F. Run-time Performance
We benchmark the performance of our system on a random
sample of 30 sequences from the NYUv2 test set. All tests
were performed on an Intel Core i7-5820K 3.30GHz CPU
and an NVIDIA Titan Black GPU. Our SLAM system
requires 29.3ms on average to process each frame and update
the map. For every frame we also update our stored surfel
probability table to account for any surfels removed by the
SLAM system. This process requires an additional 1.0ms.
As discussed above, the other components in our system do
not need to be applied for every frame. A forward pass of
our CNN requires 51.2ms and our Bayesian update scheme
requires a further 41.1ms. Our standard scheme performs

Citations
More filters
Proceedings ArticleDOI

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

TL;DR: This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
Posted Content

ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

TL;DR: The ScanNet dataset as discussed by the authors contains 2.5M RGB-D views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations.
Book

Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art

TL;DR: This survey includes both the historically most relevant literature as well as the current state of the art on several specific topics, including recognition, reconstruction, motion estimation, tracking, scene understanding, and end-to-end learning for autonomous driving.
Proceedings ArticleDOI

Pointwise Convolutional Neural Networks

TL;DR: Pointwise convolution as discussed by the authors is a new convolution operator that can be applied at each point of a point cloud, which can yield competitive accuracy in both semantic segmentation and object recognition task.
Proceedings ArticleDOI

Tangent Convolutions for Dense Prediction in 3D

TL;DR: Tangent convolutions as discussed by the authors is a new construction for convolutional networks on 3D data that operates directly on surface geometry and is applicable to unstructured point clouds and other noisy real-world data.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Related Papers (5)