scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Depth-Aware Deep Representations for Robotic Perception

Reads0
Chats0
TLDR
This paper shows that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations in RGB-D images.
Abstract
Exploiting RGB-D data by means of convolutional neural networks (CNNs) is at the core of a number of robotics applications, including object detection, scene semantic segmentation, and grasping. Most existing approaches, however, exploit RGB-D data by simply considering depth as an additional input channel for the network. In this paper we show that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations. We demonstrate the benefits of DaConv on a variety of robotics oriented tasks, involving affordance detection, object coordinate regression, and contour detection in RGB-D images. In each of these experiments we show the potential of the proposed block and how it can be readily integrated into existing CNN architectures.

read more

Content maybe subject to copyright    Report

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016 1
Learning Depth-aware Deep Representations for
Robotic Perception
Lorenzo Porzi
1,2
, Samuel Rota Bul
`
o
2
, Adrian Penate-Sanchez
3
, Elisa Ricci
1,2
, Francesc Moreno-Noguer
4
Abstract—Exploiting RGB-D data by means of Convolutional
Neural Networks (CNNs) is at the core of a number of robotics
applications, including object detection, scene semantic segmen-
tation and grasping. Most existing approaches, however, exploit
RGB-D data by simply considering depth as an additional
input channel for the network. In this paper we show that
the performance of deep architectures can be boosted by in-
troducing DaConv, a novel, general-purpose CNN block which
exploits depth to learn scale-aware feature representations. We
demonstrate the benefits of DaConv on a variety of robotics
oriented tasks, involving affordance detection, object coordinate
regression and contour detection in RGB-D images. In each
of these experiments we show the potential of the proposed
block and how it can be readily integrated into existing CNN
architectures.
Index Terms—RGB-D Perception; Visual Learning
I. INTRODUCTION
S
INCE the introduction of Microsoft Kinect, RGB-D data
has been used in robotics and computer vision to address
a large variety of tasks, including visual odometry, 3D object
pose estimation, people tracking and activity recognition. The
success of depth sensors can be partially ascribed to the fact
that they provide a low-cost solution to a fundamental problem
in robotics, i.e. the recovery of scale.
In the last few years, deep learning techniques have attracted
the attention of robotics researchers, as they generally guar-
antee improved performance over traditional learning-based
approaches in a wide range of applications and heterogeneous
types of data (e.g. images, audio, text). Deep models have
been applied to a number of robotics tasks involving RGB
inputs, e.g. monocular depth prediction [1], 3D scene layout
understanding [2], change detection in large 3D maps [3] and
camera relocalization [4], [5], [6].
The popularity of Convolutional Neural Networks (CNNs)
has also encouraged researchers to investigate the adoption of
deep models for dealing with RGB-D inputs. In particular, the
idea of considering CNNs to learn features describing both
Manuscript received: September, 9, 2016; Accepted November, 17, 2016.
This paper was recommended for publication by Editor Jana Kosecka upon
evaluation of the Associate Editor and Reviewers’ comments. This work is
partly funded by the Spanish MINECO project RobInstruct TIN2014-58178-R,
by the ERA-Net Chistera project I-DRESS PCIN-2015-147, by the EU project
AEROARMS H2020-ICT-2014-1-644271 and by the EU project SECOND
HANDS H2020-ICT-2014-1-643950.
1
Lorenzo Porzi and Elisa Ricci are with University of Perugia, Italy
2
Lorenzo Porzi, Samuel Rota Bul
`
o and Elisa Ricci are with Fondazione
Bruno Kessler, Trento, Italy
3
Adrian Penate-Sanchez is with University College London
4
Francesc Moreno-Noguer is with Institut de Rob
`
otica i Inform
`
atica Indus-
trial (UPC-CSIC), Barcelona, Spain
Digital Object Identifier (DOI): see top of this page.
We
drawbest scale
Fig. 1. Illustration of the intuition motivating our Depth-aware Convolution
block. Two identical objects lying at different distances d
1
,d
2
from the viewer
appear to have different sizes on the image plane. It would be desirable,
however, for them to activate the same convolutional neurons in a network.
This can be achieved by locally tying the scale of the convolutional kernels
to the measured depth.
RGB and depth data has been proved beneficial in object
detection [7] and recognition [8], [9], semantic segmentation
[10] and grasping [11]. Most previous works, however, exploit
RGB-D data by considering depth or hand-crafted descriptors
derived from depth (i.e. surface normals, HHA features [7]) as
additional input channels for task-specific CNN architectures.
In this way, the scale information provided by depth sensors
is not explicitly used within the network model.
In this paper we depart from previous works and we demon-
strate that depth information can be directly used to derive
more powerful CNN-based feature representations. Specifi-
cally, we introduce DaConv (Depth-aware Convolution), a
novel, general-purpose block for CNN architectures which
performs convolutions at multiple scales and combines the
outputs using a learnable depth-dependent function. Intuitively,
DaConv allows the network to learn convolutional activations
that optimally adapt their receptive fields to the local scale of
their input (see Figure 1).
In the experimental section we thoroughly demonstrate the
benefits of DaConv in a number of robotics applications. In
particular, we consider the tasks of affordance detection [12],
3D object coordinate regression [13], [14] and contour de-
tection [15], [16]. In all three tasks we conduct experiments
on publicly available benchmarks and show that our DaConv
block can be used to systematically improve the performance

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016
of state-of-the-art CNN architectures. Specially remarkable are
the improvements we obtain over state-of-the-art methods in
the RGB-D Part Affordance dataset [12].
In short, the main contributions of this paper are twofold.
First, we introduce DaConv, a novel depth-aware CNN compu-
tational block which uses depth information within the network
to drive scale selection, departing from existing approaches
that tackle invariance to generic transformations [17], [18] or
scale [10], [19]. And second, we demonstrate that DaConv is
a general-purpose block, which can be embedded in different
CNN architectures, improving their performance. The code
implementing DaConv will be made publicly available.
II. RELATED WORK
CNNs for RGB-D data. Due to the impressive results
achieved in tasks such as object recognition and detection, in
the last few years CNNs have imposed themselves as the main
learning paradigm in computer vision and robotic perception.
Deep architectures have also been used to tackle challenging
problems involving RGB-D data. For instance, Wu et al.
adopted CNNs to address the problem of depth-based object
recognition. Similarly, the tasks of multi-view recognition and
next-best-view prediction are tackled in [20]. Here, Johns et
al. developed a framework which considers a CNN model to
classify image pairs and then optimally combines the obtained
classification scores. Gupta et al. [21] used CNNs to learn how
to align synthetic 3D models to real instances of the same
object in RGB-D scenes, obtaining a significant improvement
over previous works not considering deep models. Similarly, in
the context of feature learning for RGB-D object recognition,
Wang et al. [22] demonstrated that a CNN-based approach
is advantageous over traditional learning based techniques.
In [23] a CNN is trained to map image patches to a descriptor
space where pose estimation and object recognition are solved
using a simple nearest-neighbour technique. Deep models are
considered in [11] to tackle the problem of grasping: CNNs
are used to learn a grasp function and to compute a grasp
quality score over all possible grasp poses, given a predefined
discretization.
Learning Invariant Feature Representations with CNNs.
A recent line of research on CNNs has addressed the prob-
lem of devising specific solutions for obtaining invariance
to different kinds of transformations. Examples include the
works of Bruna et al. [18], Gens et al. [24] and Laptev et
al. [25], who sought invariance to translation and rotation, pose
and part deformation, or generic transformations, respectively.
Many works have focused on achieving invariance to scale
changes. Common approaches include multi-scale pooling [26]
and combining activations obtained from scaled versions of
the input, either by simple concatenation [19] or by linear
combination [27]. The method of Chen et al. [27] in par-
ticular has some similarities to ours: feature maps computed
at different scales are linearly combined using an attention-
like mechanism [28]. Differently from our approach, however,
they do not exploit depth information as a prior and only
apply their scale-aware mechanism at a single level in the
network. Recently, in [29] a multi-scale convolutional network
architecture is proposed to jointly perform depth prediction,
surface normal estimation, and semantic labeling. An overview
of approaches modeling scale changes within deep networks is
provided in [16]. Differently from all these previous works, our
approach uses depth information to drive the scale selection of
the convolutional filters.
Specific efforts to define a common framework for CNN
architectures focusing on learning invariant representations has
been made in [17] and [30]. The former work presented the
Spatial Transformer layer, which automatically learns a spatial
transformation of its input. The work in [30] introduced an
adaptation method to compute convolutional kernels. Similar
to our DaConv, a local adaptation strategy is considered,
motivated by the fact that different image regions may demand
different adaptation functions. However, their tree-structured
kernel adaptive CNN greatly differs from our DaConv block:
since the focus of [30] is on facial traits recognition, kernels
are dynamically updated according to the spatial distribution
of facial landmarks rather than depth.
Traditionally, depth information has been used to achieve
invariance to scale changes, e.g. in conjunction with random
forests [31], [13]. In these methods, depth is used to determine
the scale at which the binary features of a decision forest are
calculated. More recently, some attempts have been made to
derive deep models robust to scale using depth information:
in [10], a global depth-dependent scaling is applied to the input
of a CNN to solve a semantic segmentation task. However,
in [10] the mapping between depth and scale is predefined,
while in our approach the network learns how to use depth to
locally handle scale at the convolutional filter level.
III. LEARNING DEPTH-AWARE CONVOLUTIONS
In this section we describe the key component of our
contribution, namely a computational block for CNNs called
DaConv, which can be regarded as a convolutional layer
endowed with the ability to adapt the scale of the filter kernels
based on depth information. One issue that we face is the
impossibility of knowing a priori which pixels, and therefore
which depths, contribute to activations within the DaConv
block,
1
while we need this information to drive the scale
selection. To sidestep this problem, we introduce an additional
network (DepthNet) fed with depth information, working in
parallel with the main network (PredictionNet). The role of
DepthNet is to provide the DaConv blocks in PredictionNet
with depth-related features that will trigger the decision about
which scale to choose within each block. PredictionNet, in-
stead, is devoted to delivering the final prediction. We call
DaConvNet the entire architecture, which includes DepthNet
and PredictionNet. In the remainder of this section we provide
some more details about the proposed architecture. Additional
details about the DaConvNet’s inputs, outputs and training
procedure are postponed to the experimental section.
Convolutions with scaled kernels. Within the DaConv block,
we simulate convolutions with filter kernels at different scales
via so-called dilated (a.k.a. atrous) convolutions [32]. A `-
dilated convolution is a standard convolution with a dilated
1
We only have coarse information about theoretical receptive fields.

PORZI et al.: LEARNING DEPTH-AWARE DEEP REPRESENTATIONS FOR ROBOTIC PERCEPTION 3
from DepthNet
from PredictionNet
Output
Fig. 2. Schematic representation of a DaConv block. Light blue block:
computation of the scale selection factors a
j
; orange block: convolution at
different scales; yellow block: linear combination.
version of the filter, which is obtained by adding ` 1
zeros between adjacent filter elements. More precisely, let
x, ω : Z
2
R be a discrete function and a discrete filter
kernel, respectively (we consider the 2D case for the sake of
simplicity). The `-dilated convolution of x and ω is given by
(x ?
`
ω)(r) =
tZ
2
x(r `t)ω(t) , (1)
where r Z
2
and ` N
>0
is the dilation factor. One recovers
the standard convolution by taking ` = 1, i.e. ? = ?
1
.
Before applying the `-dilated convolution, we smooth the
input signal x by convolving it with a smoothing kernel σ
`
in
order to propagate local information, which would otherwise
be lost due to the dilation operation. We implement σ
`
as
a binomial kernel with window size 2` + 1, which ensures
that we have stronger smoothing effects at higher scales, (or
equivalently with larger dilation factors).
DaConv block.
2
This block extends standard convolutional
layers with a data-driven selection of the filters’ scale. It
is fed with an input x computed by the previous layers of
PredictionNet, and a depth-dependent input z from DepthNet
(see Figure 2). Both x and z share the same spatial resolution,
while they might have a different number of feature channels.
Like a standard convolutional layer, DaConv is parametrized
by m filter kernels {ω
1
,. . . , ω
m
} (we omit the corresponding
bias parameters), but in addition it has also d filter kernels
{ν
1
,. . . , ν
d
} with spatial resolution 1 × 1 that will be involved
in the scale selection. Indeed, each filter ν
j
is associated with
a pre-fixed dilation factor `
j
. The output dimensionality of
the block is the same one would expect from a standard
convolution with the filter banks {ω
1
,. . . , ω
m
}.
We let the scale selection vary across different spatial
locations. To do this, the input z from DepthNet is convolved
with each filter ν
j
and the output batch-normalized [33] before
entering a softmax layer acting along the feature dimension.
This operation preserves the input spatial resolution and yields
a probability vector for each spatial location, indicating the
scale selection distribution. The probability that dilation factor
`
j
is chosen for spatial location u is denoted by a
j
(u) (see
Figure 2 top). The use of batch-normalization before the
2
We use the term ”block” instead of ”layer” because it can be built by
composing standard layers found in recent deep network frameworks.
Fig. 3. In `-dilated convolution the elements of a convolutional kernel are
interspersed with ` 1 zeroes, thus increasing the receptive field without
adding extra parameters.
softmax operation ensures that the scale selection will not be
biased towards a fixed one across the entire dataset.
Finally, the scale selection probabilities encoded in
{a
1
,. . . , a
d
} are used by DaConv to linearly combine the out-
puts of convolutions of x with the filter kernels ω
i
undergoing
different dilation factors. In formal terms, the output of the
DaConv block is given by
y
i
=
d
j=1
a
j
(x ? σ
`
j
?
`
j
w
i
), (2)
where denotes the Hadamard (a.k.a. elementwise) product,
i {1, . . ., m} indexes one of the filter kernel ω
i
, and we find
between parentheses the smoothing operation and `
j
-dilated
convolution previously described.
DepthNet. This network provides the depth-specific feature
representations z that drive the selection of the dilation factors
within each DaConv block in PredictionNet. It is designed in
a way to ensure that the input z provided to each DaConv
block has a spatial resolution that matches the one of the
input x to the same DaConv block. Details about the actual
topology of DepthNet in the different application scenarios are
provided in the experimental section. Similarly, we postpone
implementation details about PredictionNet.
IV. DEPTH-AWARE CNN ARCHITECTURES FOR ROBOTIC
PERCEPTION TASKS
In this section we describe and evaluate the proposed depth-
aware approach in three different tasks, which involve RGB-D
data and are of interest for the robotics community, namely part
affordance detection, object coordinates regression and contour
detection. Each task requires pixel-level prediction models.
Therefore, for each application, we consider as baseline a
fully-convolutional network, as it currently represents the most
common architectural choice for pixel-wise classification tasks.
In order to demonstrate the advantages of our proposal, we sys-
tematically compare each baseline network with an associated
DaConv network (“-DA” suffix in the tables and figures).
Each DaConv network is constructed by replacing some of
the convolutional layers of the corresponding baseline netowrk
with DaConv blocks, obtaining the PredictionNet, and pairing
it with a similarly-structured DepthNet. Additional details

4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016
TABLE I
NETWORK ARCHITECTURES
Network Architecture
Sec.IV-A
Baseline c[3,16];c[3,16];p[2,2] c[3, 32]; c[3, 32]; p[2, 2] c[3, 64]; c[3, 64]; p[2, 1] c[3, 64]; c[1, n]
DepthNet c[3,8];c[3,8];p[2,2] c[3, 16]; c[3, 16]; p[2, 2] c[3, 32]; c[3, 32]
PredictionNet c[3,16];dac[3,16];p[2,2] c[3, 32]; dac[3, 32]; p[2, 2] c[3, 64]; dac[3, 64]; p[2, 1] c[3, 64]; c[1, n]
Sec.IV-B
Baseline
c[7,64];p[2,2] c[7, 64]; p[2, 2] c[7, 64]; p[2, 2] c[7, 64]; p[2, 2]
u[2,2];d[7,64] u[2, 2]; d[7, 64] u[2, 2]; d[7, 64] u[2, 2]; d[7, 64] c[1, 7]
DepthNet c[7,8];p[2,2] c[7, 8]; p[2, 2] c[7, 8]; p[2, 2] c[7, 8]
PredictionNet
dac[7,64];p[2,2] dac[7, 64]; p[2, 2] dac[7, 64]; p[2, 2] dac[7, 64]; p[2, 2]
u[2,2];d[7,64] u[2, 2]; d[7, 64] u[2, 2]; d[7, 64] u[2, 2]; d[7, 64] c[1, 7]
Sec.IV-C
Baseline
c[3,64];c[3,64];p[2,2] c[3, 128]; c[3, 128]; p[2, 2] c[3, 256]; c[3, 256]; c[3, 256]; p[2, 2]
c[3,512];c[3,512];c[3,512];p[2,2] c[3, 512]; c[3, 512]; c[3, 512]
DepthNet c[3,8];c[3,8];p[2,2] c[3, 16]; c[3, 16]; p[2, 2] c[3, 32]; c[3, 32]; c[3, 32]
PredictionNet
c[3,64];dac[3,64];p[2,2] c[3, 128]; dac[3, 128]; p[2, 2] c[3, 256]; c[3, 256]; dac[3, 256]; p[2, 2]
c[3,512];c[3,512];c[3,512];p[2,2] c[3, 512]; c[3, 512]; c[3, 512]
Convolution: c[size, filters]; Deconvolution: d[size, filters]; DaConv: dac[size, filters]; Pooling: p[size, stride]; Unpooling: u[size, stride]
about the networks’ architectures are provided in Table I, as
well as in Sections IV-A, IV-B and IV-C.
Unless otherwise stated, we train all the networks using
the Adam stochastic gradient descent method with a weight-
decay factor of 5 × 10
5
, parameters β
1
= 0.9, β
2
= 0.999, and
we use DaConv blocks with d = 3 dilation factors, namely
`
j
= 2
j1
for i {1, 2, 3}.
3
In the following sections we use
the notation f
i
: Z
2
R to indicate the output of the ith
channel of any network under consideration. All our networks
are implemented using the Caffe
4
framework and trained on
a single Nvidia K40 GPU.
A. Object coordinates regression
Several recent methods [13], [14] to estimate the pose
of known 3D objects from a single RGB-D image share a
common two-steps pipeline: (i) for each pixel in the image,
predict its 3D coordinates in the object’s frame of reference;
(ii) geometrically estimate the object’s pose from the corre-
spondences between the predicted object coordinates and the
observed depth. The first step results in a pixel-wise, vector-
valued regression that is, in general, quite hard to solve. To
simplify learning, previous work [13], [14] resort to a quan-
tization of the object coordinates space, turning the problem
into a classification one. In these works, the classification is
performed using a random forest, while we show how the
accuracy of step (i) can be improved by employing a DaConv
network to perform the pixel-wise classification.
Dataset and experimental protocol. As in [13], [14], we con-
sider the dataset of Hinterstoisser et al. [34], which comprises
15 sequences of RGB-D images of several objects lying on a
cluttered table. For each sequence, we are given the 6-DOF
pose of a specific object relative to the camera and a 3D mesh
of the object. As in [13], we partition each dimension of an
object’s coordinates space into 5 uniform intervals, obtaining
5 × 5 × 5 = 125 spatial bins in total. By doing so, we can
rephrase the regression task into a classification task. The
3
In our experiments we found d = 3 to be a good compromise between
classification accuracy and computational complexity.
4
http://caffe.berkeleyvision.org/
number of actual classes can be reduced, since only k (out
of 125) spatial bins will contain at least one point from the
object’s surface, thus being a relevant coordinate for the pose
estimation. Given the depth and camera pose information, we
assign to each of the sequence’s pixels a label in {1, . . . , k}
if it back-projects to one of the k relevant bins, or k + 1 if it
belongs to the background.
In our experiments we randomly split each sequence into
train, validation and test sets comprising, respectively, 30%,
10% and 60% of the images. All our results are obtained by
training a different classifier on a train set, selecting parameters
on the corresponding validation set and evaluating on the test
set. Experimental results are reported in terms of the average
per-class accuracy.
Network architecture and training. For this application we
adopt a fully-convolutional network architecture reminiscent
of the VGG net of Simonyan et al. [35]. Compared to [35],
we drastically reduce the number of convolutional filters and
exclude the fully-connected part of the network, as we want
to obtain pixel-wise predictions. Furthermore, we feed the
network with 6-channels tensors obtained by stacking the RGB
image with the 3-channel surface normals computed from the
depth. The architecture, summarized in Table I, is composed
of four main blocks of 3 × 3 convolutions of stride 1, followed
by 2 × 2 max pooling. The first two max pooling layers
have stride 2, while the third one has stride 1, resulting in
a final downsampling factor of 4. Because of this, the network
outputs pixel-wise predictions at one fourth of the original
resolution. At test time we up-sample the predictions using
nearest-neighbor interpolation. As for DaConvNet, we replace
the second convolutional layer within each of the first three
blocks with a DaConv block.
The objective we use for training consists of a per-pixel
softmax log-loss. The contribution of each pixel is opportunely
weighted in order to compensate for the highly imbalanced
class distribution in the dataset. In formal terms, we address
the following optimization problem:
arg min
r
ξ
l(r)
log
exp( f
l(r)
(r))
k+1
i=1
exp( f
i
(r))
, (3)

PORZI et al.: LEARNING DEPTH-AWARE DEEP REPRESENTATIONS FOR ROBOTIC PERCEPTION 5
where l(r) {1, . .. , k + 1} is the ground truth label at spatial
location r Z
2
, and the minimization is implicitly taken
with respect to the network parameters. The class-rebalancing
weights ξ
i
are defined as in [29], for all 1 i k + 1:
ξ
i
=
median
i
(ξ
0
i
)
ξ
0
i
, (4)
ξ
0
i
=
#pixels of class i
#pixels in images containing i
. (5)
As mentioned already at the beginning of this section, we
optimize (3) via stochastic gradient descent. Both the baseline
and DaConvNet are trained with the following schedule: 50
epochs with learning rate 10
2
followed by 25 epochs with
learning rate 10
3
, batch size equal to 64. As is common
practice when considering small datasets [16], we perform data
augmentation during training. In particular, we form training
batches by sampling a randomly rotated, scaled and translated
128×128 pixels patches from the training images. At test time
we apply the learned network on full-resolution images.
Results. Figure 4 reports the results of our experimental
evaluation, comparing the proposed DaConv architecture with
the baseline, fully-convolutional network. The proposed net-
work (CNN-DA) outperforms the baseline CNN using standard
convolution layers for all different objects. On average, the use
of DaConv blocks improves classification accuracy by 10%. As
a reference, we also report the results obtained considering a
random forest classifier, as this represents the common choice
for coordinate regression tasks [13], [14]. In particular, we use
the implementation in Piotr Doll
´
ar’s toolbox
5
, training a forest
with 10 trees observing the same RGB plus surface normal
inputs as the CNN. Note that this implementation differs from
the one in [13], as the code for that is not publicly available.
As one can see, deep architectures outperform off-the-shelf
random forests, which is not so surprising, as CNNs currently
achieve state-of-the-art results in many tasks in robotic percep-
tion. Figure 5 shows an example of the a
j
functions learned
by the first DaConv block of CNN-DA for the Ape object.
Interestingly, the functions mostly follow the scene’s depth,
with gradually higher weights being assigned to the scale j = 3
in correspondence of closer objects, and vice-versa for j = 1.
This is in accordance with the intuition illustrated in Figure 1.
B. Part Affordance Detection
The problem of localizing and identifying part affordances
[12] is a fundamental task for deploying the next generation
of robotic platforms, which are supposed to effectively col-
laborate with humans in everyday workspaces. Part affordance
detection requires to segment and label image regions corre-
sponding to object parts according to the interaction modality,
or affordance. In other words, each affordance constitutes
a class in the segmentation problem. Predicting affordances
is very challenging since objects from different categories,
with different shapes and visual appearances, can have parts
associated to the same affordance.
5
https://pdollar.github.io/toolbox/
Dataset and experimental protocol. In our experiments we
consider the RGB-D Part Affordance Dataset of Myers et
al. [12], covering 105 different tools and 7 different affor-
dances, namely “grasp”, “cut”, “scoop”, “contain”, “pound”,
“support” and “wrap-grasp”, for a total of about 30k images.¡
This dataset is split in two parts: (i) a Non-cluttered subset
comprising RGB-D video sequences of single tools lying on
a rotating plane; (ii) a Cluttered subset comprising 3 RGB-
D video sequences of several different tools amassed over
a table. One third of the video frames have been manually
labeled by a group of users, and the labels automatically
propagated to the remaining frames. To account for possibly
discording labellings provided by different users, each pixel
retains as ground-truth information a ranking of affordance
labels, ordered from the most voted to the less voted one.
We follow the experimental protocol in [12] by directly
using the publicly available evaluation code from the authors,
6
which considers only the manually labeled frames, both for
training and testing, and gives separate results for the Non-
cluttered and Cluttered subsets. Detection accuracy is evaluated
in terms of three different metrics: weighted F-measure F
w
β
,
rank weighted F-measure R
w
β
and ranked correlation score
τ
k
. For a detailed description of the way these metrics are
calculated, we refer the reader to [12] and the public code.
Network architecture and training. We adopt the SegNet-
Basic architecture in [36], summarized in Table I. This is a
symmetric architecture that takes RGB images as input and
is composed of four convolutional and four deconvolutional
layers with 64 7 × 7 filters and stride 1. Each convolutional
layer is followed by a 2 × 2, stride 2 max-pooling layer and
each deconvolutional layer is preceded by a 2 × 2, stride 2
max-unpooling layer. Batch normalization is applied to the
output of all convolutional and deconvolutional layers. The
network output layer has 8 channels, corresponding to the 7
affordance classes with an additional background class. For
our DaConvNet, we replace each convolutional layer with a
DaConv block.
We train our networks by solving the following optimization
problem:
arg minL
c
+ λ L
r
, (6)
where the minimization is intedend with respect to the network
parameters. The objective is composed by a classification loss
term L
c
and a ranking-related loss term L
r
. The classification
loss L
c
is a weighted sum of pixel-wise log-loss terms defined
similarly to (3), where the per-pixel log-loss term is computed
with respect to the top-ranked class in the ground-truth ranking.
The ranking loss, L
r
, is a sum of pixel-wise loss terms, each
aimed at exploiting the ranking information from the ground-
truth. It is defined as follows:
L
r
=
r
i6= j
p
i, j
(r)log(σ ( f
i
(r) f
j
(r))), (7)
where p
i, j
(r) is 1 if the affordance i ranks higher than j in
the ground-truth ranking for pixel r, 0.5 if they have the same
ranking and 0 otherwise, while σ(·) is the sigmoid function.
6
http://www.umiacs.umd.edu/˜amyers/part_affordance/

Citations
More filters
Journal ArticleDOI

State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems

TL;DR: An overview of the state-of-the-art deep learning architectures and algorithms relevant to the network traffic control systems, and a new use case, i.e., deep learning based intelligent routing, which is demonstrated to be effective in contrast with the conventional routing strategy.
Posted Content

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics

TL;DR: In this article, a grasp quality convolutional neural network (GQ-CNN) is trained from a synthetic dataset of 6.7 million point clouds, grasps and analytic grasp metrics generated from thousands of 3D models from Dex-Net 1.0 in randomized poses on a table.
Proceedings ArticleDOI

Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation

TL;DR: In this article, a deep model which fuses complementary information derived from multiple CNN side outputs is proposed, which is obtained by means of continuous Conditional Random Fields (CRFs).
Posted Content

iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes.

TL;DR: It is shown that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors.
Proceedings ArticleDOI

GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes

TL;DR: A generative model is introduced that jointly reasons in all levels and refines the 51-DoF of a 3D hand model that minimize a graspability loss, and can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Proceedings Article

Spatial transformer networks

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.
Journal ArticleDOI

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

TL;DR: This work equips the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Learning depth-aware deep representations for robotic perception" ?

In this paper the authors show that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations. The authors demonstrate the benefits of DaConv on a variety of robotics oriented tasks, involving affordance detection, object coordinate regression and contour detection in RGB-D images. In each of these experiments the authors show the potential of the proposed block and how it can be readily integrated into existing CNN architectures. 

Future works will be devoted to investigate this possibility.