What are the future works mentioned in the paper "Learning depth-aware deep representations for robotic perception" ?

Future works will be devoted to investigate this possibility.

(Open Access) Learning Depth-Aware Deep Representations for Robotic Perception (2017) | Lorenzo Porzi

Q: What contributions have the authors mentioned in the paper "Learning depth-aware deep representations for robotic perception" ?

In this paper the authors show that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations. The authors demonstrate the benefits of DaConv on a variety of robotics oriented tasks, involving affordance detection, object coordinate regression and contour detection in RGB-D images. In each of these experiments the authors show the potential of the proposed block and how it can be readily integrated into existing CNN architectures.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016 1

Learning Depth-aware Deep Representations for

Robotic Perception

Lorenzo Porzi

1,2

, Samuel Rota Bul

, Adrian Penate-Sanchez

, Elisa Ricci

1,2

, Francesc Moreno-Noguer

Abstract—Exploiting RGB-D data by means of Convolutional

Neural Networks (CNNs) is at the core of a number of robotics

applications, including object detection, scene semantic segmen-

tation and grasping. Most existing approaches, however, exploit

RGB-D data by simply considering depth as an additional

input channel for the network. In this paper we show that

the performance of deep architectures can be boosted by in-

troducing DaConv, a novel, general-purpose CNN block which

exploits depth to learn scale-aware feature representations. We

demonstrate the beneﬁts of DaConv on a variety of robotics

oriented tasks, involving affordance detection, object coordinate

regression and contour detection in RGB-D images. In each

of these experiments we show the potential of the proposed

block and how it can be readily integrated into existing CNN

architectures.

Index Terms—RGB-D Perception; Visual Learning

I. INTRODUCTION

INCE the introduction of Microsoft Kinect, RGB-D data

has been used in robotics and computer vision to address

a large variety of tasks, including visual odometry, 3D object

pose estimation, people tracking and activity recognition. The

success of depth sensors can be partially ascribed to the fact

that they provide a low-cost solution to a fundamental problem

in robotics, i.e. the recovery of scale.

In the last few years, deep learning techniques have attracted

the attention of robotics researchers, as they generally guar-

antee improved performance over traditional learning-based

approaches in a wide range of applications and heterogeneous

types of data (e.g. images, audio, text). Deep models have

been applied to a number of robotics tasks involving RGB

inputs, e.g. monocular depth prediction [1], 3D scene layout

understanding [2], change detection in large 3D maps [3] and

camera relocalization [4], [5], [6].

The popularity of Convolutional Neural Networks (CNNs)

has also encouraged researchers to investigate the adoption of

deep models for dealing with RGB-D inputs. In particular, the

idea of considering CNNs to learn features describing both

Manuscript received: September, 9, 2016; Accepted November, 17, 2016.

This paper was recommended for publication by Editor Jana Kosecka upon

evaluation of the Associate Editor and Reviewers’ comments. This work is

partly funded by the Spanish MINECO project RobInstruct TIN2014-58178-R,

by the ERA-Net Chistera project I-DRESS PCIN-2015-147, by the EU project

AEROARMS H2020-ICT-2014-1-644271 and by the EU project SECOND

HANDS H2020-ICT-2014-1-643950.

Lorenzo Porzi and Elisa Ricci are with University of Perugia, Italy

Lorenzo Porzi, Samuel Rota Bul

o and Elisa Ricci are with Fondazione

Bruno Kessler, Trento, Italy

Adrian Penate-Sanchez is with University College London

Francesc Moreno-Noguer is with Institut de Rob

otica i Inform

atica Indus-

trial (UPC-CSIC), Barcelona, Spain

Digital Object Identiﬁer (DOI): see top of this page.

drawbest scale

Fig. 1. Illustration of the intuition motivating our Depth-aware Convolution

block. Two identical objects lying at different distances d

from the viewer

appear to have different sizes on the image plane. It would be desirable,

however, for them to activate the same convolutional neurons in a network.

This can be achieved by locally tying the scale of the convolutional kernels

to the measured depth.

RGB and depth data has been proved beneﬁcial in object

detection [7] and recognition [8], [9], semantic segmentation

[10] and grasping [11]. Most previous works, however, exploit

RGB-D data by considering depth or hand-crafted descriptors

derived from depth (i.e. surface normals, HHA features [7]) as

additional input channels for task-speciﬁc CNN architectures.

In this way, the scale information provided by depth sensors

is not explicitly used within the network model.

In this paper we depart from previous works and we demon-

strate that depth information can be directly used to derive

more powerful CNN-based feature representations. Speciﬁ-

cally, we introduce DaConv (Depth-aware Convolution), a

novel, general-purpose block for CNN architectures which

performs convolutions at multiple scales and combines the

outputs using a learnable depth-dependent function. Intuitively,

DaConv allows the network to learn convolutional activations

that optimally adapt their receptive ﬁelds to the local scale of

their input (see Figure 1).

In the experimental section we thoroughly demonstrate the

beneﬁts of DaConv in a number of robotics applications. In

particular, we consider the tasks of affordance detection [12],

3D object coordinate regression [13], [14] and contour de-

tection [15], [16]. In all three tasks we conduct experiments

on publicly available benchmarks and show that our DaConv

block can be used to systematically improve the performance

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016

of state-of-the-art CNN architectures. Specially remarkable are

the improvements we obtain over state-of-the-art methods in

the RGB-D Part Affordance dataset [12].

In short, the main contributions of this paper are twofold.

First, we introduce DaConv, a novel depth-aware CNN compu-

tational block which uses depth information within the network

to drive scale selection, departing from existing approaches

that tackle invariance to generic transformations [17], [18] or

scale [10], [19]. And second, we demonstrate that DaConv is

a general-purpose block, which can be embedded in different

CNN architectures, improving their performance. The code

implementing DaConv will be made publicly available.

II. RELATED WORK

CNNs for RGB-D data. Due to the impressive results

achieved in tasks such as object recognition and detection, in

the last few years CNNs have imposed themselves as the main

learning paradigm in computer vision and robotic perception.

Deep architectures have also been used to tackle challenging

problems involving RGB-D data. For instance, Wu et al.

adopted CNNs to address the problem of depth-based object

recognition. Similarly, the tasks of multi-view recognition and

next-best-view prediction are tackled in [20]. Here, Johns et

al. developed a framework which considers a CNN model to

classify image pairs and then optimally combines the obtained

classiﬁcation scores. Gupta et al. [21] used CNNs to learn how

to align synthetic 3D models to real instances of the same

object in RGB-D scenes, obtaining a signiﬁcant improvement

over previous works not considering deep models. Similarly, in

the context of feature learning for RGB-D object recognition,

Wang et al. [22] demonstrated that a CNN-based approach

is advantageous over traditional learning based techniques.

In [23] a CNN is trained to map image patches to a descriptor

space where pose estimation and object recognition are solved

using a simple nearest-neighbour technique. Deep models are

considered in [11] to tackle the problem of grasping: CNNs

are used to learn a grasp function and to compute a grasp

quality score over all possible grasp poses, given a predeﬁned

discretization.

Learning Invariant Feature Representations with CNNs.

A recent line of research on CNNs has addressed the prob-

lem of devising speciﬁc solutions for obtaining invariance

to different kinds of transformations. Examples include the

works of Bruna et al. [18], Gens et al. [24] and Laptev et

al. [25], who sought invariance to translation and rotation, pose

and part deformation, or generic transformations, respectively.

Many works have focused on achieving invariance to scale

changes. Common approaches include multi-scale pooling [26]

and combining activations obtained from scaled versions of

the input, either by simple concatenation [19] or by linear

combination [27]. The method of Chen et al. [27] in par-

ticular has some similarities to ours: feature maps computed

at different scales are linearly combined using an attention-

like mechanism [28]. Differently from our approach, however,

they do not exploit depth information as a prior and only

apply their scale-aware mechanism at a single level in the

network. Recently, in [29] a multi-scale convolutional network

architecture is proposed to jointly perform depth prediction,

surface normal estimation, and semantic labeling. An overview

of approaches modeling scale changes within deep networks is

provided in [16]. Differently from all these previous works, our

approach uses depth information to drive the scale selection of

the convolutional ﬁlters.

Speciﬁc efforts to deﬁne a common framework for CNN

architectures focusing on learning invariant representations has

been made in [17] and [30]. The former work presented the

Spatial Transformer layer, which automatically learns a spatial

transformation of its input. The work in [30] introduced an

adaptation method to compute convolutional kernels. Similar

to our DaConv, a local adaptation strategy is considered,

motivated by the fact that different image regions may demand

different adaptation functions. However, their tree-structured

kernel adaptive CNN greatly differs from our DaConv block:

since the focus of [30] is on facial traits recognition, kernels

are dynamically updated according to the spatial distribution

of facial landmarks rather than depth.

Traditionally, depth information has been used to achieve

invariance to scale changes, e.g. in conjunction with random

forests [31], [13]. In these methods, depth is used to determine

the scale at which the binary features of a decision forest are

calculated. More recently, some attempts have been made to

derive deep models robust to scale using depth information:

in [10], a global depth-dependent scaling is applied to the input

of a CNN to solve a semantic segmentation task. However,

in [10] the mapping between depth and scale is predeﬁned,

while in our approach the network learns how to use depth to

locally handle scale at the convolutional ﬁlter level.

III. LEARNING DEPTH-AWARE CONVOLUTIONS

In this section we describe the key component of our

contribution, namely a computational block for CNNs called

DaConv, which can be regarded as a convolutional layer

endowed with the ability to adapt the scale of the ﬁlter kernels

based on depth information. One issue that we face is the

impossibility of knowing a priori which pixels, and therefore

which depths, contribute to activations within the DaConv

block,

while we need this information to drive the scale

selection. To sidestep this problem, we introduce an additional

network (DepthNet) fed with depth information, working in

parallel with the main network (PredictionNet). The role of

DepthNet is to provide the DaConv blocks in PredictionNet

with depth-related features that will trigger the decision about

which scale to choose within each block. PredictionNet, in-

stead, is devoted to delivering the ﬁnal prediction. We call

DaConvNet the entire architecture, which includes DepthNet

and PredictionNet. In the remainder of this section we provide

some more details about the proposed architecture. Additional

details about the DaConvNet’s inputs, outputs and training

procedure are postponed to the experimental section.

Convolutions with scaled kernels. Within the DaConv block,

we simulate convolutions with ﬁlter kernels at different scales

via so-called dilated (a.k.a. atrous) convolutions [32]. A `-

dilated convolution is a standard convolution with a dilated

We only have coarse information about theoretical receptive ﬁelds.

PORZI et al.: LEARNING DEPTH-AWARE DEEP REPRESENTATIONS FOR ROBOTIC PERCEPTION 3

from DepthNet

from PredictionNet

Output

Fig. 2. Schematic representation of a DaConv block. Light blue block:

computation of the scale selection factors a

; orange block: convolution at

different scales; yellow block: linear combination.

version of the ﬁlter, which is obtained by adding ` − 1

zeros between adjacent ﬁlter elements. More precisely, let

x, ω : Z

→ R be a discrete function and a discrete ﬁlter

kernel, respectively (we consider the 2D case for the sake of

simplicity). The `-dilated convolution of x and ω is given by

(x ?

ω)(r) =

∑

t∈Z

x(r − `t)ω(t) , (1)

where r ∈ Z

and ` ∈ N

is the dilation factor. One recovers

the standard convolution by taking ` = 1, i.e. ? = ?

Before applying the `-dilated convolution, we smooth the

input signal x by convolving it with a smoothing kernel σ

order to propagate local information, which would otherwise

be lost due to the dilation operation. We implement σ

a binomial kernel with window size 2` + 1, which ensures

that we have stronger smoothing effects at higher scales, (or

equivalently with larger dilation factors).

DaConv block.

This block extends standard convolutional

layers with a data-driven selection of the ﬁlters’ scale. It

is fed with an input x computed by the previous layers of

PredictionNet, and a depth-dependent input z from DepthNet

(see Figure 2). Both x and z share the same spatial resolution,

while they might have a different number of feature channels.

Like a standard convolutional layer, DaConv is parametrized

by m ﬁlter kernels {ω

,. . . , ω

} (we omit the corresponding

bias parameters), but in addition it has also d ﬁlter kernels

{ν

,. . . , ν

} with spatial resolution 1 × 1 that will be involved

in the scale selection. Indeed, each ﬁlter ν

is associated with

a pre-ﬁxed dilation factor `

. The output dimensionality of

the block is the same one would expect from a standard

convolution with the ﬁlter banks {ω

,. . . , ω

We let the scale selection vary across different spatial

locations. To do this, the input z from DepthNet is convolved

with each ﬁlter ν

and the output batch-normalized [33] before

entering a softmax layer acting along the feature dimension.

This operation preserves the input spatial resolution and yields

a probability vector for each spatial location, indicating the

scale selection distribution. The probability that dilation factor

is chosen for spatial location u is denoted by a

(u) (see

Figure 2 top). The use of batch-normalization before the

We use the term ”block” instead of ”layer” because it can be built by

composing standard layers found in recent deep network frameworks.

Fig. 3. In `-dilated convolution the elements of a convolutional kernel are

interspersed with ` − 1 zeroes, thus increasing the receptive ﬁeld without

adding extra parameters.

softmax operation ensures that the scale selection will not be

biased towards a ﬁxed one across the entire dataset.

Finally, the scale selection probabilities encoded in

,. . . , a

} are used by DaConv to linearly combine the out-

puts of convolutions of x with the ﬁlter kernels ω

undergoing

different dilation factors. In formal terms, the output of the

DaConv block is given by

∑

j=1

◦ (x ? σ

), (2)

where ◦ denotes the Hadamard (a.k.a. elementwise) product,

i ∈ {1, . . ., m} indexes one of the ﬁlter kernel ω

, and we ﬁnd

between parentheses the smoothing operation and `

-dilated

convolution previously described.

DepthNet. This network provides the depth-speciﬁc feature

representations z that drive the selection of the dilation factors

within each DaConv block in PredictionNet. It is designed in

a way to ensure that the input z provided to each DaConv

block has a spatial resolution that matches the one of the

input x to the same DaConv block. Details about the actual

topology of DepthNet in the different application scenarios are

provided in the experimental section. Similarly, we postpone

implementation details about PredictionNet.

IV. DEPTH-AWARE CNN ARCHITECTURES FOR ROBOTIC

PERCEPTION TASKS

In this section we describe and evaluate the proposed depth-

aware approach in three different tasks, which involve RGB-D

data and are of interest for the robotics community, namely part

affordance detection, object coordinates regression and contour

detection. Each task requires pixel-level prediction models.

Therefore, for each application, we consider as baseline a

fully-convolutional network, as it currently represents the most

common architectural choice for pixel-wise classiﬁcation tasks.

In order to demonstrate the advantages of our proposal, we sys-

tematically compare each baseline network with an associated

DaConv network (“-DA” sufﬁx in the tables and ﬁgures).

Each DaConv network is constructed by replacing some of

the convolutional layers of the corresponding baseline netowrk

with DaConv blocks, obtaining the PredictionNet, and pairing

it with a similarly-structured DepthNet. Additional details

4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2016

TABLE I

NETWORK ARCHITECTURES

Network Architecture

Sec.IV-A

Baseline c[3,16];c[3,16];p[2,2] → c[3, 32]; c[3, 32]; p[2, 2] → c[3, 64]; c[3, 64]; p[2, 1] → c[3, 64]; c[1, n]

DepthNet c[3,8];c[3,8];p[2,2] → c[3, 16]; c[3, 16]; p[2, 2] → c[3, 32]; c[3, 32]

PredictionNet c[3,16];dac[3,16];p[2,2] → c[3, 32]; dac[3, 32]; p[2, 2] → c[3, 64]; dac[3, 64]; p[2, 1] → c[3, 64]; c[1, n]

Sec.IV-B

Baseline

c[7,64];p[2,2] → c[7, 64]; p[2, 2] → c[7, 64]; p[2, 2] → c[7, 64]; p[2, 2] →

u[2,2];d[7,64] → u[2, 2]; d[7, 64] → u[2, 2]; d[7, 64] → u[2, 2]; d[7, 64] → c[1, 7]

DepthNet c[7,8];p[2,2] → c[7, 8]; p[2, 2] → c[7, 8]; p[2, 2] → c[7, 8]

PredictionNet

dac[7,64];p[2,2] → dac[7, 64]; p[2, 2] → dac[7, 64]; p[2, 2] → dac[7, 64]; p[2, 2] →

u[2,2];d[7,64] → u[2, 2]; d[7, 64] → u[2, 2]; d[7, 64] → u[2, 2]; d[7, 64] → c[1, 7]

Sec.IV-C

Baseline

c[3,64];c[3,64];p[2,2] → c[3, 128]; c[3, 128]; p[2, 2] → c[3, 256]; c[3, 256]; c[3, 256]; p[2, 2] →

c[3,512];c[3,512];c[3,512];p[2,2] → c[3, 512]; c[3, 512]; c[3, 512]

DepthNet c[3,8];c[3,8];p[2,2] → c[3, 16]; c[3, 16]; p[2, 2] → c[3, 32]; c[3, 32]; c[3, 32]

PredictionNet

c[3,64];dac[3,64];p[2,2] → c[3, 128]; dac[3, 128]; p[2, 2] → c[3, 256]; c[3, 256]; dac[3, 256]; p[2, 2] →

c[3,512];c[3,512];c[3,512];p[2,2] → c[3, 512]; c[3, 512]; c[3, 512]

Convolution: c[size, ﬁlters]; Deconvolution: d[size, ﬁlters]; DaConv: dac[size, ﬁlters]; Pooling: p[size, stride]; Unpooling: u[size, stride]

about the networks’ architectures are provided in Table I, as

well as in Sections IV-A, IV-B and IV-C.

Unless otherwise stated, we train all the networks using

the Adam stochastic gradient descent method with a weight-

decay factor of 5 × 10

−5

, parameters β

= 0.9, β

= 0.999, and

we use DaConv blocks with d = 3 dilation factors, namely

= 2

j−1

for i ∈ {1, 2, 3}.

In the following sections we use

the notation f

: Z

→ R to indicate the output of the ith

channel of any network under consideration. All our networks

are implemented using the Caffe

framework and trained on

a single Nvidia K40 GPU.

A. Object coordinates regression

Several recent methods [13], [14] to estimate the pose

of known 3D objects from a single RGB-D image share a

common two-steps pipeline: (i) for each pixel in the image,

predict its 3D coordinates in the object’s frame of reference;

(ii) geometrically estimate the object’s pose from the corre-

spondences between the predicted object coordinates and the

observed depth. The ﬁrst step results in a pixel-wise, vector-

valued regression that is, in general, quite hard to solve. To

simplify learning, previous work [13], [14] resort to a quan-

tization of the object coordinates space, turning the problem

into a classiﬁcation one. In these works, the classiﬁcation is

performed using a random forest, while we show how the

accuracy of step (i) can be improved by employing a DaConv

network to perform the pixel-wise classiﬁcation.

Dataset and experimental protocol. As in [13], [14], we con-

sider the dataset of Hinterstoisser et al. [34], which comprises

15 sequences of RGB-D images of several objects lying on a

cluttered table. For each sequence, we are given the 6-DOF

pose of a speciﬁc object relative to the camera and a 3D mesh

of the object. As in [13], we partition each dimension of an

object’s coordinates space into 5 uniform intervals, obtaining

5 × 5 × 5 = 125 spatial bins in total. By doing so, we can

rephrase the regression task into a classiﬁcation task. The

In our experiments we found d = 3 to be a good compromise between

classiﬁcation accuracy and computational complexity.

http://caffe.berkeleyvision.org/

number of actual classes can be reduced, since only k (out

of 125) spatial bins will contain at least one point from the

object’s surface, thus being a relevant coordinate for the pose

estimation. Given the depth and camera pose information, we

assign to each of the sequence’s pixels a label in {1, . . . , k}

if it back-projects to one of the k relevant bins, or k + 1 if it

belongs to the background.

In our experiments we randomly split each sequence into

train, validation and test sets comprising, respectively, 30%,

10% and 60% of the images. All our results are obtained by

training a different classiﬁer on a train set, selecting parameters

on the corresponding validation set and evaluating on the test

set. Experimental results are reported in terms of the average

per-class accuracy.

Network architecture and training. For this application we

adopt a fully-convolutional network architecture reminiscent

of the VGG net of Simonyan et al. [35]. Compared to [35],

we drastically reduce the number of convolutional ﬁlters and

exclude the fully-connected part of the network, as we want

to obtain pixel-wise predictions. Furthermore, we feed the

network with 6-channels tensors obtained by stacking the RGB

image with the 3-channel surface normals computed from the

depth. The architecture, summarized in Table I, is composed

of four main blocks of 3 × 3 convolutions of stride 1, followed

by 2 × 2 max pooling. The ﬁrst two max pooling layers

have stride 2, while the third one has stride 1, resulting in

a ﬁnal downsampling factor of 4. Because of this, the network

outputs pixel-wise predictions at one fourth of the original

resolution. At test time we up-sample the predictions using

nearest-neighbor interpolation. As for DaConvNet, we replace

the second convolutional layer within each of the ﬁrst three

blocks with a DaConv block.

The objective we use for training consists of a per-pixel

softmax log-loss. The contribution of each pixel is opportunely

weighted in order to compensate for the highly imbalanced

class distribution in the dataset. In formal terms, we address

the following optimization problem:

arg min−

∑

l(r)

log

exp( f

l(r)

(r))

∑

k+1

i=1

exp( f

(r))

, (3)

PORZI et al.: LEARNING DEPTH-AWARE DEEP REPRESENTATIONS FOR ROBOTIC PERCEPTION 5

where l(r) ∈ {1, . .. , k + 1} is the ground truth label at spatial

location r ∈ Z

, and the minimization is implicitly taken

with respect to the network parameters. The class-rebalancing

weights ξ

are deﬁned as in [29], for all 1 ≤ i ≤ k + 1:

median

(ξ

)

, (4)

#pixels of class i

#pixels in images containing i

. (5)

As mentioned already at the beginning of this section, we

optimize (3) via stochastic gradient descent. Both the baseline

and DaConvNet are trained with the following schedule: 50

epochs with learning rate 10

−2

followed by 25 epochs with

learning rate 10

−3

, batch size equal to 64. As is common

practice when considering small datasets [16], we perform data

augmentation during training. In particular, we form training

batches by sampling a randomly rotated, scaled and translated

128×128 pixels patches from the training images. At test time

we apply the learned network on full-resolution images.

Results. Figure 4 reports the results of our experimental

evaluation, comparing the proposed DaConv architecture with

the baseline, fully-convolutional network. The proposed net-

work (CNN-DA) outperforms the baseline CNN using standard

convolution layers for all different objects. On average, the use

of DaConv blocks improves classiﬁcation accuracy by 10%. As

a reference, we also report the results obtained considering a

random forest classiﬁer, as this represents the common choice

for coordinate regression tasks [13], [14]. In particular, we use

the implementation in Piotr Doll

ar’s toolbox

, training a forest

with 10 trees observing the same RGB plus surface normal

inputs as the CNN. Note that this implementation differs from

the one in [13], as the code for that is not publicly available.

As one can see, deep architectures outperform off-the-shelf

random forests, which is not so surprising, as CNNs currently

achieve state-of-the-art results in many tasks in robotic percep-

tion. Figure 5 shows an example of the a

functions learned

by the ﬁrst DaConv block of CNN-DA for the Ape object.

Interestingly, the functions mostly follow the scene’s depth,

with gradually higher weights being assigned to the scale j = 3

in correspondence of closer objects, and vice-versa for j = 1.

This is in accordance with the intuition illustrated in Figure 1.

B. Part Affordance Detection

The problem of localizing and identifying part affordances

[12] is a fundamental task for deploying the next generation

of robotic platforms, which are supposed to effectively col-

laborate with humans in everyday workspaces. Part affordance

detection requires to segment and label image regions corre-

sponding to object parts according to the interaction modality,

or affordance. In other words, each affordance constitutes

a class in the segmentation problem. Predicting affordances

is very challenging since objects from different categories,

with different shapes and visual appearances, can have parts

associated to the same affordance.

https://pdollar.github.io/toolbox/

Dataset and experimental protocol. In our experiments we

consider the RGB-D Part Affordance Dataset of Myers et

al. [12], covering 105 different tools and 7 different affor-

dances, namely “grasp”, “cut”, “scoop”, “contain”, “pound”,

“support” and “wrap-grasp”, for a total of about 30k images.¡

This dataset is split in two parts: (i) a Non-cluttered subset

comprising RGB-D video sequences of single tools lying on

a rotating plane; (ii) a Cluttered subset comprising 3 RGB-

D video sequences of several different tools amassed over

a table. One third of the video frames have been manually

labeled by a group of users, and the labels automatically

propagated to the remaining frames. To account for possibly

discording labellings provided by different users, each pixel

retains as ground-truth information a ranking of affordance

labels, ordered from the most voted to the less voted one.

We follow the experimental protocol in [12] by directly

using the publicly available evaluation code from the authors,

which considers only the manually labeled frames, both for

training and testing, and gives separate results for the Non-

cluttered and Cluttered subsets. Detection accuracy is evaluated

in terms of three different metrics: weighted F-measure F

rank weighted F-measure R

and ranked correlation score

. For a detailed description of the way these metrics are

calculated, we refer the reader to [12] and the public code.

Network architecture and training. We adopt the SegNet-

Basic architecture in [36], summarized in Table I. This is a

symmetric architecture that takes RGB images as input and

is composed of four convolutional and four deconvolutional

layers with 64 7 × 7 ﬁlters and stride 1. Each convolutional

layer is followed by a 2 × 2, stride 2 max-pooling layer and

each deconvolutional layer is preceded by a 2 × 2, stride 2

max-unpooling layer. Batch normalization is applied to the

output of all convolutional and deconvolutional layers. The

network output layer has 8 channels, corresponding to the 7

affordance classes with an additional background class. For

our DaConvNet, we replace each convolutional layer with a

DaConv block.

We train our networks by solving the following optimization

problem:

arg minL

+ λ L

, (6)

where the minimization is intedend with respect to the network

parameters. The objective is composed by a classiﬁcation loss

term L

and a ranking-related loss term L

. The classiﬁcation

loss L

is a weighted sum of pixel-wise log-loss terms deﬁned

similarly to (3), where the per-pixel log-loss term is computed

with respect to the top-ranked class in the ground-truth ranking.

The ranking loss, L

, is a sum of pixel-wise loss terms, each

aimed at exploiting the ranking information from the ground-

truth. It is deﬁned as follows:

= −

∑

i6= j

i, j

(r)log(σ ( f

(r) − f

(r))), (7)

where p

i, j

(r) is 1 if the affordance i ranks higher than j in

the ground-truth ranking for pixel r, 0.5 if they have the same

ranking and 0 otherwise, while σ(·) is the sigmoid function.

http://www.umiacs.umd.edu/˜amyers/part_affordance/

Learning Depth-Aware Deep Representations for Robotic Perception

Figures

Citations

State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics

Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation

iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes.

GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Spatial transformer networks

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Related Papers (5)

Towards unified depth and semantic prediction from a single image

Pulling Things out of Perspective

Learning Depth from Single Monocular Images

Indoor segmentation and support inference from RGBD images

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Learning depth-aware deep representations for robotic perception" ?

Q2. What are the future works mentioned in the paper "Learning depth-aware deep representations for robotic perception" ?