scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Hierarchical Features for Scene Labeling

TLDR
A method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel, alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information.
Abstract
Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a 320×240 image labeling in less than a second, including feature extraction.

read more

Content maybe subject to copyright    Report

1
Learning Hierarchical Features
for Scene Labeling
Cl
´
ement Farabet, Camille Couprie, Laurent Najman, Yann LeCun
Abstract—Scene labeling consists in labeling each pixel in an image with the category of the object it belongs to. We propose a
method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of
multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation
that captures texture, shape and contextual information. We report results using multiple post-processing methods to produce the final
labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of
components that best explain the scene; these components are arbitrary, e.g. they can be taken from a segmentation tree, or from any
family of over-segmentations. The system yields record accuracies on the Sift Flow Dataset (33 classes) and the Barcelona Dataset
(170 cl asses) and near-record acc uracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than
competing approaches, producing a 320 × 240 image labeling in less than a second, including feature extraction.
Index Terms—Convolutional networks, deep learning, image segmentation, image classification, scene parsing.
1 INTRODUCTION
I
MAGE UNDERSTANDING is a task of primary impor-
tance for a wide range of practical applications. One
important step towards understanding an image is to
perform a full-scene labeling also known as a scene parsing,
which consists in labeling every pixel in the image
with the category of the object it belongs to. After a
perfect scene parsing, every region and every object is
delineated and tagged. One challenge of scene parsing
is that it combines the traditional problems of detection,
segmentation, and multi-label recognition in a single
process.
There are two questions of primary importance in the
context of scene parsing: how to produce good internal
representations of the visual information, and how to use
contextual information to ensure the self-consistency of
the interpretation.
This paper presents a scene parsing system that relies
on deep learning methods to approach both questions.
The main idea is to use a convolutional network [
27]
operating on a large input window to produce label hy-
potheses for each pixel location. The convolutional net is
fed with raw image pixels (after band-pass filtering and
contrast normalization), and trained in supervised mode
from fully-labeled images to produce a category for each
pixel location. Convolutional networks are composed
of multiple stages each of which contains a filter bank
module, a non-linearity, and a spatial pooling module.
With end-to-end training, convolutional networks can
automatically learn hierarchical feature representations.
Unfortunately, labeling each pixel by looking at a small
region around it is difficult. The category of a pixel
may depend on relatively short-range information (e.g.
Cl´ement Farabet, Camille Couprie, and Yann LeCun are with the Courant
Institute of Mathematical Sciences, New York University (New York, NY
10003, USA).
Cl´ement Farabet and Laurent Najman are with the Laboratoire
d’Informatique Gaspard-Monge, Universit´e Paris-Est, Equipe A3SI,
ESIEE Paris (93160 Noisy-le-Grand, France).
E-mails: cfarabet@cs.nyu.edu, ccouprie@cs.nyu.edu,
l.najman@esiee.fr, yann@cs.nyu.edu
the presence of a human face generally indicates the
presence of a human body nearby), but may also depend
on long-range information. For example, identifying a
grey pixel as belonging to a road, a sidewalk, a gray car,
a concrete building, or a cloudy sky requires a wide con-
textual window that shows enough of the surroundings
to make an informed decision. To address this problem,
we propose to use a multi-scale convolutional network,
which can take into account large input windows, while
keeping the number of free parameters to a minimum.
Common approaches to scene parsing first produce
segmentation hypotheses using graph-based methods.
Candidate segments are then encoded using engineered
features. Finally, a conditional random field (or some
other type of graphical model), is trained to produce
labels for each candidate segment, and to ensure that
the labelings are globally consistent.
A striking characteristic of the system proposed here
is that the use of a large contextual window to label
pixels reduces the requirement for sophisticated post-
processing methods that ensure the consistency of the
labeling.
More precisely, the proposed scene parsing
architecture is depicted on Figure
1. It relies on
two main components:
1) Multi-scale, convolutional representation: our
multi-scale, dense feature extractor produces a series of
feature vectors for regions of multiple sizes centered
around every pixel in the image, covering a large
context. The multi-scale convolutional net contains
multiple copies of a single network (all sharing the
same weights) that are applied to different scales of a
Laplacian pyramid version of the input image. For each
pixel, the networks collectively encode the information
present in a large contextual window around the given
pixel (184 × 184 pixels in the system described here).
The convolutional network is fed with raw pixels
and trained end to end, thereby alleviating the need
for hand-engineered features. When properly trained,
these features produce a representation that captures
texture, shape and contextual information. While using

2
a multiscale representation seems natural for FSL, it
has rarely been used in the context of feature learning
systems. The multiscale representation that is learned
is sufficiently complete to allow the detection and
recognition of all the objects and regions in the scene.
However, it does not accurately pinpoint the boundaries
of the regions, and requires some post-processing to
yield cleanly delineated predictions.
2) Graph-based classification:
An over-segmentation is constructed from the image,
and is used to group the feature descriptors. Several
over-segmentations are considered, and three techniques
are proposed to produce the final image labeling.
2.a. Superpixels: The image is segmented into disjoint
components, widely over-segmenting the scene. In this
scenario, a pixelwise classifier is trained on the convo-
lutional feature vectors, and a simple vote is done for
each component, to assign a single class per component.
This method is simple and effective, but imposes a fixed
level of segmentation, which can be suboptimal.
2.b. Conditional random field over superpixels: a
conditional random field is defined over a set of super-
pixels. Compared to the previous, simpler method, this
post-processing models joint probabilities at the level
of the scene, and is useful to avoid local aberrations
(e.g. a person in the sky). That kind of approach is
widely used in the computer vision community, and we
show that our learned multiscale feature representation
essentially makes the use of a global random field much
less useful: most scene-level relationships seem to be
already captured by it.
2.c. Multilevel cut with class purity criterion: A
family of segmentations is constructed over the image
to analyze the scene at multiple levels. In the simplest
case, this family might be a segmentation tree; in the
most general case it can be any set of segmentations,
for example a collection of superpixels either produced
using the same algorithm with different parameter
tunings or produced by different algorithms. Each
segmentation component is represented by the set
of feature vectors that fall into it: the component is
encoded by a spatial grid of aggregated feature vectors.
The aggregated feature vector of each grid cell is
computed by a component-wise max pooling of the
feature vectors centered on all the pixels that fall into the
grid cell. This produces a scale-invariant representation
of the segment and its surrounding. A classifier is then
applied to the aggregated feature grid of each node.
This classifier is trained to estimate the histogram of all
object categories present in the component. A subset of
the components is then selected such that they cover
the entire image. These components are selected so
as to minimize the average “impurity” of the class
distribution in a procedure that we name “optimal
cover”. The class “impurity” is defined as the entropy
of the class distribution. The choice of the cover thus
attempts to find a consistent overall segmentation in
which each segment contains pixels belonging to only
one of the learned categories. This simple method allows
us to consider f ull families of segmentation components,
rather than a unique, predetermined segmentation (e.g.
a single set of superpixels).
All the steps in the process have a complexity linear
(or almost linear) in the number of pixels. The bulk of
the computation resides in the convolutional network
feature extractor. The resulting system is very fast,
producing a full parse of a 320 × 240 image in less than
a second on a conventional CPU, and in less than 100ms
using dedicated hardware, opening the door to real-time
applications. Once trained, the system is parameter free,
and requires no adjustment of thresholds or other knobs.
An early version of this work was first published
in [
7]. This journal version reports more complete ex-
periments, comparisons and higher results.
2 RELATED WORK
The scene parsing problem has been approached with a
wide variety of methods in recent years. Many methods
rely on MRFs, CRFs, or other types of graphical models
to ensure the consistency of the labeling and to account
for context [
19], [39], [15], [25], [32], [44], [30]. Most
methods rely on a pre-segmentation into superpixels
or other segment candidates, and extract features and
categories from individual segments and from various
combinations of neighboring segments. The graphical
model inference pulls out the most consistent set of
segments which covers the image.
[
43] proposed a method to aggregate segments in a
greedy fashion using a t rained scoring function. The
originality of the approach is that the feature vector
of the combination of two segments is computed from
the feature vectors of the individual segments through
a trainable function. Like us, they use “deep learning”
methods to train their feature extractor. But unlike us,
their feature extractor operates on hand-engineered fea-
tures.
One of the main question in scene parsing is how
to take a wide context into account to make a local
decision. [
32] proposed to use the histogram of labels
extracted from a coarse scale as input to the labeler
that looks at finer scales. Our approach is somewhat
simpler: our feature extractor is applied densely to an
image pyramid. The coarse feature maps thereby gen-
erated are upsampled to match that of the finest scale.
Hence with three scales, each feature vector has multiple
fields which encode multiple regions of increasing sizes
and decreasing resolutions, centered on the same pixel
location.
Like us, a number of authors have used families of
segmentations or trees to generate candidate segments
by aggregating elementary segments. The approaches of
[
39], [30] rely on inference algorithms based on Graph
Cuts to label images using trees of segmentation. Other
strategies using families of segmentations appeared in
[
36], [5]. None of the previous strategies for scene la-
beling used a purity criterion on the class distributions.
Combined to the optimal cover strategy, this purity
criterion is general, efficient and could be applied to
solve different problems.
Contrary to the previously cited approaches using
engineered features, our system extracts features densely
from a multiscale pyramid of images using a convolu-
tional network (ConvNet) [
27]. These networks can be
fed with raw pixels and can automatically learn low-
level and mid-level features, alleviating the need for

3
X1
X2
X3 F3
I
convnet
F1
F2
F
f1 (X1;𝛉1)
f2 (X2;𝛉2)
f3 (X3;𝛉3)
pyramid
g (I)
C1 C2 C3 C4
C6 C7
C9
C5
C8
segmentation
h (I)
labeling
l (F, h (I))
superpixels tree
T,{Ck}
or
Fig. 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid.
Each scale is fed to a 3-stage convolutional network, which produces a set of feature maps. The feature maps of all
scales are concatenated, the coarser-scale maps being upsampled to match the size of the finest-scale map. Each
feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation (i.e.
superpixels), or a family of segmentations (e. g. a segmentation tree) are computed to exploit the natural contours of
the image. The final labeling is produced from the feature vectors and the segmentation(s) using different methods,
as presented in section
4.
hand-engineered features. One of their advantage is the
ability to compute dense features efficiently over large
images. They are best known for their applications to
detection and recognition [47], [14], [35], [21], but they
have also been used for image segmentation, particularly
for biological image segmentation [
34], [20], [46].
The only previously published work on using con-
volutional networks for scene parsing is that of [
17].
While somewhat preliminary, their work showed that
convolutional networks fed with raw pixels could be
trained to perform scene parsing with decent accuracy.
Unlike [17] however, our system uses a boundary-based
hierarchy of segmentations to align the labels produced
by the network to the boundaries in the image and thus
produces representations that are independent of the size
of the segments through feature pooling. Slightly after
[
8], Schulz and Behnke proposed a similar architecture
of a multiscale convolutional network for scene parsing
[
40]. Unlike us, they use pairwise class location filters
to predict the final segmentation, instead of using the
image gradient that we found to be more accurate.
3 MULTISCALE FEATURE EXTRACTION FOR
SCENE PARSING
The model proposed in this paper, depicted on Figure
1,
relies on two complementary image representations. In
the first representation, an image patch is seen as a point
in R
P
, and we seek to find a transform f : R
P
R
Q
that maps each patch into R
Q
, a space where it can
be classified linearly. This first representation typically
suffers from two main problems when using a classi-
cal convolutional network, where the image is divided
following a grid pattern: (1) the window considered
rarely contains an object t hat is properly centered and
scaled, and therefore offers a poor observation basis to
predict the class of the underlying object, (2) integrating
a large context involves increasing the grid size, and
therefore the dimensionality P of the input; given a
finite amount of training data, it is then necessary to
enforce some invariance in the function f itself. This is
usually achieved by using pooling/subsampling layers,
which in turn degrades the ability of the model to
precisely locate and delineate objects. In this paper, f
is implemented by a multiscale convolutional network,
which allows integrating large contexts (as large as the
complete scene) into local decisions, yet still remaining
manageable in terms of parameters/dimensionality. This
multiscale model, in which weights are shared across
scales, allows the model to capture long-range interac-
tions, without the penalty of extra parameters to train.
This model is described in Section
3.1.
In the second representation, the image is seen as an
edge-weighted graph, on which one or several over-
segmentations can be constructed. The components are
spatially accurate, and naturally delineate the underly-
ing objects, as this representation conserves pixel-level
precision. Section
4 describes multiple strategies to com-
bine both representations. In particular, we describe in
Section
4.3 a method for analyzing a family of segmenta-
tions (at multiple levels). It can be used as a solution to
the first problem exposed above: assuming the capability
of assessing the quality of all the components in this
family of segmentations, a system can automatically
choose its components so as to produce the best set of
predictions.
3.1 Scale-invariant, scene-level feature extraction
Good internal representations are hierarchical. In vision,
pixels are assembled into edglets, edglets into motifs,
motifs into parts, parts into objects, and objects into
scenes. This suggests that recognition architectures for
vision (and for other modalities such as audio and
natural language) should have multiple trainable stages
stacked on top of each other, one for each level in the
feature hierarchy. Convolutional Networks (ConvNets)
provide a simple framework to learn such hierarchies of
features.

4
Convolutional Networks [
26], [27] are trainable archi-
tectures composed of multiple stages. The input and
output of each stage are sets of arrays called feature maps.
For example, if the input is a color image, each feature
map would be a 2D array containing a color channel of
the input image (for an audio input each feature map
would be a 1D array, and for a video or volumetric
image, it would be a 3D array). At the output, each
feature map represents a particular feature extracted at
all locations on the input. Each stage is composed of
three layers: a filter bank layer, a non-linearity layer, and
a feature pooling layer. A typical ConvNet is composed
of one, two or three such 3-layer stages, followed by a
classification module. Because they are trainable, arbi-
trary input modalities can be modeled, beyond natural
images.
Our feature extractor is a three-stage convolutional
network. The first two stages contain a bank of filters
producing multiple feature maps, a point-wise non-
linear mapping and a spatial pooling followed by sub-
sampling of each feature map. The last layer only con-
tains a bank of filters. The filters (convolution kernels)
are subject to training. Each filter is applied to the
input feature maps through a 2D convolution operation,
which detects local features at all locations on the input.
Each filter bank of a convolutional network produces
features that are equivariant under shifts, i.e. if the
input is shifted, the output is also shifted but otherwise
unchanged.
While convolutional networks have been used success-
fully for a number of image labeling problems, image-
level tasks such as full-scene understanding (pixel-wise
labeling, or any dense feature estimation) require the
system to model complex interactions at the scale of
complete images, not simply within a patch. To view
a large contextual window at full resolution, a convolu-
tional network would have to be unmanageably large.
The solution is to use a multiscale approach. Our
multiscale convolutional network overcomes these limi-
tations by extending the concept of spatial weight repli-
cation to the scale space. Given an input image I, a
multiscale pyramid of images X
s
, s {1, . . . , N } is
constructed, where X
1
has the size of I. The multiscale
pyramid can be a Laplacian pyramid, and is typically
pre-processed, so that local neighborhoods have zero
mean and unit standard deviation. Given a classical
convolutional network f
s
with parameters θ
s
, the multi-
scale network is obtained by instantiating one network
per scale s, and sharing all parameters across scales:
θ
s
= θ
0
, s {1, . . . , N}.
We introduce the following convention: banks of im-
ages will be seen as three dimensional arrays in which
the first dimension is the number of independent feature
maps, or images, the second is the height of the maps
and the third is the width. The output state of the L-th
stage is denoted H
L
.
The maps in the pyramid are computed using a
scaling/normalizing function g
s
as X
s
= g
s
(I), for all
s {1, . . . , N}.
For each scale s, the convolutional network f
s
can
be described as a sequence of linear transforms, inter-
spersed with non-linear symmetric squashing units (typ-
ically the tanh function [
28]), and pooling/subsampling
operators. For a network f
s
with L layers, we have:
f
s
(X
s
; θ
s
) = W
L
H
L1
, (1)
where t he vector of hidden units at layer l is
H
l
= pool(tanh(W
l
H
l1
+ b
l
)) (2)
for all l {1, . . . , L 1}, with b
l
a vector of bias
parameters, and H
0
= X
s
. The matrices W
l
are Toeplitz
matrices, therefore each hidden unit vector H
l
can be
expressed as a regular convolution between kernels from
W
l
and the previous hidden unit vector H
l1
, squashed
through a tanh, and pooled spatially. More specifically,
H
lp
= pool(tanh(b
lp
+
X
qparents(p)
w
lpq
H
l1,q
)). (3)
The filters W
l
and the biases b
l
constitute the trainable
parameters of our model, and are collectively denoted
θ
s
. The function tanh is a point-wise non-linearity, while
pool is a function that considers a neighborhood of
activations, and produces one activation per neighbor-
hood. In all our experiments, we use a max-pooling
operator, which takes the maximum activation within
the neighborhood. Pooling over a small neighborhood
provides built-in invariance to small translations.
Finally, the outputs of the N networks are upsampled
and concatenated so as to produce F, a map of feature
vectors of size N times the size of f
1
, which can be seen
as local patch descriptors and scene-level descriptors
F = [f
1
, u (f
2
), . . . , u (f
N
)], (4)
where u is an upsampling function.
As mentioned above, weights are shared between net-
works f
s
. Intuitively, imposing complete weight sharing
across scales is a natural way of forcing the network
to learn scale invariant features, and at the same time
reduce the chances of over-fitting. The more scales used
to jointly train the models f
s
(θ
s
) the better the represen-
tation becomes for all scales. Because image content is,
in principle, scale invariant, using the same function to
extract features at each scale is justified.
3.2 Learning discriminative scale-invariant features
As described in Section
3.1, feature vectors in F are
obtained by concatenating the outputs of multiple net-
works f
s
, each taking as input a different image in a
multiscale pyramid.
Ideally a linear classifier should produce the correct
categorization for all pixel locations i, from the feature
vectors F
i
. We train the parameters θ
s
to achieve this
goal, using the multiclass cross entropy loss function. Let
ˆ
c
i
be the normalized prediction vector from the linear
classifier for pixel i. We compute normalized predicted
probability distributions over classes
ˆ
c
i,a
using the soft-
max function, i.e.
ˆ
c
i,a
=
e
w
T
a
F
i
P
bclasses
e
w
T
b
F
i
, (5)
where w is a temporary weight matrix only used to learn
the features. The cross entropy between the predicted
class distribution
ˆ
c and the target class distribution c
penalizes their deviation and is measured by

5
class predictions
F
classifier
2 layer - mlp
average
across
super-
pixels
superpixels
argmax
Fig. 2. First labeling strategy from the features: using
superpixels as described in Section
4.1.
L
cat
=
X
ipixels
X
aclasses
c
i,a
ln(
ˆ
c
i,a
). (6)
The true target probability c
i,a
of class a to be present
at location i can either be a distribution of classes at
location i, in a given neighborhood or a hard target
vector: c
i,a
= 1 if pixel i is labeled a, and 0 otherwise.
For training maximally discriminative features, we use
hard target vectors in this first stage.
Once the parameters θ
s
are trained, the classifier in
Eq
5 is discarded, and the feature vectors F
i
are used
using different strategies, as described in Section
4.
4 SCENE LABELING STRATEGIES
The simplest strategy for labeling the scene is to use the
linear classifier described in Section
3.2, and assign each
pixel with the argmax of the prediction at its location.
More specifically, for each pixel i
l
i
= arg max
aclasses
ˆ
c
i,a
(7)
The resulting labeling l, although fairly accurate, is not
satisfying visually, as it lacks spatial consistency, and
precise delineation of objects. In this section, we explore
three strategies to produce spatially more appealing
labelings.
4.1 Superpixels
Predicting the class of each pixel independently from
its neighbors yields noisy predictions. A simple cleanup
can be obtained by forcing local regions of same color
intensities to be assigned a single label.
As in [
13], [16], we compute superpixels, following
the method proposed by [
11], to produce an over-
segmentation of the image. We then classify each location
of the image densely, and aggregate these predictions in
each superpixel, by computing the average class distri-
bution within the superpixel.
For this method, the pixelwise distributions
ˆ
d
k
at
superpixel k are predicted from the feature vectors F
i
using a two-layer neural network:
y
i
= W
2
tanh(W
1
F
i
+ b
1
), (8)
ˆ
d
i,a
=
e
y
i,a
P
bclasses
e
y
i,b
, (9)
L
cat
=
X
ipixels
X
aclasses
d
i,a
ln(
ˆ
d
i,a
), (10)
ˆ
d
k,a
=
1
s(k)
X
ik
ˆ
d
i,a
, (11)
with d
i
the groundtruth distribution at location i, and
s(k) the surface of component k. Matrices W
1
and W
2
are the trainable parameters of the classifier. Using a two-
layer neural network, as opposed to the simple linear
classifier used in Section
3.2, allows the system to capture
non-linear relationships between the features at different
scales. In this case, the final labeling for each component
k is given by
l
k
= arg max
aclasses
ˆ
d
k,a
. (12)
The pipeline is depicted in Figure
2.
4.2 Conditional Random Fields
The local assignment obtained using superpixels does
not involve a global understanding of the scene. In
this section, we implement a classical CRF model, con-
structed on the superpixels. This is a quite standard ap-
proach for image labeling. Our multi-scale convolutional
network already has the capability of modeling global
relationships within a scene, but might still be prone to
errors, and can benefit from a CRF, to impose consistency
and coherency between labels, at test time.
A common strategy for labeling a scene consists in
associating the image to a graph and define an energy
function whose optimal solution corresponds to the de-
sired segmentation [
41], [13].
For this purpose, we define a graph G = (V, E) with
vertices v V and edges e E V × V . Each pixel
in the image is associated to a vertex, and edges are
added between every neighboring nodes. An edge, e,
spanning two vertices, v
i
and v
j
, is denoted by e
ij
.
The Conditional Random Field (CRF) energy function
is typically composed of a unary term enforcing the
variable l to take values close to the predictions
ˆ
d and
a pairwise term enforcing regularity or local consistency
of l. The CRF energy to minimize is given by
E(l) =
X
iV
Φ(
ˆ
d
i
, l
i
) + γ
X
e
ij
E
Ψ(l
i
, l
j
) (13)
We considered as unary terms
Φ(
ˆ
d
i,a
, l
i
) = exp (α
ˆ
d
i,a
)1(l
i
6= a), (14)
where
ˆ
d
i,a
corresponds to the probability of class a to
be present at a pixel i computed as in Section
4.1, and
1(·) is an indicator function that equals one if the input
is true, and zero otherwise.
The pairwise term consists in
Ψ(l
i
, l
j
) = exp (βk∇Ik
i
)1(l
i
6= l
j
) (15)
where k∇Ik
i
is the
2
norm of the gradient of the image
I at a pixel i. Details on the parameters used are given
in the experimental section.
The CRF energy (
13) is minimized using alpha-
expansions [4], [3]. An illustration of the procedure
appears in Figure
3.

Citations
More filters
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
References
More filters
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal ArticleDOI

Fast approximate energy minimization via graph cuts

TL;DR: This work presents two algorithms based on graph cuts that efficiently find a local minimum with respect to two types of large moves, namely expansion moves and swap moves that allow important cases of discontinuity preserving energies.
Journal ArticleDOI

Efficient Graph-Based Image Segmentation

TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Journal ArticleDOI

Contour Detection and Hierarchical Image Segmentation

TL;DR: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation and presents state-of-the-art algorithms for both of these tasks.
Proceedings ArticleDOI

Dimensionality Reduction by Learning an Invariant Mapping

TL;DR: This work presents a method - called Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) - for learning a globally coherent nonlinear function that maps the data evenly to the output manifold.
Related Papers (5)
Frequently Asked Questions (18)
Q1. How many jitter parameters were used to find the learning rate?

Simple grid-search was performed to find the best learning rate (10−3) and regularization parameters (L2 coefficient: 10−5), using a holdout of 10% of the training data for validation. 

One challenge of scene parsing is that it combines the traditional problems of detection, segmentation, and multi-label recognition in a single process. 

The originality of the approach is that the feature vector of the combination of two segments is computed from the feature vectors of the individual segments through a trainable function. 

One important step towards understanding an image is to perform a full-scene labeling also known as a scene parsing, which consists in labeling every pixel in the image with the category of the object it belongs to. 

The aggregated feature vector of each grid cell is computed by a component-wise max pooling of the feature vectors centered on all the pixels that fall into the grid cell. 

Hence with three scales, each feature vector has multiple fields which encode multiple regions of increasing sizes and decreasing resolutions, centered on the same pixel location. 

Exploiting the parallel structure of this special network, by computing convolutions in parallel, allows us to parse an image of size 320×240 in less than one second on a 4-core Intel i7 laptop. 

In the first representation, an image patch is seen as a point in RP , and the authors seek to find a transform f : RP → RQ that maps each patch into RQ, a space where it can be classified linearly. 

Relying heavily on a highly-accurate feed-forward pixel labeling system, while simplifying the postprocessing module to its bare minimum cuts down the inference times considerably. 

Each segmentation component is represented by the set of feature vectors that fall into it: the component is encoded by a spatial grid of aggregated feature vectors. 

This multiscale model, in which weights are shared across scales, allows the model to capture long-range interactions, without the penalty of extra parameters to train. 

A classical technique to reduce the set of components is to consider a hierarchy of segmentations [33], [1], that can be represented as a tree T . 

The authors define a simple attention function a used to mask the feature vector map with each component Ck, producing a set of K masked feature vector patterns {F ⋂ Ck}, ∀k ∈ {1, . . . ,K}. 

This is usually achieved by using pooling/subsampling layers, which in turn degrades the ability of the model to precisely locate and delineate objects. 

It can be used as a solution to the first problem exposed above: assuming the capability of assessing the quality of all the components in this family of segmentations, a system can automatically choose its components so as to produce the best set of predictions. 

We8 construct the segmentation collections (T )T∈T on the entire training set, and, for all T ∈ T train the classifier c to predict the distribution of classes in component Ck ∈ T , as well as the costs Sk. 

In the simplest case, this family might be a segmentation tree; in the most general case it can be any set of segmentations, for example a collection of superpixels either produced using the same algorithm with different parameter tunings or produced by different algorithms. 

As described in Section 3.1, feature vectors in F are obtained by concatenating the outputs of multiple networks fs, each taking as input a different image in a multiscale pyramid.