How many jitter parameters were used to find the learning rate?

Simple grid-search was performed to find the best learning rate (10−3) and regularization parameters (L2 coefficient: 10−5), using a holdout of 10% of the training data for validation.

How can the authors parse an image of size 320240 in less than one second?

Exploiting the parallel structure of this special network, by computing convolutions in parallel, allows us to parse an image of size 320×240 in less than one second on a 4-core Intel i7 laptop.

What is the first representation of an image patch?

In the first representation, an image patch is seen as a point in RP , and the authors seek to find a transform f : RP → RQ that maps each patch into RQ, a space where it can be classified linearly.

What is the way to improve the performance of a feedforward pixel labeling system?

Relying heavily on a highly-accurate feed-forward pixel labeling system, while simplifying the postprocessing module to its bare minimum cuts down the inference times considerably.

What is the way to reduce the set of components?

A classical technique to reduce the set of components is to consider a hierarchy of segmentations [33], [1], that can be represented as a tree T .

What is the attention function a used to mask the feature vector map with each component Ck?

The authors define a simple attention function a used to mask the feature vector map with each component Ck, producing a set of K masked feature vector patterns {F ⋂ Ck}, ∀k ∈ {1, . . . ,K}.

How do the authors train the classifier c to predict the distribution of classes in the training set?

We8 construct the segmentation collections (T )T∈T on the entire training set, and, for all T ∈ T train the classifier c to predict the distribution of classes in component Ck ∈ T , as well as the costs Sk.

(Open Access) Learning Hierarchical Features for Scene Labeling (2013) | Clement Farabet

Q: What scales are used to generate feature vectors?

Hence with three scales, each feature vector has multiple fields which encode multiple regions of increasing sizes and decreasing resolutions, centered on the same pixel location.

Learning Hierarchical Features

for Scene Labeling

ement Farabet, Camille Couprie, Laurent Najman, Yann LeCun

Abstract—Scene labeling consists in labeling each pixel in an image with the category of the object it belongs to. We propose a

method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of

multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation

that captures texture, shape and contextual information. We report results using multiple post-processing methods to produce the ﬁnal

labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of

components that best explain the scene; these components are arbitrary, e.g. they can be taken from a segmentation tree, or from any

family of over-segmentations. The system yields record accuracies on the Sift Flow Dataset (33 classes) and the Barcelona Dataset

(170 cl asses) and near-record acc uracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than

competing approaches, producing a 320 × 240 image labeling in less than a second, including feature extraction.

Index Terms—Convolutional networks, deep learning, image segmentation, image classiﬁcation, scene parsing.

✦

1 INTRODUCTION

MAGE UNDERSTANDING is a task of primary impor-

tance for a wide range of practical applications. One

important step towards understanding an image is to

perform a full-scene labeling also known as a scene parsing,

which consists in labeling every pixel in the image

with the category of the object it belongs to. After a

perfect scene parsing, every region and every object is

delineated and tagged. One challenge of scene parsing

is that it combines the traditional problems of detection,

segmentation, and multi-label recognition in a single

process.

There are two questions of primary importance in the

context of scene parsing: how to produce good internal

representations of the visual information, and how to use

contextual information to ensure the self-consistency of

the interpretation.

This paper presents a scene parsing system that relies

on deep learning methods to approach both questions.

The main idea is to use a convolutional network [

27]

operating on a large input window to produce label hy-

potheses for each pixel location. The convolutional net is

fed with raw image pixels (after band-pass ﬁltering and

contrast normalization), and trained in supervised mode

from fully-labeled images to produce a category for each

pixel location. Convolutional networks are composed

of multiple stages each of which contains a ﬁlter bank

module, a non-linearity, and a spatial pooling module.

With end-to-end training, convolutional networks can

automatically learn hierarchical feature representations.

Unfortunately, labeling each pixel by looking at a small

region around it is difﬁcult. The category of a pixel

may depend on relatively short-range information (e.g.

• Cl´ement Farabet, Camille Couprie, and Yann LeCun are with the Courant

Institute of Mathematical Sciences, New York University (New York, NY

10003, USA).

• Cl´ement Farabet and Laurent Najman are with the Laboratoire

d’Informatique Gaspard-Monge, Universit´e Paris-Est, Equipe A3SI,

ESIEE Paris (93160 Noisy-le-Grand, France).

E-mails: cfarabet@cs.nyu.edu, ccouprie@cs.nyu.edu,

l.najman@esiee.fr, yann@cs.nyu.edu

the presence of a human face generally indicates the

presence of a human body nearby), but may also depend

on long-range information. For example, identifying a

grey pixel as belonging to a road, a sidewalk, a gray car,

a concrete building, or a cloudy sky requires a wide con-

textual window that shows enough of the surroundings

to make an informed decision. To address this problem,

we propose to use a multi-scale convolutional network,

which can take into account large input windows, while

keeping the number of free parameters to a minimum.

Common approaches to scene parsing ﬁrst produce

segmentation hypotheses using graph-based methods.

Candidate segments are then encoded using engineered

features. Finally, a conditional random ﬁeld (or some

other type of graphical model), is trained to produce

labels for each candidate segment, and to ensure that

the labelings are globally consistent.

A striking characteristic of the system proposed here

is that the use of a large contextual window to label

pixels reduces the requirement for sophisticated post-

processing methods that ensure the consistency of the

labeling.

More precisely, the proposed scene parsing

architecture is depicted on Figure

1. It relies on

two main components:

1) Multi-scale, convolutional representation: our

multi-scale, dense feature extractor produces a series of

feature vectors for regions of multiple sizes centered

around every pixel in the image, covering a large

context. The multi-scale convolutional net contains

multiple copies of a single network (all sharing the

same weights) that are applied to different scales of a

Laplacian pyramid version of the input image. For each

pixel, the networks collectively encode the information

present in a large contextual window around the given

pixel (184 × 184 pixels in the system described here).

The convolutional network is fed with raw pixels

and trained end to end, thereby alleviating the need

for hand-engineered features. When properly trained,

these features produce a representation that captures

texture, shape and contextual information. While using

a multiscale representation seems natural for FSL, it

has rarely been used in the context of feature learning

systems. The multiscale representation that is learned

is sufﬁciently complete to allow the detection and

recognition of all the objects and regions in the scene.

However, it does not accurately pinpoint the boundaries

of the regions, and requires some post-processing to

yield cleanly delineated predictions.

2) Graph-based classiﬁcation:

An over-segmentation is constructed from the image,

and is used to group the feature descriptors. Several

over-segmentations are considered, and three techniques

are proposed to produce the ﬁnal image labeling.

2.a. Superpixels: The image is segmented into disjoint

components, widely over-segmenting the scene. In this

scenario, a pixelwise classiﬁer is trained on the convo-

lutional feature vectors, and a simple vote is done for

each component, to assign a single class per component.

This method is simple and effective, but imposes a ﬁxed

level of segmentation, which can be suboptimal.

2.b. Conditional random ﬁeld over superpixels: a

conditional random ﬁeld is deﬁned over a set of super-

pixels. Compared to the previous, simpler method, this

post-processing models joint probabilities at the level

of the scene, and is useful to avoid local aberrations

(e.g. a person in the sky). That kind of approach is

widely used in the computer vision community, and we

show that our learned multiscale feature representation

essentially makes the use of a global random ﬁeld much

less useful: most scene-level relationships seem to be

already captured by it.

2.c. Multilevel cut with class purity criterion: A

family of segmentations is constructed over the image

to analyze the scene at multiple levels. In the simplest

case, this family might be a segmentation tree; in the

most general case it can be any set of segmentations,

for example a collection of superpixels either produced

using the same algorithm with different parameter

tunings or produced by different algorithms. Each

segmentation component is represented by the set

of feature vectors that fall into it: the component is

encoded by a spatial grid of aggregated feature vectors.

The aggregated feature vector of each grid cell is

computed by a component-wise max pooling of the

feature vectors centered on all the pixels that fall into the

grid cell. This produces a scale-invariant representation

of the segment and its surrounding. A classiﬁer is then

applied to the aggregated feature grid of each node.

This classiﬁer is trained to estimate the histogram of all

object categories present in the component. A subset of

the components is then selected such that they cover

the entire image. These components are selected so

as to minimize the average “impurity” of the class

distribution in a procedure that we name “optimal

cover”. The class “impurity” is deﬁned as the entropy

of the class distribution. The choice of the cover thus

attempts to ﬁnd a consistent overall segmentation in

which each segment contains pixels belonging to only

one of the learned categories. This simple method allows

us to consider f ull families of segmentation components,

rather than a unique, predetermined segmentation (e.g.

a single set of superpixels).

All the steps in the process have a complexity linear

(or almost linear) in the number of pixels. The bulk of

the computation resides in the convolutional network

feature extractor. The resulting system is very fast,

producing a full parse of a 320 × 240 image in less than

a second on a conventional CPU, and in less than 100ms

using dedicated hardware, opening the door to real-time

applications. Once trained, the system is parameter free,

and requires no adjustment of thresholds or other knobs.

An early version of this work was ﬁrst published

in [

7]. This journal version reports more complete ex-

periments, comparisons and higher results.

2 RELATED WORK

The scene parsing problem has been approached with a

wide variety of methods in recent years. Many methods

rely on MRFs, CRFs, or other types of graphical models

to ensure the consistency of the labeling and to account

for context [

19], [39], [15], [25], [32], [44], [30]. Most

methods rely on a pre-segmentation into superpixels

or other segment candidates, and extract features and

categories from individual segments and from various

combinations of neighboring segments. The graphical

model inference pulls out the most consistent set of

segments which covers the image.

[

43] proposed a method to aggregate segments in a

greedy fashion using a t rained scoring function. The

originality of the approach is that the feature vector

of the combination of two segments is computed from

the feature vectors of the individual segments through

a trainable function. Like us, they use “deep learning”

methods to train their feature extractor. But unlike us,

their feature extractor operates on hand-engineered fea-

tures.

One of the main question in scene parsing is how

to take a wide context into account to make a local

decision. [

32] proposed to use the histogram of labels

extracted from a coarse scale as input to the labeler

that looks at ﬁner scales. Our approach is somewhat

simpler: our feature extractor is applied densely to an

image pyramid. The coarse feature maps thereby gen-

erated are upsampled to match that of the ﬁnest scale.

Hence with three scales, each feature vector has multiple

ﬁelds which encode multiple regions of increasing sizes

and decreasing resolutions, centered on the same pixel

location.

Like us, a number of authors have used families of

segmentations or trees to generate candidate segments

by aggregating elementary segments. The approaches of

[

39], [30] rely on inference algorithms based on Graph

Cuts to label images using trees of segmentation. Other

strategies using families of segmentations appeared in

[

36], [5]. None of the previous strategies for scene la-

beling used a purity criterion on the class distributions.

Combined to the optimal cover strategy, this purity

criterion is general, efﬁcient and could be applied to

solve different problems.

Contrary to the previously cited approaches using

engineered features, our system extracts features densely

from a multiscale pyramid of images using a convolu-

tional network (ConvNet) [

27]. These networks can be

fed with raw pixels and can automatically learn low-

level and mid-level features, alleviating the need for

X3 F3

convnet

f1 (X1;𝛉1)

f2 (X2;𝛉2)

f3 (X3;𝛉3)

pyramid

g (I)

C1 C2 C3 C4

C6 C7

segmentation

h (I)

labeling

l (F, h (I))

superpixels tree

T,{Ck}

Fig. 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid.

Each scale is fed to a 3-stage convolutional network, which produces a set of feature maps. The feature maps of all

scales are concatenated, the coarser-scale maps being upsampled to match the size of the ﬁnest-scale map. Each

feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation (i.e.

superpixels), or a family of segmentations (e. g. a segmentation tree) are computed to exploit the natural contours of

the image. The ﬁnal labeling is produced from the feature vectors and the segmentation(s) using different methods,

as presented in section

hand-engineered features. One of their advantage is the

ability to compute dense features efﬁciently over large

images. They are best known for their applications to

detection and recognition [47], [14], [35], [21], but they

have also been used for image segmentation, particularly

for biological image segmentation [

34], [20], [46].

The only previously published work on using con-

volutional networks for scene parsing is that of [

17].

While somewhat preliminary, their work showed that

convolutional networks fed with raw pixels could be

trained to perform scene parsing with decent accuracy.

Unlike [17] however, our system uses a boundary-based

hierarchy of segmentations to align the labels produced

by the network to the boundaries in the image and thus

produces representations that are independent of the size

of the segments through feature pooling. Slightly after

[

8], Schulz and Behnke proposed a similar architecture

of a multiscale convolutional network for scene parsing

[

40]. Unlike us, they use pairwise class location ﬁlters

to predict the ﬁnal segmentation, instead of using the

image gradient that we found to be more accurate.

3 MULTISCALE FEATURE EXTRACTION FOR

SCENE PARSING

The model proposed in this paper, depicted on Figure

relies on two complementary image representations. In

the ﬁrst representation, an image patch is seen as a point

in R

, and we seek to ﬁnd a transform f : R

→ R

that maps each patch into R

, a space where it can

be classiﬁed linearly. This ﬁrst representation typically

suffers from two main problems when using a classi-

cal convolutional network, where the image is divided

following a grid pattern: (1) the window considered

rarely contains an object t hat is properly centered and

scaled, and therefore offers a poor observation basis to

predict the class of the underlying object, (2) integrating

a large context involves increasing the grid size, and

therefore the dimensionality P of the input; given a

ﬁnite amount of training data, it is then necessary to

enforce some invariance in the function f itself. This is

usually achieved by using pooling/subsampling layers,

which in turn degrades the ability of the model to

precisely locate and delineate objects. In this paper, f

is implemented by a multiscale convolutional network,

which allows integrating large contexts (as large as the

complete scene) into local decisions, yet still remaining

manageable in terms of parameters/dimensionality. This

multiscale model, in which weights are shared across

scales, allows the model to capture long-range interac-

tions, without the penalty of extra parameters to train.

This model is described in Section

3.1.

In the second representation, the image is seen as an

edge-weighted graph, on which one or several over-

segmentations can be constructed. The components are

spatially accurate, and naturally delineate the underly-

ing objects, as this representation conserves pixel-level

precision. Section

4 describes multiple strategies to com-

bine both representations. In particular, we describe in

Section

4.3 a method for analyzing a family of segmenta-

tions (at multiple levels). It can be used as a solution to

the ﬁrst problem exposed above: assuming the capability

of assessing the quality of all the components in this

family of segmentations, a system can automatically

choose its components so as to produce the best set of

predictions.

3.1 Scale-invariant, scene-level feature extraction

Good internal representations are hierarchical. In vision,

pixels are assembled into edglets, edglets into motifs,

motifs into parts, parts into objects, and objects into

scenes. This suggests that recognition architectures for

vision (and for other modalities such as audio and

natural language) should have multiple trainable stages

stacked on top of each other, one for each level in the

feature hierarchy. Convolutional Networks (ConvNets)

provide a simple framework to learn such hierarchies of

features.

Convolutional Networks [

26], [27] are trainable archi-

tectures composed of multiple stages. The input and

output of each stage are sets of arrays called feature maps.

For example, if the input is a color image, each feature

map would be a 2D array containing a color channel of

the input image (for an audio input each feature map

would be a 1D array, and for a video or volumetric

image, it would be a 3D array). At the output, each

feature map represents a particular feature extracted at

all locations on the input. Each stage is composed of

three layers: a ﬁlter bank layer, a non-linearity layer, and

a feature pooling layer. A typical ConvNet is composed

of one, two or three such 3-layer stages, followed by a

classiﬁcation module. Because they are trainable, arbi-

trary input modalities can be modeled, beyond natural

images.

Our feature extractor is a three-stage convolutional

network. The ﬁrst two stages contain a bank of ﬁlters

producing multiple feature maps, a point-wise non-

linear mapping and a spatial pooling followed by sub-

sampling of each feature map. The last layer only con-

tains a bank of ﬁlters. The ﬁlters (convolution kernels)

are subject to training. Each ﬁlter is applied to the

input feature maps through a 2D convolution operation,

which detects local features at all locations on the input.

Each ﬁlter bank of a convolutional network produces

features that are equivariant under shifts, i.e. if the

input is shifted, the output is also shifted but otherwise

unchanged.

While convolutional networks have been used success-

fully for a number of image labeling problems, image-

level tasks such as full-scene understanding (pixel-wise

labeling, or any dense feature estimation) require the

system to model complex interactions at the scale of

complete images, not simply within a patch. To view

a large contextual window at full resolution, a convolu-

tional network would have to be unmanageably large.

The solution is to use a multiscale approach. Our

multiscale convolutional network overcomes these limi-

tations by extending the concept of spatial weight repli-

cation to the scale space. Given an input image I, a

multiscale pyramid of images X

, ∀s ∈ {1, . . . , N } is

constructed, where X

has the size of I. The multiscale

pyramid can be a Laplacian pyramid, and is typically

pre-processed, so that local neighborhoods have zero

mean and unit standard deviation. Given a classical

convolutional network f

with parameters θ

, the multi-

scale network is obtained by instantiating one network

per scale s, and sharing all parameters across scales:

= θ

, ∀s ∈ {1, . . . , N}.

We introduce the following convention: banks of im-

ages will be seen as three dimensional arrays in which

the ﬁrst dimension is the number of independent feature

maps, or images, the second is the height of the maps

and the third is the width. The output state of the L-th

stage is denoted H

The maps in the pyramid are computed using a

scaling/normalizing function g

as X

= g

(I), for all

s ∈ {1, . . . , N}.

For each scale s, the convolutional network f

can

be described as a sequence of linear transforms, inter-

spersed with non-linear symmetric squashing units (typ-

ically the tanh function [

28]), and pooling/subsampling

operators. For a network f

with L layers, we have:

; θ

) = W

L−1

, (1)

where t he vector of hidden units at layer l is

= pool(tanh(W

l−1

+ b

)) (2)

for all l ∈ {1, . . . , L − 1}, with b

a vector of bias

parameters, and H

= X

. The matrices W

are Toeplitz

matrices, therefore each hidden unit vector H

can be

expressed as a regular convolution between kernels from

and the previous hidden unit vector H

l−1

, squashed

through a tanh, and pooled spatially. More speciﬁcally,

= pool(tanh(b

q∈parents(p)

lpq

∗ H

l−1,q

)). (3)

The ﬁlters W

and the biases b

constitute the trainable

parameters of our model, and are collectively denoted

. The function tanh is a point-wise non-linearity, while

pool is a function that considers a neighborhood of

activations, and produces one activation per neighbor-

hood. In all our experiments, we use a max-pooling

operator, which takes the maximum activation within

the neighborhood. Pooling over a small neighborhood

provides built-in invariance to small translations.

Finally, the outputs of the N networks are upsampled

and concatenated so as to produce F, a map of feature

vectors of size N times the size of f

, which can be seen

as local patch descriptors and scene-level descriptors

F = [f

, u (f

), . . . , u (f

)], (4)

where u is an upsampling function.

As mentioned above, weights are shared between net-

works f

. Intuitively, imposing complete weight sharing

across scales is a natural way of forcing the network

to learn scale invariant features, and at the same time

reduce the chances of over-ﬁtting. The more scales used

to jointly train the models f

(θ

) the better the represen-

tation becomes for all scales. Because image content is,

in principle, scale invariant, using the same function to

extract features at each scale is justiﬁed.

3.2 Learning discriminative scale-invariant features

As described in Section

3.1, feature vectors in F are

obtained by concatenating the outputs of multiple net-

works f

, each taking as input a different image in a

multiscale pyramid.

Ideally a linear classiﬁer should produce the correct

categorization for all pixel locations i, from the feature

vectors F

. We train the parameters θ

to achieve this

goal, using the multiclass cross entropy loss function. Let

be the normalized prediction vector from the linear

classiﬁer for pixel i. We compute normalized predicted

probability distributions over classes

i,a

using the soft-

max function, i.e.

i,a

b∈classes

, (5)

where w is a temporary weight matrix only used to learn

the features. The cross entropy between the predicted

class distribution

c and the target class distribution c

penalizes their deviation and is measured by

class predictions

classifier

2 layer - mlp

average

across

super-

pixels

superpixels

argmax

Fig. 2. First labeling strategy from the features: using

superpixels as described in Section

4.1.

cat

= −

i∈pixels

a∈classes

i,a

ln(

i,a

). (6)

The true target probability c

i,a

of class a to be present

at location i can either be a distribution of classes at

location i, in a given neighborhood or a hard target

vector: c

i,a

= 1 if pixel i is labeled a, and 0 otherwise.

For training maximally discriminative features, we use

hard target vectors in this ﬁrst stage.

Once the parameters θ

are trained, the classiﬁer in

5 is discarded, and the feature vectors F

are used

using different strategies, as described in Section

4 SCENE LABELING STRATEGIES

The simplest strategy for labeling the scene is to use the

linear classiﬁer described in Section

3.2, and assign each

pixel with the argmax of the prediction at its location.

More speciﬁcally, for each pixel i

= arg max

a∈classes

i,a

(7)

The resulting labeling l, although fairly accurate, is not

satisfying visually, as it lacks spatial consistency, and

precise delineation of objects. In this section, we explore

three strategies to produce spatially more appealing

labelings.

4.1 Superpixels

Predicting the class of each pixel independently from

its neighbors yields noisy predictions. A simple cleanup

can be obtained by forcing local regions of same color

intensities to be assigned a single label.

As in [

13], [16], we compute superpixels, following

the method proposed by [

11], to produce an over-

segmentation of the image. We then classify each location

of the image densely, and aggregate these predictions in

each superpixel, by computing the average class distri-

bution within the superpixel.

For this method, the pixelwise distributions

superpixel k are predicted from the feature vectors F

using a two-layer neural network:

= W

tanh(W

+ b

), (8)

i,a

b∈classes

i,b

, (9)

cat

= −

i∈pixels

a∈classes

i,a

ln(

i,a

), (10)

k,a

s(k)

i∈k

i,a

, (11)

with d

the groundtruth distribution at location i, and

s(k) the surface of component k. Matrices W

and W

are the trainable parameters of the classiﬁer. Using a two-

layer neural network, as opposed to the simple linear

classiﬁer used in Section

3.2, allows the system to capture

non-linear relationships between the features at different

scales. In this case, the ﬁnal labeling for each component

k is given by

= arg max

a∈classes

k,a

. (12)

The pipeline is depicted in Figure

4.2 Conditional Random Fields

The local assignment obtained using superpixels does

not involve a global understanding of the scene. In

this section, we implement a classical CRF model, con-

structed on the superpixels. This is a quite standard ap-

proach for image labeling. Our multi-scale convolutional

network already has the capability of modeling global

relationships within a scene, but might still be prone to

errors, and can beneﬁt from a CRF, to impose consistency

and coherency between labels, at test time.

A common strategy for labeling a scene consists in

associating the image to a graph and deﬁne an energy

function whose optimal solution corresponds to the de-

sired segmentation [

41], [13].

For this purpose, we deﬁne a graph G = (V, E) with

vertices v ∈ V and edges e ∈ E ⊆ V × V . Each pixel

in the image is associated to a vertex, and edges are

added between every neighboring nodes. An edge, e,

spanning two vertices, v

and v

, is denoted by e

The Conditional Random Field (CRF) energy function

is typically composed of a unary term enforcing the

variable l to take values close to the predictions

d and

a pairwise term enforcing regularity or local consistency

of l. The CRF energy to minimize is given by

E(l) =

i∈V

Φ(

, l

) + γ

∈E

Ψ(l

, l

) (13)

We considered as unary terms

Φ(

i,a

, l

) = exp (−α

i,a

)1(l

6= a), (14)

where

i,a

corresponds to the probability of class a to

be present at a pixel i computed as in Section

4.1, and

1(·) is an indicator function that equals one if the input

is true, and zero otherwise.

The pairwise term consists in

Ψ(l

, l

) = exp (−βk∇Ik

)1(l

6= l

) (15)

where k∇Ik

is the ℓ

norm of the gradient of the image

I at a pixel i. Details on the parameters used are given

in the experimental section.

The CRF energy (

13) is minimized using alpha-

expansions [4], [3]. An illustration of the procedure

appears in Figure

Learning Hierarchical Features for Scene Labeling

Citations

Deep learning

Going deeper with convolutions

Deep Learning

Fully convolutional networks for semantic segmentation

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

References

Gradient-based learning applied to document recognition

Fast approximate energy minimization via graph cuts

Efficient Graph-Based Image Segmentation

Contour Detection and Hierarchical Image Segmentation

Dimensionality Reduction by Learning an Invariant Mapping

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

Deep Residual Learning for Image Recognition

Gradient-based learning applied to document recognition

Going deeper with convolutions

Very Deep Convolutional Networks for Large-Scale Image Recognition

Frequently Asked Questions (18)

Q1. How many jitter parameters were used to find the learning rate?

Q2. What is the main challenge of scene parsing?

Q3. What is the originality of the approach?

Q4. What is the main idea of scene parsing?

Q5. What is the aggregated feature vector of each grid cell?

Q6. What scales are used to generate feature vectors?

Q7. How can the authors parse an image of size 320240 in less than one second?

Q8. What is the first representation of an image patch?

Q9. What is the way to improve the performance of a feedforward pixel labeling system?

Q10. What is the representation of the segmentation component?

Q11. What is the way to train a multiscale model?

Q12. What is the way to reduce the set of components?

Q13. What is the attention function a used to mask the feature vector map with each component Ck?

Q14. How is the model able to accurately locate and delineate objects?

Q15. What is the way to analyze a family of segmentations?

Q16. How do the authors train the classifier c to predict the distribution of classes in the training set?

Q17. What is the general case of a segmentation tree?

Q18. How do the authors get feature vectors in f?