What is the best multi-stage architecture for object recognition?

doi:10.1109/ICCV.2009.5459469

What is the Best Multi-Stage Architecture for Object Recognition?

Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun

The Courant Institute of Mathematical Sciences

New York University, 715 Broadway, New York, NY 10003, USA

koray@cs.nyu.edu

Abstract

In many recent object recognition systems, feature ex-

traction stages are generally composed of a ﬁlter bank, a

non-linear transformation, and some sort of feature pooling

layer. Most systems use only one stage of feature extrac-

tion in which the ﬁlters are hard-wired, or two stages where

the ﬁlters in one or both stages are learned in supervised

or unsupervised mode. This paper addresses three ques-

tions: 1. How does the non-linearities that follow the ﬁlter

banks inﬂuence the recognition accuracy? 2. does learn-

ing the ﬁlter banks in an unsupervised or supervised man-

ner improve the performance over random ﬁlters or hard-

wired ﬁlters? 3. Is there any advantage to using an ar-

chitecture with two stages of feature extraction, rather than

one? We show that using non-linearities that include recti-

ﬁcation and local contrast normalization is the single most

important ingredient for good accuracy on object recogni-

tion benchmarks. We show that two stages of feature ex-

traction yield better accuracy than one. Most surprisingly,

we show that a two-stage system with random ﬁlters can

yield almost 63% recognition rate on Caltech-101, provided

that the proper non-linearities and pooling layers are used.

Finally, we show that with supervised reﬁnement, the sys-

tem achieves state-of-the-art performance on NORB dataset

(5.6%) and unsupervised pre-training followed by super-

vised reﬁnement produces good accuracy on Caltech-101

(> 65%), and the lowest known error rate on the undis-

torted, unprocessed MNIST dataset (0.53%).

1. Introduction

Over the last few years, considerable efforts have been

devoted to designing appropriate feature descriptors for ob-

ject recognition. Many recent proposals use dense features

extracted on regularly-spaced patches over the input image.

The vast majority of these systems use a feature extrac-

tion process composed of a ﬁlter bank (generally based on

oriented edge detectors), a non-linear operation (quantiza-

tion, winner-take-all, sparsiﬁcation, normalization, and/or

point-wise saturation), and a pooling operation that com-

bines nearby values in real space or feature space through

a max, average, or histogramming operator. For example,

the SIFT operator applies oriented edge ﬁlters to a small

patch and determines the dominant orientation through a

winner-take-all operation. Finally, the resulting sparse vec-

tors are added (pooled) over a larger patch to form local ori-

entation histograms. Several recognition architectures use a

single stage of such features followed by a supervised clas-

siﬁer. Particular embodiments of the single-stage systems

use SIFT features [19, 13], HoG [6], Geometric Blur [5],

and models inspired by the architecture of the mammalian

primary visual cortex [24], to mention a few. Other models

use two or more successive stages of such feature extractors,

followed by a supervised classiﬁer. This includes convolu-

tional networks globally trained in purely supervised mode

with gradient descent [10], convolutional networks trained

in supervised mode with an auxiliary task [3], or trained

in purely unsupervised mode [25, 11, 18]. Multi-stage sys-

tems also include HMAX-type models [28, 22] in which the

ﬁrst layer is hardwired with Gabor ﬁlters, and the second

layer is trained in unsupervised mode by storing randomly-

picked output conﬁgurations from the ﬁrst stage into ﬁlters

of the second stage. All of these models essentially differ

by whether they have one or two (or more) feature extrac-

tion stages, by the type of non-linearity used after the ﬁlter

banks, the method used to pick the ﬁlters (hard-wired, un-

supervised, supervised), and the top-level classiﬁer (linear

or more sophisticated).

This paper addresses three questions: 1. How do the non-

linearities that follow the ﬁlter banks inﬂuence the recogni-

tion accuracy? 2. Does learning the ﬁlter banks in an un-

supervised or supervised manner improve t he performance

over hard-wired ﬁlters or even random ﬁlters? 3. Is there

any advantage to using an architecture with two successive

stages of feature extraction, rather than with a single stage?

To address these questions, we experimented with various

combinations of architectures (with 1 or 2 stages of fea-

ture extraction), non-linearities, ﬁlter types, ﬁlter learning

methods (random, unsupervised and supervised). We use

a recently-proposed unsupervised feature learning method

called Predictive Sparse Decomposition (PSD), based on

an encoder-decoder architecture with sparsity constraints

on the feature vector [ 12]. Results are presented on the

well-known Caltech-101 dataset [7], on the NORB object

dataset [15], and on the MNIST dataset of handwritten dig-

its [14].

At ﬁrst glance, one may think that training a complete

system in a purely s upervised manner (using gradient de-

scent) is bound to fail on dataset with small number of la-

beled samples such as Caltech-101, because the number of

parameters greatly outstrips the number of samples. One

may also think that the ﬁlters need to be carefully hand-

picked (or trained) to produce good performance, and that

the details of the non-linearity play a somewhat secondary

role. These intuitions, as it turns out, are wrong.

1.1. Modules for dense feature extraction

A common choice for the ﬁlter bank of the ﬁrst stage is

Gabor Wavelets [28, 22, 24]. Other proposals use simple

oriented edge detection ﬁlters such as gradient operators,

including SIFT [19], and HoG [6]. Another set of meth-

ods learn the ﬁlters by adapting them to the statistics of the

input data with unsupervised learning [25, 11, 18]. When

trained on natural images these ﬁlters are Gabor-like edge

detectors. The advantage of learning methods is that they

provide a way to learn the ﬁlters in the subsequent stages

of the feature hierarchy. While prior knowledge about im-

age statistics point to the usefulness of oriented edge de-

tectors at the ﬁrst stage, there is no similar prior knowl-

edge that would allow to design sensible ﬁlters for the sec-

ond stage in the hierarchy. Hence the second stage must

be learned. A number of methods have been proposed to

learn ﬁlters in a multi-stage vision system. The simplest

method, which is a kind of patch memorization, is to set

the ﬁlters to randomly-picked conﬁgurations of outputs of

the previous stage [28, 22]. One of the oldest methods is

to simply learn the ﬁlters in a supervised fashion using gra-

dient descent [14, 10, 3]. The main issue with the purely

supervised global training approach is that the number of

parameters to be adjusted is very large, perhaps too large

relative to the available number of training samples for most

applications. Finally, one can train the ﬁlters in an unsuper-

vised fashion by following the so-called “deep belief net-

work” strategy [8, 4, 26, 9, 25, 17]: the ﬁlters are trained

so that representations at one s tage can be reconstructed

from the representation at the next stage under sparsity con-

straints [25, 11] or using the so-called contrastive diver-

gence method [18]. The main problem with the unsuper-

vised approach is that the ﬁlters are learned independently

of the task, although a few authors have proposed methods

that combine unsupervised and supervised criteria to allevi-

ate the problem [21, 27, 4].

The second ingredient of a feature extraction system

is the non-linearity. Convolutional networks use a sim-

ple point-wise sigmoid function after the ﬁlter banks [14],

while models that are strongly inspired by biology have

included rectifying non-linearities, such as positive part,

absolute value, or squaring functions [24], often followed

by a local contrast normalization [24], which is inspired

by divisive normalization models [20]. SIFT uses a recti-

ﬁcation followed by winner-take-all operation over orien-

tation, which is an extreme form of normalization. The

last step is the pooling layer that can be applied over

space [14, 13, 25, 3], over scale and space [28, 22, 24], or

over similar feature types and space [11]. This layer builds

robustness to small distortions by computing an average or

a max of the ﬁlter responses within the pool.

The accuracy of single-stage systems on the Caltech-101

dataset, after training on 30 labeled samples per category

varies with the details of the architecture and the ﬁlters.

SIFT-based systems yield accuracies around 50% when fed

to linear classiﬁers [ 11], and around 65% when using more

sophisticated classiﬁers such as the Pyramid Match Ker-

nel SVM (PMK-SVM) [13, 31, 11]. The V1-like model

of Pinto et al. yields around 60% with a linear classiﬁer fol-

lowing PCA [24]. These methods are similar in the fact that

they use hand-crafted oriented edge ﬁlters.

In recent years, a few authors have experimented with

ﬁlter-learning methods on Caltech-101. Kavukcuoglu et

al. [11] report recognition rates similar to SIFT using a

single-stage feature extractor fed to either a linear classi-

ﬁer or a PMK-SVM. Several authors have proposed sys-

tems with two s tages of learned feature extractors, each

of which comprises ﬁlter banks, non-linearities, and pool-

ing. This includes convolutional networks using supervised

training [10] and unsupervised training [25] yielding recog-

nition rates in the mid 50’s, and supervised training us-

ing auxiliary “pseudo-tasks” to regularize the system [3]

yielding 67.2% recognition rate. HMAX-type architectures

have yielded rates in the mid-40’s to mid-50’s [28, 22],

and stacked Restricted Boltzmann Machines [17, 18] have

yielded 65.4% with a PMK-SVM classiﬁer on top. While

the best results on Caltech-101 have been obtained by com-

bining a large number of different feature families [29], the

present study concerns systems with a single feature family,

hence results will be compared with other work in which a

single feature family is used. Better absolute numbers can

be obtained by combining the features presented here with

others, as described in [29].

2. Model Architecture

This section describes how to build a hierarchical feature

extraction and classiﬁcation system with fast (feed-forward)

processing. The hierarchy stacks one or several feature ex-

traction stages, each of which consists of ﬁlter bank layer,

non-linear transformation layers, and a pooling layer that

combines ﬁlter responses over local neighborhoods using

an average or max operation, thereby achieving invariance

to small distortions.

Filter Bank Layer - F

CSG

: the input of a ﬁlter bank

layer is a 3D array with n

1

2D feature maps of size n

2

×n

3

.

Each component is denoted x

ijk

, and each feature map is

denoted x

i

. The output is also a 3D array, y composed of

m

1

feature maps of size m

2

× m

3

. A ﬁlter in the ﬁlter bank

k

ij

has size l

1

× l

2

and connects input feature map x

i

to

output feature map y

j

. The module computes:

y

j

= g

j

tanh(

X

i

k

ij

∗ x

i

) (1)

where tanh is the hyperbolic tangent non-linearity, ∗ is the

2D discrete convolution operator and g

j

is a tr ainable scalar

coefﬁcient. By taking into account the borders effect, we

have m

1

= n

1

− l

1

+ 1, and m

2

= n

2

− l

2

+ 1. This layer is

denoted by F

CSG

because it is composed of a set of convo-

lution ﬁlters (C), a sigmoid/tanh non-linearity (S), and gain

coefﬁcients (G). In the following, superscripts are used to

denote the size of the ﬁlters. For instance, a ﬁlter bank layer

with 64 ﬁlters of size 9x9, is denoted as: 64F

9×9

CSG

.

Rectiﬁcation Layer - R

abs

: This module simply applies

the absolute value function to all the components of its in-

put: y

ijk

= |x

ijk

|. Several rectifying non-linearities were

tried, including the positive part, and produced similar re-

sults.

Local Contrast Normalization Layer - N : This module

performs local subtractive and divisive normalizations, en-

forcing a sort of local competition between adjacent fea-

tures in a feature map, and between features at the same

spatial location in different feature maps. The subtrac-

tive normalization operation for a given site x

ijk

com-

putes: v

ijk

= x

ijk

−

P

ipq

w

pq

.x

i,j+p,k+q

, where w

pq

is

a Gaussian weighting window (of size 9x9 in our exper-

iments) normalized so that

P

ipq

w

pq

= 1. The divisive

normalization computes y

ijk

= v

ijk

/max(c, σ

jk

) where

σ

jk

= (

P

ipq

w

pq

.v

2

i,j+p,k+q

)

1/2

. For each sample, the

constant c is set to the mean(σ

jk

) in the experiments. The

denominator is the weighted standard deviation of all fea-

tures over a spatial neighborhood. The local contrast nor-

malization layer is inspired by computational neuroscience

models [24, 20].

Average Pooling and Subsampling Layer - P

A

: The pur-

pose of this layer is to build robustness to small distor-

tions, playing the same role as the complex cells in mod-

els of visual perception. Each output value is y

ijk

=

P

pq

w

pq

.x

i,j+p,k+q

, where w

pq

is a uniform weighting

window (“boxcar ﬁlter”). Each output feature map is then

subsampled spatially by a factor S horizontally and verti-

cally. In this work, we do not consider pooling over fea-

ture types, but only over the spatial dimensions. Therefore,

the numbers of input and output feature maps are identical,

while the spatial resolution is decreased. Disregarding the

border effects in the boxcar averaging, the spatial resolution

is decreased by the down-sampling ratio S in both direc-

tions, denoted by a superscript, so that, an average pooling

Figure 1. A example of feature extraction stage of the type F

CSG

−

R

abs

− N − P

A

. An input image (or a feature map) is passed

through a non-linear ﬁlterbank, followed by rectiﬁcation, local

contrast normalization and spatial pooling/sub-sampling.

layer with 4x4 down-sampling is denoted: P

4×4

A

.

Max-Pooling and Subsampling Layer - P

M

: building lo-

cal invariance to shift can be performed with any symmetric

pooling operation. The max-pooling module is similar to

the average pooling, except that the average operation is re-

placed by a max operation. In our experiments, the pooling

windows were non-overlapping. A max-pooling layer with

4x4 down-sampling is denoted P

4×4

M

.

2.1. Combining Modules into a Hierarchy

Different architectures can be produced by cascading the

above-mentioned modules in various ways. An architec-

ture is composed of one or two stages of feature extraction,

each of which is formed by cascading a ﬁltering layer with

different combinations of rectiﬁcation, normalization, and

pooling. Recognition architectures are composed of one or

two such stages, followed by a classiﬁer, generally a multi-

nomial logistic regression.

F

CSG

− P

A

This is the basic building block of t ra-

ditional convolutional networks, alternating tanh-squashed

ﬁlter banks with average down-sampling layers [14, 10].

A complete convolutional network would have several se-

quences of “F

CSG

- P

A

” followed by by a linear classiﬁer.

F

CSG

− R

abs

− P

A

The tanh-squashed ﬁlter bank is

followed by an absolute value non-linearity, and by an av-

erage down-sampling layer.

F

CSG

− R

abs

− N − P

A

The tanh-squashed ﬁlter bank

is followed by an absolute value non-linearity, by a lo-

cal contrast normalization layer and by an average down-

sampling layer.

F

CSG

− P

M

This is also a typical building block of con-

volutional networks, as well as the basis of the HMAX and

other architectures [28, 25], which alternate tanh-squashed

ﬁlter banks with max-pooling layers.

3. Training Protocol

Given a particular architecture, a number of training pro-

tocols have been considered and tested. Each protocol is

identiﬁed by a letter R, U, R

+

, or U

+

. A single letter (e.g.

R) indicates an architecture with a single stage of feature

extraction, followed by a classiﬁer, while a double letter

(e.g. RR) indicates an architecture with two stages of fea-

ture extraction followed by a classiﬁer:

Random Features and Supervised Classiﬁer - R and

RR: The ﬁlters in the feature extraction stages are set to

random values and kept ﬁxed (no feature learning takes

place), and the classiﬁer stage is trained in supervised mode.

Unsupervised Features, Supervised Classiﬁer - U and

UU. The ﬁlters of the feature extraction stages are t rained

using the unsupervised PSD algorithm, described in sec-

tion 3.1, and kept ﬁxed. The classiﬁer stage is trained in

supervised mode.

Random Features, Global Supervised Reﬁnement - R

+

and R

+

R

+

. The ﬁlters in the feature extractor stages are

initialized with random values, and the entire system (fea-

ture stages + classiﬁer) is trained in supervised mode by

gradient descent. The gradients are computed using back-

propagation, and all the ﬁlters are adjusted by stochastic on-

line updates. This is identical to the usual method for train-

ing supervised convolutional networks.

Unsupervised Feature, Global Supervised Reﬁnement -

U

+

and U

+

U

+

. The ﬁlters in the feature extractor stages

are initialized with the PSD unsupervised learning algo-

rithm, and the entire system (feature stages + classiﬁer) is

then trained (reﬁned) in supervised mode by gradient de-

scent. The system is t rained the same way as random fea-

tures with global reﬁnement using online stochastic updates.

This is reminiscent of the “deep belief network” strategy in

which the stages are ﬁrst trained in unsupervised mode one

after the other, and then globally reﬁned using supervised

learning [8, 4, 26]

For instance, a traditional convolutional network with a

single stage initialized at random [14] would be denoted by

an architecture motif like “F

CSG

− P

A

”, and the training

protocol would be denoted by R

+

. The stages of a con-

volutional network with max-pooling would be denoted by

“F

CSG

− P

M

”. A system with two such stages trained in

unsupervised mode, and the classiﬁer (only) trained in su-

pervised mode, as in [25], is denoted UU.

3.1. Unsupervised Training of Filter Banks using

Predictive Sparse Decomposition

In order to learn the ﬁlter coefﬁcients (g, k) in the ﬁl-

ter bank layers (see eq. 1), an unsupervised learning al-

gorithm is required. We used the Predictive Sparse De-

composition algorithm of [12], which has the following

characteristics: 1. it produces efﬁcient, feed-forward ﬁl-

ter banks that include a point-wise non-linearity; 2. the

training procedure is deterministic (no sampling required,

as with Restricted Boltzmann Machines); 3. it learns to pro-

duce high-dimensional sparse features, which are suitable

for subsequent pooling, and which enhance class discrim-

inability. Although the ﬁlter banks are eventually applied

to entire images, the PSD algorithm trains them on individ-

ual patches (or stacks of patches from multiple input feature

maps) whose size is equal to the size of the ﬁlters. The start-

ing point of PSD is the well-known sparse coding algorithm

proposed by Olshausen and Field [23] which, unfortunately

does not produce direct ﬁlters, but “reverse” ﬁlters (or dic-

tionary elements). Inputs are approximated as a sparse lin-

ear combination of these dictionary elements. The coef-

ﬁcients constitute the feature representation. The method

learns the optimal dictionary that can be used to reconstruct

a set of training samples under sparsity constraints on the

feature vector. For a given input X (a vectorized patch or

stack of patches), and a matrix W whose columns are the

dictionary elements, feature vector Z

∗

is obtained by mini-

mizing the following energy function:

E

OF

(X, Z, W ) = kX − W Zk

2

+ λkZk

1

(2)

Z

∗

= arg min

Z

E

OF

(X, Z, W ) (3)

where λ is a sparsity hyper-parameter. Given a set of train-

ing samples X

i

, i = 1 . . . P , learning proceeds by minimiz-

ing the loss L

OF

(W ) = 1/P

P

i=1

min

z

E

OF

(X

i

, Z, W )

using stochastic gradient descent or a similar procedure.

After learning, for any input X, one needs to run a

rather expensive optimization algorithm to ﬁnd Z

∗

(the

so-called “basis pursuit” problem, which is convex, but

non-quadratic [16, 2]). To alleviate the problem, the PSD

method [12] trains a simple (feed-forward) regressor (or en-

coder) to approximate the sparse s olution Z

∗

for all X in

the training set. The regressor C(X, K) takes the form of

eq. 1 on a patch the size of the ﬁlters (K collectively de-

notes all the ﬁlter coefﬁcients). During training, the feature

vector Z

∗

is obtained by minimizing the energy function

E

P SD

(X, Z, W, K), deﬁned as follows:

E

P SD

= kX − W Zk

2

+ λkZk

1

+

kZ − C(X, K)k

2

(4)

Z

∗

= arg min

Z

E

P SD

(X, Z, W, K) (5)

As with Olshausen and Field [23], learning pro-

ceeds by minimizing the loss L

P SD

(W, K) =

1/P

P

i=1

min

z

E

P SD

(X

i

, Z, W, K). The learning

procedure simultaneously optimizes W (dictionary) and K

(ﬁlters). Once training is complete, the feature vector for a

given input is simply obtained with Z

∗

= C(X, K), hence

the process is extremely fast (feed-forward).

4. Experiments

In this section, various architectures and training proto-

cols are compared on the Caltech 101 [7], MNIST [1] and

NORB [15] datasets. Our purpose is to determine whether

two stages are better than one stage, which non-linearities

are preferable, and which training protocol makes a differ-

ence.

Images from the Caltech-101 dataset were pre-processed

with a procedure similar to [24]. The steps are: 1) con-

verting to gray-scale (no color) and resizing to 151 × 151

pixels. 2) subtracting the image mean and dividing by the

image standard deviation, 3) applying subtractive/divisive

normalization (N layer with c = 1). 4) zero-padding the

shorter side to 143 pixels.

Single Stage System: [64.F

9×9

CSG

− R/N/P

5×5

] - log reg

R

abs

− N − P

A

R

abs

− P

A

N − P

M

N − P

A

P

A

U

+

54.2% 50.0% 44.3% 18.5% 14.5%

R

+

54.8% 47.0% 38.0% 16.3% 14.3%

U 52.2% 43.3%(±1.6) 44.0% 17.2% 13.4%

R 53.3% 31.7% 32.1% 15.3% 12.1%(±2.2)

G 52.3%

Two Stage System: [64.F

9×9

CSG

− R/N/P

5×5

] − [256.F

9×9

CSG

− R/N/P

4×4

] - log

reg

R

abs

− N − P

A

R

abs

− P

A

N − P

M

N − P

A

P

A

U

+

U

+

65.5% 60.5% 61.0% 34.0% 32.0%

R

+

R

+

64.7% 59.5% 60.0% 31.0% 29.7%

UU 63.7% 46.7% 56.0% 23.1% 9.1%

RR 62.9% 33.7%(±1.5) 37.6%(±1.9) 19.6% 8.8%

GT 55.8%

Single Stage: [64.F

9×9

CSG

− R

abs

/N/P

5×5

A

] - PMK-SVM

U 64.0%

Two Stages: [64.F

9×9

CSG

− R

abs

/N/P

5×5

A

] − [256.F

9×9

CSG

− R

abs

/N] - PMK-SVM

UU 52.8%

Table 1. Average recognition rates on Caltech-101 with 30 training samples per class. Each row contains results for one of the training

protocols, and each column for one type of architecture. All columns use an F

CSG

as the ﬁrst module, followed by the modules shown in

the column label. The error bars for all experiments are within 1%, except where noted.

All results are recognition r ates averaged over classes,

after training with 30 samples per class, and averaged over

5 drawings of the training set. To adjust hyperparameters,

a validation set of 5 samples per class was taken out of the

training sets. The hyper-parameters were selected to maxi-

mize the performance on the validation set. Then, the sys-

tem was trained over the entire training set. The ﬁnal error

value is computed as the average error over categories to

account for differences in the number of instances per cat-

egory (as is standard protocol for Caltech-101). The back-

ground category was also included.

Using a Single Stage of Feature Extraction: The ﬁrst

stage is composed of an F

CSG

layer with 64 ﬁlters of size

9 × 9 (64F

9×9

CSG

), followed by an abs rectiﬁcation (R

abs

), a

local contrast normalization layer (N ) and an average pool-

ing layer with 10×10 boxcar ﬁlter and 5×5 down-sampling

(P

5×5

A

). The output of the ﬁrst stage is a set of 64 features

maps of size 26 × 26. This output is then fed to a multi-

nomial logistic regression classiﬁer that produces a 102-

dimensional output vector representing a posterior distribu-

tion over class labels. Lazebnik’s PMK-SVM classiﬁer [13]

was also tested.

Using Two Stages of Feature Extraction: In two-stage

systems, the second-stage feature extractor is fed with the

output of the ﬁrst stage. The ﬁrst layer of the second stage

is an F

CSG

module with 256 output features maps, each of

which combines a random subset of 16 feature maps from

the previous stage using 9 × 9 kernels. Hence the total num-

ber of convolution kernels is 256 × 16 = 4096. The aver-

age pooling module uses a 6 × 6 boxcar ﬁlter with a 4 × 4

down-sampling step. This produces an output feature map

of size 256×4×4, which is then fed to a multinomial logis-

tic regression classiﬁer. The PMK-SVM classiﬁer was also

tested.

Table 1 summarizes the results for the experiments.

1. The most astonishing r esult is that systems with random

ﬁlters and no ﬁlter learning whatsoever achieve decent per-

formance (53.3% for R and 62.9% for RR), as long as they

include absolute value rectiﬁcation and contrast normaliza-

tion (R

abs

− N − P

A

).

2. Comparing experiments from rows R vs R

+

, R R vs

R

+

R

+

, U vs U

+

and UU vs U

+

U

+

, we see that supervised

ﬁne tuning consistently improves the performance, particu-

larly with weak non-linearities: 62.9% to 64.7% for RR ,

63.7% to 65.5% for UU using R

abs

− N − P

A

and 35.1%

to 59.5% for RR using R

abs

− P

A

.

3. It appears clear that two-stage systems (U U, U

+

U

+

,

RR, R

+

R

+

) are systematically and signiﬁcantly better than

their single-stage counterparts (U, U

+

, R, R

+

). For in-

stance, 54.2% obtained by U

+

compared to 65.5% obtained

by U

+

U

+

using R

abs

− N − P

A

. However, when using P

A

architecture, adding second stage without supervised reﬁne-

ment does not seem to help. This may be due to cancellation

effects of the P

A

layer when rectiﬁcation is not present.

4. It seems that unsupervised training (U, U U , U

+

, U

+

U

+

)

does not seem to signiﬁcantly improve the performance

(comparing with (R, RR, R

+

, R

+

R

+

) if both rectiﬁcation

and normalization are used (62.9% for RR versus 63.7%

for UU). When contrast normalization is removed, the per-

formance gap becomes signiﬁcant (35.1% for RR versus

47.8% for U U ). If no supervised reﬁnement is performed, it

looks as if appropriate architectural components are a good

What is the best multi-stage architecture for object recognition?

Citations

ImageNet Classification with Deep Convolutional Neural Networks

Generative Adversarial Nets

Deep Learning

Dropout: a simple way to prevent neural networks from overfitting

ImageNet classification with deep convolutional neural networks

References

Distinctive Image Features from Scale-Invariant Keypoints

Gradient-based learning applied to document recognition

Histograms of oriented gradients for human detection

Reducing the Dimensionality of Data with Neural Networks

Distinctive Image Features from Scale-Invariant Keypoints

Related Papers (5)

Gradient-based learning applied to document recognition

ImageNet Classification with Deep Convolutional Neural Networks

Deep Residual Learning for Image Recognition

Going deeper with convolutions

Very Deep Convolutional Networks for Large-Scale Image Recognition