scispace - formally typeset
Open AccessProceedings ArticleDOI

What is the best multi-stage architecture for object recognition?

TLDR
It is shown that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks and that two stages of feature extraction yield better accuracy than one.
Abstract
In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer Most systems use only one stage of feature extraction in which the filters are hard-wired, or two stages where the filters in one or both stages are learned in supervised or unsupervised mode This paper addresses three questions: 1 How does the non-linearities that follow the filter banks influence the recognition accuracy? 2 does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters? 3 Is there any advantage to using an architecture with two stages of feature extraction, rather than one? We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks We show that two stages of feature extraction yield better accuracy than one Most surprisingly, we show that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used Finally, we show that with supervised refinement, the system achieves state-of-the-art performance on NORB dataset (56%) and unsupervised pre-training followed by supervised refinement produces good accuracy on Caltech-101 (≫ 65%), and the lowest known error rate on the undistorted, unprocessed MNIST dataset (053%)

read more

Content maybe subject to copyright    Report

What is the Best Multi-Stage Architecture for Object Recognition?
Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun
The Courant Institute of Mathematical Sciences
New York University, 715 Broadway, New York, NY 10003, USA
koray@cs.nyu.edu
Abstract
In many recent object recognition systems, feature ex-
traction stages are generally composed of a filter bank, a
non-linear transformation, and some sort of feature pooling
layer. Most systems use only one stage of feature extrac-
tion in which the filters are hard-wired, or two stages where
the filters in one or both stages are learned in supervised
or unsupervised mode. This paper addresses three ques-
tions: 1. How does the non-linearities that follow the filter
banks influence the recognition accuracy? 2. does learn-
ing the filter banks in an unsupervised or supervised man-
ner improve the performance over random filters or hard-
wired filters? 3. Is there any advantage to using an ar-
chitecture with two stages of feature extraction, rather than
one? We show that using non-linearities that include recti-
fication and local contrast normalization is the single most
important ingredient for good accuracy on object recogni-
tion benchmarks. We show that two stages of feature ex-
traction yield better accuracy than one. Most surprisingly,
we show that a two-stage system with random filters can
yield almost 63% recognition rate on Caltech-101, provided
that the proper non-linearities and pooling layers are used.
Finally, we show that with supervised refinement, the sys-
tem achieves state-of-the-art performance on NORB dataset
(5.6%) and unsupervised pre-training followed by super-
vised refinement produces good accuracy on Caltech-101
(> 65%), and the lowest known error rate on the undis-
torted, unprocessed MNIST dataset (0.53%).
1. Introduction
Over the last few years, considerable efforts have been
devoted to designing appropriate feature descriptors for ob-
ject recognition. Many recent proposals use dense features
extracted on regularly-spaced patches over the input image.
The vast majority of these systems use a feature extrac-
tion process composed of a filter bank (generally based on
oriented edge detectors), a non-linear operation (quantiza-
tion, winner-take-all, sparsification, normalization, and/or
point-wise saturation), and a pooling operation that com-
bines nearby values in real space or feature space through
a max, average, or histogramming operator. For example,
the SIFT operator applies oriented edge filters to a small
patch and determines the dominant orientation through a
winner-take-all operation. Finally, the resulting sparse vec-
tors are added (pooled) over a larger patch to form local ori-
entation histograms. Several recognition architectures use a
single stage of such features followed by a supervised clas-
sifier. Particular embodiments of the single-stage systems
use SIFT features [19, 13], HoG [6], Geometric Blur [5],
and models inspired by the architecture of the mammalian
primary visual cortex [24], to mention a few. Other models
use two or more successive stages of such feature extractors,
followed by a supervised classifier. This includes convolu-
tional networks globally trained in purely supervised mode
with gradient descent [10], convolutional networks trained
in supervised mode with an auxiliary task [3], or trained
in purely unsupervised mode [25, 11, 18]. Multi-stage sys-
tems also include HMAX-type models [28, 22] in which the
first layer is hardwired with Gabor filters, and the second
layer is trained in unsupervised mode by storing randomly-
picked output configurations from the first stage into filters
of the second stage. All of these models essentially differ
by whether they have one or two (or more) feature extrac-
tion stages, by the type of non-linearity used after the filter
banks, the method used to pick the filters (hard-wired, un-
supervised, supervised), and the top-level classifier (linear
or more sophisticated).
This paper addresses three questions: 1. How do the non-
linearities that follow the filter banks influence the recogni-
tion accuracy? 2. Does learning the filter banks in an un-
supervised or supervised manner improve t he performance
over hard-wired filters or even random filters? 3. Is there
any advantage to using an architecture with two successive
stages of feature extraction, rather than with a single stage?
To address these questions, we experimented with various
combinations of architectures (with 1 or 2 stages of fea-
ture extraction), non-linearities, filter types, filter learning
methods (random, unsupervised and supervised). We use
a recently-proposed unsupervised feature learning method
called Predictive Sparse Decomposition (PSD), based on

an encoder-decoder architecture with sparsity constraints
on the feature vector [ 12]. Results are presented on the
well-known Caltech-101 dataset [7], on the NORB object
dataset [15], and on the MNIST dataset of handwritten dig-
its [14].
At first glance, one may think that training a complete
system in a purely s upervised manner (using gradient de-
scent) is bound to fail on dataset with small number of la-
beled samples such as Caltech-101, because the number of
parameters greatly outstrips the number of samples. One
may also think that the filters need to be carefully hand-
picked (or trained) to produce good performance, and that
the details of the non-linearity play a somewhat secondary
role. These intuitions, as it turns out, are wrong.
1.1. Modules for dense feature extraction
A common choice for the filter bank of the first stage is
Gabor Wavelets [28, 22, 24]. Other proposals use simple
oriented edge detection filters such as gradient operators,
including SIFT [19], and HoG [6]. Another set of meth-
ods learn the filters by adapting them to the statistics of the
input data with unsupervised learning [25, 11, 18]. When
trained on natural images these filters are Gabor-like edge
detectors. The advantage of learning methods is that they
provide a way to learn the filters in the subsequent stages
of the feature hierarchy. While prior knowledge about im-
age statistics point to the usefulness of oriented edge de-
tectors at the first stage, there is no similar prior knowl-
edge that would allow to design sensible filters for the sec-
ond stage in the hierarchy. Hence the second stage must
be learned. A number of methods have been proposed to
learn filters in a multi-stage vision system. The simplest
method, which is a kind of patch memorization, is to set
the filters to randomly-picked configurations of outputs of
the previous stage [28, 22]. One of the oldest methods is
to simply learn the filters in a supervised fashion using gra-
dient descent [14, 10, 3]. The main issue with the purely
supervised global training approach is that the number of
parameters to be adjusted is very large, perhaps too large
relative to the available number of training samples for most
applications. Finally, one can train the filters in an unsuper-
vised fashion by following the so-called “deep belief net-
work” strategy [8, 4, 26, 9, 25, 17]: the filters are trained
so that representations at one s tage can be reconstructed
from the representation at the next stage under sparsity con-
straints [25, 11] or using the so-called contrastive diver-
gence method [18]. The main problem with the unsuper-
vised approach is that the filters are learned independently
of the task, although a few authors have proposed methods
that combine unsupervised and supervised criteria to allevi-
ate the problem [21, 27, 4].
The second ingredient of a feature extraction system
is the non-linearity. Convolutional networks use a sim-
ple point-wise sigmoid function after the filter banks [14],
while models that are strongly inspired by biology have
included rectifying non-linearities, such as positive part,
absolute value, or squaring functions [24], often followed
by a local contrast normalization [24], which is inspired
by divisive normalization models [20]. SIFT uses a recti-
fication followed by winner-take-all operation over orien-
tation, which is an extreme form of normalization. The
last step is the pooling layer that can be applied over
space [14, 13, 25, 3], over scale and space [28, 22, 24], or
over similar feature types and space [11]. This layer builds
robustness to small distortions by computing an average or
a max of the filter responses within the pool.
The accuracy of single-stage systems on the Caltech-101
dataset, after training on 30 labeled samples per category
varies with the details of the architecture and the filters.
SIFT-based systems yield accuracies around 50% when fed
to linear classifiers [ 11], and around 65% when using more
sophisticated classifiers such as the Pyramid Match Ker-
nel SVM (PMK-SVM) [13, 31, 11]. The V1-like model
of Pinto et al. yields around 60% with a linear classifier fol-
lowing PCA [24]. These methods are similar in the fact that
they use hand-crafted oriented edge filters.
In recent years, a few authors have experimented with
filter-learning methods on Caltech-101. Kavukcuoglu et
al. [11] report recognition rates similar to SIFT using a
single-stage feature extractor fed to either a linear classi-
fier or a PMK-SVM. Several authors have proposed sys-
tems with two s tages of learned feature extractors, each
of which comprises filter banks, non-linearities, and pool-
ing. This includes convolutional networks using supervised
training [10] and unsupervised training [25] yielding recog-
nition rates in the mid 50’s, and supervised training us-
ing auxiliary “pseudo-tasks” to regularize the system [3]
yielding 67.2% recognition rate. HMAX-type architectures
have yielded rates in the mid-40’s to mid-50’s [28, 22],
and stacked Restricted Boltzmann Machines [17, 18] have
yielded 65.4% with a PMK-SVM classifier on top. While
the best results on Caltech-101 have been obtained by com-
bining a large number of different feature families [29], the
present study concerns systems with a single feature family,
hence results will be compared with other work in which a
single feature family is used. Better absolute numbers can
be obtained by combining the features presented here with
others, as described in [29].
2. Model Architecture
This section describes how to build a hierarchical feature
extraction and classification system with fast (feed-forward)
processing. The hierarchy stacks one or several feature ex-
traction stages, each of which consists of filter bank layer,
non-linear transformation layers, and a pooling layer that
combines filter responses over local neighborhoods using
an average or max operation, thereby achieving invariance
to small distortions.

Filter Bank Layer - F
CSG
: the input of a filter bank
layer is a 3D array with n
1
2D feature maps of size n
2
×n
3
.
Each component is denoted x
ijk
, and each feature map is
denoted x
i
. The output is also a 3D array, y composed of
m
1
feature maps of size m
2
× m
3
. A filter in the filter bank
k
ij
has size l
1
× l
2
and connects input feature map x
i
to
output feature map y
j
. The module computes:
y
j
= g
j
tanh(
X
i
k
ij
x
i
) (1)
where tanh is the hyperbolic tangent non-linearity, is the
2D discrete convolution operator and g
j
is a tr ainable scalar
coefficient. By taking into account the borders effect, we
have m
1
= n
1
l
1
+ 1, and m
2
= n
2
l
2
+ 1. This layer is
denoted by F
CSG
because it is composed of a set of convo-
lution filters (C), a sigmoid/tanh non-linearity (S), and gain
coefficients (G). In the following, superscripts are used to
denote the size of the filters. For instance, a filter bank layer
with 64 filters of size 9x9, is denoted as: 64F
9×9
CSG
.
Rectification Layer - R
abs
: This module simply applies
the absolute value function to all the components of its in-
put: y
ijk
= |x
ijk
|. Several rectifying non-linearities were
tried, including the positive part, and produced similar re-
sults.
Local Contrast Normalization Layer - N : This module
performs local subtractive and divisive normalizations, en-
forcing a sort of local competition between adjacent fea-
tures in a feature map, and between features at the same
spatial location in different feature maps. The subtrac-
tive normalization operation for a given site x
ijk
com-
putes: v
ijk
= x
ijk
P
ipq
w
pq
.x
i,j+p,k+q
, where w
pq
is
a Gaussian weighting window (of size 9x9 in our exper-
iments) normalized so that
P
ipq
w
pq
= 1. The divisive
normalization computes y
ijk
= v
ijk
/max(c, σ
jk
) where
σ
jk
= (
P
ipq
w
pq
.v
2
i,j+p,k+q
)
1/2
. For each sample, the
constant c is set to the mean(σ
jk
) in the experiments. The
denominator is the weighted standard deviation of all fea-
tures over a spatial neighborhood. The local contrast nor-
malization layer is inspired by computational neuroscience
models [24, 20].
Average Pooling and Subsampling Layer - P
A
: The pur-
pose of this layer is to build robustness to small distor-
tions, playing the same role as the complex cells in mod-
els of visual perception. Each output value is y
ijk
=
P
pq
w
pq
.x
i,j+p,k+q
, where w
pq
is a uniform weighting
window (“boxcar filter”). Each output feature map is then
subsampled spatially by a factor S horizontally and verti-
cally. In this work, we do not consider pooling over fea-
ture types, but only over the spatial dimensions. Therefore,
the numbers of input and output feature maps are identical,
while the spatial resolution is decreased. Disregarding the
border effects in the boxcar averaging, the spatial resolution
is decreased by the down-sampling ratio S in both direc-
tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type F
CSG
R
abs
N P
A
. An input image (or a feature map) is passed
through a non-linear filterbank, followed by rectification, local
contrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P
4×4
A
.
Max-Pooling and Subsampling Layer - P
M
: building lo-
cal invariance to shift can be performed with any symmetric
pooling operation. The max-pooling module is similar to
the average pooling, except that the average operation is re-
placed by a max operation. In our experiments, the pooling
windows were non-overlapping. A max-pooling layer with
4x4 down-sampling is denoted P
4×4
M
.
2.1. Combining Modules into a Hierarchy
Different architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-
ture is composed of one or two stages of feature extraction,
each of which is formed by cascading a filtering layer with
different combinations of rectification, normalization, and
pooling. Recognition architectures are composed of one or
two such stages, followed by a classifier, generally a multi-
nomial logistic regression.
F
CSG
P
A
This is the basic building block of t ra-
ditional convolutional networks, alternating tanh-squashed
filter banks with average down-sampling layers [14, 10].
A complete convolutional network would have several se-
quences of F
CSG
- P
A
followed by by a linear classifier.
F
CSG
R
abs
P
A
The tanh-squashed filter bank is
followed by an absolute value non-linearity, and by an av-
erage down-sampling layer.
F
CSG
R
abs
N P
A
The tanh-squashed filter bank
is followed by an absolute value non-linearity, by a lo-
cal contrast normalization layer and by an average down-
sampling layer.
F
CSG
P
M
This is also a typical building block of con-
volutional networks, as well as the basis of the HMAX and
other architectures [28, 25], which alternate tanh-squashed
filter banks with max-pooling layers.
3. Training Protocol
Given a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol is
identified by a letter R, U, R
+
, or U
+
. A single letter (e.g.
R) indicates an architecture with a single stage of feature
extraction, followed by a classifier, while a double letter
(e.g. RR) indicates an architecture with two stages of fea-
ture extraction followed by a classifier:
Random Features and Supervised Classifier - R and
RR: The filters in the feature extraction stages are set to
random values and kept fixed (no feature learning takes
place), and the classifier stage is trained in supervised mode.

Unsupervised Features, Supervised Classifier - U and
UU. The filters of the feature extraction stages are t rained
using the unsupervised PSD algorithm, described in sec-
tion 3.1, and kept fixed. The classifier stage is trained in
supervised mode.
Random Features, Global Supervised Refinement - R
+
and R
+
R
+
. The filters in the feature extractor stages are
initialized with random values, and the entire system (fea-
ture stages + classifier) is trained in supervised mode by
gradient descent. The gradients are computed using back-
propagation, and all the filters are adjusted by stochastic on-
line updates. This is identical to the usual method for train-
ing supervised convolutional networks.
Unsupervised Feature, Global Supervised Refinement -
U
+
and U
+
U
+
. The filters in the feature extractor stages
are initialized with the PSD unsupervised learning algo-
rithm, and the entire system (feature stages + classifier) is
then trained (refined) in supervised mode by gradient de-
scent. The system is t rained the same way as random fea-
tures with global refinement using online stochastic updates.
This is reminiscent of the “deep belief network” strategy in
which the stages are first trained in unsupervised mode one
after the other, and then globally refined using supervised
learning [8, 4, 26]
For instance, a traditional convolutional network with a
single stage initialized at random [14] would be denoted by
an architecture motif like F
CSG
P
A
”, and the training
protocol would be denoted by R
+
. The stages of a con-
volutional network with max-pooling would be denoted by
F
CSG
P
M
”. A system with two such stages trained in
unsupervised mode, and the classifier (only) trained in su-
pervised mode, as in [25], is denoted UU.
3.1. Unsupervised Training of Filter Banks using
Predictive Sparse Decomposition
In order to learn the filter coefficients (g, k) in the fil-
ter bank layers (see eq. 1), an unsupervised learning al-
gorithm is required. We used the Predictive Sparse De-
composition algorithm of [12], which has the following
characteristics: 1. it produces efficient, feed-forward fil-
ter banks that include a point-wise non-linearity; 2. the
training procedure is deterministic (no sampling required,
as with Restricted Boltzmann Machines); 3. it learns to pro-
duce high-dimensional sparse features, which are suitable
for subsequent pooling, and which enhance class discrim-
inability. Although the filter banks are eventually applied
to entire images, the PSD algorithm trains them on individ-
ual patches (or stacks of patches from multiple input feature
maps) whose size is equal to the size of the filters. The start-
ing point of PSD is the well-known sparse coding algorithm
proposed by Olshausen and Field [23] which, unfortunately
does not produce direct filters, but “reverse” filters (or dic-
tionary elements). Inputs are approximated as a sparse lin-
ear combination of these dictionary elements. The coef-
ficients constitute the feature representation. The method
learns the optimal dictionary that can be used to reconstruct
a set of training samples under sparsity constraints on the
feature vector. For a given input X (a vectorized patch or
stack of patches), and a matrix W whose columns are the
dictionary elements, feature vector Z
is obtained by mini-
mizing the following energy function:
E
OF
(X, Z, W ) = kX W Zk
2
2
+ λkZk
1
(2)
Z
= arg min
Z
E
OF
(X, Z, W ) (3)
where λ is a sparsity hyper-parameter. Given a set of train-
ing samples X
i
, i = 1 . . . P , learning proceeds by minimiz-
ing the loss L
OF
(W ) = 1/P
P
P
i=1
min
z
E
OF
(X
i
, Z, W )
using stochastic gradient descent or a similar procedure.
After learning, for any input X, one needs to run a
rather expensive optimization algorithm to find Z
(the
so-called “basis pursuit” problem, which is convex, but
non-quadratic [16, 2]). To alleviate the problem, the PSD
method [12] trains a simple (feed-forward) regressor (or en-
coder) to approximate the sparse s olution Z
for all X in
the training set. The regressor C(X, K) takes the form of
eq. 1 on a patch the size of the filters (K collectively de-
notes all the filter coefficients). During training, the feature
vector Z
is obtained by minimizing the energy function
E
P SD
(X, Z, W, K), defined as follows:
E
P SD
= kX W Zk
2
2
+ λkZk
1
+
kZ C(X, K)k
2
2
(4)
Z
= arg min
Z
E
P SD
(X, Z, W, K) (5)
As with Olshausen and Field [23], learning pro-
ceeds by minimizing the loss L
P SD
(W, K) =
1/P
P
P
i=1
min
z
E
P SD
(X
i
, Z, W, K). The learning
procedure simultaneously optimizes W (dictionary) and K
(filters). Once training is complete, the feature vector for a
given input is simply obtained with Z
= C(X, K), hence
the process is extremely fast (feed-forward).
4. Experiments
In this section, various architectures and training proto-
cols are compared on the Caltech 101 [7], MNIST [1] and
NORB [15] datasets. Our purpose is to determine whether
two stages are better than one stage, which non-linearities
are preferable, and which training protocol makes a differ-
ence.
Images from the Caltech-101 dataset were pre-processed
with a procedure similar to [24]. The steps are: 1) con-
verting to gray-scale (no color) and resizing to 151 × 151
pixels. 2) subtracting the image mean and dividing by the
image standard deviation, 3) applying subtractive/divisive
normalization (N layer with c = 1). 4) zero-padding the
shorter side to 143 pixels.

Single Stage System: [64.F
9×9
CSG
R/N/P
5×5
] - log reg
R
abs
N P
A
R
abs
P
A
N P
M
N P
A
P
A
U
+
54.2% 50.0% 44.3% 18.5% 14.5%
R
+
54.8% 47.0% 38.0% 16.3% 14.3%
U 52.2% 43.3%(±1.6) 44.0% 17.2% 13.4%
R 53.3% 31.7% 32.1% 15.3% 12.1%(±2.2)
G 52.3%
Two Stage System: [64.F
9×9
CSG
R/N/P
5×5
] [256.F
9×9
CSG
R/N/P
4×4
] - log
reg
R
abs
N P
A
R
abs
P
A
N P
M
N P
A
P
A
U
+
U
+
65.5% 60.5% 61.0% 34.0% 32.0%
R
+
R
+
64.7% 59.5% 60.0% 31.0% 29.7%
UU 63.7% 46.7% 56.0% 23.1% 9.1%
RR 62.9% 33.7%(±1.5) 37.6%(±1.9) 19.6% 8.8%
GT 55.8%
Single Stage: [64.F
9×9
CSG
R
abs
/N/P
5×5
A
] - PMK-SVM
U 64.0%
Two Stages: [64.F
9×9
CSG
R
abs
/N/P
5×5
A
] [256.F
9×9
CSG
R
abs
/N] - PMK-SVM
UU 52.8%
Table 1. Average recognition rates on Caltech-101 with 30 training samples per class. Each row contains results for one of the training
protocols, and each column for one type of architecture. All columns use an F
CSG
as the first module, followed by the modules shown in
the column label. The error bars for all experiments are within 1%, except where noted.
All results are recognition r ates averaged over classes,
after training with 30 samples per class, and averaged over
5 drawings of the training set. To adjust hyperparameters,
a validation set of 5 samples per class was taken out of the
training sets. The hyper-parameters were selected to maxi-
mize the performance on the validation set. Then, the sys-
tem was trained over the entire training set. The final error
value is computed as the average error over categories to
account for differences in the number of instances per cat-
egory (as is standard protocol for Caltech-101). The back-
ground category was also included.
Using a Single Stage of Feature Extraction: The first
stage is composed of an F
CSG
layer with 64 filters of size
9 × 9 (64F
9×9
CSG
), followed by an abs rectification (R
abs
), a
local contrast normalization layer (N ) and an average pool-
ing layer with 10×10 boxcar filter and 5×5 down-sampling
(P
5×5
A
). The output of the first stage is a set of 64 features
maps of size 26 × 26. This output is then fed to a multi-
nomial logistic regression classifier that produces a 102-
dimensional output vector representing a posterior distribu-
tion over class labels. Lazebnik’s PMK-SVM classifier [13]
was also tested.
Using Two Stages of Feature Extraction: In two-stage
systems, the second-stage feature extractor is fed with the
output of the first stage. The first layer of the second stage
is an F
CSG
module with 256 output features maps, each of
which combines a random subset of 16 feature maps from
the previous stage using 9 × 9 kernels. Hence the total num-
ber of convolution kernels is 256 × 16 = 4096. The aver-
age pooling module uses a 6 × 6 boxcar filter with a 4 × 4
down-sampling step. This produces an output feature map
of size 256×4×4, which is then fed to a multinomial logis-
tic regression classifier. The PMK-SVM classifier was also
tested.
Table 1 summarizes the results for the experiments.
1. The most astonishing r esult is that systems with random
filters and no filter learning whatsoever achieve decent per-
formance (53.3% for R and 62.9% for RR), as long as they
include absolute value rectification and contrast normaliza-
tion (R
abs
N P
A
).
2. Comparing experiments from rows R vs R
+
, R R vs
R
+
R
+
, U vs U
+
and UU vs U
+
U
+
, we see that supervised
fine tuning consistently improves the performance, particu-
larly with weak non-linearities: 62.9% to 64.7% for RR ,
63.7% to 65.5% for UU using R
abs
N P
A
and 35.1%
to 59.5% for RR using R
abs
P
A
.
3. It appears clear that two-stage systems (U U, U
+
U
+
,
RR, R
+
R
+
) are systematically and significantly better than
their single-stage counterparts (U, U
+
, R, R
+
). For in-
stance, 54.2% obtained by U
+
compared to 65.5% obtained
by U
+
U
+
using R
abs
N P
A
. However, when using P
A
architecture, adding second stage without supervised refine-
ment does not seem to help. This may be due to cancellation
effects of the P
A
layer when rectification is not present.
4. It seems that unsupervised training (U, U U , U
+
, U
+
U
+
)
does not seem to significantly improve the performance
(comparing with (R, RR, R
+
, R
+
R
+
) if both rectification
and normalization are used (62.9% for RR versus 63.7%
for UU). When contrast normalization is removed, the per-
formance gap becomes significant (35.1% for RR versus
47.8% for U U ). If no supervised refinement is performed, it
looks as if appropriate architectural components are a good

Citations
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Reducing the Dimensionality of Data with Neural Networks

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Related Papers (5)