scispace - formally typeset
Open AccessProceedings ArticleDOI

Joint Deep Learning for Pedestrian Detection

Reads0
Chats0
TLDR
This paper forms these four important components in pedestrian detection into a joint deep learning framework and proposes a new deep network architecture that achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.
Abstract
Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.

read more

Content maybe subject to copyright    Report

Joint Deep Learning for Pedestrian Detection
Wanli Ouyang and Xiaogang Wang
Department of Electronic Engineering, the Chinese University of Hong Kong
wlouyang, xgwang@ee.cuhk.edu.hk
Abstract
Feature extraction, deformation handling, occlusion
handling, and classication are four important components
in pedestrian detection. Existing methods learn or design
these components either individually or sequentially. The
interaction among these components is not yet well ex-
plored. This paper proposes that they should be jointly
learned in order to maximize their strengths through coop-
eration. We formulate these four components into a joint
deep learning framework and propose a new deep network
architecture
1
. By establishing automatic, mutual interac-
tion among components, the deep model achieves a 9% re-
duction in the average miss rate compared with the cur-
rent best-performing pedestrian detection approaches on
the largest Caltech benchmark dataset.
1. Introduction
Pedestrian detection is a key technology in automotive
safety, robotics, and intelligent video surveillance. It has at-
tracted a great deal of research interest [2, 5, 12, 47, 8]. The
main challenges of this task are caused by the intra-class
variation of pedestrians in clothing, lighting, backgrounds,
articulation, and occlusion.
In order to handle these challenges, a group of interde-
pendent components are important. First, features should
capture the most discriminative information of pedestri-
ans. Well-known features such as Haar-like features [49],
SIFT [29], and HOG [5] are designed to be robust to intra-
class variation while remain sensitive to inter-class varia-
tion. Second, deformation models should handle the artic-
ulation of human parts such as torso, head, and legs. The
state-of-the-art deformable part-based model in [ 17] allows
human parts to articulate with constraint. Third, occlusion
handling approaches [13, 51, 19] seek to identify the oc-
cluded regions and avoid their use when determining the
existence of a pedestrian in a window. Finally, a classier
decides whether a candidate window shall be detected as
1
Code available www.ee.cuhk.edu.hk/
˜
wlouyang/projects/ouyangWiccv13Joint/index.html
Feature
extraction
Part deformation
handling
Deformable
part-based
model
HOG
Occlusion
handling
Occlusion
handling
methods
Classification
SVM
Example
Components:
This paper jointly learns
Example Example
Example
Figure 1. Motivation of this paper to jointly learn the four key
components in pedestrian detection: feature extraction, deforma-
tion handling models, occlusion handling models, and classiers.
enclosing a pedestrian. SVM [5], boosted classiers [11],
random forests [9], and their variations are often used.
Although these components are interdependent, their in-
teractions have not been well explored. Currently, they are
rst learned or designed individually or sequentially, and
then put together in a pipeline. The interaction among these
components is usually achieved using manual parameter
conguration. Consider the following three examples. (1)
The HOG feature is individually designed with its param-
eters manually tuned given the linear SVM classier being
used in [5]. Then HOG feature become xed when people
design new classiers [31]. (2) A few HOG feature parame-
ters are tuned in [17]andxed, and then different part mod-
els are learned in [17, 58]. (3) By xing HOG features and
deformable models, occlusion handling models are learned
in [34, 36], using the part-detection scores as input.
As shown in Fig. 1, the motivation of this paper is to es-
tablish automatic interaction in learning these key compo-
nents. We hope that jointly learned components, like mem-
bers with team spirit, can create synergy through close in-
teraction, and generate performance that is greater than in-
dividually learned components. For example, well-learned
1

features help to locate parts, meanwhile, well-located parts
help to learn more discriminative features for different parts.
This paper formulates the learning of these key components
into a unied deep learning problem. The deep model is
especially appropriate for this task because it can organize
these components into different layers and jointly optimize
them through back-propagation.
This paper makes the following three main contributions.
1. A unied deep model for jointly learning feature extrac-
tion, a part deformation model, an occlusion model and
classication. With the deep model, these components
interact with each other in the learning process, which
allows each component to maximize its strength when
cooperating with others.
2. We enrich the operation in deep models by incorporating
the deformation layer into the convolutional neural net-
works (CNN) [26]. With this layer, various deformation
handling approaches can be applied to our deep model.
3. The features are learned from pixels through interaction
with deformation and occlusion handling models. Such
interaction helps to learn more discriminative features.
2. Related Work
It has been proved that deep models are potentially more
capable than shallow models in handling complex tasks [ 3].
They have achieved spectacular progress in computer vi-
sion [20, 21, 40, 23, 25, 33, 24, 56, 30, 46, 16, 38]. Deep
models for pedestrian detection focus on feature learning
[44, 33], contextual information learning [57], and occlu-
sion handling [34].
Many features are utilized for pedestrian detection.
Haar-like features [49], HOG [5], and dense SIFT [48]are
designed to capture the overall shape of pedestrians. First-
order color features like color histograms [11], second-
order color features like color-self-similarity (CSS) [50]and
co-occurrence features [43] are also used for pedestrian de-
tection. Texture feature like LBP are used in [51]. Other
types of features include the covariance descriptor [ 47],
depth [15], segmentation results [13], 3D geometry [22],
and their combinations [27, 51, 11, 50, 13, 43]. All the
features mentioned above are designed manually. Recently,
researchers have become aware of the benet of learning
features from training data [1, 33, 44]. Similar to HOG,
they use local max pooling or average pooling to be robust
to small local misalignment. However, these approaches do
not learn the variable deformation properties of body parts.
The approach in [7] learns features and a part-based model
sequentially but not jointly.
Since pedestrians have non-rigid deformation, the abil-
ity to handle deformation improves detection performance.
Deformable part-based models are used in [17
, 58, 37, 35]
for handling translational movement of parts. To handle
more complex articulations, size change and rotation of
parts are modeled in [18], and mixture of part appearance
and articulation types are modeled in [4, 55, 6]. In these
approaches, features are manually designed.
In order to handle occlusion, many approaches have been
proposed for estimating the visibility of parts [13, 51, 54,
53, 45, 27]. Some of them use the detection scores of blocks
or parts [51, 34, 13, 54] as input for visibility estimation.
Some use other cues like segmentation results [27, 13]and
depth [13]. However, all these approaches learn the occlu-
sion modeling separately from feature extraction and part
models.
The widely used classication approaches include var-
ious boosting classiers [9, 11, 53], linear SVM [5], his-
togram intersection kernel SVM [31], latent SVM [17],
multiple kernel SVM [48], structural SVM [58], and prob-
abilistic models [2, 32]. In these approaches, classiers are
adapted to training data, but features are designed manually.
If useful information has been lost at feature extraction, it
cannot be recovered during classication. Ideally, classi-
ers should guide feature learning.
In summary, previous works treat the components in-
dividually or sequentially. This paper takes a global view
of these components and is an important step towards joint
learning of them for pedestrian detection.
3. Method
3.1. Overview of the proposed deep model
An overview of our proposed deep model is shown in
Fig. 2. In this model:
1. Filtered data maps are obtained from the rst convolu-
tional layer. This layer convolves the 3-channel input
image data with 9 × 9 × 3 lters and outputs 64 maps.
|tanh(x)|, i.e. activation function tanh and absolution
value rectication, is used for each lter response x.
2. Features maps are obtained by average pooling of the 64
ltered data maps using 4 × 4 boxcar lters with a 4 × 4
subsampling step.
3. Part detection maps are obtained from the second con-
volutional layer. This layer convolves the feature maps
with 20 part lters of different sizes and outputs 20 part
detection maps. Details are given in Section 3.3.
4. Part scores are obtained from the 20 part detection maps
using a deformation handling layer. This layer outputs
20 part scores. Details are given in Section 3.4.
5. The visibility reasoning of 20 parts is used for estimating
the label y; that is, whether a given window encloses a
pedestrian or not. Details are given in Section 3.5.
At the training stage, all the parameters are optimized
through back-propagation (BP).

Convolutional
layer 1
Image data
Average
pooling
64
Extracted
feature
map
Visibility
reasoning and
classification
64
Filtered data map
Part
score
Convolutional
layer 2
Deformation
layer
20
Part
detection
map
9
9
28
84
20
76
5
19
...
20
3
...
...
...
4×4
y
Figure 2. Overview of our deep model. Image data is convolved with 64 9 × 9 × 3 lters and averagely pooled to obtain 64 feature maps.
The feature maps are then processed by the second convolutional layer and the deformation layer to obtain 20 part scores. Finally the
visibility reasoning model is used to estimate the detection label y.
3.2. Input data preparation
The detection windows are extracted into images with
height 84 and width 28, in which pedestrians have height 60
and width 20. The i nput image data contains three channels.
(1) The rst channel is a 84 × 28 Y-channel image after
the image is converted into the YUV color space.
(2) The three-channel 42 × 14 images in the YUV color
space are concatenated into the second channel of size 84 ×
28 with zero padding.
(3) Four 42 × 14 edge maps are concatenated into the
third channel of size 84 × 28. Three edge maps are obtained
from the three-channel images in the YUV color space. The
magnitudes of horizontal and vertical edges are computed
using the Sobel edge detector. The fourth edge map is ob-
tained by choosing the maximum magnitudes from the rst
three edge maps.
In this way, information about pixel values at different
resolutions and information of primitive edges are utilized
as the input of the rst convlutional layer to extract features.
The rst convolutional layer and its following average pool-
ing layer use the standard CNN settings.
We empirically nd that it is better to arrange the images
and edge maps into three concatenated channels instead of
eight separate channels. In order to deal with illumination
change, the data in each channel is preprocessed to be zero
mean and unit variance.
3.3. Generating the part detection map
Normally, the lter size of a convolutional layer is xed
[26, 24]. Since the parts of pedestrians have different sizes,
we design the lters in the second convolutional layer with
variable sizes. As shown in Fig. 3(a), we design parts at
three levels with different sizes. There are six small parts
at level 1, seven medium-sized parts at level 2, and seven
Level 3
Level 2
Level 1
(a)
(b)
Figure 3. The parts model (a) and the lters (b) learned at the sec-
ond convolutional layer. We follow [14] and visualize the lter
that optimizes the corresponding stimuli of the neurons, which is
also used in [25].
large parts at level 3, as shown in Fig. 3(a). A part at an
upper level is composed of parts at the lower level. Parts at
the top level are also the possible occlusion statuses. Gray
color indicates occlusion. The other two levels are body
parts. In the gure, the head-shoulder part appears twice
(representing occlusion status at the top level and part at the

middle level respectively) because this body part itself can
generate an occlusion status. Fig. 3(b) shows a few part
lters learned with our deep model. They are visualized
using the activation maximization approach in [ 14]. The
gure shows that the head-shoulder at level 2 and the head-
shoulder at level 3 extract different visual cues from the in-
put image. The head-shoulder lters in Fig. 3(b) contain
more detailed silhouette information on heads and shoul-
ders than the head-shoulder lter learned with HOG in Fig.
1. The two-legs lter in Fig. 3(b) is visually more meaning-
ful than the one learned with HOG in Fig. 1.
3.4. The deformation layer
In order to learn the deformation constraints of different
parts, we propose the deformation handling layer (deforma-
tion layer for short) for the CNNs.
The deformation layer takes the P part detection maps
as input and outputs P part scores s = {s
1
,...,s
P
}, P =
20 in Fig. 2. The deformation layer treats the detection
maps separately and produces the pth part score s
p
from
the pth part detection map, denoted by M
p
. A 2D summed
map, denoted by B
p
, is obtained by summing up the part
detection map M
p
and the deformation maps as follows:
B
p
= M
p
+
N
n=1
c
n,p
D
n,p
. (1)
D
n,p
denotes the nth deformation map for the pth part, c
n,p
denotes the weight for D
n,p
,andN denotes the number of
deformation maps. s
p
is globally max-pooled from B
p
in
Eq. (1):
s
p
=max
(x,y )
b
(x,y )
p
, (2)
where b
(x,y)
p
denotes the (x, y)th element of B
p
. The de-
tected part location can be inferred from the summed map
as follows:
(x, y)
p
=argmax
(x,y)
b
(x,y)
p
. (3)
At the training stage, only the value at location (x, y)
p
of
B
p
is used for learning the deformation parameters.
The c
n,p
and D
n,p
in (1) are the key for designing dif-
ferent deformation models. Both c
n,p
and D
n,p
can be con-
sidered as the parameters to be learned. Three examples are
given below.
Example 1. Suppose N =1, c
1,p
=1and the defor-
mation map D
1,p
is to be learned. In this case, the discrete
locations of the pth part are treated as bins and the defor-
mation cost for each bin is learned. d
(x,y)
1,p
, which denotes
the (x, y)th element of D
1,p
, corresponds to the deforma-
tioncostofthepth part at location (x, y). The approach in
[39] treats deformation as bins of locations.
Example 2. D
1,p
can also be predened. Suppose N =1
and c
n,p
=1.Ifd
(x,y)
1,p
is the same for any (x, y), then there
is no deformation cost. If d
(x,y)
1,p
= −∞ for (x, y) / X,
d
(x,y)
1,p
=0for (x, y) X, then the parts are only allowed to
move freely in the location set X. Max-pooling is a special
Part detection
map
D
1,p
D
2,p
D
3,p
D
4,p
M
p
Deformation maps
c
1
,
p
c
3
,
p
c
4
,
p
c
2
,
p
Summed map
Part score
B
p
s
p
High
value
Low
value
Global max
pooling
Figure 4. The deformation layer when deformation map is dened
in (4). Part detection map and deformation maps are summed up
with weights c
n,p
for n =1, 2, 3, 4 to obtain the summed map
B
p
. Global max pooling is then performed on the summed map to
obtain the score s
p
for the pth part.
case of this example by setting X to be a local region. The
disadvantage of max-pooling is that the hand-tuned local
region does not adapt to different deformation properties of
different parts.
Example 3. The deformation layer can represent the
widely used quadratic constraint of deformation in [ 17]. Be-
low, we skip the subscript
p
used in Eq. (1) to be concise.
The quadratic constraint of deformation can be represented
as follows:
b
(x,y )
=m
(x,y )
+ c
1
(xa
x
+
c
3
2c
1
)
2
+c
2
(ya
y
+
c
4
2c
2
)
2
,
(4)
where m
(x,y)
is the (x, y)th element of the part detection
map M, (a
x
,a
y
) is the predened anchor location of the
pth part. They are adjusted by c
3
/2c
1
and c
4
/2c
2
, which
are automatically learned. c
1
and c
2
(4) decide the defor-
mation cost. There is no deformation cost if c
1
= c
2
=0.
Parts are not allowed to move if c
1
= c
2
= −∞. (a
x
,a
y
)
and (
c
3
2c
1
,
c
4
2c
2
) jointly decide the center of the part. The
quadratic constraint in Eq. (4) can be represented using Eq.
(1) as follows:
B = M + c
1
D
1
+ c
2
D
2
+ c
3
D
3
+ c
4
D
4
+ c
5
· 1,
b
(x,y )
=m
(x,y )
+ c
1
d
(x,y )
1
+ c
2
d
(x,y )
2
+ c
3
d
(x,y )
3
+ c
4
d
(x,y )
4
+c
5
,
d
(x,y )
1
=(x a
x
)
2
,d
(x,y )
2
=(y a
y
)
2
,d
(x,y )
3
=x a
x
,
d
(x,y )
4
=y a
y
,c
5
= c
3
2
/(4c
1
)+c
4
2
/(4c
2
), (5)
where 1 is a matrix with all elements being one, d
(x,y )
n
is the
(x, y)th element of D
n
.Inthiscase,c
1
,c
2
,c
3
and c
4
are pa-
rameters to be learned and D
n
are predened. c
5
is the same in all
locations and need not be learned. Fig. 4 illustrates this example,
which is used as the deformation layer in this work.
3.5. Visibility reasoning and classication
The deformation layer in Section 3.4 provides the part scores
s = {s
1
,...,s
P
} using Eq. (2). s is then used for visibility rea-

3
1,1
h
3
2,1
h
3
3,1
h
3
4,1
h
3
5,1
h
3
6,1
h
3
7,1
h
Level 3
Level 2
Level 1
1
1,1
h
1
2,1
h
1
3,1
h
1
4,1
h
1
5,1
h
1
6,1
h
h
1
3
h
1
1
h
1
2
.
.
.
h
3
h
1
.
.
.
.
.
.
...
s
h
2
.
.
.
.
.
.
y
2
1,1
h
2
2,1
h
2
3,1
h
2
4,1
h
2
5,1
h
2
6,1
h
2
7,1
h
3
1,1
s
2
1,1
s
1
1,1
s
Figure 5. The visibility reasoning and detection label estimation
model. For the ith part at the lth level, s
l
i
is the detection score and
h
l
i
is the visibility. For example, h
1
1
indicates the visibility of the
left-head-shoulder part. Best viewed in color.
soning and classication. We adopt the model in [34] to estimate
visibility.
Fig. 5 shows the model for the visibility reasoning and classi-
cation in Fig. 2. Denote the score and visibility of the jth part at
level l as s
l
j
and h
l
j
respectively. Denote the visibility of P
l
parts
at level l by h
l
=[h
l
1
...h
l
P
l
]
T
.Givens, the model for BP and
inference is as follows:
˜
h
1
j
= σ(c
1
j
+ g
1
j
s
1
j
),
˜
h
l+1
j
= σ(
˜
h
l
T
w
l
,j
+ c
l+1
j
+ g
l+1
j
s
l+1
j
),l =1, 2,
˜y = σ(
˜
h
3
T
w
cls
+ b),
(6)
where σ(t)=(1+exp(t))
1
is the sigmoid function, g
l
j
is
the weight for s
l
j
, c
l
j
is its bias term, W
l
models the correlation
between h
l
and h
l+1
, w
l
,j
is the jth column of W
l
, w
cls
is con-
sidered as the linear classier for the hidden units
˜
h
3
,and˜y is the
estimated detection label. Hidden variables at adjacent levels are
connected. w
l
,j
represents the relationship between
˜
h
l
and
˜
h
l+1
j
.
A part can have multiple parents and multiple children. The visi-
bility of one part is correlated with the visibility of other parts at
the same level through shared parents. g
l
j
, c
l
j
, W
l
, w
cls
,andb are
parameters to be learned.
The differences between the deep model in this paper and the
approach in [34] are as follows:
1. The parts at levels 1 and 2 propagate information to the
classier through the parts at level 3 in [34]. But the imperfect
part scores at level 3 may disturb the information from levels 1
and 2. This paper includes extra hidden nodes at levels 2 and 3.
These nodes provide branches that help parts at level 1 and level
2 to directly propagate information to the classier without being
disturbed by other parts. These extra hidden nodes do not use
detection scores and have the term g
l+1
j
s
l+1
j
=0in (6). They are
represented by white circles in Fig. 5, while the hidden nodes with
the term g
l+1
j
s
l+1
j
=0in (6) are represented by gray circles.
2. The approach in [34 ] only learns the visibility relationship
from part scores. Both HOG features and the parameters for the
deformation model are xed in [34]. In this paper, features, de-
formable models, and visibility relationships are jointly learned.
In order to learn the parameters in the two convolutional layers
and the deformation layer in Fig. 2, prediction error is back-
propagated through s. The gradient for s is:
∂L
∂s
l
i
=
∂L
∂h
l
i
∂h
l
i
∂s
l
i
=
∂L
∂h
l
i
h
l
i
(1 h
l
i
)g
l
i
, (7)
where
∂L
∂h
3
i
=
∂L
˜y
˜y(1 ˜y)w
cls
i
,
∂L
∂h
2
i
= w
2
i,
∂L
h
3
h
3
(1 h
3
)
,
∂L
∂h
1
i
= w
1
i,
∂L
h
2
h
2
(1 h
2
)
,
(8)
denotes the Hadamard product; that is (U V )
i,j
= U
i,j
V
i,j
,
w
l
i,
is the ith row of W
l
,andw
cls
i
is the ith element of the w
cls
.
L is the loss function. For example L =(y
gnd
˜y)
2
/2 is for the
square loss, and y
gnd
the ground-truth label. L = y
gnd
log ˜y +
(1 y
gnd
)log(1 ˜y) is for the log loss, which is chosen in this
work.
In order to train this deep architecture, we adopt a multi-stage
training strategy. We start with a 1-layer CNN using supervised
training. Since Gabor lters are similar to the human visual sys-
tem, they are used for initialing the rst CNN. We add one more
layer at each stage, the layers trained in the previous stage are
used for initialization and then all the layers at the current stage
are jointly optimized with BP.
4. Experimental Results
The proposed framework is evaluated on the Caltech dataset
[12] and the ETH dataset [15]. In order to save computation, a
detector using HOG+CSS and Linear SVM is utilized for pruning
candidate detection windows at both training and testing stages.
Approximately 60,000 training samples that are not pruned by the
detector are used for training the deep model. At the testing stage,
the execution time required by our deep model is less than 10%
of the execution time required by the HOG+CSS+SVM detector,
which has ltered most samples. In the deep learning model, learn-
ing rate is xed as 0.025 with batch size 60. Similar to [44, 24],
norm penalty is not used.
The labels and evaluation code provided by Doll´ar et al. on-
line are used for evaluation following the criteria proposed in [12].
As in [12], the log-average miss rate is used to summarize the de-
tector performance, and is computed by averaging the miss rate
at nine FPPI rates that are evenly spaced in the log-space in the
range from 10
2
to 10
0
. In the experiments, we evaluate the per-
formance on the reasonable subset of the evaluated datasets. This
subset, which is the most popular portion of the datasets, consists
of pedestrians who are more than 49 pixels in height, and whose
occluded portions are less than 35%.
The compared approaches are VJ [49], Shapelet [42], PoseInv
[28], LatSVM-V1 [17], LatSVM-V2 [17], HikSVM [31], HOG
[5], MultiFtr [52], HogLbp [51], Pls [43], MultiFtr+CCS, Multi-
Ftr+Motion [50], FeatSynth [1] FPDW [10], ChnFtrs [11], Mul-
tiResC [37], CrossTalk [9], DN-HOG [34] and ConvNet-U-MS

Citations
More filters
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Journal ArticleDOI

Image Super-Resolution Using Deep Convolutional Networks

TL;DR: Zhang et al. as discussed by the authors proposed a deep learning method for single image super-resolution (SR), which directly learns an end-to-end mapping between the low/high-resolution images.
Proceedings ArticleDOI

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Posted Content

Image Super-Resolution Using Deep Convolutional Networks

TL;DR: This work proposes a deep learning method for single image super-resolution (SR) that directly learns an end-to-end mapping between the low/high-resolution images, represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Reducing the Dimensionality of Data with Neural Networks

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Joint deep learning for pedestrian detection" ?

This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. The authors formulate these four components into a joint deep learning framework and propose a new deep network architecture1. 

The authors expect even larger improvement by training their UDN on much larger-scale training sets in the future work. This framework also has the potential for general object detection. 

The widely used classification approaches include various boosting classifiers [9, 11, 53], linear SVM [5], histogram intersection kernel SVM [31], latent SVM [17], multiple kernel SVM [48], structural SVM [58], and probabilistic models [2, 32]. 

In order to save computation, a detector using HOG+CSS and Linear SVM is utilized for pruning candidate detection windows at both training and testing stages. 

2. The authors enrich the operation in deep models by incorporating the deformation layer into the convolutional neural networks (CNN) [26]. 

(3) By fixing HOG features and deformable models, occlusion handling models are learned in [34, 36], using the part-detection scores as input. 

At the training stage, there are approximately 60,000 negative samples and 4,000 positive samples from the Caltech-Train dataset. 

The current best performing approaches on the Caltech-Test are the MultiResC [37] and the contextual boost [8], both of which have an average miss rate of 48%. 

Through interaction among these interdependent components,joint learning achieves the best performance on publicly available datasets, outperforming the existing best performing approaches by 9% on the largest Caltech dataset. 

As in [12], the log-average miss rate is used to summarize the detector performance, and is computed by averaging the miss rate at nine FPPI rates that are evenly spaced in the log-space in the range from 10−2 to 100. 

Acknowledgment: This work is supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project No. CUHK 417110, CUHK 417011, CUHK 429412) and National Natural Science Foundation of China (Project No. 61005057). 

This paper proposes a unified deep model that jointly learns four components – feature extraction, deformation handling, occlusion handling and classification – for pedestrian detection. 

A two-layer CNN (CNN-2layer in Fig. 7(a)) is constructed by convolving the extracted feature maps with another convolutional layer and another pooling layer.