What are the contributions in "Joint deep learning for pedestrian detection" ?

This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. The authors formulate these four components into a joint deep learning framework and propose a new deep network architecture1.

What are the future works mentioned in the paper "Joint deep learning for pedestrian detection" ?

The authors expect even larger improvement by training their UDN on much larger-scale training sets in the future work. This framework also has the potential for general object detection.

How many negative samples are used in the Caltech-Train dataset?

At the training stage, there are approximately 60,000 negative samples and 4,000 positive samples from the Caltech-Train dataset.

What are the performing approaches on the Caltech-Test?

The current best performing approaches on the Caltech-Test are the MultiResC [37] and the contextual boost [8], both of which have an average miss rate of 48%.

How does the proposed deep model perform on public datasets?

Through interaction among these interdependent components,joint learning achieves the best performance on publicly available datasets, outperforming the existing best performing approaches by 9% on the largest Caltech dataset.

What is the funding source for this work?

Acknowledgment: This work is supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project No. CUHK 417110, CUHK 417011, CUHK 429412) and National Natural Science Foundation of China (Project No. 61005057).

What is the main purpose of this paper?

This paper proposes a unified deep model that jointly learns four components – feature extraction, deformation handling, occlusion handling and classification – for pedestrian detection.

What is the performing CNN on Caltech-Test?

A two-layer CNN (CNN-2layer in Fig. 7(a)) is constructed by convolving the extracted feature maps with another convolutional layer and another pooling layer.

(Open Access) Joint Deep Learning for Pedestrian Detection (2013) | Wanli Ouyang

Q: What are the commonly used classification approaches?

The widely used classification approaches include various boosting classifiers [9, 11, 53], linear SVM [5], histogram intersection kernel SVM [31], latent SVM [17], multiple kernel SVM [48], structural SVM [58], and probabilistic models [2, 32].

Q: How do the authors enrich the operation in deep models?

2. The authors enrich the operation in deep models by incorporating the deformation layer into the convolutional neural networks (CNN) [26].

Q: How are the features learned in deep models?

(3) By fixing HOG features and deformable models, occlusion handling models are learned in [34, 36], using the part-detection scores as input.

Joint Deep Learning for Pedestrian Detection

Wanli Ouyang and Xiaogang Wang

Department of Electronic Engineering, the Chinese University of Hong Kong

wlouyang, xgwang@ee.cuhk.edu.hk

Abstract

Feature extraction, deformation handling, occlusion

handling, and classiﬁcation are four important components

in pedestrian detection. Existing methods learn or design

these components either individually or sequentially. The

interaction among these components is not yet well ex-

plored. This paper proposes that they should be jointly

learned in order to maximize their strengths through coop-

eration. We formulate these four components into a joint

deep learning framework and propose a new deep network

architecture

. By establishing automatic, mutual interac-

tion among components, the deep model achieves a 9% re-

duction in the average miss rate compared with the cur-

rent best-performing pedestrian detection approaches on

the largest Caltech benchmark dataset.

1. Introduction

Pedestrian detection is a key technology in automotive

safety, robotics, and intelligent video surveillance. It has at-

tracted a great deal of research interest [2, 5, 12, 47, 8]. The

main challenges of this task are caused by the intra-class

variation of pedestrians in clothing, lighting, backgrounds,

articulation, and occlusion.

In order to handle these challenges, a group of interde-

pendent components are important. First, features should

capture the most discriminative information of pedestri-

ans. Well-known features such as Haar-like features [49],

SIFT [29], and HOG [5] are designed to be robust to intra-

class variation while remain sensitive to inter-class varia-

tion. Second, deformation models should handle the artic-

ulation of human parts such as torso, head, and legs. The

state-of-the-art deformable part-based model in [ 17] allows

human parts to articulate with constraint. Third, occlusion

handling approaches [13, 51, 19] seek to identify the oc-

cluded regions and avoid their use when determining the

existence of a pedestrian in a window. Finally, a classiﬁer

decides whether a candidate window shall be detected as

Code available www.ee.cuhk.edu.hk/

wlouyang/projects/ouyangWiccv13Joint/index.html

Feature

extraction

Part deformation

handling

Deformable

part-based

model

HOG

Occlusion

handling

Occlusion

handling

methods

Classification

SVM

Example

Components:

This paper jointly learns

Example Example

Example

Figure 1. Motivation of this paper to jointly learn the four key

components in pedestrian detection: feature extraction, deforma-

tion handling models, occlusion handling models, and classiﬁers.

enclosing a pedestrian. SVM [5], boosted classiﬁers [11],

random forests [9], and their variations are often used.

Although these components are interdependent, their in-

teractions have not been well explored. Currently, they are

ﬁrst learned or designed individually or sequentially, and

then put together in a pipeline. The interaction among these

components is usually achieved using manual parameter

conﬁguration. Consider the following three examples. (1)

The HOG feature is individually designed with its param-

eters manually tuned given the linear SVM classiﬁer being

used in [5]. Then HOG feature become ﬁxed when people

design new classiﬁers [31]. (2) A few HOG feature parame-

ters are tuned in [17]andﬁxed, and then different part mod-

els are learned in [17, 58]. (3) By ﬁxing HOG features and

deformable models, occlusion handling models are learned

in [34, 36], using the part-detection scores as input.

As shown in Fig. 1, the motivation of this paper is to es-

tablish automatic interaction in learning these key compo-

nents. We hope that jointly learned components, like mem-

bers with team spirit, can create synergy through close in-

teraction, and generate performance that is greater than in-

dividually learned components. For example, well-learned

features help to locate parts, meanwhile, well-located parts

help to learn more discriminative features for different parts.

This paper formulates the learning of these key components

into a uniﬁed deep learning problem. The deep model is

especially appropriate for this task because it can organize

these components into different layers and jointly optimize

them through back-propagation.

This paper makes the following three main contributions.

1. A uniﬁed deep model for jointly learning feature extrac-

tion, a part deformation model, an occlusion model and

classiﬁcation. With the deep model, these components

interact with each other in the learning process, which

allows each component to maximize its strength when

cooperating with others.

2. We enrich the operation in deep models by incorporating

the deformation layer into the convolutional neural net-

works (CNN) [26]. With this layer, various deformation

handling approaches can be applied to our deep model.

3. The features are learned from pixels through interaction

with deformation and occlusion handling models. Such

interaction helps to learn more discriminative features.

2. Related Work

It has been proved that deep models are potentially more

capable than shallow models in handling complex tasks [ 3].

They have achieved spectacular progress in computer vi-

sion [20, 21, 40, 23, 25, 33, 24, 56, 30, 46, 16, 38]. Deep

models for pedestrian detection focus on feature learning

[44, 33], contextual information learning [57], and occlu-

sion handling [34].

Many features are utilized for pedestrian detection.

Haar-like features [49], HOG [5], and dense SIFT [48]are

designed to capture the overall shape of pedestrians. First-

order color features like color histograms [11], second-

order color features like color-self-similarity (CSS) [50]and

co-occurrence features [43] are also used for pedestrian de-

tection. Texture feature like LBP are used in [51]. Other

types of features include the covariance descriptor [ 47],

depth [15], segmentation results [13], 3D geometry [22],

and their combinations [27, 51, 11, 50, 13, 43]. All the

features mentioned above are designed manually. Recently,

researchers have become aware of the beneﬁt of learning

features from training data [1, 33, 44]. Similar to HOG,

they use local max pooling or average pooling to be robust

to small local misalignment. However, these approaches do

not learn the variable deformation properties of body parts.

The approach in [7] learns features and a part-based model

sequentially but not jointly.

Since pedestrians have non-rigid deformation, the abil-

ity to handle deformation improves detection performance.

Deformable part-based models are used in [17

, 58, 37, 35]

for handling translational movement of parts. To handle

more complex articulations, size change and rotation of

parts are modeled in [18], and mixture of part appearance

and articulation types are modeled in [4, 55, 6]. In these

approaches, features are manually designed.

In order to handle occlusion, many approaches have been

proposed for estimating the visibility of parts [13, 51, 54,

53, 45, 27]. Some of them use the detection scores of blocks

or parts [51, 34, 13, 54] as input for visibility estimation.

Some use other cues like segmentation results [27, 13]and

depth [13]. However, all these approaches learn the occlu-

sion modeling separately from feature extraction and part

models.

The widely used classiﬁcation approaches include var-

ious boosting classiﬁers [9, 11, 53], linear SVM [5], his-

togram intersection kernel SVM [31], latent SVM [17],

multiple kernel SVM [48], structural SVM [58], and prob-

abilistic models [2, 32]. In these approaches, classiﬁers are

adapted to training data, but features are designed manually.

If useful information has been lost at feature extraction, it

cannot be recovered during classiﬁcation. Ideally, classi-

ﬁers should guide feature learning.

In summary, previous works treat the components in-

dividually or sequentially. This paper takes a global view

of these components and is an important step towards joint

learning of them for pedestrian detection.

3. Method

3.1. Overview of the proposed deep model

An overview of our proposed deep model is shown in

Fig. 2. In this model:

1. Filtered data maps are obtained from the ﬁrst convolu-

tional layer. This layer convolves the 3-channel input

image data with 9 × 9 × 3 ﬁlters and outputs 64 maps.

|tanh(x)|, i.e. activation function tanh and absolution

value rectiﬁcation, is used for each ﬁlter response x.

2. Features maps are obtained by average pooling of the 64

ﬁltered data maps using 4 × 4 boxcar ﬁlters with a 4 × 4

subsampling step.

3. Part detection maps are obtained from the second con-

volutional layer. This layer convolves the feature maps

with 20 part ﬁlters of different sizes and outputs 20 part

detection maps. Details are given in Section 3.3.

4. Part scores are obtained from the 20 part detection maps

using a deformation handling layer. This layer outputs

20 part scores. Details are given in Section 3.4.

5. The visibility reasoning of 20 parts is used for estimating

the label y; that is, whether a given window encloses a

pedestrian or not. Details are given in Section 3.5.

At the training stage, all the parameters are optimized

through back-propagation (BP).

Convolutional

layer 1

Image data

Average

pooling

Extracted

feature

map

Visibility

reasoning and

classification

Filtered data map

Part

score

Convolutional

layer 2

Deformation

layer

Part

detection

map

...

4×4

Figure 2. Overview of our deep model. Image data is convolved with 64 9 × 9 × 3 ﬁlters and averagely pooled to obtain 64 feature maps.

The feature maps are then processed by the second convolutional layer and the deformation layer to obtain 20 part scores. Finally the

visibility reasoning model is used to estimate the detection label y.

3.2. Input data preparation

The detection windows are extracted into images with

height 84 and width 28, in which pedestrians have height 60

and width 20. The i nput image data contains three channels.

(1) The ﬁrst channel is a 84 × 28 Y-channel image after

the image is converted into the YUV color space.

(2) The three-channel 42 × 14 images in the YUV color

space are concatenated into the second channel of size 84 ×

28 with zero padding.

(3) Four 42 × 14 edge maps are concatenated into the

third channel of size 84 × 28. Three edge maps are obtained

from the three-channel images in the YUV color space. The

magnitudes of horizontal and vertical edges are computed

using the Sobel edge detector. The fourth edge map is ob-

tained by choosing the maximum magnitudes from the ﬁrst

three edge maps.

In this way, information about pixel values at different

resolutions and information of primitive edges are utilized

as the input of the ﬁrst convlutional layer to extract features.

The ﬁrst convolutional layer and its following average pool-

ing layer use the standard CNN settings.

We empirically ﬁnd that it is better to arrange the images

and edge maps into three concatenated channels instead of

eight separate channels. In order to deal with illumination

change, the data in each channel is preprocessed to be zero

mean and unit variance.

3.3. Generating the part detection map

Normally, the ﬁlter size of a convolutional layer is ﬁxed

[26, 24]. Since the parts of pedestrians have different sizes,

we design the ﬁlters in the second convolutional layer with

variable sizes. As shown in Fig. 3(a), we design parts at

three levels with different sizes. There are six small parts

at level 1, seven medium-sized parts at level 2, and seven

Level 3

Level 2

Level 1

(a)

(b)

Figure 3. The parts model (a) and the ﬁlters (b) learned at the sec-

ond convolutional layer. We follow [14] and visualize the ﬁlter

that optimizes the corresponding stimuli of the neurons, which is

also used in [25].

large parts at level 3, as shown in Fig. 3(a). A part at an

upper level is composed of parts at the lower level. Parts at

the top level are also the possible occlusion statuses. Gray

color indicates occlusion. The other two levels are body

parts. In the ﬁgure, the head-shoulder part appears twice

(representing occlusion status at the top level and part at the

middle level respectively) because this body part itself can

generate an occlusion status. Fig. 3(b) shows a few part

ﬁlters learned with our deep model. They are visualized

using the activation maximization approach in [ 14]. The

ﬁgure shows that the head-shoulder at level 2 and the head-

shoulder at level 3 extract different visual cues from the in-

put image. The head-shoulder ﬁlters in Fig. 3(b) contain

more detailed silhouette information on heads and shoul-

ders than the head-shoulder ﬁlter learned with HOG in Fig.

1. The two-legs ﬁlter in Fig. 3(b) is visually more meaning-

ful than the one learned with HOG in Fig. 1.

3.4. The deformation layer

In order to learn the deformation constraints of different

parts, we propose the deformation handling layer (deforma-

tion layer for short) for the CNNs.

The deformation layer takes the P part detection maps

as input and outputs P part scores s = {s

,...,s

}, P =

20 in Fig. 2. The deformation layer treats the detection

maps separately and produces the pth part score s

from

the pth part detection map, denoted by M

. A 2D summed

map, denoted by B

, is obtained by summing up the part

detection map M

and the deformation maps as follows:

= M



n=1

n,p

. (1)

n,p

denotes the nth deformation map for the pth part, c

n,p

denotes the weight for D

n,p

,andN denotes the number of

deformation maps. s

is globally max-pooled from B

Eq. (1):

=max

(x,y )

, (2)

where b

(x,y)

denotes the (x, y)th element of B

. The de-

tected part location can be inferred from the summed map

as follows:

(x, y)

=argmax

(x,y)

. (3)

At the training stage, only the value at location (x, y)

is used for learning the deformation parameters.

The c

n,p

and D

n,p

in (1) are the key for designing dif-

ferent deformation models. Both c

n,p

and D

n,p

can be con-

sidered as the parameters to be learned. Three examples are

given below.

Example 1. Suppose N =1, c

1,p

=1and the defor-

mation map D

1,p

is to be learned. In this case, the discrete

locations of the pth part are treated as bins and the defor-

mation cost for each bin is learned. d

(x,y)

1,p

, which denotes

the (x, y)th element of D

1,p

, corresponds to the deforma-

tioncostofthepth part at location (x, y). The approach in

[39] treats deformation as bins of locations.

Example 2. D

1,p

can also be predeﬁned. Suppose N =1

and c

n,p

=1.Ifd

(x,y)

1,p

is the same for any (x, y), then there

is no deformation cost. If d

(x,y)

1,p

= −∞ for (x, y) /∈ X,

(x,y)

1,p

=0for (x, y) ∈ X, then the parts are only allowed to

move freely in the location set X. Max-pooling is a special

Part detection

map

1,p

2,p

3,p

4,p

Deformation maps

Summed map

Part score

High

value

Low

value

Global max

pooling

Figure 4. The deformation layer when deformation map is deﬁned

in (4). Part detection map and deformation maps are summed up

with weights c

n,p

for n =1, 2, 3, 4 to obtain the summed map

. Global max pooling is then performed on the summed map to

obtain the score s

for the pth part.

case of this example by setting X to be a local region. The

disadvantage of max-pooling is that the hand-tuned local

region does not adapt to different deformation properties of

different parts.

Example 3. The deformation layer can represent the

widely used quadratic constraint of deformation in [ 17]. Be-

low, we skip the subscript

used in Eq. (1) to be concise.

The quadratic constraint of deformation can be represented

as follows:

(x,y )

+ c

(x−a

)

(y−a

)

(4)

where m

(x,y)

is the (x, y)th element of the part detection

map M, (a

) is the predeﬁned anchor location of the

pth part. They are adjusted by c

/2c

and c

/2c

, which

are automatically learned. c

and c

(4) decide the defor-

mation cost. There is no deformation cost if c

= c

=0.

Parts are not allowed to move if c

= c

= −∞. (a

)

and (

) jointly decide the center of the part. The

quadratic constraint in Eq. (4) can be represented using Eq.

(1) as follows:

B = M + c

+ c

· 1,

(x,y )

+ c

(x,y )

+ c

(x,y )

+ c

(x,y )

+ c

(x,y )

=(x − a

)

(x,y )

=(y − a

)

(x,y )

=x − a

(x,y )

=y − a

= c

/(4c

)+c

/(4c

), (5)

where 1 is a matrix with all elements being one, d

(x,y )

is the

(x, y)th element of D

.Inthiscase,c

and c

are pa-

rameters to be learned and D

are predeﬁned. c

is the same in all

locations and need not be learned. Fig. 4 illustrates this example,

which is used as the deformation layer in this work.

3.5. Visibility reasoning and classiﬁcation

The deformation layer in Section 3.4 provides the part scores

s = {s

,...,s

} using Eq. (2). s is then used for visibility rea-

1,1

2,1

3,1

4,1

5,1

6,1

7,1

Level 3

Level 2

Level 1

1,1

2,1

3,1

4,1

5,1

6,1

...

1,1

2,1

3,1

4,1

5,1

6,1

7,1

1,1

Figure 5. The visibility reasoning and detection label estimation

model. For the ith part at the lth level, s

is the detection score and

is the visibility. For example, h

indicates the visibility of the

left-head-shoulder part. Best viewed in color.

soning and classiﬁcation. We adopt the model in [34] to estimate

visibility.

Fig. 5 shows the model for the visibility reasoning and classi-

ﬁcation in Fig. 2. Denote the score and visibility of the jth part at

level l as s

and h

respectively. Denote the visibility of P

parts

at level l by h

=[h

...h

]

.Givens, the model for BP and

inference is as follows:

= σ(c

+ g

l+1

= σ(

∗,j

+ c

l+1

+ g

l+1

),l =1, 2,

˜y = σ(

cls

+ b),

(6)

where σ(t)=(1+exp(−t))

−1

is the sigmoid function, g

the weight for s

, c

is its bias term, W

models the correlation

between h

and h

l+1

, w

∗,j

is the jth column of W

, w

cls

is con-

sidered as the linear classiﬁer for the hidden units

,and˜y is the

estimated detection label. Hidden variables at adjacent levels are

connected. w

∗,j

represents the relationship between

and

l+1

A part can have multiple parents and multiple children. The visi-

bility of one part is correlated with the visibility of other parts at

the same level through shared parents. g

, c

, W

, w

cls

,andb are

parameters to be learned.

The differences between the deep model in this paper and the

approach in [34] are as follows:

1. The parts at levels 1 and 2 propagate information to the

classiﬁer through the parts at level 3 in [34]. But the imperfect

part scores at level 3 may disturb the information from levels 1

and 2. This paper includes extra hidden nodes at levels 2 and 3.

These nodes provide branches that help parts at level 1 and level

2 to directly propagate information to the classiﬁer without being

disturbed by other parts. These extra hidden nodes do not use

detection scores and have the term g

l+1

=0in (6). They are

represented by white circles in Fig. 5, while the hidden nodes with

the term g

l+1

=0in (6) are represented by gray circles.

2. The approach in [34 ] only learns the visibility relationship

from part scores. Both HOG features and the parameters for the

deformation model are ﬁxed in [34]. In this paper, features, de-

formable models, and visibility relationships are jointly learned.

In order to learn the parameters in the two convolutional layers

and the deformation layer in Fig. 2, prediction error is back-

propagated through s. The gradient for s is:

∂L

∂s

∂L

∂h

∂s

∂L

∂h

(1 − h

, (7)

where

∂L

∂h

∂L

∂ ˜y

˜y(1 − ˜y)w

cls

∂L

∂h

= w

i,∗



∂L

∂h

 h

 (1 − h

)



∂L

∂h

= w

i,∗



∂L

∂h

 h

 (1 − h

)



(8)

 denotes the Hadamard product; that is (U  V )

i,j

= U

i,j

i,∗

is the ith row of W

,andw

cls

is the ith element of the w

cls

L is the loss function. For example L =(y

gnd

− ˜y)

/2 is for the

square loss, and y

gnd

the ground-truth label. L = y

gnd

log ˜y +

(1 − y

gnd

)log(1− ˜y) is for the log loss, which is chosen in this

work.

In order to train this deep architecture, we adopt a multi-stage

training strategy. We start with a 1-layer CNN using supervised

training. Since Gabor ﬁlters are similar to the human visual sys-

tem, they are used for initialing the ﬁrst CNN. We add one more

layer at each stage, the layers trained in the previous stage are

used for initialization and then all the layers at the current stage

are jointly optimized with BP.

4. Experimental Results

The proposed framework is evaluated on the Caltech dataset

[12] and the ETH dataset [15]. In order to save computation, a

detector using HOG+CSS and Linear SVM is utilized for pruning

candidate detection windows at both training and testing stages.

Approximately 60,000 training samples that are not pruned by the

detector are used for training the deep model. At the testing stage,

the execution time required by our deep model is less than 10%

of the execution time required by the HOG+CSS+SVM detector,

which has ﬁltered most samples. In the deep learning model, learn-

ing rate is ﬁxed as 0.025 with batch size 60. Similar to [44, 24],

norm penalty is not used.

The labels and evaluation code provided by Doll´ar et al. on-

line are used for evaluation following the criteria proposed in [12].

As in [12], the log-average miss rate is used to summarize the de-

tector performance, and is computed by averaging the miss rate

at nine FPPI rates that are evenly spaced in the log-space in the

range from 10

−2

to 10

. In the experiments, we evaluate the per-

formance on the reasonable subset of the evaluated datasets. This

subset, which is the most popular portion of the datasets, consists

of pedestrians who are more than 49 pixels in height, and whose

occluded portions are less than 35%.

The compared approaches are VJ [49], Shapelet [42], PoseInv

[28], LatSVM-V1 [17], LatSVM-V2 [17], HikSVM [31], HOG

[5], MultiFtr [52], HogLbp [51], Pls [43], MultiFtr+CCS, Multi-

Ftr+Motion [50], FeatSynth [1] FPDW [10], ChnFtrs [11], Mul-

tiResC [37], CrossTalk [9], DN-HOG [34] and ConvNet-U-MS

Joint Deep Learning for Pedestrian Detection

Figures

Citations

ImageNet Large Scale Visual Recognition Challenge

Image Super-Resolution Using Deep Convolutional Networks

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

Deep Learning for Generic Object Detection: A Survey

Image Super-Resolution Using Deep Convolutional Networks

References

ImageNet Classification with Deep Convolutional Neural Networks

Distinctive Image Features from Scale-Invariant Keypoints

Gradient-based learning applied to document recognition

Histograms of oriented gradients for human detection

Reducing the Dimensionality of Data with Neural Networks

Related Papers (5)

Histograms of oriented gradients for human detection

Object Detection with Discriminatively Trained Part-Based Models

ImageNet Classification with Deep Convolutional Neural Networks

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Very Deep Convolutional Networks for Large-Scale Image Recognition

Frequently Asked Questions (13)

Q1. What are the contributions in "Joint deep learning for pedestrian detection" ?

Q2. What are the future works mentioned in the paper "Joint deep learning for pedestrian detection" ?

Q3. What are the commonly used classification approaches?

Q4. How is the algorithm used to prune candidate detection windows?

Q5. How do the authors enrich the operation in deep models?

Q6. How are the features learned in deep models?

Q7. How many negative samples are used in the Caltech-Train dataset?

Q8. What are the performing approaches on the Caltech-Test?

Q9. How does the proposed deep model perform on public datasets?

Q10. How is the log-average miss rate calculated?

Q11. What is the funding source for this work?

Q12. What is the main purpose of this paper?

Q13. What is the performing CNN on Caltech-Test?