scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Regionlets for Generic Object Detection

01 Dec 2013-pp 17-24
TL;DR: This work proposes to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as region lets, which significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method.
Abstract: Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as region lets. A region let is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These region lets are organized in small groups with stable relative positions to delineate fine grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detection mean average precision of 41.7% on the PASCAL VOC 2007 dataset and 39.7% on the VOC 2010 for 20 object categories. It achieves 14.7% mean average precision on the Image Net dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%.

Summary (4 min read)

Participants

  • The participants of the study include 869 early childhood education teachers who work in public preschools in Turkey gathered through simple random sampling.
  • Nearly 95% of the participants were working in state preschools and nearly 5% were working in private schools.
  • Participants were from 7 regions of the Turkey with the following percentages: 14.7% from Mediterranean Region, 10.1% from Eastern Anatolia Region, 17.9% from Aegean Region, 7.9% from Southeastern Region, 16.6% from Central Anatolia Region, 14.1% from Black Sea Region, 18.7% from Marmara Region.
  • Nearly 59% of the participants were married, 37% were single, and nearly4% were divorced or widowed.
  • 94% of the participants were actively working with children.

Instruments

  • In order to examine the individual and contextual variables, which might have a relationship with burnout, a demographical information questionnaire was used.
  • The questionnaire included questions about age, marital status, monthly income, educational level, educational background, years of experience, housing status, being a parent, region of residence, type of school and current position at work.
  • In order to explore the burnout phenomenon among early childhood teachers Maslach Burnout Inventory (MBI) which was developed by Maslach and Jackson (1981) was used for the study.
  • The MBI is 22-itemed, 7-point Likert-type scale and is the most frequently used instrument to assess burnout worldwide.

Data collection

  • The schools that participants work were chosen from the list of schools serving children from 3 to 6 years, obtained from the Ministry of National Education through random sampling.
  • The participants were asked to participate in the study on a voluntary basis.
  • For the ethical purposes and to protect the participants’ confidentiality and anonymity, participants were given information about the study, informed that they are free to participate or not to participate the study, and were asked not to give any identifying information and to return their responses in the closed and sealed envelopes provided by the researchers.

Data Analysis

  • Data for the study were analyzed using SPSS for Windows, version 16 (SPSS Inc., Chicago, IL).
  • The descriptive statistics, frequencies and percentages of categorical variables and the means and standard deviations of numeric variables were calculated.
  • Numerical variables were investigated by independent sample t-test and one-way analysis of variance .
  • When the statistically significant difference was found in one-way ANOVA, pairwise comparisons were performed using post hoc Tukey HSD test.

RESULTS

  • The purpose of the study was to explore the individual and contextual variables gathered through demographic information questionnaire that contribute to burnout of early childhood teachers in Turkey.
  • The individual and contextual variables that were included in the questionnaire were gender, marital status, geographical location, being a parent, educational level and background, current position at work, actively working with children, age group of children, years of experience, work experience in integrated classrooms, monthly income and ownership of the residential house.

Burnout and gender

  • T-Test was used to compare male and female participants for their scores from MBI subscales, and the results are given in Table 1.
  • As given in Table 1, there are significant differences between means scores of male and female participants for EE subscale but no significant differences for DP and PA subscales.

Burnout and marital status

  • One-way ANOVA was applied to determine any significant differences among single, married and divorced or widow participants for their scores from MBI.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups in triple groups.
  • As given in Table 2, there are significant differences among single, married and divorced or widow participants for their scores from EE and DP subscales but there is no significant differences for the PA subscale.
  • For EE subscale, the mean scores of single teachers were significantly lower than the mean scores of married teachers.
  • For DP subscale, the means scores of single teachers were significantly lower than the mean scores of married and divorced/widow teachers.

Burnout and Geographical Regions

  • One-way ANOVA was applied to determine any significant differences among participants according to geographical region for their scores from MBI subscales.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups.
  • For EE subscale, teachers from Southeastern Anatolia Region had significantly lower scores than the teachers from Marmara and Eastern Anatolia Regions, and teachers from Aegean Region had lower scores than teachers from Eastern Anatolia Region.
  • For y DP subscale, teachers from Marmara Region had significantly higher scores than teachers from Aegean, Black Sea and Southeastern Regions.
  • For PA subscale, teachers from Central Anatolia and Eastern Anatolia Regions had lower scores than teachers from Aegean and Mediterranean Regions.

Burnout and Being a Parent

  • T-Test was used to compare teachers with and without children for their scores from MBI subscales, and the results are given in Table 4.
  • As given in Table 4, there were significant differences between mean scores of teacher with and without children for EE and DP subscales but no significant differences for PA subscales.

Burnout and ECDE training

  • T-Test was applied to determine whether there are any significant differences among participants according to the field from which teachers graduated for their scores from MBI subscales.
  • As given in Table 5, teacher graduated from a child development and education department had significantly lower scores for EE subscale than teachers not graduated from a child development and education department, and there were no significant differences between teachers graduated from different areas for DP and PA subscales.

Burnout and Educational Level

  • One-way ANOVA was applied to determine any significant differences among participants according to their educational levels for their scores from MBI and its subscales.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups.
  • As given in Table 6, there are significant differences among teachers according to their educational levels only for scores of EE subscale.
  • For scores from EE subscale, teachers with high school educational level had significantly lower scores than teachers with Associate and BSc/PhD degrees.

Burnout and Current Position at Work

  • One-way ANOVA was applied to determine whether there are any significant differences among participants according to their position at work for their scores from MBI subscales.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups.
  • As given in Table 7, there are significant differences among participants according to their positions for total scores from EE & DP subscales.
  • For scores from EE and DP subscales, covenanted teachers had significantly lower scores than managers and teachers, and teachers had significantly lower scores than managers.

Burnout and Active Teaching Experience

  • T-Test was applied to determine whether there are any significant differences among the participants who are actively working with children or not working with children for their scores from MBI subscales.
  • As given in Table 8, there was no differences between teachers actively working with children and teachers not actively working with children for scores from MBI subscales.

Burnout and Age Group of Children

  • One-way ANOVA applied to determine any significant differences among participants according to the age groups they worked with (mean age groups: under three years, three years, four years, five years, six years and mixed ages) for their scores from MBI subscales.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups.
  • As given in Table 9, there were significant differences between teachers only for the scores from DP subscale.
  • Teachers working with children under three years had significantly lower scores than the teachers working with children 4 and 5 years old.

Teacher burnout and Monthly Income

  • One-way ANOVA applied to determine any significant differences among participants according to their monthly income for their scores from MBI subscales.
  • Any significant differences found among these groups were tested by means of post hoc Tukey HSD test to determine any differences between two groups.
  • As given in Table 10, there were significant differences among teachers according to their income level for their scores from EE and DP subscales.
  • For the scores y from DP subscale, there are significant differences between the teachers with the lowest income level and the teachers with the income level of YTL 2500- 3999.
  • P<.05 Burnout and Experience with Children with Special Needs t-Test was used to compare teachers with and without experiences with handicapped children for their scores from MBI subscales, and the results are given in Table 11.

CONCLUSION

  • It is a commonly acknowledged fact that teacher burnout is among the major factors that influence teachers’ performance, health and well-being regardless of the country and the grade level.
  • Prior studies (Kokkinos, 2007; Noor & Zainuddin, 2011) suggest that being married and female increases the likelihood for teachers to experience emotional exhaustion and depersonalization, as a result of the conflicting demands of work and family, not because of the marriage itself.
  • In other words, depersonalization scores of teachers increase as the educational level increases.
  • Class size is a significant variable to explain burnout.
  • Findings suggest that covenanted teachers have significantly lower levels emotional exhaustion and depersonalization than managers and teachers, and teachers have significantly lower scores than managers.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Regionlets for Generic Object Detection
Xiaoyu Wang Ming Yang Shenghuo Zhu Yuanqing Lin
NEC Laboratories America, Inc.
{xwang,myang,zsh,ylin}@nec-labs.com
Abstract
Generic object detection is confronted by dealing with
different degrees of variations in distinct object classes with
tractable computations, which demands for descriptive and
flexible object representations that are also efficient to eval-
uate for many locations. In view of this, we propose to
model an object class by a cascaded boosting classifier
which integrates various types of features from competing
local regions, named as regionlets. A regionlet is a base
feature extraction region defined proportionally to a detec-
tion window at an arbitrary resolution (i.e. size and as-
pect ratio). These regionlets are organized in small groups
with stable relative positions to delineate fine-grained spa-
tial layouts inside objects. Their features are aggregated to
a one-dimensional feature within one group so as to tol-
erate deformations. Then we evaluate the object bound-
ing box proposal in selective search from segmentation
cues, limiting the evaluation locations to thousands. Our
approach significantly outperforms the state-of-the-art on
popular multi-class detection benchmark datasets with a
single method, without any contexts. It achieves the detec-
tion mean average precision of 41.7% on the PASCAL VOC
2007 dataset and 39.7% on the VOC 2010 for 20 object cat-
egories. It achieves 14.7% mean average precision on the
ImageNet dataset for 200 object categories, outperforming
the latest deformable part-based model (DPM) by 4.7%.
1. Introduction
Despite the success of face detection where the target ob-
jects are roughly rigid, generic object detection remains an
open problem mainly due to the challenge of handling all
possible variations with tractable computations. In particu-
lar, different object classes demonstrate a variable degree of
deformation in images, either due to their nature, e.g., living
creatures like cats are generally more deformable than man-
made objects like vehicles, or due to viewing distances or
angles, e.g., deformable objects may appear somehow rigid
at a distance and even rigid objects may show larger vari-
ations in different view angles. These pose a f undamental
Candidate detection bounding boxes
Regionlet based model Applied to candidate boxes
Figure 1: Illustration of the regionlet representation. Re-
gionlet representation can be applied to candidate bound-
ing boxes that have different sizes and aspect ratios. A
regionlet-based model is composed of a number of regions
(denoted by blue r ectangles), and then each region is repre-
sented by a group of regionlets (denoted by the small orange
rectangles inside each region).
dilemma to object class representations: on one hand, a del-
icate model describing rigid object appearances may hardly
handle deformable objects; on the other hand, a high toler-
ance of deformation may r esult in imprecise localization or
false positives for rigid objects.
Prior arts in object detection cope with object deforma-
tion efficiently with primarily three typical strategies. First,
if spatial layouts of object appearances are roughly rigid
such as faces or pedestrians at a distance, the classical Ad-
aboost detection [
26] mainly tackles local variations with
an ensemble classifier of efficient features. Then a sliding
window search with cascaded classifiers is an effective way
to achieve precise and efficient localization. Second, the de-
formable part model (DPM) method [
12] inherits the HOG
window template matching [6] but explicitly models de-
formations by latent variables, where an exhaustive search
of possible locations, scales, and aspect ratios are critical
to localize objects. Later on, the DPM has been acceler-
ated by coarse-to-fine search [
19], branch and bound [16],
and cross-talk approaches [9]. Third, object recognition

methods using spatial pyramid matching (SPM) of bag-of-
words (BoW) models [17] are adopted for detection [25],
and they inherently can tolerate large deformations. These
sophisticated detectors are applied to thousands of object-
independent candidate regions [
25, 2], instead of millions
of sliding windows. In return, little modeling of local spa-
tial appearances leaves these recognition classifiers unable
to localize rigid objects precisely, e.g., bottles. These suc-
cessful detection approaches inspire us to investigate a de-
scriptive and flexible object representation, which delivers
the modeling capacity for both rigid and deformable objects
in a unified framework.
In this paper, we propose a new object representation
strategy for generic object detection, which incorporates
adaptive deformation handling into both object classifier
learning and basic feature extraction. Each object bound-
ing box is classified by a cascaded boosting classifier, where
each weak classifier takes the feature response of a region
inside the bounding box as its input and then the region is in
turn represented by a group of small sub-regions, named as
regionlets. The sets of regionlets are selected from a huge
pool of candidate regionlet groups by boosting. On one
hand, the relative spatial positions of the regionlets within
a region and the region within an object bounding box are
stable. Therefore, the proposed regionlet representation can
model fine-grained spatial appearance layouts. On the other
hand, the feature responses of regionlets within one group
are aggregated to a one dimensional feature, and the result-
ing feature is generally robust to local deformation. Also,
our regionlet model is designed to be flexible to take bound-
ing boxes with different sizes and aspect ratios. There-
fore our approach is ready to utilizes the selective search
strategy [
25] to evaluate on merely thousands of candidate
bounding boxes rather than hundreds of thousands (if not
millions) of sliding windows as in the exhaustive search.
An i llustr ation of the regionlet representation is shown
in Figure
1, where the regionlets drawn as orange boxes
are grouped within blue rectangular regions. The regionlets
and their groups for one object class are learned in boost-
ing with stable relative positions to each other. When they
are applied to two candidate bounding boxes, the feature
responses of regionlets are obtained at the their respective
scales and aspect ratios without enumerating all possible
spatial configurations.
The major contribution of this paper lies in two-fold.
1) It introduces the regionlet concept which is flexible to
extract features from arbitrary bounding boxes. 2) The
regionlet-based representation for an object class, which not
only models relative spatial layouts inside an object but also
accommodates variations especially deformations by the re-
gionlet group selection in boosting and the aggregation of
feature responses in a regionlet group. As validated in the
experiment, the proposed representation adaptively models
a varying degree of deformation in diverse object classes.
2. Related Work
Object detection is arguably an indispensable component
for most of vision tasks, and it has achieved prominent suc-
cesses for some specific targets such as faces [
26, 14] and
pedestrians [6, 24, 28, 4, 8]. Complete survey of object de-
tection is certainly beyond the scope of this paper. Instead,
we briefly review related generic object detection methods
that do not focus on a particular type of object.
One of the most influential methods in generic object de-
tection is the deformable part model (DPM) [
12] and its ex-
tensions [
12, 11, 31, 19]. The DPM object detector consists
of a root filter and several part filters. Deformations among
parts are inferred with latent variables. Since the resolutions
of the object templates are fixed, an exhaustive sliding win-
dow search [
12] is required to find objects at different scales
and different aspect ratios. The exhaustive search can be ac-
celerated by more efficient search [
16, 15, 19, 9, 11, 4, 30].
In contrast, our regionlet-based detection handles object de-
formation directly in feature extraction, and it is flexible
to deal with different scaling and aspect r atios without the
need of an exhaustive search.
Recently, a new detection strategy [
2, 25, 20] is to use
multi-scale image segmentation to propose a couple thou-
sands of candidate bounding boxes for each image and then
the object categories of the bounding boxes are determined
by strong object classifiers, e.g., using bag-of-words (BoW)
model with spatial pyramid match (SPM) [
25]. Because the
BoW models ignore spatial relations among local features,
they are able to tolerate large deformations. However, be-
cause of the lack of local spatial relations, they may not
localize rigid object precisely. Our method borrows the can-
didate window proposing procedure in [
25] to speed up the
detection, however, it is fundamentally different from [
25]
in both feature extraction and classifier learning.
Contexts from local or global appearance have been ex-
plored to improve object detection [
23, 27, 8, 5, 18, 29]. We
do not use any context cues in this paper and leave it as a
future work.
3. Regionlets for Detection
Object detection is composed of two key components:
determing where the candidate locations are in images and
discerning whether they are the objects of interests. Beyond
the straightforward exhaustive search of all locations, our
regionlet detection approach screens the candidate windows
derived from the selective search [
25]. For selective search,
given an image, it first over-segments the image into super-
pixels, and then those superpixels are grouped in a bottom-
up manner to propose candidate bounding boxes. The work
in [
25] shows that such proposing bounding boxes, about

1,0002,000 each image, achieve very high recall. After
this, the task of detection boils down to extraction of an
appropriate object representation on each proposed box and
learning of a scoring function to rank the boxes. To that end,
we introduce regionlet features for each candidate bounding
box. In our proposed method, we construct a largely over-
complete regionlet feature pool and then design a cascaded
boosting learning process to select the most discriminative
regionlets for detection.
In the following, Section
3.1 describes what the region-
lets are and explains how they are designed to handle de-
formation. Section
3.2 presents how to construct a largely
over-complete regionlet pool and learn a cascaded boosting
classifier for an object category by selecting the most dis-
criminative regionlets.
3.1. Regionlets
3.1.1 Regionlet definition
In object detection, an object category is essentially defined
by a classifier where both object appearance and the spa-
tial layout inside an object shall be taken into account. For
simplicity, appearance features are mostly extracted from
some rectangular sub-regions within an object, which we
refer as feature extraction regions in the paper. Features
extracted from a small region often provide a good local-
ization ability, but are vulnerable to variations; a big region
tends to tolerate more variations but may not be sensitive
enough for accurate localization. When large variations es-
pecially deformations occur, a large rectangle region may
not be appropriate for extracting descriptive features of an
object. Because some parts or the regions may not be in-
formative or even distractive. This motivates us to define
sub-parts of a region, i.e., the regionlets, as the basic units
to extract appearance features, and organize them into small
groups which are more flexible to describe distinct object
categories with different degrees of deformation.
We would like to introduce the regionlets with an ex-
ample illustrated in Figure
2. The first column in Figure 2
shows three samples of a person that are the target object to
detect and they are cropped by black bounding boxes in the
second column. A rectangle feature extraction region inside
the bounding box is denoted as R, which will contribute a
weak classifier to the boosting classifier. Within this region
R, we further spot some small sub-regions (e.g., r
1
,r
2
and
r
3
) and define them as a group of regionlets. We employ
the term regionlet, because the features of these sub-regions
will be aggregated to a single feature for R, and they are
below the level of a standalone feature extraction region i n
an object classifier. In short, in the proposed method, a de-
tection bounding box is represented by a number of regions,
each of which is composed of a small set of regionlets.
This example also illustrates how regionlets are designed
to handle deformation. Hand, as a supposingly informative
R
1
r
2
r
3
r
R
R
R
Figure 2: Illustration of the relationship among a detection
bounding box, a feature extraction region and regionlets. A
feature extraction region R, shown as a light blue rectangle,
is cropped from a fixed position from 3 samples of a person.
Inside R, several small sub-regions denoted as r
1
, r
2
and r
3
(in orange small rectangules) are the regionlets to capture
the possible locations of the hand for person detection.
part for a person, may appear at different locations within
the bounding box of a person. If we extract the feature for
a hand from the whole region R which roughly covers the
possible locations of the hand, the appearance of some non-
hand regions on the torso or background clearly are also in-
cluded in the feature. An ideal deformation handling strat-
egy is to extract features only from the hand region in all
three cases. To that end, we introduce three regionlets inside
R (In general, a region can contain many regionlets. Here
“three” is mainly for illustration purpose). Each regionlet
r covers a possible location of hand. Then only features
from the regionlets are extracted and aggregated to gener-
ate a compact representation for R. Irrelevant appearance
from backgrounds are largely discarded. More regionlets
in R will increase the capacity to model deformations, e.g.,
hand surely may appear in more positions than three. On
the other hand, rigid objects may only require one regionlet
from a feature extraction region.
3.1.2 Region feature extraction
Feature extraction from R takes two steps: 1) extracting
appearance features, e.g., the HOG [
6] and LBP descrip-
tors [
1] from each regionlet respectively; and 2) generating
the representation of R based on regionlets’ features. The
first step is straightforward. For the second step, we de-
fine a permutation invariant feature operation on features
extracted from regionlets, and such an operation also as-
sumes an exclusive relation among regionlets. Let’s denote
T (R) as the feature representation for region R, T (r
j
) as
the feature extracted from the j
th
regionlet r
j
in R, then the

operation is defined as following:
T (R) =
N
R
X
j=1
α
j
T (r
j
), (1)
subject to α
j
{0, 1},
N
R
X
j=1
α
j
= 1,
where N
R
is the total number of regionlets in region R, α
j
is a binary variable, either 0 or 1. This operation is permu-
tation invariant, namely, the occurrence of the appearance
cues in any of regionlets is equivalent, which allows de-
formations among these regionlet locations. The operation
also assumes the exclusiveness within a group of region-
lets, namely, one and only one regionlet will contribute to
the region feature representation. The exclusive assump-
tion is that when deformation occurs, the discriminative
sub-region appears at only one position in a specific train-
ing/testing sample.
In our framework, we simply apply max-pooling over
regionlet features. So Eq.
1 is instantiated as:
T (R) = max
j
T (r
j
). (2)
The max-pooling happens for each feature dimension inde-
pendently. For each regionlet r
j
, we first extract low-level
feature vectors, such as HOG or LBP histograms. Then, we
pick a 1D feature from the same dimension of these feature
vectors in each regionlet and apply Eq. 2 to form the fea-
ture for region R. We have millions of such 1D features
in a detection window and the most discriminative ones are
determined through a boosting type learning process (to be
described in Section
3.2.2).
Figure 3 illustrates the process to extract T (R), the 1-D
feature for a region R. Here we again use the example in
Figure
2, where the blue region R is the one covering the
variation of hand locations. Assuming the first dimension
of the concatenated low-level features is the most distinctive
feature dimension learned for hand, we collect this dimen-
sion from all the three regionlets and represent T (R) by the
strongest feature response from the top regionlet.
3.1.3 Regionlets normalized by detection windows
In this work, the proposed regionlet representations are
evaluated on the candidate bounding boxes derived from
selective search approach [
25]. In principle, they are also
applicable for sliding windows. The selective search ap-
proach first over-segments an images into superpixels, and
then the superpixel are grouped in a bottom-up manner to
propose some candidate bounding boxes. This approach
typically produces 1000 to 2000 candidate bounding boxes
for an object detector to evaluate on, compared to millions
of windows in an exhaustive sliding window search.
1D feature for
The learned dimension
Regionlets’ features
Figure 3: Example of regionlet-based feature extraction.
However, these proposed bounding boxes have arbitrary
sizes and aspect ratios. As a result, it is not feasible to use
template regions (or template regionlets) with fixed abso-
lute sizes that are widely used in sliding window search.
We address this difficulty by using the relative positions and
sizes of the regionlets and their groups to an object bound-
ing box. Figure
4 shows our way of defining regionlets in
contrast to fixed regions with absolute sizes. When using
a sliding window search, a feature extraction region is of-
ten defined by the top-left (l, t) and the bottom-right corner
(r, b) w.r.t. the anchor position of the candidate bounding
box. In contrast, our approach normalizes the coordinates
by the width w and height h of the box and records the rela-
tive position of a region (l
, t
, r
, b
) = (
l
w
,
t
h
,
r
w
,
b
h
) = R
.
These relative region definitions allow us to directly eval-
uate the regionlets-based representation on candidate win-
dows at different sizes and aspect ratios without scaling
images into multiple resolutions or using multiples compo-
nents for enumerating possible aspect ratios.
3.2. Learning the object detection model
We follow the boosting framework to learn the discrimi-
native regionlet groups and their configurations from a huge
pool of candidate regions and regionlets.
3.2.1 Regions/regionlets pool construction
Deformation may occur at different scales. For instance,
in person detection, deformation can be caused by a mov-
ing finger or a waving hand. A set of small regionlets that
is effective to capture finger-level deformation may hardly
handle deformation caused by hand movements. In order
to deal with diverse variations, we build a largely over-
complete pool for regions and regionlets with various po-
sitions, aspect ratios, and sizes. Before regionlet learning,

݈
ܾ
ݐ
ݎ
ݓ
݄
(݈, ݐ, ݎ, ܾ)
(
݈
ݓ
,
ݐ
݄
,
ݎ
ݓ
,
ܾ
݄
)
Traditional
Normalized
(݈, ݐ, ݎ, ܾ)
(
݈
ݓ
,
ݐ
݄
,
ݎ
ݓ
,
ܾ
݄
)
(a)
(b)
Figure 4: Relative regions normalized by a candidate win-
dow that are robust to scale and aspect ratio changes.
a region R
or a regionlet r
are not applied to a detection
window yet, so we call R
a feature region prototype and r
a regionlet prototype.
We first explain how the pool of region feature proto-
types is constructed. Using the definition in Section
3.1.3,
we denote the 1D feature of a region relative to a bound-
ing box as R
= (l
, t
, r
, b
, k) where k denotes the kth
element of the low-level feature vector of the region. R
represents a feature prototype. The region pool is spanned
by X × Y × W × H × F, where X and Y are respec-
tively the space of horizontal and vertical anchor position
of R in the detection window, W and H are the width and
height of the feature extraction region R
, and F is the space
of low-level feature vector (e.g., the concatenation of HOG
and LBP). Enumerating all possible regions is impractical
and not necessary. We employ a sampling process to reduce
the pool size. Algorithm
1 describes how we sample mul-
tiple region feature prototypes. In our implementation, we
generate about 100 million feature prototypes.
Afterwards, we propose a set of regionlets with random
positions inside each region. Although the sizes of region-
lets in a region could be arbitrary in general, we restrict re-
gionlets in a group to have the identical size because our
regionlets are designed to capture the same appearance in
different possible l ocations due to deformation. The sizes
of regionlets in different groups could be different. A re-
gion may contain up to 5 regionlets in our implementation.
So the final feature space used as the feature pool for
boosting is spanned by R × C, where R is the region fea-
ture prototype space, C is the configuration space of re-
gionlets. Therefore, we augment a region feature prototype
R
= (l
, t
, r
, b
, k, c) with a regionlet configuration c.
Algorithm 1: Generation of region feature prototypes
Input: Region width step s
w
and height step s
h
;
maximum width W and height H of region
prototypes; horizontal step p
x
and vertical step
p
y
for the region anchor position; minimum
width w
min
and height h
min
of region
prototypes; the number of features N to extract
from one region
1 begin
2 w w
min
, h h
min
, i 0
3 for w < W do
4 h h
min
5 for h < H do
6 h h + s
h
7 l 0, t 0
8 for l < W w do
9 t 0
10 for t < H h do
11 for k=1,. . . N do
12 r l + w , b t + h
R
= (l/w, t/h, r/w, b/h, k)
R R {R
}
13 t t + p
y
, i i + 1
14 l l + p
x
15 h h + s
h
16 w w + s
w
Output: Region feature prototype pool R
3.2.2 Training with boosting regionlet features
We use RealBoost [22] to train cascaded classifiers for our
object detector. One boosting classifer consists of a set of
selected weak classifiers. Similar to [
14], we define the
weak classifier using a lookup table:
h(x) =
n1
X
o=1
v
o
1(B(x) = o), (3)
where h(x) is a piece-wise linear function defined by a
lookup table, v
o
is the table value for the oth entry, B(x)
quantizes the feature value x into a table entry, and 1(·)
is an indicator function. In each round of the training, v
o
is computed based on the sample weight distri bution as
v
o
=
1
2
ln(
U
o
+
U
o
), where U
o
+
is the summation of the weights
of the positive examples whose feature values fall into the
oth entry of the table. The U
o
is defined in a similar manner
for the weights of negative examples.
Let’s denote Q as a candidate bounding box, R
(Q)
as a rectangular region in Q, and T (R
(Q)) as the one-
dimensional f eature computed on R
(Q) (similar notation
as in Eq.
1). Substituting x in Eq. 3 with the extracted fea-
ture, we can get the weak classifier in the tth round of train-

Citations
More filters
Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations

Book ChapterDOI
06 Sep 2014
TL;DR: This work equips the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

3,945 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A novel method for generating object bounding box proposals using edges is proposed, showing results that are significantly more accurate than the current state-of-the-art while being faster to compute.
Abstract: The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box’s boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.

2,892 citations

Book ChapterDOI
TL;DR: SPP-Net as mentioned in this paper proposes a spatial pyramid pooling strategy, which can generate a fixed-length representation regardless of image size/scale, and achieves state-of-the-art performance in object detection.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

2,304 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations


"Regionlets for Generic Object Detec..." refers background in this paper

  • ...For example, sparse SIFT descriptors [38] are extracted on interest points, and the DCNN may only be evaluated on a few locations....

    [...]

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Regionlets for Generic Object Detec..." refers background or methods in this paper

  • ...com method [2] inherits HOG window template matching [3] but explicitly models deformations using latent variables....

    [...]

  • ...Most existing approaches [3], [2], [1], [4], [7] train an object detector at a fixed scale and aspect ratio....

    [...]

  • ...Viola and Jones’s face detector [1] employed Haar features in a cascaded boosting classifier to differentiate facial textures; Dalal and Triggs [3] proposed the Histogram of Oriented Gradients (HOG) templates to model pedestrian silhouettes by a linear SVM....

    [...]

  • ..., the HOG [3] and LBP descriptors [25] from each regionlet respectively; and 2) generating...

    [...]

  • ...HOG [3], LBP [25] and covariance features [19] are adopted as candidate features for the regionlets....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations

Frequently Asked Questions (12)
Q1. What are the contributions in "Regionlets for generic object detection" ?

In view of this, the authors propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. Their approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. 

As a future work, the authors plan to improve the way of proposing bounding boxes in term of recall and speed. Second, the authors will investigate how the context information can be integrated into the boosting learning process for further improving detection performance. 

A rectangle feature extraction region inside the bounding box is denoted as R, which will contribute a weak classifier to the boosting classifier. 

In their implementation of regionlet-based detection, the authors utilize the selective search bounding boxes from [25] to train their detector. 

The selective search approach first over-segments an images into superpixels, and then the superpixel are grouped in a bottom-up manner to propose some candidate bounding boxes. 

Despite the success of face detection where the target objects are roughly rigid, generic object detection remains an open problem mainly due to the challenge of handling all possible variations with tractable computations. 

Since the resolutions of the object templates are fixed, an exhaustive sliding window search [12] is required to find objects at different scales and different aspect ratios. 

When large variations especially deformations occur, a large rectangle region may not be appropriate for extracting descriptive features of an object. 

Due to the regionlets representation and enforced spatial layout learning, their proposed approach performs perfectly in both cases. 

These pose a fundamentaldilemma to object class representations: on one hand, a delicate model describing rigid object appearances may hardly handle deformable objects; on the other hand, a high tolerance of deformation may result in imprecise localization or false positives for rigid objects. 

The proposed method is further validated on the much larger ImageNet object detection dataset (ILSVRC2013) [21], including 200 object categories. 

This approach typically produces 1000 to 2000 candidate bounding boxes for an object detector to evaluate on, compared to millions of windows in an exhaustive sliding window search.