scispace - formally typeset
Open AccessProceedings ArticleDOI

Integral Channel Features

Reads0
Chats0
TLDR
It is demonstrated that when designed properly, integral channel features not only outperform other features including histogram of oriented gradient (HOG), they also result in fast detectors when coupled with cascade classifiers.
Abstract
We study the performance of ‘integral channel features’ for image classification tasks, focusing in particular on pedestrian detection. The general idea behind integral channel features is that multiple registered image channels are computed using linear and non-linear transformations of the input image, and then features such as local sums, histograms, and Haar features and their various generalizations are efficiently computed using integral images. Such features have been used in recent literature for a variety of tasks – indeed, variations appear to have been invented independently multiple times. Although integral channel features have proven effective, little effort has been devoted to analyzing or optimizing the features themselves. In this work we present a unified view of the relevant work in this area and perform a detailed experimental evaluation. We demonstrate that when designed properly, integral channel features not only outperform other features including histogram of oriented gradient (HOG), they also (1) naturally integrate heterogeneous sources of information, (2) have few parameters and are insensitive to exact parameter settings, (3) allow for more accurate spatial localization during detection, and (4) result in fast detectors when coupled with cascade classifiers.

read more

Content maybe subject to copyright    Report

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 1
Integral Channel Features
Piotr Dollár
1
pdollar@caltech.edu
Zhuowen Tu
2
zhuowen.tu@loni.ucla.edu
Pietro Perona
1
perona@caltech.edu
Serge Belongie
3
sjb@cs.ucsd.edu
1
Dept. of Electrical Engineering
California Institute of Technology
Pasadena, CA, USA
2
Lab of Neuro Imaging
University of CA, Los Angeles
Los Angeles, CA, USA
3
Dept. of Computer Science and Eng.
University of California, San Diego
San Diego, CA, USA
Abstract
We study the performance of ‘integral channel features’ for image classification tasks,
focusing in particular on pedestrian detection. The general idea behind integral chan-
nel features is that multiple registered image channels are computed using linear and
non-linear transformations of the input image, and then features such as local sums, his-
tograms, and Haar features and their various generalizations are efficiently computed
using integral images. Such features have been used in recent literature for a variety of
tasks indeed, variations appear to have been invented independently multiple times.
Although integral channel features have proven effective, little effort has been devoted to
analyzing or optimizing the features themselves. In this work we present a unified view
of the relevant work in this area and perform a detailed experimental evaluation. We
demonstrate that when designed properly, integral channel features not only outperform
other features including histogram of oriented gradient (HOG), they also (1) naturally
integrate heterogeneous sources of information, (2) have few parameters and are insen-
sitive to exact parameter settings, (3) allow for more accurate spatial localization during
detection, and (4) result in fast detectors when coupled with cascade classifiers.
1 Introduction
The performance of object detection systems is determined by two key factors: the learning
algorithm and the feature representation. Considerable recent progress has been made both
on learning [8, 10, 24, 26] and features design [5, 23, 28]. In this work we use a standard
boosting approach [11] and instead focus our attention on the choice of features.
Our study is based on the following architecture: multiple registered image channels are
computed using linear and non-linear transformations of the input image [12, 17]; next, fea-
tures are extracted from each channel using sums over local rectangular regions. These local
sums, and features computed using multiple such sums including Haar-like wavelets [27],
their various generalizations [7], and even local histograms [20] are computed efficiently
c
2009. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.

2 DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES
Figure 1: Multiple registered image channels are computed using various transformations of the
input image; next, features such as local sums, histograms, and Haar wavelets are computed efficiently
using integral images. Such features, which we refer to as integral channel features, naturally integrate
heterogeneous sources of information, have few parameters, and result in fast, accurate detectors.
using integral images. We refer to such features as integral channel features, see Fig. 1. In-
tegral channel features combine the richness and diversity of information from use of image
channels with the computational efficiency of the Viola and Jones detection framework [27].
A number of papers have utilized variants of integral channel features, applications have
included object recognition [16, 24], pedestrian detection [7, 31], edge detection [6], brain
anatomical structure segmentation [25] and local region matching [2]. A unified overview of
the feature representations in these works is given in Sec. 1.1.
Although integral channel features have proven effective, little effort has been devoted to
analyzing or optimizing the features themselves. In many of the above mentioned works the
focus was on the learning aspect [7, 8, 24] or novel applications [2, 6, 25] and it is difficult to
decouple the performance gains due to the richer features from gains due to more powerful
learning methods. In [16, 30, 31] the authors used integral channel features for computing
histograms of oriented gradients; although these methods achieved good performance; they
do not explore the full potential of the representation.
Furthermore, some of the integral channel features used in the literature have been com-
putationally expensive, e.g. the channels in [6] took over 30s to compute for a 640 × 480
image. In this work we show how to compute effective channels that take about .05-.2s per
640 ×480 image depending on the options selected. For 320×240 images, the channels can
be computed in real time at rates of 20-80 frames per second on a standard PC.
The INRIA pedestrian dataset [5] serves as our primary testbed. Pedestrian detection has
generated significant interest in the past few years [9]; moreover, the Histogram of Oriented
Gradient (HOG) descriptor [5] was designed specifically for the INRIA dataset. HOG has
since been successfully adopted for numerous object detection tasks and is one of the most
common features used in the PASCAL object challenges [19]. Therefore, not only is pedes-
trian detection interesting in and of itself, HOG’s success in other domains serves as strong
evidence that results obtained on pedestrians should generalize effectively.
Our detailed experimental exploration of integral channel features, along with a number
of performance optimizations and use of complementary channel types, leads to large gains
in performance. We show significantly improved results over previous applications of sim-
ilar features to pedestrian detection [7, 8, 31]. In fact, full-image evaluation on the INRIA
pedestrian dataset shows that learning using standard boosting coupled with our optimized
integral channel features matches or outperforms all but one other method, including state of
the art approaches obtained using HOG features with more sophisticated learning techniques.
On the task of accurate localization in the INRIA dataset, the proposed method outperforms

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 3
state of the art by a large margin. Finally, we show results on the recently introduced Caltech
Pedestrian Dataset [9], achieving a detection rate of almost 60% at 1 false positive per image
compared to at most 50% detection rate for competing methods, including HOG.
The remainder of this paper is organized as follows. We begin with a review of related
work below. In Sec. 2 we give a more detailed overview of integral channel features and we
discuss implementation details in Sec. 3. We perform a detailed experimental evaluation in
Sec. 4 and conclude in Sec. 5.
1.1 Related Work
The notion of channels can be traced back to the earliest days of computer vision. The
Roberts Cross Edge Detector [22] employed two tiny (2x2) kernels representing orthogonal
spatial derivative operators. The response of those filters, combined nonlinearly to obtain a
rudimentary measure of edge strength and orientation, could be thought of as the ur-channels.
Another early work was Fukushima’s Neocognitron architecture which used layered chan-
nels of increasing discriminative power [12]. In the following decades, numerous extensions
emerged. E.g., the texture discrimination approach of Malik & Perona [17] employed dozens
of channels computed via nonlinear combination of the responses of a bank of bandpass
filters. Malik & Perona performed spatial integration via Gaussian smoothing; eventually
statistics were pooled using histogram based representations [15, 21] still popular today.
Soon thereafter Viola and Jones proposed a boosted object detection approach with a
front-end that eschewed computationally expensive bandpass kernels for efficient Haar-like
wavelets implemented using integral images [27]. Nevertheless, with computing power in-
creasing rapidly, computing the responses of a bank of bandpass filters became less of a
bottleneck. The idea of using integral images on top of such channels to efficiently pool
statistics with different regions of support naturally followed, giving rise to the representa-
tion we refer to as integral channel features.
The first application of integral images to multiple image channels that we are aware
of appears in Tu’s work on probabilistic boosting trees [24]. Tu computed the response of
Gabor filters and Canny edges [4] at multiple scales and used Haar-like features on top of
these channels in a boosted object detection framework. Later, Tu extended this approach to
3D MRI brain volume segmentation in [25], where each resulting channel and Haar feature
was 3D. Likewise, Dollár et al. [6] used integral channel features to train a per-pixel edge
detector; state of the art results were obtained that matched the performance of methods using
more carefully tuned features. A large number of channels were used including gradients
at various scales, Gabor filter responses, difference of offset Gaussian filters, etc. Similar
channels were used for pedestrian detection [7] and learning patch descriptors [2].
In a different line of work, integral channel features have been used to compute his-
tograms of oriented gradients efficiently. As described in [20], rectangular histograms can
be computed efficiently by quantizing an image into multiple channels (details are given
in Sec. 2). This key idea has been exploited at least three separate times in the literature
[16, 30, 31]. Zhu et al. [31] used integral histograms to approximate HOG features for use
in a cascade. Laptev [16] likewise computed gradient histograms using integral images, re-
sulting in effective object detectors. In later work, [30] used a similar representation for cat
detection, except features were individual histogram bins as opposed to entire histograms as
in [16, 19]. Although details vary, none of these methods explored the richness of possible
channel types to seamlessly integrate heterogeneous sources of information.

4 DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES
Figure 2: Examples of integral channel features: (a) A first-order feature is the sum of pixels in a
rectangular region. (b) A Haar-like feature is a second-order feature approximating a local derivative
[27]. (c) Generalized Haar features include more complex combinations of weighted rectangles [7]. (d)
Histograms can be computed by evaluating local sums on quantized images [20] (see text for details).
2 Channel Types
Given an input image I, a corresponding channel is a registered map of the original im-
age, where the output pixels are computed from corresponding patches of input pixels (thus
preserving overall image layout). A trivial channel is simply C = I for a grayscale image,
likewise for a color image each color channel can serve as a channel. Other channels can be
computed using linear or non-linear transformation of I, various choices are discussed below
(see also Fig. 1). We use to denote a channel generation function and write C = (I).
To allow for fast detection using sliding window detectors channels must be translationally
invariant, that is given I and I
0
related by a translation, C = (I) and C
0
= (I
0
) must be re-
lated by the same translation. This allows to be evaluated once on the entire image rather
than separately for each overlapping detection window.
We define a first-order channel feature f as a sum of pixels in a fixed rectangular region
in a single channel, denoted by f (C). Using an integral image [27] for each channel, f (C)
can be computed with three floating point operations, and as such first-order features are
extremely fast to compute. We define higher-order channel features as any feature that can
be computed using multiple first-order features (e.g., the difference of two simple features in
the same channel). See Fig. 2 for example channel features.
The above is a generalization of the features used in the detection framework of Viola and
Jones (VJ) [27]. Specifically, VJ used C = I with features resembling Haar Basis functions
[18]. Using the terminology above, a Haar feature is a higher-order feature involving a sum
of 2-4 rectangles arranged in patterns that compute first and second order image derivatives
at multiple scales. This description is a bit oversimplified, as VJ actually used a two channel
representation where C
1
(x,y) = I(x,y) and C
2
(x,y) = I(x,y)
2
. Together these 2 channels
were used for variance normalization of detection windows leading to partial invariance to
changing lighting conditions (see [27] for details).
A number of channel types for an example image are shown in Fig. 1, panels a-h. Note
that all displayed channels are slightly smaller than the input image; it is of key importance to
discard channel boundaries to avoid introducing undesirable artifacts. Below we present an
overview of common channels used in the literature, all of which are translationally invariant.
Gray & Color: The simplest channel is simply a grayscale version of the image (panel
a). Color channels can also be used, in panel b we display the three CIE-LUV color channels.
Linear Filters: A straightforward approach for generating channels that capture different
aspects of an image is through use of linear filters. Panel c shows I convolved with 4 oriented
Gabor filters [17]; each channel contains edge information in I at a different orientation.
Convolving I with Difference of Gaussian (DoG) filters captures the ‘texturedness’ of the
image at different scales (panel d). Convolution with a large bank of filters can be slow;

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 5
nevertheless, linear filters are a simple and effective method for generating diverse channels.
Nonlinear Transformations: There are countless translationally invariant non-linear
image transformations, here we present a few illustrative cases. Gradient magnitude (panel e)
captures unoriented edge strength while Canny edges (panel f) more explicitly compute edge
information. Gradients for both can be computed at different scales by smoothing the input
image; moreover, for color images a common trick is to compute the gradient on the 3 color
channels separately and use the maximum response [5]. Panel h shows channels obtained by
thresholding the input image with two different thresholds. Although this particular form of
foreground segmentation is not robust, more sophisticated algorithms can be used.
Pointwise Transformations: Each pixel in a channel can be transformed by an arbi-
trary function as a post processing step. This can be used to overcome the limitation that
a feature f must be a local sum. E.g., taking the log, we can obtain local products since
exp(
i
log(x
i
)) =
i
x
i
. Likewise, raising each element to the p-th power can be used to
compute the generalized mean: (
1
n
n
i=1
x
p
i
)
1/p
. Setting p to a large value approximates the
maximum of x
i
, which can be useful for accumulating values as argued in [12].
Integral Histogram: Porikli [20] presented an efficient method for computing his-
tograms using integral images. Let Q be a quantized version of I with values in {1... q},
and let Q
i
(x,y) = 1[Q(x, y) = i] for each i q (1 is the indicator function). Counting the
elements in a region of Q equal to i can be computed by summing over Q
i
in the same re-
gion. Therefore, a histogram over a rectangular region can be efficiently computed given an
integral image for each ‘histogram’ channel Q
i
. Although histogram channels can be com-
puted over any input [16], the most common use is a variant known as gradient histograms
[16, 30, 31] described below. Note that a related representation known as ‘value encoding’
has been studied extensively in the neuroscience literature, e.g. see [3].
Gradient Histogram: A gradient histogram is a weighted histogram where bin index is
determined by gradient angle and weight by gradient magnitude. In other words the channels
are given by Q
θ
(x,y) = G(x,y) · 1[Θ(x,y) = θ ], where G(x, y) and Θ(x,y) are the gradient
magnitude and quantized gradient angle, respectively, at I(x, y). An example quantized to
4 orientations is shown in panel g. Gradients at different scales can again be obtained by
smoothing I and an additional channel storing gradient magnitude (panel e) can be used
to L
1
normalize the resulting histogram, thus allowing gradient histograms to approximate
HOG features[31]. Note also the similarity between the channels in panels c and g. Although
seemingly quite different, convolving I with oriented Gabor filters and quantizing gradient
magnitude by gradient angle yields channels capturing qualitatively similar information.
We briefly discuss channel scale. We differentiate between pre-smoothing (smoothing
the input image) and post-smoothing (smoothing the created channels). In the terminology
of Gårding et al. [13], pre-smoothing determines the ‘local scale’ of the representation and
serves to suppress fine scale structures as well as image noise. Pre-smoothing can have
a significant impact on the information captured, e.g. pre-smoothing determines the scale
of subsequently computed gradients, affecting computed Canny edges (panel f) or gradient
histograms (panel g). Post-smoothing, on the other hand, helps determine the ‘integration
scale’ over which information is pooled. However, the integration scale is also determined by
the support of local sums computed over the channels, making post-smoothing less useful.
The channel representation has a number of appealing properties. Most channels defined
above can be computed with a few lines of code given standard image processing tools;
furthermore, many of the channels can be computed very efficiently (we discuss optimization
techniques in Sec. 3). In Sec. 4 we show that most parameters used to compute the channels
are not crucial (assuming they’re consistent between training and testing).

Citations
More filters
Proceedings ArticleDOI

Focal Loss for Dense Object Detection

TL;DR: This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Posted Content

Focal Loss for Dense Object Detection

TL;DR: This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Journal ArticleDOI

Focal Loss for Dense Object Detection

TL;DR: Focal loss as discussed by the authors focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training, which improves the accuracy of one-stage detectors.
Journal ArticleDOI

Pedestrian Detection: An Evaluation of the State of the Art

TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Journal ArticleDOI

Fast Feature Pyramids for Object Detection

TL;DR: For a broad family of features, this work finds that features computed at octave-spaced scale intervals are sufficient to approximate features on a finely-sampled pyramid, and this approximation yields considerable speedups with negligible loss in detection accuracy.
References
More filters
Journal ArticleDOI

A Computational Approach to Edge Detection

TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Journal ArticleDOI

Textural Features for Image Classification

TL;DR: These results indicate that the easily computable textural features based on gray-tone spatial dependancies probably have a general applicability for a wide variety of image-classification applications.
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Proceedings ArticleDOI

Robust real-time face detection

TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Journal ArticleDOI

Additive Logistic Regression : A Statistical View of Boosting

TL;DR: This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.
Related Papers (5)