Integral Channel Features

doi:10.5244/C.23.91

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 1

Piotr Dollár

1

pdollar@caltech.edu

Zhuowen Tu

2

zhuowen.tu@loni.ucla.edu

Pietro Perona

1

perona@caltech.edu

Serge Belongie

3

sjb@cs.ucsd.edu

1

Dept. of Electrical Engineering

California Institute of Technology

Pasadena, CA, USA

2

Lab of Neuro Imaging

University of CA, Los Angeles

Los Angeles, CA, USA

3

Dept. of Computer Science and Eng.

University of California, San Diego

San Diego, CA, USA

Abstract

We study the performance of ‘integral channel features’ for image classiﬁcation tasks,

focusing in particular on pedestrian detection. The general idea behind integral chan-

nel features is that multiple registered image channels are computed using linear and

non-linear transformations of the input image, and then features such as local sums, his-

tograms, and Haar features and their various generalizations are efﬁciently computed

using integral images. Such features have been used in recent literature for a variety of

tasks – indeed, variations appear to have been invented independently multiple times.

Although integral channel features have proven effective, little effort has been devoted to

analyzing or optimizing the features themselves. In this work we present a uniﬁed view

of the relevant work in this area and perform a detailed experimental evaluation. We

demonstrate that when designed properly, integral channel features not only outperform

other features including histogram of oriented gradient (HOG), they also (1) naturally

integrate heterogeneous sources of information, (2) have few parameters and are insen-

sitive to exact parameter settings, (3) allow for more accurate spatial localization during

detection, and (4) result in fast detectors when coupled with cascade classiﬁers.

1 Introduction

The performance of object detection systems is determined by two key factors: the learning

algorithm and the feature representation. Considerable recent progress has been made both

on learning [8, 10, 24, 26] and features design [5, 23, 28]. In this work we use a standard

boosting approach [11] and instead focus our attention on the choice of features.

Our study is based on the following architecture: multiple registered image channels are

computed using linear and non-linear transformations of the input image [12, 17]; next, fea-

tures are extracted from each channel using sums over local rectangular regions. These local

sums, and features computed using multiple such sums including Haar-like wavelets [27],

their various generalizations [7], and even local histograms [20] are computed efﬁciently

c

 2009. The copyright of this document resides with its authors.

It may be distributed unchanged freely in print or electronic forms.

2 DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES

Figure 1: Multiple registered image channels are computed using various transformations of the

input image; next, features such as local sums, histograms, and Haar wavelets are computed efﬁciently

using integral images. Such features, which we refer to as integral channel features, naturally integrate

heterogeneous sources of information, have few parameters, and result in fast, accurate detectors.

using integral images. We refer to such features as integral channel features, see Fig. 1. In-

tegral channel features combine the richness and diversity of information from use of image

channels with the computational efﬁciency of the Viola and Jones detection framework [27].

A number of papers have utilized variants of integral channel features, applications have

included object recognition [16, 24], pedestrian detection [7, 31], edge detection [6], brain

anatomical structure segmentation [25] and local region matching [2]. A uniﬁed overview of

the feature representations in these works is given in Sec. 1.1.

Although integral channel features have proven effective, little effort has been devoted to

analyzing or optimizing the features themselves. In many of the above mentioned works the

focus was on the learning aspect [7, 8, 24] or novel applications [2, 6, 25] and it is difﬁcult to

decouple the performance gains due to the richer features from gains due to more powerful

learning methods. In [16, 30, 31] the authors used integral channel features for computing

histograms of oriented gradients; although these methods achieved good performance; they

do not explore the full potential of the representation.

Furthermore, some of the integral channel features used in the literature have been com-

putationally expensive, e.g. the channels in [6] took over 30s to compute for a 640 × 480

image. In this work we show how to compute effective channels that take about .05-.2s per

640 ×480 image depending on the options selected. For 320×240 images, the channels can

be computed in real time at rates of 20-80 frames per second on a standard PC.

The INRIA pedestrian dataset [5] serves as our primary testbed. Pedestrian detection has

generated signiﬁcant interest in the past few years [9]; moreover, the Histogram of Oriented

Gradient (HOG) descriptor [5] was designed speciﬁcally for the INRIA dataset. HOG has

since been successfully adopted for numerous object detection tasks and is one of the most

common features used in the PASCAL object challenges [19]. Therefore, not only is pedes-

trian detection interesting in and of itself, HOG’s success in other domains serves as strong

evidence that results obtained on pedestrians should generalize effectively.

Our detailed experimental exploration of integral channel features, along with a number

of performance optimizations and use of complementary channel types, leads to large gains

in performance. We show signiﬁcantly improved results over previous applications of sim-

ilar features to pedestrian detection [7, 8, 31]. In fact, full-image evaluation on the INRIA

pedestrian dataset shows that learning using standard boosting coupled with our optimized

integral channel features matches or outperforms all but one other method, including state of

the art approaches obtained using HOG features with more sophisticated learning techniques.

On the task of accurate localization in the INRIA dataset, the proposed method outperforms

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 3

state of the art by a large margin. Finally, we show results on the recently introduced Caltech

Pedestrian Dataset [9], achieving a detection rate of almost 60% at 1 false positive per image

compared to at most 50% detection rate for competing methods, including HOG.

The remainder of this paper is organized as follows. We begin with a review of related

work below. In Sec. 2 we give a more detailed overview of integral channel features and we

discuss implementation details in Sec. 3. We perform a detailed experimental evaluation in

Sec. 4 and conclude in Sec. 5.

1.1 Related Work

The notion of channels can be traced back to the earliest days of computer vision. The

Roberts Cross Edge Detector [22] employed two tiny (2x2) kernels representing orthogonal

spatial derivative operators. The response of those ﬁlters, combined nonlinearly to obtain a

rudimentary measure of edge strength and orientation, could be thought of as the ur-channels.

Another early work was Fukushima’s Neocognitron architecture which used layered chan-

nels of increasing discriminative power [12]. In the following decades, numerous extensions

emerged. E.g., the texture discrimination approach of Malik & Perona [17] employed dozens

of channels computed via nonlinear combination of the responses of a bank of bandpass

ﬁlters. Malik & Perona performed spatial integration via Gaussian smoothing; eventually

statistics were pooled using histogram based representations [15, 21] still popular today.

Soon thereafter Viola and Jones proposed a boosted object detection approach with a

front-end that eschewed computationally expensive bandpass kernels for efﬁcient Haar-like

wavelets implemented using integral images [27]. Nevertheless, with computing power in-

creasing rapidly, computing the responses of a bank of bandpass ﬁlters became less of a

bottleneck. The idea of using integral images on top of such channels to efﬁciently pool

statistics with different regions of support naturally followed, giving rise to the representa-

tion we refer to as integral channel features.

The ﬁrst application of integral images to multiple image channels that we are aware

of appears in Tu’s work on probabilistic boosting trees [24]. Tu computed the response of

Gabor ﬁlters and Canny edges [4] at multiple scales and used Haar-like features on top of

these channels in a boosted object detection framework. Later, Tu extended this approach to

3D MRI brain volume segmentation in [25], where each resulting channel and Haar feature

was 3D. Likewise, Dollár et al. [6] used integral channel features to train a per-pixel edge

detector; state of the art results were obtained that matched the performance of methods using

more carefully tuned features. A large number of channels were used including gradients

at various scales, Gabor ﬁlter responses, difference of offset Gaussian ﬁlters, etc. Similar

channels were used for pedestrian detection [7] and learning patch descriptors [2].

In a different line of work, integral channel features have been used to compute his-

tograms of oriented gradients efﬁciently. As described in [20], rectangular histograms can

be computed efﬁciently by quantizing an image into multiple channels (details are given

in Sec. 2). This key idea has been exploited at least three separate times in the literature

[16, 30, 31]. Zhu et al. [31] used integral histograms to approximate HOG features for use

in a cascade. Laptev [16] likewise computed gradient histograms using integral images, re-

sulting in effective object detectors. In later work, [30] used a similar representation for cat

detection, except features were individual histogram bins as opposed to entire histograms as

in [16, 19]. Although details vary, none of these methods explored the richness of possible

channel types to seamlessly integrate heterogeneous sources of information.

4 DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES

Figure 2: Examples of integral channel features: (a) A ﬁrst-order feature is the sum of pixels in a

rectangular region. (b) A Haar-like feature is a second-order feature approximating a local derivative

[27]. (c) Generalized Haar features include more complex combinations of weighted rectangles [7]. (d)

Histograms can be computed by evaluating local sums on quantized images [20] (see text for details).

2 Channel Types

Given an input image I, a corresponding channel is a registered map of the original im-

age, where the output pixels are computed from corresponding patches of input pixels (thus

preserving overall image layout). A trivial channel is simply C = I for a grayscale image,

likewise for a color image each color channel can serve as a channel. Other channels can be

computed using linear or non-linear transformation of I, various choices are discussed below

(see also Fig. 1). We use Ω to denote a channel generation function and write C = Ω(I).

To allow for fast detection using sliding window detectors channels must be translationally

invariant, that is given I and I

0

related by a translation, C = Ω(I) and C

0

= Ω(I

0

) must be re-

lated by the same translation. This allows Ω to be evaluated once on the entire image rather

than separately for each overlapping detection window.

We deﬁne a ﬁrst-order channel feature f as a sum of pixels in a ﬁxed rectangular region

in a single channel, denoted by f (C). Using an integral image [27] for each channel, f (C)

can be computed with three ﬂoating point operations, and as such ﬁrst-order features are

extremely fast to compute. We deﬁne higher-order channel features as any feature that can

be computed using multiple ﬁrst-order features (e.g., the difference of two simple features in

the same channel). See Fig. 2 for example channel features.

The above is a generalization of the features used in the detection framework of Viola and

Jones (VJ) [27]. Speciﬁcally, VJ used C = I with features resembling Haar Basis functions

[18]. Using the terminology above, a Haar feature is a higher-order feature involving a sum

of 2-4 rectangles arranged in patterns that compute ﬁrst and second order image derivatives

at multiple scales. This description is a bit oversimpliﬁed, as VJ actually used a two channel

representation where C

1

(x,y) = I(x,y) and C

2

(x,y) = I(x,y)

2

. Together these 2 channels

were used for variance normalization of detection windows leading to partial invariance to

changing lighting conditions (see [27] for details).

A number of channel types for an example image are shown in Fig. 1, panels a-h. Note

that all displayed channels are slightly smaller than the input image; it is of key importance to

discard channel boundaries to avoid introducing undesirable artifacts. Below we present an

overview of common channels used in the literature, all of which are translationally invariant.

Gray & Color: The simplest channel is simply a grayscale version of the image (panel

a). Color channels can also be used, in panel b we display the three CIE-LUV color channels.

Linear Filters: A straightforward approach for generating channels that capture different

aspects of an image is through use of linear ﬁlters. Panel c shows I convolved with 4 oriented

Gabor ﬁlters [17]; each channel contains edge information in I at a different orientation.

Convolving I with Difference of Gaussian (DoG) ﬁlters captures the ‘texturedness’ of the

image at different scales (panel d). Convolution with a large bank of ﬁlters can be slow;

DOLLÁR, et al.: INTEGRAL CHANNEL FEATURES 5

nevertheless, linear ﬁlters are a simple and effective method for generating diverse channels.

Nonlinear Transformations: There are countless translationally invariant non-linear

image transformations, here we present a few illustrative cases. Gradient magnitude (panel e)

captures unoriented edge strength while Canny edges (panel f) more explicitly compute edge

information. Gradients for both can be computed at different scales by smoothing the input

image; moreover, for color images a common trick is to compute the gradient on the 3 color

channels separately and use the maximum response [5]. Panel h shows channels obtained by

thresholding the input image with two different thresholds. Although this particular form of

foreground segmentation is not robust, more sophisticated algorithms can be used.

Pointwise Transformations: Each pixel in a channel can be transformed by an arbi-

trary function as a post processing step. This can be used to overcome the limitation that

a feature f must be a local sum. E.g., taking the log, we can obtain local products since

exp(

∑

i

log(x

i

)) =

∏

i

x

i

. Likewise, raising each element to the p-th power can be used to

compute the generalized mean: (

1

n

∑

n

i=1

x

p

i

)

1/p

. Setting p to a large value approximates the

maximum of x

i

, which can be useful for accumulating values as argued in [12].

Integral Histogram: Porikli [20] presented an efﬁcient method for computing his-

tograms using integral images. Let Q be a quantized version of I with values in {1... q},

and let Q

i

(x,y) = 1[Q(x, y) = i] for each i ≤ q (1 is the indicator function). Counting the

elements in a region of Q equal to i can be computed by summing over Q

i

in the same re-

gion. Therefore, a histogram over a rectangular region can be efﬁciently computed given an

integral image for each ‘histogram’ channel Q

i

. Although histogram channels can be com-

puted over any input [16], the most common use is a variant known as gradient histograms

[16, 30, 31] described below. Note that a related representation known as ‘value encoding’

has been studied extensively in the neuroscience literature, e.g. see [3].

Gradient Histogram: A gradient histogram is a weighted histogram where bin index is

determined by gradient angle and weight by gradient magnitude. In other words the channels

are given by Q

θ

(x,y) = G(x,y) · 1[Θ(x,y) = θ ], where G(x, y) and Θ(x,y) are the gradient

magnitude and quantized gradient angle, respectively, at I(x, y). An example quantized to

4 orientations is shown in panel g. Gradients at different scales can again be obtained by

smoothing I and an additional channel storing gradient magnitude (panel e) can be used

to L

1

normalize the resulting histogram, thus allowing gradient histograms to approximate

HOG features[31]. Note also the similarity between the channels in panels c and g. Although

seemingly quite different, convolving I with oriented Gabor ﬁlters and quantizing gradient

magnitude by gradient angle yields channels capturing qualitatively similar information.

We brieﬂy discuss channel scale. We differentiate between pre-smoothing (smoothing

the input image) and post-smoothing (smoothing the created channels). In the terminology

of Gårding et al. [13], pre-smoothing determines the ‘local scale’ of the representation and

serves to suppress ﬁne scale structures as well as image noise. Pre-smoothing can have

a signiﬁcant impact on the information captured, e.g. pre-smoothing determines the scale

of subsequently computed gradients, affecting computed Canny edges (panel f) or gradient

histograms (panel g). Post-smoothing, on the other hand, helps determine the ‘integration

scale’ over which information is pooled. However, the integration scale is also determined by

the support of local sums computed over the channels, making post-smoothing less useful.

The channel representation has a number of appealing properties. Most channels deﬁned

above can be computed with a few lines of code given standard image processing tools;

furthermore, many of the channels can be computed very efﬁciently (we discuss optimization

techniques in Sec. 3). In Sec. 4 we show that most parameters used to compute the channels

are not crucial (assuming they’re consistent between training and testing).

Integral Channel Features

Citations

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

Pedestrian Detection: An Evaluation of the State of the Art

Fast Feature Pyramids for Object Detection

References

A Computational Approach to Edge Detection

Textural Features for Image Classification

Robust Real-Time Face Detection

Robust real-time face detection

Additive Logistic Regression : A Statistical View of Boosting

Related Papers (5)

Histograms of oriented gradients for human detection

Pedestrian Detection: An Evaluation of the State of the Art

Object Detection with Discriminatively Trained Part-Based Models

Robust Real-Time Face Detection

Rapid object detection using a boosted cascade of simple features