scispace - formally typeset
Open AccessProceedings ArticleDOI

HOGgles: Visualizing Object Detection Features

Reads0
Chats0
TLDR
Algorithms to visualize feature spaces used by object detectors allow a human to put on 'HOG goggles' and perceive the visual world as a HOG based object detector sees it, and allow us to analyze object detection systems in new ways and gain new insight into the detector's failures.
Abstract
We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on 'HOG goggles' and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.

read more

Content maybe subject to copyright    Report

HOGgles: Visualizing Object Detection Features
Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Massachusetts Institute of Technology
{vondrick,khosla,tomasz,torralba}@csail.mit.edu
Abstract
We introduce algorithms to visualize feature spaces used
by object detectors. The tools in this paper allow a human
to put on ‘HOG goggles’ and perceive the visual world as
a HOG based object detector sees it. We found that these
visualizations allow us to analyze object detection systems
in new ways and gain new insight into the detector’s fail-
ures. For example, when we visualize the features for high
scoring false alarms, we discovered that, although they are
clearly wrong in image space, they do look deceptively sim-
ilar to true positives in feature space. This result suggests
that many of these false alarms are caused by our choice of
feature space, and indicates that creating a better learning
algorithm or building bigger datasets is unlikely to correct
these errors. By visualizing feature spaces, we can gain a
more intuitive understanding of our detection systems.
1. Introduction
Figure 1 shows a high scoring detection from an ob-
ject detector with HOG features and a linear SVM classifier
trained on PASCAL. Despite our field’s incredible progress
in object recognition over the last decade, why do our de-
tectors still think that sea water looks like a car?
Unfortunately, computer vision researchers are often un-
able to explain the failures of object detection systems.
Some researchers blame the features, others the training set,
and even more the learning algorithm. Yet, if we wish to
build the next generation of object detectors, it seems cru-
cial to understand the failures of our current detectors.
In this paper, we introduce a tool to explain some of the
failures of object detection systems.
1
We present algorithms
to visualize the feature spaces of object detectors. Since
features are too high dimensional for humans to directly in-
spect, our visualization algorithms work by inverting fea-
tures back to natural images. We found that these inversions
provide an intuitive and accurate visualization of the feature
spaces used by object detectors.
Previously: Inverting and Visualizing Features for Object Detection
1
Code is available online at http://mit.edu/vondrick/ihog
Figure 1: An image from PASCAL and a high scoring car
detection from DPM [8]. Why did the detector fail?
Figure 2: We show the crop for the false car detection from
Figure 1. On the right, we show our visualization of the
HOG features for the same patch. Our visualization reveals
that this false alarm actually looks like a car in HOG space.
Figure 2 shows the output from our visualization on the
features for the false car detection. This visualization re-
veals that, while there are clearly no cars in the original
image, there is a car hiding in the HOG descriptor. HOG
features see a slightly different visual world than what we
see, and by visualizing this space, we can gain a more intu-
itive understanding of our object detectors.
Figure 3 inverts more top detections on PASCAL for
a few categories. Can you guess which are false alarms?
Take a minute to study the figure since the next sentence
might ruin the surprise. Although every visualization looks
like a true positive, all of these detections are actually false
alarms. Consequently, even with a better learning algorithm
or more data, these false alarms will likely persist. In other
words, the features are to blame.
The principle contribution of this paper is the presenta-
tion of algorithms for visualizing features used in object de-
tection. To this end, we present four algorithms to invert
1

Figure 3: We visualize some high scoring detections from the deformable parts model [8] for person, chair, and car. Can you
guess which are false alarms? Take a minute to study this figure, then see Figure 16 for the corresponding RGB patches.
Figure 4: In this paper, we present algorithms to visualize
HOG features. Our visualizations are perceptually intuitive
for humans to understand.
object detection features to natural images. Although we fo-
cus on HOG features in this paper, our approach is general
and can be applied to other features as well. We evaluate
our inversions with both automatic benchmarks and a large
human study, and we found our visualizations are percep-
tually more accurate at representing the content of a HOG
feature than existing methods; see Figure 4 for a compar-
ison between our visualization and HOG glyphs. We then
use our visualizations to inspect the behaviors of object de-
tection systems and analyze their features. Since we hope
our visualizations will be useful to other researchers, our
final contribution is a public feature visualization toolbox.
2. Related Work
Our visualization algorithms extend an actively growing
body of work in feature inversion. Torralba and Oliva, in
early work, described a simple iterative procedure to re-
cover images only given gist descriptors [17]. Weinzaepfel
et al. [22] were the first to reconstruct an image given its
keypoint SIFT descriptors [13]. Their approach obtains
compelling reconstructions using a nearest neighbor based
approach on a massive database. d’Angelo et al. [4] then de-
veloped an algorithm to reconstruct images given only LBP
features [2, 1]. Their method analytically solves for the in-
verse image and does not require a dataset.
While [22, 4, 17] do a good job at reconstructing im-
ages from SIFT, LBP, and gist features, our visualization
algorithms have several advantages. Firstly, while existing
methods are tailored for specific features, our visualization
algorithms we propose are feature independent. Since we
cast feature inversion as a machine learning problem, our
algorithms can be used to visualize any feature. In this pa-
per, we focus on features for object detection, the most pop-
ular of which is HOG. Secondly, our algorithms are fast:
our best algorithm can invert features in under a second on
a desktop computer, enabling interactive visualization. Fi-
nally, to our knowledge, this paper is the first to invert HOG.
Our visualizations enable analysis that complement a re-
cent line of papers that provide tools to diagnose object
recognition systems, which we briefly review here. Parikh
and Zitnick [18, 19] introduced a new paradigm for human
debugging of object detectors, an idea that we adopt in our
experiments. Hoiem et al. [10] performed a large study an-
alyzing the errors that object detectors make. Divvala et al.
[5] analyze part-based detectors to determine which com-
ponents of object detection systems have the most impact
on performance. Tatu et al. [20] explored the set of images
that generate identical HOG descriptors. Liu and Wang [12]
designed algorithms to highlight which image regions con-
tribute the most to a classifier’s confidence. Zhu et al. [24]
try to determine whether we have reached Bayes risk for
HOG. The tools in this paper enable an alternative mode to
analyze object detectors through visualizations. By putting
on ‘HOG glasses’ and visualizing the world according to
the features, we are able to gain a better understanding of
the failures and behaviors of our object detection systems.
3. Feature Visualization Algorithms
We pose the feature visualization problem as one of fea-
ture inversion, i.e. recovering the natural image that gen-
erated a feature vector. Let x R
D
be an image and
y = φ(x) be the corresponding HOG feature descriptor.
Since φ(·) is a many-to-one function, no analytic inverse
exists. Hence, we seek an image x that, when we compute
2

Figure 5: We found that averaging the images of top detec-
tions from an exemplar LDA detector provide one method
to invert HOG features.
HOG on it, closely matches the original descriptor y:
φ
1
(y) = argmin
xR
D
||φ(x) y||
2
2
(1)
Optimizing Eqn.1 is challenging. Although Eqn.1 is not
convex, we tried gradient-descent strategies by numerically
evaluating the derivative in image space with Newton’s
method. Unfortunately, we observed poor results, likely be-
cause HOG is both highly sensitive to noise and Eqn.1 has
frequent local minima.
In the rest of this section, we present four algorithms for
inverting HOG features. Since, to our knowledge, no al-
gorithms to invert HOG have yet been developed, we first
describe three simple baselines for HOG inversion. We then
present our main inversion algorithm.
3.1. Baseline A: Exemplar LDA (ELDA)
Consider the top detections for the exemplar object de-
tector [9, 15] for a few images shown in Figure 5. Although
all top detections are false positives, notice that each detec-
tion captures some statistics about the query. Even though
the detections are wrong, if we squint, we can see parts of
the original object appear in each detection.
We use this simple observation to produce our first in-
version baseline. Suppose we wish to invert HOG feature y.
We first train an exemplar LDA detector [9] for this query,
w = Σ
1
(y µ). We score w against every sliding window
on a large database. The HOG inverse is then the average of
the top K detections in RGB space: φ
1
A
(y) =
1
K
P
K
i=1
z
i
where z
i
is an image of a top detection.
This method, although simple, produces surprisingly ac-
curate reconstructions, even when the database does not
contain the category of the HOG template. However, it is
computationally expensive since it requires running an ob-
ject detector across a large database. We also point out that
a similar nearest neighbor method is used in brain research
to visualize what a person might be seeing [16].
3.2. Baseline B: Ridge Regression
We present a fast, parametric inversion baseline based
off ridge regression. Let X R
D
be a random variable
representing a gray scale image and Y R
d
be a random
variable of its corresponding HOG point. We define these
random variables to be normally distributed on a D + d-
variate Gaussian P (X, Y ) N (µ, Σ) with parameters
µ = [
µ
X
µ
Y
] and Σ =
h
Σ
XX
Σ
XY
Σ
T
XY
Σ
Y Y
i
. In order to invert a
HOG feature y, we calculate the most likely image from the
conditional Gaussian distribution P (X|Y = y):
φ
1
B
(y) = argmax
xR
D
P (X = x|Y = y) (2)
It is well known that Gaussians have a closed form condi-
tional mode:
φ
1
B
(y) = Σ
XY
Σ
1
Y Y
(y µ
Y
) + µ
X
(3)
Under this inversion algorithm, any HOG point can be in-
verted by a single matrix multiplication, allowing for inver-
sion in under a second.
We estimate µ and Σ on a large database. In practice, Σ
is not positive definite; we add a small uniform prior (i.e.,
ˆ
Σ = Σ + λI) so Σ can be inverted. Since we wish to in-
vert any HOG point, we assume that P (X, Y ) is stationary
[9], allowing us to efficiently learn the covariance across
massive datasets. We invert an arbitrary dimensional HOG
point by marginalizing out unused dimensions.
We found that ridge regression yields blurred inversions.
Intuitively, since HOG is invariant to shifts up to its bin size,
there are many images that map to the same HOG point.
Ridge regression is reporting the statistically most likely
image, which is the average over all shifts. This causes
ridge regression to only recover the low frequencies of the
original image.
3.3. Baseline C: Direct Optimization
We now provide a baseline that attempts to find im-
ages that, when we compute HOG on it, sufficiently match
the original descriptor. In order to do this efficiently, we
only consider images that span a natural image basis. Let
U R
D×K
be the natural image basis. We found using the
first K eigenvectors of Σ
XX
R
D×D
worked well for this
basis. Any image x R
D
can be encoded by coefficients
ρ R
K
in this basis: x = U ρ. We wish to minimize:
φ
1
C
(y) = Uρ
where ρ
= argmin
ρR
K
||φ(Uρ) y||
2
2
(4)
Empirically we found success optimizing Eqn.4 using coor-
dinate descent on ρ with random restarts. We use an over-
complete basis corresponding to sparse Gabor-like filters
for U. We compute the eigenvectors of Σ
XX
across dif-
ferent scales and translate smaller eigenvectors to form U.
3.4. Algorithm D: Paired Dictionary Learning
In this section, we present our main inversion algorithm.
Let x R
D
be an image and y R
d
be its HOG descriptor.
3

Figure 6: Inverting HOG using paired dictionary learning.
We first project the HOG vector on to a HOG basis. By
jointly learning a coupled basis of HOG features and natural
images, we then transfer the coefficients to the image basis
to recover the natural image.
Figure 7: Some pairs of dictionaries for U and V . The left
of every pair is the gray scale dictionary element and the
right is the positive components elements in the HOG dic-
tionary. Notice the correlation between dictionaries.
Suppose we write x and y in terms of bases U R
D×K
and V R
d×K
respectively, but with shared coefficients
α R
K
:
x = Uα and y = V α (5)
The key observation is that inversion can be obtained by
first projecting the HOG features y onto the HOG basis V ,
then projecting α into the natural image basis U :
φ
1
D
(y) = Uα
where α
= argmin
αR
K
||V α y||
2
2
s.t. ||α||
1
λ
(6)
See Figure 6 for a graphical representation of the paired dic-
tionaries. Since efficient solvers for Eqn.6 exist [14, 11], we
can invert features in under two seconds on a 4 core CPU.
Paired dictionaries require finding appropriate bases U
and V such that Eqn.5 holds. To do this, we solve a paired
dictionary learning problem, inspired by recent super reso-
lution sparse coding work [23, 21]:
argmin
U,V
N
X
i=1
||x
i
U α
i
||
2
2
+ ||φ(x
i
) V α
i
||
2
2
s.t. ||α
i
||
1
λ i, ||U||
2
2
γ
1
, ||V ||
2
2
γ
2
(7)
After a few algebraic manipulations, the above objective
simplifies to a standard sparse coding and dictionary learn-
Original ELDA Ridge Direct PairDict
Figure 8: We show results for all four of our inversion al-
gorithms on held out image patches on similar dimensions
common for object detection. See supplemental for more.
ing problem with concatenated dictionaries, which we op-
timize using SPAMS [14]. Optimization typically took a
few hours on medium sized problems. We estimate U and
V with a dictionary size K 10
3
and training samples
N 10
6
from a large database. See Figure 7 for a visual-
ization of the learned dictionary pairs.
4. Evaluation of Visualizations
We evaluate our inversion algorithms using both quali-
tative and quantitative measures. We use PASCAL VOC
2011 [6] as our dataset and we invert patches corresponding
to objects. Any algorithm that required training could only
access the training set. During evaluation, only images from
the validation set are examined. The database for exemplar
LDA excluded the category of the patch we were inverting
to reduce the potential effect of dataset biases.
We show our inversions in Figure 8 for a few object cat-
egories. Exemplar LDA and ridge regression tend to pro-
4

Figure 9: We show results where our paired dictionary al-
gorithm is trained to recover RGB images instead of only
grayscale images. The right shows the original image and
the left shows the inverse.
PairDict (seconds) Greedy (days) Original
Figure 10: Although our algorithms are good at inverting
HOG, they are not perfect, and struggle to reconstruct high
frequency detail. See text for details.
40 × 40 20 × 20 10 × 10 5 × 5
Figure 11: Our inversion algorithms are sensitive to the
HOG template size. We show how performance degrades
as the template becomes smaller.
duce blurred visualizations. Direct optimization recovers
high frequency details at the expense of extra noise. Paired
dictionary learning tends to produce the best visualization
for HOG descriptors. By learning a dictionary over the vi-
sual world and the correlation between HOG and natural
images, paired dictionary learning recovered high frequen-
cies without introducing significant noise.
We discovered that the paired dictionary is able to re-
cover color from HOG descriptors. Figure 9 shows the re-
sult of training a paired dictionary to estimate RGB images
instead of grayscale images. While the paired dictionary
assigns arbitrary colors to man-made objects and in-door
scenes, it frequently colors natural objects correctly, such as
grass or the sky, likely because those categories are strongly
correlated to HOG descriptors. We focus on grayscale visu-
alizations in this paper because we found those to be more
intuitive for humans to understand.
While our visualizations do a good job at representing
HOG features, they have some limitations. Figure 10 com-
pares our best visualization (paired dictionary) against a
greedy algorithm that draws triangles of random rotation,
scale, position, and intensity, and only accepts the triangle
if it improves the reconstruction. If we allow the greedy al-
Original x
x
0
= φ
1
(φ(x)) x
00
= φ
1
(φ(x
0
))
Figure 12: We recursively compute HOG and invert it with a
paired dictionary. While there is some information loss, our
visualizations still do a good job at accurately representing
HOG features. φ(·) is HOG, and φ
1
(·) is the inverse.
gorithm to execute for an extremely long time (a few days),
the visualization better shows higher frequency detail. This
reveals that there exists a visualization better than paired
dictionary learning, although it may not be tractable. In a
related experiment, Figure 12 recursively computes HOG
on the inverse and inverts it again. This recursion shows
that there is some loss between iterations, although it is mi-
nor and appears to discard high frequency details. More-
over, Figure 11 indicates that our inversions are sensitive to
the dimensionality of the HOG template. Despite these lim-
itations, our visualizations are, as we will now show, still
perceptually intuitive for humans to understand.
We quantitatively evaluate our algorithms under two
benchmarks. Firstly, we use an automatic inversion metric
that measures how well our inversions reconstruct original
images. Secondly, we conducted a large visualization chal-
lenge with human subjects on Amazon Mechanical Turk
(MTurk), which is designed to determine how well people
can infer high level semantics from our visualizations.
4.1. Inversion Benchmark
We consider the inversion performance of our algorithm:
given a HOG feature y, how well does our inverse φ
1
(y)
reconstruct the original pixels x for each algorithm? Since
HOG is invariant up to a constant shift and scale, we score
each inversion against the original image with normalized
cross correlation. Our results are shown in Table 1. Overall,
exemplar LDA does the best at pixel level reconstruction.
4.2. Visualization Benchmark
While the inversion benchmark evaluates how well the
inversions reconstruct the original image, it does not cap-
ture the high level content of the inverse: is the inverse of a
sheep still a sheep? To evaluate this, we conducted a study
on MTurk. We sampled 2,000 windows corresponding to
objects in PASCAL VOC 2011. We then showed partic-
ipants an inversion from one of our algorithms and asked
users to classify it into one of the 20 categories. Each win-
dow was shown to three different users. Users were required
to pass a training course and qualification exam before par-
ticipating in order to guarantee users understood the task.
Users could optionally select that they were not confident in
5

Figures
Citations
More filters
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Posted Content

Rich feature hierarchies for accurate object detection and semantic segmentation

TL;DR: This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Proceedings ArticleDOI

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

TL;DR: This work combines existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and applies it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
Book ChapterDOI

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

TL;DR: In this paper, the authors combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image style transfer, where a feedforward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
Posted Content

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

TL;DR: This work considers image transformation problems, and proposes the use of perceptual loss functions for training feed-forward networks for image transformation tasks, and shows results on image style transfer, where aFeed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
References
More filters
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings ArticleDOI

Object recognition from local scale-invariant features

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal ArticleDOI

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Related Papers (5)