scispace - formally typeset
Open AccessProceedings ArticleDOI

Fine-Grained Crowdsourcing for Fine-Grained Recognition

This work includes humans in the loop to help computers select discriminative features humans use, and proposes the "Bubble Bank" algorithm that uses the human selected bubbles to improve machine recognition performance.
Fine-grained recognition concerns categorization at sub-ordinate levels, where the distinction between object classes is highly local. Compared to basic level recognition, fine-grained categorization can be more challenging as there are in general less data and fewer discriminative features. This necessitates the use of stronger prior for feature selection. In this work, we include humans in the loop to help computers select discriminative features. We introduce a novel online game called "Bubbles" that reveals discriminative features humans use. The player's goal is to identify the category of a heavily blurred image. During the game, the player can choose to reveal full details of circular regions ("bubbles"), with a certain penalty. With proper setup the game generates discriminative bubbles with assured quality. We next propose the "Bubble Bank" algorithm that uses the human selected bubbles to improve machine recognition performance. Experiments demonstrate that our approach yields large improvements over the previous state of the art on challenging benchmarks.

read more

Content maybe subject to copyright    Report

Fine-Grained Crowdsourcing for Fine-Grained Recognition
Jia Deng, Jonathan Krause, Li Fei-Fei
Computer Science Department, Stanford University
Fine-grained recognition concerns categorization at
sub-ordinate levels, where the distinction between object
classes is highly local. Compared to basic level recogni-
tion, fine-grained categorization can be more challenging
as there are in general less data and fewer discriminative
features. This necessitates the use of stronger prior for fea-
ture selection. In this work, we include humans in the loop
to help computers select discriminative features. We intro-
duce a novel online game called “Bubbles” that reveals dis-
criminative features humans use. The player’s goal is to
identify the category of a heavily blurred image. During the
game, the player can choose to reveal full details of circu-
lar regions (“bubbles”), with a certain penalty. With proper
setup the game generates discriminative bubbles with as-
sured quality. We next propose the “BubbleBank” algo-
rithm that uses the human selected bubbles to improve ma-
chine recognition performance. Experiments demonstrate
that our approach yields large improvements over the pre-
vious state of the art on challenging benchmarks.
1. Introduction
Fine-grained recognition concerns recognizing sub-
ordinate object classes. Examples include distinguishing
different breeds of dogs, species of birds, models of cars,
and categories of mushrooms. These tasks yield a great deal
of information for a human user and can thus add tremen-
dous value to society.
Fine-grained recognition is challenging. There is in gen-
eral limited data as fine grained labels are much harder to
acquire. More importantly, there are much fewer discrimi-
native features compared to categorization at the basic level.
Distinguishing a dog and a microwave is easy because there
are plenty of helpful visual cues. In comparison, the dif-
ference between fine grained classes can be very subtle and
only a few key features matter. Consider, for example, two
very similar woodpeckers “Northern Flicker” and “Red Bel-
lied Woodpecker” (bird A and B in Figure 1). If irrelevant
features are used, it is virtually impossible to distinguish the
two. But if we know the key differences, the task becomes
Which bird,
(A) or (B)?
Wrong features: hard
Right features: easy
Figure 1. The distinction between fine-grained categories is often
very subtle. It is crucial to identify the key features if the wrong
features are selected, the task can be very difficult. A small number
of right features, on the other hand, makes the task easy.
easy bird A has a spotted chest while the top of bird B’s
head is red. Given limited data, automatic selection of dis-
criminative features becomes especially difficult as a large
number of irrelevant features can cause severe overfitting.
To tackle the challenge of feature selection, one ap-
proach is applying specialized domain knowledge, This ap-
proach can yield great success [16], but demands from the
researcher a deep understanding of the specific domain.
Another promising direction is including the crowd in the
loop by having humans either label or propose parts and
attributes [3, 13, 11, 22, 9, 21]. These approaches can
potentially reduce the burden of domain specific engineer-
ing. We refer to this category of approaches “fine-grained
crowdsourcing”. It is “fine-grained” in two senses: (1) the
crowd not only provides class labels indicating what the
object is, but also provides detailed information on how
humans achieve fine grained recognition; (2) the learn-
ing algorithm not only optimizes the classification accu-
racy but also incorporates the “finer-grained” hints from the
crowd, which would help avoid overfitting and lead to better
generalization performance. The challenge, however, lies
in how to design effective annotation tasks. Existing ap-
proaches either ask humans to label pre-defined parts and at-
tributes [3, 13], or assign open-ended tasks such as propos-
ing semantic labels [20, 11, 22, 9] and specifying visual

primitives (keypoints or regions) [9, 21]. Labeling parts and
attributes places a large burden on the researcher in specify-
ing the annotation task. The open-ended alternatives, on the
other hand, can be difficult for quality control as the tasks
are highly subjective.
In this paper we take one step further in this direction by
introducing a novel crowdsourcing approach to help com-
puters select discriminative features. This approach does
not require the researcher to specify parts and attributes,
is open-ended, and has automatic quality control. Specifi-
cally, we propose a novel online game called “Bubbles” that
reveals the discriminative features. Consider bird species
identification as an example. At each round of the game,
a player sees example images for two bird species. She is
then given a new image and is asked to classify the bird
into one of the two species. She earns points for correct
identification and loses points otherwise. Regardless of the
outcome, the game advances to the next round with a new
image and possibly a new pair of bird species. The key twist
of the game is that the new image is always heavily blurred
so that the player can only see a rough outline of the bird.
The player can, however, click to reveal small, circular ar-
eas of the image (“bubbles”) to inspect the full details, with
a penalty on game points. Through proper setup of reward,
the game can guarantee that bubbles selected by a success-
ful human player contain discriminative features.
The game enjoys the following advantages: (1) Domain
agnostic. The only assumption is that humans can discover
discriminative visual features from a handful of examples.
Thus it applies to a wide range of domains and appeals to
a generic crowd. In fact, learning to tell unfamiliar cate-
gories apart under time pressure creates challenges and fun.
(2) Automatic quality assurance. If the players earns high
scores, we know with certainty that the areas chosen must
be important. (3) Cost effective. The game provides enter-
tainment and people will volunteer to play. This can enable
large scale data collection with low or zero cost.
Our second contribution is ”BubbleBank”, a new algo-
rithm that uses the crowd-selected bubbles for fine-grained
recognition. For each bubble from the game, we generate a
“bubble detector” that tries to detect the same pattern from
other images. Each image can then be represented by ”Bub-
bleBank”, a collection of max-pooled responses from each
bubble detector. We demonstrate that BubbleBank can im-
prove previous state of the art methods by large margins on
challenging fine-grained benchmarks. Fig. 2 illustrates our
complete framework.
2. Related Work
Our work shares the same end goal as existing work on
fine grained recognition [16, 6, 15, 25, 19, 37, 36], but our
approach is more aligned with the general line of research
that places humans in the loop [20, 11, 22, 29, 9, 5, 24, 23,
Figure 2. In our approach, the crowd first plays the “Bubbles”
game, trying to classify a blurred image into one of the two given
categories. During the game, the crowd is allowed to inspect cir-
cular regions (“bubbles”), with a penalty of game points. In this
process, discriminative regions are revealed. Next, when a com-
puter tries to recognize fine grained categories, it collects the hu-
man selected bubbles and detects similar patterns on a image. The
detection responses are max-pooled to form a “BubbleBank” rep-
resentation that can be used for learning classifiers.
20, 32, 4, 26]. In particular, there has been success in seek-
ing to understand how humans perform recognition, e.g. by
asking humans to directly provide annotation rationales [9],
to label features in NLP tasks [10], to describe the differ-
ences between pairs of images [20], or to perform tasks that
are parts of the machine pipeline [23]. Our work is differ-
ent in that we use online games to discover discriminative
features for fine grained recognition.
This work relates to many human vision studies. The
game is named after a well known psychology technique for
studying features that humans use for face recognition [14].
Human subjects are shown a face image with random bub-
bles revealed and asked to identify the gender or expression.
Our approach differs in that our bubbles are actively chosen
by the player. Another connection to human vision studies
is that our game to a certain extent resembles eye tracking,
revealing the locations looked at by humans.
Our game also draws inspiration from human compu-
tation [30, 31, 17], especially the seminal “Peekaboom”
game [31]. In this two player game, player A is given a word
(e.g. “cow”) and an image. Player A can then click to reveal
parts of the image to Player B. Player B needs to type the
word “cow” after seeing only the revealed area. Our game is
different. First, Peekaboom is not suitable for fine grained

Figure 3. The game UI. The goal is to correctly classify the center
image into one of the two categories. A green bubble follows the
cursor. The player can click to reveal the area inside the bubble.
The more bubbles used, the fewer points the player can earn.
recognition because an average player cannot be expected
to come up with the same word “Northern Flicker”. Sec-
ond, the goal of Peekaboom is to locate the objects, not the
discriminative parts. In particular, what parts are discrim-
inative for a category depends on what the reference cate-
gory is. In our game, we replace word typing with binary
choices and make discovering discriminative visual features
between unfamiliar categories part of the game play. An-
other difference is that our game is for a single-player. This
eliminates the need to match two players in real time, mak-
ing it much easier to deploy on paid crowdsourcing plat-
forms such as Amazon Mechanical Turk (AMT).
Finally, our BubbleBank algorithm is related to work that
uses collections of part/object detectors [3, 13, 18]. Our ap-
proach differs in that our BubbleBank consists of detectors
tailored to the outputs of the Bubbles game, with a simple
representation that requires no additional detector learning.
3. The Bubbles Game
The Game Mechanism Fig. 3 shows the game UI. A
player is given example images of two categories. In the
center lies a a blurred, de-saturated image with only the
rough outline of the object visible. The goal is to correctly
classify the center image into one of the two categories. A
green “bubble” (size adjustable) follows the mouse cursor
as the player hovers over the center image. When the player
clicks, the area under the circle is revealed in full detail. If
the player answers correctly, she earns new points. Other-
wise she loses points. Either way, the game then advances
to the next round, with a new center image and possibly a
new pair of categories. Note that all images are assumed to
have ground truth class labels so that we can instantly judge
the player’s answers.
We design the reward of the game such that a player
can only earn high scores if she identifies the categories
correctly and uses bubbles parsimoniously. First, we set
the penalty on wrong answers very large, for example, 100
points for correct identification but 300 for incorrect ones.
This renders random guessing an ineffective strategy. Also,
the player is allowed to pass difficult images or categories
with no penalty, such that they are not forced to guess. Sec-
ond, there is a cost associated with the total area revealed.
The points for correct identification will decrease as more
area is revealed. For example, in our experiments the scores
typically drop to zero when about 30% of the object bound-
ing box is revealed. This thus encourages careful bubble
use. This reward setup therefore reliably distinguishes good
players and assures the quality of their bubbles.
Another issue of game design is determining the amount
of blurring for the center image. With insufficient blurring,
the player can directly identify the category, whereas too
much blurring would obscure the global shape. To address
this issue, we start with a small amount of blurring and in-
crease it gradually in new games until the use of bubbles
becomes necessary. Note that this in fact creates useful side
information about the scale of the discriminative features.
The game can be enjoyable as it has an engaging
challenge-reward setup with instant feedback. To earn high
scores, the user needs to discover the differences between
highly confusing categories. This is similar to the classical
“spot-the-difference” games. Next, the user needs to think
about where to place the bubbles. To further enhance the ex-
perience, we can create a sense of time pressure by adding a
countdown timer and “freezing” the bubbles for a few sec-
onds once a certain amount of area has been revealed.
We finally note that there is nothing specific about birds
in the game design. In particular, the players do not need
to understand any attributes or parts. Thus the game can be
readily applied in a different domain. The only assumption
is that humans can learn from a few examples, which turns
out to be valid through our large scale AMT deployment.
AMT Deployment The game is suitable for deployment
on paid crowdsourcing platforms such as AMT. Each AMT
task would consist of multiple rounds of games. The worker
must score enough points in order to submit the task, oth-
erwise the games will continue indefinitely. The threshold
for submission is set high enough such that random guess-
ing is infeasible. This ensures that only the good workers
would be able to submit. Notably, there is no need to make
approval/reject decisions, as is necessary for conventional
tasks. All submissions are guaranteed to be high quality
and can be automatically approved. This is a significant ad-
vantage as quality control is often a significant concern for
We deployed the game on AMT using the CUB-200-
2010 bird dataset [35] that contains 200 types of birds. A
total of 275 workers submitted 3339 tasks, with an average
price of $0.07 each. This gives 90659 rounds of games, an

Figure 4. Examples of game results from AMT. The red boxes show zoomed-in views of the bubbles. Top row: Bubbles drawn on images
of “Common Tern”’ when compared against “Herring Gull”. Second row: Bubbles for “Common Tern” on the same images of the top
row when compared against Arctic Tern”. Third row: Bubbles for “Parakeet Auklet” when compared against “Horned Puffin”. Fourth
row: Bubbles for “Parakeet Auklet” on the same images of the third row when compared against “Least Auklet”.
average of 27 rounds per task. We generate the games from
visually confusing category pairs (see Sec. 5.2 for details).
Each round identifies one image and lasts 25 seconds on av-
erage. Among all game rounds, 71% were successful (i.e.
the player correctly identifies the category), 14% failed, and
15% were skipped by passing the image or switching cate-
gories. Fig. 4 shows examples of successful games for four
pairs of categories. Remarkably, the workers are able to dis-
cover the subtle differences between very difficult pairs of
categories. As in Fig. 4, the difference between “Common
Tern” and Arctic Tern” lies in whether the tip of the beak
is black and in the length of the tail. Also observe how dif-
ferent features are selected for the same image when it is
discriminated against different categories. When “Common
Tern” is compared against “Herring Gull”, the black patch
on the head is discriminative and gets picked often. But
when discriminated against Arctic Tern”, the black patch is
no longer relevant and is less frequently chosen. For failed
games, we observe that a significant fraction is due to a few
very difficult category pairs.
It is also remarkable how little is needed to distinguish a
pair of fine-grained categories. Fig. 5 plots the cumulative
distribution of the area revealed in successful games over
90% of the games reveal less than 10% of the object bound-
ing box. This validates our hypothesis that (1) humans can
indeed discover the fine differences from a handful of ex-
amples and (2) for fine-grained recognition, the key features
are highly local.
1% 5% 10%
Bubble sizes as proporons of an image
0 0.2 0.4 0.6 0.8 1
Percentage of games using equal or less area
Proportion of image area revealed in a game
Figure 5. Statistics of image area revealed in successful games.
The area revealed in most of the successful games is small. Over
90% of the games use less than 10% of the object bounding box.
Finally, we can aggregate the bubbles on the same im-
age from multiple games played by multiple players and
obtain a heat map of discriminative regions. Fig. 6 shows
two examples. It suggests that the game can indeed discover
meaningful cues for fine-grained recognition.
4. The BubbleBank Algorithm
The Bubbles game reveals discriminative features. In
this section we show how to use the human selected bub-
bles to improve recognition. Our basic idea is to generate
a detector for each bubble and represent each image as a
collection of responses of the bubble detectors.
The Bubble Detectors Since each bubble is drawn in the
context of discriminating two classes, we start by assuming

Figure 6. Heat maps of bubbles averaged over multiple games
played by multiple players.
only two classes. Our intuition is that since each bubble
contains discriminative features for recognition, it suffices
to detect such patterns in a test image. It is thus natural to
obtain a detector for each bubble.
How do we represent each bubble detector? Since each
bubble is usually a small area, it can be represented by a
single descriptor such as SIFT, or a concatenation of sim-
ple descriptors. This descriptor acts as an image filter
to detect on a test image, we convolve it with densely sam-
pled patches and then take the maximum response (max-
pooling). To further exploit the cues provided by the bub-
bles, we specify a pooling region for each detector. Instead
of convolving with the entire image, each detector operates
on a fixed, rectangular region whose center is determined by
the relative location of the bubble in the original image. In
other words, we have a strong spatial prior about where we
expect to detect bubbles. Note that here we have assumed
that the object has been localized, as is standard in the clas-
sification task in fine grained recognition [35, 37, 36].
Now, assume that we have collected multiple bubbles,
each from a training image of one of the two classes (each
training image can have multiple bubbles from a single
round of game or multiple games played by different play-
ers). We can then form a bank of bubble detectors (“Bub-
bleBank”) and represent the image by a vector of the max-
pooled responses from each detector, in a spirit similar to
the ObjectBank [18] representation. Then a binary clas-
sifier can be learned on top of this representation. Fig. 2
illustrates the BubbleBank representation.
Extending to Multiple Classes Extending to multiple
classes is straightforward we can simply obtain bub-
bles for all pairs of categories and then use all of them to
form our the BubbleBank. This, however, does not scale
well with the number of classes because we need to run
) games for K classes. Fortunately, obtaining bub-
bles for every pair of categories is unnecessary in practice.
Not all classes are equally similar to others. It is likely that
a bubble useful for differentiating a class from another very
confusing class is also helpful for discriminating the same
class against less similar ones. For example, the bubbles se-
lected for “Common Tern” against “Herring Gull” in Fig. 4
are also useful for distinguishing “Common Tern” from the
woodpeckers in Fig. 1. Therefore, for a large number of
classes, we can pick only the most confusing category pairs.
Specifically, we can first train a baseline classifier and then
find out the confusing pairs via cross-validation. Alterna-
tively, if a semantic hierarchy is available and visual sim-
ilarity between classes is known to align well with the se-
mantic hierarchy, as is often the case [8], we can directly
select pairs of categories from within small subtrees.
We conclude this section by further comparing Bubble-
Bank with related methods. On one hand, BubbleBank is
related to a class of methods that learn attributes, parts, or
object detectors (ObjectBank [18], Poselet [3], Birdlet [13])
and use their responses for classification. However, all
these methods require additional annotation to train the de-
tectors. On the other hand, BubbleBank is also related
to more generic methods such as the codebook-free and
annotation-free approach (CFAF) [36] and LLC [34]. These
approaches use simple template representations but gener-
ate them through uniform or random sampling, with no ad-
ditional supervision. Here we highlight some key differ-
ences: (1) our detectors are derived from a game that guar-
antees quality; (2) due to the assured quality, our representa-
tion of bubble detectors can be made very simple using low
level descriptors without additional training; (3) we assume
a strong spatial prior for each bubble detector.
5. Experiments
5.1. Dataset and Implementation
Dataset We use a standard fine-grained benchmark, the
CUB-200 dataset [35] that contains 200 bird species. There
are 6033 images in total and around 30 images per class. All
of our experiments use the default training-test split. We ex-
periment on the full dataset as well as a subset of 14 classes
from the Vireo and Woodpecker family (CUB-14) that have
been used in previous work [13, 36, 38]. All images are
cropped to the bounding boxes, as is standard for many pub-
lished results [35, 5, 37, 36, 1]. At test time, we do not use
any ground truth information other than assuming that the
image has been cropped.
Bubble Detectors We implement the bubble detectors us-
ing SIFT [27] and color histograms extracted at the bubble
locations. The color histograms are based on a color naming
method [28] that converts each pixel into a 11 dimensional
vector, each dimension representing the probability of one
of the 11 basic color terms (e.g. “black”, “blue”, “brown”
etc.). We form an L
normalized histogram by averaging
the color naming vectors within each bubble. The color vec-
tor is then concatenated with the SIFT descriptor to form the
final 139-dimensional descriptor. To run the bubble detec-
tors, we resize an image to a max dimension of 300 pixels

More filters
Proceedings ArticleDOI

3D Object Representations for Fine-Grained Categorization

TL;DR: This paper lifts two state-of-the-art 2D object representations to 3D, on the level of both local feature appearance and location, and shows their efficacy for estimating 3D geometry from images via ultra-wide baseline matching and 3D reconstruction.
Book ChapterDOI

Part-Based R-CNNs for Fine-Grained Category Detection

TL;DR: In this article, the authors propose a model for fine-grained categorization by leveraging deep convolutional features computed on bottom-up region proposals, which learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a finegrained category from a pose normalized representation.
Posted Content

Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels

TL;DR: Co-teaching as discussed by the authors trains two deep neural networks simultaneously, and let them teach each other given every mini-batch: first, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this minibatch should be used for training; finally, each networks back propagates the data selected by its peer network and updates itself.
Proceedings ArticleDOI

ReferItGame: Referring to Objects in Photographs of Natural Scenes

TL;DR: A new game to crowd-source natural language referring expressions by designing a two player game that can both collect and verify referring expressions directly within the game and provides an in depth analysis of the resulting dataset.
Proceedings ArticleDOI

Evaluation of Output Embeddings for Fine-Grained Image Classification

TL;DR: In this article, given image and class embeddings, they learn a compatibility function such that matching embedding are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score.
More filters
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Proceedings ArticleDOI

Locality-constrained Linear Coding for image classification

TL;DR: This paper presents a simple but effective coding scheme called Locality-constrained Linear Coding (LLC) in place of the VQ coding in traditional SPM, using the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to generate the final representation.
Proceedings ArticleDOI

Labeling images with a computer game

TL;DR: A new interactive system: a game that is fun and can be used to create valuable output that addresses the image-labeling problem and encourages people to do the work by taking advantage of their desire to be entertained.
Journal ArticleDOI

Evaluating Color Descriptors for Object and Scene Recognition

TL;DR: From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition and the usefulness of invariance is category-specific.

Caltech-UCSD Birds 200

TL;DR: Caltech-UCSD Birds 200 (CUB-200) is a challenging image dataset annotated with 200 bird species to enable the study of subordinate categorization, which is not possible with other popular datasets that focus on basic level categories.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What have the authors contributed in "Fine-grained crowdsourcing for fine-grained recognition" ?

In this work, the authors include humans in the loop to help computers select discriminative features. The authors introduce a novel online game called “ Bubbles ” that reveals discriminative features humans use. The authors next propose the “ BubbleBank ” algorithm that uses the human selected bubbles to improve machine recognition performance. Experiments demonstrate that their approach yields large improvements over the previous state of the art on challenging benchmarks. 

Through proper setup of reward, the game can guarantee that bubbles selected by a successful human player contain discriminative features. 

Their intuition is that since each bubble contains discriminative features for recognition, it suffices to detect such patterns in a test image. 

Since each bubble is usually a small area, it can be represented by a single descriptor such as SIFT, or a concatenation of simple descriptors. 

the authors set the penalty on wrong answers very large, for example, 100 points for correct identification but−300 for incorrect ones. 

The authors design the reward of the game such that a player can only earn high scores if she identifies the categoriescorrectly and uses bubbles parsimoniously. 

To run the bubble detectors, the authors resize an image to a max dimension of 300 pixelsand extract the same SIFT and color descriptors on dense patches at every 2 pixels at multiple resolutions. 

The authors specify the pooling region for each detector to be a 0.5× 0.5 rectangle centered at the original bubble location, after normalizing all (x, y) coordinates to be in [0, 1]× [0, 1]. 

using only 1634 human selected bubbles (5% of the entire set), the authors already outperform CFAF [36] (51.05% versus 44.73%). 

Extending to Multiple Classes Extending to multiple classes is straightforward — the authors can simply obtain bubbles for all pairs of categories and then use all of them to form their the BubbleBank. 

To further enhance the experience, the authors can create a sense of time pressure by adding a countdown timer and “freezing” the bubbles for a few seconds once a certain amount of area has been revealed. 

It is “fine-grained” in two senses: (1) the crowd not only provides class labels indicating what the object is, but also provides detailed information on how humans achieve fine grained recognition; (2) the learning algorithm not only optimizes the classification accuracy but also incorporates the “finer-grained” hints from the crowd, which would help avoid overfitting and lead to better generalization performance. 

To address this issue, the authors start with a small amount of blurring and increase it gradually in new games until the use of bubbles becomes necessary. 

It is likely that a bubble useful for differentiating a class from another very confusing class is also helpful for discriminating the same class against less similar ones. 

The authors experiment on the full dataset as well as a subset of 14 classes from the Vireo and Woodpecker family (CUB-14) that have been used in previous work [13, 36, 38]. 

with random bubbles, the performance is similar to 44.73% achieved by CFAF [36], which also uses random templates but further boosts performance by a bagging technique. 

The authors first obtain the confusion matrix of the KDES method [2] via cross-validation on training data (no test data is used) and then pick the top 763 most confusing pairs. 

Fig. 5 plots the cumulative distribution of the area revealed in successful games — over 90% of the games reveal less than 10% of the object bounding box.