What have the authors contributed in "Fine-grained crowdsourcing for fine-grained recognition" ?

In this work, the authors include humans in the loop to help computers select discriminative features. The authors introduce a novel online game called “ Bubbles ” that reveals discriminative features humans use. The authors next propose the “ BubbleBank ” algorithm that uses the human selected bubbles to improve machine recognition performance. Experiments demonstrate that their approach yields large improvements over the previous state of the art on challenging benchmarks.

How do the authors design the reward of the game?

The authors design the reward of the game such that a player can only earn high scores if she identifies the categoriescorrectly and uses bubbles parsimoniously.

How do the authors run the bubble detectors?

To run the bubble detectors, the authors resize an image to a max dimension of 300 pixelsand extract the same SIFT and color descriptors on dense patches at every 2 pixels at multiple resolutions.

What is the pooling region for the bubble detectors?

The authors specify the pooling region for each detector to be a 0.5× 0.5 rectangle centered at the original bubble location, after normalizing all (x, y) coordinates to be in [0, 1]× [0, 1].

How many bubbles are used in the CFAF algorithm?

using only 1634 human selected bubbles (5% of the entire set), the authors already outperform CFAF [36] (51.05% versus 44.73%).

What is the way to extend to multiple classes?

Extending to Multiple Classes Extending to multiple classes is straightforward — the authors can simply obtain bubbles for all pairs of categories and then use all of them to form their the BubbleBank.

How does the performance of the bubbles game compare with the previous best?

with random bubbles, the performance is similar to 44.73% achieved by CFAF [36], which also uses random templates but further boosts performance by a bagging technique.

How do the authors obtain the confusion matrix of the KDES method?

The authors first obtain the confusion matrix of the KDES method [2] via cross-validation on training data (no test data is used) and then pick the top 763 most confusing pairs.

(Open Access) Fine-Grained Crowdsourcing for Fine-Grained Recognition (2013) | Jia Deng

Q: What is the reason for the bubble detectors?

Their intuition is that since each bubble contains discriminative features for recognition, it suffices to detect such patterns in a test image.

Q: How do the authors set the penalty on wrong answers?

the authors set the penalty on wrong answers very large, for example, 100 points for correct identification but−300 for incorrect ones.

Fine-Grained Crowdsourcing for Fine-Grained Recognition

Jia Deng, Jonathan Krause, Li Fei-Fei

Computer Science Department, Stanford University

Abstract

Fine-grained recognition concerns categorization at

sub-ordinate levels, where the distinction between object

classes is highly local. Compared to basic level recogni-

tion, ﬁne-grained categorization can be more challenging

as there are in general less data and fewer discriminative

features. This necessitates the use of stronger prior for fea-

ture selection. In this work, we include humans in the loop

to help computers select discriminative features. We intro-

duce a novel online game called “Bubbles” that reveals dis-

criminative features humans use. The player’s goal is to

identify the category of a heavily blurred image. During the

game, the player can choose to reveal full details of circu-

lar regions (“bubbles”), with a certain penalty. With proper

setup the game generates discriminative bubbles with as-

sured quality. We next propose the “BubbleBank” algo-

rithm that uses the human selected bubbles to improve ma-

chine recognition performance. Experiments demonstrate

that our approach yields large improvements over the pre-

vious state of the art on challenging benchmarks.

1. Introduction

Fine-grained recognition concerns recognizing sub-

ordinate object classes. Examples include distinguishing

different breeds of dogs, species of birds, models of cars,

and categories of mushrooms. These tasks yield a great deal

of information for a human user and can thus add tremen-

dous value to society.

Fine-grained recognition is challenging. There is in gen-

eral limited data as ﬁne grained labels are much harder to

acquire. More importantly, there are much fewer discrimi-

native features compared to categorization at the basic level.

Distinguishing a dog and a microwave is easy because there

are plenty of helpful visual cues. In comparison, the dif-

ference between ﬁne grained classes can be very subtle and

only a few key features matter. Consider, for example, two

very similar woodpeckers “Northern Flicker” and “Red Bel-

lied Woodpecker” (bird A and B in Figure 1). If irrelevant

features are used, it is virtually impossible to distinguish the

two. But if we know the key differences, the task becomes

(A)

Which bird,

(A) or (B)?

Wrong features: hard

(B)

Right features: easy

Figure 1. The distinction between ﬁne-grained categories is often

very subtle. It is crucial to identify the key features – if the wrong

features are selected, the task can be very difﬁcult. A small number

of right features, on the other hand, makes the task easy.

easy — bird A has a spotted chest while the top of bird B’s

head is red. Given limited data, automatic selection of dis-

criminative features becomes especially difﬁcult as a large

number of irrelevant features can cause severe overﬁtting.

To tackle the challenge of feature selection, one ap-

proach is applying specialized domain knowledge, This ap-

proach can yield great success [16], but demands from the

researcher a deep understanding of the speciﬁc domain.

Another promising direction is including the crowd in the

loop by having humans either label or propose parts and

attributes [3, 13, 11, 22, 9, 21]. These approaches can

potentially reduce the burden of domain speciﬁc engineer-

ing. We refer to this category of approaches “ﬁne-grained

crowdsourcing”. It is “ﬁne-grained” in two senses: (1) the

crowd not only provides class labels indicating what the

object is, but also provides detailed information on how

humans achieve ﬁne grained recognition; (2) the learn-

ing algorithm not only optimizes the classiﬁcation accu-

racy but also incorporates the “ﬁner-grained” hints from the

crowd, which would help avoid overﬁtting and lead to better

generalization performance. The challenge, however, lies

in how to design effective annotation tasks. Existing ap-

proaches either ask humans to label pre-deﬁned parts and at-

tributes [3, 13], or assign open-ended tasks such as propos-

ing semantic labels [20, 11, 22, 9] and specifying visual

primitives (keypoints or regions) [9, 21]. Labeling parts and

attributes places a large burden on the researcher in specify-

ing the annotation task. The open-ended alternatives, on the

other hand, can be difﬁcult for quality control as the tasks

are highly subjective.

In this paper we take one step further in this direction by

introducing a novel crowdsourcing approach to help com-

puters select discriminative features. This approach does

not require the researcher to specify parts and attributes,

is open-ended, and has automatic quality control. Speciﬁ-

cally, we propose a novel online game called “Bubbles” that

reveals the discriminative features. Consider bird species

identiﬁcation as an example. At each round of the game,

a player sees example images for two bird species. She is

then given a new image and is asked to classify the bird

into one of the two species. She earns points for correct

identiﬁcation and loses points otherwise. Regardless of the

outcome, the game advances to the next round with a new

image and possibly a new pair of bird species. The key twist

of the game is that the new image is always heavily blurred

so that the player can only see a rough outline of the bird.

The player can, however, click to reveal small, circular ar-

eas of the image (“bubbles”) to inspect the full details, with

a penalty on game points. Through proper setup of reward,

the game can guarantee that bubbles selected by a success-

ful human player contain discriminative features.

The game enjoys the following advantages: (1) Domain

agnostic. The only assumption is that humans can discover

discriminative visual features from a handful of examples.

Thus it applies to a wide range of domains and appeals to

a generic crowd. In fact, learning to tell unfamiliar cate-

gories apart under time pressure creates challenges and fun.

(2) Automatic quality assurance. If the players earns high

scores, we know with certainty that the areas chosen must

be important. (3) Cost effective. The game provides enter-

tainment and people will volunteer to play. This can enable

large scale data collection with low or zero cost.

Our second contribution is ”BubbleBank”, a new algo-

rithm that uses the crowd-selected bubbles for ﬁne-grained

recognition. For each bubble from the game, we generate a

“bubble detector” that tries to detect the same pattern from

other images. Each image can then be represented by ”Bub-

bleBank”, a collection of max-pooled responses from each

bubble detector. We demonstrate that BubbleBank can im-

prove previous state of the art methods by large margins on

challenging ﬁne-grained benchmarks. Fig. 2 illustrates our

complete framework.

2. Related Work

Our work shares the same end goal as existing work on

ﬁne grained recognition [16, 6, 15, 25, 19, 37, 36], but our

approach is more aligned with the general line of research

that places humans in the loop [20, 11, 22, 29, 9, 5, 24, 23,

Figure 2. In our approach, the crowd ﬁrst plays the “Bubbles”

game, trying to classify a blurred image into one of the two given

categories. During the game, the crowd is allowed to inspect cir-

cular regions (“bubbles”), with a penalty of game points. In this

process, discriminative regions are revealed. Next, when a com-

puter tries to recognize ﬁne grained categories, it collects the hu-

man selected bubbles and detects similar patterns on a image. The

detection responses are max-pooled to form a “BubbleBank” rep-

resentation that can be used for learning classiﬁers.

20, 32, 4, 26]. In particular, there has been success in seek-

ing to understand how humans perform recognition, e.g. by

asking humans to directly provide annotation rationales [9],

to label features in NLP tasks [10], to describe the differ-

ences between pairs of images [20], or to perform tasks that

are parts of the machine pipeline [23]. Our work is differ-

ent in that we use online games to discover discriminative

features for ﬁne grained recognition.

This work relates to many human vision studies. The

game is named after a well known psychology technique for

studying features that humans use for face recognition [14].

Human subjects are shown a face image with random bub-

bles revealed and asked to identify the gender or expression.

Our approach differs in that our bubbles are actively chosen

by the player. Another connection to human vision studies

is that our game to a certain extent resembles eye tracking,

revealing the locations looked at by humans.

Our game also draws inspiration from human compu-

tation [30, 31, 17], especially the seminal “Peekaboom”

game [31]. In this two player game, player A is given a word

(e.g. “cow”) and an image. Player A can then click to reveal

parts of the image to Player B. Player B needs to type the

word “cow” after seeing only the revealed area. Our game is

different. First, Peekaboom is not suitable for ﬁne grained

Figure 3. The game UI. The goal is to correctly classify the center

image into one of the two categories. A green bubble follows the

cursor. The player can click to reveal the area inside the bubble.

The more bubbles used, the fewer points the player can earn.

recognition because an average player cannot be expected

to come up with the same word “Northern Flicker”. Sec-

ond, the goal of Peekaboom is to locate the objects, not the

discriminative parts. In particular, what parts are discrim-

inative for a category depends on what the reference cate-

gory is. In our game, we replace word typing with binary

choices and make discovering discriminative visual features

between unfamiliar categories part of the game play. An-

other difference is that our game is for a single-player. This

eliminates the need to match two players in real time, mak-

ing it much easier to deploy on paid crowdsourcing plat-

forms such as Amazon Mechanical Turk (AMT).

Finally, our BubbleBank algorithm is related to work that

uses collections of part/object detectors [3, 13, 18]. Our ap-

proach differs in that our BubbleBank consists of detectors

tailored to the outputs of the Bubbles game, with a simple

representation that requires no additional detector learning.

3. The Bubbles Game

The Game Mechanism Fig. 3 shows the game UI. A

player is given example images of two categories. In the

center lies a a blurred, de-saturated image with only the

rough outline of the object visible. The goal is to correctly

classify the center image into one of the two categories. A

green “bubble” (size adjustable) follows the mouse cursor

as the player hovers over the center image. When the player

clicks, the area under the circle is revealed in full detail. If

the player answers correctly, she earns new points. Other-

wise she loses points. Either way, the game then advances

to the next round, with a new center image and possibly a

new pair of categories. Note that all images are assumed to

have ground truth class labels so that we can instantly judge

the player’s answers.

We design the reward of the game such that a player

can only earn high scores if she identiﬁes the categories

correctly and uses bubbles parsimoniously. First, we set

the penalty on wrong answers very large, for example, 100

points for correct identiﬁcation but −300 for incorrect ones.

This renders random guessing an ineffective strategy. Also,

the player is allowed to pass difﬁcult images or categories

with no penalty, such that they are not forced to guess. Sec-

ond, there is a cost associated with the total area revealed.

The points for correct identiﬁcation will decrease as more

area is revealed. For example, in our experiments the scores

typically drop to zero when about 30% of the object bound-

ing box is revealed. This thus encourages careful bubble

use. This reward setup therefore reliably distinguishes good

players and assures the quality of their bubbles.

Another issue of game design is determining the amount

of blurring for the center image. With insufﬁcient blurring,

the player can directly identify the category, whereas too

much blurring would obscure the global shape. To address

this issue, we start with a small amount of blurring and in-

crease it gradually in new games until the use of bubbles

becomes necessary. Note that this in fact creates useful side

information about the scale of the discriminative features.

The game can be enjoyable as it has an engaging

challenge-reward setup with instant feedback. To earn high

scores, the user needs to discover the differences between

highly confusing categories. This is similar to the classical

“spot-the-difference” games. Next, the user needs to think

about where to place the bubbles. To further enhance the ex-

perience, we can create a sense of time pressure by adding a

countdown timer and “freezing” the bubbles for a few sec-

onds once a certain amount of area has been revealed.

We ﬁnally note that there is nothing speciﬁc about birds

in the game design. In particular, the players do not need

to understand any attributes or parts. Thus the game can be

readily applied in a different domain. The only assumption

is that humans can learn from a few examples, which turns

out to be valid through our large scale AMT deployment.

AMT Deployment The game is suitable for deployment

on paid crowdsourcing platforms such as AMT. Each AMT

task would consist of multiple rounds of games. The worker

must score enough points in order to submit the task, oth-

erwise the games will continue indeﬁnitely. The threshold

for submission is set high enough such that random guess-

ing is infeasible. This ensures that only the good workers

would be able to submit. Notably, there is no need to make

approval/reject decisions, as is necessary for conventional

tasks. All submissions are guaranteed to be high quality

and can be automatically approved. This is a signiﬁcant ad-

vantage as quality control is often a signiﬁcant concern for

crowdsourcing.

We deployed the game on AMT using the CUB-200-

2010 bird dataset [35] that contains 200 types of birds. A

total of 275 workers submitted 3339 tasks, with an average

price of $0.07 each. This gives 90659 rounds of games, an

Figure 4. Examples of game results from AMT. The red boxes show zoomed-in views of the bubbles. Top row: Bubbles drawn on images

of “Common Tern”’ when compared against “Herring Gull”. Second row: Bubbles for “Common Tern” on the same images of the top

row when compared against “Arctic Tern”. Third row: Bubbles for “Parakeet Auklet” when compared against “Horned Pufﬁn”. Fourth

row: Bubbles for “Parakeet Auklet” on the same images of the third row when compared against “Least Auklet”.

average of 27 rounds per task. We generate the games from

visually confusing category pairs (see Sec. 5.2 for details).

Each round identiﬁes one image and lasts 25 seconds on av-

erage. Among all game rounds, 71% were successful (i.e.

the player correctly identiﬁes the category), 14% failed, and

15% were skipped by passing the image or switching cate-

gories. Fig. 4 shows examples of successful games for four

pairs of categories. Remarkably, the workers are able to dis-

cover the subtle differences between very difﬁcult pairs of

categories. As in Fig. 4, the difference between “Common

Tern” and “Arctic Tern” lies in whether the tip of the beak

is black and in the length of the tail. Also observe how dif-

ferent features are selected for the same image when it is

discriminated against different categories. When “Common

Tern” is compared against “Herring Gull”, the black patch

on the head is discriminative and gets picked often. But

when discriminated against “Arctic Tern”, the black patch is

no longer relevant and is less frequently chosen. For failed

games, we observe that a signiﬁcant fraction is due to a few

very difﬁcult category pairs.

It is also remarkable how little is needed to distinguish a

pair of ﬁne-grained categories. Fig. 5 plots the cumulative

distribution of the area revealed in successful games — over

90% of the games reveal less than 10% of the object bound-

ing box. This validates our hypothesis that (1) humans can

indeed discover the ﬁne differences from a handful of ex-

amples and (2) for ﬁne-grained recognition, the key features

are highly local.

1% 5% 10%

Bubble sizes as proporons of an image

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Percentage of games using equal or less area

Proportion of image area revealed in a game

Figure 5. Statistics of image area revealed in successful games.

The area revealed in most of the successful games is small. Over

90% of the games use less than 10% of the object bounding box.

Finally, we can aggregate the bubbles on the same im-

age from multiple games played by multiple players and

obtain a heat map of discriminative regions. Fig. 6 shows

two examples. It suggests that the game can indeed discover

meaningful cues for ﬁne-grained recognition.

4. The BubbleBank Algorithm

The Bubbles game reveals discriminative features. In

this section we show how to use the human selected bub-

bles to improve recognition. Our basic idea is to generate

a detector for each bubble and represent each image as a

collection of responses of the bubble detectors.

The Bubble Detectors Since each bubble is drawn in the

context of discriminating two classes, we start by assuming

Figure 6. Heat maps of bubbles averaged over multiple games

played by multiple players.

only two classes. Our intuition is that since each bubble

contains discriminative features for recognition, it sufﬁces

to detect such patterns in a test image. It is thus natural to

obtain a detector for each bubble.

How do we represent each bubble detector? Since each

bubble is usually a small area, it can be represented by a

single descriptor such as SIFT, or a concatenation of sim-

ple descriptors. This descriptor acts as an image ﬁlter —

to detect on a test image, we convolve it with densely sam-

pled patches and then take the maximum response (max-

pooling). To further exploit the cues provided by the bub-

bles, we specify a pooling region for each detector. Instead

of convolving with the entire image, each detector operates

on a ﬁxed, rectangular region whose center is determined by

the relative location of the bubble in the original image. In

other words, we have a strong spatial prior about where we

expect to detect bubbles. Note that here we have assumed

that the object has been localized, as is standard in the clas-

siﬁcation task in ﬁne grained recognition [35, 37, 36].

Now, assume that we have collected multiple bubbles,

each from a training image of one of the two classes (each

training image can have multiple bubbles from a single

round of game or multiple games played by different play-

ers). We can then form a bank of bubble detectors (“Bub-

bleBank”) and represent the image by a vector of the max-

pooled responses from each detector, in a spirit similar to

the ObjectBank [18] representation. Then a binary clas-

siﬁer can be learned on top of this representation. Fig. 2

illustrates the BubbleBank representation.

Extending to Multiple Classes Extending to multiple

classes is straightforward — we can simply obtain bub-

bles for all pairs of categories and then use all of them to

form our the BubbleBank. This, however, does not scale

well with the number of classes because we need to run

O(K

) games for K classes. Fortunately, obtaining bub-

bles for every pair of categories is unnecessary in practice.

Not all classes are equally similar to others. It is likely that

a bubble useful for differentiating a class from another very

confusing class is also helpful for discriminating the same

class against less similar ones. For example, the bubbles se-

lected for “Common Tern” against “Herring Gull” in Fig. 4

are also useful for distinguishing “Common Tern” from the

woodpeckers in Fig. 1. Therefore, for a large number of

classes, we can pick only the most confusing category pairs.

Speciﬁcally, we can ﬁrst train a baseline classiﬁer and then

ﬁnd out the confusing pairs via cross-validation. Alterna-

tively, if a semantic hierarchy is available and visual sim-

ilarity between classes is known to align well with the se-

mantic hierarchy, as is often the case [8], we can directly

select pairs of categories from within small subtrees.

We conclude this section by further comparing Bubble-

Bank with related methods. On one hand, BubbleBank is

related to a class of methods that learn attributes, parts, or

object detectors (ObjectBank [18], Poselet [3], Birdlet [13])

and use their responses for classiﬁcation. However, all

these methods require additional annotation to train the de-

tectors. On the other hand, BubbleBank is also related

to more generic methods such as the codebook-free and

annotation-free approach (CFAF) [36] and LLC [34]. These

approaches use simple template representations but gener-

ate them through uniform or random sampling, with no ad-

ditional supervision. Here we highlight some key differ-

ences: (1) our detectors are derived from a game that guar-

antees quality; (2) due to the assured quality, our representa-

tion of bubble detectors can be made very simple using low

level descriptors without additional training; (3) we assume

a strong spatial prior for each bubble detector.

5. Experiments

5.1. Dataset and Implementation

Dataset We use a standard ﬁne-grained benchmark, the

CUB-200 dataset [35] that contains 200 bird species. There

are 6033 images in total and around 30 images per class. All

of our experiments use the default training-test split. We ex-

periment on the full dataset as well as a subset of 14 classes

from the Vireo and Woodpecker family (CUB-14) that have

been used in previous work [13, 36, 38]. All images are

cropped to the bounding boxes, as is standard for many pub-

lished results [35, 5, 37, 36, 1]. At test time, we do not use

any ground truth information other than assuming that the

image has been cropped.

Bubble Detectors We implement the bubble detectors us-

ing SIFT [27] and color histograms extracted at the bubble

locations. The color histograms are based on a color naming

method [28] that converts each pixel into a 11 dimensional

vector, each dimension representing the probability of one

of the 11 basic color terms (e.g. “black”, “blue”, “brown”

etc.). We form an L

normalized histogram by averaging

the color naming vectors within each bubble. The color vec-

tor is then concatenated with the SIFT descriptor to form the

ﬁnal 139-dimensional descriptor. To run the bubble detec-

tors, we resize an image to a max dimension of 300 pixels

Fine-Grained Crowdsourcing for Fine-Grained Recognition

Figures

Citations

3D Object Representations for Fine-Grained Categorization

Part-Based R-CNNs for Fine-Grained Category Detection

Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Evaluation of Output Embeddings for Fine-Grained Image Classification

References

LIBLINEAR: A Library for Large Linear Classification

Locality-constrained Linear Coding for image classification

Labeling images with a computer game

Evaluating Color Descriptors for Object and Scene Recognition

Caltech-UCSD Birds 200

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

The Caltech-UCSD Birds-200-2011 Dataset

ImageNet: A large-scale hierarchical image database

3D Object Representations for Fine-Grained Categorization

Part-Based R-CNNs for Fine-Grained Category Detection

Frequently Asked Questions (18)

Q1. What have the authors contributed in "Fine-grained crowdsourcing for fine-grained recognition" ?

Q2. How can the game guarantee that bubbles contain discriminative features?

Q3. What is the reason for the bubble detectors?

Q4. What is the way to represent a bubble?

Q5. How do the authors set the penalty on wrong answers?

Q6. How do the authors design the reward of the game?

Q7. How do the authors run the bubble detectors?

Q8. What is the pooling region for the bubble detectors?

Q9. How many bubbles are used in the CFAF algorithm?

Q10. What is the way to extend to multiple classes?

Q11. How can the authors create a sense of time pressure?

Q12. What is the definition of a fine-grained crowdsourcing approach?

Q13. How do the authors address the issue of blurring in games?

Q14. What is the way to use bubbles to differentiate a class from another class?

Q15. How many classes are used in the experiment?

Q16. How does the performance of the bubbles game compare with the previous best?

Q17. How do the authors obtain the confusion matrix of the KDES method?

Q18. How much of the area is revealed in successful games?