What have the authors contributed in "How do humans sketch objects?" ?

This paper is the first large scale exploration of human sketches. The authors analyze the distribution of non-expert sketches of everyday objects such as ‘ teapot ’ or ‘ car ’. With this dataset the authors perform a perceptual study and find that humans can correctly identify the object category of a sketch 73 % of the time. The authors compare human performance against computational recognition methods. The authors develop a bag-of-features sketch representation and use multi-class support vector machines, trained on their sketch dataset, to classify sketches. Based on the computational model, the authors demonstrate an interactive sketch recognition system.

What are the contributions mentioned in the paper "How do humans sketch objects?" ?

This paper is the first large scale exploration of human sketches. The authors analyze the distribution of non-expert sketches of everyday objects such as ‘ teapot ’ or ‘ car ’. With this dataset the authors perform a perceptual study and find that humans can correctly identify the object category of a sketch 73 % of the time. The authors compare human performance against computational recognition methods. The authors develop a bag-of-features sketch representation and use multi-class support vector machines, trained on their sketch dataset, to classify sketches. Based on the computational model, the authors demonstrate an interactive sketch recognition system.

What is the general strategy for such techniques?

The general strategy for such techniques is to remove complexity (e.g. delete edges) while staying as close as possible to the original instance according to some geometric error metric.

What is the way to make a sketch more natural?

A stroke-based model might be more natural and facilitate easier synthesis applications such as simplification, beautification, and even synthesis of novel sketches by mixing existing strokes.

How does the study show that humans still perform better than computers at this task?

While the authors achieve a high computational recognition accuracy of 56% (chance is 0.4%), their study also reveals that humans still perform significantly better than computers at this task.

What is the simplest way to visualize the distribution of sketches in the feature space?

8.To visualize the distribution of sketches in the feature space the authors apply dimensionality reduction to the feature vectors from each category.

What is the way to find clusters of sketches in the proposed feature space?

the authors would find clusters of sketches in this space that clearly represent their categories, i.e. the authors would hope to find that features within a category are close to each other while having large distances to all other features.•

Why do people need a large dataset of sketches?

Because people represent the same object using differing degrees of realism and distinct drawing styles (see Fig. 1), the authors need a large dataset of sketches which adequately samples these variations.

How does the computational model perform in such cases?

indeed, computational classification can perform better in such cases: for ‘armchair’ and ‘suv’ the computational model achieves significantly higher accuracy than humans.

How does the performance gain for larger training sets change as the dataset grows?

The performance gain for larger training set sizes becomes smaller as the authors approach the full size of their dataset: this suggests that the dataset is large enough to capture most of the variance within each category.

What is the way to retrieve images from a sketch?

The authors propose the following extension to sketch-based image retrieval: a) perform classification on the user sketch and query a traditional keyword based search engine using the determined category; b) (optionally) re-order the resulting images according to their geometric similarity to the user sketch.

(Open Access) How do humans sketch objects (2012) | Mathias Eitz

Q: How do the authors extract the local descriptor lj from the response image?

For each response image, the authors extract a local descriptor lj by binning the underlying orientational response values into a small, lo-cal histogram using 4 × 4 spatial bins, again linearly interpolating into neighboring bins.

Q: What is the direct way to define a sketch?

Probably the most direct way to define a feature space for sketches is to directly use its (possibly down-scaled) bitmap representation.

Q: How do the authors build their final, local patch descriptor?

The authors build their final, local patch descriptor by stacking the orientational descriptors into a single column vector d = [l1, . . . , lr]

Q: How many HITs are required to identify each sketch?

The authors submit a total of 5,000 HITs to Mechanical Turk, each requiring workers to sequentially identify four sketches from random categories.

How Do Humans Sketch Objects?

Mathias Eitz

∗

TU Berlin

James Hays

†

Brown University

Marc Alexa

‡

TU Berlin

Figure 1: In this paper we explore how humans sketch and recognize objects from 250 categories – such as the ones shown above.

Abstract

Humans have used sketching to depict our visual world since pre-

historic times. Even today, sketching is possibly the only rendering

technique readily available to all humans. This paper is the ﬁrst

large scale exploration of human sketches. We analyze the distri-

bution of non-expert sketches of everyday objects such as ‘teapot’

or ‘car’. We ask humans to sketch objects of a given category and

gather 20,000 unique sketches evenly distributed over 250 object

categories. With this dataset we perform a perceptual study and ﬁnd

that humans can correctly identify the object category of a sketch

73% of the time. We compare human performance against compu-

tational recognition methods. We develop a bag-of-features sketch

representation and use multi-class support vector machines, trained

on our sketch dataset, to classify sketches. The resulting recogni-

tion method is able to identify unknown sketches with 56% accu-

racy (chance is 0.4%). Based on the computational model, we

demonstrate an interactive sketch recognition system. We release

the complete crowd-sourced dataset of sketches to the community.

CR Categories: I.3.6 [Computer Graphics]: Methodology and

Techniques—Interaction techniques;

Keywords: sketch, recognition, learning, crowd-sourcing

Links: DL PDF

1 Introduction

Sketching is a universal form of communication. Since prehistoric

times, people have rendered the visual world in sketch-like petro-

glyphs or cave paintings. Such pictographs predate the appearance

of language by tens of thousands of years and today the ability to

draw and recognize sketched objects is ubiquitous. In fact, recent

∗

e-mail: m.eitz@tu-berlin.de

†

e-mail: hays@cs.brown.edu

‡

e-mail: marc.alexa@tu-berlin.de

neuroscience work suggests that simple, abstracted sketches acti-

vate our brain in similar ways to real stimuli [Walther et al. 2011].

Despite decades of graphics research, sketching is the only mecha-

nism for most people to render visual content. However, there has

never been a formal study of how people sketch objects and how

well such sketches can be recognized by humans and computers.

We examine these topics for the ﬁrst time and demonstrate applica-

tions of computational sketch understanding. In this paper we use

the term ‘sketch’ to mean a non-expert, abstract pictograph and not

to imply any particular medium (e.g. pencil and paper).

There exists signiﬁcant prior research on retrieving images or 3d

models based on sketches. The assumption in all of these works

is that, in some well-engineered feature space, sketched objects re-

semble their real-world counterparts. But this fundamental assump-

tion is often violated – most humans are not faithful artists. Instead

people use shared, iconic representations of objects (e.g. stick ﬁg-

ures) or they make dramatic simpliﬁcations or exaggerations (e.g.

pronounced ears on rabbits). Thus to understand and recognize

sketches an algorithm must learn from a training dataset of real

sketches, not photos or 3d models. Because people represent the

same object using differing degrees of realism and distinct draw-

ing styles (see Fig. 1), we need a large dataset of sketches which

adequately samples these variations.

There also exists prior research in sketch recognition which tries to

identify predeﬁned glyphs in narrow domains (e.g. wire diagrams,

musical scores). We instead identify objects such as snowmen, ice

cream cones, giraffes, etc. This task is hard, because an average

human is, unfortunately, not a faithful artist. Although both shape

and proportions of a sketched object may be far from that of the

corresponding real object, and at the same time sketches are an im-

poverished visual representation, humans are amazingly accurate at

interpreting such sketches.

We ﬁrst deﬁne a taxonomy of 250 object categories and acquire

a large dataset of human sketches for the categories using crowd-

sourcing (Sec. 3). Based on the dataset we estimate how humans

perform in recognizing the categories for each sketch (Sec. 4).

We design a robust visual feature descriptor for sketches (Sec. 5).

This feature permits not only unsupervised analysis of the dataset

(Sec. 6), but also the computational recognition of sketches (Sec. 7).

While we achieve a high computational recognition accuracy of

56% (chance is 0.4%), our study also reveals that humans still per-

form signiﬁcantly better than computers at this task. We show sev-

eral interesting applications of the computational model (Sec. 8):

apart from the interactive recognition of sketches itself, we also

demonstrate that recognizing the category of a sketch could im-

prove image retrieval. We hope that the use of sketching as a visual

input modality opens up computing technology to a signiﬁcantly

larger user base than text input alone. This paper is a ﬁrst step

toward this goal and we release the dataset to encourage future re-

search in this domain.

2 Related work

Our work has been inspired by and made possible by recent

progress in several different areas, which we review below.

2.1 Sketch-based retrieval and synthesis

Instead of an example image as in content-based retrieval [Datta

et al. 2008], user input for sketch-based retrieval is a simple binary

sketch – exactly as in our setting. The huge difference compared

to our approach is that these methods do not learn from example

sketches and thus generally do not achieve semantic understanding

of a sketch. Retrieval results are purely based on geometric simi-

larity between the sketch and the image content [Chalechale et al.

2005; Eitz et al. 2011a; Shrivastava et al. 2011; Eitz et al. 2012].

This can help make retrieval efﬁcient as it often can be cast as a

nearest-neighbor problem [Samet 2006]. However, retrieving per-

ceptually meaningful results can be difﬁcult as users generally draw

sketches in an abstract way that is geometrically far from the real

photographs or models (though still recognizable for humans as we

demonstrate later in this paper).

Several image synthesis systems build upon the recent progress in

sketch-based retrieval and allow users to create novel, realistic im-

agery using sketch exemplars. Synthesis systems that are based on

user sketches alone have to rely on huge amounts of data to off-

set the problem of geometric dissimilarity between sketches and

image content [Eitz et al. 2011b] or require users to augment the

sketches with text labels [Chen et al. 2009]. Using template match-

ing to identify face parts, Dixon et al. [2010] propose a system that

helps users get proportions right when sketching portraits. Lee et

al. [2011] build upon this idea and generalize real-time feedback

assisted sketching to a few dozen object categories. Their approach

uses fast nearest neighbor matching to ﬁnd geometrically similar

objects and blends those object edges into rough shadow guide-

lines. As with other sketch-based retrieval systems, users must draw

edges faithfully for the retrieval to work in the presence of many

object categories – poor artists see no beneﬁt from the system.

2.2 Sketch recognition

While there is no previous work on recognizing sketched objects,

there is signiﬁcant research on recognizing simpler, domain spe-

ciﬁc sketches. The very ﬁrst works on sketch-recognition [Suther-

land 1964; Herot 1976] introduce sketching as a means of human-

computer interaction and provide – nowadays ubiquitous – tools

such as drawing lines and curves using mouse or pen. More re-

cent approaches try to understand human sketch input at a higher

level. They are tuned to automatically identify a small variety of

stroke types, such as lines, circles or arcs from noisy user input

and achieve near perfect recognition rates in real-time for those

tasks [Sezgin et al. 2001; Paulson and Hammond 2008]. Ham-

mond and Davis [2005] exploit these lower-level recognizers to

identify higher level symbols in hand-drawn diagrams. If the appli-

cation domain contains a lot of structure – as in the case of chemi-

cal molecules or mathematic equations and diagrams – this can be

exploited to achieve very high recognition rates [LaViola Jr. and

Zeleznik 2007; Ouyang and Davis 2011].

2.3 Object and scene classiﬁcation

Our overall approach is broadly similar to recent work in the com-

puter vision community in which large, categorical databases of

context around object not allowed!

not easily recognizable!

text labels not allowed!

large black areas not allowed!

barn

bathtub

blimp

cow

Figure 2: Instructional examples shown to workers on Mechanical

Turk. In each ﬁeld: desired sketching style (left), undesired sketch-

ing style (right).

visual phenomena are used to train recognition systems. High-

proﬁle examples of this include the Caltech-256 database of ob-

ject images [Grifﬁn et al. 2007], the SUN database of scenes [Xiao

et al. 2010], and the LabelMe [Russell et al. 2008] and Pascal

VOC [Everingham et al. 2010] databases of spatially annotated ob-

jects in scenes. The considerable effort that goes into building these

databases has allowed algorithms to learn increasingly effective

classiﬁers and to compare recognition systems on common bench-

marks. Our pipeline is similar to many modern computer vision al-

gorithms although we are working in a new domain which requires

a new, carefully tailored representation. We also need to generate

our data from scratch because there are no preexisting, large repos-

itories of sketches as there are for images (e.g. Flickr). For this

reason we utilize crowd-sourcing to create a database of human ob-

ject sketches and hope that it will be as useful to the community as

existing databases in other visual domains.

3 A large dataset of human object sketches

In this section we describe the collection of a dataset of 20,000 hu-

man sketches. This categorical database is the basis for all learning,

evaluation, and applications in this paper. We deﬁne the following

set of criteria for the object categories in our database:

Exhaustive The categories exhaustively cover most objects that

we commonly encounter in everyday life. We want a broad

taxonomy of object categories in order to make the results in-

teresting and useful in practice and to avoid superﬁcially sim-

plifying the recognition task.

Recognizable The categories are recognizable from their shape

alone and do not require context for recognition.

Speciﬁc Finally, the categories are speciﬁc enough to have rela-

tively few visual manifestations. ‘Animal’ or ‘musical instru-

ment’ would not be good object categories as they have many

subcategories.

3.1 Deﬁning a taxonomy of 250 object categories

In order to identify common objects, we start by extracting the

1,000 most frequent labels from the LabelMe [Russell et al. 2008]

dataset. We manually remove duplicates (e.g. car side vs. car front)

as well as labels that do not follow our criteria. This gives us an

initial set of categories. We augment this with categories from the

Princeton Shape Benchmark [Shilane et al. 2004] and the Caltech

256 dataset [Grifﬁn et al. 2007]. Finally, we add categories by ask-

ing members of our lab to suggest object categories that are not

yet in the list. Our current set of 250 categories is quite exhaus-

0 0.2 0.4 0.6 0.8 1

200

400

600

800

1000

stroke length over time

time

stroke length

median

- 90

percentile

Figure 3: Stroke length in sketches over drawing time: initial

strokes are signiﬁcantly longer than later in the sketching process.

On the x-axis, time is normalized for each sketch.

tive as we ﬁnd it increasingly difﬁcult to come up with additional

categories that adhere to the desiderata outlined above.

3.2 Collecting 20,000 sketches

We ask participants to draw one sketch at a time given a random

category name. For each sketch, participants start with an empty

canvas and have up to 30 minutes to create the ﬁnal version of the

sketch. We keep our instructions as simple as possible and ask par-

ticipants to “sketch an image [...] that is clearly recognizable to

other humans as belonging to the following category: [...]”. We

also ask users to draw outlines only and not use context around the

actual object. We provide visual examples that illustrate these re-

quirements, see Fig. 2. We provide undo, redo, clear, and delete

buttons for our stroke-based sketching canvas so that participants

can easily familiarize themselves with the tool while drawing their

ﬁrst sketch. After ﬁnishing a sketch participants can move on and

draw another sketch given a new category. In addition to the spatial

parameters of the sketched strokes we store their temporal order.

Crowd-sourcing Collecting 20,000 sketches requires a huge

number of participants so we rely on crowd-sourcing to generate

our dataset. We use Amazon Mechanical Turk (AMT) which is a

web-based market where requesters can offer paid “Human Intelli-

gence Tasks” (HITs) to a pool of non-expert workers. We submit

90 × 250 = 22,500 HITs, requesting 90 sketches for each of the

250 categories. In order to ensure a diverse set of sketches within

each category, we limit the number of sketches a worker could draw

to one per category.

In total, we receive sketches from 1,350 unique participants who

spent a total of 741 hours to draw all sketches. The median drawing

time per sketch is 86 seconds with the 10

and 90

percentile at

31 and 280 seconds, respectively. The participants draw a total of

351,060 strokes with each sketch containing a median number of 13

strokes. We ﬁnd that the ﬁrst few strokes of a sketch are on average

considerably longer than the remaining ones, see Fig. 3. This sug-

gests that humans tend to follow a coarse-to-ﬁne drawing strategy,

ﬁrst outlining the shape using longer strokes and then adding detail

at the end of the sketching process.

Our sketching task appears to be quite popular on Mechanical Turk

– we received a great deal of positive feedback, the time to complete

all HITs was low, and very few sketches were unusable, adversar-

ial, or automated responses. Still, as with any crowd-sourced data

collection effort, steps must be taken to ensure that data collected

0 100 200 300 400 500

0.2

0.4

0.6

0.8

accuracy vs. #sketches

accuracy

#sketches

unique worker

acc of workers with >= #sketches

Figure 4: Scatter plot of per-worker sketch recognition perfor-

mance. Solid (green) dots represent single, unique workers and

give their average classiﬁcation accuracy (y-axis) at the number

of sketches they classiﬁed (x-axis). Outlined (blue) dots represent

overall average accuracy of all workers that have worked on more

than the number of sketches indicated on the x-axis.

from non-expert, untrained users is of sufﬁcient quality.

Data veriﬁcation We manually inspect and clean the complete

dataset using a simple interactive tool we implemented for this pur-

pose. The tool displays all sketches of a given category on a large

screen which lets us identify incorrect ones at a glance. We remove

sketches that are clearly in the wrong category (e.g. an airplane in

the teapot category), contain offensive content or otherwise do not

follow our requirements (typically excessive context). We do not

remove sketches just because they are poorly drawn. As a result of

this procedure, we remove about 6.3% of the sketches. We truncate

the dataset to contain exactly 80 sketches per category yielding our

ﬁnal dataset of 20,000 sketches. We make the categories uniformly

sized to simplify training and testing (e.g. we avoid the need to cor-

rect for bias toward the larger classes when learning a classiﬁer).

4 Human sketch recognition

In this section we analyze human sketch recognition performance

and use the dataset gathered in Sec. 3 for this task. Our basic ques-

tions are the following: given a sketch, what is the accuracy with

which humans correctly identify its category? Are there categories

that are easier/more difﬁcult to discern for humans? To provide an-

swers to those questions, we perform a second large-scale, crowd-

sourced study (again using Amazon Mechanical Turk) in which we

ask participants to identify the category of query sketches. This

test provides us with an important human baseline which we later

compare against our computational recognition method. We invite

the reader to try this test on the sketches shown in Fig. 1: can you

correctly identify all categories and solve the riddle hidden in this

ﬁgure?

4.1 Methodology

Given a random sketch, we ask participants to select the best ﬁtting

category from the set of 250 object categories. We give workers

unlimited time, although workers are naturally incentivized to work

quickly for greater pay. To avoid the frustration of scrolling through

a list of 250 categories for each query, we roughly follow Xiao et

al. [2010] and organize the categories in an intuitive 3-level hierar-

chy, containing 6 top-level and 27 second-level categories such as

‘animals’, ‘buildings’ and ‘musical instruments’.

t-shirt snake comb flower eyeglasses

100% 99% 99% 99% 98%

elephant

98%

leaf

98%

sun

98%

wrist-watch pineapple trousers ladder

96% 96% 96% 96%

apple airplane butterfly umbrella chair key

96% 96% 96% 96% 95% 95%

Figure 5: Representative sketches from 18 categories with highest

human category recognition rate (bottom right corner, in percent).

We submit a total of 5,000 HITs to Mechanical Turk, each requir-

ing workers to sequentially identify four sketches from random cat-

egories. This gives us one human classiﬁcation result for each of

the 20,000 sketches. We include several sketches per HIT to pre-

vent workers from skipping tasks that contain ‘difﬁcult’ sketches

based on AMT preview functions as this would artiﬁcially inﬂate

the accuracy of certain workers. In order to measure performance

from many participants, we limit the maximum paid HITs to 100

per worker, i.e. 400 sketches (however, three workers did 101, 101

and 119 HITs, respectively, see Fig. 4).

4.2 Human classiﬁcation results

Humans recognize on average 73.1% percent of all sketches cor-

rectly. We observe a large variance over the categories: while all

participants correctly identiﬁed all instances of ‘t-shirt’, the ‘seag-

ull’ category was only recognized 2.5% of the time. We visu-

alize the categories with highest human recognition performance

in Fig. 5 and those with lowest performance in Fig. 6 (along with the

most confusing categories). Human errors are usually confusions

between semantically similar categories (e.g. ‘panda’ and ‘bear’),

although geometric similarity accounts for some errors (e.g. ‘tire’

and ‘donut’).

If we assume that it takes participants a while to learn our taxon-

omy and hierarchy of objects, we would expect that workers who

have done more HITs are more accurate. However, this effect is

not very pronounced in our experiments – if we remove all results

from workers that have done less than 40 sketches, the accuracy of

the remaining workers rises slightly to 73.9%. This suggests that

there are no strong learning effects for this task. We visualize the

accuracy of each single worker as well as the overall accuracy when

gradually removing workers that have classiﬁed less than a certain

number of sketches in Fig. 4.

5 Sketch representation

In our setting, we consider a sketch S as a bitmap image, with

S ∈ R

m×n

. While the dataset from Sec. 3 is inherently vector-

valued, a bitmap-based approach is more general: it lets us readily

analyze any existing sketches even if they are only available in a

bitmap representation.

An ideal sketch representation for our purposes is invariant to irrel-

evant features (e.g. scale, translation), discriminative between cat-

egories, and compact. More formally, we are looking for a feature

space transform that maps a sketch bitmap to a lower-dimensional

representation x ∈ R

, i.e. a mapping f : R

m×n

→ R

with

(typically) d  m × n. In the remainder of this paper we call this

seagull panda armchair tire ashtray

2.5% 11% 13% 21% 24%

snowboard

25%

flying bird

47%

bear

44%

chair wheel cigarette skateboard

89% 44% 30% 32%

standing bird teddy bear couch donut bowl knife

24% 30% 3% 16% 15% 7%

pigeon

14% 8% 1% 6% 11% 3%

dog bench fan bathtub canoe

Figure 6: Top row: six most difﬁcult classes for human recogni-

tion. E.g., only 2.5% of all seagull sketches are correctly identiﬁed

as such by humans. Instead, humans often mistake sketches belong-

ing to the classes shown in the rows below as seagulls. Out of all

sketches confused with seagull, 47% belong to ﬂying bird, 24% to

standing bird and 14% to pigeon. The remaining 15% (not shown

in this ﬁgure) are distributed over various other classes.

representation either a feature vector or a descriptor. Ideally, the

mapping f preserves the information necessary for x to be distin-

guished from all sketches in other categories.

Probably the most direct way to deﬁne a feature space for sketches

is to directly use its (possibly down-scaled) bitmap representation.

Such representations work poorly in our experiments. Instead we

adopt methods from computer vision and represent sketches using

local feature vectors that encode distributions of image properties.

Speciﬁcally, we encode the distribution of line orientation within

a small local region of a sketch. The binning process during con-

struction of the distribution histograms facilitates better invariance

to slight offsets in orientation compared to directly encoding pixel

values.

In the remainder of this paper, we use the following notational con-

ventions: k denotes a scalar value, h a column vector, S a matrix

and V a set.

5.1 Extracting local features

First, we achieve global scale and translation invariance by isotropi-

cally rescaling each sketch such that the longest side of its bounding

box has a ﬁxed length and each sketch is centered in a 256 × 256

image.

We build upon on a bag-of-features representation [Sivic and Zis-

serman 2003] as an intermediate step to deﬁne the mapping f. We

represent a sketch as a large number of local features that encode

local orientation estimates (but do not carry any information about

their spatial location in the sketch). We write g

= ∇S for the

gradient of S at coordinate (u, v) and o

∈ [0, π) for its orienta-

tion. We compute gradients using Gaussian derivatives to achieve

reliable orientation estimates. We coarsely bin kgk into r orienta-

tion bins according to o, linearly interpolating into the neighboring

bins to avoid sharp energy changes at bin borders. This gives us

r orientational response images O, each encoding the fraction of

orientational energy at a given discrete orientation value. We ﬁnd

that using r = 4 orientation bins works well.

For each response image, we extract a local descriptor l

by bin-

ning the underlying orientational response values into a small, lo-

f(x)

histogram bin i

Figure 7: 1d example of fast histogram construction. The total

energy in histogram bin i corresponds to (f ? t)(x

) where x

is the

center of bin i.

cal histogram using 4 × 4 spatial bins, again linearly interpolating

into neighboring bins. We build our ﬁnal, local patch descriptor by

stacking the orientational descriptors into a single column vector

d = [l

, . . . , l

]

. We normalize each local patch descriptor such

that kdk

= 1. This results in a representation that is closely re-

lated to the one used for SIFT [Lowe 2004] but stores orientations

only [Eitz et al. 2011a].

While in computer vision applications the size of local patches used

to analyze photographs is often quite small (e.g. 16 × 16 pix-

els [Lazebnik et al. 2006]), sketches contain little information at

that scale and larger patch sizes are required for an effective rep-

resentation. In our case, we use local patches covering an area of

12.5% of the size of S. We use 28 × 28 = 784 regularly spaced

sample points in S and extract a local feature for each point. The

resulting representation is a so-called bag-of-features D = {d

Due to the relatively large patch size we use, the regions covered

by the local features signiﬁcantly overlap and each single pixel gets

binned into about 100 distinct histograms. As this requires many

image/histogram accesses, this operation can be quite slow. We

speed up building the local descriptors by observing that the total

energy accumulated in a single spatial histogram bin (using linear

interpolation) is proportional to the convolution of the local image

area with a 2d tent function having an extent of two times bin-

width. We illustrate this property for the 1d case in Fig. 7. As a

consequence, before creating spatial histograms, we ﬁrst convolve

the response images O

1...r

with the corresponding function (which

we in turn speed up using the FFT). Filling a histogram bin is now

reduced to a single lookup of the response at the center of each bin.

This lets us efﬁciently extract a large number of local histograms

which will be an important property later in this paper.

At this point, a sketch is represented as a so-called bag-of-features,

containing a large number of local, 64-dimensional feature vec-

tors (4 × 4 spatial bins and 4 orientational bins). In the following,

we build a more compact representation by quantizing each feature

against a visual vocabulary [Sivic and Zisserman 2003].

5.2 Building a visual vocabulary

Using a training set of n local descriptors d randomly sampled from

our dataset of sketches, we construct a visual vocabulary using k-

means clustering, which partitions the descriptors into k disjunct

clusters C

. More speciﬁcally, we deﬁne our visual vocabulary V

to be the set of vectors {µ

} resulting from minimizing

V = arg min

{µ

}

i=1

∈C

− µ

(1)

with

= 1/ |C

∈C

5.3 Quantizing features

We represent a sketch as a frequency histogram of visual words

h – this is the ﬁnal representation we use throughout the paper.

As a baseline we use a standard histogram of visual words using a

‘hard’ assignment of local features to visual words [Sivic and Zis-

serman 2003]. We compare this to using ‘soft’ kernel-codebook

coding [Philbin et al. 2008] for constructing the histograms. The

idea behind kernel codebook coding is that a feature vector may be

equally close to multiple visual words but this information cannot

be captured in the case of hard assignment. Instead we use a kernel-

ized distance between descriptors that encodes weighted distances

to all visual words – with a rapid falloff for distant visual words.

More speciﬁcally, we deﬁne our histogram h as:

h(D) =

|D|

∈D

q(d

)/ kq(d

(2)

where q(d) is a vector-valued quantization function that quantizes

a local descriptor d against the visual vocabulary V:

q(d) = [K(d, µ

), . . . , K(d, µ

)]

We use a Gaussian kernel to measure distances between samples,

i.e.

K(d, µ) = exp(− kd − µk

/2σ

Note that in Eqn. (2) we normalize h by the number of samples

to get to our ﬁnal representation. Thus our representation is not

sensitive to the total amount of local features in a sketch, but rather

to local structure and orientation of lines. We use σ = 0.1 in our

experiments.

6 Unsupervised dataset analysis

In this section we perform an automatic, unsupervised analysis of

our sketch dataset making use of the feature space we developed in

the previous section (each sketch is represented as a histogram of

visual words h). We would like to provide answers to the following

questions:

• What is the distribution of sketches in the proposed feature

space? Ideally, we would ﬁnd clusters of sketches in this

space that clearly represent our categories, i.e. we would hope

to ﬁnd that features within a category are close to each other

while having large distances to all other features.

• Can we identify iconic sketches that are good representatives

of a category?

• Can we visualize the distribution of sketches in our feature

space? This would help build an intuition about the represen-

tative power of the feature transform developed in Sec. 5.

Our feature space is sparsely populated (only 20,000 points in a

high-dimensional space). This makes clustering in this space a dif-

ﬁcult problem. Efﬁcient methods such as k-means clustering do not

give meaningful clusters as they use rigid, simple distance metrics

and require us to deﬁne the number of clusters beforehand. Instead,

we use variable-bandwidth mean-shift clustering with locality sen-

sitive hashing to speed up the underlying nearest-neighbor search

problem [Georgescu et al. 2003]. Adaptive mean-shift estimates a

density function in feature space for each histogram h as:

f(h) =

i=1



kh − h



where b

is the bandwidth associated with each point and K is again

a Gaussian kernel. Given this deﬁnition, we compute the modes, i.e.

How do humans sketch objects

Figures

Citations

Image-to-Image Translation with Conditional Adversarial Networks

Image-to-Image Translation with Conditional Adversarial Networks

Edge Boxes: Locating Object Proposals from Edges

Multi-view Convolutional Neural Networks for 3D Shape Recognition

PointCNN: convolution on Χ -transformed points

References

Distinctive Image Features from Scale-Invariant Keypoints

Visualizing Data using t-SNE

The Pascal Visual Object Classes (VOC) Challenge

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Video Google: a text retrieval approach to object matching in videos

Related Papers (5)

The sketchy database: learning to retrieve badly drawn bunnies

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Sketch Me That Shoe

ImageNet: A large-scale hierarchical image database

Frequently Asked Questions (19)

Q1. What have the authors contributed in "How do humans sketch objects?" ?

Q2. What are the contributions mentioned in the paper "How do humans sketch objects?" ?

Q3. What is the general strategy for such techniques?

Q4. How do the authors extract the local descriptor lj from the response image?

Q5. What is the direct way to define a sketch?

Q6. How do the authors build their final, local patch descriptor?

Q7. What is the main assumption in all of these works?

Q8. What is the way to make a sketch more natural?

Q9. How does the study show that humans still perform better than computers at this task?

Q10. How many HITs are required to identify each sketch?

Q11. What is the simplest way to visualize the distribution of sketches in the feature space?

Q12. What is the way to find clusters of sketches in the proposed feature space?

Q13. Why do people need a large dataset of sketches?

Q14. How does the computational model perform in such cases?

Q15. How old are the ability to recognize sketched objects?

Q16. What is the simplest way to represent a sketch?

Q17. How does the performance gain for larger training sets change as the dataset grows?

Q18. What is the way to get a good representation of the local patch?

Q19. What is the way to retrieve images from a sketch?