scispace - formally typeset
Open AccessJournal ArticleDOI

How do humans sketch objects

TLDR
This paper is the first large scale exploration of human sketches, developing a bag-of-features sketch representation and using multi-class support vector machines, trained on the sketch dataset, to classify sketches.
Abstract
Humans have used sketching to depict our visual world since prehistoric times. Even today, sketching is possibly the only rendering technique readily available to all humans. This paper is the first large scale exploration of human sketches. We analyze the distribution of non-expert sketches of everyday objects such as 'teapot' or 'car'. We ask humans to sketch objects of a given category and gather 20,000 unique sketches evenly distributed over 250 object categories. With this dataset we perform a perceptual study and find that humans can correctly identify the object category of a sketch 73% of the time. We compare human performance against computational recognition methods. We develop a bag-of-features sketch representation and use multi-class support vector machines, trained on our sketch dataset, to classify sketches. The resulting recognition method is able to identify unknown sketches with 56% accuracy (chance is 0.4%). Based on the computational model, we demonstrate an interactive sketch recognition system. We release the complete crowd-sourced dataset of sketches to the community.

read more

Content maybe subject to copyright    Report

How Do Humans Sketch Objects?
Mathias Eitz
TU Berlin
James Hays
Brown University
Marc Alexa
TU Berlin
Figure 1: In this paper we explore how humans sketch and recognize objects from 250 categories such as the ones shown above.
Abstract
Humans have used sketching to depict our visual world since pre-
historic times. Even today, sketching is possibly the only rendering
technique readily available to all humans. This paper is the first
large scale exploration of human sketches. We analyze the distri-
bution of non-expert sketches of everyday objects such as ‘teapot’
or ‘car’. We ask humans to sketch objects of a given category and
gather 20,000 unique sketches evenly distributed over 250 object
categories. With this dataset we perform a perceptual study and find
that humans can correctly identify the object category of a sketch
73% of the time. We compare human performance against compu-
tational recognition methods. We develop a bag-of-features sketch
representation and use multi-class support vector machines, trained
on our sketch dataset, to classify sketches. The resulting recogni-
tion method is able to identify unknown sketches with 56% accu-
racy (chance is 0.4%). Based on the computational model, we
demonstrate an interactive sketch recognition system. We release
the complete crowd-sourced dataset of sketches to the community.
CR Categories: I.3.6 [Computer Graphics]: Methodology and
Techniques—Interaction techniques;
Keywords: sketch, recognition, learning, crowd-sourcing
Links: DL PDF
1 Introduction
Sketching is a universal form of communication. Since prehistoric
times, people have rendered the visual world in sketch-like petro-
glyphs or cave paintings. Such pictographs predate the appearance
of language by tens of thousands of years and today the ability to
draw and recognize sketched objects is ubiquitous. In fact, recent
e-mail: m.eitz@tu-berlin.de
e-mail: hays@cs.brown.edu
e-mail: marc.alexa@tu-berlin.de
neuroscience work suggests that simple, abstracted sketches acti-
vate our brain in similar ways to real stimuli [Walther et al. 2011].
Despite decades of graphics research, sketching is the only mecha-
nism for most people to render visual content. However, there has
never been a formal study of how people sketch objects and how
well such sketches can be recognized by humans and computers.
We examine these topics for the first time and demonstrate applica-
tions of computational sketch understanding. In this paper we use
the term ‘sketch’ to mean a non-expert, abstract pictograph and not
to imply any particular medium (e.g. pencil and paper).
There exists significant prior research on retrieving images or 3d
models based on sketches. The assumption in all of these works
is that, in some well-engineered feature space, sketched objects re-
semble their real-world counterparts. But this fundamental assump-
tion is often violated most humans are not faithful artists. Instead
people use shared, iconic representations of objects (e.g. stick fig-
ures) or they make dramatic simplifications or exaggerations (e.g.
pronounced ears on rabbits). Thus to understand and recognize
sketches an algorithm must learn from a training dataset of real
sketches, not photos or 3d models. Because people represent the
same object using differing degrees of realism and distinct draw-
ing styles (see Fig. 1), we need a large dataset of sketches which
adequately samples these variations.
There also exists prior research in sketch recognition which tries to
identify predefined glyphs in narrow domains (e.g. wire diagrams,
musical scores). We instead identify objects such as snowmen, ice
cream cones, giraffes, etc. This task is hard, because an average
human is, unfortunately, not a faithful artist. Although both shape
and proportions of a sketched object may be far from that of the
corresponding real object, and at the same time sketches are an im-
poverished visual representation, humans are amazingly accurate at
interpreting such sketches.
We first define a taxonomy of 250 object categories and acquire
a large dataset of human sketches for the categories using crowd-
sourcing (Sec. 3). Based on the dataset we estimate how humans
perform in recognizing the categories for each sketch (Sec. 4).
We design a robust visual feature descriptor for sketches (Sec. 5).
This feature permits not only unsupervised analysis of the dataset
(Sec. 6), but also the computational recognition of sketches (Sec. 7).
While we achieve a high computational recognition accuracy of
56% (chance is 0.4%), our study also reveals that humans still per-
form significantly better than computers at this task. We show sev-
eral interesting applications of the computational model (Sec. 8):
apart from the interactive recognition of sketches itself, we also
demonstrate that recognizing the category of a sketch could im-
prove image retrieval. We hope that the use of sketching as a visual
input modality opens up computing technology to a significantly
larger user base than text input alone. This paper is a first step

toward this goal and we release the dataset to encourage future re-
search in this domain.
2 Related work
Our work has been inspired by and made possible by recent
progress in several different areas, which we review below.
2.1 Sketch-based retrieval and synthesis
Instead of an example image as in content-based retrieval [Datta
et al. 2008], user input for sketch-based retrieval is a simple binary
sketch exactly as in our setting. The huge difference compared
to our approach is that these methods do not learn from example
sketches and thus generally do not achieve semantic understanding
of a sketch. Retrieval results are purely based on geometric simi-
larity between the sketch and the image content [Chalechale et al.
2005; Eitz et al. 2011a; Shrivastava et al. 2011; Eitz et al. 2012].
This can help make retrieval efficient as it often can be cast as a
nearest-neighbor problem [Samet 2006]. However, retrieving per-
ceptually meaningful results can be difficult as users generally draw
sketches in an abstract way that is geometrically far from the real
photographs or models (though still recognizable for humans as we
demonstrate later in this paper).
Several image synthesis systems build upon the recent progress in
sketch-based retrieval and allow users to create novel, realistic im-
agery using sketch exemplars. Synthesis systems that are based on
user sketches alone have to rely on huge amounts of data to off-
set the problem of geometric dissimilarity between sketches and
image content [Eitz et al. 2011b] or require users to augment the
sketches with text labels [Chen et al. 2009]. Using template match-
ing to identify face parts, Dixon et al. [2010] propose a system that
helps users get proportions right when sketching portraits. Lee et
al. [2011] build upon this idea and generalize real-time feedback
assisted sketching to a few dozen object categories. Their approach
uses fast nearest neighbor matching to find geometrically similar
objects and blends those object edges into rough shadow guide-
lines. As with other sketch-based retrieval systems, users must draw
edges faithfully for the retrieval to work in the presence of many
object categories poor artists see no benefit from the system.
2.2 Sketch recognition
While there is no previous work on recognizing sketched objects,
there is significant research on recognizing simpler, domain spe-
cific sketches. The very first works on sketch-recognition [Suther-
land 1964; Herot 1976] introduce sketching as a means of human-
computer interaction and provide nowadays ubiquitous tools
such as drawing lines and curves using mouse or pen. More re-
cent approaches try to understand human sketch input at a higher
level. They are tuned to automatically identify a small variety of
stroke types, such as lines, circles or arcs from noisy user input
and achieve near perfect recognition rates in real-time for those
tasks [Sezgin et al. 2001; Paulson and Hammond 2008]. Ham-
mond and Davis [2005] exploit these lower-level recognizers to
identify higher level symbols in hand-drawn diagrams. If the appli-
cation domain contains a lot of structure as in the case of chemi-
cal molecules or mathematic equations and diagrams this can be
exploited to achieve very high recognition rates [LaViola Jr. and
Zeleznik 2007; Ouyang and Davis 2011].
2.3 Object and scene classification
Our overall approach is broadly similar to recent work in the com-
puter vision community in which large, categorical databases of
context around object not allowed!
not easily recognizable!
text labels not allowed!
large black areas not allowed!
barn
bathtub
blimp
cow
Figure 2: Instructional examples shown to workers on Mechanical
Turk. In each field: desired sketching style (left), undesired sketch-
ing style (right).
visual phenomena are used to train recognition systems. High-
profile examples of this include the Caltech-256 database of ob-
ject images [Griffin et al. 2007], the SUN database of scenes [Xiao
et al. 2010], and the LabelMe [Russell et al. 2008] and Pascal
VOC [Everingham et al. 2010] databases of spatially annotated ob-
jects in scenes. The considerable effort that goes into building these
databases has allowed algorithms to learn increasingly effective
classifiers and to compare recognition systems on common bench-
marks. Our pipeline is similar to many modern computer vision al-
gorithms although we are working in a new domain which requires
a new, carefully tailored representation. We also need to generate
our data from scratch because there are no preexisting, large repos-
itories of sketches as there are for images (e.g. Flickr). For this
reason we utilize crowd-sourcing to create a database of human ob-
ject sketches and hope that it will be as useful to the community as
existing databases in other visual domains.
3 A large dataset of human object sketches
In this section we describe the collection of a dataset of 20,000 hu-
man sketches. This categorical database is the basis for all learning,
evaluation, and applications in this paper. We define the following
set of criteria for the object categories in our database:
Exhaustive The categories exhaustively cover most objects that
we commonly encounter in everyday life. We want a broad
taxonomy of object categories in order to make the results in-
teresting and useful in practice and to avoid superficially sim-
plifying the recognition task.
Recognizable The categories are recognizable from their shape
alone and do not require context for recognition.
Specific Finally, the categories are specific enough to have rela-
tively few visual manifestations. Animal’ or ‘musical instru-
ment’ would not be good object categories as they have many
subcategories.
3.1 Defining a taxonomy of 250 object categories
In order to identify common objects, we start by extracting the
1,000 most frequent labels from the LabelMe [Russell et al. 2008]
dataset. We manually remove duplicates (e.g. car side vs. car front)
as well as labels that do not follow our criteria. This gives us an
initial set of categories. We augment this with categories from the
Princeton Shape Benchmark [Shilane et al. 2004] and the Caltech
256 dataset [Griffin et al. 2007]. Finally, we add categories by ask-
ing members of our lab to suggest object categories that are not
yet in the list. Our current set of 250 categories is quite exhaus-

0 0.2 0.4 0.6 0.8 1
0
200
400
600
800
1000
stroke length over time
time
stroke length
median
10
th
- 90
th
percentile
Figure 3: Stroke length in sketches over drawing time: initial
strokes are significantly longer than later in the sketching process.
On the x-axis, time is normalized for each sketch.
tive as we find it increasingly difficult to come up with additional
categories that adhere to the desiderata outlined above.
3.2 Collecting 20,000 sketches
We ask participants to draw one sketch at a time given a random
category name. For each sketch, participants start with an empty
canvas and have up to 30 minutes to create the final version of the
sketch. We keep our instructions as simple as possible and ask par-
ticipants to “sketch an image [...] that is clearly recognizable to
other humans as belonging to the following category: [...]”. We
also ask users to draw outlines only and not use context around the
actual object. We provide visual examples that illustrate these re-
quirements, see Fig. 2. We provide undo, redo, clear, and delete
buttons for our stroke-based sketching canvas so that participants
can easily familiarize themselves with the tool while drawing their
first sketch. After finishing a sketch participants can move on and
draw another sketch given a new category. In addition to the spatial
parameters of the sketched strokes we store their temporal order.
Crowd-sourcing Collecting 20,000 sketches requires a huge
number of participants so we rely on crowd-sourcing to generate
our dataset. We use Amazon Mechanical Turk (AMT) which is a
web-based market where requesters can offer paid “Human Intelli-
gence Tasks” (HITs) to a pool of non-expert workers. We submit
90 × 250 = 22,500 HITs, requesting 90 sketches for each of the
250 categories. In order to ensure a diverse set of sketches within
each category, we limit the number of sketches a worker could draw
to one per category.
In total, we receive sketches from 1,350 unique participants who
spent a total of 741 hours to draw all sketches. The median drawing
time per sketch is 86 seconds with the 10
th
and 90
th
percentile at
31 and 280 seconds, respectively. The participants draw a total of
351,060 strokes with each sketch containing a median number of 13
strokes. We find that the first few strokes of a sketch are on average
considerably longer than the remaining ones, see Fig. 3. This sug-
gests that humans tend to follow a coarse-to-fine drawing strategy,
first outlining the shape using longer strokes and then adding detail
at the end of the sketching process.
Our sketching task appears to be quite popular on Mechanical Turk
we received a great deal of positive feedback, the time to complete
all HITs was low, and very few sketches were unusable, adversar-
ial, or automated responses. Still, as with any crowd-sourced data
collection effort, steps must be taken to ensure that data collected
0 100 200 300 400 500
0.2
0.4
0.6
0.8
1
accuracy vs. #sketches
accuracy
#sketches
unique worker
acc of workers with >= #sketches
Figure 4: Scatter plot of per-worker sketch recognition perfor-
mance. Solid (green) dots represent single, unique workers and
give their average classification accuracy (y-axis) at the number
of sketches they classified (x-axis). Outlined (blue) dots represent
overall average accuracy of all workers that have worked on more
than the number of sketches indicated on the x-axis.
from non-expert, untrained users is of sufficient quality.
Data verification We manually inspect and clean the complete
dataset using a simple interactive tool we implemented for this pur-
pose. The tool displays all sketches of a given category on a large
screen which lets us identify incorrect ones at a glance. We remove
sketches that are clearly in the wrong category (e.g. an airplane in
the teapot category), contain offensive content or otherwise do not
follow our requirements (typically excessive context). We do not
remove sketches just because they are poorly drawn. As a result of
this procedure, we remove about 6.3% of the sketches. We truncate
the dataset to contain exactly 80 sketches per category yielding our
final dataset of 20,000 sketches. We make the categories uniformly
sized to simplify training and testing (e.g. we avoid the need to cor-
rect for bias toward the larger classes when learning a classifier).
4 Human sketch recognition
In this section we analyze human sketch recognition performance
and use the dataset gathered in Sec. 3 for this task. Our basic ques-
tions are the following: given a sketch, what is the accuracy with
which humans correctly identify its category? Are there categories
that are easier/more difficult to discern for humans? To provide an-
swers to those questions, we perform a second large-scale, crowd-
sourced study (again using Amazon Mechanical Turk) in which we
ask participants to identify the category of query sketches. This
test provides us with an important human baseline which we later
compare against our computational recognition method. We invite
the reader to try this test on the sketches shown in Fig. 1: can you
correctly identify all categories and solve the riddle hidden in this
figure?
4.1 Methodology
Given a random sketch, we ask participants to select the best fitting
category from the set of 250 object categories. We give workers
unlimited time, although workers are naturally incentivized to work
quickly for greater pay. To avoid the frustration of scrolling through
a list of 250 categories for each query, we roughly follow Xiao et
al. [2010] and organize the categories in an intuitive 3-level hierar-
chy, containing 6 top-level and 27 second-level categories such as
‘animals’, ‘buildings’ and ‘musical instruments’.

t-shirt snake comb flower eyeglasses
100% 99% 99% 99% 98%
elephant
98%
leaf
98%
sun
98%
wrist-watch pineapple trousers ladder
96% 96% 96% 96%
apple airplane butterfly umbrella chair key
96% 96% 96% 96% 95% 95%
Figure 5: Representative sketches from 18 categories with highest
human category recognition rate (bottom right corner, in percent).
We submit a total of 5,000 HITs to Mechanical Turk, each requir-
ing workers to sequentially identify four sketches from random cat-
egories. This gives us one human classification result for each of
the 20,000 sketches. We include several sketches per HIT to pre-
vent workers from skipping tasks that contain ‘difficult’ sketches
based on AMT preview functions as this would artificially inflate
the accuracy of certain workers. In order to measure performance
from many participants, we limit the maximum paid HITs to 100
per worker, i.e. 400 sketches (however, three workers did 101, 101
and 119 HITs, respectively, see Fig. 4).
4.2 Human classification results
Humans recognize on average 73.1% percent of all sketches cor-
rectly. We observe a large variance over the categories: while all
participants correctly identified all instances of ‘t-shirt’, the ‘seag-
ull’ category was only recognized 2.5% of the time. We visu-
alize the categories with highest human recognition performance
in Fig. 5 and those with lowest performance in Fig. 6 (along with the
most confusing categories). Human errors are usually confusions
between semantically similar categories (e.g. ‘panda’ and ‘bear’),
although geometric similarity accounts for some errors (e.g. ‘tire’
and ‘donut’).
If we assume that it takes participants a while to learn our taxon-
omy and hierarchy of objects, we would expect that workers who
have done more HITs are more accurate. However, this effect is
not very pronounced in our experiments if we remove all results
from workers that have done less than 40 sketches, the accuracy of
the remaining workers rises slightly to 73.9%. This suggests that
there are no strong learning effects for this task. We visualize the
accuracy of each single worker as well as the overall accuracy when
gradually removing workers that have classified less than a certain
number of sketches in Fig. 4.
5 Sketch representation
In our setting, we consider a sketch S as a bitmap image, with
S R
m×n
. While the dataset from Sec. 3 is inherently vector-
valued, a bitmap-based approach is more general: it lets us readily
analyze any existing sketches even if they are only available in a
bitmap representation.
An ideal sketch representation for our purposes is invariant to irrel-
evant features (e.g. scale, translation), discriminative between cat-
egories, and compact. More formally, we are looking for a feature
space transform that maps a sketch bitmap to a lower-dimensional
representation x R
d
, i.e. a mapping f : R
m×n
R
d
with
(typically) d m × n. In the remainder of this paper we call this
seagull panda armchair tire ashtray
2.5% 11% 13% 21% 24%
snowboard
25%
flying bird
47%
bear
44%
chair wheel cigarette skateboard
89% 44% 30% 32%
standing bird teddy bear couch donut bowl knife
24% 30% 3% 16% 15% 7%
pigeon
14% 8% 1% 6% 11% 3%
dog bench fan bathtub canoe
Figure 6: Top row: six most difficult classes for human recogni-
tion. E.g., only 2.5% of all seagull sketches are correctly identified
as such by humans. Instead, humans often mistake sketches belong-
ing to the classes shown in the rows below as seagulls. Out of all
sketches confused with seagull, 47% belong to flying bird, 24% to
standing bird and 14% to pigeon. The remaining 15% (not shown
in this figure) are distributed over various other classes.
representation either a feature vector or a descriptor. Ideally, the
mapping f preserves the information necessary for x to be distin-
guished from all sketches in other categories.
Probably the most direct way to define a feature space for sketches
is to directly use its (possibly down-scaled) bitmap representation.
Such representations work poorly in our experiments. Instead we
adopt methods from computer vision and represent sketches using
local feature vectors that encode distributions of image properties.
Specifically, we encode the distribution of line orientation within
a small local region of a sketch. The binning process during con-
struction of the distribution histograms facilitates better invariance
to slight offsets in orientation compared to directly encoding pixel
values.
In the remainder of this paper, we use the following notational con-
ventions: k denotes a scalar value, h a column vector, S a matrix
and V a set.
5.1 Extracting local features
First, we achieve global scale and translation invariance by isotropi-
cally rescaling each sketch such that the longest side of its bounding
box has a fixed length and each sketch is centered in a 256 × 256
image.
We build upon on a bag-of-features representation [Sivic and Zis-
serman 2003] as an intermediate step to define the mapping f. We
represent a sketch as a large number of local features that encode
local orientation estimates (but do not carry any information about
their spatial location in the sketch). We write g
uv
= S for the
gradient of S at coordinate (u, v) and o
uv
[0, π) for its orienta-
tion. We compute gradients using Gaussian derivatives to achieve
reliable orientation estimates. We coarsely bin kgk into r orienta-
tion bins according to o, linearly interpolating into the neighboring
bins to avoid sharp energy changes at bin borders. This gives us
r orientational response images O, each encoding the fraction of
orientational energy at a given discrete orientation value. We find
that using r = 4 orientation bins works well.
For each response image, we extract a local descriptor l
j
by bin-
ning the underlying orientational response values into a small, lo-

x
f(x)
histogram bin i
x
i
t
Figure 7: 1d example of fast histogram construction. The total
energy in histogram bin i corresponds to (f ? t)(x
i
) where x
i
is the
center of bin i.
cal histogram using 4 × 4 spatial bins, again linearly interpolating
into neighboring bins. We build our final, local patch descriptor by
stacking the orientational descriptors into a single column vector
d = [l
1
, . . . , l
r
]
T
. We normalize each local patch descriptor such
that kdk
2
= 1. This results in a representation that is closely re-
lated to the one used for SIFT [Lowe 2004] but stores orientations
only [Eitz et al. 2011a].
While in computer vision applications the size of local patches used
to analyze photographs is often quite small (e.g. 16 × 16 pix-
els [Lazebnik et al. 2006]), sketches contain little information at
that scale and larger patch sizes are required for an effective rep-
resentation. In our case, we use local patches covering an area of
12.5% of the size of S. We use 28 × 28 = 784 regularly spaced
sample points in S and extract a local feature for each point. The
resulting representation is a so-called bag-of-features D = {d
i
}.
Due to the relatively large patch size we use, the regions covered
by the local features significantly overlap and each single pixel gets
binned into about 100 distinct histograms. As this requires many
image/histogram accesses, this operation can be quite slow. We
speed up building the local descriptors by observing that the total
energy accumulated in a single spatial histogram bin (using linear
interpolation) is proportional to the convolution of the local image
area with a 2d tent function having an extent of two times bin-
width. We illustrate this property for the 1d case in Fig. 7. As a
consequence, before creating spatial histograms, we first convolve
the response images O
1...r
with the corresponding function (which
we in turn speed up using the FFT). Filling a histogram bin is now
reduced to a single lookup of the response at the center of each bin.
This lets us efficiently extract a large number of local histograms
which will be an important property later in this paper.
At this point, a sketch is represented as a so-called bag-of-features,
containing a large number of local, 64-dimensional feature vec-
tors (4 × 4 spatial bins and 4 orientational bins). In the following,
we build a more compact representation by quantizing each feature
against a visual vocabulary [Sivic and Zisserman 2003].
5.2 Building a visual vocabulary
Using a training set of n local descriptors d randomly sampled from
our dataset of sketches, we construct a visual vocabulary using k-
means clustering, which partitions the descriptors into k disjunct
clusters C
i
. More specifically, we define our visual vocabulary V
to be the set of vectors {µ
i
} resulting from minimizing
V = arg min
{µ
i
}
k
X
i=1
X
d
j
C
i
kd
j
µ
i
k
2
(1)
with
µ
i
= 1/ |C
i
|
X
d
j
C
i
d
j
.
5.3 Quantizing features
We represent a sketch as a frequency histogram of visual words
h this is the final representation we use throughout the paper.
As a baseline we use a standard histogram of visual words using a
‘hard’ assignment of local features to visual words [Sivic and Zis-
serman 2003]. We compare this to using ‘soft’ kernel-codebook
coding [Philbin et al. 2008] for constructing the histograms. The
idea behind kernel codebook coding is that a feature vector may be
equally close to multiple visual words but this information cannot
be captured in the case of hard assignment. Instead we use a kernel-
ized distance between descriptors that encodes weighted distances
to all visual words with a rapid falloff for distant visual words.
More specifically, we define our histogram h as:
h(D) =
1
|D|
X
d
i
∈D
q(d
i
)/ kq(d
i
)k
1
(2)
where q(d) is a vector-valued quantization function that quantizes
a local descriptor d against the visual vocabulary V:
q(d) = [K(d, µ
1
), . . . , K(d, µ
k
)]
T
We use a Gaussian kernel to measure distances between samples,
i.e.
K(d, µ) = exp( kd µk
2
/2σ
2
).
Note that in Eqn. (2) we normalize h by the number of samples
to get to our final representation. Thus our representation is not
sensitive to the total amount of local features in a sketch, but rather
to local structure and orientation of lines. We use σ = 0.1 in our
experiments.
6 Unsupervised dataset analysis
In this section we perform an automatic, unsupervised analysis of
our sketch dataset making use of the feature space we developed in
the previous section (each sketch is represented as a histogram of
visual words h). We would like to provide answers to the following
questions:
What is the distribution of sketches in the proposed feature
space? Ideally, we would find clusters of sketches in this
space that clearly represent our categories, i.e. we would hope
to find that features within a category are close to each other
while having large distances to all other features.
Can we identify iconic sketches that are good representatives
of a category?
Can we visualize the distribution of sketches in our feature
space? This would help build an intuition about the represen-
tative power of the feature transform developed in Sec. 5.
Our feature space is sparsely populated (only 20,000 points in a
high-dimensional space). This makes clustering in this space a dif-
ficult problem. Efficient methods such as k-means clustering do not
give meaningful clusters as they use rigid, simple distance metrics
and require us to define the number of clusters beforehand. Instead,
we use variable-bandwidth mean-shift clustering with locality sen-
sitive hashing to speed up the underlying nearest-neighbor search
problem [Georgescu et al. 2003]. Adaptive mean-shift estimates a
density function in feature space for each histogram h as:
f(h) =
1
n
n
X
i=1
1
b
d
i
K
kh h
i
k
2
/b
2
i
,
where b
i
is the bandwidth associated with each point and K is again
a Gaussian kernel. Given this definition, we compute the modes, i.e.

Citations
More filters
Proceedings ArticleDOI

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Book ChapterDOI

Edge Boxes: Locating Object Proposals from Edges

TL;DR: A novel method for generating object bounding box proposals using edges is proposed, showing results that are significantly more accurate than the current state-of-the-art while being faster to compute.
Proceedings ArticleDOI

Multi-view Convolutional Neural Networks for 3D Shape Recognition

TL;DR: In this article, a CNN architecture is proposed to combine information from multiple views of a 3D shape into a single and compact shape descriptor, which can be applied to accurately recognize human hand-drawn sketches of shapes.
Proceedings Article

PointCNN: convolution on Χ -transformed points

TL;DR: This work proposes to learn an Χ-transformation from the input points to simultaneously promote two causes: the first is the weighting of the input features associated with the points, and the second is the permutation of the points into a latent and potentially canonical order.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Frequently Asked Questions (19)
Q1. What have the authors contributed in "How do humans sketch objects?" ?

This paper is the first large scale exploration of human sketches. The authors analyze the distribution of non-expert sketches of everyday objects such as ‘ teapot ’ or ‘ car ’. With this dataset the authors perform a perceptual study and find that humans can correctly identify the object category of a sketch 73 % of the time. The authors compare human performance against computational recognition methods. The authors develop a bag-of-features sketch representation and use multi-class support vector machines, trained on their sketch dataset, to classify sketches. Based on the computational model, the authors demonstrate an interactive sketch recognition system. 

This paper is the first large scale exploration of human sketches. The authors analyze the distribution of non-expert sketches of everyday objects such as ‘ teapot ’ or ‘ car ’. With this dataset the authors perform a perceptual study and find that humans can correctly identify the object category of a sketch 73 % of the time. The authors compare human performance against computational recognition methods. The authors develop a bag-of-features sketch representation and use multi-class support vector machines, trained on their sketch dataset, to classify sketches. Based on the computational model, the authors demonstrate an interactive sketch recognition system. 

The general strategy for such techniques is to remove complexity (e.g. delete edges) while staying as close as possible to the original instance according to some geometric error metric. 

For each response image, the authors extract a local descriptor lj by binning the underlying orientational response values into a small, lo-cal histogram using 4 × 4 spatial bins, again linearly interpolating into neighboring bins. 

Probably the most direct way to define a feature space for sketches is to directly use its (possibly down-scaled) bitmap representation. 

The authors build their final, local patch descriptor by stacking the orientational descriptors into a single column vector d = [l1, . . . , lr] 

The assumption in all of these works is that, in some well-engineered feature space, sketched objects resemble their real-world counterparts. 

A stroke-based model might be more natural and facilitate easier synthesis applications such as simplification, beautification, and even synthesis of novel sketches by mixing existing strokes. 

While the authors achieve a high computational recognition accuracy of 56% (chance is 0.4%), their study also reveals that humans still perform significantly better than computers at this task. 

The authors submit a total of 5,000 HITs to Mechanical Turk, each requiring workers to sequentially identify four sketches from random categories. 

8.To visualize the distribution of sketches in the feature space the authors apply dimensionality reduction to the feature vectors from each category. 

the authors would find clusters of sketches in this space that clearly represent their categories, i.e. the authors would hope to find that features within a category are close to each other while having large distances to all other features.• 

Because people represent the same object using differing degrees of realism and distinct drawing styles (see Fig. 1), the authors need a large dataset of sketches which adequately samples these variations. 

indeed, computational classification can perform better in such cases: for ‘armchair’ and ‘suv’ the computational model achieves significantly higher accuracy than humans. 

Such pictographs predate the appearance of language by tens of thousands of years and today the ability to draw and recognize sketched objects is ubiquitous. 

At this point, a sketch is represented as a so-called bag-of-features, containing a large number of local, 64-dimensional feature vectors (4 × 4 spatial bins and 4 orientational bins). 

The performance gain for larger training set sizes becomes smaller as the authors approach the full size of their dataset: this suggests that the dataset is large enough to capture most of the variance within each category. 

While in computer vision applications the size of local patches used to analyze photographs is often quite small (e.g. 16 × 16 pixels [Lazebnik et al. 2006]), sketches contain little information at that scale and larger patch sizes are required for an effective representation. 

The authors propose the following extension to sketch-based image retrieval: a) perform classification on the user sketch and query a traditional keyword based search engine using the determined category; b) (optionally) re-order the resulting images according to their geometric similarity to the user sketch.