scispace - formally typeset
Open AccessProceedings ArticleDOI

Sketch Me That Shoe

Reads0
Chats0
TLDR
A deep tripletranking model for instance-level SBIR is developed with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data.
Abstract
We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketchphoto pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep tripletranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and over-fitting avoidance when training deep networks for finegrained cross-domain ranking tasks.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Sketch Me That Shoe
Citation for published version:
Yu, Q, Liu, F, Song, Y-Z, Xiang, T, Hospedales, T & Loy, CC 2016, Sketch Me That Shoe. in 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). Institute of Electrical and Electronics
Engineers (IEEE), pp. 799-807, 29th IEEE Conference on Computer Vision and Pattern Recognition, Las
Vegas, Nevada, United States, 26/06/16. https://doi.org/10.1109/CVPR.2016.93
Digital Object Identifier (DOI):
10.1109/CVPR.2016.93
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 09. Aug. 2022

Sketch Me That Shoe
Qian Yu
1
Feng Liu
1,2
Yi-Zhe Song
1
Tao Xiang
1
Timothy M. Hospedales
1
Chen Change Loy
3
Queen Mary University of London, London, UK
1
Southeast University, Nanjing, China
2
The Chinese University of Hong Kong, Hong Kong, China
3
{q.yu, feng.liu, yizhe.song, t.xiang, t.hospedales}@qmul.ac.uk ccloy@ie.cuhk.edu.hk
Abstract
We investigate the problem of fine-grained sketch-based
image retrieval (SBIR), where free-hand human sketches are
used as queries to perform instance-level retrieval of im-
ages. This is an extremely challenging task because (i) vi-
sual comparisons not only need to be fine-grained but also
executed cross-domain, (ii) free-hand (finger) sketches are
highly abstract, making fine-grained matching harder, and
most importantly (iii) annotated cross-domain sketch-photo
datasets required for training are scarce, challenging many
state-of-the-art machine learning techniques.
In this paper, for the first time, we address all these
challenges, providing a step towards the capabilities that
would underpin a commercial sketch-based image retrieval
application. We introduce a new database of 1,432 sketch-
photo pairs from two categories with 32,000 fine-grained
triplet ranking annotations. We then develop a deep triplet-
ranking model for instance-level SBIR with a novel data
augmentation and staged pre-training strategy to allevi-
ate the issue of insufficient fine-grained training data. Ex-
tensive experiments are carried out to contribute a vari-
ety of insights into the challenges of data sufficiency and
over-fitting avoidance when training deep networks for fine-
grained cross-domain ranking tasks.
1. Introduction
Notwithstanding the proliferation of touch-screen de-
vices, mainstream image retrieval paradigms at present are
still limited to having text or exemplar image as input.
Only very recently has sketch-based image retrieval (SBIR)
started to return as a practical form of retrieval. Compared
with text, sketches are incredibly intuitive to humans and
have been used since pre-historic times to conceptualise and
depict visual objects [20, 15]. A unique characteristic of
sketches in the context of image retrieval is that they offer
inherently fine-grained visual descriptions a sketch speaks
for a ‘hundred’ words.
free$hand
sketch
fine$grained
retrieval
Figure 1. Free-hand sketch is ideal for fine-grained instance-level
image retrieval.
However, existing SBIR works largely overlook such
fine-grained details, and mainly focus on retrieving images
of the same category [21, 22, 10, 2, 3, 27, 12, 19, 13, 28, 11],
thus not exploiting the real strength of SBIR. This oversight
pre-emptively limits the practical use of SBIR since text is
often a simpler form of input when only category-level re-
trieval is required, e.g., one would rather type in the word
“shoe” to retrieve one rather than sketching a shoe. The ex-
isting commercial image search engines have already done
a pretty good job on category-level image retrieval. In con-
trast, it is when aiming to retrieve a particular shoe that
sketching may be preferable than elucidating a long textual
description of it. Figure 1 illustrates an application scenario
of using free-hand sketch for fine-grained image search: a
person walking on a street notices that another person walk-
ing towards him/her wears a pair of shoes that he/she des-
perately wants to buy; instead of taking a picture of it, which
would be rude, he/she takes out a smartphone and draws a
sketch of it using fingers; all the information required to
have that pair of shoes is then just one click away.
In this paper, for the first time, the problem of
fine-grained instance-level SBIR using hand-free sketches
drawn by amateurs on a touch-screen device is studied. This
is an extremely challenging problem. Some of the chal-
lenges faced are shared with the category-level SBIR task:

sketches and photos are from inherently heterogeneous do-
mains sparse black and white line drawings versus dense
color pixels; and free-hand (finger) sketches are often very
abstract compared with photos a person can be drawn as
a stick-man. In addition, it has its unique scientific chal-
lenges: (i) Fine-grained instance-level retrieval requires a
mechanism to capture the fine-grained (dis)similarities of
sketches and photo images across the domains. (ii) Col-
lecting and annotating a fine-grained SBIR dataset is much
harder than category-level ones. As a result, no large-scale
dataset exists for the researchers to develop solutions.
We address all these challenges by contributing two
large-scale datasets and developing a model for fine-grained
instance-level SBIR. For the dataset, we introduce two
instance-level SBIR datasets consisting of 1,432 sketch-
photo pairs in two categories (shoes and chairs), collected
by asking participants to finger-sketch an object after ob-
serving a photo. A total of 32,000 ground truth triplet
ranking annotations are provided for both model develop-
ment and performance evaluation. For the model, we take
a deep learning approach to better bridge this large domain
gap by learning rather than engineering [11, 23] free-hand
sketch/photo invariant features. Our model is a Siamese
network with a triplet ranking objective. However, such a
network with three branches naively requires a prohibitive
O(N
3
) annotations given that CNN models already require
a large number of data instances N. Despite the large num-
ber of annotations provided in our datasets, they are still in-
sufficient to effectively train a deep triplet ranking network
for instance-level SBIR. We thus introduce and evaluate
various novel ways including sketch-specific data augmen-
tation and staged pre-training using auxiliary data sources
to deal with the data insufficiency problem.
Our contributions are as follows: (1) For the first time,
the problem of fine-grained instance-level SBIR using free-
hand sketches is addressed. (2) We contribute two new
fine-grained SBIR datasets with extensive ground truth an-
notations, in the hope that it will kick-start research ef-
fort on solving this challenging problem. (3) We formulate
a deep triplet ranking model with staged pre-training us-
ing various auxiliary data sources including sketches, pho-
tos, and sketch-photo category-level pairs. (4) Extensive
experiments are conducted to provide insights on how a
deep learning model for fine-grained SBIR can benefit from
novel sketch-specific data augmentation and various pre-
training and sampling strategies to tackle the challenges of
big domain gap and lack of sufficient training data.
2. Related Work
Category-level and fine-grained SBIR Most existing
SBIR works [21, 22, 10, 2, 3, 27, 12, 19, 13, 28, 11] focus
on category-level sketch-to-photo retrieval. A bag-of-words
(BOW) representation combined with some form of edge
detection from photo images are often employed to bridge
the domain gap. The only previous work that attempted to
address the fine-grained SBIR problem is that of [16], which
is based on deformable part-based model (DPM) and graph
matching. However, their definition of fine-grain is very
different from ours a sketch is considered to be a match
to a photo if the objects depicted look similar, i.e. having
the same viewpoint, pose and zoom parameters; in other
words, they do not have to contain the same object instance.
In addition, these hand-crafted feature based approaches are
inadequate in bridging the domain gap as well as capturing
the subtle intra-category and inter-instance differences, as
demonstrated in our experiments.
Other SBIR works like Sketch2Photo [4] and Average-
Explorer [34], use sketch in addition to text or colour cues
for image retrieval. [34] further investigates an interac-
tive process, in which each user ‘edit’ indicates the traits
to focus on for refining retrieval. For now we focus on
non-interactive black & white sketch-based retrieval, and
leave these extensions to future work. Another data-driven
method [25] performs well in cross-domain image matching
through learning the ‘uniqueness’ of the query. However
[25] is prohibitively slow, limiting its usability for practical
interactive image retrieval; it is thus excluded as a baseline.
SBIR Datasets One of the key barriers to fine-grained
SBIR research is lack of benchmark datasets. There are
free-hand sketch datasets, the most commonly used being
the TU-Berlin 20,000 sketch dataset [7]; there are also many
photo datasets such as PASCAL VOC [8] and ImageNet
[6]. Therefore, with few exceptions [22, 11], most existing
SBIR datasets were created by combining overlapping cate-
gories of sketches and photos, which means only category-
level SBIR is possible. The ‘semi’-fine-grained dataset in
[16] was created by selecting similar-looking sketch-photo
pairs from the TU-Berlin and and Pascal VOC datasets. For
each of 14 categories, there are 6 sketches and 60 images
much smaller than ours, and too small to apply state of
the art deep learning techniques. For specific domains such
as face, large-scale datasets exist such as the CUHK Face
Sketches [30]. However, our sketches were drawn by am-
ateurs on touch-screen devices, instead of artists using pen
and paper. Importantly, besides sketch-photo pairs, we pro-
vide a large number of triplet ranking annotations, i.e. given
a sketch, ranking which of two photos are more similar,
making it suitable for more thorough evaluation as well as
developing more advanced retrieval models.
Related Deep Learning Models Deep neural networks,
particularly deep Convolutional Neural Networks [14] have
achieved great success in various visual recognition tasks.
A CNN model, ‘Sketch-a-Net’ was developed specifically
for sketch recognition in [32], and achieves state-of-the-art
recognition performance to date on TU-Berlin [7]. In our
fine-grained SBIR model, we use Sketch-a-Net as the ba-

sic network architecture in each branch of a triplet ranking
Siamese network [9]. However, we introduce two new mod-
ifications to improve Sketch-a-Net: a pre-training step us-
ing edge maps extracted from ImageNet and a new sketch-
specific data augmentation scheme. Our staged pre-training
and sampling strategies are similar in spirit to those used in
fine-grained image-to-image retrieval work [29, 1], which
is also based on a triplet Siamese network, but with the
vital difference of being cross-domain. For cross-domain
modelling, there are two recent works worth mentioning:
the ground-to-aerial image matching work in [18] and the
sketch-to-3D-shape retrieval work in [28]. The former uses
a two-branch Siamese network. We show in our experi-
ments that using a triplet ranking Siamese network is advan-
tageous in that it can better capture the inter-instance subtle
differences. The latter uses a variant of Siamese network
where each branch has a different architecture; we show
that without tying the branches, i.e. being strictly Siamese,
the model is weaker in bridging the semantic gap between
the two domains and more likely to over-fit.
3. Fine-Grained Instance-Level SBIR Datasets
We contribute two datasets, one for shoes and the other
for chairs
1
. There are 1,432 sketches and photos in total, or
716 sketch-photo pairs. The shoe dataset has 419 sketch-
photo pairs, and the chair dataset 297 pairs. Figure 2 shows
some examples. In each column, we display several simi-
lar samples, indicating the fine-details that are required to
differentiate specific shoes/chairs, as well as the challenge
level of doing so based on realistic free-hand sketches. We
next detail the data collection and annotation process.
3.1. Data Collection
Collecting Photo Images Because our dataset is for fine-
grained retrieval, the photo images should cover the vari-
ability of the corresponding object category. When collect-
ing the shoe photo images, we selected 419 representative
images from UT-Zap50K [31] covering shoes of different
types including boots, high-heels, ballerinas, formal and in-
formal shoes. When collecting chairs, we searched three
online shopping websites, including IKEA, Amazon and
Taobao, and selected chair product images of varying types
and styles. The final selection consists of 297 images which
are representative and cover different kinds of chairs includ-
ing office chairs, couches, kids chairs, desk chairs, etc.
Collecting Sketches The second step is to use the col-
lected images to generate corresponding sketches. We re-
cruited 22 volunteers to sketch the images. We showed one
shoe/chair image to a volunteer on a tablet for 15 seconds,
then displayed a blank canvas and let the volunteer sketch
1
Both datasets can be downloaded from
http://sketchx.eecs.qmul.ac.uk/downloads.html
(a)
(b)
Figure 2. Examples of the shoe and chair datasets.
the object he/she just saw using their fingers on the tablet.
None of the volunteers has any art training, and are thus
representative the general population who might use the de-
veloped SBIR system. As a result, the collected sketches
are nowhere near perfect (see Fig. 2), making subsequent
SBIR using these sketches challenging.
3.2. Data Annotation
Our goal is to find the most similar photos to a query
sketch. The photo-sketch pair correspondence already pro-
vides some annotation that could be used to train a pairwise
verification model [5]. However, for fine-grained analysis it
is possible to learn a stronger model if we have a detailed
ranking of the similarity of each candidate image to a given
query sketch. However, asking a human annotator to rank
all 419 shoe photos given a query shoe sketch would be
an error-prone task. This is because humans are bad at list
ranking, but better at individual forced choice judgements.
Therefore, instead of global ranking, a much more man-
ageable triplet ranking task is designed for the annotators.
Specifically, each triplet consists of one query sketch and
two candidate photos; the task is to determine which one of
the two candidate photos is more similar to the query sketch.
However, exhaustively annotating all possible triplets is also
out of the question due to the extremely large number of
possible triplets. We therefore selected only a subset of the
triplets and obtained the annotations through the following
three steps:

1. Attribute Annotation: We first defined an ontology
of attributes for shoes and chairs based on existing UT-
Zap50K attributes [31] and product tags on online shopping
websites. We selected 21 and 15 binary attributes for shoes
and chairs respectively. 60 volunteers helped to annotate all
1,432 images with ground-truth attribute vectors.
2. Generating Candidate Photos for each Sketch: Next
we selected 10 most-similar candidate images for each
sketch in order to focus our limited amount of gold-standard
fine-grained annotation effort. In particular, we combined
the attribute vector with a deep feature vector (the fc7 layer
features extracted using Sketch-a-Net [32]) and computed
the Euclidean distance between each sketch and image. For
each query sketch, we took the top 10 closest photo images
to the query sketch as candidates for annotation.
3. Triplet Annotation: To provide triplet annotations for
the (419 + 297) · 10 · 9/2 = 32, 000 triplets generated in
the previous step, we recruited 36 volunteers. Each vol-
unteer was presented with one sketch and two photos at a
time. Volunteers were then asked to indicate which image
is more similar to the sketch. Each sketch has 10 · 9/2 = 45
triplets and three people annotated each triplet. We merged
the three annotations by majority voting to clean up some
human errors. These collected triplet ranking annotations
will be used in training our model and provide the ground
truth for performance evaluation.
4. Methodology
4.1. Overview
For a given query sketch s and a set of M candidate pho-
tos {p
j
}
M
j=1
P, we need to compute the similarity be-
tween s and p and use it to rank the whole gallery set of
photos in the hope that the true match for the query sketch
is ranked at the top. As discussed earlier, this involves two
challenges: (i) bridging the domain gap between sketches
and photos, and (ii) capturing subtle differences between
candidate photos to obtain a fine-grained ranking despite the
domain gap and amateur free-hand sketching. To achieve
this, we propose to use a deep triplet ranking model to learn
a domain invariant representation f
θ
(·) which enables us to
measure the similarity between s and p P for retrieval
with Euclidean distance: D(s, p) = ||f
θ
(s) f
θ
(p)||
2
2
.
To learn this representation f
θ
(·) we will use the an-
notated triplets {(s
i
, p
+
i
, p
i
)}
N
i=1
as supervision. A triplet
ranking model is thus appropriate. Specifically, each triplet
consists of a query sketch s and two photos p
+
and p
,
namely a positive photo and a negative photo, such that the
positive one is more similar to the query sketch than the neg-
ative one. Our goal is to learn a feature mapping f
θ
(·) that
maps photos and sketches to a common feature embedding
space, R
d
, in which photos similar to particular sketches
are closer than those dissimilar ones, i.e., the distance be-
tween query s and positive p
+
is always smaller than the
distance between query s and negative p
:
D(f
θ
(s), f
θ
(p
+
)) < D(f
θ
(s), f
θ
(p
)). (1)
We constrain the embedding to live on the d-dimensional
hypersphere, i.e., ||f
θ
(·)||
2
= 1.
4.2. Triplet Loss
Towards this goal, we formulate a deep triplet ranking
model with a ranking loss. The loss is defined using the
max-margin framework. For a given triplet t = (s, p
+
, p
),
its loss is defined as:
L
θ
(t) = max(0, ∆+D(f
θ
(s), f
θ
(p
+
))D(f
θ
(s), f
θ
(p
)))
(2)
where is a margin between the positive-query distance
and negative-query distance. If the two photos are ranked
correctly with a margin of distance , then this triplet will
not be penalised. Otherwise the loss is a convex approxi-
mation of the 0 1 ranking loss which measures the degree
of violation of the desired ranking order specified by the
triplet. Overall we optimise the following objective:
min
θ
X
tT
L
θ
(t) + λR(θ), (3)
where T is the training set of triplets, θ are the parameters
of the deep model, which defines a mapping f
θ
(·) from the
input space to the embedding space, and R(·) is a `
2
regu-
lariser ||θ||
2
2
. Minimising this loss will narrow the positive-
query distance while widening the negative-query distance,
and thus learn a representation satisfying the ranking or-
der. With sufficient triplet annotations, the deep model will
eventually learn a representation which captures the fine-
grained details between sketches and photos for retrieval.
Even though the new datasets contain thousands of
triplet annotations each, they are still far from sufficient to
train a deep triplet ranking model with millions of parame-
ters. Next we detail the characteristics of our model from
architecture design, staged model pre-training to sketch-
specific data augmentation, which are all designed to cope
with the sparse training data problem.
4.3. Heterogeneous vs. Siamese Networks
During training, there are three branches in our network,
and each corresponds to one of the atoms in the triplet:
query sketch, positive photo and negative photo (see Fig. 3).
The weights of the two photo branches should always be
shared, while the weights of the photo branch and the sketch
branch can either be shared or not depending on whether we
are using a Siamese network or a heterogeneous network.
After examining existing deep networks for cross-
domain modelling, it seems that if the two domains are
drastically different, e.g. text and image, a heterogeneous

Citations
More filters
Proceedings ArticleDOI

Deeper, Broader and Artier Domain Generalization

TL;DR: In this article, a low-rank parameterized CNN model is proposed for domain generalization, which can learn from multiple training domains and extract a domain-agnostic model that can then be applied to an unseen domain.
Posted Content

Deeper, Broader and Artier Domain Generalization

TL;DR: This paper builds upon the favorable domain shift-robust properties of deep learning methods, and develops a low-rank parameterized CNN model for end-to-end DG learning that outperforms existing DG alternatives.
Journal ArticleDOI

The sketchy database: learning to retrieve badly drawn bunnies

TL;DR: The Sketchy database is presented, the first large-scale collection of sketch-photo pairs and it is shown that the learned representation significantly outperforms both hand-crafted features as well as deep features trained for sketch or photo classification.

Learning Deep Representations for Ground to Aerial Geolocalization (Open Access)

TL;DR: In this article, where-CNN is used to learn a feature representation in which matching views are near one another and mismatched views are far apart, which achieves significant improvements over traditional hand-crafted features and existing deep features learned from other large-scale databases.
Proceedings ArticleDOI

Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval

TL;DR: This paper introduces a novel binary coding method, named Deep Sketch Hashing (DSH), where a semi-heterogeneous deep architecture is proposed and incorporated into an end-to-end binary coding framework, and is the first hashing work specifically designed for category-level SBIR with an end to end deep architecture.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Proceedings ArticleDOI

Show and tell: A neural image caption generator

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.
Proceedings ArticleDOI

Learning a similarity metric discriminatively, with application to face verification

TL;DR: The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The authors investigate the problem of fine-grained sketch-based image retrieval ( SBIR ), where free-hand human sketches are used as queries to perform instance-level retrieval of images. In this paper, for the first time, the authors address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. The authors introduce a new database of 1,432 sketchphoto pairs from two categories with 32,000 fine-grained triplet ranking annotations.