What are the contributions in this paper?

Q: What are the contributions in this paper?

The authors investigate the problem of fine-grained sketch-based image retrieval ( SBIR ), where free-hand human sketches are used as queries to perform instance-level retrieval of images. In this paper, for the first time, the authors address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. The authors introduce a new database of 1,432 sketchphoto pairs from two categories with 32,000 fine-grained triplet ranking annotations.

(Open Access) Sketch Me That Shoe (2016) | Qian Yu

Edinburgh Research Explorer

Sketch Me That Shoe

Citation for published version:

Yu, Q, Liu, F, Song, Y-Z, Xiang, T, Hospedales, T & Loy, CC 2016, Sketch Me That Shoe. in 2016 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR). Institute of Electrical and Electronics

Engineers (IEEE), pp. 799-807, 29th IEEE Conference on Computer Vision and Pattern Recognition, Las

Vegas, Nevada, United States, 26/06/16. https://doi.org/10.1109/CVPR.2016.93

Digital Object Identifier (DOI):

10.1109/CVPR.2016.93

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 09. Aug. 2022

Sketch Me That Shoe

Qian Yu

Feng Liu

1,2

Yi-Zhe Song

Tao Xiang

Timothy M. Hospedales

Chen Change Loy

Queen Mary University of London, London, UK

Southeast University, Nanjing, China

The Chinese University of Hong Kong, Hong Kong, China

{q.yu, feng.liu, yizhe.song, t.xiang, t.hospedales}@qmul.ac.uk ccloy@ie.cuhk.edu.hk

Abstract

We investigate the problem of ﬁne-grained sketch-based

image retrieval (SBIR), where free-hand human sketches are

used as queries to perform instance-level retrieval of im-

ages. This is an extremely challenging task because (i) vi-

sual comparisons not only need to be ﬁne-grained but also

executed cross-domain, (ii) free-hand (ﬁnger) sketches are

highly abstract, making ﬁne-grained matching harder, and

most importantly (iii) annotated cross-domain sketch-photo

datasets required for training are scarce, challenging many

state-of-the-art machine learning techniques.

In this paper, for the ﬁrst time, we address all these

challenges, providing a step towards the capabilities that

would underpin a commercial sketch-based image retrieval

application. We introduce a new database of 1,432 sketch-

photo pairs from two categories with 32,000 ﬁne-grained

triplet ranking annotations. We then develop a deep triplet-

ranking model for instance-level SBIR with a novel data

augmentation and staged pre-training strategy to allevi-

ate the issue of insufﬁcient ﬁne-grained training data. Ex-

tensive experiments are carried out to contribute a vari-

ety of insights into the challenges of data sufﬁciency and

over-ﬁtting avoidance when training deep networks for ﬁne-

grained cross-domain ranking tasks.

1. Introduction

Notwithstanding the proliferation of touch-screen de-

vices, mainstream image retrieval paradigms at present are

still limited to having text or exemplar image as input.

Only very recently has sketch-based image retrieval (SBIR)

started to return as a practical form of retrieval. Compared

with text, sketches are incredibly intuitive to humans and

have been used since pre-historic times to conceptualise and

depict visual objects [20, 15]. A unique characteristic of

sketches in the context of image retrieval is that they offer

inherently ﬁne-grained visual descriptions – a sketch speaks

for a ‘hundred’ words.

free$hand

sketch

fine$grained

retrieval

Figure 1. Free-hand sketch is ideal for ﬁne-grained instance-level

image retrieval.

However, existing SBIR works largely overlook such

ﬁne-grained details, and mainly focus on retrieving images

of the same category [21, 22, 10, 2, 3, 27, 12, 19, 13, 28, 11],

thus not exploiting the real strength of SBIR. This oversight

pre-emptively limits the practical use of SBIR since text is

often a simpler form of input when only category-level re-

trieval is required, e.g., one would rather type in the word

“shoe” to retrieve one rather than sketching a shoe. The ex-

isting commercial image search engines have already done

a pretty good job on category-level image retrieval. In con-

trast, it is when aiming to retrieve a particular shoe that

sketching may be preferable than elucidating a long textual

description of it. Figure 1 illustrates an application scenario

of using free-hand sketch for ﬁne-grained image search: a

person walking on a street notices that another person walk-

ing towards him/her wears a pair of shoes that he/she des-

perately wants to buy; instead of taking a picture of it, which

would be rude, he/she takes out a smartphone and draws a

sketch of it using ﬁngers; all the information required to

have that pair of shoes is then just one click away.

In this paper, for the ﬁrst time, the problem of

ﬁne-grained instance-level SBIR using hand-free sketches

drawn by amateurs on a touch-screen device is studied. This

is an extremely challenging problem. Some of the chal-

lenges faced are shared with the category-level SBIR task:

sketches and photos are from inherently heterogeneous do-

mains – sparse black and white line drawings versus dense

color pixels; and free-hand (ﬁnger) sketches are often very

abstract compared with photos – a person can be drawn as

a stick-man. In addition, it has its unique scientiﬁc chal-

lenges: (i) Fine-grained instance-level retrieval requires a

mechanism to capture the ﬁne-grained (dis)similarities of

sketches and photo images across the domains. (ii) Col-

lecting and annotating a ﬁne-grained SBIR dataset is much

harder than category-level ones. As a result, no large-scale

dataset exists for the researchers to develop solutions.

We address all these challenges by contributing two

large-scale datasets and developing a model for ﬁne-grained

instance-level SBIR. For the dataset, we introduce two

instance-level SBIR datasets consisting of 1,432 sketch-

photo pairs in two categories (shoes and chairs), collected

by asking participants to ﬁnger-sketch an object after ob-

serving a photo. A total of 32,000 ground truth triplet

ranking annotations are provided for both model develop-

ment and performance evaluation. For the model, we take

a deep learning approach to better bridge this large domain

gap by learning rather than engineering [11, 23] free-hand

sketch/photo invariant features. Our model is a Siamese

network with a triplet ranking objective. However, such a

network with three branches naively requires a prohibitive

O(N

) annotations given that CNN models already require

a large number of data instances N. Despite the large num-

ber of annotations provided in our datasets, they are still in-

sufﬁcient to effectively train a deep triplet ranking network

for instance-level SBIR. We thus introduce and evaluate

various novel ways including sketch-speciﬁc data augmen-

tation and staged pre-training using auxiliary data sources

to deal with the data insufﬁciency problem.

Our contributions are as follows: (1) For the ﬁrst time,

the problem of ﬁne-grained instance-level SBIR using free-

hand sketches is addressed. (2) We contribute two new

ﬁne-grained SBIR datasets with extensive ground truth an-

notations, in the hope that it will kick-start research ef-

fort on solving this challenging problem. (3) We formulate

a deep triplet ranking model with staged pre-training us-

ing various auxiliary data sources including sketches, pho-

tos, and sketch-photo category-level pairs. (4) Extensive

experiments are conducted to provide insights on how a

deep learning model for ﬁne-grained SBIR can beneﬁt from

novel sketch-speciﬁc data augmentation and various pre-

training and sampling strategies to tackle the challenges of

big domain gap and lack of sufﬁcient training data.

2. Related Work

Category-level and ﬁne-grained SBIR Most existing

SBIR works [21, 22, 10, 2, 3, 27, 12, 19, 13, 28, 11] focus

on category-level sketch-to-photo retrieval. A bag-of-words

(BOW) representation combined with some form of edge

detection from photo images are often employed to bridge

the domain gap. The only previous work that attempted to

address the ﬁne-grained SBIR problem is that of [16], which

is based on deformable part-based model (DPM) and graph

matching. However, their deﬁnition of ﬁne-grain is very

different from ours – a sketch is considered to be a match

to a photo if the objects depicted look similar, i.e. having

the same viewpoint, pose and zoom parameters; in other

words, they do not have to contain the same object instance.

In addition, these hand-crafted feature based approaches are

inadequate in bridging the domain gap as well as capturing

the subtle intra-category and inter-instance differences, as

demonstrated in our experiments.

Other SBIR works like Sketch2Photo [4] and Average-

Explorer [34], use sketch in addition to text or colour cues

for image retrieval. [34] further investigates an interac-

tive process, in which each user ‘edit’ indicates the traits

to focus on for reﬁning retrieval. For now we focus on

non-interactive black & white sketch-based retrieval, and

leave these extensions to future work. Another data-driven

method [25] performs well in cross-domain image matching

through learning the ‘uniqueness’ of the query. However

[25] is prohibitively slow, limiting its usability for practical

interactive image retrieval; it is thus excluded as a baseline.

SBIR Datasets One of the key barriers to ﬁne-grained

SBIR research is lack of benchmark datasets. There are

free-hand sketch datasets, the most commonly used being

the TU-Berlin 20,000 sketch dataset [7]; there are also many

photo datasets such as PASCAL VOC [8] and ImageNet

[6]. Therefore, with few exceptions [22, 11], most existing

SBIR datasets were created by combining overlapping cate-

gories of sketches and photos, which means only category-

level SBIR is possible. The ‘semi’-ﬁne-grained dataset in

[16] was created by selecting similar-looking sketch-photo

pairs from the TU-Berlin and and Pascal VOC datasets. For

each of 14 categories, there are 6 sketches and 60 images

– much smaller than ours, and too small to apply state of

the art deep learning techniques. For speciﬁc domains such

as face, large-scale datasets exist such as the CUHK Face

Sketches [30]. However, our sketches were drawn by am-

ateurs on touch-screen devices, instead of artists using pen

and paper. Importantly, besides sketch-photo pairs, we pro-

vide a large number of triplet ranking annotations, i.e. given

a sketch, ranking which of two photos are more similar,

making it suitable for more thorough evaluation as well as

developing more advanced retrieval models.

Related Deep Learning Models Deep neural networks,

particularly deep Convolutional Neural Networks [14] have

achieved great success in various visual recognition tasks.

A CNN model, ‘Sketch-a-Net’ was developed speciﬁcally

for sketch recognition in [32], and achieves state-of-the-art

recognition performance to date on TU-Berlin [7]. In our

ﬁne-grained SBIR model, we use Sketch-a-Net as the ba-

sic network architecture in each branch of a triplet ranking

Siamese network [9]. However, we introduce two new mod-

iﬁcations to improve Sketch-a-Net: a pre-training step us-

ing edge maps extracted from ImageNet and a new sketch-

speciﬁc data augmentation scheme. Our staged pre-training

and sampling strategies are similar in spirit to those used in

ﬁne-grained image-to-image retrieval work [29, 1], which

is also based on a triplet Siamese network, but with the

vital difference of being cross-domain. For cross-domain

modelling, there are two recent works worth mentioning:

the ground-to-aerial image matching work in [18] and the

sketch-to-3D-shape retrieval work in [28]. The former uses

a two-branch Siamese network. We show in our experi-

ments that using a triplet ranking Siamese network is advan-

tageous in that it can better capture the inter-instance subtle

differences. The latter uses a variant of Siamese network

where each branch has a different architecture; we show

that without tying the branches, i.e. being strictly Siamese,

the model is weaker in bridging the semantic gap between

the two domains and more likely to over-ﬁt.

3. Fine-Grained Instance-Level SBIR Datasets

We contribute two datasets, one for shoes and the other

for chairs

. There are 1,432 sketches and photos in total, or

716 sketch-photo pairs. The shoe dataset has 419 sketch-

photo pairs, and the chair dataset 297 pairs. Figure 2 shows

some examples. In each column, we display several simi-

lar samples, indicating the ﬁne-details that are required to

differentiate speciﬁc shoes/chairs, as well as the challenge

level of doing so based on realistic free-hand sketches. We

next detail the data collection and annotation process.

3.1. Data Collection

Collecting Photo Images Because our dataset is for ﬁne-

grained retrieval, the photo images should cover the vari-

ability of the corresponding object category. When collect-

ing the shoe photo images, we selected 419 representative

images from UT-Zap50K [31] covering shoes of different

types including boots, high-heels, ballerinas, formal and in-

formal shoes. When collecting chairs, we searched three

online shopping websites, including IKEA, Amazon and

Taobao, and selected chair product images of varying types

and styles. The ﬁnal selection consists of 297 images which

are representative and cover different kinds of chairs includ-

ing ofﬁce chairs, couches, kids chairs, desk chairs, etc.

Collecting Sketches The second step is to use the col-

lected images to generate corresponding sketches. We re-

cruited 22 volunteers to sketch the images. We showed one

shoe/chair image to a volunteer on a tablet for 15 seconds,

then displayed a blank canvas and let the volunteer sketch

Both datasets can be downloaded from

http://sketchx.eecs.qmul.ac.uk/downloads.html

(a)

(b)

Figure 2. Examples of the shoe and chair datasets.

the object he/she just saw using their ﬁngers on the tablet.

None of the volunteers has any art training, and are thus

representative the general population who might use the de-

veloped SBIR system. As a result, the collected sketches

are nowhere near perfect (see Fig. 2), making subsequent

SBIR using these sketches challenging.

3.2. Data Annotation

Our goal is to ﬁnd the most similar photos to a query

sketch. The photo-sketch pair correspondence already pro-

vides some annotation that could be used to train a pairwise

veriﬁcation model [5]. However, for ﬁne-grained analysis it

is possible to learn a stronger model if we have a detailed

ranking of the similarity of each candidate image to a given

query sketch. However, asking a human annotator to rank

all 419 shoe photos given a query shoe sketch would be

an error-prone task. This is because humans are bad at list

ranking, but better at individual forced choice judgements.

Therefore, instead of global ranking, a much more man-

ageable triplet ranking task is designed for the annotators.

Speciﬁcally, each triplet consists of one query sketch and

two candidate photos; the task is to determine which one of

the two candidate photos is more similar to the query sketch.

However, exhaustively annotating all possible triplets is also

out of the question due to the extremely large number of

possible triplets. We therefore selected only a subset of the

triplets and obtained the annotations through the following

three steps:

1. Attribute Annotation: We ﬁrst deﬁned an ontology

of attributes for shoes and chairs based on existing UT-

Zap50K attributes [31] and product tags on online shopping

websites. We selected 21 and 15 binary attributes for shoes

and chairs respectively. 60 volunteers helped to annotate all

1,432 images with ground-truth attribute vectors.

2. Generating Candidate Photos for each Sketch: Next

we selected 10 most-similar candidate images for each

sketch in order to focus our limited amount of gold-standard

ﬁne-grained annotation effort. In particular, we combined

the attribute vector with a deep feature vector (the fc7 layer

features extracted using Sketch-a-Net [32]) and computed

the Euclidean distance between each sketch and image. For

each query sketch, we took the top 10 closest photo images

to the query sketch as candidates for annotation.

3. Triplet Annotation: To provide triplet annotations for

the (419 + 297) · 10 · 9/2 = 32, 000 triplets generated in

the previous step, we recruited 36 volunteers. Each vol-

unteer was presented with one sketch and two photos at a

time. Volunteers were then asked to indicate which image

is more similar to the sketch. Each sketch has 10 · 9/2 = 45

triplets and three people annotated each triplet. We merged

the three annotations by majority voting to clean up some

human errors. These collected triplet ranking annotations

will be used in training our model and provide the ground

truth for performance evaluation.

4. Methodology

4.1. Overview

For a given query sketch s and a set of M candidate pho-

tos {p

}

j=1

∈ P, we need to compute the similarity be-

tween s and p and use it to rank the whole gallery set of

photos in the hope that the true match for the query sketch

is ranked at the top. As discussed earlier, this involves two

challenges: (i) bridging the domain gap between sketches

and photos, and (ii) capturing subtle differences between

candidate photos to obtain a ﬁne-grained ranking despite the

domain gap and amateur free-hand sketching. To achieve

this, we propose to use a deep triplet ranking model to learn

a domain invariant representation f

(·) which enables us to

measure the similarity between s and p ∈ P for retrieval

with Euclidean distance: D(s, p) = ||f

(s) − f

(p)||

To learn this representation f

(·) we will use the an-

notated triplets {(s

, p

−

)}

i=1

as supervision. A triplet

ranking model is thus appropriate. Speciﬁcally, each triplet

consists of a query sketch s and two photos p

and p

−

namely a positive photo and a negative photo, such that the

positive one is more similar to the query sketch than the neg-

ative one. Our goal is to learn a feature mapping f

(·) that

maps photos and sketches to a common feature embedding

space, R

, in which photos similar to particular sketches

are closer than those dissimilar ones, i.e., the distance be-

tween query s and positive p

is always smaller than the

distance between query s and negative p

−

D(f

(s), f

)) < D(f

(s), f

−

)). (1)

We constrain the embedding to live on the d-dimensional

hypersphere, i.e., ||f

(·)||

= 1.

4.2. Triplet Loss

Towards this goal, we formulate a deep triplet ranking

model with a ranking loss. The loss is deﬁned using the

max-margin framework. For a given triplet t = (s, p

, p

−

its loss is deﬁned as:

(t) = max(0, ∆+D(f

(s), f

))−D(f

(s), f

−

)))

(2)

where ∆ is a margin between the positive-query distance

and negative-query distance. If the two photos are ranked

correctly with a margin of distance ∆, then this triplet will

not be penalised. Otherwise the loss is a convex approxi-

mation of the 0 − 1 ranking loss which measures the degree

of violation of the desired ranking order speciﬁed by the

triplet. Overall we optimise the following objective:

min

t∈T

(t) + λR(θ), (3)

where T is the training set of triplets, θ are the parameters

of the deep model, which deﬁnes a mapping f

(·) from the

input space to the embedding space, and R(·) is a `

regu-

lariser ||θ||

. Minimising this loss will narrow the positive-

query distance while widening the negative-query distance,

and thus learn a representation satisfying the ranking or-

der. With sufﬁcient triplet annotations, the deep model will

eventually learn a representation which captures the ﬁne-

grained details between sketches and photos for retrieval.

Even though the new datasets contain thousands of

triplet annotations each, they are still far from sufﬁcient to

train a deep triplet ranking model with millions of parame-

ters. Next we detail the characteristics of our model from

architecture design, staged model pre-training to sketch-

speciﬁc data augmentation, which are all designed to cope

with the sparse training data problem.

4.3. Heterogeneous vs. Siamese Networks

During training, there are three branches in our network,

and each corresponds to one of the atoms in the triplet:

query sketch, positive photo and negative photo (see Fig. 3).

The weights of the two photo branches should always be

shared, while the weights of the photo branch and the sketch

branch can either be shared or not depending on whether we

are using a Siamese network or a heterogeneous network.

After examining existing deep networks for cross-

domain modelling, it seems that if the two domains are

drastically different, e.g. text and image, a heterogeneous

Sketch Me That Shoe

Figures

Citations

Deeper, Broader and Artier Domain Generalization

Deeper, Broader and Artier Domain Generalization

The sketchy database: learning to retrieve badly drawn bunnies

Learning Deep Representations for Ground to Aerial Geolocalization (Open Access)

Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval

References

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

The Pascal Visual Object Classes (VOC) Challenge

Show and tell: A neural image caption generator

Learning a similarity metric discriminatively, with application to face verification

Related Papers (5)

The sketchy database: learning to retrieve badly drawn bunnies

How do humans sketch objects

Deep Residual Learning for Image Recognition

Sketch-Based Image Retrieval: Benchmark and Bag-of-Features Descriptors

ImageNet Classification with Deep Convolutional Neural Networks

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?