Proceedings Article•DOI•

Relative Parts: Distinctive Parts for Learning Relative Attributes

Ramachandruni N. Sandeep¹, Yashaswi Verma¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

23 Jun 2014-pp 3614-3621

TL;DR: This paper introduces a part-based representation combining a pair of images that specifically compares corresponding parts and associates a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute.

read less

Abstract: The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as "smiling" for face images, "naturalness" for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method is shown to achieve significant improvement in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.

...read moreread less

Summary (5 min read)

Jump to: [1. Introduction] – [2. Related Works] – [3. Preliminaries] – [3.1. The Ranking SVM Model] – [4. Proposed Representations] – [4.1. Part-based Joint Representation] – [4.2. Weighted Part-based Joint Representation] – [5. Parameter Learning] – [5.1. For Part-based Joint Representation] – [5.2. For Weighted Part-based Joint Representation] – [5.2.1 Solving the optimization problem] – [5.3. Computing Parts] – [5.4. Relation with Latent Models] – [6. Experiments] – [6.1. LFW-10 Dataset] – [6.2. Features for Parts] – [6.3. Baselines] – [6.4. Results] – [6.5. Application to Interactive Image Search] and [7. Conclusion]

1. Introduction

Visual attributes (or simply attributes) are perceptual properties that can be used to describe an entity (“pointed nose”), an object (“furry sheep”), or a scene (“natural outdoor”).
This led to the notion of “relative attributes”, where the strength of an attribute in a given image can be described with respect to some other image/category; e.g. “given face is less chubby than person A and more chubby than person B”.
Next, the authors update this part-based representation by additionally learning weights corresponding to each part that denote their contribution towards predicting the strength of a given attribute.
The authors compare the baseline method of [24] with the proposed method under various settings.
In Sec. 3, the authors discuss the method of [24] for learning relative attribute ranking models.

3. Preliminaries

In [24], a Ranking SVM based method was used for learning relative attribute classifiers.
Ranking SVM [12] is a max-margin ranking framework that learns linear models to perform pairwise comparisons.
This is conceptually different from the conventional one-vs-rest SVM that learns a model using individual samples rather than pairs.
Though SVM scores can also be used to perform pairwise comparisons, usually Ranking SVM has been known to perform better than SVM for such tasks.
In [24] also, Ranking SVM was shown to perform better than SVM on the task of relative attribute prediction.

3.1. The Ranking SVM Model

And, Sm = {(Ii, Ij)} is such that both Ii and Ij have nearly the same strength of attribute am.
Dm, the goal is to learn a ranking function fm that, given a new pair of images Ip and Iq represented by xp and xq respectively, predicts which image has greater strength of attribute am.
Using fm, the authors determine which image has higher strength for attribute am based on ympq = sign(fm(xp,xq; wm)).
Note that along with pairwise constraints as in [12], the optimization problem now also includes similarity constraints.
This is solved in the primal form itself using Newton’s method [3].

4. Proposed Representations

The Ranking SVM method discussed above uses a joint representation based on globally computed features (Eq. 2) while determining the strength of some given attribute.
Several attributes such as “visible-teeth”, “eyesopen”, etc. are not representative of whole image, and correspond to only some specific regions/parts.
This means there exists a weak association between an image and its attribute label.
This inspires us to build a representation that (i) encodes part/regionspecific features, without confusing across parts; and (ii) explicitly encodes the relative significance of each part with respect to a given attribute.
Next the authors propose two part-based joint-representations for the task of learning relative attribute classifiers.

4.1. Part-based Joint Representation

These parts can be obtained using a domainspecific method; e.g., the method discussed in [35] can be used for determining a set of localized parts in face images.
Here, N1 = K × d1 such that each x̃k is a sparse vector with only d1 non-zero entries in the kth interval representing part pk.
The advantage of this representation is that it specifically encodes correspondence among parts; i.e., now the kth part of Ip is compared with just the kth part of Iq .
The assumption here is that such a direct comparison between localized pairs of parts would provide stronger cues for learning relative attribute models than using a single global representation as in Eq.
(This assumption is also validated by improvements in prediction accuracy as discussed in Sec. 6.).

4.2. Weighted Part-based Joint Representation

Though the joint representation proposed in the previous section allows direct part-based comparison between a pair of images, it does not provide information about which parts actually symbolize some given attribute.
As discussed in Sec. 4.1, let each image I be represented by a set ofK parts.
Additionally, let skm ∈ [0, 1] be a weight associated with the kth part.
This weight denotes the relative importance of the kth part compared to other parts for predicting the strength of attribute am; i.e., larger the weight, more important is that part, and vice-versa.
These help in explicitly encoding the relative importance of individual parts in the joint representation.

5. Parameter Learning

Now the authors discuss how to learn the parameters for each attribute using the two joint representations discussed above.
Note that the authors still need to satisfy the constraints as in Eq. 3 and Eq. 4 depending upon the representation followed.

5.1. For Part-based Joint Representation

This is similar to OP1, except that now the authors use part-based representation instead of global representation.
This allows us to use the same Newton’s method [3] for solving OP2.

5.2. For Weighted Part-based Joint Representation

For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm.
Note that the overall weight of all the parts is constrained to be unit; i.e., skm ≥ 0, e · sm = 1, which ensures that all parts are fairly used.
This is desirable since usually only a few parts contribute towards determining the strength of a given attribute.

5.2.1 Solving the optimization problem

The authors solve OP3 in the primal form itself using a block coordinate descent algorithm.
The authors consider each set of parameters wm and sm as two blocks, and optimize them in an alternate manner.
Om is the set of pairs that violate the margin constraint.
Note that Qm is not fixed, and may change at every iteration.
The authors solve OP4 using an iterative gradient descent and projection method similar to [34].

5.3. Computing Parts

The two joint representations as proposed in Sec. 4 are based on an ordered set of corresponding parts computed from a given pair of images.
Given a method for computing such parts, their framework is applicable irrespective of the domain.
To compute parts from a given face image, the authors use the method proposed in [35].
Though these parts can be used to represent several attributes such as “smiling”, “eyes-open”, etc., there are few other attributes which are not covered by these parts such as “bald-head”, “visible-forehead” and “dark-hair”.

5.4. Relation with Latent Models

In the last few years, latent models have become popular for several tasks, particularly for object detection [9].
These models usually look for characteristics (e.g., parts) that are shared within a category but distinctive across categories.
(As discussed in Sec. 2, recent works such as [1, 5, 13] also have similar motivation, though they do not explicitly investigate the latent aspect.).
The authors work is similar to theirs in the sense that the authors also seek attribute-specific distinctive parts by incorporating significance-coefficients.
This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.

6. Experiments

The authors compare the proposed method with that of [24] under different settings on two datasets.
To address this limitation, the authors have collected a new dataset using a subset of LFW [11] images.
The new dataset has attribute-level annotations for 10000 image pairs and 10 attributes, and the authors call this as LFW-10 dataset.
While collecting the annotations, the authors particularly ignore the category information, thus making it more suitable for the task of learning relative attributes.
The details of this dataset are described next.

6.1. LFW-10 Dataset

Out of these, 1000 images are used for creating training pairs and the remaining 1000 for testing pairs.
The annotations are collected for 10 attributes, with 500 training and testing pairs per attribute.
In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.
Figure 5 shows example pairs from this dataset.

6.2. Features for Parts

The authors represent each part using a Bag of Words (BoW) histogram over dense SIFT [20] features.
The authors consider two settings for learning visual-word vocabulary: (1) In the first setting, they learn a part-specific vocabulary for every part.
This is possible since their parts are fixed and known.
In practice, the authors learn a vocabulary of 100 visual words for each part.
(2) In the second setting, the authors learn a single vocabulary of 100 visual words for all the parts.

6.3. Baselines

The authors compare with the Ranking SVM method of [24] using the code provided by the authors 1.
The authors use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i).
As another baseline, the authors compare the quality of their partlearning framework (Sec. 5.2) against human selected parts.
For this, the authors asked a human expert to select a subset of few most representative parts corresponding to every attribute.
For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights.

6.4. Results

Table 1 compares different methods on PubFig-29 dataset.
This clearly validates the significance of these representations for learning relative attribute models.
(3) Using part-specific vocabulary performs better than single vocabulary.
Figure 7 shows the top ten learned parts with highest significance-coefficients for all the ten attributes in LFW-10 dataset.
Also, the performance of their method closely matches with that obtained using human selected parts, thus demonstrating its effectiveness.

6.5. Application to Interactive Image Search

Now, the authors illustrate the advantage of the proposed method on the task of interactive image search using relative attribute based feedback.
For a given attribute’s feedback with respect to a reference image, the search set is partitioned into two disjoint sets using that attribute’s scores.
The rank of all the images in the search set are averaged over all feedbacks over all reference images.
A total of 275 searches are performed for each of the six settings, by collecting feedbacks from 30 human evaluators.
These results demonstrate that here also their method consistently outperforms the baseline method, and achieves performance comparable to that using human selected parts, thus validating its efficacy.

7. Conclusion

Inspired from the success of relative attributes, the authors have presented a novel method that learns relative attribute models using local parts that are shared across categories.
The authors method achieves significant improvements compared to the baseline method.
Apart from this, the part-specific weights learned using their method also provide semantic interpretation of different parts for diverse attributes.

Did you find this useful? Give us your feedback

Figures (11)

Table 1. Results on PubFig-29 dataset. Though all the methods give comparable performance, these results are not really indicative of their actual behaviour since the annotations in this dataset are at category-level rather than instance-level.

Table 2. Average relative attribute prediction accuracies using different methods on LFW-10 dataset.

Figure 6. For three attributes from LFW-10 dataset (“smiling”, “visible-forehead” & “eyes-open” resp.) the first block shows the top five parts and their weights learned using our method, and the second block shows five parts selected by human expert.

Figure 1. Given ordered pair of images, first we detect parts corresponding to different (facial) landmarks. Using these, a joint pairwise part-based representation is formed that encodes (i) correspondence among different parts, & (ii) relative importance of each part for a given attribute. Using this, a max-margin ranking model w is learned simultaneously with part weights s (red blocks) in an iterative manner.

Figure 8. Performance for each of the ten attributes in LFW-10 dataset using different methods and representations.

Figure 7. Top 10 parts learned using our method with maximum weights for each of the ten attributes in LFW-10 dataset. Greater is the intensity of red, more important is that part, and vice-versa.

Figure 2. Given an input image (left), the parts that correspond to “visible-teeth” (middle) and “eyes-open” (right).

Figure 9. Performance variation of different methods on interactive image search with number of reference images and number of feedbacks. Each plot shows the number of searches in which the target image is ranked below a particular rank. Larger is the number of searches falling below a specified rank, better is the accuracy.

Figure 4. Example pairs and their ground-truth annotations from Pubfig-29 dataset. Due to category-level annotations, there exist inconsistencies in (true) instance-level attribute visibility.

Figure 5. Example pairs from LFW-10 dataset. The images exhibit high diversity in terms of age, pose, lighting, occulusion, etc.

Figure 3. Input image (left), parts detected using [35] (middle), and additional parts detected by us (right).

Content maybe subject to copyright Report

Relative Parts: Distinctive Parts for Learning Relative Attributes

Ramachandruni N. Sandeep Yashaswi Verma C. V. Jawahar

Center for Visual Information Technology, IIIT Hyderabad, India - 500032

Abstract

The notion of relative attributes as introduced by Parikh

and Grauman (ICCV, 2011) provides an appealing way

of comparing two images based on their visual properties

(or attributes) such as “smiling” for face images, “natu-

ralness” for outdoor images, etc. For learning such at-

tributes, a Ranking SVM based formulation was proposed

that uses globally represented pairs of annotated images. In

this paper, we extend this idea towards learning relative at-

tributes using local parts that are shared across categories.

First, instead of using a global representation, we introduce

a part-based representation combining a pair of images

that speciﬁcally compares corresponding parts. Then, with

each part we associate a locally adaptive “signiﬁcance-

coefﬁcient” that represents its discriminative ability with

respect to a particular attribute. For each attribute, the

signiﬁcance-coefﬁcients are learned simultaneously with a

max-margin ranking model in an iterative manner. Com-

pared to the baseline method, the new method is shown to

achieve signiﬁcant improvement in relative attribute predic-

tion accuracy. Additionally, it is also shown to improve rel-

ative feedback based interactive image search.

1. Introduction

Visual attributes (or simply attributes) are perceptual

properties that can be used to describe an entity (“pointed

nose”), an object (“furry sheep”), or a scene (“natural out-

door”). These act as mid-level representations that are com-

prehensible for both human as well as machine, thus provid-

ing a strong means of ﬁlling-up the so-called semantic-gap.

Attributes have recently been used as a source of seman-

tic cues in diverse tasks such as object recognition [17, 18],

image description [24], learning unseen object categories

(or zero-shot learning) [18], etc. While most of these works

have focused on binary attributes (indicating presence or ab-

sence of some visual property), Parikh and Grauman [24]

proposed that it is more natural to consider the strength of

an attribute rather than its absolute presence/absence. This

led to the notion of “relative attributes”, where the strength

of an attribute in a given image can be described with re-

spect to some other image/category; e.g. “given face is less

chubby than person A and more chubby than person B”.

In [24], given a set of pairs of images depicting similar

and/or different strengths of some particular attribute, the

problem of learning a relative attribute classiﬁer is posed as

one of learning a ranking model for that attribute similar to

Ranking SVM [12].

In this work, we build upon this idea by learning relative

attribute models using local parts that are shared across cat-

egories. First, we propose a part-based representation that

jointly represents a pair of images. A part corresponds to

a block around a landmark point detected using a domain-

speciﬁc method. This representation explicitly encodes cor-

respondences among parts, thus better capturing minute dif-

ferences in parts that make an attribute more prominent in

one image than another, as compared to a global represen-

tation as in [24]. Next, we update this part-based repre-

sentation by additionally learning weights corresponding to

each part that denote their contribution towards predicting

the strength of a given attribute. We call these weights

as “signiﬁcance-coefﬁcients” of parts. For each attribute,

the signiﬁcance-coefﬁcients are learned in a discriminative

manner simultaneously with a max-margin ranking model.

Thus, the best parts for predicting the relative attribute

“more smiling” will be different from those for predicting

“more eyes-open”. The steps of the proposed method are

illustrated in Figure 1. While the notion of parts is not new,

we believe that ours is the ﬁrst attempt that explores the ap-

plicability of parts in a ranking scenario, and for learning

relative attribute ranking models in particular.

We compare the baseline method of [24] with the pro-

posed method under various settings. For this, we have col-

lected a new dataset of 10000 pairwise attribute-level anno-

tations using images from the “Labeled Faces in the Wild”

(LFW) dataset [11], particularly focusing on (i) large va-

riety among samples in terms of poses, lighting condition,

etc., and (ii) completely ignoring the category information

while collecting attribute annotations. Extensive experi-

ments demonstrate that the new method signiﬁcantly im-

proves the prediction accuracy as compared to the baseline

Figure 1. Given ordered pair of images, ﬁrst we detect parts corresponding to different (facial) landmarks. Using these, a joint pairwise

part-based representation is formed that encodes (i) correspondence among different parts, & (ii) relative importance of each part for a given

attribute. Using this, a max-margin ranking model w is learned simultaneously with part weights s (red blocks) in an iterative manner.

method. Moreover, the learned parts also compare favor-

ably with human selected parts, thus indicating the intrinsic

capacity of the proposed framework for learning attribute-

speciﬁc semantic parts.

The paper is organized as follows. In Sec. 2, we give an

overview of some of the recent works based on attributes

and relative attributes. In Sec. 3, we discuss the method

of [24] for learning relative attribute ranking models. Then

we present the new part-based representations in Sec. 4,

followed by an algorithm for learning model variables in

Sec. 5. Experiments and results are discussed in Sec. 6, and

ﬁnally we conclude in Sec. 7.

2. Related Works

As discussed earlier, attributes are properties that are un-

derstandable by both human as well as machine. Because

of this, attributes have recently gained signiﬁcant popular-

ity among several vision applications, where attribute iden-

tiﬁcation is not the ﬁnal goal but just an intermediate step.

In [8], objects are described using their attributes; e.g. in-

stead of classifying an image as that of a “sheep”, it is de-

scribed based on its properties such as “has horn”, “has

wool”, etc. This helps in describing even those objects

which have few or no examples during training phase. Sim-

ilar idea is used in [7, 18] where attribute-based feedback

is used for unseen category recognition. Attribute-based

feedback has been shown to be useful for anomaly detec-

tion [28] within an object category, and adding unlabeled

samples for category classﬁer learning [4]. Attributes have

also been used for multiple-query image search [30], where

input attributes along with other related attributes are used

in a structured-prediction based model. Along with various

applications, attributes have been used in several mid-level

tasks. These include identiﬁcation of color/texture [10],

speciﬁc objects such as faces [17], and general object cate-

gories [18, 33]. In some cases, since it might not be possible

to learn discriminative attributes from individual images,

in [21], pairs of images are used to learn such attributes

based on human feedback.

While most of the above methods have focused on pres-

ence/absence of some attribute, in [24] the notion of relative

attributes was introduced. In this, two images are compared

based on the relative strength of some given attribute, thus

providing a semantically richer way of describing the vi-

sual world than using binary attributes. Since then, relative

attributes have been used in several applications, such as

customized image search [15, 16], where a user can inter-

actively describe and reﬁne visual properties while search-

ing for some speciﬁc object. This has been further ex-

tended in recent works [14, 25]. In [14], generic attribute

models are learned that can adapt to different users’ prefer-

ences. In [25], novel features are introduced based on user’s

implied feedback, which subsequently help in improving

search performance. In [26], an active learning framework

based on relative attribute feedback is proposed. Here, the

teacher (human) not only corrects an incorrect prediction

made by learner (machine), but also tells why the predic-

tion is incorrect using attribute based feedback. This helps

the learner in propagating this understanding among other

examples, which subsequently improves the learning pro-

cess. This idea is extended in [2] where the learner learns

attribute classiﬁers along with category classiﬁers. In [29], a

semi-supervised constrained bootstrapping approach is pro-

posed that tries to beneﬁt from inter-class attribute-based

relationships to avoid semantic drift during the learning pro-

cess. In [32], a novel framework for predicting relative

dominance among attributes within an image is proposed.

In [27], rather than using either binary or relative attributes,

their interactions are modeled to better describe images.

Our work closely relates with recent works [1, 5, 6, 13]

that use distinctive part/region-based representations for

scene classiﬁcation [13] or ﬁne-grained classiﬁcation [1, 5,

6]. However, rather than identifying category-speciﬁc dis-

tinctive parts, our aim is to compare similar parts that are

shared across categories. This makes our problem some-

what more challenging, since our representation is expected

to capture small relative differences in the appearance of se-

mantically similar parts, which contribute in making some

attribute prominent in one image than another.

3. Preliminaries

In [24], a Ranking SVM based method was used for

learning relative attribute classiﬁers. Ranking SVM [12] is

a max-margin ranking framework that learns linear models

to perform pairwise comparisons. This is conceptually dif-

ferent from the conventional one-vs-rest SVM that learns a

model using individual samples rather than pairs. Though

SVM scores can also be used to perform pairwise compar-

isons, usually Ranking SVM has been known to perform

better than SVM for such tasks. In [24] also, Ranking SVM

was shown to perform better than SVM on the task of rela-

tive attribute prediction. We now brieﬂy discuss the method

used in [24] for learning relative attribute classiﬁers.

3.1. The Ranking SVM Model

Let I = {I

, . . . , I

} be a collection of n images. Each

image I

is represented by a global feature vector x

∈ R

Suppose we have a ﬁxed set of attributes A = {a

For each attribute a

∈ A, we are given a set D

∪ S

consisting of ordered pairs of images. Here,

= {(I

, I

)} is such that image I

has more strength of

attribute a

than image I

. And, S

= {(I

, I

)} is such

that both I

and I

have nearly the same strength of attribute

. Using D

, the goal is to learn a ranking function f

that, given a new pair of images I

and I

represented by

and x

respectively, predicts which image has greater

strength of attribute a

. Under the assumption that f

is a

linear function of x

and x

, it is deﬁned as:

, x

; w

) = w

· Ψ(x

, x

), (1)

Ψ(x

, x

) = x

− x

(2)

Here, w

is the parameter vector for attribute a

, and

Ψ(x

, x

) is a joint representation formed using x

and x

Using f

, we determine which image has higher strength

for attribute a

based on y

= sign(f

, x

; w

)).

= 1 means I

has higher strength of a

than I

, and

= −1 means otherwise. In order to learn w

, follow-

ing constraints need to be satisﬁed:

· Ψ(x

, x

) > 0 ∀(I

, I

) ∈ O

(3)

· Ψ(x

, x

) = 0 ∀(I

, I

) ∈ S

(4)

Since this is an NP-hard problem, its relaxed version is

solved by introducing slack variables. This leads to the fol-

lowing optimization problem (OP 1):

OP 1 : min

||w

+ C

(

) (5)

s.t. w

· Ψ(x

, x

) ≥ 1 − ξ

, ∀(I

, I

) ∈ O

(6)

||w

· Ψ(x

, x

)||

≤ α

, ∀(I

, I

) ∈ S

(7)

≥ 0; α

≥ 0. (8)

Figure 2. Given an input image (left), the parts that correspond to

“visible-teeth” (middle) and “eyes-open” (right).

Here, || · ||

denotes squared L

norm, || · ||

denotes L

norm, and C

> 0 is a constant that takes care of the trade-

off between regularization term and loss term. Note that

along with pairwise constraints as in [12], the optimization

problem now also includes similarity constraints. This is

solved in the primal form itself using Newton’s method [3].

4. Proposed Representations

The Ranking SVM method discussed above uses a joint

representation based on globally computed features (Eq. 2)

while determining the strength of some given attribute.

However, several attributes such as “visible-teeth”, “eyes-

open”, etc. are not representative of whole image, and cor-

respond to only some speciﬁc regions/parts. This means

there exists a weak association between an image and its at-

tribute label. E.g., Figure 2 shows the parts corresponding

to attributes “visible-teeth” and “eyes-open”. This inspires

us to build a representation that (i) encodes part/region-

speciﬁc features, without confusing across parts; and (ii)

explicitly encodes the relative signiﬁcance of each part with

respect to a given attribute. With this motivation, next we

propose two part-based joint-representations for the task of

learning relative attribute classiﬁers.

4.1. Part-based Joint Representation

Given an image I, let P = {p

, . . . , p

} be the set of

its K parts. These parts can be obtained using a domain-

speciﬁc method; e.g., the method discussed in [35] can be

used for determining a set of localized parts in face images.

Each part p

, ∀k ∈ {1, . . . , K} is represented using an N

dimensional feature vector

∈ R

. Here, N

= K × d

such that each

is a sparse vector with only d

non-zero

entries in the k

interval representing part p

. Based on

this, given a pair of images I

and I

, we deﬁne a joint part-

based feature representation as below:

Ψ(

) =

k=1

(

−

), (9)

where

= {

| ∀k ∈ {1, . . . , K}}. The advantage of

this representation is that it speciﬁcally encodes correspon-

dence among parts; i.e., now the k

part of I

is compared

with just the k

part of I

. The assumption here is that such

a direct comparison between localized pairs of parts would

provide stronger cues for learning relative attribute models

than using a single global representation as in Eq. 2. (This

assumption is also validated by improvements in prediction

accuracy as discussed in Sec. 6.)

4.2. Weighted Part-based Joint Representation

Though the joint representation proposed in the previous

section allows direct part-based comparison between a pair

of images, it does not provide information about which parts

actually symbolize some given attribute. This is particularly

desirable in case of local attributes, where only a few parts

are important in predicting attribute strength. With this mo-

tivation, we update the joint representation of Eq. 9 to pre-

cisely encode relative importance of parts.

As discussed in Sec. 4.1, let each image I be represented

by a set of K parts. Additionally, let s

∈ [0, 1] be a weight

associated with the k

part. This weight denotes the rel-

ative importance of the k

part compared to other parts

for predicting the strength of attribute a

; i.e., larger the

weight, more important is that part, and vice-versa. Using

this, given a pair of images I

and I

, the new weighted

part-based joint feature representation is deﬁned as:

(

, s

) =

k=1

(

−

), (10)

where s

= [s

, . . . , s

]

. Since s

expresses the rela-

tive signiﬁcance of the k

part with respect to a

, we call

it as the signiﬁcance-coefﬁcient of the k

part. These help

in explicitly encoding the relative importance of individual

parts in the joint representation.

5. Parameter Learning

Now we discuss how to learn the parameters for each at-

tribute using the two joint representations discussed above.

Note that we still need to satisfy the constraints as in Eq. 3

and Eq. 4 depending upon the representation followed.

5.1. For Part-based Joint Representation

In order to learn a ranking model based on the part-based

representation in Eq. 9, we optimize the following problem:

OP 2 : min

||w

+ C

(

) (11)

s.t. w

Ψ(

) ≥ 1 − ξ

, ∀(I

, I

) ∈ O

(12)

||w

Ψ(

)||

≤ α

, ∀(I

, I

) ∈ S

(13)

≥ 0; α

≥ 0. (14)

This is similar to OP 1, except that now we use part-based

representation instead of global representation. This allows

us to use the same Newton’s method [3] for solving OP 2.

5.2. For Weighted Part-based Joint Representation

For the weighted part-based joint representation in

Eq. 10, we need to learn two sets of parameters correspond-

ing to every attribute: ranking model w

, and signiﬁcance-

coefﬁcients s

. To do this, we solve the following opti-

mization problem (OP 3):

OP 3 : min

||w

+ C

(

) (15)

s.t. w

(

, s

) ≥ 1 − ξ

, ∀(I

, I

) ∈ O

(16)

||w

(

, s

)||

≤ α

, ∀(I

, I

) ∈ S

(17)

≥ 0; α

≥ 0; (18)

≥ 0, ∀1 ≤ k ≤ K; e · s

= 1. (19)

where e = [1, . . . , 1]

is a constant vector with all entries

equal to 1. Note that the overall weight of all the parts is

constrained to be unit; i.e., s

≥ 0, e · s

= 1, which en-

sures that all parts are fairly used. This is equivalent to con-

straining the L

-norm of s

to be 1 (i.e., L

-regularization),

thus implicitly imposing sparsity on s

[22, 31]. This is

desirable since usually only a few parts contribute towards

determining the strength of a given attribute.

5.2.1 Solving the optimization problem

We solve OP 3 in the primal form itself using a block co-

ordinate descent algorithm. We consider each set of param-

eters w

and s

as two blocks, and optimize them in an

alternate manner. In the beginning, we initialize all entries

of w

to be zero, and all entries of s

to be equal to 1/K.

First we ﬁx s

to optimize w

. For a ﬁxed s

, the

problem becomes equivalent to OP 2 (Eq. 11 to 14), and

hence can be solved in the same manner using [3].

Then we ﬁx w

to optimize s

. Let

[

. . .

] ∈ R

×K

be a matrix formed by appending

features corresponding to all parts of image I

. Using this,

we compute

∈ R

. This gives

(

, s

) = s

ijm

, (20)

ijm

−

. (21)

Substituting this in OP 3 leads to the following optimization

problem for learning s

(for ﬁxed w

OP 4 : min

C (

)∈Q

(1 − s

ijm

)

)∈S

||s

ijm

) (22)

s.t. s

≥ 0, ∀1 ≤ k ≤ K; e · s

= 1. (23)

where Q

⊆ O

is the set of pairs that violate the margin

constraint. Note that Q

is not ﬁxed, and may change at

every iteration. We solve OP 4 using an iterative gradient

descent and projection method similar to [34].

Figure 3. Input image (left), parts detected using [35] (middle),

and additional parts detected by us (right).

5.3. Computing Parts

The two joint representations as proposed in Sec. 4 are

based on an ordered set of corresponding parts computed

from a given pair of images. Given a method for computing

such parts, our framework is applicable irrespective of the

domain. This makes our framework domain adaptable.

In this work, we consider the domain of face images. To

compute parts from a given face image, we use the method

proposed in [35]. It is based on a mixture-of-tress model

to learn a shared pool of facial parts. Given a face image, it

computes a set of 68 parts covering facial landmarks such as

eyes, eyebrows, nose, mouth and jawline. Figure 3 shows a

face image (left) and its parts (middle) computed using this

method. Though these parts can be used to represent several

attributes such as “smiling”, “eyes-open”, etc., there are few

other attributes which are not covered by these parts such as

“bald-head”, “visible-forehead” and “dark-hair”. In order to

cover these attributes as well, we compute additional parts

using image-level statistics such as image-size and distance

from the earlier 68 parts. This gives an extended set of 83

parts for a given face image. Figure 3 (right) shows this

extended set of parts computed for the given image (left).

5.4. Relation with Latent Models

In the last few years, latent models have become popular

for several tasks, particularly for object detection [9]. These

models usually look for characteristics (e.g., parts) that are

shared within a category but distinctive across categories.

(As discussed in Sec. 2, recent works such as [1, 5, 13] also

have similar motivation, though they do not explicitly inves-

tigate the latent aspect.) Our work is similar to theirs in the

sense that we also seek attribute-speciﬁc distinctive parts by

incorporating signiﬁcance-coefﬁcients. However, in con-

trary to them, we require these parts to be shared across

categories. This is because our ranking method uses these

parts to learn attribute-speciﬁc models which are indepen-

dent of categories being depicted in training pairs.

6. Experiments

We compare the proposed method with that of [24] un-

der different settings on two datasets. First is the PubFig-29

dataset as used in [26]. It consists of 60 face categories

and 29 attributes, with attribute annotations being collected

Figure 4. Example pairs and their ground-truth annotations from

Pubﬁg-29 dataset. Due to category-level annotations, there exist

inconsistencies in (true) instance-level attribute visibility.

Figure 5. Example pairs from LFW-10 dataset. The images exhibit

high diversity in terms of age, pose, lighting, occulusion, etc.

at category-level; i.e., using pairs of categories rather than

pairs of images. Due to this, the annotations in this dataset

are not consistent for several attributes (see Figure 4) ; e.g.,

Scarlett Johansson may not be smiling more than Hugh Lau-

rie in all their images. To address this limitation, we have

collected a new dataset using a subset of LFW [11] images.

The new dataset has attribute-level annotations for 10000

image pairs and 10 attributes, and we call this as LFW-10

dataset. While collecting the annotations, we particularly

ignore the category information, thus making it more suit-

able for the task of learning relative attributes. The details

of this dataset are described next.

6.1. LFW-10 Dataset

We randomly select 2000 images from LFW

dataset [11]. Out of these, 1000 images are used for

creating training pairs and the remaining (unseen) 1000 for

testing pairs. The annotations are collected for 10 attributes,

with 500 training and testing pairs per attribute. In order

to minimize the chances of inconsistency in the dataset,

each image pair is got annotated from 5 trained annotators,

and ﬁnal annotation is decided based on majority voting.

Figure 5 shows example pairs from this dataset.

6.2. Features for Parts

We represent each part using a Bag of Words (BoW) his-

togram over dense SIFT (DSIFT) [20] features. We con-

sider two settings for learning visual-word vocabulary: (1)

In the ﬁrst setting, we learn a part-speciﬁc vocabulary for

every part. This is possible since our parts are ﬁxed and

known. In practice, we learn a vocabulary of 100 visual

words for each part. This gives a 8300-dimensional (= 83

parts ×100) (sparse) feature vector per part. (2) In the

second setting, we learn a single vocabulary of 100 visual

words for all the parts. This again results into a 8300-

HTML Viewer

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Relative parts: distinctive parts for learning relative attributes" ?

The notion of relative attributes as introduced by Parikh and Grauman ( ICCV, 2011 ) provides an appealing way of comparing two images based on their visual properties ( or attributes ) such as “ smiling ” for face images, “ naturalness ” for outdoor images, etc. In this paper, the authors extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, the authors introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part the authors associate a locally adaptive “ significancecoefficient ” that represents its discriminative ability with respect to a particular attribute.

Q2. How is the annotation of the image pair determined?

In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.

Q3. What is the common use of attributes in image search?

Attributes have also been used for multiple-query image search [30], where input attributes along with other related attributes are used in a structured-prediction based model.

Q4. What is the idea of a semi-supervised constrained bootstrapping approach?

In [29], a semi-supervised constrained bootstrapping approach is proposed that tries to benefit from inter-class attribute-based relationships to avoid semantic drift during the learning process.

Q5. What is the simplest way to learn wm?

In order to learn wm, following constraints need to be satisfied:wm ·Ψ(xi,xj) > 0 ∀(Ii, Ij) ∈ Om (3) wm ·Ψ(xi,xj) = 0 ∀(Ii, Ij) ∈ Sm (4)Since this is an NP-hard problem, its relaxed version is solved by introducing slack variables.

Q6. What are the two sets of parameters that are used to learn the ranking model?

For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm.

Q7. What is the wm model for learning the part?

For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights.

Q8. What is the problem of learning a relative attribute classifier?

In [24], given a set of pairs of images depicting similar and/or different strengths of some particular attribute, the problem of learning a relative attribute classifier is posed as one of learning a ranking model for that attribute similar to Ranking SVM [12].

Q9. Why is the LFW-10 dataset better than the single vocabulary?

One possible reason for this could be that using vocabularies learned individually for each part results into less confusion than using a single vocabulary learned using all the parts.

Q10. What is the motivation for implementing the part-based representation in Eq. 9?

In order to learn a ranking model based on the part-based representation in Eq. 9, the authors optimize the following problem:OP2 : min wm 1 2 ||wm||22 + Cm(∑ ξ2ij + ∑ α2ij) (11)s.t. wm · Ψ̃(x̃i, x̃j) ≥ 1− ξij , ∀(Ii, Ij) ∈ Om (12) ||wm · Ψ̃(x̃i, x̃j)||1 ≤ αij , ∀(Ii, Ij) ∈ Sm (13)ξij ≥ 0; αij ≥ 0. (14)This is similar to OP1, except that now the authors use part-based representation instead of global representation.

Q11. Why do the authors use these parts to learn attribute-specific models?

This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.

Relative Parts: Distinctive Parts for Learning Relative Attributes

Summary (5 min read)

1. Introduction

2. Related Works

3. Preliminaries

3.1. The Ranking SVM Model

4. Proposed Representations

4.1. Part-based Joint Representation

4.2. Weighted Part-based Joint Representation

5. Parameter Learning

5.1. For Part-based Joint Representation

5.2. For Weighted Part-based Joint Representation

5.2.1 Solving the optimization problem

5.3. Computing Parts

5.4. Relation with Latent Models

6. Experiments

6.1. LFW-10 Dataset

6.2. Features for Parts

6.3. Baselines

6.4. Results

6.5. Application to Interactive Image Search

7. Conclusion

Figures (11)

Citations

Cites background from "Relative Parts: Distinctive Parts f..."

Cites background or methods from "Relative Parts: Distinctive Parts f..."

References

"Relative Parts: Distinctive Parts f..." refers background in this paper

"Relative Parts: Distinctive Parts f..." refers methods in this paper

"Relative Parts: Distinctive Parts f..." refers background in this paper

"Relative Parts: Distinctive Parts f..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Relative parts: distinctive parts for learning relative attributes" ?

Q2. How is the annotation of the image pair determined?

Q3. What is the common use of attributes in image search?

Q4. What is the idea of a semi-supervised constrained bootstrapping approach?

Q5. What is the simplest way to learn wm?

Q6. What are the two sets of parameters that are used to learn the ranking model?

Q7. What is the wm model for learning the part?

Q8. What is the problem of learning a relative attribute classifier?

Q9. Why is the LFW-10 dataset better than the single vocabulary?

Q10. What is the motivation for implementing the part-based representation in Eq. 9?

Q11. Why do the authors use these parts to learn attribute-specific models?