scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Relative Parts: Distinctive Parts for Learning Relative Attributes

TL;DR: This paper introduces a part-based representation combining a pair of images that specifically compares corresponding parts and associates a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute.
Abstract: The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as "smiling" for face images, "naturalness" for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method is shown to achieve significant improvement in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.

Summary (5 min read)

1. Introduction

  • Visual attributes (or simply attributes) are perceptual properties that can be used to describe an entity (“pointed nose”), an object (“furry sheep”), or a scene (“natural outdoor”).
  • This led to the notion of “relative attributes”, where the strength of an attribute in a given image can be described with respect to some other image/category; e.g. “given face is less chubby than person A and more chubby than person B”.
  • Next, the authors update this part-based representation by additionally learning weights corresponding to each part that denote their contribution towards predicting the strength of a given attribute.
  • The authors compare the baseline method of [24] with the proposed method under various settings.
  • In Sec. 3, the authors discuss the method of [24] for learning relative attribute ranking models.

3. Preliminaries

  • In [24], a Ranking SVM based method was used for learning relative attribute classifiers.
  • Ranking SVM [12] is a max-margin ranking framework that learns linear models to perform pairwise comparisons.
  • This is conceptually different from the conventional one-vs-rest SVM that learns a model using individual samples rather than pairs.
  • Though SVM scores can also be used to perform pairwise comparisons, usually Ranking SVM has been known to perform better than SVM for such tasks.
  • In [24] also, Ranking SVM was shown to perform better than SVM on the task of relative attribute prediction.

3.1. The Ranking SVM Model

  • And, Sm = {(Ii, Ij)} is such that both Ii and Ij have nearly the same strength of attribute am.
  • Dm, the goal is to learn a ranking function fm that, given a new pair of images Ip and Iq represented by xp and xq respectively, predicts which image has greater strength of attribute am.
  • Using fm, the authors determine which image has higher strength for attribute am based on ympq = sign(fm(xp,xq; wm)).
  • Note that along with pairwise constraints as in [12], the optimization problem now also includes similarity constraints.
  • This is solved in the primal form itself using Newton’s method [3].

4. Proposed Representations

  • The Ranking SVM method discussed above uses a joint representation based on globally computed features (Eq. 2) while determining the strength of some given attribute.
  • Several attributes such as “visible-teeth”, “eyesopen”, etc. are not representative of whole image, and correspond to only some specific regions/parts.
  • This means there exists a weak association between an image and its attribute label.
  • This inspires us to build a representation that (i) encodes part/regionspecific features, without confusing across parts; and (ii) explicitly encodes the relative significance of each part with respect to a given attribute.
  • Next the authors propose two part-based joint-representations for the task of learning relative attribute classifiers.

4.1. Part-based Joint Representation

  • These parts can be obtained using a domainspecific method; e.g., the method discussed in [35] can be used for determining a set of localized parts in face images.
  • Here, N1 = K × d1 such that each x̃k is a sparse vector with only d1 non-zero entries in the kth interval representing part pk.
  • The advantage of this representation is that it specifically encodes correspondence among parts; i.e., now the kth part of Ip is compared with just the kth part of Iq .
  • The assumption here is that such a direct comparison between localized pairs of parts would provide stronger cues for learning relative attribute models than using a single global representation as in Eq.
  • (This assumption is also validated by improvements in prediction accuracy as discussed in Sec. 6.).

4.2. Weighted Part-based Joint Representation

  • Though the joint representation proposed in the previous section allows direct part-based comparison between a pair of images, it does not provide information about which parts actually symbolize some given attribute.
  • As discussed in Sec. 4.1, let each image I be represented by a set ofK parts.
  • Additionally, let skm ∈ [0, 1] be a weight associated with the kth part.
  • This weight denotes the relative importance of the kth part compared to other parts for predicting the strength of attribute am; i.e., larger the weight, more important is that part, and vice-versa.
  • These help in explicitly encoding the relative importance of individual parts in the joint representation.

5. Parameter Learning

  • Now the authors discuss how to learn the parameters for each attribute using the two joint representations discussed above.
  • Note that the authors still need to satisfy the constraints as in Eq. 3 and Eq. 4 depending upon the representation followed.

5.1. For Part-based Joint Representation

  • This is similar to OP1, except that now the authors use part-based representation instead of global representation.
  • This allows us to use the same Newton’s method [3] for solving OP2.

5.2. For Weighted Part-based Joint Representation

  • For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm.
  • Note that the overall weight of all the parts is constrained to be unit; i.e., skm ≥ 0, e · sm = 1, which ensures that all parts are fairly used.
  • This is desirable since usually only a few parts contribute towards determining the strength of a given attribute.

5.2.1 Solving the optimization problem

  • The authors solve OP3 in the primal form itself using a block coordinate descent algorithm.
  • The authors consider each set of parameters wm and sm as two blocks, and optimize them in an alternate manner.
  • Om is the set of pairs that violate the margin constraint.
  • Note that Qm is not fixed, and may change at every iteration.
  • The authors solve OP4 using an iterative gradient descent and projection method similar to [34].

5.3. Computing Parts

  • The two joint representations as proposed in Sec. 4 are based on an ordered set of corresponding parts computed from a given pair of images.
  • Given a method for computing such parts, their framework is applicable irrespective of the domain.
  • To compute parts from a given face image, the authors use the method proposed in [35].
  • Though these parts can be used to represent several attributes such as “smiling”, “eyes-open”, etc., there are few other attributes which are not covered by these parts such as “bald-head”, “visible-forehead” and “dark-hair”.

5.4. Relation with Latent Models

  • In the last few years, latent models have become popular for several tasks, particularly for object detection [9].
  • These models usually look for characteristics (e.g., parts) that are shared within a category but distinctive across categories.
  • (As discussed in Sec. 2, recent works such as [1, 5, 13] also have similar motivation, though they do not explicitly investigate the latent aspect.).
  • The authors work is similar to theirs in the sense that the authors also seek attribute-specific distinctive parts by incorporating significance-coefficients.
  • This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.

6. Experiments

  • The authors compare the proposed method with that of [24] under different settings on two datasets.
  • To address this limitation, the authors have collected a new dataset using a subset of LFW [11] images.
  • The new dataset has attribute-level annotations for 10000 image pairs and 10 attributes, and the authors call this as LFW-10 dataset.
  • While collecting the annotations, the authors particularly ignore the category information, thus making it more suitable for the task of learning relative attributes.
  • The details of this dataset are described next.

6.1. LFW-10 Dataset

  • Out of these, 1000 images are used for creating training pairs and the remaining 1000 for testing pairs.
  • The annotations are collected for 10 attributes, with 500 training and testing pairs per attribute.
  • In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.
  • Figure 5 shows example pairs from this dataset.

6.2. Features for Parts

  • The authors represent each part using a Bag of Words (BoW) histogram over dense SIFT [20] features.
  • The authors consider two settings for learning visual-word vocabulary: (1) In the first setting, they learn a part-specific vocabulary for every part.
  • This is possible since their parts are fixed and known.
  • In practice, the authors learn a vocabulary of 100 visual words for each part.
  • (2) In the second setting, the authors learn a single vocabulary of 100 visual words for all the parts.

6.3. Baselines

  • The authors compare with the Ranking SVM method of [24] using the code provided by the authors 1.
  • The authors use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i).
  • As another baseline, the authors compare the quality of their partlearning framework (Sec. 5.2) against human selected parts.
  • For this, the authors asked a human expert to select a subset of few most representative parts corresponding to every attribute.
  • For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights.

6.4. Results

  • Table 1 compares different methods on PubFig-29 dataset.
  • This clearly validates the significance of these representations for learning relative attribute models.
  • (3) Using part-specific vocabulary performs better than single vocabulary.
  • Figure 7 shows the top ten learned parts with highest significance-coefficients for all the ten attributes in LFW-10 dataset.
  • Also, the performance of their method closely matches with that obtained using human selected parts, thus demonstrating its effectiveness.

7. Conclusion

  • Inspired from the success of relative attributes, the authors have presented a novel method that learns relative attribute models using local parts that are shared across categories.
  • The authors method achieves significant improvements compared to the baseline method.
  • Apart from this, the part-specific weights learned using their method also provide semantic interpretation of different parts for diverse attributes.

Did you find this useful? Give us your feedback

Figures (11)

Content maybe subject to copyright    Report

Relative Parts: Distinctive Parts for Learning Relative Attributes
Ramachandruni N. Sandeep Yashaswi Verma C. V. Jawahar
Center for Visual Information Technology, IIIT Hyderabad, India - 500032
Abstract
The notion of relative attributes as introduced by Parikh
and Grauman (ICCV, 2011) provides an appealing way
of comparing two images based on their visual properties
(or attributes) such as “smiling” for face images, “natu-
ralness” for outdoor images, etc. For learning such at-
tributes, a Ranking SVM based formulation was proposed
that uses globally represented pairs of annotated images. In
this paper, we extend this idea towards learning relative at-
tributes using local parts that are shared across categories.
First, instead of using a global representation, we introduce
a part-based representation combining a pair of images
that specifically compares corresponding parts. Then, with
each part we associate a locally adaptive “significance-
coefficient” that represents its discriminative ability with
respect to a particular attribute. For each attribute, the
significance-coefficients are learned simultaneously with a
max-margin ranking model in an iterative manner. Com-
pared to the baseline method, the new method is shown to
achieve significant improvement in relative attribute predic-
tion accuracy. Additionally, it is also shown to improve rel-
ative feedback based interactive image search.
1. Introduction
Visual attributes (or simply attributes) are perceptual
properties that can be used to describe an entity (“pointed
nose”), an object (“furry sheep”), or a scene (“natural out-
door”). These act as mid-level representations that are com-
prehensible for both human as well as machine, thus provid-
ing a strong means of filling-up the so-called semantic-gap.
Attributes have recently been used as a source of seman-
tic cues in diverse tasks such as object recognition [17, 18],
image description [24], learning unseen object categories
(or zero-shot learning) [18], etc. While most of these works
have focused on binary attributes (indicating presence or ab-
sence of some visual property), Parikh and Grauman [24]
proposed that it is more natural to consider the strength of
an attribute rather than its absolute presence/absence. This
led to the notion of “relative attributes”, where the strength
of an attribute in a given image can be described with re-
spect to some other image/category; e.g. “given face is less
chubby than person A and more chubby than person B”.
In [24], given a set of pairs of images depicting similar
and/or different strengths of some particular attribute, the
problem of learning a relative attribute classifier is posed as
one of learning a ranking model for that attribute similar to
Ranking SVM [12].
In this work, we build upon this idea by learning relative
attribute models using local parts that are shared across cat-
egories. First, we propose a part-based representation that
jointly represents a pair of images. A part corresponds to
a block around a landmark point detected using a domain-
specific method. This representation explicitly encodes cor-
respondences among parts, thus better capturing minute dif-
ferences in parts that make an attribute more prominent in
one image than another, as compared to a global represen-
tation as in [24]. Next, we update this part-based repre-
sentation by additionally learning weights corresponding to
each part that denote their contribution towards predicting
the strength of a given attribute. We call these weights
as “significance-coefficients” of parts. For each attribute,
the significance-coefficients are learned in a discriminative
manner simultaneously with a max-margin ranking model.
Thus, the best parts for predicting the relative attribute
“more smiling” will be different from those for predicting
“more eyes-open”. The steps of the proposed method are
illustrated in Figure 1. While the notion of parts is not new,
we believe that ours is the first attempt that explores the ap-
plicability of parts in a ranking scenario, and for learning
relative attribute ranking models in particular.
We compare the baseline method of [24] with the pro-
posed method under various settings. For this, we have col-
lected a new dataset of 10000 pairwise attribute-level anno-
tations using images from the “Labeled Faces in the Wild”
(LFW) dataset [11], particularly focusing on (i) large va-
riety among samples in terms of poses, lighting condition,
etc., and (ii) completely ignoring the category information
while collecting attribute annotations. Extensive experi-
ments demonstrate that the new method significantly im-
proves the prediction accuracy as compared to the baseline
1

Figure 1. Given ordered pair of images, first we detect parts corresponding to different (facial) landmarks. Using these, a joint pairwise
part-based representation is formed that encodes (i) correspondence among different parts, & (ii) relative importance of each part for a given
attribute. Using this, a max-margin ranking model w is learned simultaneously with part weights s (red blocks) in an iterative manner.
method. Moreover, the learned parts also compare favor-
ably with human selected parts, thus indicating the intrinsic
capacity of the proposed framework for learning attribute-
specific semantic parts.
The paper is organized as follows. In Sec. 2, we give an
overview of some of the recent works based on attributes
and relative attributes. In Sec. 3, we discuss the method
of [24] for learning relative attribute ranking models. Then
we present the new part-based representations in Sec. 4,
followed by an algorithm for learning model variables in
Sec. 5. Experiments and results are discussed in Sec. 6, and
finally we conclude in Sec. 7.
2. Related Works
As discussed earlier, attributes are properties that are un-
derstandable by both human as well as machine. Because
of this, attributes have recently gained significant popular-
ity among several vision applications, where attribute iden-
tification is not the final goal but just an intermediate step.
In [8], objects are described using their attributes; e.g. in-
stead of classifying an image as that of a “sheep”, it is de-
scribed based on its properties such as “has horn”, “has
wool”, etc. This helps in describing even those objects
which have few or no examples during training phase. Sim-
ilar idea is used in [7, 18] where attribute-based feedback
is used for unseen category recognition. Attribute-based
feedback has been shown to be useful for anomaly detec-
tion [28] within an object category, and adding unlabeled
samples for category classfier learning [4]. Attributes have
also been used for multiple-query image search [30], where
input attributes along with other related attributes are used
in a structured-prediction based model. Along with various
applications, attributes have been used in several mid-level
tasks. These include identification of color/texture [10],
specific objects such as faces [17], and general object cate-
gories [18, 33]. In some cases, since it might not be possible
to learn discriminative attributes from individual images,
in [21], pairs of images are used to learn such attributes
based on human feedback.
While most of the above methods have focused on pres-
ence/absence of some attribute, in [24] the notion of relative
attributes was introduced. In this, two images are compared
based on the relative strength of some given attribute, thus
providing a semantically richer way of describing the vi-
sual world than using binary attributes. Since then, relative
attributes have been used in several applications, such as
customized image search [15, 16], where a user can inter-
actively describe and refine visual properties while search-
ing for some specific object. This has been further ex-
tended in recent works [14, 25]. In [14], generic attribute
models are learned that can adapt to different users’ prefer-
ences. In [25], novel features are introduced based on user’s
implied feedback, which subsequently help in improving
search performance. In [26], an active learning framework
based on relative attribute feedback is proposed. Here, the
teacher (human) not only corrects an incorrect prediction
made by learner (machine), but also tells why the predic-
tion is incorrect using attribute based feedback. This helps
the learner in propagating this understanding among other
examples, which subsequently improves the learning pro-
cess. This idea is extended in [2] where the learner learns
attribute classifiers along with category classifiers. In [29], a
semi-supervised constrained bootstrapping approach is pro-
posed that tries to benefit from inter-class attribute-based
relationships to avoid semantic drift during the learning pro-
cess. In [32], a novel framework for predicting relative
dominance among attributes within an image is proposed.
In [27], rather than using either binary or relative attributes,
their interactions are modeled to better describe images.
Our work closely relates with recent works [1, 5, 6, 13]
that use distinctive part/region-based representations for
scene classification [13] or fine-grained classification [1, 5,
6]. However, rather than identifying category-specific dis-
tinctive parts, our aim is to compare similar parts that are
shared across categories. This makes our problem some-
what more challenging, since our representation is expected
to capture small relative differences in the appearance of se-
2

mantically similar parts, which contribute in making some
attribute prominent in one image than another.
3. Preliminaries
In [24], a Ranking SVM based method was used for
learning relative attribute classifiers. Ranking SVM [12] is
a max-margin ranking framework that learns linear models
to perform pairwise comparisons. This is conceptually dif-
ferent from the conventional one-vs-rest SVM that learns a
model using individual samples rather than pairs. Though
SVM scores can also be used to perform pairwise compar-
isons, usually Ranking SVM has been known to perform
better than SVM for such tasks. In [24] also, Ranking SVM
was shown to perform better than SVM on the task of rela-
tive attribute prediction. We now briefly discuss the method
used in [24] for learning relative attribute classifiers.
3.1. The Ranking SVM Model
Let I = {I
1
, . . . , I
n
} be a collection of n images. Each
image I
i
is represented by a global feature vector x
i
R
N
.
Suppose we have a fixed set of attributes A = {a
m
}.
For each attribute a
m
A, we are given a set D
m
=
O
m
S
m
consisting of ordered pairs of images. Here,
O
m
= {(I
i
, I
j
)} is such that image I
i
has more strength of
attribute a
m
than image I
j
. And, S
m
= {(I
i
, I
j
)} is such
that both I
i
and I
j
have nearly the same strength of attribute
a
m
. Using D
m
, the goal is to learn a ranking function f
m
that, given a new pair of images I
p
and I
q
represented by
x
p
and x
q
respectively, predicts which image has greater
strength of attribute a
m
. Under the assumption that f
m
is a
linear function of x
p
and x
q
, it is defined as:
f
m
(x
p
, x
q
; w
m
) = w
m
· Ψ(x
p
, x
q
), (1)
Ψ(x
p
, x
q
) = x
p
x
q
(2)
Here, w
m
is the parameter vector for attribute a
m
, and
Ψ(x
p
, x
q
) is a joint representation formed using x
p
and x
q
.
Using f
m
, we determine which image has higher strength
for attribute a
m
based on y
m
pq
= sign(f
m
(x
p
, x
q
; w
m
)).
y
m
pq
= 1 means I
p
has higher strength of a
m
than I
q
, and
y
m
pq
= 1 means otherwise. In order to learn w
m
, follow-
ing constraints need to be satisfied:
w
m
· Ψ(x
i
, x
j
) > 0 (I
i
, I
j
) O
m
(3)
w
m
· Ψ(x
i
, x
j
) = 0 (I
i
, I
j
) S
m
(4)
Since this is an NP-hard problem, its relaxed version is
solved by introducing slack variables. This leads to the fol-
lowing optimization problem (OP 1):
OP 1 : min
w
m
1
2
||w
m
||
2
2
+ C
m
(
X
ξ
2
ij
+
X
α
2
ij
) (5)
s.t. w
m
· Ψ(x
i
, x
j
) 1 ξ
ij
, (I
i
, I
j
) O
m
(6)
||w
m
· Ψ(x
i
, x
j
)||
1
α
ij
, (I
i
, I
j
) S
m
(7)
ξ
ij
0; α
ij
0. (8)
Figure 2. Given an input image (left), the parts that correspond to
“visible-teeth” (middle) and “eyes-open” (right).
Here, || · ||
2
2
denotes squared L
2
norm, || · ||
1
denotes L
1
norm, and C
m
> 0 is a constant that takes care of the trade-
off between regularization term and loss term. Note that
along with pairwise constraints as in [12], the optimization
problem now also includes similarity constraints. This is
solved in the primal form itself using Newton’s method [3].
4. Proposed Representations
The Ranking SVM method discussed above uses a joint
representation based on globally computed features (Eq. 2)
while determining the strength of some given attribute.
However, several attributes such as “visible-teeth”, “eyes-
open”, etc. are not representative of whole image, and cor-
respond to only some specific regions/parts. This means
there exists a weak association between an image and its at-
tribute label. E.g., Figure 2 shows the parts corresponding
to attributes “visible-teeth” and “eyes-open”. This inspires
us to build a representation that (i) encodes part/region-
specific features, without confusing across parts; and (ii)
explicitly encodes the relative significance of each part with
respect to a given attribute. With this motivation, next we
propose two part-based joint-representations for the task of
learning relative attribute classifiers.
4.1. Part-based Joint Representation
Given an image I, let P = {p
1
, . . . , p
K
} be the set of
its K parts. These parts can be obtained using a domain-
specific method; e.g., the method discussed in [35] can be
used for determining a set of localized parts in face images.
Each part p
k
, k {1, . . . , K} is represented using an N
1
-
dimensional feature vector
˜
x
k
R
N
1
. Here, N
1
= K × d
1
such that each
˜
x
k
is a sparse vector with only d
1
non-zero
entries in the k
th
interval representing part p
k
. Based on
this, given a pair of images I
p
and I
q
, we define a joint part-
based feature representation as below:
˜
Ψ(
˜
x
p
,
˜
x
q
) =
K
X
k=1
(
˜
x
k
p
˜
x
k
q
), (9)
where
˜
x
p
= {
˜
x
k
p
| k {1, . . . , K}}. The advantage of
this representation is that it specifically encodes correspon-
dence among parts; i.e., now the k
th
part of I
p
is compared
with just the k
th
part of I
q
. The assumption here is that such
3

a direct comparison between localized pairs of parts would
provide stronger cues for learning relative attribute models
than using a single global representation as in Eq. 2. (This
assumption is also validated by improvements in prediction
accuracy as discussed in Sec. 6.)
4.2. Weighted Part-based Joint Representation
Though the joint representation proposed in the previous
section allows direct part-based comparison between a pair
of images, it does not provide information about which parts
actually symbolize some given attribute. This is particularly
desirable in case of local attributes, where only a few parts
are important in predicting attribute strength. With this mo-
tivation, we update the joint representation of Eq. 9 to pre-
cisely encode relative importance of parts.
As discussed in Sec. 4.1, let each image I be represented
by a set of K parts. Additionally, let s
k
m
[0, 1] be a weight
associated with the k
th
part. This weight denotes the rel-
ative importance of the k
th
part compared to other parts
for predicting the strength of attribute a
m
; i.e., larger the
weight, more important is that part, and vice-versa. Using
this, given a pair of images I
p
and I
q
, the new weighted
part-based joint feature representation is defined as:
˜
Ψ
s
(
˜
x
p
,
˜
x
q
, s
m
) =
K
X
k=1
s
k
m
(
˜
x
k
p
˜
x
k
q
), (10)
where s
m
= [s
1
m
, . . . , s
K
m
]
T
. Since s
k
m
expresses the rela-
tive significance of the k
th
part with respect to a
m
, we call
it as the significance-coefficient of the k
th
part. These help
in explicitly encoding the relative importance of individual
parts in the joint representation.
5. Parameter Learning
Now we discuss how to learn the parameters for each at-
tribute using the two joint representations discussed above.
Note that we still need to satisfy the constraints as in Eq. 3
and Eq. 4 depending upon the representation followed.
5.1. For Part-based Joint Representation
In order to learn a ranking model based on the part-based
representation in Eq. 9, we optimize the following problem:
OP 2 : min
w
m
1
2
||w
m
||
2
2
+ C
m
(
X
ξ
2
ij
+
X
α
2
ij
) (11)
s.t. w
m
·
˜
Ψ(
˜
x
i
,
˜
x
j
) 1 ξ
ij
, (I
i
, I
j
) O
m
(12)
||w
m
·
˜
Ψ(
˜
x
i
,
˜
x
j
)||
1
α
ij
, (I
i
, I
j
) S
m
(13)
ξ
ij
0; α
ij
0. (14)
This is similar to OP 1, except that now we use part-based
representation instead of global representation. This allows
us to use the same Newton’s method [3] for solving OP 2.
5.2. For Weighted Part-based Joint Representation
For the weighted part-based joint representation in
Eq. 10, we need to learn two sets of parameters correspond-
ing to every attribute: ranking model w
m
, and significance-
coefficients s
m
. To do this, we solve the following opti-
mization problem (OP 3):
OP 3 : min
w
m
,s
m
1
2
||w
m
||
2
2
+ C
m
(
X
ξ
2
ij
+
X
α
2
ij
) (15)
s.t. w
m
·
˜
Ψ
s
(
˜
x
i
,
˜
x
j
, s
m
) 1 ξ
ij
, (I
i
, I
j
) O
m
(16)
||w
m
·
˜
Ψ
s
(
˜
x
i
,
˜
x
j
, s
m
)||
1
α
ij
, (I
i
, I
j
) S
m
(17)
ξ
ij
0; α
ij
0; (18)
s
k
m
0, 1 k K; e · s
m
= 1. (19)
where e = [1, . . . , 1]
T
is a constant vector with all entries
equal to 1. Note that the overall weight of all the parts is
constrained to be unit; i.e., s
k
m
0, e · s
m
= 1, which en-
sures that all parts are fairly used. This is equivalent to con-
straining the L
1
-norm of s
m
to be 1 (i.e., L
1
-regularization),
thus implicitly imposing sparsity on s
m
[22, 31]. This is
desirable since usually only a few parts contribute towards
determining the strength of a given attribute.
5.2.1 Solving the optimization problem
We solve OP 3 in the primal form itself using a block co-
ordinate descent algorithm. We consider each set of param-
eters w
m
and s
m
as two blocks, and optimize them in an
alternate manner. In the beginning, we initialize all entries
of w
m
to be zero, and all entries of s
m
to be equal to 1/K.
First we fix s
m
to optimize w
m
. For a fixed s
m
, the
problem becomes equivalent to OP 2 (Eq. 11 to 14), and
hence can be solved in the same manner using [3].
Then we fix w
m
to optimize s
m
. Let
˜
X
i
=
[
˜
x
1
i
. . .
˜
x
K
i
] R
N
1
×K
be a matrix formed by appending
features corresponding to all parts of image I
i
. Using this,
we compute
˜
z
im
=
˜
X
T
i
w
m
R
K
. This gives
w
m
·
˜
Ψ
s
(
˜
x
i
,
˜
x
j
, s
m
) = s
m
·
˜
z
ijm
, (20)
˜
z
ijm
=
˜
z
im
˜
z
jm
. (21)
Substituting this in OP 3 leads to the following optimization
problem for learning s
m
(for fixed w
m
):
OP 4 : min
s
m
C (
X
(I
i
,I
j
)∈Q
m
(1 s
m
·
˜
z
ijm
)
2
+
X
(I
i
,I
j
)∈S
m
||s
m
·
˜
z
ijm
||
2
1
) (22)
s.t. s
k
m
0, 1 k K; e · s
m
= 1. (23)
where Q
m
O
m
is the set of pairs that violate the margin
constraint. Note that Q
m
is not fixed, and may change at
every iteration. We solve OP 4 using an iterative gradient
descent and projection method similar to [34].
4

Figure 3. Input image (left), parts detected using [35] (middle),
and additional parts detected by us (right).
5.3. Computing Parts
The two joint representations as proposed in Sec. 4 are
based on an ordered set of corresponding parts computed
from a given pair of images. Given a method for computing
such parts, our framework is applicable irrespective of the
domain. This makes our framework domain adaptable.
In this work, we consider the domain of face images. To
compute parts from a given face image, we use the method
proposed in [35]. It is based on a mixture-of-tress model
to learn a shared pool of facial parts. Given a face image, it
computes a set of 68 parts covering facial landmarks such as
eyes, eyebrows, nose, mouth and jawline. Figure 3 shows a
face image (left) and its parts (middle) computed using this
method. Though these parts can be used to represent several
attributes such as “smiling”, “eyes-open”, etc., there are few
other attributes which are not covered by these parts such as
“bald-head”, “visible-forehead” and “dark-hair”. In order to
cover these attributes as well, we compute additional parts
using image-level statistics such as image-size and distance
from the earlier 68 parts. This gives an extended set of 83
parts for a given face image. Figure 3 (right) shows this
extended set of parts computed for the given image (left).
5.4. Relation with Latent Models
In the last few years, latent models have become popular
for several tasks, particularly for object detection [9]. These
models usually look for characteristics (e.g., parts) that are
shared within a category but distinctive across categories.
(As discussed in Sec. 2, recent works such as [1, 5, 13] also
have similar motivation, though they do not explicitly inves-
tigate the latent aspect.) Our work is similar to theirs in the
sense that we also seek attribute-specific distinctive parts by
incorporating significance-coefficients. However, in con-
trary to them, we require these parts to be shared across
categories. This is because our ranking method uses these
parts to learn attribute-specific models which are indepen-
dent of categories being depicted in training pairs.
6. Experiments
We compare the proposed method with that of [24] un-
der different settings on two datasets. First is the PubFig-29
dataset as used in [26]. It consists of 60 face categories
and 29 attributes, with attribute annotations being collected
Figure 4. Example pairs and their ground-truth annotations from
Pubfig-29 dataset. Due to category-level annotations, there exist
inconsistencies in (true) instance-level attribute visibility.
Figure 5. Example pairs from LFW-10 dataset. The images exhibit
high diversity in terms of age, pose, lighting, occulusion, etc.
at category-level; i.e., using pairs of categories rather than
pairs of images. Due to this, the annotations in this dataset
are not consistent for several attributes (see Figure 4) ; e.g.,
Scarlett Johansson may not be smiling more than Hugh Lau-
rie in all their images. To address this limitation, we have
collected a new dataset using a subset of LFW [11] images.
The new dataset has attribute-level annotations for 10000
image pairs and 10 attributes, and we call this as LFW-10
dataset. While collecting the annotations, we particularly
ignore the category information, thus making it more suit-
able for the task of learning relative attributes. The details
of this dataset are described next.
6.1. LFW-10 Dataset
We randomly select 2000 images from LFW
dataset [11]. Out of these, 1000 images are used for
creating training pairs and the remaining (unseen) 1000 for
testing pairs. The annotations are collected for 10 attributes,
with 500 training and testing pairs per attribute. In order
to minimize the chances of inconsistency in the dataset,
each image pair is got annotated from 5 trained annotators,
and final annotation is decided based on majority voting.
Figure 5 shows example pairs from this dataset.
6.2. Features for Parts
We represent each part using a Bag of Words (BoW) his-
togram over dense SIFT (DSIFT) [20] features. We con-
sider two settings for learning visual-word vocabulary: (1)
In the first setting, we learn a part-specific vocabulary for
every part. This is possible since our parts are fixed and
known. In practice, we learn a vocabulary of 100 visual
words for each part. This gives a 8300-dimensional (= 83
parts ×100) (sparse) feature vector per part. (2) In the
second setting, we learn a single vocabulary of 100 visual
words for all the parts. This again results into a 8300-
5

Citations
More filters
Journal ArticleDOI
TL;DR: In this article, a joint multi-task learning algorithm is proposed to better predict attributes in images using deep convolutional neural networks (CNN), where each CNN will predict one binary attribute.
Abstract: This paper proposes a joint multi-task learning algorithm to better predict attributes in images using deep convolutional neural networks (CNN). We consider learning binary semantic attributes through a multi-task CNN model, where each CNN will predict one binary attribute. The multi-task learning allows CNN models to simultaneously share visual knowledge among different attribute categories. Each CNN will generate attribute-specific feature representations, and then we apply multi-task learning on the features to predict their attributes. In our multi-task framework, we propose a method to decompose the overall model’s parameters into a latent task matrix and combination matrix. Furthermore, under-sampled classifiers can leverage shared statistics from other classifiers to improve their performance. Natural grouping of attributes is applied such that attributes in the same group are encouraged to share more knowledge. Meanwhile, attributes in different groups will generally compete with each other, and consequently share less knowledge. We show the effectiveness of our method on two popular attribute datasets.

255 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work provides a novel perspective to attribute detection and proposes to gear the techniques in multi-source domain generalization for the purpose of learning cross-category generalizable attribute detectors.
Abstract: Attributes possess appealing properties and benefit many computer vision problems, such as object recognition, learning with humans in the loop, and image retrieval. Whereas the existing work mainly pursues utilizing attributes for various computer vision problems, we contend that the most basic problem—how to accurately and robustly detect attributes from images—has been left under explored. Especially, the existing work rarely explicitly tackles the need that attribute detectors should generalize well across different categories, including those previously unseen. Noting that this is analogous to the objective of multi-source domain generalization, if we treat each category as a domain, we provide a novel perspective to attribute detection and propose to gear the techniques in multi-source domain generalization for the purpose of learning cross-category generalizable attribute detectors. We validate our understanding and approach with extensive experiments on four challenging datasets and three different problems.

169 citations


Cites background from "Relative Parts: Distinctive Parts f..."

  • ..., tails of mammals) [4, 6, 37, 3, 83, 59, 14], and the relationship between attributes and categories [79, 48, 32, 54]....

    [...]

BookDOI
01 Jan 2017
TL;DR: This chapter gives an overview of domain adaptation and transfer learning with a specific view to visual applications and reviews DA methods that go beyond image categorization, such as object detection, image segmentation, video analyses or learning visual attributes.
Abstract: The aim of this chapter is to give an overview of domain adaptation and transfer learning with a specific view to visual applications. After a general motivation, we first position domain adaptation in the more general transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to the new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we review DA methods that go beyond image categorization, such as object detection, image segmentation, video analyses or learning visual attributes. We conclude the chapter with a section where we relate domain adaptation to other machine learning solutions.

169 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: The authors propose to overcome the sparsity of supervision problem via synthetically generated images by augmenting real training image pairs with these examples, then train attribute ranking models to predict the relative strength of an attribute in novel pairs of real images.
Abstract: Distinguishing subtle differences in attributes is valuable, yet learning to make visual comparisons remains nontrivial. Not only is the number of possible comparisons quadratic in the number of training images, but also access to images adequately spanning the space of fine-grained visual differences is limited. We propose to overcome the sparsity of supervision problem via synthetically generated images. Building on a state-of-the-art image generation engine, we sample pairs of training images exhibiting slight modifications of individual attributes. Augmenting real training image pairs with these examples, we then train attribute ranking models to predict the relative strength of an attribute in novel pairs of real images. Our results on datasets of faces and fashion images show the great promise of bootstrapping imperfect image generators to counteract sample sparsity for learning to rank.

131 citations

Posted Content
TL;DR: An end-to-end deep convolutional network to simultaneously localize and rank relative visual attributes, given only weakly-supervised pairwise image comparisons is proposed.
Abstract: We propose an end-to-end deep convolutional network to simultaneously localize and rank relative visual attributes, given only weakly-supervised pairwise image comparisons. Unlike previous methods, our network jointly learns the attribute's features, localization, and ranker. The localization module of our network discovers the most informative image region for the attribute, which is then used by the ranking module to learn a ranking model of the attribute. Our end-to-end framework also significantly speeds up processing and is much faster than previous methods. We show state-of-the-art ranking results on various relative attribute datasets, and our qualitative localization results clearly demonstrate our network's ability to learn meaningful image patches.

67 citations


Cites background or methods from "Relative Parts: Distinctive Parts f..."

  • ...The state-of-the-art method of Xiao and Lee [14] outperforms the baseline method of [13] because it automatically discovers the relevant regions of an attribute without relying on pretrained keypoint detectors whose detected parts may be irrelevant to the attribute....

    [...]

  • ...Since that method has to process a sequence of time-consuming modules including feature extraction, nearest neighbor matching, iterative SVM classifier training to build the visual chains, and training an SVM ranker to rank the chains, it takes ∼10 hours to train one attribute model on LFW-10 using a cluster of 20 CPU nodes with 2 cores each....

    [...]

  • ...Attribute ranking accuracy on LFW-10....

    [...]

  • ...5, we show the results for the face attributes on the LFW-10 test images....

    [...]

  • ...We use the same train-test split used in [13]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations


"Relative Parts: Distinctive Parts f..." refers background in this paper

  • ..., L1-regularization), thus implicitly imposing sparsity on sm [22, 31]....

    [...]

01 Jan 2011
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

14,708 citations


"Relative Parts: Distinctive Parts f..." refers methods in this paper

  • ...We represent each part using a Bag of Words (BoW) histogram over dense SIFT (DSIFT) [20] features....

    [...]

  • ...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....

    [...]

  • ...This again results into a 8300- Method Accuracy Global DSIFT + RSVM [24] 61.28 Global GIST + RGB + RSVM [24] 59.18 SPM (Upto 2 levels) + RSVM [24] 49.60 SPM (Upto 3 levels) + RSVM [24] 49.17 Unweighted parts + Part-specific vocab....

    [...]

  • ...Method Accuracy Global DSIFT + RSVM [24] 64.61 Global GIST + RSVM [24] 68.89 Global GIST + RGB + RSVM [24] 69.89 SPM (Upto 2 levels) + RSVM [24] 50.73 SPM (Upto 3 levels) + RSVM [24] 50.01 Human selected parts + Part-specific Vocab....

    [...]

Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


"Relative Parts: Distinctive Parts f..." refers background in this paper

  • ...In the last few years, latent models have become popular for several tasks, particularly for object detection [9]....

    [...]

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Abstract: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.

8,736 citations


"Relative Parts: Distinctive Parts f..." refers methods in this paper

  • ...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....

    [...]

  • ...The performance for SPM is comparable to chance accuracy....

    [...]

  • ...Method Accuracy Global DSIFT + RSVM [24] 64.61 Global GIST + RSVM [24] 68.89 Global GIST + RGB + RSVM [24] 69.89 SPM (Upto 2 levels) + RSVM [24] 50.73 SPM (Upto 3 levels) + RSVM [24] 50.01 Human selected parts + Part-specific Vocab....

    [...]

  • ...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....

    [...]

  • ...This again results into a 8300- Method Accuracy Global DSIFT + RSVM [24] 61.28 Global GIST + RGB + RSVM [24] 59.18 SPM (Upto 2 levels) + RSVM [24] 49.60 SPM (Upto 3 levels) + RSVM [24] 49.17 Unweighted parts + Part-specific vocab....

    [...]

Frequently Asked Questions (11)
Q1. What have the authors contributed in "Relative parts: distinctive parts for learning relative attributes" ?

The notion of relative attributes as introduced by Parikh and Grauman ( ICCV, 2011 ) provides an appealing way of comparing two images based on their visual properties ( or attributes ) such as “ smiling ” for face images, “ naturalness ” for outdoor images, etc. In this paper, the authors extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, the authors introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part the authors associate a locally adaptive “ significancecoefficient ” that represents its discriminative ability with respect to a particular attribute. 

In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting. 

Attributes have also been used for multiple-query image search [30], where input attributes along with other related attributes are used in a structured-prediction based model. 

In [29], a semi-supervised constrained bootstrapping approach is proposed that tries to benefit from inter-class attribute-based relationships to avoid semantic drift during the learning process. 

In order to learn wm, following constraints need to be satisfied:wm ·Ψ(xi,xj) > 0 ∀(Ii, Ij) ∈ Om (3) wm ·Ψ(xi,xj) = 0 ∀(Ii, Ij) ∈ Sm (4)Since this is an NP-hard problem, its relaxed version is solved by introducing slack variables. 

For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm. 

For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights. 

In [24], given a set of pairs of images depicting similar and/or different strengths of some particular attribute, the problem of learning a relative attribute classifier is posed as one of learning a ranking model for that attribute similar to Ranking SVM [12]. 

One possible reason for this could be that using vocabularies learned individually for each part results into less confusion than using a single vocabulary learned using all the parts. 

In order to learn a ranking model based on the part-based representation in Eq. 9, the authors optimize the following problem:OP2 : min wm 1 2 ||wm||22 + Cm(∑ ξ2ij + ∑ α2ij) (11)s.t. wm · Ψ̃(x̃i, x̃j) ≥ 1− ξij , ∀(Ii, Ij) ∈ Om (12) ||wm · Ψ̃(x̃i, x̃j)||1 ≤ αij , ∀(Ii, Ij) ∈ Sm (13)ξij ≥ 0; αij ≥ 0. (14)This is similar to OP1, except that now the authors use part-based representation instead of global representation. 

This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.