Relative Parts: Distinctive Parts for Learning Relative Attributes
Summary (5 min read)
1. Introduction
- Visual attributes (or simply attributes) are perceptual properties that can be used to describe an entity (“pointed nose”), an object (“furry sheep”), or a scene (“natural outdoor”).
- This led to the notion of “relative attributes”, where the strength of an attribute in a given image can be described with respect to some other image/category; e.g. “given face is less chubby than person A and more chubby than person B”.
- Next, the authors update this part-based representation by additionally learning weights corresponding to each part that denote their contribution towards predicting the strength of a given attribute.
- The authors compare the baseline method of [24] with the proposed method under various settings.
- In Sec. 3, the authors discuss the method of [24] for learning relative attribute ranking models.
3. Preliminaries
- In [24], a Ranking SVM based method was used for learning relative attribute classifiers.
- Ranking SVM [12] is a max-margin ranking framework that learns linear models to perform pairwise comparisons.
- This is conceptually different from the conventional one-vs-rest SVM that learns a model using individual samples rather than pairs.
- Though SVM scores can also be used to perform pairwise comparisons, usually Ranking SVM has been known to perform better than SVM for such tasks.
- In [24] also, Ranking SVM was shown to perform better than SVM on the task of relative attribute prediction.
3.1. The Ranking SVM Model
- And, Sm = {(Ii, Ij)} is such that both Ii and Ij have nearly the same strength of attribute am.
- Dm, the goal is to learn a ranking function fm that, given a new pair of images Ip and Iq represented by xp and xq respectively, predicts which image has greater strength of attribute am.
- Using fm, the authors determine which image has higher strength for attribute am based on ympq = sign(fm(xp,xq; wm)).
- Note that along with pairwise constraints as in [12], the optimization problem now also includes similarity constraints.
- This is solved in the primal form itself using Newton’s method [3].
4. Proposed Representations
- The Ranking SVM method discussed above uses a joint representation based on globally computed features (Eq. 2) while determining the strength of some given attribute.
- Several attributes such as “visible-teeth”, “eyesopen”, etc. are not representative of whole image, and correspond to only some specific regions/parts.
- This means there exists a weak association between an image and its attribute label.
- This inspires us to build a representation that (i) encodes part/regionspecific features, without confusing across parts; and (ii) explicitly encodes the relative significance of each part with respect to a given attribute.
- Next the authors propose two part-based joint-representations for the task of learning relative attribute classifiers.
4.1. Part-based Joint Representation
- These parts can be obtained using a domainspecific method; e.g., the method discussed in [35] can be used for determining a set of localized parts in face images.
- Here, N1 = K × d1 such that each x̃k is a sparse vector with only d1 non-zero entries in the kth interval representing part pk.
- The advantage of this representation is that it specifically encodes correspondence among parts; i.e., now the kth part of Ip is compared with just the kth part of Iq .
- The assumption here is that such a direct comparison between localized pairs of parts would provide stronger cues for learning relative attribute models than using a single global representation as in Eq.
- (This assumption is also validated by improvements in prediction accuracy as discussed in Sec. 6.).
4.2. Weighted Part-based Joint Representation
- Though the joint representation proposed in the previous section allows direct part-based comparison between a pair of images, it does not provide information about which parts actually symbolize some given attribute.
- As discussed in Sec. 4.1, let each image I be represented by a set ofK parts.
- Additionally, let skm ∈ [0, 1] be a weight associated with the kth part.
- This weight denotes the relative importance of the kth part compared to other parts for predicting the strength of attribute am; i.e., larger the weight, more important is that part, and vice-versa.
- These help in explicitly encoding the relative importance of individual parts in the joint representation.
5. Parameter Learning
- Now the authors discuss how to learn the parameters for each attribute using the two joint representations discussed above.
- Note that the authors still need to satisfy the constraints as in Eq. 3 and Eq. 4 depending upon the representation followed.
5.1. For Part-based Joint Representation
- This is similar to OP1, except that now the authors use part-based representation instead of global representation.
- This allows us to use the same Newton’s method [3] for solving OP2.
5.2. For Weighted Part-based Joint Representation
- For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm.
- Note that the overall weight of all the parts is constrained to be unit; i.e., skm ≥ 0, e · sm = 1, which ensures that all parts are fairly used.
- This is desirable since usually only a few parts contribute towards determining the strength of a given attribute.
5.2.1 Solving the optimization problem
- The authors solve OP3 in the primal form itself using a block coordinate descent algorithm.
- The authors consider each set of parameters wm and sm as two blocks, and optimize them in an alternate manner.
- Om is the set of pairs that violate the margin constraint.
- Note that Qm is not fixed, and may change at every iteration.
- The authors solve OP4 using an iterative gradient descent and projection method similar to [34].
5.3. Computing Parts
- The two joint representations as proposed in Sec. 4 are based on an ordered set of corresponding parts computed from a given pair of images.
- Given a method for computing such parts, their framework is applicable irrespective of the domain.
- To compute parts from a given face image, the authors use the method proposed in [35].
- Though these parts can be used to represent several attributes such as “smiling”, “eyes-open”, etc., there are few other attributes which are not covered by these parts such as “bald-head”, “visible-forehead” and “dark-hair”.
5.4. Relation with Latent Models
- In the last few years, latent models have become popular for several tasks, particularly for object detection [9].
- These models usually look for characteristics (e.g., parts) that are shared within a category but distinctive across categories.
- (As discussed in Sec. 2, recent works such as [1, 5, 13] also have similar motivation, though they do not explicitly investigate the latent aspect.).
- The authors work is similar to theirs in the sense that the authors also seek attribute-specific distinctive parts by incorporating significance-coefficients.
- This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.
6. Experiments
- The authors compare the proposed method with that of [24] under different settings on two datasets.
- To address this limitation, the authors have collected a new dataset using a subset of LFW [11] images.
- The new dataset has attribute-level annotations for 10000 image pairs and 10 attributes, and the authors call this as LFW-10 dataset.
- While collecting the annotations, the authors particularly ignore the category information, thus making it more suitable for the task of learning relative attributes.
- The details of this dataset are described next.
6.1. LFW-10 Dataset
- Out of these, 1000 images are used for creating training pairs and the remaining 1000 for testing pairs.
- The annotations are collected for 10 attributes, with 500 training and testing pairs per attribute.
- In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.
- Figure 5 shows example pairs from this dataset.
6.2. Features for Parts
- The authors represent each part using a Bag of Words (BoW) histogram over dense SIFT [20] features.
- The authors consider two settings for learning visual-word vocabulary: (1) In the first setting, they learn a part-specific vocabulary for every part.
- This is possible since their parts are fixed and known.
- In practice, the authors learn a vocabulary of 100 visual words for each part.
- (2) In the second setting, the authors learn a single vocabulary of 100 visual words for all the parts.
6.3. Baselines
- The authors compare with the Ranking SVM method of [24] using the code provided by the authors 1.
- The authors use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i).
- As another baseline, the authors compare the quality of their partlearning framework (Sec. 5.2) against human selected parts.
- For this, the authors asked a human expert to select a subset of few most representative parts corresponding to every attribute.
- For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights.
6.4. Results
- Table 1 compares different methods on PubFig-29 dataset.
- This clearly validates the significance of these representations for learning relative attribute models.
- (3) Using part-specific vocabulary performs better than single vocabulary.
- Figure 7 shows the top ten learned parts with highest significance-coefficients for all the ten attributes in LFW-10 dataset.
- Also, the performance of their method closely matches with that obtained using human selected parts, thus demonstrating its effectiveness.
6.5. Application to Interactive Image Search
- Now, the authors illustrate the advantage of the proposed method on the task of interactive image search using relative attribute based feedback.
- For a given attribute’s feedback with respect to a reference image, the search set is partitioned into two disjoint sets using that attribute’s scores.
- The rank of all the images in the search set are averaged over all feedbacks over all reference images.
- A total of 275 searches are performed for each of the six settings, by collecting feedbacks from 30 human evaluators.
- These results demonstrate that here also their method consistently outperforms the baseline method, and achieves performance comparable to that using human selected parts, thus validating its efficacy.
7. Conclusion
- Inspired from the success of relative attributes, the authors have presented a novel method that learns relative attribute models using local parts that are shared across categories.
- The authors method achieves significant improvements compared to the baseline method.
- Apart from this, the part-specific weights learned using their method also provide semantic interpretation of different parts for diverse attributes.
Did you find this useful? Give us your feedback
Citations
255 citations
169 citations
Cites background from "Relative Parts: Distinctive Parts f..."
..., tails of mammals) [4, 6, 37, 3, 83, 59, 14], and the relationship between attributes and categories [79, 48, 32, 54]....
[...]
169 citations
131 citations
67 citations
Cites background or methods from "Relative Parts: Distinctive Parts f..."
...The state-of-the-art method of Xiao and Lee [14] outperforms the baseline method of [13] because it automatically discovers the relevant regions of an attribute without relying on pretrained keypoint detectors whose detected parts may be irrelevant to the attribute....
[...]
...Since that method has to process a sequence of time-consuming modules including feature extraction, nearest neighbor matching, iterative SVM classifier training to build the visual chains, and training an SVM ranker to rank the chains, it takes ∼10 hours to train one attribute model on LFW-10 using a cluster of 20 CPU nodes with 2 cores each....
[...]
...Attribute ranking accuracy on LFW-10....
[...]
...5, we show the results for the face attributes on the LFW-10 test images....
[...]
...We use the same train-test split used in [13]....
[...]
References
46,906 citations
40,785 citations
"Relative Parts: Distinctive Parts f..." refers background in this paper
..., L1-regularization), thus implicitly imposing sparsity on sm [22, 31]....
[...]
14,708 citations
"Relative Parts: Distinctive Parts f..." refers methods in this paper
...We represent each part using a Bag of Words (BoW) histogram over dense SIFT (DSIFT) [20] features....
[...]
...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....
[...]
...This again results into a 8300- Method Accuracy Global DSIFT + RSVM [24] 61.28 Global GIST + RGB + RSVM [24] 59.18 SPM (Upto 2 levels) + RSVM [24] 49.60 SPM (Upto 3 levels) + RSVM [24] 49.17 Unweighted parts + Part-specific vocab....
[...]
...Method Accuracy Global DSIFT + RSVM [24] 64.61 Global GIST + RSVM [24] 68.89 Global GIST + RGB + RSVM [24] 69.89 SPM (Upto 2 levels) + RSVM [24] 50.73 SPM (Upto 3 levels) + RSVM [24] 50.01 Human selected parts + Part-specific Vocab....
[...]
10,501 citations
"Relative Parts: Distinctive Parts f..." refers background in this paper
...In the last few years, latent models have become popular for several tasks, particularly for object detection [9]....
[...]
8,736 citations
"Relative Parts: Distinctive Parts f..." refers methods in this paper
...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....
[...]
...The performance for SPM is comparable to chance accuracy....
[...]
...Method Accuracy Global DSIFT + RSVM [24] 64.61 Global GIST + RSVM [24] 68.89 Global GIST + RGB + RSVM [24] 69.89 SPM (Upto 2 levels) + RSVM [24] 50.73 SPM (Upto 3 levels) + RSVM [24] 50.01 Human selected parts + Part-specific Vocab....
[...]
...We use four features for comparison: (i) BoW histogram over DSIFT features with 1000 visual words, (ii) global 512-dimensional GIST descriptor [23], (iii) global 512-dimensional GIST and 30- dimensional RGB histogram (which was also used in [24]), and (iv) spatial pyramid (SPM) [19] upto two and three levels using DSIFT features and the same vocabulary as in (i)....
[...]
...This again results into a 8300- Method Accuracy Global DSIFT + RSVM [24] 61.28 Global GIST + RGB + RSVM [24] 59.18 SPM (Upto 2 levels) + RSVM [24] 49.60 SPM (Upto 3 levels) + RSVM [24] 49.17 Unweighted parts + Part-specific vocab....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. How is the annotation of the image pair determined?
In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.
Q3. What is the common use of attributes in image search?
Attributes have also been used for multiple-query image search [30], where input attributes along with other related attributes are used in a structured-prediction based model.
Q4. What is the idea of a semi-supervised constrained bootstrapping approach?
In [29], a semi-supervised constrained bootstrapping approach is proposed that tries to benefit from inter-class attribute-based relationships to avoid semantic drift during the learning process.
Q5. What is the simplest way to learn wm?
In order to learn wm, following constraints need to be satisfied:wm ·Ψ(xi,xj) > 0 ∀(Ii, Ij) ∈ Om (3) wm ·Ψ(xi,xj) = 0 ∀(Ii, Ij) ∈ Sm (4)Since this is an NP-hard problem, its relaxed version is solved by introducing slack variables.
Q6. What are the two sets of parameters that are used to learn the ranking model?
For the weighted part-based joint representation in Eq. 10, the authors need to learn two sets of parameters corresponding to every attribute: ranking model wm, and significancecoefficients sm.
Q7. What is the wm model for learning the part?
For a given attribute am, all the selected parts are assigned equal weights and the remaining parts are assigned zero weight, and then a ranking model wm is learned based on these part weights.
Q8. What is the problem of learning a relative attribute classifier?
In [24], given a set of pairs of images depicting similar and/or different strengths of some particular attribute, the problem of learning a relative attribute classifier is posed as one of learning a ranking model for that attribute similar to Ranking SVM [12].
Q9. Why is the LFW-10 dataset better than the single vocabulary?
One possible reason for this could be that using vocabularies learned individually for each part results into less confusion than using a single vocabulary learned using all the parts.
Q10. What is the motivation for implementing the part-based representation in Eq. 9?
In order to learn a ranking model based on the part-based representation in Eq. 9, the authors optimize the following problem:OP2 : min wm 1 2 ||wm||22 + Cm(∑ ξ2ij + ∑ α2ij) (11)s.t. wm · Ψ̃(x̃i, x̃j) ≥ 1− ξij , ∀(Ii, Ij) ∈ Om (12) ||wm · Ψ̃(x̃i, x̃j)||1 ≤ αij , ∀(Ii, Ij) ∈ Sm (13)ξij ≥ 0; αij ≥ 0. (14)This is similar to OP1, except that now the authors use part-based representation instead of global representation.
Q11. Why do the authors use these parts to learn attribute-specific models?
This is because their ranking method uses these parts to learn attribute-specific models which are independent of categories being depicted in training pairs.