Unsupervised Domain Adaptation by Domain Invariant Projection
Summary (3 min read)
1. Introduction
- Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].
- They fail to account for the fact that the image features themselves may have been distorted by the domain shift, and that some of the image features may be specific to one domain and thus irrelevant for classification in the other one.
- In light of the above discussion, the authors propose to tackle the problem of domain shift by extracting the information that is invariant across the source and target domains.
3. Background
- The authors review some concepts that will be used in their algorithm.
- In particular, the authors briefly discuss the idea of Maximum Mean Discrepancy and introduce some notions of Grassmann manifolds.
3.1. Maximum Mean Discrepancy
- The authors are interested in measuring the dissimilarity between two probability distributions s and t. Non-parametric representations are very wellsuited to visual data, which typically exhibits complex probability distributions in high-dimensional spaces.
- The authors employ the maximum mean discrepancy [17] between two distributions s and t to measure their dissimilarity.
- The MMD is an effective non-parametric criterion that compares the distributions of two sets of data by mapping the data to RKHS.
- In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space.
4. Domain Invariant Projection (DIP)
- The authors introduce their approach to unsupervised domain adaptation.
- The authors first derive the optimization problem at the heart of their approach, and then discuss the details of their Grassmann manifold optimization method.
4.1. Problem Formulation
- Intuitively, with such a representation, a classifier trained on the source domain should perform equally well on the target domain.
- To achieve invariance, the authors search for a projection to a lowdimensional subspace where the source and target distributions are similar, or, in other words, a projection that minimizes a distance measure between the two distributions.
- In particular, the authors measure the distance between these two distribution with the MMD discussed in Section 3.1.
- In particular, the more general class of characteristic kernels can also be employed.
4.1.1 Encouraging Class Clustering (DIP-CC)
- In the DIP formulation described above, learning the projection W is done in a fully unsupervised manner.
- Note, however, that even in the so-called unsupervised setting, domain adaptation methods have access to the labels of the source examples.
- Here, the authors show that their formulation naturally allows us to exploit these labels while learning the projection.
- This can be achieved by minimizing the distance between the projected samples of each class and their mean.
- Note also that the regularizer in Eq. 8 is related to the intra-class scatter in the objective function of Linear Discriminant Analysis (LDA).
4.1.2 Semi-Supervised DIP (SS-DIP)
- The formulations of DIP given in Eqs. 7 and 8 fall into the unsupervised domain adaptation category, since they do not exploit any labeled target examples.
- Their formulation can very naturally be extended to the semi-supervised settings.
- In the unsupervised setting, this classifier is only trained using the source examples.
- With Semi-Supervised DIP (SS-DIP), the labeled target examples can be taken into account in two different manners.
- With the class-clustering regularizer of Eq. 8, the authors utilize the target labels in the regularizer when learning W , as well as when learning the final classifier.
4.2. Optimization on a Grassmann Manifold
- All versions of their DIP formulation yield nonlinear, constrained optimization problems.
- This lets us rewrite their constrained optimization problem as an unconstrained problem on the manifold G(d,D).
- While their optimization problem has become unconstrained, it remains nonlinear.
- Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic.
- In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix.
5. Experiments
- The authors evaluated their approach on the tasks of indoor WiFi localization and visual object recognition, and compare its performance against the state-of-the art methods in each task.
- In all their experiments, the authors set the variance σ of the Gaussian kernel to the median squared distance between all source examples, and the weight λ of the regularizer to 4/σ when using the regularizer.
5.1. Cross-domain WiFi Localization
- The authors first evaluated their approach on the task of indoor WiFi localization using the public wifi data set published in the 2007 IEEE ICDM Contest for domain adaptation [29].
- The goal of indoor WiFi localization is to predict the location of WiFi devices based on received signal strength (RSS) values collected during different time periods .
- The authors followed the transductive evaluation setting introduced in [24] to compare their DIP methods with TCA and SSTCA, which are considered state-of-the-art on this dataset.
- Amazon, Webcam, DSLR, and Caltech, also known as From left to right.
5.2. Visual Object Recognition
- The authors then evaluated their approach on the task of visual object recognition using the benchmark domain adaptation dataset introduced in [26].
- This dataset contains images from four different domains: Amazon, DSLR, Webcam, and Caltech.
- The Amazon domain consists of images acquired in a highly-controlled environment with studio lighting conditions.
- The authors results are presented as DIP for the original model and DIP-CC for the class-clustering regularized one.
- Table 1 shows the recognition accuracies on the target examples for the 9 pairs of source and target domains.
Did you find this useful? Give us your feedback
Citations
4,862 citations
3,351 citations
Cites background or methods from "Unsupervised Domain Adaptation by D..."
...A rich line of prior works have focused on learning shallow features by jointly minimizing a distance metric of domain discrepancy (Pan et al., 2011; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Ghifary et al., 2014; Wang & Schneider, 2014)....
[...]
...…tasks,A → D, D → A andW → A. Office-10 + Caltech-10(Gong et al., 2012) This dataset consists of the 10 common categories shared by the Office31 and Caltech-256 (C) (Griffin et al., 2007) datasets and is widely adopted in transfer learning methods (Long et al., 2013; Baktashmotlagh et al., 2013)....
[...]
..., 2013; Wang & Schneider, 2014) and computer vision (Gong et al., 2012; Baktashmotlagh et al., 2013; Long et al., 2013), etc....
[...]
...It has been explored to save the manual labeling efforts for machine learning (Pan et al., 2011; Zhang et al., 2013; Wang & Schneider, 2014) and computer vision (Gong et al., 2012; Baktashmotlagh et al., 2013; Long et al., 2013), etc....
[...]
...A rich line of prior work has focused on learning shallow features by jointly minimizing a distance metric of domain discrepancy (Pan et al., 2011; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Ghifary et al., 2014; Wang & Schneider, 2014)....
[...]
3,222 citations
2,889 citations
Additional excerpts
...Some approaches perform this by reweighing or selecting samples from the source domain [3, 11, 7], while others seek an explicit feature space transformation that would map source distribution into the target ones [16, 10, 2]....
[...]
1,272 citations
References
13,011 citations
"Unsupervised Domain Adaptation by D..." refers methods in this paper
...Local scaleinvariant interest points were detected by the SURF detector [2], and a 64-dimensional rotation invariant SURF descriptor was extracted from the image patch around each interest point....
[...]
3,792 citations
"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper
...By defining F as the set of functions in the unit ball in a universal RKHS H, it was shown that D′(F, s, t) = 0 if and only if s = t [17]....
[...]
...We employ the maximum mean discrepancy [17] between two distributions s and t to measure their dissimilarity....
[...]
...The fact that this kernel yields a distribution distance that only compares the first and second moment of the two distributions [17] will be shown to have little impact on our experimental results, thus showing the robustness of our approach to the choice of kernel....
[...]
...In this work, we make use of the Maximum Mean Discrepancy (MMD) [17] to measure the dissimilarity between the empirical distributions of the source and target examples....
[...]
3,195 citations
"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper
...We compare our DIP and DIP-CC results, with Gaussian or polynomial kernel in MMD, with those obtained by several state-ofthe-art methods: transfer component analysis (TCA) [24], geodesic flow kernel (GFK) [15], geodesic flow sampling (GFS) [16], structural correspondence learning (SCL) [5], kernel mean matching (KMM) [18] and landmark selection (LM) [14]....
[...]
...We followed the transductive evaluation setting introduced in [24] to compare our DIP methods with TCA and SSTCA, which are considered state-of-the-art on this dataset....
[...]
...Note that our algorithms outperform TCA in both unsupervised and supervised settings....
[...]
...This is in contrast with sample re-weighting, or selection methods [21, 18, 14, 24] that place weights outside φ(·)....
[...]
...However, although motivated by MMD, in TCA, the distance between the sample means is measured in a lower-dimensional space rather than in Reproducing Kernel Hilbert Space (RKHS), which somewhat contradicts the intuition behind the use of kernels....
[...]
2,699 citations
Additional excerpts
...This dataset contains images from four different domains: Amazon, DSLR, Webcam, and Caltech....
[...]
...The last domain, Caltech [19], consists of images of 256 object classes downloaded from Google images....
[...]
2,686 citations
"Unsupervised Domain Adaptation by D..." refers background or methods in this paper
...Unlike in flat spaces, this cannot be achieved by simple translation, but requires subtracting a normal component at the end point [13]....
[...]
...In particular, we make use of a conjugate gradient (CG) algorithm on the Grassmann manifold [13]....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. What have the authors stated for future works in "Unsupervised domain adaptation by domain invariant projection" ?
Although, in practice, optimization on the Grassmann manifold has proven well-behaved, the authors intend to study if the use of other characteristic kernels in conjunction with different optimization strategies, such as the convex-concave procedure, could yield theoretical convergence guarantees within their formalism. Finally, the authors also plan to investigate how ideas from the deep learning literature could be employed to obtain domain invariant features.
Q3. What is the tangent space at a point on a manifold?
The tangent space at a point on a manifold is a vector space that consists of the tangent vectors of all possible curves passing through this point.
Q4. In what way did the authors first apply PCA to the concatenated source and target data?
In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix.
Q5. What is the purpose of the evaluation protocol?
In a second experiment, the authors used the more conventional evaluation protocol introduced in [26], which consists of splitting the data into multiple partitions.
Q6. What was the subspace disagreement measure used in all their experiments?
In all their experiments, the authors used the subspace disagreement measure of [15] to automatically determine the dimensionality of the projection matrix W .
Q7. What is the universal kernel associated with the mapping?
H=( n∑i,j=1k(x̃is, x̃ j s)n2 + m∑ i,j=1 k(x̃it, x̃ j t ) m2 − 2 n,m∑ i,j=1 k(x̃is, x̃ j t ) nm) 1 2,where φ(·) is the mapping to the RKHS H, and k(·, ·) = 〈φ(·), φ(·)〉 is the universal kernel associated with this mapping.
Q8. What are the methods that have been proposed to create intermediate representations?
To relate the source and target domains, several state-ofthe-art methods have proposed to create intermediate representations [15, 16].
Q9. What is the MMD of Eq. 5?
With the MMD of Eq. 5 based on the degree 2 polynomial kernel kP (·, ·),Gss(i, j) becomesGss(i, j) = 2kP (x i s,x j s)(x i sx j s T + xjsx i s T )W ,and similarly for Gtt(·, ·) and Gst(·, ·).
Q10. What is the MMD between the distributions of two sets of observations?
In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space.
Q11. Why is the distance between the sample means measured in TCA?
although motivated by MMD, in TCA, the distance between the sample means is measured in a lower-dimensional space rather than in Reproducing Kernel Hilbert Space (RKHS), which somewhat contradicts the intuition behind the use of kernels.
Q12. What is the effect of the Gaussian kernel on the experimental results?
The fact that this kernel yields a distribution distance that only compares the first and second moment of the two distributions [17] will be shown to have little impact on their experimental results, thus showing the robustness of their approach to the choice of kernel.
Q13. What is the CG on a Grassmann manifold?
Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic.
Q14. What is the purpose of the re-weighting?
In particular,in [21, 18], the source examples are re-weighted so as to minimize the MMD between the source and target distributions.
Q15. What is the difference between the unsupervised and the semi-supervised DIP formulations?
In the unregularized formulation of Eq. 7, since no labels are used when learning W , the authors only employ the labeled target examples along with the source ones to train the final classifier.
Q16. What is the definition of domain shift?
Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].