Proceedings Article•DOI•

Unsupervised Domain Adaptation by Domain Invariant Projection

Mahsa Baktashmotlagh¹, Mahsa Baktashmotlagh², Mehrtash Harandi³, Mehrtash Harandi², Brian C. Lovell¹, Mathieu Salzmann², Mathieu Salzmann³ - Show less +3 more•Institutions (3)

University of Queensland¹, NICTA², Australian National University³

01 Dec 2013-pp 769-776

TL;DR: This paper learns a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized and demonstrates the effectiveness of the approach on the task of visual object recognition.

read less

Abstract: Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. Existing techniques that have attempted to match the distributions of the source and target domains typically compare these distributions in the original feature space. This space, however, may not be directly suitable for such a comparison, since some of the features may have been distorted by the domain shift, or may be domain specific. In this paper, we introduce a Domain Invariant Projection approach: An unsupervised domain adaptation method that overcomes this issue by extracting the information that is invariant across the source and target domains. More specifically, we learn a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized. We demonstrate the effectiveness of our approach on the task of visual object recognition and show that it outperforms state-of-the-art methods on a standard domain adaptation benchmark dataset.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Related Work] – [3. Background] – [3.1. Maximum Mean Discrepancy] – [4. Domain Invariant Projection (DIP)] – [4.1. Problem Formulation] – [4.1.1 Encouraging Class Clustering (DIP-CC)] – [4.1.2 Semi-Supervised DIP (SS-DIP)] – [4.2. Optimization on a Grassmann Manifold] – [5. Experiments] – [5.1. Cross-domain WiFi Localization] and [5.2. Visual Object Recognition]

1. Introduction

Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].
They fail to account for the fact that the image features themselves may have been distorted by the domain shift, and that some of the image features may be specific to one domain and thus irrelevant for classification in the other one.
In light of the above discussion, the authors propose to tackle the problem of domain shift by extracting the information that is invariant across the source and target domains.

3. Background

The authors review some concepts that will be used in their algorithm.
In particular, the authors briefly discuss the idea of Maximum Mean Discrepancy and introduce some notions of Grassmann manifolds.

3.1. Maximum Mean Discrepancy

The authors are interested in measuring the dissimilarity between two probability distributions s and t. Non-parametric representations are very wellsuited to visual data, which typically exhibits complex probability distributions in high-dimensional spaces.
The authors employ the maximum mean discrepancy [17] between two distributions s and t to measure their dissimilarity.
The MMD is an effective non-parametric criterion that compares the distributions of two sets of data by mapping the data to RKHS.
In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space.

4. Domain Invariant Projection (DIP)

The authors introduce their approach to unsupervised domain adaptation.
The authors first derive the optimization problem at the heart of their approach, and then discuss the details of their Grassmann manifold optimization method.

4.1. Problem Formulation

Intuitively, with such a representation, a classifier trained on the source domain should perform equally well on the target domain.
To achieve invariance, the authors search for a projection to a lowdimensional subspace where the source and target distributions are similar, or, in other words, a projection that minimizes a distance measure between the two distributions.
In particular, the authors measure the distance between these two distribution with the MMD discussed in Section 3.1.
In particular, the more general class of characteristic kernels can also be employed.

4.1.1 Encouraging Class Clustering (DIP-CC)

In the DIP formulation described above, learning the projection W is done in a fully unsupervised manner.
Note, however, that even in the so-called unsupervised setting, domain adaptation methods have access to the labels of the source examples.
Here, the authors show that their formulation naturally allows us to exploit these labels while learning the projection.
This can be achieved by minimizing the distance between the projected samples of each class and their mean.
Note also that the regularizer in Eq. 8 is related to the intra-class scatter in the objective function of Linear Discriminant Analysis (LDA).

4.1.2 Semi-Supervised DIP (SS-DIP)

The formulations of DIP given in Eqs. 7 and 8 fall into the unsupervised domain adaptation category, since they do not exploit any labeled target examples.
Their formulation can very naturally be extended to the semi-supervised settings.
In the unsupervised setting, this classifier is only trained using the source examples.
With Semi-Supervised DIP (SS-DIP), the labeled target examples can be taken into account in two different manners.
With the class-clustering regularizer of Eq. 8, the authors utilize the target labels in the regularizer when learning W , as well as when learning the final classifier.

4.2. Optimization on a Grassmann Manifold

All versions of their DIP formulation yield nonlinear, constrained optimization problems.
This lets us rewrite their constrained optimization problem as an unconstrained problem on the manifold G(d,D).
While their optimization problem has become unconstrained, it remains nonlinear.
Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic.
In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix.

5. Experiments

The authors evaluated their approach on the tasks of indoor WiFi localization and visual object recognition, and compare its performance against the state-of-the art methods in each task.
In all their experiments, the authors set the variance σ of the Gaussian kernel to the median squared distance between all source examples, and the weight λ of the regularizer to 4/σ when using the regularizer.

5.1. Cross-domain WiFi Localization

The authors first evaluated their approach on the task of indoor WiFi localization using the public wifi data set published in the 2007 IEEE ICDM Contest for domain adaptation [29].
The goal of indoor WiFi localization is to predict the location of WiFi devices based on received signal strength (RSS) values collected during different time periods .
The authors followed the transductive evaluation setting introduced in [24] to compare their DIP methods with TCA and SSTCA, which are considered state-of-the-art on this dataset.
Amazon, Webcam, DSLR, and Caltech, also known as From left to right.

5.2. Visual Object Recognition

The authors then evaluated their approach on the task of visual object recognition using the benchmark domain adaptation dataset introduced in [26].
This dataset contains images from four different domains: Amazon, DSLR, Webcam, and Caltech.
The Amazon domain consists of images acquired in a highly-controlled environment with studio lighting conditions.
The authors results are presented as DIP for the original model and DIP-CC for the class-clustering regularized one.
Table 1 shows the recognition accuracies on the target examples for the 9 pairs of source and target domains.

Did you find this useful? Give us your feedback

Figures (2)

Table 1. Recognition accuracies on 9 pairs of source/target domains using the evaluation protocol of [14]. C: Caltech, A: Amazon, W : Webcam,D: DSLR.

Figure 1. Comparison of our approach with TCA on the task of indoor WiFi localization.

Content maybe subject to copyright Report

Unsupervised Domain Adaptation by Domain Invariant Projection

Mahsa Baktashmotlagh

1,3

, Mehrtash T. Harandi

2,3

, Brian C. Lovell

, and Mathieu Salzmann

2,3

University of Queensland

Australian National University

NICTA, Canberra

∗

mahsa.baktashmotlagh@nicta.com.au

Abstract

Domain-invariant representations are key to addressing

the domain shift problem where the training and test exam-

ples follow different distributions. Existing techniques that

have attempted to match the distributions of the source and

target domains typically compare these distributions in the

original feature space. This space, however, may not be di-

rectly suitable for such a comparison, since some of the fea-

tures may have been distorted by the domain shift, or may

be domain speciﬁc. In this paper, we introduce a Domain

Invariant Projection approach: An unsupervised domain

adaptation method that overcomes this issue by extracting

the information that is invariant across the source and tar-

get domains. More speciﬁcally, we learn a projection of the

data to a low-dimensional latent space where the distance

between the empirical distributions of the source and target

examples is minimized. We demonstrate the effectiveness of

our approach on the task of visual object recognition and

show that it outperforms state-of-the-art methods on a stan-

dard domain adaptation benchmark dataset.

1. Introduction

Domain shift is a fundamental problem in visual recog-

nition tasks as evidenced by the recent surge of interest

in domain adaptation [22, 15, 16]. The problem typically

arises when the training (source) and test (target) exam-

ples follow different distributions. This is a common sce-

nario in modern visual recognition tasks, especially if im-

ages are acquired with different cameras, or in very different

conditions (e.g., commercial website versus home environ-

ment, images taken under different illuminations). Failing

to model the distribution shift in the hope that the image

features will be robust enough often yields poor recognition

accuracy [26, 16, 15, 14]. On the other hand, labeling suf-

ﬁciently many images from the target domain to train a dis-

criminative classiﬁer speciﬁc to this domain is prohibitively

time-consuming and impractical in realistic scenarios.

∗

NICTA is funded by the Australian Government as represented by the

Department of Broadband, Communications and the Digital Economy and

the ARC through the ICT Centre of Excellence program.

To relate the source and target domains, several state-of-

the-art methods have proposed to create intermediate repre-

sentations [15, 16]. However, these representations do not

explicitly try to match the probability distributions of the

source and target data, which may make them sub-optimal

for classiﬁcation. Sample selection, or re-weighting, ap-

proaches [14, 21] explicitly attempt to match the source and

target distributions by ﬁnding the most appropriate source

examples for the target data. However, they fail to account

for the fact that the image features themselves may have

been distorted by the domain shift, and that some of the

image features may be speciﬁc to one domain and thus ir-

relevant for classiﬁcation in the other one.

In light of the above discussion, we propose to tackle the

problem of domain shift by extracting the information that

is invariant across the source and target domains. To this

end, we introduce a Domain Invariant Projection (DIP) ap-

proach, which aims to learn a low-dimensional latent space

where the source and target distributions are similar. Learn-

ing such a projection allows us to account for the potential

distortions induced by the domain shift, as well as for the

presence of domain-speciﬁc image features. Furthermore,

since the distributions of the source and target data in the

latent space are similar, we expect a classiﬁer trained on the

source examples to perform well on the target domain.

In this work, we make use of the Maximum Mean Dis-

crepancy (MMD) [17] to measure the dissimilarity between

the empirical distributions of the source and target exam-

ples. Learning the latent space that minimizes the MMD

between the source and target domains can then be formu-

lated as an optimization problem on a Grassmann manifold.

This lets us utilize Grassmannian geometry to effectively

obtain our domain invariant projection. Although designed

to be fully unsupervised, our formalism naturally allows us

to exploit label information from either domain during the

training process. While not strictly necessary, this informa-

tion can help boosting classiﬁcation accuracy even further.

In short, we introduce the idea of ﬁnding a domain in-

variant representation of the data by matching the source

and target distributions in a low-dimensional latent space,

and propose an effective algorithm to learn our Domain In-

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.100

769

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.100

769

variant Projection. We demonstrate the beneﬁts of our ap-

proach on the task of visual object recognition and show

that it outperforms state-of-the-art methods on the standard

domain adaptation benchmark dataset [26].

2. Related Work

Existing domain adaptation methods can be divided into

two categories: Semi-supervised approaches [12, 3, 26] that

assume that a small number of labeled examples from the

target domain are available during training, and unsuper-

vised approaches [15, 14, 16, 21] that do not require any

labels from the target domain.

In the former category, modiﬁcations of Support Vector

Machines (SVM) [12, 3] and other statistical classiﬁers [10]

have been proposed to exploit the availability of labeled and

unlabeled data from the target domain. Co-regularization

of similar classiﬁers was also introduced to utilize unla-

beled target data during training [9]. For visual recognition,

metric learning [26] and transformation learning [23] were

shown to be effective at making use of the labeled target ex-

amples. Furthermore, semi-supervised methods have also

been proposed to tackle the case where multiple source do-

mains are available [11, 20]. While semi-supervised meth-

ods are often effective, in many applications, labeled target

examples are not available and cannot easily be acquired.

To address this issue, unsupervised domain adaptation

approaches that rely on purely unsupervised target data have

been proposed [28, 7, 8]. In particular, two types of meth-

ods have proven quite successful at the task of visual ob-

ject recognition: Subspace-based approaches and sample

re-weighting approaches.

Subspace-based approaches [4, 16, 15] model the do-

main shift by representing the data with multiple subspaces.

In particular, in [4], coupled subspaces are learned using

Canonical Correlation Analysis (CCA). Rather than limit-

ing the representation to one source and one target sub-

spaces, several techniques exploit intermediate subspaces,

which link the source data to the target data. This idea

was originally introduced in [16], where the subspaces were

modeled as points on a Grassmann manifold, and interme-

diate subspaces were obtained by sampling points along the

geodesic between the source and target subspaces. This

method was extended in [15], which showed that all inter-

mediate subspaces could be taken into account by integrat-

ing along the geodesic. While this formulation nicely char-

acterizes the change between the source and target data, it

is not clear why all the subspaces along this path should

yield meaningful representations. More importantly, these

subspace-based methods do not explicitly exploit the statis-

tical properties of the observed data.

In contrast, sample re-weighting, or selection, ap-

proaches, have focused more directly on comparing the

distributions of the source and target data. In particular,

in [21, 18], the source examples are re-weighted so as to

minimize the MMD between the source and target dis-

tributions. More recently, an approach to selecting land-

marks among the source examples based on MMD was in-

troduced [14]. This sample selection approach was shown

to be very effective, especially for the task of visual object

recognition, to the point that it outperforms state-of-the-art

semi-supervised approaches. Despite their success, it is im-

portant to note that sample re-weighting and selection meth-

ods compare the source and target distributions directly in

the original feature space. This space, however, may not

be appropriate for this task, since the image features may

have been distorted by the domain shift, and since some of

the features may only be relevant to one speciﬁc domain.

In contrast, in this work, we compare the source and tar-

get distributions in a low-dimensional latent space where

these effects are removed, or reduced. This, in turn, yields

a representation that signiﬁcantly outperforms the recent

landmark-based approach [14], as well as other state-of-the-

art methods on the task of object recognition.

Transfer Component Analysis (TCA) [24] may be clos-

est in spirit to our work. However, although motivated by

MMD, in TCA, the distance between the sample means is

measured in a lower-dimensional space rather than in Re-

producing Kernel Hilbert Space (RKHS), which somewhat

contradicts the intuition behind the use of kernels. Here,

we follow the more intuitive idea of comparing the distribu-

tions of the transformed data using MMD. This, we believe

and as suggested by our experiments, makes better use of

the expressive power of the kernel in MMD.

3. Background

In this section, we review some concepts that will be

used in our algorithm. In particular, we brieﬂy discuss the

idea of Maximum Mean Discrepancy and introduce some

notions of Grassmann manifolds.

3.1. Maximum Mean Discrepancy

In this work, we are interested in measuring the dissimi-

larity between two probability distributions s and t. Rather

than restricting these distributions to take a speciﬁc para-

metric form, we opt for a non-parametric approach to com-

pare s and t. Non-parametric representations are very well-

suited to visual data, which typically exhibits complex prob-

ability distributions in high-dimensional spaces.

We employ the maximum mean discrepancy [17] be-

tween two distributions s and t to measure their dissimi-

larity. The MMD is an effective non-parametric criterion

that compares the distributions of two sets of data by map-

ping the data to RKHS. Given two distributions s and t, the

MMD between s and t is deﬁned as



(F,s,t)=sup

f∈F

˜x

∼s

[f(˜x

)] − E

˜x

∼t

[f(˜x

)]) ,

770770

where E

˜x∼s

[·] is the expectation under distribution s.By

deﬁning F as the set of functions in the unit ball in a univer-

sal RKHS H, it was shown that D



(F,s,t)=0if and only

if s = t [17].

Let

= {˜x

, ··· , ˜x

} and

= {˜x

, ··· , ˜x

} be

two sets of observations drawn i.i.d. from s and t, respec-

tively. An empirical estimate of the MMD can be computed





i=1

φ(˜x

) −



j=1

φ(˜x

)







i,j=1

k(˜x

, ˜x

)



i,j=1

k(˜x

, ˜x

)

− 2

n,m



i,j=1

k(˜x

, ˜x

)



where φ(·) is the mapping to the RKHS H, and k(·, ·)=

φ(·),φ(·) is the universal kernel associated with this map-

ping. In short, the MMD between the distributions of two

sets of observations is equivalent to the distance between

the sample means in a high-dimensional feature space.

3.2. Grassmann Manifolds

In our formulation, we model the projection of the source

and target data to a low-dimensional space as a point W on

a Grassmann manifold G(d, D). The Grassmann manifold

G(d, D) consists of the set of all linear d-dimensional sub-

spaces of R

. In particular, this lets us handle constraints

of the form W

W = I

. Learning the projection then

involves non-linear optimization on the Grassmann mani-

fold, which requires some notions of differential geometry

reviewed below.

In differential geometry, the shortest path between two

points on a manifold is a curve called a geodesic. The tan-

gent space at a point on a manifold is a vector space that

consists of the tangent vectors of all possible curves pass-

ing through this point. Parallel transport is the action of

transferring a tangent vector between two points on a man-

ifold. Unlike in ﬂat spaces, this cannot be achieved by sim-

ple translation, but requires subtracting a normal component

at the end point [13].

On a Grassmann manifold, the above-mentioned opera-

tions have efﬁcient numerical forms and can thus be used

to perform optimization on the manifold. In particular, we

make use of a conjugate gradient (CG) algorithm on the

Grassmann manifold [13]. CG techniques are popular non-

linear optimization methods with fast convergence rates.

These methods iteratively optimize the objective function

in linearly independent directions called conjugate direc-

tions [25]. CG on a Grassmann manifold can be summa-

rized by the following steps:

(i) Compute the gradient ∇f

of the objective function

f on the manifold at the current estimate W as

∇f

= ∂f

− WW

∂f

, (1)

with ∂f

the matrix of usual partial derivatives.

(ii) Determine the search direction H by parallel trans-

porting the previous search direction and combining

it with ∇f

(iii) Perform a line search along the geodesic at W in the

direction H.

These steps are repeated until convergence to a local mini-

mum, or until a maximum number of iterations is reached.

4. Domain Invariant Projection (DIP)

In this section, we introduce our approach to unsuper-

vised domain adaptation. We ﬁrst derive the optimization

problem at the heart of our approach, and then discuss the

details of our Grassmann manifold optimization method.

4.1. Problem Formulation

Our goal is to ﬁnd a representation of the data that is

invariant across different domains. Intuitively, with such

a representation, a classiﬁer trained on the source domain

should perform equally well on the target domain. To

achieve invariance, we search for a projection to a low-

dimensional subspace where the source and target distribu-

tions are similar, or, in other words, a projection that mini-

mizes a distance measure between the two distributions.

More speciﬁcally, let X



, ··· , x



be the D × n

matrix containing n samples from the source domain and



, ··· , x



be the D × m matrix containing m

samples from the target domain. We search for a D ×d pro-

jection matrix W , such that the distributions of the source

and target samples in the resulting d-dimensional subspace

are as similar as possible. In particular, we measure the

distance between these two distribution with the MMD dis-

cussed in Section 3.1. This distance can be expressed as

D(W

, W





i=1

φ(W

) −



j=1

φ(W

)



(2)

with φ(·) the mapping from R

to the high-dimensional

RKHS H. Note that, here, W appears inside φ(·) in or-

der to measure the MMD of the projected samples. This

is in contrast with sample re-weighting, or selection meth-

ods [21, 18, 14, 24] that place weights outside φ(·). There-

fore, these methods ultimately still compare the distribu-

tions in the original image feature space and may suffer

from the presence of domain-speciﬁc features.

Using the MMD, learning W can be expressed as the

optimization problem

∗

=argmin

, W

)

s.t. W

W = I

, (3)

771771

where the constraints enforce W to be orthogonal. Such

constraints prevent our model from wrongly matching the

two distributions by distorting the data, and make it very

unlikely that the resulting subspace only contains the noise

of both domains. Orthogonality constraints have proven ef-

fective in many subspace methods, such as PCA or CCA.

As shown in Section 3.1, the MMD in the RKHS H can

be expressed in terms of a kernel function k(·, ·). In partic-

ular here, we exploit the Gaussian kernel function, which is

known to be universal [27]. This lets us rewrite our objec-

tive function as

, W

)= (4)



i,j=1

exp



−

− x

)

− x

)





i,j=1

exp



−

− x

)

− x

)



−

n,m



i,j=1

exp



−

− x

)

− x

)



Since the Gaussian kernel satisﬁes the universality con-

dition of the MMD, it is a natural choice for our approach.

However, it was shown that, in practice, choices of non-

universal kernels may be more appropriate to measure the

MMD [6]. In particular, the more general class of character-

istic kernels can also be employed. This class incorporates

all strictly positive deﬁnite kernels, such as the well-known

polynomial kernel. Therefore, here, we also consider us-

ing the polynomial kernel of degree two. The fact that this

kernel yields a distribution distance that only compares the

ﬁrst and second moment of the two distributions [17] will

be shown to have little impact on our experimental results,

thus showing the robustness of our approach to the choice of

kernel. Replacing the Gaussian kernel with this polynomial

kernel in our objective function yields

, W

)= (5)



i=1



j=1

(1 + x

)



i=1



j=1

(1 + x

)

−



i=1



j=1

(1 + x

)

The two deﬁnitions of MMD introduced in Eqs. 4 and 5

can be computed efﬁciently in matrix form as

, W

)=Tr(K

L) , (6)

where



s,s

s,t

t,s

t,t



∈ R

(n+m)×(n+m)

, and

⎧

⎨

⎩

1/n

i, j ∈S

1/m

i, j ∈T

−1/(nm) otherwise

with S and T the sets of source and target indices, respec-

tively. Each element in K

is computed using the kernel

function (either Gaussian, or polynomial), and thus depends

on W . Note that, with both kernels, K

can be computed

efﬁciently in matrix form (i.e., without looping over its ele-

ments). This yields the optimization problem

∗

=argmin

Tr(K

s.t. W

W = I

, (7)

which is a nonlinear constrained problem. In practice, we

represent W as a point on a Grassmann manifold, which

yields an unconstrained optimization problem on the mani-

fold. As mentioned in Section 3.2, we make use of a conju-

gate gradient method on the manifold to obtain W

∗

4.1.1 Encouraging Class Clustering (DIP-CC)

In the DIP formulation described above, learning the projec-

tion W is done in a fully unsupervised manner. Note, how-

ever, that even in the so-called unsupervised setting, domain

adaptation methods have access to the labels of the source

examples. Here, we show that our formulation naturally al-

lows us to exploit these labels while learning the projection.

Intuitively, we are interested in ﬁnding a projection that

not only minimizes the distance between the distribution of

the projected source and target data, but also yields good

classiﬁcation performance. To this end, we search for a

projection that encourages samples with the same labels to

form a more compact cluster. This can be achieved by min-

imizing the distance between the projected samples of each

class and their mean. This yields the optimization problem

∗

= argmin

Tr(K

L)+λ



c=1



i=1



i,c

− μ

)



s.t. W

W = I , (8)

where C is the number of classes, n

the number of exam-

ples in class c, x

i,c

denotes the i

example of class c, and

the mean of the examples in class c. Note that in our for-

mulation, the mean of the projected examples is equivalent

to the projection of the mean. Note also that the regularizer

in Eq. 8 is related to the intra-class scatter in the objective

function of Linear Discriminant Analysis (LDA). While we

also tried to incorporate the other LDA term, which encour-

ages the means of different classes to be spread apart, we

found no beneﬁts in doing so in our results.

772772

4.1.2 Semi-Supervised DIP (SS-DIP)

The formulations of DIP given in Eqs. 7 and 8 fall into the

unsupervised domain adaptation category, since they do not

exploit any labeled target examples. However, our formula-

tion can very naturally be extended to the semi-supervised

settings. To this end, it must ﬁrst be noted that, after learn-

ing W , we train a classiﬁer in the resulting latent space

(i.e., on W

∗

x). In the unsupervised setting, this classiﬁer

is only trained using the source examples.

With Semi-Supervised DIP (SS-DIP), the labeled target

examples can be taken into account in two different man-

ners. In the unregularized formulation of Eq. 7, since no

labels are used when learning W , we only employ the la-

beled target examples along with the source ones to train

the ﬁnal classiﬁer. With the class-clustering regularizer of

Eq. 8, we utilize the target labels in the regularizer when

learning W , as well as when learning the ﬁnal classiﬁer.

4.2. Optimization on a Grassmann Manifold

All versions of our DIP formulation yield nonlinear, con-

strained optimization problems. To tackle this challenging

scenario, we ﬁrst note that the constraints on W make it

a point on a Grassmann manifold. This lets us rewrite our

constrained optimization problem as an unconstrained prob-

lem on the manifold G(d, D). Optimization on Grassmann

manifolds has proven effective at avoiding bad local min-

ima [1]. More speciﬁcally, manifold optimization methods

often have better convergence behavior than iterative pro-

jection methods, which can be crucial with a nonlinear ob-

jective function [1].

While our optimization problem has become uncon-

strained, it remains nonlinear. To effectively address this,

we make use of a conjugate gradient method on the man-

ifold. Recall from Section 3.2 that CG on a Grassmann

manifold involves (i) computing the gradient on the man-

ifold ∇f

, (ii) estimating the search direction H, and (iii)

performing a line search along a geodesic. Eq. 1 shows that

the gradient on the manifold depends on the partial deriva-

tives of the objective function w.r.t. W , i.e., ∂f/∂W . The

general form of ∂f/∂W in our formulation is

∂f

∂W



i,j=1

(i, j)



i,j=1

(i, j)

−2

n,m



i,j=1

(i, j)

where G

(·, ·), G

(·, ·) and G

(·, ·) are matrices of size

D × d. With the deﬁnition of MMD in Eq. 4 based on the

Gaussian kernel k

(·, ·), the matrix, e.g., G

(i, j) takes

the form

(i, j)=−

, x

)(x

− x

)(x

− x

)

W ,

and similarly for G

(·, ·) and G

(·, ·). With the MMD

of Eq. 5 based on the degree 2 polynomial kernel k

(·, ·),

Figure 1. Comparison of our approach with TCA on the task of

indoor WiFi localization.

(i, j) becomes

(i, j)=2k

, x

)(x

+ x

)W ,

and similarly for G

(·, ·) and G

(·, ·).Asf itself,

∂f/∂W can be efﬁciently computed in matrix form.

In our experiments, we ﬁrst applied PCA to the concate-

nated source and target data, kept all the data variance, and

initialized W to the truncated identity matrix. We observed

that learning W typically converges in only a few iterations.

5. Experiments

We evaluated our approach on the tasks of indoor WiFi

localization and visual object recognition, and compare its

performance against the state-of-the art methods in each

task. In all our experiments, we set the variance σ of the

Gaussian kernel to the median squared distance between all

source examples, and the weight λ of the regularizer to 4/σ

when using the regularizer.

5.1. Cross-domain WiFi Localization

We ﬁrst evaluated our approach on the task of indoor

WiFi localization using the public wiﬁ data set published in

the 2007 IEEE ICDM Contest for domain adaptation [29].

The goal of indoor WiFi localization is to predict the lo-

cation (labels) of WiFi devices based on received signal

strength (RSS) values collected during different time peri-

ods (domains). The dataset contains 621 labeled examples

collected during time period A (i.e., source) and 3128 unla-

beled examples collected during time period B (i.e., target).

We followed the transductive evaluation setting intro-

duced in [24] to compare our DIP methods with TCA

and SSTCA, which are considered state-of-the-art on this

dataset. Nearest-neighbor was employed as the ﬁnal classi-

ﬁer for our algorithms and for the baselines. In our experi-

ments, we used all the source data and 400 randomly sam-

pled target examples. In Fig. 1, we report the mean Average

773773

HTML Viewer

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Unsupervised domain adaptation by domain invariant projection" ?

Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. In this paper, the authors introduce a Domain Invariant Projection approach: More specifically, the authors learn a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized. The authors demonstrate the effectiveness of their approach on the task of visual object recognition and show that it outperforms state-of-the-art methods on a standard domain adaptation benchmark dataset.

Q2. What have the authors stated for future works in "Unsupervised domain adaptation by domain invariant projection" ?

Although, in practice, optimization on the Grassmann manifold has proven well-behaved, the authors intend to study if the use of other characteristic kernels in conjunction with different optimization strategies, such as the convex-concave procedure, could yield theoretical convergence guarantees within their formalism. Finally, the authors also plan to investigate how ideas from the deep learning literature could be employed to obtain domain invariant features.

Q3. What is the tangent space at a point on a manifold?

The tangent space at a point on a manifold is a vector space that consists of the tangent vectors of all possible curves passing through this point.

Q4. In what way did the authors first apply PCA to the concatenated source and target data?

In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix.

Q5. What is the purpose of the evaluation protocol?

In a second experiment, the authors used the more conventional evaluation protocol introduced in [26], which consists of splitting the data into multiple partitions.

Q6. What was the subspace disagreement measure used in all their experiments?

In all their experiments, the authors used the subspace disagreement measure of [15] to automatically determine the dimensionality of the projection matrix W .

Q7. What is the universal kernel associated with the mapping?

H=( n∑i,j=1k(x̃is, x̃ j s)n2 + m∑ i,j=1 k(x̃it, x̃ j t ) m2 − 2 n,m∑ i,j=1 k(x̃is, x̃ j t ) nm) 1 2,where φ(·) is the mapping to the RKHS H, and k(·, ·) = 〈φ(·), φ(·)〉 is the universal kernel associated with this mapping.

Q8. What are the methods that have been proposed to create intermediate representations?

To relate the source and target domains, several state-ofthe-art methods have proposed to create intermediate representations [15, 16].

Q9. What is the MMD of Eq. 5?

With the MMD of Eq. 5 based on the degree 2 polynomial kernel kP (·, ·),Gss(i, j) becomesGss(i, j) = 2kP (x i s,x j s)(x i sx j s T + xjsx i s T )W ,and similarly for Gtt(·, ·) and Gst(·, ·).

Q10. What is the MMD between the distributions of two sets of observations?

In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space.

Q11. Why is the distance between the sample means measured in TCA?

although motivated by MMD, in TCA, the distance between the sample means is measured in a lower-dimensional space rather than in Reproducing Kernel Hilbert Space (RKHS), which somewhat contradicts the intuition behind the use of kernels.

Q12. What is the effect of the Gaussian kernel on the experimental results?

The fact that this kernel yields a distribution distance that only compares the first and second moment of the two distributions [17] will be shown to have little impact on their experimental results, thus showing the robustness of their approach to the choice of kernel.

Q13. What is the CG on a Grassmann manifold?

Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic.

Q14. What is the purpose of the re-weighting?

In particular,in [21, 18], the source examples are re-weighted so as to minimize the MMD between the source and target distributions.

Q15. What is the difference between the unsupervised and the semi-supervised DIP formulations?

In the unregularized formulation of Eq. 7, since no labels are used when learning W , the authors only employ the labeled target examples along with the source ones to train the final classifier.

Q16. What is the definition of domain shift?

Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].

Unsupervised Domain Adaptation by Domain Invariant Projection

Summary (3 min read)

1. Introduction

2. Related Work

3. Background

3.1. Maximum Mean Discrepancy

4. Domain Invariant Projection (DIP)

4.1. Problem Formulation

4.1.1 Encouraging Class Clustering (DIP-CC)

4.1.2 Semi-Supervised DIP (SS-DIP)

4.2. Optimization on a Grassmann Manifold

5. Experiments

5.1. Cross-domain WiFi Localization

5.2. Visual Object Recognition

Figures (2)

Citations

Cites background or methods from "Unsupervised Domain Adaptation by D..."

Additional excerpts

References

"Unsupervised Domain Adaptation by D..." refers methods in this paper

"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper

"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper

Additional excerpts

"Unsupervised Domain Adaptation by D..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Unsupervised domain adaptation by domain invariant projection" ?

Q2. What have the authors stated for future works in "Unsupervised domain adaptation by domain invariant projection" ?

Q3. What is the tangent space at a point on a manifold?

Q4. In what way did the authors first apply PCA to the concatenated source and target data?

Q5. What is the purpose of the evaluation protocol?

Q6. What was the subspace disagreement measure used in all their experiments?

Q7. What is the universal kernel associated with the mapping?

Q8. What are the methods that have been proposed to create intermediate representations?

Q9. What is the MMD of Eq. 5?

Q10. What is the MMD between the distributions of two sets of observations?

Q11. Why is the distance between the sample means measured in TCA?

Q12. What is the effect of the Gaussian kernel on the experimental results?

Q13. What is the CG on a Grassmann manifold?

Q14. What is the purpose of the re-weighting?

Q15. What is the difference between the unsupervised and the semi-supervised DIP formulations?

Q16. What is the definition of domain shift?