scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Unsupervised Domain Adaptation by Domain Invariant Projection

TL;DR: This paper learns a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized and demonstrates the effectiveness of the approach on the task of visual object recognition.
Abstract: Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. Existing techniques that have attempted to match the distributions of the source and target domains typically compare these distributions in the original feature space. This space, however, may not be directly suitable for such a comparison, since some of the features may have been distorted by the domain shift, or may be domain specific. In this paper, we introduce a Domain Invariant Projection approach: An unsupervised domain adaptation method that overcomes this issue by extracting the information that is invariant across the source and target domains. More specifically, we learn a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized. We demonstrate the effectiveness of our approach on the task of visual object recognition and show that it outperforms state-of-the-art methods on a standard domain adaptation benchmark dataset.

Summary (3 min read)

1. Introduction

  • Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].
  • They fail to account for the fact that the image features themselves may have been distorted by the domain shift, and that some of the image features may be specific to one domain and thus irrelevant for classification in the other one.
  • In light of the above discussion, the authors propose to tackle the problem of domain shift by extracting the information that is invariant across the source and target domains.

3. Background

  • The authors review some concepts that will be used in their algorithm.
  • In particular, the authors briefly discuss the idea of Maximum Mean Discrepancy and introduce some notions of Grassmann manifolds.

3.1. Maximum Mean Discrepancy

  • The authors are interested in measuring the dissimilarity between two probability distributions s and t. Non-parametric representations are very wellsuited to visual data, which typically exhibits complex probability distributions in high-dimensional spaces.
  • The authors employ the maximum mean discrepancy [17] between two distributions s and t to measure their dissimilarity.
  • The MMD is an effective non-parametric criterion that compares the distributions of two sets of data by mapping the data to RKHS.
  • In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space.

4. Domain Invariant Projection (DIP)

  • The authors introduce their approach to unsupervised domain adaptation.
  • The authors first derive the optimization problem at the heart of their approach, and then discuss the details of their Grassmann manifold optimization method.

4.1. Problem Formulation

  • Intuitively, with such a representation, a classifier trained on the source domain should perform equally well on the target domain.
  • To achieve invariance, the authors search for a projection to a lowdimensional subspace where the source and target distributions are similar, or, in other words, a projection that minimizes a distance measure between the two distributions.
  • In particular, the authors measure the distance between these two distribution with the MMD discussed in Section 3.1.
  • In particular, the more general class of characteristic kernels can also be employed.

4.1.1 Encouraging Class Clustering (DIP-CC)

  • In the DIP formulation described above, learning the projection W is done in a fully unsupervised manner.
  • Note, however, that even in the so-called unsupervised setting, domain adaptation methods have access to the labels of the source examples.
  • Here, the authors show that their formulation naturally allows us to exploit these labels while learning the projection.
  • This can be achieved by minimizing the distance between the projected samples of each class and their mean.
  • Note also that the regularizer in Eq. 8 is related to the intra-class scatter in the objective function of Linear Discriminant Analysis (LDA).

4.1.2 Semi-Supervised DIP (SS-DIP)

  • The formulations of DIP given in Eqs. 7 and 8 fall into the unsupervised domain adaptation category, since they do not exploit any labeled target examples.
  • Their formulation can very naturally be extended to the semi-supervised settings.
  • In the unsupervised setting, this classifier is only trained using the source examples.
  • With Semi-Supervised DIP (SS-DIP), the labeled target examples can be taken into account in two different manners.
  • With the class-clustering regularizer of Eq. 8, the authors utilize the target labels in the regularizer when learning W , as well as when learning the final classifier.

4.2. Optimization on a Grassmann Manifold

  • All versions of their DIP formulation yield nonlinear, constrained optimization problems.
  • This lets us rewrite their constrained optimization problem as an unconstrained problem on the manifold G(d,D).
  • While their optimization problem has become unconstrained, it remains nonlinear.
  • Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic.
  • In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix.

5. Experiments

  • The authors evaluated their approach on the tasks of indoor WiFi localization and visual object recognition, and compare its performance against the state-of-the art methods in each task.
  • In all their experiments, the authors set the variance σ of the Gaussian kernel to the median squared distance between all source examples, and the weight λ of the regularizer to 4/σ when using the regularizer.

5.1. Cross-domain WiFi Localization

  • The authors first evaluated their approach on the task of indoor WiFi localization using the public wifi data set published in the 2007 IEEE ICDM Contest for domain adaptation [29].
  • The goal of indoor WiFi localization is to predict the location of WiFi devices based on received signal strength (RSS) values collected during different time periods .
  • The authors followed the transductive evaluation setting introduced in [24] to compare their DIP methods with TCA and SSTCA, which are considered state-of-the-art on this dataset.
  • Amazon, Webcam, DSLR, and Caltech, also known as From left to right.

5.2. Visual Object Recognition

  • The authors then evaluated their approach on the task of visual object recognition using the benchmark domain adaptation dataset introduced in [26].
  • This dataset contains images from four different domains: Amazon, DSLR, Webcam, and Caltech.
  • The Amazon domain consists of images acquired in a highly-controlled environment with studio lighting conditions.
  • The authors results are presented as DIP for the original model and DIP-CC for the class-clustering regularized one.
  • Table 1 shows the recognition accuracies on the target examples for the 9 pairs of source and target domains.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Unsupervised Domain Adaptation by Domain Invariant Projection
Mahsa Baktashmotlagh
1,3
, Mehrtash T. Harandi
2,3
, Brian C. Lovell
1
, and Mathieu Salzmann
2,3
1
University of Queensland
2
Australian National University
3
NICTA, Canberra
mahsa.baktashmotlagh@nicta.com.au
Abstract
Domain-invariant representations are key to addressing
the domain shift problem where the training and test exam-
ples follow different distributions. Existing techniques that
have attempted to match the distributions of the source and
target domains typically compare these distributions in the
original feature space. This space, however, may not be di-
rectly suitable for such a comparison, since some of the fea-
tures may have been distorted by the domain shift, or may
be domain specific. In this paper, we introduce a Domain
Invariant Projection approach: An unsupervised domain
adaptation method that overcomes this issue by extracting
the information that is invariant across the source and tar-
get domains. More specifically, we learn a projection of the
data to a low-dimensional latent space where the distance
between the empirical distributions of the source and target
examples is minimized. We demonstrate the effectiveness of
our approach on the task of visual object recognition and
show that it outperforms state-of-the-art methods on a stan-
dard domain adaptation benchmark dataset.
1. Introduction
Domain shift is a fundamental problem in visual recog-
nition tasks as evidenced by the recent surge of interest
in domain adaptation [22, 15, 16]. The problem typically
arises when the training (source) and test (target) exam-
ples follow different distributions. This is a common sce-
nario in modern visual recognition tasks, especially if im-
ages are acquired with different cameras, or in very different
conditions (e.g., commercial website versus home environ-
ment, images taken under different illuminations). Failing
to model the distribution shift in the hope that the image
features will be robust enough often yields poor recognition
accuracy [26, 16, 15, 14]. On the other hand, labeling suf-
ficiently many images from the target domain to train a dis-
criminative classifier specific to this domain is prohibitively
time-consuming and impractical in realistic scenarios.
NICTA is funded by the Australian Government as represented by the
Department of Broadband, Communications and the Digital Economy and
the ARC through the ICT Centre of Excellence program.
To relate the source and target domains, several state-of-
the-art methods have proposed to create intermediate repre-
sentations [15, 16]. However, these representations do not
explicitly try to match the probability distributions of the
source and target data, which may make them sub-optimal
for classification. Sample selection, or re-weighting, ap-
proaches [14, 21] explicitly attempt to match the source and
target distributions by finding the most appropriate source
examples for the target data. However, they fail to account
for the fact that the image features themselves may have
been distorted by the domain shift, and that some of the
image features may be specific to one domain and thus ir-
relevant for classification in the other one.
In light of the above discussion, we propose to tackle the
problem of domain shift by extracting the information that
is invariant across the source and target domains. To this
end, we introduce a Domain Invariant Projection (DIP) ap-
proach, which aims to learn a low-dimensional latent space
where the source and target distributions are similar. Learn-
ing such a projection allows us to account for the potential
distortions induced by the domain shift, as well as for the
presence of domain-specific image features. Furthermore,
since the distributions of the source and target data in the
latent space are similar, we expect a classifier trained on the
source examples to perform well on the target domain.
In this work, we make use of the Maximum Mean Dis-
crepancy (MMD) [17] to measure the dissimilarity between
the empirical distributions of the source and target exam-
ples. Learning the latent space that minimizes the MMD
between the source and target domains can then be formu-
lated as an optimization problem on a Grassmann manifold.
This lets us utilize Grassmannian geometry to effectively
obtain our domain invariant projection. Although designed
to be fully unsupervised, our formalism naturally allows us
to exploit label information from either domain during the
training process. While not strictly necessary, this informa-
tion can help boosting classification accuracy even further.
In short, we introduce the idea of finding a domain in-
variant representation of the data by matching the source
and target distributions in a low-dimensional latent space,
and propose an effective algorithm to learn our Domain In-
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.100
769
2013 IEEE International Conference on Computer Vision
1550-5499/13 $31.00 © 2013 IEEE
DOI 10.1109/ICCV.2013.100
769

variant Projection. We demonstrate the benefits of our ap-
proach on the task of visual object recognition and show
that it outperforms state-of-the-art methods on the standard
domain adaptation benchmark dataset [26].
2. Related Work
Existing domain adaptation methods can be divided into
two categories: Semi-supervised approaches [12, 3, 26] that
assume that a small number of labeled examples from the
target domain are available during training, and unsuper-
vised approaches [15, 14, 16, 21] that do not require any
labels from the target domain.
In the former category, modifications of Support Vector
Machines (SVM) [12, 3] and other statistical classifiers [10]
have been proposed to exploit the availability of labeled and
unlabeled data from the target domain. Co-regularization
of similar classifiers was also introduced to utilize unla-
beled target data during training [9]. For visual recognition,
metric learning [26] and transformation learning [23] were
shown to be effective at making use of the labeled target ex-
amples. Furthermore, semi-supervised methods have also
been proposed to tackle the case where multiple source do-
mains are available [11, 20]. While semi-supervised meth-
ods are often effective, in many applications, labeled target
examples are not available and cannot easily be acquired.
To address this issue, unsupervised domain adaptation
approaches that rely on purely unsupervised target data have
been proposed [28, 7, 8]. In particular, two types of meth-
ods have proven quite successful at the task of visual ob-
ject recognition: Subspace-based approaches and sample
re-weighting approaches.
Subspace-based approaches [4, 16, 15] model the do-
main shift by representing the data with multiple subspaces.
In particular, in [4], coupled subspaces are learned using
Canonical Correlation Analysis (CCA). Rather than limit-
ing the representation to one source and one target sub-
spaces, several techniques exploit intermediate subspaces,
which link the source data to the target data. This idea
was originally introduced in [16], where the subspaces were
modeled as points on a Grassmann manifold, and interme-
diate subspaces were obtained by sampling points along the
geodesic between the source and target subspaces. This
method was extended in [15], which showed that all inter-
mediate subspaces could be taken into account by integrat-
ing along the geodesic. While this formulation nicely char-
acterizes the change between the source and target data, it
is not clear why all the subspaces along this path should
yield meaningful representations. More importantly, these
subspace-based methods do not explicitly exploit the statis-
tical properties of the observed data.
In contrast, sample re-weighting, or selection, ap-
proaches, have focused more directly on comparing the
distributions of the source and target data. In particular,
in [21, 18], the source examples are re-weighted so as to
minimize the MMD between the source and target dis-
tributions. More recently, an approach to selecting land-
marks among the source examples based on MMD was in-
troduced [14]. This sample selection approach was shown
to be very effective, especially for the task of visual object
recognition, to the point that it outperforms state-of-the-art
semi-supervised approaches. Despite their success, it is im-
portant to note that sample re-weighting and selection meth-
ods compare the source and target distributions directly in
the original feature space. This space, however, may not
be appropriate for this task, since the image features may
have been distorted by the domain shift, and since some of
the features may only be relevant to one specific domain.
In contrast, in this work, we compare the source and tar-
get distributions in a low-dimensional latent space where
these effects are removed, or reduced. This, in turn, yields
a representation that significantly outperforms the recent
landmark-based approach [14], as well as other state-of-the-
art methods on the task of object recognition.
Transfer Component Analysis (TCA) [24] may be clos-
est in spirit to our work. However, although motivated by
MMD, in TCA, the distance between the sample means is
measured in a lower-dimensional space rather than in Re-
producing Kernel Hilbert Space (RKHS), which somewhat
contradicts the intuition behind the use of kernels. Here,
we follow the more intuitive idea of comparing the distribu-
tions of the transformed data using MMD. This, we believe
and as suggested by our experiments, makes better use of
the expressive power of the kernel in MMD.
3. Background
In this section, we review some concepts that will be
used in our algorithm. In particular, we briefly discuss the
idea of Maximum Mean Discrepancy and introduce some
notions of Grassmann manifolds.
3.1. Maximum Mean Discrepancy
In this work, we are interested in measuring the dissimi-
larity between two probability distributions s and t. Rather
than restricting these distributions to take a specific para-
metric form, we opt for a non-parametric approach to com-
pare s and t. Non-parametric representations are very well-
suited to visual data, which typically exhibits complex prob-
ability distributions in high-dimensional spaces.
We employ the maximum mean discrepancy [17] be-
tween two distributions s and t to measure their dissimi-
larity. The MMD is an effective non-parametric criterion
that compares the distributions of two sets of data by map-
ping the data to RKHS. Given two distributions s and t, the
MMD between s and t is defined as
D
(F,s,t)=sup
fF
(E
˜x
s
s
[f(˜x
s
)] E
˜x
t
t
[f(˜x
t
)]) ,
770770

where E
˜xs
[·] is the expectation under distribution s.By
defining F as the set of functions in the unit ball in a univer-
sal RKHS H, it was shown that D
(F,s,t)=0if and only
if s = t [17].
Let
˜
X
s
= {˜x
1
s
, ··· , ˜x
n
s
} and
˜
X
t
= {˜x
1
t
, ··· , ˜x
m
t
} be
two sets of observations drawn i.i.d. from s and t, respec-
tively. An empirical estimate of the MMD can be computed
as
D(
˜
X
s
,
˜
X
t
)=
1
n
n
i=1
φ(˜x
i
s
)
1
m
m
j=1
φ(˜x
j
t
)
H
=
n
i,j=1
k(˜x
i
s
, ˜x
j
s
)
n
2
+
m
i,j=1
k(˜x
i
t
, ˜x
j
t
)
m
2
2
n,m
i,j=1
k(˜x
i
s
, ˜x
j
t
)
nm
1
2
,
where φ(·) is the mapping to the RKHS H, and k(·, ·)=
φ(·)(·) is the universal kernel associated with this map-
ping. In short, the MMD between the distributions of two
sets of observations is equivalent to the distance between
the sample means in a high-dimensional feature space.
3.2. Grassmann Manifolds
In our formulation, we model the projection of the source
and target data to a low-dimensional space as a point W on
a Grassmann manifold G(d, D). The Grassmann manifold
G(d, D) consists of the set of all linear d-dimensional sub-
spaces of R
D
. In particular, this lets us handle constraints
of the form W
T
W = I
d
. Learning the projection then
involves non-linear optimization on the Grassmann mani-
fold, which requires some notions of differential geometry
reviewed below.
In differential geometry, the shortest path between two
points on a manifold is a curve called a geodesic. The tan-
gent space at a point on a manifold is a vector space that
consists of the tangent vectors of all possible curves pass-
ing through this point. Parallel transport is the action of
transferring a tangent vector between two points on a man-
ifold. Unlike in flat spaces, this cannot be achieved by sim-
ple translation, but requires subtracting a normal component
at the end point [13].
On a Grassmann manifold, the above-mentioned opera-
tions have efficient numerical forms and can thus be used
to perform optimization on the manifold. In particular, we
make use of a conjugate gradient (CG) algorithm on the
Grassmann manifold [13]. CG techniques are popular non-
linear optimization methods with fast convergence rates.
These methods iteratively optimize the objective function
in linearly independent directions called conjugate direc-
tions [25]. CG on a Grassmann manifold can be summa-
rized by the following steps:
(i) Compute the gradient f
W
of the objective function
f on the manifold at the current estimate W as
f
W
= ∂f
W
WW
T
∂f
W
, (1)
with ∂f
W
the matrix of usual partial derivatives.
(ii) Determine the search direction H by parallel trans-
porting the previous search direction and combining
it with f
W
.
(iii) Perform a line search along the geodesic at W in the
direction H.
These steps are repeated until convergence to a local mini-
mum, or until a maximum number of iterations is reached.
4. Domain Invariant Projection (DIP)
In this section, we introduce our approach to unsuper-
vised domain adaptation. We first derive the optimization
problem at the heart of our approach, and then discuss the
details of our Grassmann manifold optimization method.
4.1. Problem Formulation
Our goal is to find a representation of the data that is
invariant across different domains. Intuitively, with such
a representation, a classifier trained on the source domain
should perform equally well on the target domain. To
achieve invariance, we search for a projection to a low-
dimensional subspace where the source and target distribu-
tions are similar, or, in other words, a projection that mini-
mizes a distance measure between the two distributions.
More specifically, let X
s
=
x
1
s
, ··· , x
n
s
be the D × n
matrix containing n samples from the source domain and
X
t
=
x
1
t
, ··· , x
m
t
be the D × m matrix containing m
samples from the target domain. We search for a D ×d pro-
jection matrix W , such that the distributions of the source
and target samples in the resulting d-dimensional subspace
are as similar as possible. In particular, we measure the
distance between these two distribution with the MMD dis-
cussed in Section 3.1. This distance can be expressed as
D(W
T
X
s
, W
T
X
t
)=
1
n
n
i=1
φ(W
T
x
i
s
)
1
m
m
j=1
φ(W
T
x
j
t
)
H
,
(2)
with φ(·) the mapping from R
D
to the high-dimensional
RKHS H. Note that, here, W appears inside φ(·) in or-
der to measure the MMD of the projected samples. This
is in contrast with sample re-weighting, or selection meth-
ods [21, 18, 14, 24] that place weights outside φ(·). There-
fore, these methods ultimately still compare the distribu-
tions in the original image feature space and may suffer
from the presence of domain-specific features.
Using the MMD, learning W can be expressed as the
optimization problem
W
=argmin
W
D
2
(W
T
X
s
, W
T
X
t
)
s.t. W
T
W = I
d
, (3)
771771

where the constraints enforce W to be orthogonal. Such
constraints prevent our model from wrongly matching the
two distributions by distorting the data, and make it very
unlikely that the resulting subspace only contains the noise
of both domains. Orthogonality constraints have proven ef-
fective in many subspace methods, such as PCA or CCA.
As shown in Section 3.1, the MMD in the RKHS H can
be expressed in terms of a kernel function k(·, ·). In partic-
ular here, we exploit the Gaussian kernel function, which is
known to be universal [27]. This lets us rewrite our objec-
tive function as
D
2
(W
T
X
s
, W
T
X
t
)= (4)
1
n
2
n
i,j=1
exp
(x
i
s
x
j
s
)
T
WW
T
(x
i
s
x
j
s
)
σ
+
1
m
2
m
i,j=1
exp
(x
i
t
x
j
t
)
T
WW
T
(x
i
t
x
j
t
)
σ
2
mn
n,m
i,j=1
exp
(x
i
s
x
j
t
)
T
WW
T
(x
i
s
x
j
t
)
σ
.
Since the Gaussian kernel satisfies the universality con-
dition of the MMD, it is a natural choice for our approach.
However, it was shown that, in practice, choices of non-
universal kernels may be more appropriate to measure the
MMD [6]. In particular, the more general class of character-
istic kernels can also be employed. This class incorporates
all strictly positive definite kernels, such as the well-known
polynomial kernel. Therefore, here, we also consider us-
ing the polynomial kernel of degree two. The fact that this
kernel yields a distribution distance that only compares the
first and second moment of the two distributions [17] will
be shown to have little impact on our experimental results,
thus showing the robustness of our approach to the choice of
kernel. Replacing the Gaussian kernel with this polynomial
kernel in our objective function yields
D
2
(W
T
X
s
, W
T
X
t
)= (5)
1
n
2
n
i=1
n
j=1
(1 + x
i
s
T
WW
T
x
j
s
)
2
+
1
m
2
m
i=1
m
j=1
(1 + x
i
t
T
WW
T
x
j
t
)
2
2
mn
n
i=1
m
j=1
(1 + x
i
s
T
WW
T
x
j
t
)
2
.
The two definitions of MMD introduced in Eqs. 4 and 5
can be computed efficiently in matrix form as
D
2
(W
T
X
s
, W
T
X
t
)=Tr(K
W
L) , (6)
where
K
W
=
K
s,s
K
s,t
K
t,s
K
t,t
R
(n+m)×(n+m)
, and
L
ij
=
1/n
2
i, j ∈S
1/m
2
i, j ∈T
1/(nm) otherwise
,
with S and T the sets of source and target indices, respec-
tively. Each element in K
W
is computed using the kernel
function (either Gaussian, or polynomial), and thus depends
on W . Note that, with both kernels, K
W
can be computed
efficiently in matrix form (i.e., without looping over its ele-
ments). This yields the optimization problem
W
=argmin
W
Tr(K
W
L)
s.t. W
T
W = I
d
, (7)
which is a nonlinear constrained problem. In practice, we
represent W as a point on a Grassmann manifold, which
yields an unconstrained optimization problem on the mani-
fold. As mentioned in Section 3.2, we make use of a conju-
gate gradient method on the manifold to obtain W
.
4.1.1 Encouraging Class Clustering (DIP-CC)
In the DIP formulation described above, learning the projec-
tion W is done in a fully unsupervised manner. Note, how-
ever, that even in the so-called unsupervised setting, domain
adaptation methods have access to the labels of the source
examples. Here, we show that our formulation naturally al-
lows us to exploit these labels while learning the projection.
Intuitively, we are interested in finding a projection that
not only minimizes the distance between the distribution of
the projected source and target data, but also yields good
classification performance. To this end, we search for a
projection that encourages samples with the same labels to
form a more compact cluster. This can be achieved by min-
imizing the distance between the projected samples of each
class and their mean. This yields the optimization problem
W
= argmin
W
Tr(K
W
L)+λ
C
c=1
n
c
i=1
W
T
(x
i,c
s
μ
c
)
2
s.t. W
T
W = I , (8)
where C is the number of classes, n
c
the number of exam-
ples in class c, x
i,c
s
denotes the i
th
example of class c, and
μ
c
the mean of the examples in class c. Note that in our for-
mulation, the mean of the projected examples is equivalent
to the projection of the mean. Note also that the regularizer
in Eq. 8 is related to the intra-class scatter in the objective
function of Linear Discriminant Analysis (LDA). While we
also tried to incorporate the other LDA term, which encour-
ages the means of different classes to be spread apart, we
found no benefits in doing so in our results.
772772

4.1.2 Semi-Supervised DIP (SS-DIP)
The formulations of DIP given in Eqs. 7 and 8 fall into the
unsupervised domain adaptation category, since they do not
exploit any labeled target examples. However, our formula-
tion can very naturally be extended to the semi-supervised
settings. To this end, it must first be noted that, after learn-
ing W , we train a classifier in the resulting latent space
(i.e., on W
T
x). In the unsupervised setting, this classifier
is only trained using the source examples.
With Semi-Supervised DIP (SS-DIP), the labeled target
examples can be taken into account in two different man-
ners. In the unregularized formulation of Eq. 7, since no
labels are used when learning W , we only employ the la-
beled target examples along with the source ones to train
the final classifier. With the class-clustering regularizer of
Eq. 8, we utilize the target labels in the regularizer when
learning W , as well as when learning the final classifier.
4.2. Optimization on a Grassmann Manifold
All versions of our DIP formulation yield nonlinear, con-
strained optimization problems. To tackle this challenging
scenario, we first note that the constraints on W make it
a point on a Grassmann manifold. This lets us rewrite our
constrained optimization problem as an unconstrained prob-
lem on the manifold G(d, D). Optimization on Grassmann
manifolds has proven effective at avoiding bad local min-
ima [1]. More specifically, manifold optimization methods
often have better convergence behavior than iterative pro-
jection methods, which can be crucial with a nonlinear ob-
jective function [1].
While our optimization problem has become uncon-
strained, it remains nonlinear. To effectively address this,
we make use of a conjugate gradient method on the man-
ifold. Recall from Section 3.2 that CG on a Grassmann
manifold involves (i) computing the gradient on the man-
ifold f
W
, (ii) estimating the search direction H, and (iii)
performing a line search along a geodesic. Eq. 1 shows that
the gradient on the manifold depends on the partial deriva-
tives of the objective function w.r.t. W , i.e., ∂f/W . The
general form of ∂f/W in our formulation is
∂f
W
=
n
i,j=1
G
ss
(i, j)
n
2
+
m
i,j=1
G
tt
(i, j)
m
2
2
n,m
i,j=1
G
st
(i, j)
mn
,
where G
ss
(·, ·), G
tt
(·, ·) and G
st
(·, ·) are matrices of size
D × d. With the definition of MMD in Eq. 4 based on the
Gaussian kernel k
G
(·, ·), the matrix, e.g., G
ss
(i, j) takes
the form
G
ss
(i, j)=
2
σ
k
G
(x
i
s
, x
j
s
)(x
i
s
x
j
s
)(x
i
s
x
j
s
)
T
W ,
and similarly for G
tt
(·, ·) and G
st
(·, ·). With the MMD
of Eq. 5 based on the degree 2 polynomial kernel k
P
(·, ·),
Figure 1. Comparison of our approach with TCA on the task of
indoor WiFi localization.
G
ss
(i, j) becomes
G
ss
(i, j)=2k
P
(x
i
s
, x
j
s
)(x
i
s
x
j
s
T
+ x
j
s
x
i
s
T
)W ,
and similarly for G
tt
(·, ·) and G
st
(·, ·).Asf itself,
∂f/W can be efficiently computed in matrix form.
In our experiments, we first applied PCA to the concate-
nated source and target data, kept all the data variance, and
initialized W to the truncated identity matrix. We observed
that learning W typically converges in only a few iterations.
5. Experiments
We evaluated our approach on the tasks of indoor WiFi
localization and visual object recognition, and compare its
performance against the state-of-the art methods in each
task. In all our experiments, we set the variance σ of the
Gaussian kernel to the median squared distance between all
source examples, and the weight λ of the regularizer to 4
when using the regularizer.
5.1. Cross-domain WiFi Localization
We first evaluated our approach on the task of indoor
WiFi localization using the public wifi data set published in
the 2007 IEEE ICDM Contest for domain adaptation [29].
The goal of indoor WiFi localization is to predict the lo-
cation (labels) of WiFi devices based on received signal
strength (RSS) values collected during different time peri-
ods (domains). The dataset contains 621 labeled examples
collected during time period A (i.e., source) and 3128 unla-
beled examples collected during time period B (i.e., target).
We followed the transductive evaluation setting intro-
duced in [24] to compare our DIP methods with TCA
and SSTCA, which are considered state-of-the-art on this
dataset. Nearest-neighbor was employed as the final classi-
fier for our algorithms and for the baselines. In our experi-
ments, we used all the source data and 400 randomly sam-
pled target examples. In Fig. 1, we report the mean Average
773773

Citations
More filters
Book ChapterDOI
TL;DR: In this article, a new representation learning approach for domain adaptation is proposed, in which data at training and test time come from similar but different distributions, and features that cannot discriminate between the training (source) and test (target) domains are used to promote the emergence of features that are discriminative for the main learning task on the source domain.
Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

4,862 citations

Posted Content
TL;DR: A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.
Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

3,351 citations


Cites background or methods from "Unsupervised Domain Adaptation by D..."

  • ...A rich line of prior works have focused on learning shallow features by jointly minimizing a distance metric of domain discrepancy (Pan et al., 2011; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Ghifary et al., 2014; Wang & Schneider, 2014)....

    [...]

  • ...…tasks,A → D, D → A andW → A. Office-10 + Caltech-10(Gong et al., 2012) This dataset consists of the 10 common categories shared by the Office31 and Caltech-256 (C) (Griffin et al., 2007) datasets and is widely adopted in transfer learning methods (Long et al., 2013; Baktashmotlagh et al., 2013)....

    [...]

  • ..., 2013; Wang & Schneider, 2014) and computer vision (Gong et al., 2012; Baktashmotlagh et al., 2013; Long et al., 2013), etc....

    [...]

  • ...It has been explored to save the manual labeling efforts for machine learning (Pan et al., 2011; Zhang et al., 2013; Wang & Schneider, 2014) and computer vision (Gong et al., 2012; Baktashmotlagh et al., 2013; Long et al., 2013), etc....

    [...]

  • ...A rich line of prior work has focused on learning shallow features by jointly minimizing a distance metric of domain discrepancy (Pan et al., 2011; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Ghifary et al., 2014; Wang & Schneider, 2014)....

    [...]

Posted Content
TL;DR: In this paper, a gradient reversal layer is proposed to promote the emergence of deep features that are discriminative for the main learning task on the source domain and invariant with respect to the shift between the domains.
Abstract: Top-performing deep architectures are trained on massive amounts of labeled data. In the absence of labeled data for a certain task, domain adaptation often provides an attractive option given that labeled data of similar nature but from a different domain (e.g. synthetic images) are available. Here, we propose a new approach to domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of "deep" features that are (i) discriminative for the main learning task on the source domain and (ii) invariant with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a simple new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation. Overall, the approach can be implemented with little effort using any of the deep-learning packages. The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.

3,222 citations

Proceedings Article
06 Jul 2015
TL;DR: The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.
Abstract: Top-performing deep architectures are trained on massive amounts of labeled data. In the absence of labeled data for a certain task, domain adaptation often provides an attractive option given that labeled data of similar nature but from a different domain (e.g. synthetic images) are available. Here, we propose a new approach to domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of "deep" features that are (i) discriminative for the main learning task on the source domain and (ii) invariant with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a simple new gradient reversal layer. The resulting augmented architecture can be trained using standard back propagation. Overall, the approach can be implemented with little effort using any of the deep-learning packages. The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.

2,889 citations


Additional excerpts

  • ...Some approaches perform this by reweighing or selecting samples from the source domain [3, 11, 7], while others seek an explicit feature space transformation that would map source distribution into the target ones [16, 10, 2]....

    [...]

Proceedings Article
06 Jul 2015
TL;DR: Deep Adaptation Network (DAN) as mentioned in this paper embeds hidden representations of all task-specific layers in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched.
Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multikernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

1,272 citations

References
More filters
Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations


"Unsupervised Domain Adaptation by D..." refers methods in this paper

  • ...Local scaleinvariant interest points were detected by the SURF detector [2], and a 64-dimensional rotation invariant SURF descriptor was extracted from the image patch around each interest point....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distribution free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

3,792 citations


"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper

  • ...By defining F as the set of functions in the unit ball in a universal RKHS H, it was shown that D′(F, s, t) = 0 if and only if s = t [17]....

    [...]

  • ...We employ the maximum mean discrepancy [17] between two distributions s and t to measure their dissimilarity....

    [...]

  • ...The fact that this kernel yields a distribution distance that only compares the first and second moment of the two distributions [17] will be shown to have little impact on our experimental results, thus showing the robustness of our approach to the choice of kernel....

    [...]

  • ...In this work, we make use of the Maximum Mean Discrepancy (MMD) [17] to measure the dissimilarity between the empirical distributions of the source and target examples....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation and proposes both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce thedistance between domain distributions by projecting data onto the learned transfer components.
Abstract: Domain adaptation allows knowledge from a source domain to be transferred to a different but related target domain. Intuitively, discovering a good feature representation across domains is crucial. In this paper, we first propose to find such a representation through a new learning method, transfer component analysis (TCA), for domain adaptation. TCA tries to learn some transfer components across domains in a reproducing kernel Hilbert space using maximum mean miscrepancy. In the subspace spanned by these transfer components, data properties are preserved and data distributions in different domains are close to each other. As a result, with the new representations in this subspace, we can apply standard machine learning methods to train classifiers or regression models in the source domain for use in the target domain. Furthermore, in order to uncover the knowledge hidden in the relations between the data labels from the source and target domains, we extend TCA in a semisupervised learning setting, which encodes label information into transfer components learning. We call this extension semisupervised TCA. The main contribution of our work is that we propose a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation. We propose both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce the distance between domain distributions by projecting data onto the learned transfer components. Finally, our approach can handle large datasets and naturally lead to out-of-sample generalization. The effectiveness and efficiency of our approach are verified by experiments on five toy datasets and two real-world applications: cross-domain indoor WiFi localization and cross-domain text classification.

3,195 citations


"Unsupervised Domain Adaptation by D..." refers background or methods or result in this paper

  • ...We compare our DIP and DIP-CC results, with Gaussian or polynomial kernel in MMD, with those obtained by several state-ofthe-art methods: transfer component analysis (TCA) [24], geodesic flow kernel (GFK) [15], geodesic flow sampling (GFS) [16], structural correspondence learning (SCL) [5], kernel mean matching (KMM) [18] and landmark selection (LM) [14]....

    [...]

  • ...We followed the transductive evaluation setting introduced in [24] to compare our DIP methods with TCA and SSTCA, which are considered state-of-the-art on this dataset....

    [...]

  • ...Note that our algorithms outperform TCA in both unsupervised and supervised settings....

    [...]

  • ...This is in contrast with sample re-weighting, or selection methods [21, 18, 14, 24] that place weights outside φ(·)....

    [...]

  • ...However, although motivated by MMD, in TCA, the distance between the sample means is measured in a lower-dimensional space rather than in Reproducing Kernel Hilbert Space (RKHS), which somewhat contradicts the intuition behind the use of kernels....

    [...]

10 Mar 2007
TL;DR: A challenging set of 256 object categories containing a total of 30607 images is introduced and the clutter category is used to train an interest detector which rejects uninformative background regions.
Abstract: We introduce a challenging set of 256 object categories containing a total of 30607 images The original Caltech-101 [1] was collected by choosing a set of object categories, downloading examples from Google Images and then manually screening out all images that did not fit the category Caltech-256 is collected in a similar manner with several improvements: a) the number of categories is more than doubled, b) the minimum number of images in any category is increased from 31 to 80, c) artifacts due to image rotation are avoided and d) a new and larger clutter category is introduced for testing background rejection We suggest several testing paradigms to measure classification performance, then benchmark the dataset using two simple metrics as well as a state-of-the-art spatial pyramid matching [2] algorithm Finally we use the clutter category to train an interest detector which rejects uninformative background regions

2,699 citations


Additional excerpts

  • ...This dataset contains images from four different domains: Amazon, DSLR, Webcam, and Caltech....

    [...]

  • ...The last domain, Caltech [19], consists of images of 256 object classes downloaded from Google images....

    [...]

Journal ArticleDOI
TL;DR: The theory proposed here provides a taxonomy for numerical linear algebra algorithms that provide a top level mathematical view of previously unrelated algorithms and developers of new algorithms and perturbation theories will benefit from the theory.
Abstract: In this paper we develop new Newton and conjugate gradient algorithms on the Grassmann and Stiefel manifolds. These manifolds represent the constraints that arise in such areas as the symmetric eigenvalue problem, nonlinear eigenvalue problems, electronic structures computations, and signal processing. In addition to the new algorithms, we show how the geometrical framework gives penetrating new insights allowing us to create, understand, and compare algorithms. The theory proposed here provides a taxonomy for numerical linear algebra algorithms that provide a top level mathematical view of previously unrelated algorithms. It is our hope that developers of new algorithms and perturbation theories will benefit from the theory, methods, and examples in this paper.

2,686 citations


"Unsupervised Domain Adaptation by D..." refers background or methods in this paper

  • ...Unlike in flat spaces, this cannot be achieved by simple translation, but requires subtracting a normal component at the end point [13]....

    [...]

  • ...In particular, we make use of a conjugate gradient (CG) algorithm on the Grassmann manifold [13]....

    [...]

Frequently Asked Questions (16)
Q1. What have the authors contributed in "Unsupervised domain adaptation by domain invariant projection" ?

Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. In this paper, the authors introduce a Domain Invariant Projection approach: More specifically, the authors learn a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized. The authors demonstrate the effectiveness of their approach on the task of visual object recognition and show that it outperforms state-of-the-art methods on a standard domain adaptation benchmark dataset. 

Although, in practice, optimization on the Grassmann manifold has proven well-behaved, the authors intend to study if the use of other characteristic kernels in conjunction with different optimization strategies, such as the convex-concave procedure, could yield theoretical convergence guarantees within their formalism. Finally, the authors also plan to investigate how ideas from the deep learning literature could be employed to obtain domain invariant features. 

The tangent space at a point on a manifold is a vector space that consists of the tangent vectors of all possible curves passing through this point. 

In their experiments, the authors first applied PCA to the concatenated source and target data, kept all the data variance, and initialized W to the truncated identity matrix. 

In a second experiment, the authors used the more conventional evaluation protocol introduced in [26], which consists of splitting the data into multiple partitions. 

In all their experiments, the authors used the subspace disagreement measure of [15] to automatically determine the dimensionality of the projection matrix W . 

H=( n∑i,j=1k(x̃is, x̃ j s)n2 + m∑ i,j=1 k(x̃it, x̃ j t ) m2 − 2 n,m∑ i,j=1 k(x̃is, x̃ j t ) nm) 1 2,where φ(·) is the mapping to the RKHS H, and k(·, ·) = 〈φ(·), φ(·)〉 is the universal kernel associated with this mapping. 

To relate the source and target domains, several state-ofthe-art methods have proposed to create intermediate representations [15, 16]. 

With the MMD of Eq. 5 based on the degree 2 polynomial kernel kP (·, ·),Gss(i, j) becomesGss(i, j) = 2kP (x i s,x j s)(x i sx j s T + xjsx i s T )W ,and similarly for Gtt(·, ·) and Gst(·, ·). 

In short, the MMD between the distributions of two sets of observations is equivalent to the distance between the sample means in a high-dimensional feature space. 

although motivated by MMD, in TCA, the distance between the sample means is measured in a lower-dimensional space rather than in Reproducing Kernel Hilbert Space (RKHS), which somewhat contradicts the intuition behind the use of kernels. 

The fact that this kernel yields a distribution distance that only compares the first and second moment of the two distributions [17] will be shown to have little impact on their experimental results, thus showing the robustness of their approach to the choice of kernel. 

Recall from Section 3.2 that CG on a Grassmann manifold involves (i) computing the gradient on the manifold∇fW , (ii) estimating the search direction H , and (iii) performing a line search along a geodesic. 

In particular,in [21, 18], the source examples are re-weighted so as to minimize the MMD between the source and target distributions. 

In the unregularized formulation of Eq. 7, since no labels are used when learning W , the authors only employ the labeled target examples along with the source ones to train the final classifier. 

Domain shift is a fundamental problem in visual recognition tasks as evidenced by the recent surge of interest in domain adaptation [22, 15, 16].