Analysis of Representations for Domain Adaptation

Home
/
Papers
/
Analysis of Representations for Domain Adaptation

Proceedings Article•

Analysis of Representations for Domain Adaptation

Shai Ben-David¹, John Blitzer², Koby Crammer², Fernando Pereira²•Institutions (2)

University of Waterloo¹, University of Pennsylvania²

04 Dec 2006-Vol. 19, pp 137-144

TL;DR: The theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.

read less

Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Survey on Transfer Learning

[...]

Sinno Jialin Pan¹, Qiang Yang¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Oct 2010-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.

...read moreread less

Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

...read moreread less

18,616 citations

Additional excerpts

...5...
[...]

Book Chapter•DOI•

Domain-adversarial training of neural networks

[...]

Yaroslav Ganin¹, Evgeniya Ustinova¹, Hana Ajakan², Pascal Germain², Hugo Larochelle³, François Laviolette², Mario Marchand², Victor Lempitsky¹ - Show less +4 more•Institutions (3)

Skolkovo Institute of Science and Technology¹, Laval University², Université de Sherbrooke³

01 Jan 2016-Journal of Machine Learning Research

TL;DR: In this article, a new representation learning approach for domain adaptation is proposed, in which data at training and test time come from similar but different distributions, and features that cannot discriminate between the training (source) and test (target) domains are used to promote the emergence of features that are discriminative for the main learning task on the source domain.

...read moreread less

Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

...read moreread less

4,862 citations

Journal Article•DOI•

A kernel two-sample test

[...]

Arthur Gretton¹, Karsten M. Borgwardt¹, Malte J. Rasch², Bernhard Schölkopf¹, Alexander J. Smola³ - Show less +1 more•Institutions (3)

Max Planck Society¹, Beijing Normal University², Australian National University³

01 Mar 2012-Journal of Machine Learning Research

TL;DR: This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).

...read moreread less

Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distribution free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

...read moreread less

3,792 citations

Cites background from "Analysis of Representations for Dom..."

...(2009, Section 2) show the MMD minimizes the expected risk of a classifier with linear loss on the samples X and Y , and Ben-David et al. (2007, Section 4) use the error of a hyperplane classifier to approximate the A-distance between distributions (Kifer et al., 2004). Reid and Williamson (2011) provide further discussion and examples....
[...]

Posted Content•

Learning Transferable Features with Deep Adaptation Networks

[...]

Mingsheng Long¹, Mingsheng Long², Yue Cao², Jianmin Wang², Michael I. Jordan¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Tsinghua University²

10 Feb 2015-arXiv: Learning

TL;DR: A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.

...read moreread less

Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

...read moreread less

3,351 citations

Cites methods from "Analysis of Representations for Dom..."

...We provide an analysis of the expected target-domain risk of our approach, making use of the theory of domain transfer (Ben-David et al., 2007; 2010; Mansour et al., 2009) and the theory of kernel embedding of probability distributions (Sriperumbudur et al....
[...]

Journal Article•DOI•

Domain Adaptation via Transfer Component Analysis

[...]

Sinno Jialin Pan, Ivor W. Tsang¹, James T. Kwok², Qiang Yang²•Institutions (2)

Nanyang Technological University¹, Hong Kong University of Science and Technology²

01 Feb 2011-IEEE Transactions on Neural Networks

TL;DR: This work proposes a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation and proposes both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce thedistance between domain distributions by projecting data onto the learned transfer components.

...read moreread less

Abstract: Domain adaptation allows knowledge from a source domain to be transferred to a different but related target domain. Intuitively, discovering a good feature representation across domains is crucial. In this paper, we first propose to find such a representation through a new learning method, transfer component analysis (TCA), for domain adaptation. TCA tries to learn some transfer components across domains in a reproducing kernel Hilbert space using maximum mean miscrepancy. In the subspace spanned by these transfer components, data properties are preserved and data distributions in different domains are close to each other. As a result, with the new representations in this subspace, we can apply standard machine learning methods to train classifiers or regression models in the source domain for use in the target domain. Furthermore, in order to uncover the knowledge hidden in the relations between the data labels from the source and target domains, we extend TCA in a semisupervised learning setting, which encodes label information into transfer components learning. We call this extension semisupervised TCA. The main contribution of our work is that we propose a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation. We propose both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce the distance between domain distributions by projecting data onto the learned transfer components. Finally, our approach can handle large datasets and naturally lead to out-of-sample generalization. The effectiveness and efficiency of our approach are verified by experiments on five toy datasets and two real-world applications: cross-domain indoor WiFi localization and cross-domain text classification.

...read moreread less

3,195 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Statistical learning theory

[...]

Vladimir Vapnik

01 Jan 1998

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

26,531 citations

Additional excerpts

...ǫT (h) ≤ λT + PrDT [Zh∆Zh∗ ] ≤ λT + PrDS [Zh∆Zh∗ ] + |PrDS [Zh∆Zh∗ ] − PrDT [Zh∆Zh∗ ]| ≤ λT + PrDS [Zh∆Zh∗ ] + dH(D̃S , D̃T ) ≤ λT + λS + ǫS(h) + dH(D̃S , D̃T ) ≤ λ + ǫS(h) + dH(D̃S , D̃T ) The theorem now follows by a standard application Vapnik-Chervonenkis theory [14] to bound the true ǫS(h) by its empirical estimatêǫS(h)....
[...]

Book•

Foundations of Statistical Natural Language Processing

[...]

Christopher D. Manning¹, Hinrich Schütze²•Institutions (2)

Stanford University¹, PARC²

28 May 1999

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.

...read moreread less

Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

...read moreread less

9,295 citations

Proceedings Article•DOI•

Domain Adaptation with Structural Correspondence Learning

[...]

John Blitzer¹, Ryan McDonald¹, Fernando Pereira¹•Institutions (1)

University of Pennsylvania¹

22 Jul 2006

TL;DR: This work introduces structural correspondence learning to automatically induce correspondences among features from different domains in order to adapt existing models from a resource-rich source domain to aresource-poor target domain.

...read moreread less

Abstract: Discriminative learning methods are widely used in natural language processing. These methods work best when their training and test data are drawn from the same distribution. For many NLP tasks, however, we are confronted with new domains in which labeled data is scarce or non-existent. In such cases, we seek to adapt existing models from a resource-rich source domain to a resource-poor target domain. We introduce structural correspondence learning to automatically induce correspondences among features from different domains. We test our technique on part of speech tagging and show performance gains for varying amounts of source and target training data, as well as improvements in target domain parsing accuracy using our improved tagger.

...read moreread less

1,672 citations

"Analysis of Representations for Dom..." refers background or methods in this paper

...For PoS tagging, the original feature space consists of high-dimensional, sparse binary vectors [6]....
[...]
...We show experimentally that the heuristic choices made by the recently proposed structural correspondence learning algorithm [6] do lead to lower values of the relevant quantities in our theoretical analysis, providing insight as to why this algorithm achieves its empirical success....
[...]
...Indeed recent empirical work in natural language processing [11, 6] has been targeted at exactly this setting....
[...]
...However, the assumption does not hold for domain adaptation [5, 7, 13, 6]....
[...]
...Section 5 shows how the bound behaves for the structural correspondence learning representation [6] on natural language data....
[...]

Proceedings Article•DOI•

Solving large scale linear prediction problems using stochastic gradient descent algorithms

[...]

Tong Zhang¹•Institutions (1)

IBM¹

04 Jul 2004

TL;DR: Stochastic gradient descent algorithms on regularized forms of linear prediction methods, related to online algorithms such as perceptron, are studied, and numerical rate of convergence for such algorithms is obtained.

...read moreread less

Abstract: Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classification, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linear prediction methods. This class of methods, related to online algorithms such as perceptron, are both efficient and very simple to implement. We obtain numerical rate of convergence for such algorithms, and discuss its implications. Experiments on text data will be provided to demonstrate numerical and statistical consequences of our theoretical findings.

...read moreread less

1,182 citations

"Analysis of Representations for Dom..." refers methods in this paper

...We minimize a modified Huber loss using stochastic gradient descent, described more completely in [15]....
[...]

Book Chapter•DOI•

Detecting change in data streams

[...]

Daniel Kifer¹, Shai Ben-David¹, Johannes Gehrke¹•Institutions (1)

Cornell University¹

31 Aug 2004

TL;DR: A novel method for the detection and estimation of change that assumes that the points in the stream are independently generated, but otherwise makes no assumptions on the nature of the generating distribution.

...read moreread less

Abstract: Detecting changes in a data stream is an important area of research with many applications. In this paper, we present a novel method for the detection and estimation of change. In addition to providing statistical guarantees on the reliability of detected changes, our method also provides meaningful descriptions and quantification of these changes. Our approach assumes that the points in the stream are independently generated, but otherwise makes no assumptions on the nature of the generating distribution. Thus our techniques work for both continuous and discrete data. In an experimental study we demonstrate the power of our techniques.

...read moreread less

883 citations

"Analysis of Representations for Dom..." refers background or methods in this paper

...The relevant distributional divergence term can be written as the Adistance of Kiferet al [9]....
[...]
...[9] show that the A-distance can be approximated arbitrarily well with increasing sample size....
[...]
...2 from [9], we can state a computable bound for the error on the target domain....
[...]
...We chose theA-distance, however, precisely because we can measure this from finite samples from the distrbutions D̃S andD̃T [9]....
[...]
...Unfortunately the variational distance between real-valued distributions cannot be computed from finite samples [2, 9] and therefore is not useful to us when investigating representations for domain adaptation on real-world data....
[...]