scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Aspect-augmented Adversarial Networks for Domain Adaptation

02 Dec 2017-Transactions of the Association for Computational Linguistics (MIT Press One Rogers Street, Cambridge, MA 02142-1209 USA journals-info@mit.edu)-Vol. 5, Iss: 1, pp 515-528
TL;DR: A neural method for transfer learning between two (source and target) classification tasks or aspects over the same domain is introduced, using a few keywords pertaining to source and target aspects indicating sentence relevance instead of document class labels.
Abstract: We introduce a neural method for transfer learning between two (source and target) classification tasks or aspects over the same domain. Rather than training on target labels, we use a few keywords pertaining to source and target aspects indicating sentence relevance instead of document class labels. Documents are encoded by learning to embed and softly select relevant sentences in an aspect-dependent manner. A shared classifier is trained on the source encoded documents and labels, and applied to target encoded documents. We ensure transfer through aspect-adversarial training so that encoded documents are, as sets, aspect-invariant. Experimental results demonstrate that our approach outperforms different baselines and model variants on two datasets, yielding an improvement of 27% on a pathology dataset and 5% on a review dataset.

Content maybe subject to copyright    Report

Aspect-augmented Adversarial Networks for Domain Adaptation
Yuan Zhang, Regina Barzilay, and Tommi Jaakkola
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
{yuanzh, regina, tommi}@csail.mit.edu
Abstract
We introduce a neural method for transfer
learning between two (source and target) clas-
sification tasks or aspects over the same do-
main. Rather than training on target la-
bels, we use a few keywords pertaining to
source and target aspects indicating sentence
relevance instead of document class labels.
Documents are encoded by learning to em-
bed and softly select relevant sentences in an
aspect-dependent manner. A shared classi-
fier is trained on the source encoded docu-
ments and labels, and applied to target en-
coded documents. We ensure transfer through
aspect-adversarial training so that encoded
documents are, as sets, aspect-invariant. Ex-
perimental results demonstrate that our ap-
proach outperforms different baselines and
model variants on two datasets, yielding an
improvement of 27% on a pathology dataset
and 5% on a review dataset.
1
1 Introduction
Many NLP problems are naturally multitask classi-
fication problems. For instance, values extracted for
different fields from the same document are often
dependent as they share the same context. Exist-
ing systems rely on this dependence (transfer across
fields) to improve accuracy. In this paper, we con-
sider a version of this problem where there is a clear
dependence between two tasks but annotations are
available only for the source task. For example,
1
The code is available at https://github.com/
yuanzh/aspect_adversarial.
Pathology report:
Final diagnosis: BREAST (LEFT) Invasive ductal
carcinoma: identified. Carcinoma tumor size: num cm.
Grade: 3. Lymphatic vessel invasion: identified.
Blood vessel invasion: Suspicious. Margin of invasive
carcinoma …
Diagnosis results:
Source (IDC): Positive Target (LVI): Positive
Figure 1: A snippet of a breast pathology report with
diagnosis results for two types of disease (aspects):
carcinoma (IDC) and lymph invasion (LVI). Note
how the same phrase indicating positive results (e.g.
identified) is applicable to both aspects. A transfer
model learns to map other key phrases (e.g. Grade
3) to such shared indicators.
the target goal may be to classify pathology reports
(shown in Figure 1) for the presence of lymph in-
vasion but training data are available only for car-
cinoma in the same reports. We call this problem
aspect transfer as the objective is to learn to classify
examples differently, focusing on different aspects,
without access to target aspect labels. Clearly, such
transfer learning is possible only with auxiliary in-
formation relating the tasks together.
The key challenge is to articulate and incorpo-
rate commonalities across the tasks. For instance, in
classifying reviews of different products, sentiment
words (referred to as pivots) can be shared across
the products. This commonality enables one to align
feature spaces across multiple products, enabling
useful transfer (?). Similar properties hold in other
contexts and beyond sentiment analysis. Figure 1
515
Transactions of the Association for Computational Linguistics, vol. 5, pp. 515–528, 2017. Action Editor: Hal Daum
´
e III .
Submission batch: 12/2016; Revision batch: 6/2017; Published 12/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

shows that certain words and phrases like “identi-
fied”, which indicates the presence of a histologi-
cal property, are applicable to both carcinoma and
lymph invasion. Our method learns and relies on
such shared indicators, and utilizes them for effec-
tive transfer.
The unique feature of our transfer problem is that
both the source and the target classifiers operate over
the same domain, i.e., the same examples. In this
setting, traditional transfer methods will always pre-
dict the same label for both aspects and thus lead-
ing to failure. Instead of supplying the target classi-
fier with direct training labels, our approach builds
on a secondary relationship between the tasks using
aspect-relevance annotations of sentences. These
relevance annotations indicate a possibility that the
answer could be found in a sentence, not what the
answer is. One can often write simple keyword rules
that identify sentence relevance to a particular as-
pect through representative terms, e.g., specific hor-
monal markers in the context of pathology reports.
Annotations of this kind can be readily provided by
domain experts, or extracted from medical literature
such as codex rules in pathology (Pantanowitz et al.,
2008). We assume a small number of relevance an-
notations (rules) pertaining to both source and target
aspects as a form of weak supervision. We use this
sentence-level aspect relevance to learn how to en-
code the examples (e.g., pathology reports) from the
point of view of the desired aspect. In our approach,
we construct different aspect-dependent encodings
of the same document by softly selecting sentences
relevant to the aspect of interest. The key to effective
transfer is how these encodings are aligned.
This encoding mechanism brings the problem
closer to the realm of standard domain adaptation,
where the derived aspect-specific representations are
considered as different domains. Given these rep-
resentations, our method learns a label classifier
shared between the two domains. To ensure that it
can be adjusted only based on the source class la-
bels, and that it also reasonably applies to the tar-
get encodings, we must align the two sets of en-
coded examples.
2
Learning this alignment is pos-
2
This alignment or invariance is enforced on the level of sets,
not individual reports; aspect-driven encoding of any specific
report should remain substantially different for the two tasks
since the encoded examples are passed on to the same classifier.
sible because, as discussed above, some keywords
are directly transferable and can serve as anchors
for constructing this invariant space. To learn this
invariant representation, we introduce an adversar-
ial domain classifier analogous to the recent suc-
cessful use of adversarial training in computer vi-
sion (Ganin and Lempitsky, 2014). The role of the
domain classifier (adversary) is to learn to distin-
guish between the two types of encodings. During
training we update the encoder with an adversarial
objective to cause the classifier to fail. The encoder
therefore learns to eliminate aspect-specific infor-
mation so that encodings look invariant (as sets) to
the classifier, thus establishing aspect-invariance en-
codings and enabling transfer. All three components
in our approach, 1) aspect-driven encoding, 2) clas-
sification of source labels, and 3) domain adversary,
are trained jointly (concurrently) to complement and
balance each other.
Adversarial training of domain and label classi-
fiers can be challenging to stabilize. In our setting,
sentences are encoded with a convolutional model.
Feedback from adversarial training can be an un-
stable guide for how the sentences should be en-
coded. To address this issue, we incorporate an ad-
ditional word-level auto-encoder reconstruction loss
to ground the convolutional processing of sentences.
We empirically demonstrate that this additional ob-
jective yields richer and more diversified feature rep-
resentations, improving transfer.
We evaluate our approach on pathology reports
(aspect transfer) as well as on a more standard re-
view dataset (domain adaptation). On the pathology
dataset, we explore cross-aspect transfer across dif-
ferent types of breast disease. Specifically, we test
on six adaptation tasks, consistently outperforming
all other baselines. Overall, our full model achieves
27% and 20.2% absolute improvement arising from
aspect-driven encoding and adversarial training re-
spectively. Moreover, our unsupervised adaptation
method is only 5.7% behind the accuracy of a super-
vised target model. On the review dataset, we test
adaptations from hotel to restaurant reviews. Our
model outperforms the marginalized denoising au-
toencoder (Chen et al., 2012) by 5%. Finally, we
examine and illustrate the impact of individual com-
ponents on the resulting performance.
516

2 Related Work
Domain Adaptation for Deep Learning Exist-
ing approaches commonly induce abstract represen-
tations without pulling apart different aspects in the
same example, and therefore are likely to fail on the
aspect transfer problem. The majority of these prior
methods first learn a task-independent representa-
tion, and then train a label predictor (e.g. SVM)
on this representation in a separate step. For ex-
ample, earlier researches employ a shared autoen-
coder (Glorot et al., 2011; Chopra et al., 2013) to
learn a cross-domain representation. Chen et al.
(2012) further improve and stabilize the represen-
tation learning by utilizing marginalized denoising
autoencoders. Later, Zhou et al. (2016) propose to
minimize domain-shift of the autoencoder in a linear
data combination manner. Other researches have fo-
cused on learning transferable representations in an
end-to-end fashion. Examples include using trans-
duction learning for object recognition (Sener et al.,
2016) and using residual transfer networks for image
classification (Long et al., 2016). In contrast, we use
adversarial training to encourage learning domain-
invariant features in a more explicit way. Our ap-
proach offers another two advantages over prior
work. First, we jointly optimize features with the
final classification task while many previous works
only learn task-independent features using autoen-
coders. Second, our model can handle traditional
domain transfer as well as aspect transfer, while pre-
vious methods can only handle the former.
Adversarial Learning in Vision and NLP Our
approach closely relates to the idea of domain-
adversarial training. Adversarial networks were
originally developed for image generation (Good-
fellow et al., 2014; Makhzani et al., 2015; Sprin-
genberg, 2015; Radford et al., 2016; Taigman et al.,
2016), and were later applied to domain adaptation
in computer vision (Ganin and Lempitsky, 2014;
Ganin et al., 2015; Bousmalis et al., 2016; Tzeng et
al., 2014) and speech recognition (Shinohara, 2016).
The core idea of these approaches is to promote the
emergence of invariant image features by optimizing
the feature extractor as an adversary against the do-
main classifier. While Ganin et al. (2015) also apply
this idea to sentiment analysis, their practical gains
have remained limited.
Our approach presents two main departures. In
computer vision, adversarial learning has been used
for transferring across domains, while our method
can also handle aspect transfer. In addition, we in-
troduce a reconstruction loss which results in more
robust adversarial training. We believe that this for-
mulation will benefit other applications of adversar-
ial training, beyond the ones described in this paper.
Semi-supervised Learning with Keywords In
our work, we use a small set of keywords as a source
of weak supervision for aspect-relevance scoring.
This relates to prior work on utilizing prototypes and
seed words in semi-supervised learning (Haghighi
and Klein, 2006; Grenager et al., 2005; Chang et
al., 2007; Mann and McCallum, 2010; Jagarlamudi
et al., 2012; Li et al., 2012; Eisenstein, 2017). All
these prior approaches utilize prototype annotations
primarily targeting model bootstrapping but not for
learning representations. In contrast, our model uses
provided keywords to learn aspect-driven encoding
of input examples.
Attention Mechanism in NLP One may view
our aspect-relevance scorer as a sentence-level
“semi-supervised attention”, in which relevant sen-
tences receive more attention during feature extrac-
tion. While traditional attention-based models typ-
ically induce attention in an unsupervised manner,
they have to rely on a large amount of labeled data
for the target task (Bahdanau et al., 2015; Rush et
al., 2015; Chen et al., 2015; Cheng et al., 2016;
Xu et al., 2015; Xu and Saenko, 2016; Yang et
al., 2016; Martins and Astudillo, 2016; Lei et al.,
2016). Unlike these methods, our approach assumes
no label annotations in the target domain. Other re-
searches have focused on utilizing human-provided
rationales as “supervised attention” to improve pre-
diction (Zaidan et al., 2007; Marshall et al., 2015;
Zhang et al., 2016; Brun et al., 2016). In contrast,
our model only assumes access to a small set of key-
words as a source of weak supervision. Moreover,
all these prior approaches focus on in-domain clas-
sification. In this paper, however, we study the task
in the context of domain adaptation.
Multitask Learning Existing multitask learn-
ing methods focus on the case where supervision
is available for all tasks. A typical architecture in-
volves using a shared encoder with a separate clas-
517

sifier for each task. (Caruana, 1998; Pan and Yang,
2010; Collobert and Weston, 2008; Liu et al., 2015;
Bordes et al., 2012). In contrast, our work assumes
labeled data only for the source aspect. We train a
single classifier for both aspects by learning aspect-
invariant representation that enables the transfer.
3 Problem Formulation
We begin by formalizing aspect transfer with the
idea of differentiating it from standard domain adap-
tation. In our setup, we have two classification tasks
called the source and the target tasks. In contrast to
source and target tasks in domain adaptation, both
of these tasks are defined over the same set of ex-
amples (here documents, e.g., pathology reports).
What differentiates the two classification tasks is
that they pertain to different aspects in the examples.
If each training document were annotated with both
the source and the target aspect labels, the problem
would reduce to multi-label classification. However,
in our setting training labels are available only for
the source aspect so the goal is to solve the target
task without any associated training label.
To fix the notation, let d = {s
i
}
|d|
i=1
be a document
that consists of a sequence of |d| sentences s
i
. Given
a document d, and the aspect of interest, we wish
to predict the corresponding aspect-dependent class
label y (e.g., y {−1, 1}). We assume that the set
of possible labels are the same across aspects. We
use y
s
l;k
to denote the k-th coordinate of a one-hot
vector indicating the correct training source aspect
label for document d
l
. Target aspect labels are not
available during training.
Beyond labeled documents for the source aspect
{d
l
, y
s
l
}
lL
, and shared unlabeled documents for
source and target aspects {d
l
}
lU
, we assume fur-
ther that we have relevance scores pertaining to each
aspect. The relevance is given per sentence, for
some subset of sentences across the documents, and
indicates the possibility that the answer for that doc-
ument would be found in the sentence but without
indicating which way the answer goes. Relevance is
always aspect dependent yet often easy to provide in
the form of simple keyword rules.
We use r
a
i
{0, 1} to denote the given relevance
label pertaining to aspect a for sentence s
i
. Only a
small subset of sentences in the training set have as-
sociated relevance labels. Let R = {(a, l, i)} de-
note the index set of relevance labels such that if
(a, l, i) R then aspect as relevance label r
a
l,i
is
available for the i
th
sentence in document d
l
. In our
case relevance labels arise from aspect-dependent
keyword matches. r
a
i
= 1 when the sentence con-
tains any keywords pertaining to aspect a and r
a
i
= 0
if it has any keywords of other aspects.
3
Separate
subsets of relevance labels are available for each as-
pect as the keywords differ.
The transfer that is sought here is between two
tasks over the same set of examples rather than be-
tween two different types of examples for the same
task as in standard domain adaptation. However, the
two formulations can be reconciled if full relevance
annotations are assumed to be available during train-
ing and testing. In this scenario, we could simply lift
the sets of relevant sentences from each document
as new types of documents. The goal would be then
to learn to classify documents of type T (consisting
of sentences relevant to the target aspect) based on
having labels only for type S (source) documents,
a standard domain adaptation task. Our problem
is more challenging as the aspect-relevance of sen-
tences must be learned from limited annotations.
Finally, we note that the aspect transfer problem
and the method we develop to solve it work the same
even when source and target documents are a priori
different, something we will demonstrate later.
4 Methods
4.1 Overview of our approach
Our model consists of three key components as
shown in Figure 2. Each document is encoded
in a relevance weighted, aspect-dependent manner
(green, left part of Figure 2) and classified using the
label predictor (blue, top-right). During training, the
encoded documents are also passed on to the domain
classifier (orange, bottom-right). The role of the do-
main classifier, as the adversary, is to ensure that the
aspect-dependent encodings of documents are distri-
butionally matched. This matching justifies the use
of the same end-classifier to provide the predicted
label regardless of the task (aspect).
3
r
a
i
= 1 if the sentence contains keywords pertaining to both
aspect a and other aspects.
518

Pathology
report
INVASIVE DUCTAL CAR-
CINOMA Tumor size
Grade: 3.
……………….
Lymphatic vessel in-
vasion: Not identified.
… (IDC) is identified …
.0
Predicted
Relevance Score
.0
.9
Document representation
Transformation
Layer
Class label y
l
Objective: predict labels
Sentence embeddings
Weighted combination
Adversary objective: confuse the domain classifier
Domain label y
a
Objective: predict domains
backprop
backprop
(b) Label predictor
(c) Domain classifier
(a) Document encoder
ˆr =1.0
ˆr =0.0
ˆr =0.9
Figure 2: Aspect-augmented adversarial network for transfer learning. The model is composed of (a) an
aspect-driven document encoder, (b) a label predictor and (c) a domain classifier.
To encode a document, the model first maps each
sentence into a vector and then passes the vector to a
scoring network to determine whether the sentence
is relevant for the chosen aspect. These predicted
relevance scores are used to obtain document vec-
tors by taking relevance-weighted sum of the asso-
ciated sentence vectors. Thus, the manner in which
the document vector is constructed is always aspect-
dependent due to the chosen relevance weights.
During training, the resulting adjusted document
vectors are consumed by the two classifiers. The pri-
mary label classifier aims to predict the source labels
(when available), while the domain classifier deter-
mines whether the document vector pertains to the
source or target aspect, which is the label that we
know by construction. Furthermore, we jointly up-
date the document encoder with a reverse of the gra-
dient from the domain classifier, so that the encoder
learns to induce document representations that fool
the domain classifier. The resulting encoded repre-
sentations will be aspect-invariant, facilitating trans-
fer.
Our adversarial training scheme uses all the train-
ing losses concurrently to adjust the model param-
eters. During testing, we simply encode each test
document in a target-aspect dependent manner, and
apply the same label predictor. We expect that the
same label classifier does well on the target task
since it solves the source task, and operates on
relevance-weighted representations that are matched
across the tasks. While our method is designed to
work in the extreme setting that the examples for the
two aspects are the same, this is by no means a re-
reconstruction of
ductal carcinoma is identified
sentence embeddings
max-pooling:
x
0
x
1
x
2
x
3
ˆ
x
2
= tanh(W
c
h
2
+ b
c
)
x
2
h
1
h
2
x
sen
= m ax {h
1
, h
2
,...}
Figure 3: Illustration of the convolutional model and
the reconstruction of word embeddings from the as-
sociated convolutional layer.
quirement. Our method will also work fine in the
more traditional domain adaptation setting, which
we will demonstrate later.
4.2 Components in detail
Sentence embedding We apply a convolutional
model illustrated in Figure 3 to each sentence s
i
to
obtain sentence-level vector embeddings x
sen
i
. The
use of RNNs or bi-LSTMs would result in more flex-
ible sentence embeddings but based on our initial ex-
periments, we did not observe any significant gains
over the simpler CNNs.
We further ground the resulting sentence embed-
dings by including an additional word-level recon-
struction step in the convolutional model. The pur-
pose of this reconstruction step is to balance adver-
sarial training signals propagating back from the do-
main classifier. Specifically, it forces the sentence
encoder to keep rich word-level information in con-
trast to adversarial training that seeks to eliminate
aspect specific features. We provide an empirical
analysis of the impact of this reconstruction in the
519

Citations
More filters
Posted Content
TL;DR: This paper proposed using adversarial training for open-domain dialogue generation, where the generator is trained to generate sequences that are indistinguishable from human-generated dialogue utterances, and the outputs from the discriminator are used as rewards for the generator.
Abstract: In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning (RL) problem where we jointly train two systems, a generative model to produce response sequences, and a discriminator---analagous to the human evaluator in the Turing test--- to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, pushing the system to generate dialogues that mostly resemble human dialogues. In addition to adversarial training we describe a model for adversarial {\em evaluation} that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines.

645 citations

Proceedings ArticleDOI
23 Jan 2017
TL;DR: This work applies adversarial training to open-domain dialogue generation, training a system to produce sequences that are indistinguishable from human-generated dialogue utterances, and investigates models for adversarial evaluation that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls.
Abstract: We apply adversarial training to open-domain dialogue generation, training a system to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning problem where we jointly train two systems: a generative model to produce response sequences, and a discriminator—analagous to the human evaluator in the Turing test— to distinguish between the human-generated dialogues and the machine-generated ones. In this generative adversarial network approach, the outputs from the discriminator are used to encourage the system towards more human-like dialogue. Further, we investigate models for adversarial evaluation that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines

644 citations


Cites methods from "Aspect-augmented Adversarial Networ..."

  • ...Outside of sequence generation, Chen et al. (2016b) apply the idea of adversarial training to sentiment analysis and Zhang et al. (2017) apply the idea to domain adaptation tasks....

    [...]

Proceedings ArticleDOI
19 Oct 2017
TL;DR: Comprehensive experimental results show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.
Abstract: Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

641 citations


Cites methods from "Aspect-augmented Adversarial Networ..."

  • ...Furthermore, our approach was inspired by the effectiveness of adversarial learning for various applications, like learning discriminative image features [17], or (un)supervised domain adaptation to enforce domain-invariant features [2, 6, 41], and regularizing correlation loss between cross-modal items [10]....

    [...]

Journal ArticleDOI
TL;DR: A survey will compare single-source and typically homogeneous unsupervised deep domain adaptation approaches, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially costly target data labels.
Abstract: Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain with the goal of performing well at test-time on the target domain. Many single-source and typically homogeneous unsupervised deep domain adaptation approaches have thus been developed, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially costly target data labels. This survey will compare these approaches by examining alternative methods, the unique and common elements, results, and theoretical insights. We follow this with a look at application areas and open research directions.

496 citations


Cites background or methods from "Aspect-augmented Adversarial Networ..."

  • ...[275] found a CNN to work just as well as RNNs or bi-LSTMs in their experiments....

    [...]

  • ...[275] weight examples by their relevance to their target aspect based on a small set of positive and negative keywords (a form of weak supervision)....

    [...]

  • ...Domain adaptation has been used in natural language processing such as for sentiment analysis (Table 4, [275, 282]), other text classification [144, 275] including weakly-supervised aspect-transfer from one aspect of a dataset to another [275], relation extraction [70], semi-supervised sequence labeling [54], semi-supervised question answering [262], sentence specificity [119], and neural machine translation [23, 31, 42]....

    [...]

Proceedings Article
24 May 2019
TL;DR: This paper constructs a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation, and proposes a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift.
Abstract: Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample exhibits conditional shift: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of any domain adaptation method that attempts to learn invariant representations. Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms.

296 citations


Cites background from "Aspect-augmented Adversarial Networ..."

  • ...…with adversarial learning, e.g., video analysis (Hoffman et al., 2016; Shrivastava et al., 2016; Hoffman et al., 2017; Tzeng et al., 2017), natural language understanding (Zhang et al., 2017; Fu et al., 2017), speech recognition (Zhao et al., 2019; Hosseini-Asl et al., 2018), to name a few....

    [...]

  • ...In fact, despite being successfully applied in various applications (Zhang et al., 2017; Hoffman et al., 2017), it has also been reported that such methods fail to generalize in certain closely related source/target pairs, e.g., digit classification from MNIST to SVHN (Ganin et al., 2016)....

    [...]

  • ...In fact, despite being successfully applied in various applications (Zhang et al., 2017; Hoffman et al., 2017), it has also been reported that such methods fail to generalize in certain closely related source/target pairs, e....

    [...]

  • ..., 2017), natural language understanding (Zhang et al., 2017; Fu et al., 2017), speech recognition (Zhao et al....

    [...]

References
More filters
Journal ArticleDOI
08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

38,211 citations


"Aspect-augmented Adversarial Networ..." refers methods in this paper

  • ...Adversarial networks have originally been developed for image generation (Goodfellow et al., 2014; Makhzani et al., 2015; Springenberg, 2015; Radford et al., 2015; Taigman et al., 2016), and later applied to domain adaption in computer vision (Ganin and Lempitsky, 2014; Ganin et al....

    [...]

  • ...Adversarial networks were originally developed for image generation (Goodfellow et al., 2014; Makhzani et al., 2015; Springenberg, 2015; Radford et al., 2015; Taigman et al., 2016), and were later applied to domain adaptation in computer vision (Ganin and Lempitsky, 2014; Ganin et al., 2015;…...

    [...]

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations


"Aspect-augmented Adversarial Networ..." refers methods in this paper

  • ...We also apply batch normalization (Ioffe and Szegedy, 2015) on the sentence encoder and apply dropout with ratio 0....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

20,027 citations


"Aspect-augmented Adversarial Networ..." refers background or methods in this paper

  • ...…models typically induce attention in an unsupervised manner, they have to rely on a large amount of labeled data for the target task (Bahdanau et al., 2014; Rush et al., 2015; Chen et al., 2015; Cheng et al., 2016; Xu et al., 2015; Xu and Saenko, 2015; Yang et al., 2015; Martins and…...

    [...]

  • ...While traditional attention-based models typically induce attention in an unsupervised manner, they have to rely on a large amount of labeled data for the target task (Bahdanau et al., 2014; Rush et al., 2015; Chen et al., 2015; Cheng et al., 2016; Xu et al., 2015; Xu and Saenko, 2015; Yang et al., 2015; Martins and Astudillo, 2016; Lei et al., 2016)....

    [...]

Journal ArticleDOI
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

18,616 citations


"Aspect-augmented Adversarial Networ..." refers background in this paper

  • ...(Caruana, 1998; Pan and Yang, 2010; Collobert and Weston, 2008; Liu et al., 2015; Bordes et al., 2012)....

    [...]

Posted Content
Sergey Ioffe1, Christian Szegedy1
TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

17,184 citations