Aspect-augmented Adversarial Networks for Domain Adaptation

doi:10.1162/TACL_A_00077

Yuan Zhang, Regina Barzilay, and Tommi Jaakkola

Computer Science and Artiﬁcial Intelligence Laboratory

Massachusetts Institute of Technology

{yuanzh, regina, tommi}@csail.mit.edu

Abstract

We introduce a neural method for transfer

learning between two (source and target) clas-

siﬁcation tasks or aspects over the same do-

main. Rather than training on target la-

bels, we use a few keywords pertaining to

source and target aspects indicating sentence

relevance instead of document class labels.

Documents are encoded by learning to em-

bed and softly select relevant sentences in an

aspect-dependent manner. A shared classi-

ﬁer is trained on the source encoded docu-

ments and labels, and applied to target en-

coded documents. We ensure transfer through

aspect-adversarial training so that encoded

documents are, as sets, aspect-invariant. Ex-

perimental results demonstrate that our ap-

proach outperforms different baselines and

model variants on two datasets, yielding an

improvement of 27% on a pathology dataset

and 5% on a review dataset.

1

1 Introduction

Many NLP problems are naturally multitask classi-

ﬁcation problems. For instance, values extracted for

different ﬁelds from the same document are often

dependent as they share the same context. Exist-

ing systems rely on this dependence (transfer across

ﬁelds) to improve accuracy. In this paper, we con-

sider a version of this problem where there is a clear

dependence between two tasks but annotations are

available only for the source task. For example,

1

The code is available at https://github.com/

yuanzh/aspect_adversarial.

Pathology report:

• Final diagnosis: BREAST (LEFT) … Invasive ductal

carcinoma: identiﬁed. Carcinoma tumor size: num cm.

Grade: 3. … Lymphatic vessel invasion: identiﬁed.

Blood vessel invasion: Suspicious. Margin of invasive

carcinoma …

Diagnosis results:

Source (IDC): Positive Target (LVI): Positive

Figure 1: A snippet of a breast pathology report with

diagnosis results for two types of disease (aspects):

carcinoma (IDC) and lymph invasion (LVI). Note

how the same phrase indicating positive results (e.g.

identiﬁed) is applicable to both aspects. A transfer

model learns to map other key phrases (e.g. Grade

3) to such shared indicators.

the target goal may be to classify pathology reports

(shown in Figure 1) for the presence of lymph in-

vasion but training data are available only for car-

cinoma in the same reports. We call this problem

aspect transfer as the objective is to learn to classify

examples differently, focusing on different aspects,

without access to target aspect labels. Clearly, such

transfer learning is possible only with auxiliary in-

formation relating the tasks together.

The key challenge is to articulate and incorpo-

rate commonalities across the tasks. For instance, in

classifying reviews of different products, sentiment

words (referred to as pivots) can be shared across

the products. This commonality enables one to align

feature spaces across multiple products, enabling

useful transfer (?). Similar properties hold in other

contexts and beyond sentiment analysis. Figure 1

515

Transactions of the Association for Computational Linguistics, vol. 5, pp. 515–528, 2017. Action Editor: Hal Daum

´

e III .

Submission batch: 12/2016; Revision batch: 6/2017; Published 12/2017.

c

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

shows that certain words and phrases like “identi-

ﬁed”, which indicates the presence of a histologi-

cal property, are applicable to both carcinoma and

lymph invasion. Our method learns and relies on

such shared indicators, and utilizes them for effec-

tive transfer.

The unique feature of our transfer problem is that

both the source and the target classiﬁers operate over

the same domain, i.e., the same examples. In this

setting, traditional transfer methods will always pre-

dict the same label for both aspects and thus lead-

ing to failure. Instead of supplying the target classi-

ﬁer with direct training labels, our approach builds

on a secondary relationship between the tasks using

aspect-relevance annotations of sentences. These

relevance annotations indicate a possibility that the

answer could be found in a sentence, not what the

answer is. One can often write simple keyword rules

that identify sentence relevance to a particular as-

pect through representative terms, e.g., speciﬁc hor-

monal markers in the context of pathology reports.

Annotations of this kind can be readily provided by

domain experts, or extracted from medical literature

such as codex rules in pathology (Pantanowitz et al.,

2008). We assume a small number of relevance an-

notations (rules) pertaining to both source and target

aspects as a form of weak supervision. We use this

sentence-level aspect relevance to learn how to en-

code the examples (e.g., pathology reports) from the

point of view of the desired aspect. In our approach,

we construct different aspect-dependent encodings

of the same document by softly selecting sentences

relevant to the aspect of interest. The key to effective

transfer is how these encodings are aligned.

This encoding mechanism brings the problem

closer to the realm of standard domain adaptation,

where the derived aspect-speciﬁc representations are

considered as different domains. Given these rep-

resentations, our method learns a label classiﬁer

shared between the two domains. To ensure that it

can be adjusted only based on the source class la-

bels, and that it also reasonably applies to the tar-

get encodings, we must align the two sets of en-

coded examples.

2

Learning this alignment is pos-

2

This alignment or invariance is enforced on the level of sets,

not individual reports; aspect-driven encoding of any speciﬁc

report should remain substantially different for the two tasks

since the encoded examples are passed on to the same classiﬁer.

sible because, as discussed above, some keywords

are directly transferable and can serve as anchors

for constructing this invariant space. To learn this

invariant representation, we introduce an adversar-

ial domain classiﬁer analogous to the recent suc-

cessful use of adversarial training in computer vi-

sion (Ganin and Lempitsky, 2014). The role of the

domain classiﬁer (adversary) is to learn to distin-

guish between the two types of encodings. During

training we update the encoder with an adversarial

objective to cause the classiﬁer to fail. The encoder

therefore learns to eliminate aspect-speciﬁc infor-

mation so that encodings look invariant (as sets) to

the classiﬁer, thus establishing aspect-invariance en-

codings and enabling transfer. All three components

in our approach, 1) aspect-driven encoding, 2) clas-

siﬁcation of source labels, and 3) domain adversary,

are trained jointly (concurrently) to complement and

balance each other.

Adversarial training of domain and label classi-

ﬁers can be challenging to stabilize. In our setting,

sentences are encoded with a convolutional model.

Feedback from adversarial training can be an un-

stable guide for how the sentences should be en-

coded. To address this issue, we incorporate an ad-

ditional word-level auto-encoder reconstruction loss

to ground the convolutional processing of sentences.

We empirically demonstrate that this additional ob-

jective yields richer and more diversiﬁed feature rep-

resentations, improving transfer.

We evaluate our approach on pathology reports

(aspect transfer) as well as on a more standard re-

view dataset (domain adaptation). On the pathology

dataset, we explore cross-aspect transfer across dif-

ferent types of breast disease. Speciﬁcally, we test

on six adaptation tasks, consistently outperforming

all other baselines. Overall, our full model achieves

27% and 20.2% absolute improvement arising from

aspect-driven encoding and adversarial training re-

spectively. Moreover, our unsupervised adaptation

method is only 5.7% behind the accuracy of a super-

vised target model. On the review dataset, we test

adaptations from hotel to restaurant reviews. Our

model outperforms the marginalized denoising au-

toencoder (Chen et al., 2012) by 5%. Finally, we

examine and illustrate the impact of individual com-

ponents on the resulting performance.

516

2 Related Work

Domain Adaptation for Deep Learning Exist-

ing approaches commonly induce abstract represen-

tations without pulling apart different aspects in the

same example, and therefore are likely to fail on the

aspect transfer problem. The majority of these prior

methods ﬁrst learn a task-independent representa-

tion, and then train a label predictor (e.g. SVM)

on this representation in a separate step. For ex-

ample, earlier researches employ a shared autoen-

coder (Glorot et al., 2011; Chopra et al., 2013) to

learn a cross-domain representation. Chen et al.

(2012) further improve and stabilize the represen-

tation learning by utilizing marginalized denoising

autoencoders. Later, Zhou et al. (2016) propose to

minimize domain-shift of the autoencoder in a linear

data combination manner. Other researches have fo-

cused on learning transferable representations in an

end-to-end fashion. Examples include using trans-

duction learning for object recognition (Sener et al.,

2016) and using residual transfer networks for image

classiﬁcation (Long et al., 2016). In contrast, we use

adversarial training to encourage learning domain-

invariant features in a more explicit way. Our ap-

proach offers another two advantages over prior

work. First, we jointly optimize features with the

ﬁnal classiﬁcation task while many previous works

only learn task-independent features using autoen-

coders. Second, our model can handle traditional

domain transfer as well as aspect transfer, while pre-

vious methods can only handle the former.

Adversarial Learning in Vision and NLP Our

approach closely relates to the idea of domain-

adversarial training. Adversarial networks were

originally developed for image generation (Good-

fellow et al., 2014; Makhzani et al., 2015; Sprin-

genberg, 2015; Radford et al., 2016; Taigman et al.,

2016), and were later applied to domain adaptation

in computer vision (Ganin and Lempitsky, 2014;

Ganin et al., 2015; Bousmalis et al., 2016; Tzeng et

al., 2014) and speech recognition (Shinohara, 2016).

The core idea of these approaches is to promote the

emergence of invariant image features by optimizing

the feature extractor as an adversary against the do-

main classiﬁer. While Ganin et al. (2015) also apply

this idea to sentiment analysis, their practical gains

have remained limited.

Our approach presents two main departures. In

computer vision, adversarial learning has been used

for transferring across domains, while our method

can also handle aspect transfer. In addition, we in-

troduce a reconstruction loss which results in more

robust adversarial training. We believe that this for-

mulation will beneﬁt other applications of adversar-

ial training, beyond the ones described in this paper.

Semi-supervised Learning with Keywords In

our work, we use a small set of keywords as a source

of weak supervision for aspect-relevance scoring.

This relates to prior work on utilizing prototypes and

seed words in semi-supervised learning (Haghighi

and Klein, 2006; Grenager et al., 2005; Chang et

al., 2007; Mann and McCallum, 2010; Jagarlamudi

et al., 2012; Li et al., 2012; Eisenstein, 2017). All

these prior approaches utilize prototype annotations

primarily targeting model bootstrapping but not for

learning representations. In contrast, our model uses

provided keywords to learn aspect-driven encoding

of input examples.

Attention Mechanism in NLP One may view

our aspect-relevance scorer as a sentence-level

“semi-supervised attention”, in which relevant sen-

tences receive more attention during feature extrac-

tion. While traditional attention-based models typ-

ically induce attention in an unsupervised manner,

they have to rely on a large amount of labeled data

for the target task (Bahdanau et al., 2015; Rush et

al., 2015; Chen et al., 2015; Cheng et al., 2016;

Xu et al., 2015; Xu and Saenko, 2016; Yang et

al., 2016; Martins and Astudillo, 2016; Lei et al.,

2016). Unlike these methods, our approach assumes

no label annotations in the target domain. Other re-

searches have focused on utilizing human-provided

rationales as “supervised attention” to improve pre-

diction (Zaidan et al., 2007; Marshall et al., 2015;

Zhang et al., 2016; Brun et al., 2016). In contrast,

our model only assumes access to a small set of key-

words as a source of weak supervision. Moreover,

all these prior approaches focus on in-domain clas-

siﬁcation. In this paper, however, we study the task

in the context of domain adaptation.

Multitask Learning Existing multitask learn-

ing methods focus on the case where supervision

is available for all tasks. A typical architecture in-

volves using a shared encoder with a separate clas-

517

siﬁer for each task. (Caruana, 1998; Pan and Yang,

2010; Collobert and Weston, 2008; Liu et al., 2015;

Bordes et al., 2012). In contrast, our work assumes

labeled data only for the source aspect. We train a

single classiﬁer for both aspects by learning aspect-

invariant representation that enables the transfer.

3 Problem Formulation

We begin by formalizing aspect transfer with the

idea of differentiating it from standard domain adap-

tation. In our setup, we have two classiﬁcation tasks

called the source and the target tasks. In contrast to

source and target tasks in domain adaptation, both

of these tasks are deﬁned over the same set of ex-

amples (here documents, e.g., pathology reports).

What differentiates the two classiﬁcation tasks is

that they pertain to different aspects in the examples.

If each training document were annotated with both

the source and the target aspect labels, the problem

would reduce to multi-label classiﬁcation. However,

in our setting training labels are available only for

the source aspect so the goal is to solve the target

task without any associated training label.

To ﬁx the notation, let d = {s

i

}

|d|

i=1

be a document

that consists of a sequence of |d| sentences s

i

. Given

a document d, and the aspect of interest, we wish

to predict the corresponding aspect-dependent class

label y (e.g., y ∈ {−1, 1}). We assume that the set

of possible labels are the same across aspects. We

use y

s

l;k

to denote the k-th coordinate of a one-hot

vector indicating the correct training source aspect

label for document d

l

. Target aspect labels are not

available during training.

Beyond labeled documents for the source aspect

{d

l

, y

s

l

}

l∈L

, and shared unlabeled documents for

source and target aspects {d

l

}

l∈U

, we assume fur-

ther that we have relevance scores pertaining to each

aspect. The relevance is given per sentence, for

some subset of sentences across the documents, and

indicates the possibility that the answer for that doc-

ument would be found in the sentence but without

indicating which way the answer goes. Relevance is

always aspect dependent yet often easy to provide in

the form of simple keyword rules.

We use r

a

i

∈ {0, 1} to denote the given relevance

label pertaining to aspect a for sentence s

i

. Only a

small subset of sentences in the training set have as-

sociated relevance labels. Let R = {(a, l, i)} de-

note the index set of relevance labels such that if

(a, l, i) ∈ R then aspect a’s relevance label r

a

l,i

is

available for the i

th

sentence in document d

l

. In our

case relevance labels arise from aspect-dependent

keyword matches. r

a

i

= 1 when the sentence con-

tains any keywords pertaining to aspect a and r

a

i

= 0

if it has any keywords of other aspects.

3

Separate

subsets of relevance labels are available for each as-

pect as the keywords differ.

The transfer that is sought here is between two

tasks over the same set of examples rather than be-

tween two different types of examples for the same

task as in standard domain adaptation. However, the

two formulations can be reconciled if full relevance

annotations are assumed to be available during train-

ing and testing. In this scenario, we could simply lift

the sets of relevant sentences from each document

as new types of documents. The goal would be then

to learn to classify documents of type T (consisting

of sentences relevant to the target aspect) based on

having labels only for type S (source) documents,

a standard domain adaptation task. Our problem

is more challenging as the aspect-relevance of sen-

tences must be learned from limited annotations.

Finally, we note that the aspect transfer problem

and the method we develop to solve it work the same

even when source and target documents are a priori

different, something we will demonstrate later.

4 Methods

4.1 Overview of our approach

Our model consists of three key components as

shown in Figure 2. Each document is encoded

in a relevance weighted, aspect-dependent manner

(green, left part of Figure 2) and classiﬁed using the

label predictor (blue, top-right). During training, the

encoded documents are also passed on to the domain

classiﬁer (orange, bottom-right). The role of the do-

main classiﬁer, as the adversary, is to ensure that the

aspect-dependent encodings of documents are distri-

butionally matched. This matching justiﬁes the use

of the same end-classiﬁer to provide the predicted

label regardless of the task (aspect).

3

r

a

i

= 1 if the sentence contains keywords pertaining to both

aspect a and other aspects.

518

Pathology

report

INVASIVE DUCTAL CAR-

CINOMA Tumor size …

Grade: 3.

……………….

Lymphatic vessel in-

vasion: Not identiﬁed.

… (IDC) is identiﬁed …

…

.0

Predicted

Relevance Score

…

.0

.9

…

Document representation

Transformation

Layer

…

Class label y

l

Objective: predict labels

Sentence embeddings

Weighted combination

Adversary objective: confuse the domain classiﬁer

…

Domain label y

a

Objective: predict domains

backprop

(b) Label predictor

(c) Domain classiﬁer

(a) Document encoder

ˆr =1.0

ˆr =0.0

ˆr =0.9

Figure 2: Aspect-augmented adversarial network for transfer learning. The model is composed of (a) an

aspect-driven document encoder, (b) a label predictor and (c) a domain classiﬁer.

To encode a document, the model ﬁrst maps each

sentence into a vector and then passes the vector to a

scoring network to determine whether the sentence

is relevant for the chosen aspect. These predicted

relevance scores are used to obtain document vec-

tors by taking relevance-weighted sum of the asso-

ciated sentence vectors. Thus, the manner in which

the document vector is constructed is always aspect-

dependent due to the chosen relevance weights.

During training, the resulting adjusted document

vectors are consumed by the two classiﬁers. The pri-

mary label classiﬁer aims to predict the source labels

(when available), while the domain classiﬁer deter-

mines whether the document vector pertains to the

source or target aspect, which is the label that we

know by construction. Furthermore, we jointly up-

date the document encoder with a reverse of the gra-

dient from the domain classiﬁer, so that the encoder

learns to induce document representations that fool

the domain classiﬁer. The resulting encoded repre-

sentations will be aspect-invariant, facilitating trans-

fer.

Our adversarial training scheme uses all the train-

ing losses concurrently to adjust the model param-

eters. During testing, we simply encode each test

document in a target-aspect dependent manner, and

apply the same label predictor. We expect that the

same label classiﬁer does well on the target task

since it solves the source task, and operates on

relevance-weighted representations that are matched

across the tasks. While our method is designed to

work in the extreme setting that the examples for the

two aspects are the same, this is by no means a re-

reconstruction of

ductal carcinoma is identiﬁed

…

… …

…

sentence embeddings

max-pooling:

…

x

0

x

1

x

2

x

3

ˆ

x

2

= tanh(W

c

h

2

+ b

c

)

x

2

h

1

h

2

x

sen

= m ax {h

1

, h

2

,...}

Figure 3: Illustration of the convolutional model and

the reconstruction of word embeddings from the as-

sociated convolutional layer.

quirement. Our method will also work ﬁne in the

more traditional domain adaptation setting, which

we will demonstrate later.

4.2 Components in detail

Sentence embedding We apply a convolutional

model illustrated in Figure 3 to each sentence s

i

to

obtain sentence-level vector embeddings x

sen

i

. The

use of RNNs or bi-LSTMs would result in more ﬂex-

ible sentence embeddings but based on our initial ex-

periments, we did not observe any signiﬁcant gains

over the simpler CNNs.

We further ground the resulting sentence embed-

dings by including an additional word-level recon-

struction step in the convolutional model. The pur-

pose of this reconstruction step is to balance adver-

sarial training signals propagating back from the do-

main classiﬁer. Speciﬁcally, it forces the sentence

encoder to keep rich word-level information in con-

trast to adversarial training that seeks to eliminate

aspect speciﬁc features. We provide an empirical

analysis of the impact of this reconstruction in the

519

Aspect-augmented Adversarial Networks for Domain Adaptation

Citations

Cites methods from "Aspect-augmented Adversarial Networ..."

Cites methods from "Aspect-augmented Adversarial Networ..."

Cites background or methods from "Aspect-augmented Adversarial Networ..."

Cites background from "Aspect-augmented Adversarial Networ..."

References

"Aspect-augmented Adversarial Networ..." refers methods in this paper

"Aspect-augmented Adversarial Networ..." refers methods in this paper

"Aspect-augmented Adversarial Networ..." refers background or methods in this paper

"Aspect-augmented Adversarial Networ..." refers background in this paper

Related Papers (5)