Neural Relation Extraction with Selective Attention over Instances

doi:10.18653/V1/P16-1200

Yankai Lin

1

, Shiqi Shen

1

, Zhiyuan Liu

1,2∗

, Huanbo Luan

1

, Maosong Sun

1,2

1

Department of Computer Science and Technology,

State Key Lab on Intelligent Technology and Systems,

National Lab for Information Science and Technology, Tsinghua University, Beijing, China

2

Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China

Abstract

Distant supervised relation extraction has

been widely used to ﬁnd novel relational

facts from text. However, distant su-

pervision inevitably accompanies with the

wrong labelling problem, and these noisy

data will substantially hurt the perfor-

mance of relation extraction. To allevi-

ate this issue, we propose a sentence-level

attention-based model for relation extrac-

tion. In this model, we employ convolu-

tional neural networks to embed the se-

mantics of sentences. Afterwards, we

build sentence-level attention over multi-

ple instances, which is expected to dy-

namically reduce the weights of those

noisy instances. Experimental results on

real-world datasets show that, our model

can make full use of all informative sen-

tences and effectively reduce the inﬂuence

of wrong labelled instances. Our model

achieves signiﬁcant and consistent im-

provements on relation extraction as com-

pared with baselines. The source code of

this paper can be obtained from https:

//github.com/thunlp/NRE.

1 Introduction

In recent years, various large-scale knowledge

bases (KBs) such as Freebase (Bollacker et al.,

2008), DBpedia (Auer et al., 2007) and YAGO

(Suchanek et al., 2007) have been built and widely

used in many natural language processing (NLP)

tasks, including web search and question answer-

ing. These KBs mostly compose of relational facts

with triple format, e.g., (Microsoft, founder,

Bill Gates). Although existing KBs contain a

∗

Corresponding author: Zhiyuan Liu (li-

uzy@tsinghua.edu.cn).

massive amount of facts, they are still far from

complete compared to the inﬁnite real-world facts.

To enrich KBs, many efforts have been invested

in automatically ﬁnding unknown relational facts.

Therefore, relation extraction (RE), the process of

generating relational data from plain text, is a cru-

cial task in NLP.

Most existing supervised RE systems require a

large amount of labelled relation-speciﬁc training

data, which is very time consuming and labor in-

tensive. (Mintz et al., 2009) proposes distant su-

pervision to automatically generate training data

via aligning KBs and texts. They assume that if

two entities have a relation in KBs, then all sen-

tences that contain these two entities will express

this relation. For example, (Microsoft, founder,

Bill Gates) is a relational fact in KB. Distant su-

pervision will regard all sentences that contain

these two entities as active instances for relation

founder. Although distant supervision is an

effective strategy to automatically label training

data, it always suffers from wrong labelling prob-

lem. For example, the sentence “Bill Gates ’s turn

to philanthropy was linked to the antitrust prob-

lems Microsoft had in the U.S. and the European

union.” does not express the relation founder

but will still be regarded as an active instance.

Hence, (Riedel et al., 2010; Hoffmann et al., 2011;

Surdeanu et al., 2012) adopt multi-instance learn-

ing to alleviate the wrong labelling problem. The

main weakness of these conventional methods is

that most features are explicitly derived from NLP

tools such as POS tagging and the errors generated

by NLP tools will propagate in these methods.

Some recent works (Socher et al., 2012; Zeng

et al., 2014; dos Santos et al., 2015) attempt to

use deep neural networks in relation classiﬁca-

tion without handcrafted features. These meth-

ods build classiﬁer based on sentence-level anno-

tated data, which cannot be applied in large-scale

x

1

x

2

x

3

x

n

CNN CNN CNN CNN

x

1

x

2

x

3

x

n

s

α

1

α

2

α

3

α

n

Figure 1: The architecture of sentence-level

attention-based CNN, where x

i

and x

i

indicate the

original sentence for an entity pair and its corre-

sponding sentence representation, α

i

is the weight

given by sentence-level attention, and s indicates

the representation of the sentence set.

KBs due to the lack of human-annotated train-

ing data. Therefore, (Zeng et al., 2015) incor-

porates multi-instance learning with neural net-

work model, which can build relation extractor

based on distant supervision data. Although the

method achieves signiﬁcant improvement in re-

lation extraction, it is still far from satisfactory.

The method assumes that at least one sentence that

mentions these two entities will express their rela-

tion, and only selects the most likely sentence for

each entity pair in training and prediction. It’s ap-

parent that the method will lose a large amount

of rich information containing in neglected sen-

tences.

In this paper, we propose a sentence-level

attention-based convolutional neural network

(CNN) for distant supervised relation extraction.

As illustrated in Fig. 1, we employ a CNN to

embed the semantics of sentences. Afterwards, to

utilize all informative sentences, we represent the

relation as semantic composition of sentence em-

beddings. To address the wrong labelling prob-

lem, we build sentence-level attention over mul-

tiple instances, which is expected to dynamically

reduce the weights of those noisy instances. Fi-

nally, we extract relation with the relation vector

weighted by sentence-level attention. We evaluate

our model on a real-world dataset in the task of

relation extraction. The experimental results show

that our model achieves signiﬁcant and consistent

improvements in relation extraction as compared

with the state-of-the-art methods.

The contributions of this paper can be summa-

rized as follows:

• As compared to existing neural relation ex-

traction model, our model can make full use

of all informative sentences of each entity

pair.

• To address the wrong labelling problem in

distant supervision, we propose selective

attention to de-emphasize those noisy in-

stances.

• In the experiments, we show that selective

attention is beneﬁcial to two kinds of CNN

models in the task of relation extraction.

2 Related Work

Relation extraction is one of the most impor-

tant tasks in NLP. Many efforts have been invested

in relation extraction, especially in supervised re-

lation extraction. Most of these methods need a

great deal of annotated data, which is time con-

suming and labor intensive. To address this issue,

(Mintz et al., 2009) aligns plain text with Free-

base by distant supervision. However, distant su-

pervision inevitably accompanies with the wrong

labelling problem. To alleviate the wrong la-

belling problem, (Riedel et al., 2010) models dis-

tant supervision for relation extraction as a multi-

instance single-label problem, and (Hoffmann et

al., 2011; Surdeanu et al., 2012) adopt multi-

instance multi-label learning in relation extraction.

Multi-instance learning was originally proposed to

address the issue of ambiguously-labelled training

data when predicting the activity of drugs (Diet-

terich et al., 1997). Multi-instance learning con-

siders the reliability of the labels for each instance.

(Bunescu and Mooney, 2007) connects weak su-

pervision with multi-instance learning and extends

it to relation extraction. But all the feature-based

methods depend strongly on the quality of the fea-

tures generated by NLP tools, which will suffer

from error propagation problem.

Recently, deep learning (Bengio, 2009) has

been widely used for various areas, including com-

puter vision, speech recognition and so on. It has

also been successfully applied to different NLP

tasks such as part-of-speech tagging (Collobert

et al., 2011), sentiment analysis (dos Santos and

Gatti, 2014), parsing (Socher et al., 2013), and

machine translation (Sutskever et al., 2014). Due

to the recent success in deep learning, many re-

searchers have investigated the possibility of us-

ing neural networks to automatically learn features

for relation extraction. (Socher et al., 2012) uses

a recursive neural network in relation extraction.

They parse the sentences ﬁrst and then represent

each node in the parsing tree as a vector. More-

over, (Zeng et al., 2014; dos Santos et al., 2015)

adopt an end-to-end convolutional neural network

for relation extraction. Besides, (Xie et al., 2016)

attempts to incorporate the text information of en-

tities for relation extraction.

Although these methods achieve great success,

they still extract relations on sentence-level and

suffer from a lack of sufﬁcient training data. In

addition, the multi-instance learning strategy of

conventional methods cannot be easily applied in

neural network models. Therefore, (Zeng et al.,

2015) combines at-least-one multi-instance learn-

ing with neural network model to extract relations

on distant supervision data. However, they assume

that only one sentence is active for each entity pair.

Hence, it will lose a large amount of rich informa-

tion containing in those neglected sentences. Dif-

ferent from their methods, we propose sentence-

level attention over multiple instances, which can

utilize all informative sentences.

The attention-based models have attracted a lot

of interests of researchers recently. The selectiv-

ity of attention-based models allows them to learn

alignments between different modalities. It has

been applied to various areas such as image clas-

siﬁcation (Mnih et al., 2014), speech recognition

(Chorowski et al., 2014), image caption generation

(Xu et al., 2015) and machine translation (Bah-

danau et al., 2014). To the best of our knowl-

edge, this is the ﬁrst effort to adopt attention-based

model in distant supervised relation extraction.

3 Methodology

Given a set of sentences {x

1

, x

2

, · · · , x

n

} and

two corresponding entities, our model measures

the probability of each relation r. In this section,

we will introduce our model in two main parts:

• Sentence Encoder. Given a sentence x and

two target entities, a convolutional neutral

network (CNN) is used to construct a dis-

tributed representation x of the sentence.

• Selective Attention over Instances. When

the distributed vector representations of all

sentences are learnt, we use sentence-level at-

tention to select the sentences which really

express the corresponding relation.

3.1 Sentence Encoder

Bill_Gates is the founder of Microsoft.

Sentence

Vector

Representaion

word

position

Convolution

Layer

Max

Pooling

x

W *

+ b

Non-linear

Layer

Figure 2: The architecture of CNN/PCNN used for

sentence encoder.

As shown in Fig. 2, we transform the sentence

x into its distributed representation x by a CNN.

First, words in the sentence are transformed into

dense real-valued feature vectors. Next, convo-

lutional layer, max-pooling layer and non-linear

transformation layer are used to construct a dis-

tributed representation of the sentence, i.e., x.

3.1.1 Input Representation

The inputs of the CNN are raw words of the

sentence x. We ﬁrst transform words into low-

dimensional vectors. Here, each input word is

transformed into a vector via word embedding ma-

trix. In addition, to specify the position of each en-

tity pair, we also use position embeddings for all

words in the sentence.

Word Embeddings. Word embeddings aim to

transform words into distributed representations

which capture syntactic and semantic meanings

of the words. Given a sentence x consisting of

m words x = {w

1

, w

2

, · · · , w

m

}, every word

w

i

is represented by a real-valued vector. Word

representations are encoded by column vectors in

an embedding matrix V ∈ R

d

a

×|V |

where V is a

ﬁxed-sized vocabulary.

Position Embeddings. In the task of relation

extraction, the words close to the target entities are

usually informative to determine the relation be-

tween entities. Similar to (Zeng et al., 2014), we

use position embeddings speciﬁed by entity pairs.

It can help the CNN to keep track of how close

each word is to head or tail entities. It is deﬁned

as the combination of the relative distances from

the current word to head or tail entities. For ex-

ample, in the sentence “Bill Gates is the founder

of Microsoft.”, the relative distance from the word

“founder” to head entity Bill Gates is 3 and tail

entity Microsoft is 2.

In the example shown in Fig. 2, it is assumed

that the dimension d

a

of the word embedding is 3

and the dimension d

b

of the position embedding is

1. Finally, we concatenate the word embeddings

and position embeddings of all words and denote

it as a vector sequence w = {w

1

, w

2

, · · · , w

m

},

where w

i

∈ R

d

(d = d

a

+ d

b

× 2).

3.1.2 Convolution, Max-pooling and

Non-linear Layers

In relation extraction, the main challenges are

that the length of the sentences is variable and the

important information can appear in any area of

the sentences. Hence, we should utilize all lo-

cal features and perform relation prediction glob-

ally. Here, we use a convolutional layer to merge

all these features. The convolutional layer ﬁrst

extracts local features with a sliding window of

length l over the sentence. In the example shown

in Fig. 2, we assume that the length of the sliding

window l is 3. Then, it combines all local features

via a max-pooling operation to obtain a ﬁxed-sized

vector for the input sentence.

Here, convolution is deﬁned as an operation be-

tween a vector sequence w and a convolution ma-

trix W ∈ R

d

c

×(l×d)

, where d

c

is the sentence em-

bedding size. Let us deﬁne the vector q

i

∈ R

l×d

as the concatenation of a sequence of w word em-

beddings within the i-th window:

q

i

= w

i−l+1:i

(1 ≤ i ≤ m + l − 1). (1)

Since the window may be outside of the sen-

tence boundaries when it slides near the boundary,

we set special padding tokens for the sentence. It

means that we regard all out-of-range input vec-

tors w

i

(i < 1 or i > m) as zero vector.

Hence, the i-th ﬁlter of convolutional layer is

computed as:

p

i

= [Wq + b]

i

(2)

where b is bias vector. And the i-th element of the

vector x ∈ R

d

c

as follows:

[x]

i

= max(p

i

), (3)

Further, PCNN (Zeng et al., 2015), which is a

variation of CNN, adopts piecewise max pooling

in relation extraction. Each convolutional ﬁlter p

i

is divided into three segments (p

i1

, p

i2

, p

i3

) by

head and tail entities. And the max pooling pro-

cedure is performed in three segments separately,

which is deﬁned as:

[x]

ij

= max(p

ij

), (4)

And [x]

i

is set as the concatenation of [x]

ij

.

Finally, we apply a non-linear function at the

output, such as the hyperbolic tangent.

3.2 Selective Attention over Instances

Suppose there is a set S contains n sen-

tences for entity pair (head, tail), i.e., S =

{x

1

, x

2

, · · · , x

n

}.

To exploit the information of all sentences, our

model represents the set S with a real-valued vec-

tor s when predicting relation r. It is straightfor-

ward that the representation of the set S depends

on all sentences’ representations x

1

, x

2

, · · · , x

n

.

Each sentence representation x

i

contains informa-

tion about whether entity pair (head, tail) con-

tains relation r for input sentence x

i

.

The set vector s is, then, computed as a

weighted sum of these sentence vector x

i

:

s =

X

i

α

i

x

i

, (5)

where α

i

is the weight of each sentence vector x

i

.

In this paper, we deﬁne α

i

in two ways:

Average: We assume that all sentences in the

set X have the same contribution to the represen-

tation of the set. It means the embedding of the set

S is the average of all the sentence vectors:

s =

X

i

1

n

x

i

, (6)

It’s a naive baseline of our selective attention.

Selective Attention: However, the wrong la-

belling problem inevitably occurs. Thus, if we

regard each sentence equally, the wrong labelling

sentences will bring in massive of noise during

training and testing. Hence, we use a selec-

tive attention to de-emphasize the noisy sentence.

Hence, α

i

is further deﬁned as:

α

i

=

exp(e

i

)

P

k

exp(e

k

)

, (7)

where e

i

is referred as a query-based function

which scores how well the input sentence x

i

and

the predict relation r matches. We select the bilin-

ear form which achieves best performance in dif-

ferent alternatives:

e

i

= x

i

Ar, (8)

where A is a weighted diagonal matrix, and r is

the query vector associated with relation r which

indicates the representation of relation r.

Finally, we deﬁne the conditional probability

p(r|S, θ) through a softmax layer as follows:

p(r|S, θ) =

exp(o

r

)

P

n

r

k=1

exp(o

k

)

, (9)

where n

r

is the total number of relations and o is

the ﬁnal output of the neural network which cor-

responds to the scores associated to all relation

types, which is deﬁned as follows:

o = Ms + d, (10)

where d ∈ R

n

r

is a bias vector and M is the rep-

resentation matrix of relations.

(Zeng et al., 2015) follows the assumption that

at least one mention of the entity pair will reﬂect

their relation, and only uses the sentence with the

highest probability in each set for training. Hence,

the method which they adopted for multi-instance

learning can be regarded as a special case as our

selective attention when the weight of the sentence

with the highest probability is set to 1 and others

to 0.

3.3 Optimization and Implementation Details

Here we introduce the learning and optimiza-

tion details of our model. We deﬁne the objective

function using cross-entropy at the set level as fol-

lows:

J(θ) =

s

X

i=1

log p(r

i

|S

i

, θ), (11)

where s indicates the number of sentence sets and

θ indicates all parameters of our model. To solve

the optimization problem, we adopt stochastic gra-

dient descent (SGD) to minimize the objective

function. For learning, we iterate by randomly

selecting a mini-batch from the training set until

converge.

In the implementation, we employ dropout (Sri-

vastava et al., 2014) on the output layer to pre-

vent overﬁtting. The dropout layer is deﬁned as

an element-wise multiplication with a a vector h

of Bernoulli random variables with probability p.

Then equation (10) is rewritten as:

o = M(s ◦ h) + d. (12)

In the test phase, the learnt set representations

are scaled by p, i.e.,

ˆ

s

i

= ps

i

. And the scaled set

vector

ˆ

r

i

is ﬁnally used to predict relations.

4 Experiments

Our experiments are intended to demonstrate

that our neural models with sentence-level selec-

tive attention can alleviate the wrong labelling

problem and take full advantage of informative

sentences for distant supervised relation extrac-

tion. To this end, we ﬁrst introduce the dataset and

evaluation metrics used in the experiments. Next,

we use cross-validation to determine the parame-

ters of our model. And then we evaluate the ef-

fects of our selective attention and show its per-

formance on the data with different set size. Fi-

nally, we compare the performance of our method

to several state-of-the-art feature-based methods.

4.1 Dataset and Evaluation Metrics

We evaluate our model on a widely used

dataset

1

which is developed by (Riedel et al.,

2010) and has also been used by (Hoffmann et

al., 2011; Surdeanu et al., 2012). This dataset was

generated by aligning Freebase relations with the

New York Times corpus (NYT). Entity mentions

are found using the Stanford named entity tagger

(Finkel et al., 2005), and are further matched to the

names of Freebase entities. The Freebase relations

are divided into two parts, one for training and one

for testing. It aligns the the sentences from the

corpus of the years 2005-2006 and regards them

as training instances. And the testing instances

are the aligned sentences from 2007. There are

53 possible relationships including a special rela-

tion NA which indicates there is no relation be-

tween head and tail entities. The training data con-

tains 522,611 sentences, 281,270 entity pairs and

18,252 relational facts. The testing set contains

172,448 sentences, 96,678 entity pairs and 1,950

relational facts.

Similar to previous work (Mintz et al., 2009),

we evaluate our model in the held-out evaluation.

It evaluates our model by comparing the relation

1

http://iesl.cs.umass.edu/riedel/ecml/

Neural Relation Extraction with Selective Attention over Instances

Citations

ERNIE: Enhanced Language Representation with Informative Entities

A Survey on Knowledge Graphs: Representation, Acquisition and Applications

Distant Supervision for Relation Extraction with Sentence-level Attention and Entity Descriptions

A Survey on Knowledge Graphs: Representation, Acquisition, and Applications

FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation

References

Dropout: a simple way to prevent neural networks from overfitting

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks

Related Papers (5)

Distant supervision for relation extraction without labeled data

Modeling relations and their mentions without labeled text

Relation Classification via Convolutional Deep Neural Network

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Freebase: a collaboratively created graph database for structuring human knowledge