scispace - formally typeset
Open AccessProceedings ArticleDOI

Neural Relation Extraction with Selective Attention over Instances

TLDR
A sentence-level attention-based model for relation extraction that employs convolutional neural networks to embed the semantics of sentences and dynamically reduce the weights of those noisy instances.
Abstract
Distant supervised relation extraction has been widely used to find novel relational facts from text. However, distant supervision inevitably accompanies with the wrong labelling problem, and these noisy data will substantially hurt the performance of relation extraction. To alleviate this issue, we propose a sentence-level attention-based model for relation extraction. In this model, we employ convolutional neural networks to embed the semantics of sentences. Afterwards, we build sentence-level attention over multiple instances, which is expected to dynamically reduce the weights of those noisy instances. Experimental results on real-world datasets show that, our model can make full use of all informative sentences and effectively reduce the influence of wrong labelled instances. Our model achieves significant and consistent improvements on relation extraction as compared with baselines. The source code of this paper can be obtained from https: //github.com/thunlp/NRE.

read more

Content maybe subject to copyright    Report

Neural Relation Extraction with Selective Attention over Instances
Yankai Lin
1
, Shiqi Shen
1
, Zhiyuan Liu
1,2
, Huanbo Luan
1
, Maosong Sun
1,2
1
Department of Computer Science and Technology,
State Key Lab on Intelligent Technology and Systems,
National Lab for Information Science and Technology, Tsinghua University, Beijing, China
2
Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China
Abstract
Distant supervised relation extraction has
been widely used to find novel relational
facts from text. However, distant su-
pervision inevitably accompanies with the
wrong labelling problem, and these noisy
data will substantially hurt the perfor-
mance of relation extraction. To allevi-
ate this issue, we propose a sentence-level
attention-based model for relation extrac-
tion. In this model, we employ convolu-
tional neural networks to embed the se-
mantics of sentences. Afterwards, we
build sentence-level attention over multi-
ple instances, which is expected to dy-
namically reduce the weights of those
noisy instances. Experimental results on
real-world datasets show that, our model
can make full use of all informative sen-
tences and effectively reduce the influence
of wrong labelled instances. Our model
achieves significant and consistent im-
provements on relation extraction as com-
pared with baselines. The source code of
this paper can be obtained from https:
//github.com/thunlp/NRE.
1 Introduction
In recent years, various large-scale knowledge
bases (KBs) such as Freebase (Bollacker et al.,
2008), DBpedia (Auer et al., 2007) and YAGO
(Suchanek et al., 2007) have been built and widely
used in many natural language processing (NLP)
tasks, including web search and question answer-
ing. These KBs mostly compose of relational facts
with triple format, e.g., (Microsoft, founder,
Bill Gates). Although existing KBs contain a
Corresponding author: Zhiyuan Liu (li-
uzy@tsinghua.edu.cn).
massive amount of facts, they are still far from
complete compared to the infinite real-world facts.
To enrich KBs, many efforts have been invested
in automatically finding unknown relational facts.
Therefore, relation extraction (RE), the process of
generating relational data from plain text, is a cru-
cial task in NLP.
Most existing supervised RE systems require a
large amount of labelled relation-specific training
data, which is very time consuming and labor in-
tensive. (Mintz et al., 2009) proposes distant su-
pervision to automatically generate training data
via aligning KBs and texts. They assume that if
two entities have a relation in KBs, then all sen-
tences that contain these two entities will express
this relation. For example, (Microsoft, founder,
Bill Gates) is a relational fact in KB. Distant su-
pervision will regard all sentences that contain
these two entities as active instances for relation
founder. Although distant supervision is an
effective strategy to automatically label training
data, it always suffers from wrong labelling prob-
lem. For example, the sentence Bill Gates s turn
to philanthropy was linked to the antitrust prob-
lems Microsoft had in the U.S. and the European
union. does not express the relation founder
but will still be regarded as an active instance.
Hence, (Riedel et al., 2010; Hoffmann et al., 2011;
Surdeanu et al., 2012) adopt multi-instance learn-
ing to alleviate the wrong labelling problem. The
main weakness of these conventional methods is
that most features are explicitly derived from NLP
tools such as POS tagging and the errors generated
by NLP tools will propagate in these methods.
Some recent works (Socher et al., 2012; Zeng
et al., 2014; dos Santos et al., 2015) attempt to
use deep neural networks in relation classifica-
tion without handcrafted features. These meth-
ods build classifier based on sentence-level anno-
tated data, which cannot be applied in large-scale

x
1
x
2
x
3
x
n
CNN CNN CNN CNN
x
1
x
2
x
3
x
n
s
α
1
α
2
α
3
α
n
Figure 1: The architecture of sentence-level
attention-based CNN, where x
i
and x
i
indicate the
original sentence for an entity pair and its corre-
sponding sentence representation, α
i
is the weight
given by sentence-level attention, and s indicates
the representation of the sentence set.
KBs due to the lack of human-annotated train-
ing data. Therefore, (Zeng et al., 2015) incor-
porates multi-instance learning with neural net-
work model, which can build relation extractor
based on distant supervision data. Although the
method achieves significant improvement in re-
lation extraction, it is still far from satisfactory.
The method assumes that at least one sentence that
mentions these two entities will express their rela-
tion, and only selects the most likely sentence for
each entity pair in training and prediction. It’s ap-
parent that the method will lose a large amount
of rich information containing in neglected sen-
tences.
In this paper, we propose a sentence-level
attention-based convolutional neural network
(CNN) for distant supervised relation extraction.
As illustrated in Fig. 1, we employ a CNN to
embed the semantics of sentences. Afterwards, to
utilize all informative sentences, we represent the
relation as semantic composition of sentence em-
beddings. To address the wrong labelling prob-
lem, we build sentence-level attention over mul-
tiple instances, which is expected to dynamically
reduce the weights of those noisy instances. Fi-
nally, we extract relation with the relation vector
weighted by sentence-level attention. We evaluate
our model on a real-world dataset in the task of
relation extraction. The experimental results show
that our model achieves significant and consistent
improvements in relation extraction as compared
with the state-of-the-art methods.
The contributions of this paper can be summa-
rized as follows:
As compared to existing neural relation ex-
traction model, our model can make full use
of all informative sentences of each entity
pair.
To address the wrong labelling problem in
distant supervision, we propose selective
attention to de-emphasize those noisy in-
stances.
In the experiments, we show that selective
attention is beneficial to two kinds of CNN
models in the task of relation extraction.
2 Related Work
Relation extraction is one of the most impor-
tant tasks in NLP. Many efforts have been invested
in relation extraction, especially in supervised re-
lation extraction. Most of these methods need a
great deal of annotated data, which is time con-
suming and labor intensive. To address this issue,
(Mintz et al., 2009) aligns plain text with Free-
base by distant supervision. However, distant su-
pervision inevitably accompanies with the wrong
labelling problem. To alleviate the wrong la-
belling problem, (Riedel et al., 2010) models dis-
tant supervision for relation extraction as a multi-
instance single-label problem, and (Hoffmann et
al., 2011; Surdeanu et al., 2012) adopt multi-
instance multi-label learning in relation extraction.
Multi-instance learning was originally proposed to
address the issue of ambiguously-labelled training
data when predicting the activity of drugs (Diet-
terich et al., 1997). Multi-instance learning con-
siders the reliability of the labels for each instance.
(Bunescu and Mooney, 2007) connects weak su-
pervision with multi-instance learning and extends
it to relation extraction. But all the feature-based
methods depend strongly on the quality of the fea-
tures generated by NLP tools, which will suffer
from error propagation problem.
Recently, deep learning (Bengio, 2009) has
been widely used for various areas, including com-
puter vision, speech recognition and so on. It has
also been successfully applied to different NLP
tasks such as part-of-speech tagging (Collobert
et al., 2011), sentiment analysis (dos Santos and
Gatti, 2014), parsing (Socher et al., 2013), and
machine translation (Sutskever et al., 2014). Due
to the recent success in deep learning, many re-

searchers have investigated the possibility of us-
ing neural networks to automatically learn features
for relation extraction. (Socher et al., 2012) uses
a recursive neural network in relation extraction.
They parse the sentences first and then represent
each node in the parsing tree as a vector. More-
over, (Zeng et al., 2014; dos Santos et al., 2015)
adopt an end-to-end convolutional neural network
for relation extraction. Besides, (Xie et al., 2016)
attempts to incorporate the text information of en-
tities for relation extraction.
Although these methods achieve great success,
they still extract relations on sentence-level and
suffer from a lack of sufficient training data. In
addition, the multi-instance learning strategy of
conventional methods cannot be easily applied in
neural network models. Therefore, (Zeng et al.,
2015) combines at-least-one multi-instance learn-
ing with neural network model to extract relations
on distant supervision data. However, they assume
that only one sentence is active for each entity pair.
Hence, it will lose a large amount of rich informa-
tion containing in those neglected sentences. Dif-
ferent from their methods, we propose sentence-
level attention over multiple instances, which can
utilize all informative sentences.
The attention-based models have attracted a lot
of interests of researchers recently. The selectiv-
ity of attention-based models allows them to learn
alignments between different modalities. It has
been applied to various areas such as image clas-
sification (Mnih et al., 2014), speech recognition
(Chorowski et al., 2014), image caption generation
(Xu et al., 2015) and machine translation (Bah-
danau et al., 2014). To the best of our knowl-
edge, this is the first effort to adopt attention-based
model in distant supervised relation extraction.
3 Methodology
Given a set of sentences {x
1
, x
2
, · · · , x
n
} and
two corresponding entities, our model measures
the probability of each relation r. In this section,
we will introduce our model in two main parts:
Sentence Encoder. Given a sentence x and
two target entities, a convolutional neutral
network (CNN) is used to construct a dis-
tributed representation x of the sentence.
Selective Attention over Instances. When
the distributed vector representations of all
sentences are learnt, we use sentence-level at-
tention to select the sentences which really
express the corresponding relation.
3.1 Sentence Encoder
Bill_Gates is the founder of Microsoft.
Sentence
Vector
Representaion
word
position
Convolution
Layer
Max
Pooling
x
W *
+ b
Non-linear
Layer
Figure 2: The architecture of CNN/PCNN used for
sentence encoder.
As shown in Fig. 2, we transform the sentence
x into its distributed representation x by a CNN.
First, words in the sentence are transformed into
dense real-valued feature vectors. Next, convo-
lutional layer, max-pooling layer and non-linear
transformation layer are used to construct a dis-
tributed representation of the sentence, i.e., x.
3.1.1 Input Representation
The inputs of the CNN are raw words of the
sentence x. We first transform words into low-
dimensional vectors. Here, each input word is
transformed into a vector via word embedding ma-
trix. In addition, to specify the position of each en-
tity pair, we also use position embeddings for all
words in the sentence.
Word Embeddings. Word embeddings aim to
transform words into distributed representations
which capture syntactic and semantic meanings
of the words. Given a sentence x consisting of
m words x = {w
1
, w
2
, · · · , w
m
}, every word
w
i
is represented by a real-valued vector. Word
representations are encoded by column vectors in
an embedding matrix V R
d
a
×|V |
where V is a
fixed-sized vocabulary.
Position Embeddings. In the task of relation
extraction, the words close to the target entities are
usually informative to determine the relation be-
tween entities. Similar to (Zeng et al., 2014), we

use position embeddings specified by entity pairs.
It can help the CNN to keep track of how close
each word is to head or tail entities. It is defined
as the combination of the relative distances from
the current word to head or tail entities. For ex-
ample, in the sentence “Bill Gates is the founder
of Microsoft.”, the relative distance from the word
“founder” to head entity Bill Gates is 3 and tail
entity Microsoft is 2.
In the example shown in Fig. 2, it is assumed
that the dimension d
a
of the word embedding is 3
and the dimension d
b
of the position embedding is
1. Finally, we concatenate the word embeddings
and position embeddings of all words and denote
it as a vector sequence w = {w
1
, w
2
, · · · , w
m
},
where w
i
R
d
(d = d
a
+ d
b
× 2).
3.1.2 Convolution, Max-pooling and
Non-linear Layers
In relation extraction, the main challenges are
that the length of the sentences is variable and the
important information can appear in any area of
the sentences. Hence, we should utilize all lo-
cal features and perform relation prediction glob-
ally. Here, we use a convolutional layer to merge
all these features. The convolutional layer first
extracts local features with a sliding window of
length l over the sentence. In the example shown
in Fig. 2, we assume that the length of the sliding
window l is 3. Then, it combines all local features
via a max-pooling operation to obtain a fixed-sized
vector for the input sentence.
Here, convolution is defined as an operation be-
tween a vector sequence w and a convolution ma-
trix W R
d
c
×(l×d)
, where d
c
is the sentence em-
bedding size. Let us define the vector q
i
R
l×d
as the concatenation of a sequence of w word em-
beddings within the i-th window:
q
i
= w
il+1:i
(1 i m + l 1). (1)
Since the window may be outside of the sen-
tence boundaries when it slides near the boundary,
we set special padding tokens for the sentence. It
means that we regard all out-of-range input vec-
tors w
i
(i < 1 or i > m) as zero vector.
Hence, the i-th filter of convolutional layer is
computed as:
p
i
= [Wq + b]
i
(2)
where b is bias vector. And the i-th element of the
vector x R
d
c
as follows:
[x]
i
= max(p
i
), (3)
Further, PCNN (Zeng et al., 2015), which is a
variation of CNN, adopts piecewise max pooling
in relation extraction. Each convolutional filter p
i
is divided into three segments (p
i1
, p
i2
, p
i3
) by
head and tail entities. And the max pooling pro-
cedure is performed in three segments separately,
which is defined as:
[x]
ij
= max(p
ij
), (4)
And [x]
i
is set as the concatenation of [x]
ij
.
Finally, we apply a non-linear function at the
output, such as the hyperbolic tangent.
3.2 Selective Attention over Instances
Suppose there is a set S contains n sen-
tences for entity pair (head, tail), i.e., S =
{x
1
, x
2
, · · · , x
n
}.
To exploit the information of all sentences, our
model represents the set S with a real-valued vec-
tor s when predicting relation r. It is straightfor-
ward that the representation of the set S depends
on all sentences’ representations x
1
, x
2
, · · · , x
n
.
Each sentence representation x
i
contains informa-
tion about whether entity pair (head, tail) con-
tains relation r for input sentence x
i
.
The set vector s is, then, computed as a
weighted sum of these sentence vector x
i
:
s =
X
i
α
i
x
i
, (5)
where α
i
is the weight of each sentence vector x
i
.
In this paper, we define α
i
in two ways:
Average: We assume that all sentences in the
set X have the same contribution to the represen-
tation of the set. It means the embedding of the set
S is the average of all the sentence vectors:
s =
X
i
1
n
x
i
, (6)
It’s a naive baseline of our selective attention.
Selective Attention: However, the wrong la-
belling problem inevitably occurs. Thus, if we
regard each sentence equally, the wrong labelling
sentences will bring in massive of noise during
training and testing. Hence, we use a selec-
tive attention to de-emphasize the noisy sentence.
Hence, α
i
is further defined as:
α
i
=
exp(e
i
)
P
k
exp(e
k
)
, (7)

where e
i
is referred as a query-based function
which scores how well the input sentence x
i
and
the predict relation r matches. We select the bilin-
ear form which achieves best performance in dif-
ferent alternatives:
e
i
= x
i
Ar, (8)
where A is a weighted diagonal matrix, and r is
the query vector associated with relation r which
indicates the representation of relation r.
Finally, we define the conditional probability
p(r|S, θ) through a softmax layer as follows:
p(r|S, θ) =
exp(o
r
)
P
n
r
k=1
exp(o
k
)
, (9)
where n
r
is the total number of relations and o is
the final output of the neural network which cor-
responds to the scores associated to all relation
types, which is defined as follows:
o = Ms + d, (10)
where d R
n
r
is a bias vector and M is the rep-
resentation matrix of relations.
(Zeng et al., 2015) follows the assumption that
at least one mention of the entity pair will reflect
their relation, and only uses the sentence with the
highest probability in each set for training. Hence,
the method which they adopted for multi-instance
learning can be regarded as a special case as our
selective attention when the weight of the sentence
with the highest probability is set to 1 and others
to 0.
3.3 Optimization and Implementation Details
Here we introduce the learning and optimiza-
tion details of our model. We define the objective
function using cross-entropy at the set level as fol-
lows:
J(θ) =
s
X
i=1
log p(r
i
|S
i
, θ), (11)
where s indicates the number of sentence sets and
θ indicates all parameters of our model. To solve
the optimization problem, we adopt stochastic gra-
dient descent (SGD) to minimize the objective
function. For learning, we iterate by randomly
selecting a mini-batch from the training set until
converge.
In the implementation, we employ dropout (Sri-
vastava et al., 2014) on the output layer to pre-
vent overfitting. The dropout layer is defined as
an element-wise multiplication with a a vector h
of Bernoulli random variables with probability p.
Then equation (10) is rewritten as:
o = M(s h) + d. (12)
In the test phase, the learnt set representations
are scaled by p, i.e.,
ˆ
s
i
= ps
i
. And the scaled set
vector
ˆ
r
i
is finally used to predict relations.
4 Experiments
Our experiments are intended to demonstrate
that our neural models with sentence-level selec-
tive attention can alleviate the wrong labelling
problem and take full advantage of informative
sentences for distant supervised relation extrac-
tion. To this end, we first introduce the dataset and
evaluation metrics used in the experiments. Next,
we use cross-validation to determine the parame-
ters of our model. And then we evaluate the ef-
fects of our selective attention and show its per-
formance on the data with different set size. Fi-
nally, we compare the performance of our method
to several state-of-the-art feature-based methods.
4.1 Dataset and Evaluation Metrics
We evaluate our model on a widely used
dataset
1
which is developed by (Riedel et al.,
2010) and has also been used by (Hoffmann et
al., 2011; Surdeanu et al., 2012). This dataset was
generated by aligning Freebase relations with the
New York Times corpus (NYT). Entity mentions
are found using the Stanford named entity tagger
(Finkel et al., 2005), and are further matched to the
names of Freebase entities. The Freebase relations
are divided into two parts, one for training and one
for testing. It aligns the the sentences from the
corpus of the years 2005-2006 and regards them
as training instances. And the testing instances
are the aligned sentences from 2007. There are
53 possible relationships including a special rela-
tion NA which indicates there is no relation be-
tween head and tail entities. The training data con-
tains 522,611 sentences, 281,270 entity pairs and
18,252 relational facts. The testing set contains
172,448 sentences, 96,678 entity pairs and 1,950
relational facts.
Similar to previous work (Mintz et al., 2009),
we evaluate our model in the held-out evaluation.
It evaluates our model by comparing the relation
1
http://iesl.cs.umass.edu/riedel/ecml/

Citations
More filters
Proceedings ArticleDOI

ERNIE: Enhanced Language Representation with Informative Entities

TL;DR: This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks.
Journal ArticleDOI

A Survey on Knowledge Graphs: Representation, Acquisition and Applications

TL;DR: A comprehensive review of the knowledge graph covering overall research topics about: 1) knowledge graph representation learning; 2) knowledge acquisition and completion; 3) temporal knowledge graph; and 4) knowledge-aware applications and summarize recent breakthroughs and perspective directions to facilitate future research.
Proceedings Article

Distant Supervision for Relation Extraction with Sentence-level Attention and Entity Descriptions

TL;DR: This paper proposes a sentence-level attention model to select the valid instances, which makes full use of the supervision information from knowledge bases, and extracts entity descriptions from Freebase and Wikipedia pages to supplement background knowledge for the authors' task.
Journal ArticleDOI

A Survey on Knowledge Graphs: Representation, Acquisition, and Applications

TL;DR: A comprehensive review of the knowledge graph covering overall research topics about: 1) knowledge graph representation learning; 2) knowledge acquisition and completion; 3) temporal knowledge graph; and 4) knowledge-aware applications and summarize recent breakthroughs and perspective directions to facilitate future research as mentioned in this paper .
Posted Content

FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation

TL;DR: This paper presented a few-shot relation classification dataset (FewRel) consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. And they adapted the most recent state-of-the-art fewshot learning methods for relation classification and conduct a thorough evaluation of these methods.
References
More filters
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Posted Content

Sequence to Sequence Learning with Neural Networks

TL;DR: This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Related Papers (5)