scispace - formally typeset
Open AccessProceedings ArticleDOI

A Bayesian Method to Incorporate Background Knowledge during Automatic Text Summarization

Annie Louis
- pp 333-338
TLDR
A model based on Bayesian surprise is presented which provides an intuitive way to identify surprising information from a summarization input with respect to a background corpus.
Abstract
In order to summarize a document, it is often useful to have a background set of documents from the domain to serve as a reference for determining new and important information in the input document. We present a model based on Bayesian surprise which provides an intuitive way to identify surprising information from a summarization input with respect to a background corpus. Specifically, the method quantifies the degree to which pieces of information in the input change one’s beliefs’ about the world represented in the background. We develop systems for generic and update summarization based on this idea. Our method provides competitive content selection performance with particular advantages in the update task where systems are given a small and topical background corpus.

read more

Content maybe subject to copyright    Report

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 333–338,
Baltimore, Maryland, USA, June 23-25 2014.
c
2014 Association for Computational Linguistics
A Bayesian Method to Incorporate Background Knowledge during
Automatic Text Summarization
Annie Louis
ILCC, School of Informatics,
University of Edinburgh,
Edinburgh EH8 9AB, UK
alouis@inf.ed.ac.uk
Abstract
In order to summarize a document, it is
often useful to have a background set
of documents from the domain to serve
as a reference for determining new and
important information in the input doc-
ument. We present a model based on
Bayesian surprise which provides an in-
tuitive way to identify surprising informa-
tion from a summarization input with re-
spect to a background corpus. Specifically,
the method quantifies the degree to which
pieces of information in the input change
one’s beliefs’ about the world represented
in the background. We develop sys-
tems for generic and update summariza-
tion based on this idea. Our method pro-
vides competitive content selection perfor-
mance with particular advantages in the
update task where systems are given a
small and topical background corpus.
1 Introduction
Important facts in a new text are those which devi-
ate from previous knowledge on the topic. When
people create summaries, they use their knowl-
edge about the world to decide what content in an
input document is informative to include in a sum-
mary. Understandably in automatic summariza-
tion as well, it is useful to keep a background set
of documents to represent general facts and their
frequency in the domain.
For example, in the simplest setting of multi-
document summarization of news, systems are
asked to summarize an input set of topically-
related news documents to reflect its central con-
tent. In this GENERIC task, some of the best re-
ported results were obtained by a system (Conroy
et al., 2006) which computed importance scores
for words in the input by examining if the word
occurs with significantly higher probability in the
input compared to a large background collection
of news articles. Other specialized summarization
tasks explicitly require the use of background in-
formation. In the UPDATE summarization task, a
system is given two sets of news documents on the
same topic; the second contains articles published
later in time. The system should summarize the
important updates from the second set assuming a
user has already read the first set of articles.
In this work, we present a Bayesian model for
assessing the novelty of a sentence taken from a
summarization input with respect to a background
corpus of documents.
Our model is based on the idea of Bayesian Sur-
prise (Itti and Baldi, 2006). For illustration, as-
sume that a user’s background knowledge com-
prises of multiple hypotheses about the current
state of the world and a probability distribution
over these hypotheses indicates his degree of be-
lief in each hypothesis. For example, one hypoth-
esis may be that the political situation in Ukraine
is peaceful, another where it is not. Apriori as-
sume the user favors the hypothesis about a peace-
ful Ukraine, i.e. the hypothesis has higher prob-
ability in the prior distribution. Given new data,
the evidence can be incorporated using Bayes Rule
to compute the posterior distribution over the hy-
potheses. For example, upon viewing news reports
about riots in the country, a user would update his
beliefs and the posterior distribution of the user’s
knowledge would have a higher probability for a
riotous Ukraine. Bayesian surprise is the differ-
ence between the prior and posterior distributions
over the hypotheses which quantifies the extent to
which the new data (the news report) has changed
a user’s prior beliefs about the world.
In this work, we exemplify how Bayesian sur-
prise can be used to do content selection for text
summarization. Here a user’s prior knowledge
is approximated by a background corpus and we
333

show how to identify sentences from the input
set which are most surprising with respect to this
background. We use the method to do two types
of summarization tasks: a) GENERIC news sum-
marization which uses a large random collection
of news articles as the background, and b) UP-
DATE summarization where the background is a
smaller but specific set of news documents on
the same topic as the input set. We find that
our method performs competitively with a previ-
ous log-likelihood ratio approach which identifies
words with significantly higher probability in the
input compared to the background. The Bayesian
approach is more advantageous in the update task,
where the background corpus is smaller in size.
2 Related work
Computing new information is useful in many ap-
plications. The TREC novelty tasks (Allan et al.,
2003; Soboroff and Harman, 2005; Schiffman,
2005) tested the ability of systems to find novel
information in an IR setting. Systems were given
a list of documents ranked according to relevance
to a query. The goal is to find sentences in each
document which are relevant to the query, and at
the same time is new information given the content
of documents higher in the relevance list.
For update summarization of news, methods
range from textual entailment techniques (Ben-
tivogli et al., 2010) to find facts in the input which
are not entailed by the background, to Bayesian
topic models (Delort and Alfonseca, 2012) which
aim to learn and use topics discussed only in back-
ground, those only in the update input and those
that overlap across the two sets.
Even for generic summarization, some of the
best results were obtained by Conroy et al. (2006)
by using a large random corpus of news articles
as the background while summarizing a new arti-
cle, an idea first proposed by Lin and Hovy (2000).
Central to this approach is the use of a likelihood
ratio test to compute topic words, words that have
significantly higher probability in the input com-
pared to the background corpus, and are hence
descriptive of the input’s topic. In this work,
we compare our system to topic word based ones
since the latter is also a general method to find sur-
prising new words in a set of input documents but
is not a bayesian approach. We briefly explain the
topic words based approach below.
Computing topic words: Let us call the input
set I and the background B. The log-likelihood
ratio test compares two hypotheses:
H
1
: A word t is not a topic word and occurs
with equal probability in I and B, i.e. p(t|I) =
p(t|B) = p
H
2
: t is a topic word, hence p(t|I) = p
1
and
p(t|B) = p
2
and p
1
> p
2
A set of documents D containing N tokens is
viewed as a sequence of words w
1
w
2
...w
N
. The
word in each position i is assumed to be generated
by a Bernoulli trial which succeeds when the gen-
erated word w
i
= t and fails when w
i
is not t.
Suppose that the probability of success is p. Then
the probability of a word t appearing k times in a
dataset of N tokens is the binomial probability:
b(k, N, p) =
N
k
p
k
(1 p)
N k
(1)
The likelihood ratio compares the likelihood of
the data D = {B, I} under the two hypotheses.
λ =
P (D|H
1
)
P (D|H
2
)
=
b(c
t
, N, p)
b(c
I
, N
I
, p
1
) b(c
B
, N
B
, p
2
)
(2)
p, p
1
and p
2
are estimated by maximum likeli-
hood. p = c
t
/N where c
t
is the number of times
word t appears in the total set of tokens compris-
ing {B, I}. p
1
= c
I
t
/N
I
and p
2
= c
B
t
/N
B
are the
probabilities of t estimated only from the input and
only from the background respectively.
A convenient aspect of this approach is that
2 log λ is asymptotically χ
2
distributed. So for a
resulting 2 log λ value, we can use the χ
2
table to
find the significance level with which the null hy-
pothesis H
1
can be rejected. For example, a value
of 10 corresponds to a significance level of 0.001
and is standardly used as the cutoff. Words with
2 log λ > 10 are considered topic words. Con-
roy et al. (2006)’s system gives a weight of 1 to the
topic words and scores sentences using the number
of topic words normalized by sentence length.
3 Bayesian Surprise
First we present the formal definition of Bayesian
surprise given by Itti and Baldi (2006) without ref-
erence to the summarization task.
Let H be the space of all hypotheses represent-
ing the background knowledge of a user. The user
has a probability P (H) associated with each hy-
pothesis H H. Let D be a new observation. The
posterior probability of a single hypothesis H can
be computed as:
P (H|D) =
P (D|H)P (H)
P (D)
(3)
334

The surprise S(D, H) created by D on hypoth-
esis space H is defined as the difference between
the prior and posterior distributions over the hy-
potheses, and is computed using KL divergence.
S(D, H) = KL(P (H|D), P (H))) (4)
=
Z
H
P (H|D) log
P (H|D)
P (H)
(5)
Note that since KL-divergence is not symmet-
ric, we could also compute KL(P (H), P (H|D))
as the surprise value. In some cases, surprise can
be computed analytically, in particular when the
prior distribution is conjugate to the form of the
hypothesis, and so the posterior has the same func-
tional form as the prior. (See Baldi and Itti (2010)
for the surprise computation for different families
of probability distributions).
4 Summarization with Bayesian Surprise
We consider the hypothesis space H as the set of
all the hypotheses encoding background knowl-
edge. A single hypothesis about the background
takes the form of a multinomial distribution over
word unigrams. For example, one multinomial
may have higher word probabilities for ‘Ukraine’
and ‘peaceful’ and another multinomial has higher
probabilities for ‘Ukraine’ and ‘riots’. P (H) gives
a prior probability to each hypothesis based on
the information in the background corpus. In our
case, P (H) is a Dirichlet distribution, the conju-
gate prior for multinomials. Suppose that the vo-
cabulary size of the background corpus is V and
we label the word types as (w
1
, w
2
, ... w
V
). Then,
P (H) = Dir(α
1
, α
2
, ...α
V
) (6)
where α
1:V
are the concentration parameters of
the Dirichlet distribution (and will be set using the
background corpus as explained in Section 4.2).
Now consider a new observation I (a text, sen-
tence, or paragraph from the summarization input)
and the word counts in I given by (c
1
, c
2
, ..., c
V
).
Then the posterior over H is the dirichlet:
P (H|I) = Dir(α
1
+ c
1
, α
2
+ c
2
, ..., α
V
+ c
V
)
(7)
The surprise due to observing I, S(I, H) is the
KL divergence between the two dirichlet distribu-
tions. (Details about computing KL divergence
between two dirichlet distributions can be found
in Penny (2001) and Baldi and Itti (2010)).
Below we propose a general algorithm for sum-
marization using surprise computation. Then we
define the prior distribution P (H) for each of our
two tasks, GENERIC and UPDATE summarization.
4.1 Extractive summarization algorithm
We first compute a surprise value for each word
type in the summarization input. Word scores are
aggregated to obtain a score for each sentence.
Step 1: Word score. Suppose that word type
w
i
appears c
i
times in the summarization input
I. We obtain the posterior distribution after see-
ing all instances of this word (w
i
) as P (H|w
i
) =
Dir(α
1
, α
2
, ...α
i
+ c
i
, ...α
V
). The score for w
i
is
the surprise computed as KL divergence between
P (H|w
i
) and the prior P (H) (eqn. 6).
Step 2: Sentence score. The composition
functions to obtain sentence scores from word
scores can impact content selection performance
(Nenkova et al., 2006). We experiment with sum
and average value of the word scores.
1
Step 3: Sentence selection. The goal is to se-
lect a subset of sentences with high surprise val-
ues. We follow a greedy approach to optimize the
summary surprise by choosing the most surprising
sentence, the next most surprising and so on. At
the same time, we aim to avoid redundancy, i.e.
selecting sentences with similar content. After a
sentence is selected for the summary, the surprise
for words from this sentence are set to zero. We re-
compute the surprise for the remaining sentences
using step 2 and the selection process continues
until the summary length limit is reached.
The key differences between our Bayesian ap-
proach and a method such as topic words are: (i)
The Bayesian approach keeps multiple hypothe-
ses about the background rather than a single one.
Surprise is computed based on the changes in
probabilities of all of these hypotheses upon see-
ing the summarization input. (ii) The computation
of topic words is local, it assumes a binomial dis-
tribution and the occurrence of a word is indepen-
dent of others. In contrast, word surprise although
computed for each word type separately, quantifies
the surprise when incorporating the new counts of
this word into the background multinomials.
4.2 Input and background
Here we describe the input sets and background
corpus used for the two summarization tasks and
1
An alternative algorithm could directly compute the sur-
prise of a sentence by incorporating the words from the sen-
tence into the posterior. However, we found this specific
method to not work well probably because the few and un-
repeated content words from a sentence did not change the
posterior much. In future, we plan to use latent topic models
to assign a topic to a sentence so that the counts of all the
sentence’s words can be aggregated into one dimension.
335

define the prior distribution for each. We use data
from the DUC
2
and TAC
3
summarization evalua-
tion workshops conducted by NIST.
Generic summarization. We use multidocument
inputs from DUC 2004. There were 50 inputs,
each contains around 10 documents on a common
topic. Each input is also provided with 4 manually
written summaries created by NIST assessors. We
use these manual summaries for evaluation.
The background corpus is a collection of 5000
randomly selected articles from the English Giga-
word corpus. We use a list of 571 stop words from
the SMART IR system (Buckley, 1985) and the re-
maining content word vocabulary has 59,497 word
types. The count of each word in the background
is calculated and used as the α parameters of the
prior Dirichlet distribution P(H) (eqn. 6).
Update summarization. This task uses data from
TAC 2009. An input has two sets of documents, A
and B, each containing 10 documents. Both A and
B are on same topic but documents in B were pub-
lished at a later time than A (background). There
were 44 inputs and 4 manual update summaries
are provided for each.
The prior parameters are the counts of words
in A for that input (using the same stoplist). The
vocabulary of these A sets is smaller, ranging from
400 to 3000 words for the different inputs.
In practice for both tasks, a new summarization
input can have words unseen in the background.
So new words in an input are added to the back-
ground corpus with a count of 1 and the counts of
existing words in the background are incremented
by 1 before computing the prior parameters. The
summary length limit is 100 words in both tasks.
5 Systems for comparison
We compare against three types of systems, (i)
those which similarly to surprise, use a back-
ground corpus to identify important sentences, (ii)
a system that uses information from the input set
only and no background, and (iii) systems that
combine scores from the input and background.
KL
back
: represents a simple baseline for sur-
prise computation from a background corpus. A
single unigram probability distribution B is cre-
ated from the background using maximum like-
lihood. The summary is created by greedily
adding sentences which maximize KL divergence
2
http://www-nlpir.nist.gov/projects/
duc/index.html
3
http://www.nist.gov/tac/
between B and the current summary. Suppose
the set of sentences currently chosen in the sum-
mary is S. The next step chooses the sentence
s
l
= arg max
s
i
KL({S s
i
}||B) .
TS
sum
, TS
avg
: use topic words computed as de-
scribed in Section 2 and utilizing the same back-
ground corpus for the generic and update tasks
as the surprise-based methods. For the generic
task, we use a critical value of 10 (0.001 signif-
icance level) for the χ
2
distribution during topic
word computation. In the update task however, the
background corpus A is smaller and for most in-
puts, no words exceeded this cutoff. We lower the
significance level to the generally accepted value
of 0.05 and take words scoring above this as topic
words. The number of topic words is still small
(ranging from 1 to 30) for different inputs.
The TS
sum
system selects sentences with greater
counts of topic words and TS
avg
computes the
number of topic words normalized by sentence
length. A greedy selection procedure is used. To
reduce redundancy, once a sentence is added, the
topic words contained in it are removed from the
topic word list before the next sentence selection.
KL
inp
: represents the system that does not use
background information. Rather the method cre-
ates a summary by optimizing for high similarity
of the summary with the input word distribution.
Suppose the input unigram distribution is I and
the current summary is S, the method chooses the
sentence s
l
= arg min
s
i
KL({S s
i
}||I) at each
iteration. Since {S s
i
} is used to compute diver-
gence, redundancy is implicitly controlled in this
approach. Such a KL objective was used in com-
petitive systems in the past (Daum
´
e III and Marcu,
2006; Haghighi and Vanderwende, 2009).
Input + background: These systems com-
bine (i) a score based on the background (KL
back
,
TS or SR) with (ii) the score based on the input
only (KL
inp
). For example, to combine TS
sum
and
KL
inp
: for each sentence, we compute its scores
based on the two methods. Then we normalize the
two sets of scores for candidate sentences using z-
scores and compute the best sentence as arg max
s
i
(TS
sum
(s
i
) - KL
inp
(s
i
)). Redundancy control is
done similarly to the TS only systems.
6 Content selection results
For evaluation, we compare each summary to the
four manual summaries using ROUGE (Lin and
Hovy, 2003; Lin, 2004). All summaries were trun-
cated to 100 words, stemming was performed and
336

ROUGE-1 ROUGE-2
KL
back
0.2276 (TS, SR) 0.0250 (TS, SR)
TS
sum
0.3078 0.0616
TS
avg
0.2841 (TS
sum
, SR
sum
) 0.0493 (TS
sum
)
SR
sum
0.3120 0.0580
SR
avg
0.3003 0.0549
KL
inp
0.3075 (KL
inp
+TS
avg
) 0.0684
KL
inp
+TS
sum
0.3250 0.0725
KL
inp
+TS
avg
0.3410 0.0795
KL
inp
+SR
sum
0.3187 (KL
inp
+TS
avg
) 0.0660 (KL
inp
+TS
avg
)
KL
inp
+SR
avg
0.3220 (KL
inp
+TS
avg
) 0.0696
Table 1: Evaluation results for generic summaries.
Systems in parentheses are significantly better.
stop words were not removed, as is standard in
TAC evaluations. We report the ROUGE-1 and
ROUGE-2 recall scores (average over the inputs)
for each system. We use the Wilcoxon signed-rank
test to check for significant differences in mean
scores. Table 1 shows the scores for generic sum-
maries and 2 for the update task. For each system,
the peer systems with significantly better scores
(p-value < 0.05) are indicated within parentheses.
We refer to the surprise-based summaries as
SR
sum
and SR
avg
depending on the type of com-
position function (Section 4.1).
First, consider GENERIC summarization and the
systems which use the background corpus only
(those above the horizontal line). The KL
back
baseline performs significantly worse than topic
words and surprise summaries. Numerically,
SR
sum
has the highest ROUGE-1 score and TS
sum
tops according to ROUGE-2. As per the Wilcoxon
test, TS
sum
, SR
sum
and SR
avg
scores are statisti-
cally indistinguishable at 95% confidence level.
Systems below the horizontal line in Table 1
use an objective which combines both similarity
with the input and difference from the background.
The first line here shows that a system optimiz-
ing only for input similarity, KL
inp
, by itself has
higher scores (though not significant) than those
using background information only. This result is
not surprising for generic summarization where all
the topical content is present in the input and the
background is a non-focused random collection.
At the same time, adding either TS or SR scores
to KL
inp
almost always leads to better results with
KL
inp
+ TS
avg
giving the best score.
In UPDATE summarization, the surprise-based
methods have an advantage over the topic word
ones. SR
avg
is significantly better than TS
avg
for both ROUGE-1 and ROUGE-2 scores and
better than TS
sum
according to ROUGE-1. In
fact, the surprise methods have numerically higher
ROUGE-1 ROUGE-2
KL
back
0.2246 (TS, SR) 0.0213 (TS, SR)
TS
sum
0.3037 (SR
avg
) 0.0563
TS
avg
0.2909 (SR
sum
, SR
avg
) 0.0477 (SR
sum
, SR
avg
)
SR
sum
0.3201 0.0640
SR
avg
0.3226 0.0639
KL
inp
0.3098 (KL
inp
+SR
avg
) 0.0710
KL
inp
+TS
sum
0.3010 (KL
inp
+SR
sum, avg
) 0.0635
KL
inp
+TS
avg
0.3021 (KL
inp
+SR
sum, avg
) 0.0543 (KL
inp
,
KL
inp
+SR
sum, avg
)
KL
inp
+SR
sum
0.3292 0.0721
KL
inp
+SR
avg
0.3379 0.0767
Table 2: Evaluation results for update summaries.
Systems in parentheses are significantly better.
ROUGE-1 scores compared to input similarity
(KL
inp
) in contrast to generic summarization.
When combined with KL
inp
, the surprise meth-
ods provide improved results, significantly better
in terms of ROUGE-1 scores. The TS methods do
not lead to any improvement, and KL
inp
+ TS
avg
is significantly worse than KL
inp
only. The limi-
tation of the TS approach arises from the paucity
of topic words that exceed the significance cutoff
applied on the log-likelihood ratio. But Bayesian
surprise is robust on the small background corpus
and does not need any tuning for cutoff values de-
pending on the size of the background set.
Note that these models do not perform on par
with summarization systems that use multiple in-
dicators of content importance, involve supervised
training and which perform sentence compression.
Rather our goal in this work is to demonstrate a
simple and intuitive unsupervised model.
7 Conclusion
We have introduced a Bayesian summarization
method that strongly aligns with intuitions about
how people use existing knowledge to identify im-
portant events or content in new observations.
Our method is especially valuable when a sys-
tem must utilize a small background corpus.
While the update task datasets we have used were
carefully selected and grouped by NIST assesors
into initial and background sets, for systems on
the web, there is little control over the number of
background documents on a particular topic. A
system should be able to use smaller amounts of
background information and as new data arrives,
be able to incorporate the evidence. Our Bayesian
approach is a natural fit in such a setting.
Acknowledgements
The author was supported by a Newton Interna-
tional Fellowship (NF120479) from the Royal So-
ciety and the British Academy.
337

Citations
More filters
Proceedings ArticleDOI

A Simple Theoretical Model of Importance for Summarization

TL;DR: It is argued that establishing theoretical models of Importance will advance the understanding of the task and help to further improve summarization systems, and proposes simple but rigorous definitions of several concepts that were previously used only intuitively in summarization: Redundancy, Relevance, and Informativeness.
Proceedings ArticleDOI

Modeling Content Importance for Summarization with Pre-trained Language Models

TL;DR: Information theory is applied on top of pre-trained language models and the concept of importance is defined from the perspective of information amount, which considers both the semantics and context when evaluating the importance of each semantic unit.
Posted Content

Dimensionality on Summarization

TL;DR: This paper summarizes previous text summarization approaches in a multi-dimensional classification space, introduces aMulti-dimensional methodology for research and development, unveils the basic characteristics and principles of language use and understanding, investigates some fundamental mechanisms of summarization, studies the dimensions and forms of representations, and proposes a multi -dimensional evaluation mechanisms.
Proceedings ArticleDOI

Abstractive Text Summarization with Multi-Head Attention

TL;DR: The proposed MHAS model can consider the previously predicted words when generating new words to avoid generating a summary of redundant repetition words, and learn the internal structure of the article by adding self-attention layer to the traditional encoder and decoder and make the model better preserve the original information.
Journal ArticleDOI

An Overview of Text Summarization

TL;DR: This paper presents a comprehensive survey of contemporary text summarization of extractive and abstractive approaches.
References
More filters
Proceedings Article

ROUGE: A Package for Automatic Evaluation of Summaries

TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Proceedings ArticleDOI

Automatic evaluation of summaries using N-gram co-occurrence statistics

TL;DR: The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Journal ArticleDOI

Bayesian surprise attracts human attention.

TL;DR: A formal Bayesian definition of surprise is proposed to capture subjective aspects of sensory information and it is shown that Bayesian surprise is a strong attractor of human attention, with 72% of all gaze shifts directed towards locations more surprising than the average.
Journal Article

The Seventh PASCAL Recognizing Textual Entailment Challenge.

TL;DR: This paper presents the Seventh Recognizing Textual Entailment (RTE-7) challenge, which replicated the exercise proposed in RTE-6, consisting of a Main Task, a Main subtask aimed at detecting novel information; and a KBP Validation Task, in which RTE systems had to validate the output of systems participating in the KBP Slot Filling Task.
Proceedings ArticleDOI

Exploring Content Models for Multi-Document Summarization

TL;DR: The final model, HierSum, utilizes a hierarchical LDA-style model (Blei et al., 2004) to represent content specificity as a hierarchy of topic vocabulary distributions and yields state-of-the-art ROUGE performance and in pairwise user evaluation strongly outperforms Toutanova et al. (2007)'s state of theart discriminative system.