A Bayesian Method to Incorporate Background Knowledge during Automatic Text Summarization

doi:10.3115/V1/P14-2055

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 333–338,

Baltimore, Maryland, USA, June 23-25 2014.

c

2014 Association for Computational Linguistics

A Bayesian Method to Incorporate Background Knowledge during

Automatic Text Summarization

Annie Louis

ILCC, School of Informatics,

University of Edinburgh,

Edinburgh EH8 9AB, UK

alouis@inf.ed.ac.uk

Abstract

In order to summarize a document, it is

often useful to have a background set

of documents from the domain to serve

as a reference for determining new and

important information in the input doc-

ument. We present a model based on

Bayesian surprise which provides an in-

tuitive way to identify surprising informa-

tion from a summarization input with re-

spect to a background corpus. Speciﬁcally,

the method quantiﬁes the degree to which

pieces of information in the input change

one’s beliefs’ about the world represented

in the background. We develop sys-

tems for generic and update summariza-

tion based on this idea. Our method pro-

vides competitive content selection perfor-

mance with particular advantages in the

update task where systems are given a

small and topical background corpus.

1 Introduction

Important facts in a new text are those which devi-

ate from previous knowledge on the topic. When

people create summaries, they use their knowl-

edge about the world to decide what content in an

input document is informative to include in a sum-

mary. Understandably in automatic summariza-

tion as well, it is useful to keep a background set

of documents to represent general facts and their

frequency in the domain.

For example, in the simplest setting of multi-

document summarization of news, systems are

asked to summarize an input set of topically-

related news documents to reﬂect its central con-

tent. In this GENERIC task, some of the best re-

ported results were obtained by a system (Conroy

et al., 2006) which computed importance scores

for words in the input by examining if the word

occurs with signiﬁcantly higher probability in the

input compared to a large background collection

of news articles. Other specialized summarization

tasks explicitly require the use of background in-

formation. In the UPDATE summarization task, a

system is given two sets of news documents on the

same topic; the second contains articles published

later in time. The system should summarize the

important updates from the second set assuming a

user has already read the ﬁrst set of articles.

In this work, we present a Bayesian model for

assessing the novelty of a sentence taken from a

summarization input with respect to a background

corpus of documents.

Our model is based on the idea of Bayesian Sur-

prise (Itti and Baldi, 2006). For illustration, as-

sume that a user’s background knowledge com-

prises of multiple hypotheses about the current

state of the world and a probability distribution

over these hypotheses indicates his degree of be-

lief in each hypothesis. For example, one hypoth-

esis may be that the political situation in Ukraine

is peaceful, another where it is not. Apriori as-

sume the user favors the hypothesis about a peace-

ful Ukraine, i.e. the hypothesis has higher prob-

ability in the prior distribution. Given new data,

the evidence can be incorporated using Bayes Rule

to compute the posterior distribution over the hy-

potheses. For example, upon viewing news reports

about riots in the country, a user would update his

beliefs and the posterior distribution of the user’s

knowledge would have a higher probability for a

riotous Ukraine. Bayesian surprise is the differ-

ence between the prior and posterior distributions

over the hypotheses which quantiﬁes the extent to

which the new data (the news report) has changed

a user’s prior beliefs about the world.

In this work, we exemplify how Bayesian sur-

prise can be used to do content selection for text

summarization. Here a user’s prior knowledge

is approximated by a background corpus and we

333

show how to identify sentences from the input

set which are most surprising with respect to this

background. We use the method to do two types

of summarization tasks: a) GENERIC news sum-

marization which uses a large random collection

of news articles as the background, and b) UP-

DATE summarization where the background is a

smaller but speciﬁc set of news documents on

the same topic as the input set. We ﬁnd that

our method performs competitively with a previ-

ous log-likelihood ratio approach which identiﬁes

words with signiﬁcantly higher probability in the

input compared to the background. The Bayesian

approach is more advantageous in the update task,

where the background corpus is smaller in size.

2 Related work

Computing new information is useful in many ap-

plications. The TREC novelty tasks (Allan et al.,

2003; Soboroff and Harman, 2005; Schiffman,

2005) tested the ability of systems to ﬁnd novel

information in an IR setting. Systems were given

a list of documents ranked according to relevance

to a query. The goal is to ﬁnd sentences in each

document which are relevant to the query, and at

the same time is new information given the content

of documents higher in the relevance list.

For update summarization of news, methods

range from textual entailment techniques (Ben-

tivogli et al., 2010) to ﬁnd facts in the input which

are not entailed by the background, to Bayesian

topic models (Delort and Alfonseca, 2012) which

aim to learn and use topics discussed only in back-

ground, those only in the update input and those

that overlap across the two sets.

Even for generic summarization, some of the

best results were obtained by Conroy et al. (2006)

by using a large random corpus of news articles

as the background while summarizing a new arti-

cle, an idea ﬁrst proposed by Lin and Hovy (2000).

Central to this approach is the use of a likelihood

ratio test to compute topic words, words that have

signiﬁcantly higher probability in the input com-

pared to the background corpus, and are hence

descriptive of the input’s topic. In this work,

we compare our system to topic word based ones

since the latter is also a general method to ﬁnd sur-

prising new words in a set of input documents but

is not a bayesian approach. We brieﬂy explain the

topic words based approach below.

Computing topic words: Let us call the input

set I and the background B. The log-likelihood

ratio test compares two hypotheses:

H

1

: A word t is not a topic word and occurs

with equal probability in I and B, i.e. p(t|I) =

p(t|B) = p

H

2

: t is a topic word, hence p(t|I) = p

1

and

p(t|B) = p

2

and p

1

> p

2

A set of documents D containing N tokens is

viewed as a sequence of words w

1

w

2

...w

N

. The

word in each position i is assumed to be generated

by a Bernoulli trial which succeeds when the gen-

erated word w

i

= t and fails when w

i

is not t.

Suppose that the probability of success is p. Then

the probability of a word t appearing k times in a

dataset of N tokens is the binomial probability:

b(k, N, p) =



N

k



p

k

(1 − p)

N −k

(1)

The likelihood ratio compares the likelihood of

the data D = {B, I} under the two hypotheses.

λ =

P (D|H

1

)

P (D|H

2

)

=

b(c

t

, N, p)

b(c

I

, N

I

, p

1

) b(c

B

, N

B

, p

2

)

(2)

p, p

1

and p

2

are estimated by maximum likeli-

hood. p = c

t

/N where c

t

is the number of times

word t appears in the total set of tokens compris-

ing {B, I}. p

1

= c

I

t

/N

I

and p

2

= c

B

t

/N

B

are the

probabilities of t estimated only from the input and

only from the background respectively.

A convenient aspect of this approach is that

−2 log λ is asymptotically χ

2

distributed. So for a

resulting −2 log λ value, we can use the χ

2

table to

ﬁnd the signiﬁcance level with which the null hy-

pothesis H

1

can be rejected. For example, a value

of 10 corresponds to a signiﬁcance level of 0.001

and is standardly used as the cutoff. Words with

−2 log λ > 10 are considered topic words. Con-

roy et al. (2006)’s system gives a weight of 1 to the

topic words and scores sentences using the number

of topic words normalized by sentence length.

3 Bayesian Surprise

First we present the formal deﬁnition of Bayesian

surprise given by Itti and Baldi (2006) without ref-

erence to the summarization task.

Let H be the space of all hypotheses represent-

ing the background knowledge of a user. The user

has a probability P (H) associated with each hy-

pothesis H ∈ H. Let D be a new observation. The

posterior probability of a single hypothesis H can

be computed as:

P (H|D) =

P (D|H)P (H)

P (D)

(3)

334

The surprise S(D, H) created by D on hypoth-

esis space H is deﬁned as the difference between

the prior and posterior distributions over the hy-

potheses, and is computed using KL divergence.

S(D, H) = KL(P (H|D), P (H))) (4)

=

Z

H

P (H|D) log

P (H|D)

P (H)

(5)

Note that since KL-divergence is not symmet-

ric, we could also compute KL(P (H), P (H|D))

as the surprise value. In some cases, surprise can

be computed analytically, in particular when the

prior distribution is conjugate to the form of the

hypothesis, and so the posterior has the same func-

tional form as the prior. (See Baldi and Itti (2010)

for the surprise computation for different families

of probability distributions).

4 Summarization with Bayesian Surprise

We consider the hypothesis space H as the set of

all the hypotheses encoding background knowl-

edge. A single hypothesis about the background

takes the form of a multinomial distribution over

word unigrams. For example, one multinomial

may have higher word probabilities for ‘Ukraine’

and ‘peaceful’ and another multinomial has higher

probabilities for ‘Ukraine’ and ‘riots’. P (H) gives

a prior probability to each hypothesis based on

the information in the background corpus. In our

case, P (H) is a Dirichlet distribution, the conju-

gate prior for multinomials. Suppose that the vo-

cabulary size of the background corpus is V and

we label the word types as (w

1

, w

2

, ... w

V

). Then,

P (H) = Dir(α

1

, α

2

, ...α

V

) (6)

where α

1:V

are the concentration parameters of

the Dirichlet distribution (and will be set using the

background corpus as explained in Section 4.2).

Now consider a new observation I (a text, sen-

tence, or paragraph from the summarization input)

and the word counts in I given by (c

1

, c

2

, ..., c

V

).

Then the posterior over H is the dirichlet:

P (H|I) = Dir(α

1

+ c

1

, α

2

+ c

2

, ..., α

V

+ c

V

)

(7)

The surprise due to observing I, S(I, H) is the

KL divergence between the two dirichlet distribu-

tions. (Details about computing KL divergence

between two dirichlet distributions can be found

in Penny (2001) and Baldi and Itti (2010)).

Below we propose a general algorithm for sum-

marization using surprise computation. Then we

deﬁne the prior distribution P (H) for each of our

two tasks, GENERIC and UPDATE summarization.

4.1 Extractive summarization algorithm

We ﬁrst compute a surprise value for each word

type in the summarization input. Word scores are

aggregated to obtain a score for each sentence.

Step 1: Word score. Suppose that word type

w

i

appears c

i

times in the summarization input

I. We obtain the posterior distribution after see-

ing all instances of this word (w

i

) as P (H|w

i

) =

Dir(α

1

, α

2

, ...α

i

+ c

i

, ...α

V

). The score for w

i

is

the surprise computed as KL divergence between

P (H|w

i

) and the prior P (H) (eqn. 6).

Step 2: Sentence score. The composition

functions to obtain sentence scores from word

scores can impact content selection performance

(Nenkova et al., 2006). We experiment with sum

and average value of the word scores.

1

Step 3: Sentence selection. The goal is to se-

lect a subset of sentences with high surprise val-

ues. We follow a greedy approach to optimize the

summary surprise by choosing the most surprising

sentence, the next most surprising and so on. At

the same time, we aim to avoid redundancy, i.e.

selecting sentences with similar content. After a

sentence is selected for the summary, the surprise

for words from this sentence are set to zero. We re-

compute the surprise for the remaining sentences

using step 2 and the selection process continues

until the summary length limit is reached.

The key differences between our Bayesian ap-

proach and a method such as topic words are: (i)

The Bayesian approach keeps multiple hypothe-

ses about the background rather than a single one.

Surprise is computed based on the changes in

probabilities of all of these hypotheses upon see-

ing the summarization input. (ii) The computation

of topic words is local, it assumes a binomial dis-

tribution and the occurrence of a word is indepen-

dent of others. In contrast, word surprise although

computed for each word type separately, quantiﬁes

the surprise when incorporating the new counts of

this word into the background multinomials.

4.2 Input and background

Here we describe the input sets and background

corpus used for the two summarization tasks and

1

An alternative algorithm could directly compute the sur-

prise of a sentence by incorporating the words from the sen-

tence into the posterior. However, we found this speciﬁc

method to not work well probably because the few and un-

repeated content words from a sentence did not change the

posterior much. In future, we plan to use latent topic models

to assign a topic to a sentence so that the counts of all the

sentence’s words can be aggregated into one dimension.

335

deﬁne the prior distribution for each. We use data

from the DUC

2

and TAC

3

summarization evalua-

tion workshops conducted by NIST.

Generic summarization. We use multidocument

inputs from DUC 2004. There were 50 inputs,

each contains around 10 documents on a common

topic. Each input is also provided with 4 manually

written summaries created by NIST assessors. We

use these manual summaries for evaluation.

The background corpus is a collection of 5000

randomly selected articles from the English Giga-

word corpus. We use a list of 571 stop words from

the SMART IR system (Buckley, 1985) and the re-

maining content word vocabulary has 59,497 word

types. The count of each word in the background

is calculated and used as the α parameters of the

prior Dirichlet distribution P(H) (eqn. 6).

Update summarization. This task uses data from

TAC 2009. An input has two sets of documents, A

and B, each containing 10 documents. Both A and

B are on same topic but documents in B were pub-

lished at a later time than A (background). There

were 44 inputs and 4 manual update summaries

are provided for each.

The prior parameters are the counts of words

in A for that input (using the same stoplist). The

vocabulary of these A sets is smaller, ranging from

400 to 3000 words for the different inputs.

In practice for both tasks, a new summarization

input can have words unseen in the background.

So new words in an input are added to the back-

ground corpus with a count of 1 and the counts of

existing words in the background are incremented

by 1 before computing the prior parameters. The

summary length limit is 100 words in both tasks.

5 Systems for comparison

We compare against three types of systems, (i)

those which similarly to surprise, use a back-

ground corpus to identify important sentences, (ii)

a system that uses information from the input set

only and no background, and (iii) systems that

combine scores from the input and background.

KL

back

: represents a simple baseline for sur-

prise computation from a background corpus. A

single unigram probability distribution B is cre-

ated from the background using maximum like-

lihood. The summary is created by greedily

adding sentences which maximize KL divergence

2

http://www-nlpir.nist.gov/projects/

duc/index.html

3

http://www.nist.gov/tac/

between B and the current summary. Suppose

the set of sentences currently chosen in the sum-

mary is S. The next step chooses the sentence

s

l

= arg max

s

i

KL({S ∪ s

i

}||B) .

TS

sum

, TS

avg

: use topic words computed as de-

scribed in Section 2 and utilizing the same back-

ground corpus for the generic and update tasks

as the surprise-based methods. For the generic

task, we use a critical value of 10 (0.001 signif-

icance level) for the χ

2

distribution during topic

word computation. In the update task however, the

background corpus A is smaller and for most in-

puts, no words exceeded this cutoff. We lower the

signiﬁcance level to the generally accepted value

of 0.05 and take words scoring above this as topic

words. The number of topic words is still small

(ranging from 1 to 30) for different inputs.

The TS

sum

system selects sentences with greater

counts of topic words and TS

avg

computes the

number of topic words normalized by sentence

length. A greedy selection procedure is used. To

reduce redundancy, once a sentence is added, the

topic words contained in it are removed from the

topic word list before the next sentence selection.

KL

inp

: represents the system that does not use

background information. Rather the method cre-

ates a summary by optimizing for high similarity

of the summary with the input word distribution.

Suppose the input unigram distribution is I and

the current summary is S, the method chooses the

sentence s

l

= arg min

s

i

KL({S ∪ s

i

}||I) at each

iteration. Since {S ∪ s

i

} is used to compute diver-

gence, redundancy is implicitly controlled in this

approach. Such a KL objective was used in com-

petitive systems in the past (Daum

´

e III and Marcu,

2006; Haghighi and Vanderwende, 2009).

Input + background: These systems com-

bine (i) a score based on the background (KL

back

,

TS or SR) with (ii) the score based on the input

only (KL

inp

). For example, to combine TS

sum

and

KL

inp

: for each sentence, we compute its scores

based on the two methods. Then we normalize the

two sets of scores for candidate sentences using z-

scores and compute the best sentence as arg max

s

i

(TS

sum

(s

i

) - KL

inp

(s

i

)). Redundancy control is

done similarly to the TS only systems.

6 Content selection results

For evaluation, we compare each summary to the

four manual summaries using ROUGE (Lin and

Hovy, 2003; Lin, 2004). All summaries were trun-

cated to 100 words, stemming was performed and

336

ROUGE-1 ROUGE-2

KL

back

0.2276 (TS, SR) 0.0250 (TS, SR)

TS

sum

0.3078 0.0616

TS

avg

0.2841 (TS

sum

, SR

sum

) 0.0493 (TS

sum

)

SR

sum

0.3120 0.0580

SR

avg

0.3003 0.0549

KL

inp

0.3075 (KL

inp

+TS

avg

) 0.0684

KL

inp

+TS

sum

0.3250 0.0725

KL

inp

+TS

avg

0.3410 0.0795

KL

inp

+SR

sum

0.3187 (KL

inp

+TS

avg

) 0.0660 (KL

inp

+TS

avg

)

KL

inp

+SR

avg

0.3220 (KL

inp

+TS

avg

) 0.0696

Table 1: Evaluation results for generic summaries.

Systems in parentheses are signiﬁcantly better.

stop words were not removed, as is standard in

TAC evaluations. We report the ROUGE-1 and

ROUGE-2 recall scores (average over the inputs)

for each system. We use the Wilcoxon signed-rank

test to check for signiﬁcant differences in mean

scores. Table 1 shows the scores for generic sum-

maries and 2 for the update task. For each system,

the peer systems with signiﬁcantly better scores

(p-value < 0.05) are indicated within parentheses.

We refer to the surprise-based summaries as

SR

sum

and SR

avg

depending on the type of com-

position function (Section 4.1).

First, consider GENERIC summarization and the

systems which use the background corpus only

(those above the horizontal line). The KL

back

baseline performs signiﬁcantly worse than topic

words and surprise summaries. Numerically,

SR

sum

has the highest ROUGE-1 score and TS

sum

tops according to ROUGE-2. As per the Wilcoxon

test, TS

sum

, SR

sum

and SR

avg

scores are statisti-

cally indistinguishable at 95% conﬁdence level.

Systems below the horizontal line in Table 1

use an objective which combines both similarity

with the input and difference from the background.

The ﬁrst line here shows that a system optimiz-

ing only for input similarity, KL

inp

, by itself has

higher scores (though not signiﬁcant) than those

using background information only. This result is

not surprising for generic summarization where all

the topical content is present in the input and the

background is a non-focused random collection.

At the same time, adding either TS or SR scores

to KL

inp

almost always leads to better results with

KL

inp

+ TS

avg

giving the best score.

In UPDATE summarization, the surprise-based

methods have an advantage over the topic word

ones. SR

avg

is signiﬁcantly better than TS

avg

for both ROUGE-1 and ROUGE-2 scores and

better than TS

sum

according to ROUGE-1. In

fact, the surprise methods have numerically higher

ROUGE-1 ROUGE-2

KL

back

0.2246 (TS, SR) 0.0213 (TS, SR)

TS

sum

0.3037 (SR

avg

) 0.0563

TS

avg

0.2909 (SR

sum

, SR

avg

) 0.0477 (SR

sum

, SR

avg

)

SR

sum

0.3201 0.0640

SR

avg

0.3226 0.0639

KL

inp

0.3098 (KL

inp

+SR

avg

) 0.0710

KL

inp

+TS

sum

0.3010 (KL

inp

+SR

sum, avg

) 0.0635

KL

inp

+TS

avg

0.3021 (KL

inp

+SR

sum, avg

) 0.0543 (KL

inp

,

KL

inp

+SR

sum, avg

)

KL

inp

+SR

sum

0.3292 0.0721

KL

inp

+SR

avg

0.3379 0.0767

Table 2: Evaluation results for update summaries.

Systems in parentheses are signiﬁcantly better.

ROUGE-1 scores compared to input similarity

(KL

inp

) in contrast to generic summarization.

When combined with KL

inp

, the surprise meth-

ods provide improved results, signiﬁcantly better

in terms of ROUGE-1 scores. The TS methods do

not lead to any improvement, and KL

inp

+ TS

avg

is signiﬁcantly worse than KL

inp

only. The limi-

tation of the TS approach arises from the paucity

of topic words that exceed the signiﬁcance cutoff

applied on the log-likelihood ratio. But Bayesian

surprise is robust on the small background corpus

and does not need any tuning for cutoff values de-

pending on the size of the background set.

Note that these models do not perform on par

with summarization systems that use multiple in-

dicators of content importance, involve supervised

training and which perform sentence compression.

Rather our goal in this work is to demonstrate a

simple and intuitive unsupervised model.

7 Conclusion

We have introduced a Bayesian summarization

method that strongly aligns with intuitions about

how people use existing knowledge to identify im-

portant events or content in new observations.

Our method is especially valuable when a sys-

tem must utilize a small background corpus.

While the update task datasets we have used were

carefully selected and grouped by NIST assesors

into initial and background sets, for systems on

the web, there is little control over the number of

background documents on a particular topic. A

system should be able to use smaller amounts of

background information and as new data arrives,

be able to incorporate the evidence. Our Bayesian

approach is a natural ﬁt in such a setting.

Acknowledgements

The author was supported by a Newton Interna-

tional Fellowship (NF120479) from the Royal So-

ciety and the British Academy.

337

A Bayesian Method to Incorporate Background Knowledge during Automatic Text Summarization

Figures

Citations

A Simple Theoretical Model of Importance for Summarization

Modeling Content Importance for Summarization with Pre-trained Language Models

Dimensionality on Summarization

Abstractive Text Summarization with Multi-Head Attention

An Overview of Text Summarization

References

ROUGE: A Package for Automatic Evaluation of Summaries

Automatic evaluation of summaries using N-gram co-occurrence statistics

Bayesian surprise attracts human attention.

The Seventh PASCAL Recognizing Textual Entailment Challenge.

Exploring Content Models for Multi-Document Summarization

Related Papers (5)

The automatic creation of literature abstracts

LexRank: graph-based lexical centrality as salience in text summarization

TextRank: Bringing Order into Text

ROUGE: A Package for Automatic Evaluation of Summaries

Get To The Point: Summarization with Pointer-Generator Networks