What are the future works mentioned in the paper "Query-free news search" ?

It would be interesting future work to refine and improve upon the filtering technique presented in this paper. It would also be interesting to experiment with different ways of using the history for query generation.

How many articles have at least one article rated relevant?

In summary, roughly 70% of the topics have at least one article rated relevant, and almost as many have at least one article rated very relevant (R+).

How does the algorithm achieve a precision of 91%?

The best algorithm achieves a precision of 91% on one data set and 84% on a second data set and finds a relevant article for at least 70% of the topics in the data sets.

What is the algorithm for finding news articles on the web?

For this genre of television show, the best algorithm finds a relevant page every 16-20 seconds on average, achieves a precision of 84-91%, and finds a relevant article for about 70% of the topics.

What are some examples of stories that are relevant to the topic?

Examples include a story about a beauty pageant for women in Lithuania’s prisons, a story about a new invention that uses recycled water from showers and baths to flush toilets, and a story about garbage trucks giving English lessons over loudspeakers in Singapore.

How could the topic finding and query generation algorithms be applied to conversations?

as voice recognition systems improve, the same kind of topic finding and query generation algorithms described in this paper could be applied to conversations, providing relevant information immediately upon demand.

What is the IR system that would use to find the documents?

The system would derive queries from the passages of text that were marked, and search over a local corpus for relevant documents to present to the user.

Why did the authors restrict the search to articles published on the day of the broadcast?

Because the authors want to retrieve articles that are about the current news item, the authors restricted the search to articles published on the day of the broadcast or the day before.

(Open Access) Query-free news search (2003) | Monika Henzinger

Q: What is the approach to finding articles that are related to a stream of text?

Their approach to finding articles that are related to a stream of text is to create queries based on the text and to issue the queries to a search engine.

Q: What is the approach to finding news articles on the web?

Their approach is to extract queries from the ongoing stream of closed captions, issue the queries in real time to a news search engine on the web, and postprocess the top results to determine the news articles that the authors show to the user.

Q: How many articles can be returned for the CNN data set?

The authors return the top two articles for each query so that a maximum of 514 relevant articles could be returned for this data set when L .

Query-Free News Search

Monika Henzinger

Google Inc.

2400 Bayshore Parkway

Mountain View, CA 94043

USA

monika@google.com

Bay-Wei Chang

Google Inc.

2400 Bayshore Parkway

Mountain View, CA 94043

USA

bay@google.com

Brian Milch

UC Berkeley

Computer Science Division

Berkeley, CA 94720-1776

USA

milch@cs.berkeley.edu

Sergey Brin

Google Inc.

2400 Bayshore Parkway

Mountain View, CA 94043

USA

sergey@google.com

ABSTRACT

Many daily activities present information in the form of a stream of

text, and often people can beneﬁt from additional information on

the topic discussed. TV broadcast news can be treated as one such

stream of text; in this paper we discuss ﬁnding news articles on the

web that are relevant to news currently being broadcast.

We evaluated a variety of algorithms for this problem, looking at

the impact of inverse document frequency, stemming, compounds,

history, and query length on the relevance and coverage of news

articles returned in real time during a broadcast. We also evaluated

several postprocessing techniques for improving the precision, in-

cluding reranking using additional terms, reranking by document

similarity, and ﬁltering on document similarity. For the best algo-

rithm, 84%-91% of the articles found were relevant, with at least

64% of the articles being on the exact topic of the broadcast. In

addition, a relevant article was found for at least 70% of the topics.

Categories and Subject Descriptors

H.3.3 [Information Systems]: Information Search and Retrieval;

H.3.5 [Information Systems]: Online Information Services

General Terms

Algorithms, experimentation

Keywords

Web information retrieval, query-free search

1. INTRODUCTION

Many daily activities present information using a written or spo-

ken stream of words: television, radio, telephone calls, meetings,

face-to-face conversations with others. Often people can beneﬁt

from additional information about the topics that are being dis-

cussed. Supplementing television broadcasts is particularly attrac-

tive because of the passive nature of TV watching. Interaction is

severely constrained, usually limited to just changing the channel;

WWW2003, May 20–24, 2003, Budapest, Hungary.

ACM 1-58113-680-3/03/0005.

there is no way to more ﬁnely direct what kind of information will

be presented.

Indeed, several companies have explored suggesting web pages

to viewers as they watch TV. For example, the Intercast system, de-

veloped by Intel, allows entire HTML pages to be broadcast in un-

used portions of the TV signal. A user watching TV on a computer

with a compatible TV tuner card can then view these pages, even

without an Internet connection. NBC transmitted pages via Inter-

cast during their coverage of the 1996 Summer Olympics. The In-

teractive TV Links system, developed by VITAC (a closed caption-

ing company) and WebTV (now a division of Microsoft), broad-

casts URLs in an alternative data channel interleaved with closed

caption data [17, 2]. When a WebTV box detects one of these

URLs, it displays an icon on the screen; if the user chooses to view

the page, the WebTV box fetches it over the Internet.

For both of these systems the producer of a program (or com-

mercial) chooses relevant documents by hand. In fact, the pro-

ducer often creates new documents speciﬁcally to be accessed by

TV viewers. To our knowledge, there has been no previous work

on automatically selecting web pages that a user might want to see

while watching a TV program.

In this paper we study the problem of ﬁnding news articles on

the web relevant to the ongoing stream of TV broadcast news. We

restrict our attention to broadcast news since it is very popular and

information-oriented (as supposed to entertainment-oriented).

Our approach is to extract queries from the ongoing stream of

closed captions, issue the queries in real time to a news search en-

gine on the web, and postprocess the top results to determine the

news articles that we show to the user. We evaluated a variety of

algorithms for this problem, looking at the impact of inverse doc-

ument frequency, stemming, compounds, history, and query length

on the relevance and coverage of news articles returned in real time

during a broadcast. We also evaluated several postprocessing tech-

niques for improving the precision, including reranking using ad-

ditional terms, reranking by document similarity, and ﬁltering on

document similarity. The best algorithm achieves a precision of

91% on one data set and 84% on a second data set and ﬁnds a rele-

vant article for at least 70% of the topics in the data sets.

In general, we ﬁnd that it is more important to concentrate on

a good postprocessing step than on a good query generation step.

The difference in precision between the best and the worst query

generation



algorithm is at most 10 percentage points, while our best

postprocessing step improves precision by 20 percentage points or

more. To reduce the impact of postprocessing on the total number

of relevant articles retrieved, we simply increased the number of

queries.

To be precise, the best algorithm uses a combination of tech-

niques. Our evaluation indicates that the most important features

for its success are a “history feature” and a postprocessing step that

ﬁlters out irrelevant articles. Many of the other features that we

added to improve the query generation do not seem to have a clearly

beneﬁcial impact on precision. The “history feature” enables the

algorithm to consider all terms since the start of the current topic

when generating a query. It tries to detect when a topic changes

and maintains a data structure that represents all terms in the cur-

rent topic, weighted by age. The ﬁltering step discards articles that

seem too dissimilar to each other or too dissimilar to the current

topic. We also experimented with other postprocessing techniques

but they had only a slight impact on precision.

Our algorithms are basically trying to extract keywords from a

stream of text so that the keywords represent the “current” piece of

the text. Using existing terminology this can be called time-based

keyword extraction. There is a large body of research on topic de-

tection and text summarization. Recently, time-based summariza-

tion has also been studied [1], but to the best of our knowledge

there is no prior work on time-based keyword extraction.

The remainder of this paper is organized as follows: Section 2

describes the different query generation algorithms and the differ-

ent postprocessing steps. Section 3 presents the evaluation. Sec-

tion 4 discusses related work. We conclude in Section 5.

2. OUR APPROACH

Our approach to ﬁnding articles that are related to a stream of

text is to create queries based on the text and to issue the queries to

a search engine. Then we postprocess the answers returned to ﬁnd

the most relevant ones. In our case the text consists of closed cap-

tioning of TV news, and we are looking for relevant news articles

on the web. Thus we issue the queries to a news search engine.

We ﬁrst describe the algorithms we use to create queries and then

the techniques we use for postprocessing the answers.

2.1 Query Generation

We are interested in showing relevant articles at a regular rate

during the news broadcast. As a result the query generation algo-

rithm needs to issue a query periodically, i.e., every



seconds. It

cannot wait for the end of a topic. We chose





for two rea-

sons: (1) Empirically we determined that showing an article every

10-15 seconds allows the user to read the title and scan the ﬁrst

paragraph. The actual user interface may allow the user to pause

and read the current article more thoroughly. (2) A caption text

of 15 seconds corresponds to roughly three sentences or roughly

50 words. This should be enough text to generate a well-speciﬁed

query.

Because postprocessing may eliminate some of the candidate

articles, we return two articles for each query. We also tested at



, thus allowing up to half of the candidate articles to be dis-

carded while maintaining the same or better coverage as





The query generation algorithm is given the text segment



since

the last query generation. It also keeps information about the pre-

vious stream of text. We consider seven different query generation

algorithms, described in the following sections. All but the last

query generation algorithm issue 2-term queries. A term is either a

word or a 2-word compound like New York. Two-term queries are

used because experiments on a test set (different from the evalua-

tion set used in this paper) showed that 1-term queries are too vague

and return many irrelevant results. On the other hand, roughly half

of the time 3-term queries are too speciﬁc and do not return any

results (because we are requiring all terms to appear in the search

results). The last query generation algorithm uses a combination

of 3- and 2-term queries to explore whether the 2-term limit hurts

performance.

As is common in the IR literature [18] the inverse document fre-

quency



of a term is a function of the frequency



of the term

in the collection and the number



of documents in the collec-

tion. Speciﬁcally, we use the function



 

. Since we

do not have a large amount of closed caption data available, we

used Google’s web collection to compute the



of the terms. This

means



was over 2 billion, and



was the frequency of a term in

this collection. Unfortunately, there is a difference in word use in

written web pages and spoken TV broadcasts. As a result we built a

small set of words that are common in captions but rare in the web

data. Examples of such words are reporter and analyst. All of the

algorithms below ignore the terms on this stopword list.

2.1.1 The baseline algorithm A1-BASE

Our baseline algorithm is a simple

! #"

based algorithm. It

weights each term by

!$%"&

, where

!$

is the frequency of the term

in the text segment



. This results in larger weights for terms that

appear more frequently in



, and larger weights for more unusual

terms. This is useful since doing a search with the more distinctive

terms of the news story is more likely to ﬁnd articles related to the

story. The baseline algorithm returns the two terms with largest

weight as the query.

2.1.2 The

!$"'(

algorithm A2-IDF2

This is the same algorithm as the baseline algorithm, but a term is

weighted by

!$)"&(

. The motivation is that rare words, like named

entities, are particularly important for issuing focussed queries. Thus,

the



component is more important than

!$

2.1.3 The simple stemming algorithm A3-STEM

In the previous two algorithms each term is assigned a weight.

Algorithm A3-STEM assigns instead a weight to each stem. The

stem of a word is approximated by taking the ﬁrst 5 letters of the

word. For example, congress and congressional would share the

same stem, congr. The intention is to aggregate the weight of terms

that describe the same entity. We use this simple method of deter-

mining stems instead of a more precise method because our algo-

rithm must be real-time.

For each stem we store all the terms that generated the stem and

their weight. The weight of a term is

*+"!$,"-

(

, where

*.



the term was a noun and

*%0/21



otherwise. (Nouns are determined

using the publicly available Brill tagger [3].) We use this weight-

ing scheme since nouns are often more useful in queries than other

parts of speech. The weight of a stem is the sum of the weights of

its terms.

To issue a query the algorithm determines the two top-weighted

stems and ﬁnds the top-weighted term for each of these stems.

These two terms form the query.

2.1.4 The stemming algorithm with compounds, al-

gorithm A4-COMP

Algorithm A4-COMP consists of algorithm A3-STEM extended

by two-word compounds. Speciﬁcally, we build stems not only for

one-word terms, but also for two-word compounds. For this we

use a list of allowed compounds compiled from Google’s corpus

of web data. Stems are computed by stemming both words in the

compound,

i.e., the stem for the compound veterans administration

is veter-admin. Compounds are considered to be terms and are

weighted as before. Queries are issued as for algorithm A3-STEM,

i.e., it ﬁnds the top-weighted term for the two top-weighted stems.

Since a term can now consists of a two-word compound, a query

can now in fact consist of two, three, or four words.

2.1.5 The history algorithm A5-HIST

Algorithm A5-HIST is algorithm A4-COMP with a “history fea-

ture”. All previous algorithms generated the query terms solely on

the basis of the text segment



that was read since the last query

generation. Algorithm A5-HIST uses terms from previous text seg-

ments to aid in generating a query for the current text segment, the

notion being that the context leading up to the current text may

contain terms that are still valuable in generating the query.

It does this by keeping a data structure, called the stem vec-

tor, which represents the previously seen text, i.e., the history. It

combines this information with the information produced by algo-

rithm A4-COMP for the current text segment



and ﬁnds the top

weighted stems.

To be precise, for each stem the stem vector keeps a weight and

a list of terms that generated the stem, each with its individual

weight. The stem vector keeps the stems of all words that were

seen between the last reset and the current text segment. A reset

simply sets the stem vector to be the empty vector; it occurs when

the topic in a text segment changes substantially from the previous

text segment (see below).

When algorithm A5-HIST receives text segment



it builds a

second stem vector for it using algorithm A4-COMP. Then it checks

how similar



is to the text represented in the old stem vector by

computing a similarity score

'4

. To do this we keep a stem vec-

tor for each of the last three text segments. (Each text segment

consists of the text between two query generations, i.e., it consists

of the text of the last



seconds.) We add these vectors and compute

the dot-product of this sum with the vector for



, only considering

the weights of the terms and ignoring the weights of the stems. If

the similarity score is above a threshold

56

, then



is similar to the

earlier text. If the similarity score is above

(

but below

, then



is somewhat similar to the earlier text. Otherwise



is dissimilar

from the earlier text.

If text segment



is similar to the earlier text, the old stem vector

is aged by multiplying every weight by 0.9 and then the two vectors

are added. To add the two vectors, both vectors are expanded to

have the same stems by suitably adding stems of weight 0. Also

the set of terms stored for each stem is expanded to consist of the

same set by adding terms of weight 0. Then the two vectors are

added by adding the corresponding weights of the stems and of the

terms.

If text segment



is very dissimilar from the earlier text, then

the old stem vector is reset and is replaced by the new stem vector.

To put it another way, when the current text is very different than

the previous text, it means that the topic has changed, so previous

history should be discarded in deciding what query to issue.

If text segment



is somewhat similar to the earlier text, then

the stem vector is not reset, but the weights in the old stem vector

are decreased by multiplying them with a weight that decreases

with the similarity score

'4

. Afterwards the old stem vector and

the new stem vector are added. So even though the topic has not

completely changed, previous terms are given less weight to allow

for topic drift.

We used a test data set (different from the evaluation data sets)

to choose values for

56

and

(

in the

'4

calculation. In our im-

plementation,

0/1 //



and

(

0/21 //7/78

. When



is somewhat

similar, we use the weight multiplier

59:/21 ;7(=<

6> > >@? ABC

, which

was chosen so that

5EDF/1 ;

, i.e., the weights are more decreased

than in the case that



is similar to the early text.

In the resulting stem vector the top two terms are found in the

same way as in algorithm A4-COMP.

2.1.6 The query shortening algorithm A6-3W

To verify our choice of query length 2 we experimented with a

query shortening algorithm, which issues a multiple term query,

and shortens the query until results are returned from the news

search engine. Earlier experiments showed that reducing the query

to one term hurt precision. Therefore we kept two terms as the

minimum query length. The query shortening algorithm A6-3W is

identical to A5-HIST, but begins with three-term queries, reissuing

the query with the two top-weighted terms if there are no results.

2.1.7 Algorithm A7-IDF

Algorithm A7-IDF is identical to algorithm A5-HIST with

&

(

replaced by



(Note that each increasing algorithm A1-A6 adds one additional

feature to the previous. A7-IDF does not ﬁt this pattern; we created

it in order to test the speciﬁc contribution of

&(

to A5-HIST’s

performance.)

2.2 Postprocessing

After generating the search queries we issue them to a news

search engine and retrieve the top at most 15 results. Note that

each result contains exactly one news article. Because we want to

retrieve articles that are about the current news item, we restricted

the search to articles published on the day of the broadcast or the

day before.

We applied several ways of improving upon these search results,

described in the sections below, and then selected the top two re-

sults to show to the user as news articles related to the broadcast

news story.

Since several queries will be issued on the same topic, they may

yield similar result sets and many identical or near identical articles

may end up being shown to the user. In fact, in the data sets used

for the evaluation (see 3.1), queried at both

GH

and

G



, an

average of 40% of articles returned would be near-duplicates. Such

a large number of duplicates would lead to a poor user experience,

so we employed a near-duplicate backoff strategy across all the al-

gorithms. If an article is deemed a near-duplicate of one that has

already been presented, the next article in the ranking is selected.

If all articles in the result set are exhausted in this manner, the ﬁrst

article in the result set is returned (even though it was deemed a

near-duplicate). This reduces the number of repeated highly simi-

lar articles to an average of 14% in the evaluation data sets.

To detect duplicates without spending time fetching each article,

we looked at the titles and summaries of the articles returned by the

search engine. We compared these titles and summaries to a cache

of article titles and summaries that have already been displayed

during the broadcast. A similarity metric of more than 20% word

overlap in the title, or more than 30% word overlap in the summary,

was successful in identifying exact matches (e.g., the same article

returned in the results for a different query) and slight variants of

the same article, as is common for news wires to issue as the story

develops over time.

The postprocessing steps we used were boosting, similarity rerank-

ing, and ﬁltering.

2.2.1 Boosting

The

news search engine gets a two-term queryand does not know

anything else about the stream of text. The idea behind boosting

is to use additional high-weighted terms to select from the search

results the most relevant articles. To implement this idea the query

generation algorithm returns along with the query associated boost

terms and boost values. The boost terms are simply the top ﬁve

terms found in the same way as the query terms. The boost values

are the IDF values of these terms.

The boosting algorithm then reranks the results returned from the

search by computing a weight for each result using the boost terms.

For a boost term which has IDF

&

and occurs

!$

times in the

text summary returned with the result, the weight is incremented

by the value

#"J!$K!$,98



, which is a

!$L"

-like formula

that limits the inﬂuence of the

!$

part to 4. For boost terms in the

title, the weight is increased by twice that value. Finally, to favor

more recent articles, the weight is divided by

M



, where



is the

number of days since the article was published. Since we restrict

articles to the current date and the day before, the weight is divided

by either 1 or 2. The results are then reordered according to their

weight; non-boosted results or ties are kept in their original order.

2.2.2 Similarity reranking

A second way of reranking is to compute for each of the results

returned by the search engine its similarity to the text segment



and to rerank the search results according to the similarity score.

To implement this idea we built a

! N"

-weighted term vector

for both the text segment



and the text of the article and compute

the normalized cosine similarity score. (The ﬁrst 500 characters of

the article are used.) This ﬁltering step requires ﬁrst fetching the

articles, which can be time-expensive.

2.2.3 Filtering

The idea behind ﬁltering is to discard articles that are very dis-

similar to the caption. Additionally, when the issued query is too

vague, then the top two search results often are very dissimilar. (In-

deed, all the results returned by vague queries are often very differ-

ent from one another.) So whenever we ﬁnd two candidate articles

and they are dissimilar, we suspect a vague query and irrelevant

results. So we discard each of the articles unless it is itself highly

similar to the caption.

We again used the

!$."&

-weighted term vector for the text seg-

ment



and the text of the article and computed the normalized

cosine similarity score as in the similarity reranking, above. When-

ever the page-



similarity score is below a threshold

the article

is discarded (Rule F1). If there are two search results we compute

their similarity score and discard the articles if the score is below a

threshold

(Rule F2)– but allowing each article to be retained if its

page-



similarity score is above a threshold

(Rule F3).

We analyzed a test data set (different from the evaluation data

sets) to determine appropriate thresholds. In our implementation,

O%0/1



QR0/21 8

, and

P,0/1 8



3. EVALUATION

To evaluate different algorithms on the same data set the evalu-

ators worked off-line. They were supplied with two browser win-

dows. One browser window contained the article to be evaluated.

The article was annotated with an input box so that the score for

the article could simply be input into the box. The other browser

window contained the part of the closed caption text for which the

article was generated. The evaluators were instructed as follows:

You will be reading a transcript of a television news broadcast.

What you will be evaluating will be the relevance of articles that

we provide periodically during the broadcast. For each displayed

article consider whether the article is relevant to at least one of

the topics being discussed in the newscast for this article. Use the

following scoring system to decide when a article is relevant to a

topic:

0 - if the article is not on the topic

1 - if the article is about the topic in general, but not the exact

story

2 - if the article is about the exact news story that is being

discussed

For example, if the news story is about the results of the presidential

election, then a article about a tax bill in congress would score a

0; a article about the candidates’ stands on the environment would

score a 1; a article about the winner’s victory speech would score

a 2.

Don’t worry if two articles seem very similar, or if you’ve seen

the article previously. Just score them normally. The “current

topic” of the newscast can be any topic discussed since the last

article was seen. So if the article is relevant to any of those topics,

score it as relevant. If the article is not relevant to those recent

topics, but is relevant to a previous segment of the transcript, it is

considered not relevant; give it a 0.

We count an article as “relevant” (R) if it was given a score of 1

or 2 by the human evaluator. We count it as “very relevant” (R+) if

it was given a score of 2.

To compare the algorithms we use precision, i.e., the percent-

age of relevant articles out of all returned articles. Recall is usu-

ally deﬁned as the percentage of returned relevant articles out of

all relevant articles that exist. However, this is very hard to mea-

sure on the web, since it is very difﬁcult to determine all articles on

a given topic. In addition, our algorithms are not designed to re-

turn all relevant documents, but instead a steady stream of relevant

documents. Thus, we deﬁne the relative recall to be the percent-

age of returned relevant articles out of all relevant articles pooled

from all of the query generation algorithms with all postprocessing

variants. We use relative recall instead of the number of relevant

documents to enable comparison over different data sets. Addition-

ally, we measure topic coverage, which is the percentage of topics

(deﬁned below) that have at least one relevant article.

To understand the relationship of the different algorithms we

compute their overlap, both in terms of issued queries and in terms

of articles returned. Since ﬁltering is such a powerful technique we

study its effectiveness in more detail.

3.1 Data sets

We evaluated all these approaches using the following two data

sets:

(1) HN: three 30-minute sessions of CNN Headline News, each

taken from a different day , and

(2) CNN: one hour of Wolf Blitzer Reports on CNN from one

day and 30 mins from another day.

The Headline News sessions (“HN”) consists of many, relatively

short, news stories. The Wolf Blitzer Reports (“CNN”) consists of

fewer news stories discussed for longer and in greater depth.

Both data sets contain news stories and meta-text. Meta-text con-

sists of the text between news stories, like “and now to you Tom”

or “thank you very much for this report”. For evaluating the perfor-

mance of our algorithms we manually decomposed the news stories

into topics, ignoring all the meta-text. (This manual segmentation

is not an input to the algorithms; it was used strictly for evaluation

purposes.) Each topic consists of at least 3 sentences on the same

Table 1: HN data set: Precision

and relative recall

Technique



Postprocessing

None Boost+

Filter

P T P T

A1-BASE 7 58% 37% 86% 31%

A2-IDF2 7 58% 37% 87% 31%

A3-STEM 7 64% 32% 88% 29%

A4-COMP 7 64% 32% 88% 28%

A5-HIST 7 64% 36% 91% 30%

A6-THREE 7 72% 33% 89% 28%

A7-IDF 7 61% 38% 89% 31%

A1-BASE 15 63% 20% 91% 17%

A2-IDF2 15 62% 20% 91% 18%

A3-STEM 15 69% 25% 88% 24%

A4-COMP 15 70% 26% 90% 25%

A5-HIST 15 67% 26% 89% 24%

A6-THREE 15 75% 24% 91% 22%

A7-IDF 15 59% 26% 91% 24%

Table 2: CNN data set: Precision

and relative recall

Technique



Postprocessing

None Boost+

Filter

P T P T

A1-BASE 7 43% 27% 77% 21%

A2-IDF2 7 46% 27% 75% 18%

A3-STEM 7 43% 23% 76% 18%

A4-COMP 7 44% 23% 76% 17%

A5-HIST 7 55% 32% 84% 23%

A6-THREE 7 60% 30% 86% 23%

A7-IDF 7 52% 25% 82% 23%

A1-BASE 15 48% 17% 83% 14%

A2-IDF2 15 60% 16% 85% 13%

A3-STEM 15 54% 17% 76% 14%

A4-COMP 15 59% 18% 82% 15%

A5-HIST 15 61% 25% 88% 20%

A6-THREE 15 71% 23% 83% 21%

A7-IDF 15 56% 25% 82% 21%

theme; we do not count 1-2 sentence long “teasers” for upcoming

stories as topics. The shortest topic in our data sets is 10 seconds

long, the longest is 426 seconds long. The average length of a topic

in the HN data set is 51 seconds and the median is 27 seconds. The

topics comprise a total of 4181 seconds (70 mins) out of the 90

mins long caption. In the CNN data set the average topic length is

107 seconds and the median is 49 seconds. The topics comprise a

total of 3854 seconds (64 mins).

3.2 Evaluation of the Query Generation Algo-

rithms

We ﬁrst evaluated all the baseline algorithms with two differ-

ent ways of postprocessing, namely no postprocessing and postpro-

cessing by both boosting and ﬁltering. The CNN data set consists

of 3854 seconds, and thus an algorithm that issues a query every

15 seconds issues 257 queries. We return the top two articles for

each query so that a maximum of 514 relevant articles could be

returned for this data set when

L



. For the HN data set the

corresponding number is 557.

The pool of all relevant documents found by any of the algo-

rithms for the HN data set is 846, and for the CNN data set is 816.

Thus the relative recall for each algorithm is calculated by divid-

ing the number of relevant documents it found by these numbers.

Note that for

U



no algorithm can return more than 557 (for

HN) or 514 (for CNN) relevant articles, so in those cases the max-

imum possible relative recall would be

7

7-VJXWYFW7W7Z

(HN) or



J[-V



W0W787Z

(CNN).

The pooled relative recall numbers are appropriate for comparing

performance among the different algorithms, but not useful as an

absolute measure of an algorithm’s recall performance, since no

algorithm would be able to achieve 100% relative recall. This is

because when a query is issued at a text segment, an algorithm is

limited to returning a maximum of two articles. However, pooling

usually identiﬁes more than two articles as relevant for a given text

segment.

Table 1 presents the precision and relative recall for all the differ-

ent query generation algorithms for the HN data set. Table 2 shows

the corresponding numbers for the CNN data set. It leads to a few

observations:

All algorithms perform statistically signiﬁcantly

better with

a p-value of

\]/1 /7/78

when postprocessed with boosting and

ﬁltering than without postprocessing. Depending on the al-

gorithm the postprocessing seems to increase the precision

by 20-35 percentage points.

For both data sets the highest precision numbers are achieved

with postprocessing and

R



. However, the largest rela-

tive recall is achieved without postprocessing and

E^

This is no surprise: Filtering reduces not only the number of

non-relevant articles that are returned, but also the number of

relevant ones. The impact of postprocessing on the number

of relevant articles that are returned varies greatly between

algorithms. The maximum change is 71 articles (A1-BASE

with

RF

on HN), and the minimum change is 10 articles

(A3-STEM with

_

on HN). Also, reducing



increases

the number of queries issued and thus one expects the num-

ber of returned articles to increase, both the relevant ones as

well as the non-relevant ones. Thus relative recall increases

as well.

Precision on the CNN data set is lower than precision on the

HN data set. This is somewhat surprising as longer topics

might be expected to lead to higher precision. The reason is

that since we issue more queries on the same topic, we reach

further down in the result sets to avoid duplicates and end up

returning less appropriate articles.

Algorithm A5-HIST with

`a

and with postprocessing

performs well in both precision and relative recall. For the

HN data set, it achieves a precision of 91% with 257 relevant

articles returned, for the CNN data set it achieves a precision

of 84% with 190 relevant articles returned. This means it

returns a relevant article every 16 seconds and every 20 sec-

onds, respectively, on the average. The performance of algo-

rithm A6-3W is very similar to algorithm A5-HIST. None of

the other algorithms achieves precision of at least 90% and

relative recall of at least 30%. For example, algorithms A1-

BASE and A2-IDF2 with

b



have precision 91% on

To determine statistical signiﬁcance we used the rank-sum test and

the t-test. If a p-value is given, it is the p-value of the rank-sum test,

as it is more conservative. If no p-value is given, the p-value of the

rank-sum test is less than 0.05.

Query-free news search

Figures

Citations

Deeper Inside PageRank

Finding advertising keywords on web pages

Personal information management

Systems and methods for improving the ranking of news articles

Automatic method and system for formulating and transforming representations of context used by information services

References

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Letizia: an agent that assists web browsing

Learning Algorithms for Keyphrase Extraction

Domain-specific keyphrase extraction

Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive.

Related Papers (5)

Systems and methods for performing background queries from content and activity

Just-in-time information retrieval agents

A vector space model for automatic indexing

Systems, methods, and interfaces for providing personalized search and information access

User interactions with everyday applications as context for just-in-time information access

Frequently Asked Questions (15)

Q1. What are the future works mentioned in the paper "Query-free news search" ?

Q2. What contributions have the authors mentioned in the paper "Query-free news search" ?

Q3. What is the approach to finding articles that are related to a stream of text?

Q4. What is the algorithm for generating the last query?

Q5. How many articles have at least one article rated relevant?

Q6. How does the algorithm achieve a precision of 91%?

Q7. What is the approach to finding news articles on the web?

Q8. What is the meaning of a reset?

Q9. How many articles can be returned for the CNN data set?

Q10. What is the algorithm for finding news articles on the web?

Q11. What are some examples of stories that are relevant to the topic?

Q12. How could the topic finding and query generation algorithms be applied to conversations?

Q13. What is the algorithm used to shorten the query?

Q14. What is the IR system that would use to find the documents?

Q15. Why did the authors restrict the search to articles published on the day of the broadcast?