scispace - formally typeset
Open AccessProceedings ArticleDOI

Query-free news search

TLDR
A variety of algorithms were evaluated for finding news articles on the web that are relevant to news currently being broadcast, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast.
Abstract
Many daily activities present information in the form of a stream of text, and often people can benefit from additional information on the topic discussed. TV broadcast news can be treated as one such stream of text; in this paper we discuss finding news articles on the web that are relevant to news currently being broadcast.We evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. We also evaluated several postprocessing techniques for improving the precision, including reranking using additional terms, reranking by document similarity, and filtering on document similarity. For the best algorithm, 84%-91% of the articles found were relevant, with at least 64% of the articles being on the exact topic of the broadcast. In addition, a relevant article was found for at least 70% of the topics.

read more

Content maybe subject to copyright    Report

Query-Free News Search
Monika Henzinger
Google Inc.
2400 Bayshore Parkway
Mountain View, CA 94043
USA
monika@google.com
Bay-Wei Chang
Google Inc.
2400 Bayshore Parkway
Mountain View, CA 94043
USA
bay@google.com
Brian Milch
UC Berkeley
Computer Science Division
Berkeley, CA 94720-1776
USA
milch@cs.berkeley.edu
Sergey Brin
Google Inc.
2400 Bayshore Parkway
Mountain View, CA 94043
USA
sergey@google.com
ABSTRACT
Many daily activities present information in the form of a stream of
text, and often people can benefit from additional information on
the topic discussed. TV broadcast news can be treated as one such
stream of text; in this paper we discuss finding news articles on the
web that are relevant to news currently being broadcast.
We evaluated a variety of algorithms for this problem, looking at
the impact of inverse document frequency, stemming, compounds,
history, and query length on the relevance and coverage of news
articles returned in real time during a broadcast. We also evaluated
several postprocessing techniques for improving the precision, in-
cluding reranking using additional terms, reranking by document
similarity, and filtering on document similarity. For the best algo-
rithm, 84%-91% of the articles found were relevant, with at least
64% of the articles being on the exact topic of the broadcast. In
addition, a relevant article was found for at least 70% of the topics.
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Retrieval;
H.3.5 [Information Systems]: Online Information Services
General Terms
Algorithms, experimentation
Keywords
Web information retrieval, query-free search
1. INTRODUCTION
Many daily activities present information using a written or spo-
ken stream of words: television, radio, telephone calls, meetings,
face-to-face conversations with others. Often people can benefit
from additional information about the topics that are being dis-
cussed. Supplementing television broadcasts is particularly attrac-
tive because of the passive nature of TV watching. Interaction is
severely constrained, usually limited to just changing the channel;
Copyright is held by the author/owner(s).
WWW2003, May 20–24, 2003, Budapest, Hungary.
ACM 1-58113-680-3/03/0005.
there is no way to more finely direct what kind of information will
be presented.
Indeed, several companies have explored suggesting web pages
to viewers as they watch TV. For example, the Intercast system, de-
veloped by Intel, allows entire HTML pages to be broadcast in un-
used portions of the TV signal. A user watching TV on a computer
with a compatible TV tuner card can then view these pages, even
without an Internet connection. NBC transmitted pages via Inter-
cast during their coverage of the 1996 Summer Olympics. The In-
teractive TV Links system, developed by VITAC (a closed caption-
ing company) and WebTV (now a division of Microsoft), broad-
casts URLs in an alternative data channel interleaved with closed
caption data [17, 2]. When a WebTV box detects one of these
URLs, it displays an icon on the screen; if the user chooses to view
the page, the WebTV box fetches it over the Internet.
For both of these systems the producer of a program (or com-
mercial) chooses relevant documents by hand. In fact, the pro-
ducer often creates new documents specifically to be accessed by
TV viewers. To our knowledge, there has been no previous work
on automatically selecting web pages that a user might want to see
while watching a TV program.
In this paper we study the problem of finding news articles on
the web relevant to the ongoing stream of TV broadcast news. We
restrict our attention to broadcast news since it is very popular and
information-oriented (as supposed to entertainment-oriented).
Our approach is to extract queries from the ongoing stream of
closed captions, issue the queries in real time to a news search en-
gine on the web, and postprocess the top results to determine the
news articles that we show to the user. We evaluated a variety of
algorithms for this problem, looking at the impact of inverse doc-
ument frequency, stemming, compounds, history, and query length
on the relevance and coverage of news articles returned in real time
during a broadcast. We also evaluated several postprocessing tech-
niques for improving the precision, including reranking using ad-
ditional terms, reranking by document similarity, and filtering on
document similarity. The best algorithm achieves a precision of
91% on one data set and 84% on a second data set and finds a rele-
vant article for at least 70% of the topics in the data sets.
In general, we find that it is more important to concentrate on
a good postprocessing step than on a good query generation step.
1

The difference in precision between the best and the worst query
generation
algorithm is at most 10 percentage points, while our best
postprocessing step improves precision by 20 percentage points or
more. To reduce the impact of postprocessing on the total number
of relevant articles retrieved, we simply increased the number of
queries.
To be precise, the best algorithm uses a combination of tech-
niques. Our evaluation indicates that the most important features
for its success are a “history feature” and a postprocessing step that
filters out irrelevant articles. Many of the other features that we
added to improve the query generation do not seem to have a clearly
beneficial impact on precision. The “history feature” enables the
algorithm to consider all terms since the start of the current topic
when generating a query. It tries to detect when a topic changes
and maintains a data structure that represents all terms in the cur-
rent topic, weighted by age. The filtering step discards articles that
seem too dissimilar to each other or too dissimilar to the current
topic. We also experimented with other postprocessing techniques
but they had only a slight impact on precision.
Our algorithms are basically trying to extract keywords from a
stream of text so that the keywords represent the “current” piece of
the text. Using existing terminology this can be called time-based
keyword extraction. There is a large body of research on topic de-
tection and text summarization. Recently, time-based summariza-
tion has also been studied [1], but to the best of our knowledge
there is no prior work on time-based keyword extraction.
The remainder of this paper is organized as follows: Section 2
describes the different query generation algorithms and the differ-
ent postprocessing steps. Section 3 presents the evaluation. Sec-
tion 4 discusses related work. We conclude in Section 5.
2. OUR APPROACH
Our approach to finding articles that are related to a stream of
text is to create queries based on the text and to issue the queries to
a search engine. Then we postprocess the answers returned to find
the most relevant ones. In our case the text consists of closed cap-
tioning of TV news, and we are looking for relevant news articles
on the web. Thus we issue the queries to a news search engine.
We first describe the algorithms we use to create queries and then
the techniques we use for postprocessing the answers.
2.1 Query Generation
We are interested in showing relevant articles at a regular rate
during the news broadcast. As a result the query generation algo-
rithm needs to issue a query periodically, i.e., every
seconds. It
cannot wait for the end of a topic. We chose


for two rea-
sons: (1) Empirically we determined that showing an article every
10-15 seconds allows the user to read the title and scan the first
paragraph. The actual user interface may allow the user to pause
and read the current article more thoroughly. (2) A caption text
of 15 seconds corresponds to roughly three sentences or roughly
50 words. This should be enough text to generate a well-specified
query.
Because postprocessing may eliminate some of the candidate
articles, we return two articles for each query. We also tested at

, thus allowing up to half of the candidate articles to be dis-
carded while maintaining the same or better coverage as


.
The query generation algorithm is given the text segment
since
the last query generation. It also keeps information about the pre-
vious stream of text. We consider seven different query generation
algorithms, described in the following sections. All but the last
query generation algorithm issue 2-term queries. A term is either a
word or a 2-word compound like New York. Two-term queries are
used because experiments on a test set (different from the evalua-
tion set used in this paper) showed that 1-term queries are too vague
and return many irrelevant results. On the other hand, roughly half
of the time 3-term queries are too specific and do not return any
results (because we are requiring all terms to appear in the search
results). The last query generation algorithm uses a combination
of 3- and 2-term queries to explore whether the 2-term limit hurts
performance.
As is common in the IR literature [18] the inverse document fre-
quency

of a term is a function of the frequency
of the term
in the collection and the number
of documents in the collec-
tion. Specifically, we use the function

 
. Since we
do not have a large amount of closed caption data available, we
used Google’s web collection to compute the

of the terms. This
means
was over 2 billion, and
was the frequency of a term in
this collection. Unfortunately, there is a difference in word use in
written web pages and spoken TV broadcasts. As a result we built a
small set of words that are common in captions but rare in the web
data. Examples of such words are reporter and analyst. All of the
algorithms below ignore the terms on this stopword list.
2.1.1 The baseline algorithm A1-BASE
Our baseline algorithm is a simple
! #"
based algorithm. It
weights each term by
!$%"&
, where
!$
is the frequency of the term
in the text segment
. This results in larger weights for terms that
appear more frequently in
, and larger weights for more unusual
terms. This is useful since doing a search with the more distinctive
terms of the news story is more likely to find articles related to the
story. The baseline algorithm returns the two terms with largest
weight as the query.
2.1.2 The
!$"'(
algorithm A2-IDF2
This is the same algorithm as the baseline algorithm, but a term is
weighted by
!$)"&(
. The motivation is that rare words, like named
entities, are particularly important for issuing focussed queries. Thus,
the

component is more important than
!$
.
2.1.3 The simple stemming algorithm A3-STEM
In the previous two algorithms each term is assigned a weight.
Algorithm A3-STEM assigns instead a weight to each stem. The
stem of a word is approximated by taking the first 5 letters of the
word. For example, congress and congressional would share the
same stem, congr. The intention is to aggregate the weight of terms
that describe the same entity. We use this simple method of deter-
mining stems instead of a more precise method because our algo-
rithm must be real-time.
For each stem we store all the terms that generated the stem and
their weight. The weight of a term is
*+"!$,"-
(
, where
*.
if
the term was a noun and
*%0/21
otherwise. (Nouns are determined
using the publicly available Brill tagger [3].) We use this weight-
ing scheme since nouns are often more useful in queries than other
parts of speech. The weight of a stem is the sum of the weights of
its terms.
To issue a query the algorithm determines the two top-weighted
stems and finds the top-weighted term for each of these stems.
These two terms form the query.
2.1.4 The stemming algorithm with compounds, al-
gorithm A4-COMP
Algorithm A4-COMP consists of algorithm A3-STEM extended
by two-word compounds. Specifically, we build stems not only for
one-word terms, but also for two-word compounds. For this we
use a list of allowed compounds compiled from Google’s corpus
2

of web data. Stems are computed by stemming both words in the
compound,
3
i.e., the stem for the compound veterans administration
is veter-admin. Compounds are considered to be terms and are
weighted as before. Queries are issued as for algorithm A3-STEM,
i.e., it finds the top-weighted term for the two top-weighted stems.
Since a term can now consists of a two-word compound, a query
can now in fact consist of two, three, or four words.
2.1.5 The history algorithm A5-HIST
Algorithm A5-HIST is algorithm A4-COMP with a “history fea-
ture”. All previous algorithms generated the query terms solely on
the basis of the text segment
that was read since the last query
generation. Algorithm A5-HIST uses terms from previous text seg-
ments to aid in generating a query for the current text segment, the
notion being that the context leading up to the current text may
contain terms that are still valuable in generating the query.
It does this by keeping a data structure, called the stem vec-
tor, which represents the previously seen text, i.e., the history. It
combines this information with the information produced by algo-
rithm A4-COMP for the current text segment
and finds the top
weighted stems.
To be precise, for each stem the stem vector keeps a weight and
a list of terms that generated the stem, each with its individual
weight. The stem vector keeps the stems of all words that were
seen between the last reset and the current text segment. A reset
simply sets the stem vector to be the empty vector; it occurs when
the topic in a text segment changes substantially from the previous
text segment (see below).
When algorithm A5-HIST receives text segment
it builds a
second stem vector for it using algorithm A4-COMP. Then it checks
how similar
is to the text represented in the old stem vector by
computing a similarity score
'4
. To do this we keep a stem vec-
tor for each of the last three text segments. (Each text segment
consists of the text between two query generations, i.e., it consists
of the text of the last
seconds.) We add these vectors and compute
the dot-product of this sum with the vector for
, only considering
the weights of the terms and ignoring the weights of the stems. If
the similarity score is above a threshold
56
, then
is similar to the
earlier text. If the similarity score is above
5
(
but below
5
6
, then
is somewhat similar to the earlier text. Otherwise
is dissimilar
from the earlier text.
If text segment
is similar to the earlier text, the old stem vector
is aged by multiplying every weight by 0.9 and then the two vectors
are added. To add the two vectors, both vectors are expanded to
have the same stems by suitably adding stems of weight 0. Also
the set of terms stored for each stem is expanded to consist of the
same set by adding terms of weight 0. Then the two vectors are
added by adding the corresponding weights of the stems and of the
terms.
If text segment
is very dissimilar from the earlier text, then
the old stem vector is reset and is replaced by the new stem vector.
To put it another way, when the current text is very different than
the previous text, it means that the topic has changed, so previous
history should be discarded in deciding what query to issue.
If text segment
is somewhat similar to the earlier text, then
the stem vector is not reset, but the weights in the old stem vector
are decreased by multiplying them with a weight that decreases
with the similarity score
'4
. Afterwards the old stem vector and
the new stem vector are added. So even though the topic has not
completely changed, previous terms are given less weight to allow
for topic drift.
We used a test data set (different from the evaluation data sets)
to choose values for
56
and
5
(
in the
'4
calculation. In our im-
plementation,
5
6
0/1 //
and
5
(
0/21 //7/78
. When
is somewhat
similar, we use the weight multiplier
59:/21 ;7(=<
6> > >@? ABC
, which
was chosen so that
5EDF/1 ;
, i.e., the weights are more decreased
than in the case that
is similar to the early text.
In the resulting stem vector the top two terms are found in the
same way as in algorithm A4-COMP.
2.1.6 The query shortening algorithm A6-3W
To verify our choice of query length 2 we experimented with a
query shortening algorithm, which issues a multiple term query,
and shortens the query until results are returned from the news
search engine. Earlier experiments showed that reducing the query
to one term hurt precision. Therefore we kept two terms as the
minimum query length. The query shortening algorithm A6-3W is
identical to A5-HIST, but begins with three-term queries, reissuing
the query with the two top-weighted terms if there are no results.
2.1.7 Algorithm A7-IDF
Algorithm A7-IDF is identical to algorithm A5-HIST with
&
(
replaced by

.
(Note that each increasing algorithm A1-A6 adds one additional
feature to the previous. A7-IDF does not fit this pattern; we created
it in order to test the specific contribution of
&(
to A5-HIST’s
performance.)
2.2 Postprocessing
After generating the search queries we issue them to a news
search engine and retrieve the top at most 15 results. Note that
each result contains exactly one news article. Because we want to
retrieve articles that are about the current news item, we restricted
the search to articles published on the day of the broadcast or the
day before.
We applied several ways of improving upon these search results,
described in the sections below, and then selected the top two re-
sults to show to the user as news articles related to the broadcast
news story.
Since several queries will be issued on the same topic, they may
yield similar result sets and many identical or near identical articles
may end up being shown to the user. In fact, in the data sets used
for the evaluation (see 3.1), queried at both
GH
and
G

, an
average of 40% of articles returned would be near-duplicates. Such
a large number of duplicates would lead to a poor user experience,
so we employed a near-duplicate backoff strategy across all the al-
gorithms. If an article is deemed a near-duplicate of one that has
already been presented, the next article in the ranking is selected.
If all articles in the result set are exhausted in this manner, the first
article in the result set is returned (even though it was deemed a
near-duplicate). This reduces the number of repeated highly simi-
lar articles to an average of 14% in the evaluation data sets.
To detect duplicates without spending time fetching each article,
we looked at the titles and summaries of the articles returned by the
search engine. We compared these titles and summaries to a cache
of article titles and summaries that have already been displayed
during the broadcast. A similarity metric of more than 20% word
overlap in the title, or more than 30% word overlap in the summary,
was successful in identifying exact matches (e.g., the same article
returned in the results for a different query) and slight variants of
the same article, as is common for news wires to issue as the story
develops over time.
The postprocessing steps we used were boosting, similarity rerank-
ing, and filtering.
3

2.2.1 Boosting
The
I
news search engine gets a two-term queryand does not know
anything else about the stream of text. The idea behind boosting
is to use additional high-weighted terms to select from the search
results the most relevant articles. To implement this idea the query
generation algorithm returns along with the query associated boost
terms and boost values. The boost terms are simply the top five
terms found in the same way as the query terms. The boost values
are the IDF values of these terms.
The boosting algorithm then reranks the results returned from the
search by computing a weight for each result using the boost terms.
For a boost term which has IDF
&
and occurs
!$
times in the
text summary returned with the result, the weight is incremented
by the value
#"J!$K!$,98
, which is a
!$L"
-like formula
that limits the influence of the
!$
part to 4. For boost terms in the
title, the weight is increased by twice that value. Finally, to favor
more recent articles, the weight is divided by
M
, where
is the
number of days since the article was published. Since we restrict
articles to the current date and the day before, the weight is divided
by either 1 or 2. The results are then reordered according to their
weight; non-boosted results or ties are kept in their original order.
2.2.2 Similarity reranking
A second way of reranking is to compute for each of the results
returned by the search engine its similarity to the text segment
and to rerank the search results according to the similarity score.
To implement this idea we built a
! N"
-weighted term vector
for both the text segment
and the text of the article and compute
the normalized cosine similarity score. (The first 500 characters of
the article are used.) This filtering step requires first fetching the
articles, which can be time-expensive.
2.2.3 Filtering
The idea behind filtering is to discard articles that are very dis-
similar to the caption. Additionally, when the issued query is too
vague, then the top two search results often are very dissimilar. (In-
deed, all the results returned by vague queries are often very differ-
ent from one another.) So whenever we find two candidate articles
and they are dissimilar, we suspect a vague query and irrelevant
results. So we discard each of the articles unless it is itself highly
similar to the caption.
We again used the
!$."&
-weighted term vector for the text seg-
ment
and the text of the article and computed the normalized
cosine similarity score as in the similarity reranking, above. When-
ever the page-
similarity score is below a threshold
O
the article
is discarded (Rule F1). If there are two search results we compute
their similarity score and discard the articles if the score is below a
threshold
P
(Rule F2)– but allowing each article to be retained if its
page-
similarity score is above a threshold
Q
(Rule F3).
We analyzed a test data set (different from the evaluation data
sets) to determine appropriate thresholds. In our implementation,
O%0/1
,
QR0/21 8
, and
P,0/1 8
.
3. EVALUATION
To evaluate different algorithms on the same data set the evalu-
ators worked off-line. They were supplied with two browser win-
dows. One browser window contained the article to be evaluated.
The article was annotated with an input box so that the score for
the article could simply be input into the box. The other browser
window contained the part of the closed caption text for which the
article was generated. The evaluators were instructed as follows:
You will be reading a transcript of a television news broadcast.
What you will be evaluating will be the relevance of articles that
we provide periodically during the broadcast. For each displayed
article consider whether the article is relevant to at least one of
the topics being discussed in the newscast for this article. Use the
following scoring system to decide when a article is relevant to a
topic:
S
0 - if the article is not on the topic
S
1 - if the article is about the topic in general, but not the exact
story
S
2 - if the article is about the exact news story that is being
discussed
For example, if the news story is about the results of the presidential
election, then a article about a tax bill in congress would score a
0; a article about the candidates’ stands on the environment would
score a 1; a article about the winner’s victory speech would score
a 2.
Don’t worry if two articles seem very similar, or if you’ve seen
the article previously. Just score them normally. The “current
topic” of the newscast can be any topic discussed since the last
article was seen. So if the article is relevant to any of those topics,
score it as relevant. If the article is not relevant to those recent
topics, but is relevant to a previous segment of the transcript, it is
considered not relevant; give it a 0.
We count an article as “relevant” (R) if it was given a score of 1
or 2 by the human evaluator. We count it as “very relevant” (R+) if
it was given a score of 2.
To compare the algorithms we use precision, i.e., the percent-
age of relevant articles out of all returned articles. Recall is usu-
ally defined as the percentage of returned relevant articles out of
all relevant articles that exist. However, this is very hard to mea-
sure on the web, since it is very difficult to determine all articles on
a given topic. In addition, our algorithms are not designed to re-
turn all relevant documents, but instead a steady stream of relevant
documents. Thus, we define the relative recall to be the percent-
age of returned relevant articles out of all relevant articles pooled
from all of the query generation algorithms with all postprocessing
variants. We use relative recall instead of the number of relevant
documents to enable comparison over different data sets. Addition-
ally, we measure topic coverage, which is the percentage of topics
(defined below) that have at least one relevant article.
To understand the relationship of the different algorithms we
compute their overlap, both in terms of issued queries and in terms
of articles returned. Since filtering is such a powerful technique we
study its effectiveness in more detail.
3.1 Data sets
We evaluated all these approaches using the following two data
sets:
(1) HN: three 30-minute sessions of CNN Headline News, each
taken from a different day , and
(2) CNN: one hour of Wolf Blitzer Reports on CNN from one
day and 30 mins from another day.
The Headline News sessions (“HN”) consists of many, relatively
short, news stories. The Wolf Blitzer Reports (“CNN”) consists of
fewer news stories discussed for longer and in greater depth.
Both data sets contain news stories and meta-text. Meta-text con-
sists of the text between news stories, like “and now to you Tom”
or “thank you very much for this report”. For evaluating the perfor-
mance of our algorithms we manually decomposed the news stories
into topics, ignoring all the meta-text. (This manual segmentation
is not an input to the algorithms; it was used strictly for evaluation
purposes.) Each topic consists of at least 3 sentences on the same
4

Table 1: HN data set: Precision
P
and relative recall
T
.
Technique
Postprocessing
None Boost+
Filter
P T P T
A1-BASE 7 58% 37% 86% 31%
A2-IDF2 7 58% 37% 87% 31%
A3-STEM 7 64% 32% 88% 29%
A4-COMP 7 64% 32% 88% 28%
A5-HIST 7 64% 36% 91% 30%
A6-THREE 7 72% 33% 89% 28%
A7-IDF 7 61% 38% 89% 31%
A1-BASE 15 63% 20% 91% 17%
A2-IDF2 15 62% 20% 91% 18%
A3-STEM 15 69% 25% 88% 24%
A4-COMP 15 70% 26% 90% 25%
A5-HIST 15 67% 26% 89% 24%
A6-THREE 15 75% 24% 91% 22%
A7-IDF 15 59% 26% 91% 24%
Table 2: CNN data set: Precision
P
and relative recall
T
.
Technique
Postprocessing
None Boost+
Filter
P T P T
A1-BASE 7 43% 27% 77% 21%
A2-IDF2 7 46% 27% 75% 18%
A3-STEM 7 43% 23% 76% 18%
A4-COMP 7 44% 23% 76% 17%
A5-HIST 7 55% 32% 84% 23%
A6-THREE 7 60% 30% 86% 23%
A7-IDF 7 52% 25% 82% 23%
A1-BASE 15 48% 17% 83% 14%
A2-IDF2 15 60% 16% 85% 13%
A3-STEM 15 54% 17% 76% 14%
A4-COMP 15 59% 18% 82% 15%
A5-HIST 15 61% 25% 88% 20%
A6-THREE 15 71% 23% 83% 21%
A7-IDF 15 56% 25% 82% 21%
theme; we do not count 1-2 sentence long “teasers” for upcoming
stories as topics. The shortest topic in our data sets is 10 seconds
long, the longest is 426 seconds long. The average length of a topic
in the HN data set is 51 seconds and the median is 27 seconds. The
topics comprise a total of 4181 seconds (70 mins) out of the 90
mins long caption. In the CNN data set the average topic length is
107 seconds and the median is 49 seconds. The topics comprise a
total of 3854 seconds (64 mins).
3.2 Evaluation of the Query Generation Algo-
rithms
We first evaluated all the baseline algorithms with two differ-
ent ways of postprocessing, namely no postprocessing and postpro-
cessing by both boosting and filtering. The CNN data set consists
of 3854 seconds, and thus an algorithm that issues a query every
15 seconds issues 257 queries. We return the top two articles for
each query so that a maximum of 514 relevant articles could be
returned for this data set when
L

. For the HN data set the
corresponding number is 557.
The pool of all relevant documents found by any of the algo-
rithms for the HN data set is 846, and for the CNN data set is 816.
Thus the relative recall for each algorithm is calculated by divid-
ing the number of relevant documents it found by these numbers.
Note that for
U

no algorithm can return more than 557 (for
HN) or 514 (for CNN) relevant articles, so in those cases the max-
imum possible relative recall would be
7
7-VJXWYFW7W7Z
(HN) or

J[-V
W0W787Z
(CNN).
The pooled relative recall numbers are appropriate for comparing
performance among the different algorithms, but not useful as an
absolute measure of an algorithm’s recall performance, since no
algorithm would be able to achieve 100% relative recall. This is
because when a query is issued at a text segment, an algorithm is
limited to returning a maximum of two articles. However, pooling
usually identifies more than two articles as relevant for a given text
segment.
Table 1 presents the precision and relative recall for all the differ-
ent query generation algorithms for the HN data set. Table 2 shows
the corresponding numbers for the CNN data set. It leads to a few
observations:
S
All algorithms perform statistically significantly
1
better with
a p-value of
\]/1 /7/78
when postprocessed with boosting and
filtering than without postprocessing. Depending on the al-
gorithm the postprocessing seems to increase the precision
by 20-35 percentage points.
S
For both data sets the highest precision numbers are achieved
with postprocessing and
R

. However, the largest rela-
tive recall is achieved without postprocessing and
E^
.
This is no surprise: Filtering reduces not only the number of
non-relevant articles that are returned, but also the number of
relevant ones. The impact of postprocessing on the number
of relevant articles that are returned varies greatly between
algorithms. The maximum change is 71 articles (A1-BASE
with
RF
on HN), and the minimum change is 10 articles
(A3-STEM with
_
on HN). Also, reducing
increases
the number of queries issued and thus one expects the num-
ber of returned articles to increase, both the relevant ones as
well as the non-relevant ones. Thus relative recall increases
as well.
S
Precision on the CNN data set is lower than precision on the
HN data set. This is somewhat surprising as longer topics
might be expected to lead to higher precision. The reason is
that since we issue more queries on the same topic, we reach
further down in the result sets to avoid duplicates and end up
returning less appropriate articles.
S
Algorithm A5-HIST with
`a
and with postprocessing
performs well in both precision and relative recall. For the
HN data set, it achieves a precision of 91% with 257 relevant
articles returned, for the CNN data set it achieves a precision
of 84% with 190 relevant articles returned. This means it
returns a relevant article every 16 seconds and every 20 sec-
onds, respectively, on the average. The performance of algo-
rithm A6-3W is very similar to algorithm A5-HIST. None of
the other algorithms achieves precision of at least 90% and
relative recall of at least 30%. For example, algorithms A1-
BASE and A2-IDF2 with
b

have precision 91% on
1
To determine statistical significance we used the rank-sum test and
the t-test. If a p-value is given, it is the p-value of the rank-sum test,
as it is more conservative. If no p-value is given, the p-value of the
rank-sum test is less than 0.05.
5

Citations
More filters
Journal ArticleDOI

Deeper Inside PageRank

TL;DR: A comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, and suggested alternatives to the traditional solution methods.
Proceedings ArticleDOI

Finding advertising keywords on web pages

TL;DR: A system that learns how to extract keywords from web pages for advertisement targeting, using a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs.
Proceedings ArticleDOI

Personal information management

TL;DR: This SIG aims to provide an opportunity for researchers, students and designers who share an interest in the field to meet and discuss key issues and lay the foundation for an ongoing PIM research community.
Patent

Systems and methods for improving the ranking of news articles

TL;DR: In this paper, the authors propose a system to identify a source with which each of the links is associated and rank the list of links based at least in part on the quality of the identified sources.
Patent

Automatic method and system for formulating and transforming representations of context used by information services

TL;DR: In this article, an information retrieval system for automatically retrieving information related to the context of an active task being manipulated by a user is presented, where the system observes the operation of the active task and user interactions and utilizes predetermined criteria to generate a context representation.
References
More filters
Journal Article

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

TL;DR: Injection molding wherein a pair of separable mold plates are initially urged together and fluid plastic is injected into a mold cavity formed between the mold plates to form an article.
Proceedings Article

Letizia: an agent that assists web browsing

TL;DR: Letizia is a user interface agent that assists a user browsing the World Wide Web by automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior.
Journal ArticleDOI

Learning Algorithms for Keyphrase Extraction

TL;DR: In this paper, the problem of automatically extracting keyphrases from text is treated as a supervised learning task, where the learning algorithm must learn to classify as positive or negative examples of key phrases.
Proceedings Article

Domain-specific keyphrase extraction

TL;DR: This paper shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art, and explains how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand.
Proceedings Article

Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive.

TL;DR: Three runs were submitted: medium (title and description), short (title only) and a run which was a combination of a long run (title, description and narrative) with the medium and short runs, which led to the discovery that due to a mistake in the indexing procedures part of the LA Times documents had been indexed.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the future works mentioned in the paper "Query-free news search" ?

It would be interesting future work to refine and improve upon the filtering technique presented in this paper. It would also be interesting to experiment with different ways of using the history for query generation. 

TV broadcast news can be treated as one such stream of text ; in this paper the authors discuss finding news articles on the web that are relevant to news currently being broadcast. The authors evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. For the best algorithm, 84 % -91 % of the articles found were relevant, with at least 64 % of the articles being on the exact topic of the broadcast. 

Their approach to finding articles that are related to a stream of text is to create queries based on the text and to issue the queries to a search engine. 

The last query generation algorithm uses a combination of 3- and 2-term queries to explore whether the 2-term limit hurts performance. 

In summary, roughly 70% of the topics have at least one article rated relevant, and almost as many have at least one article rated very relevant (R+). 

The best algorithm achieves a precision of 91% on one data set and 84% on a second data set and finds a relevant article for at least 70% of the topics in the data sets. 

Their approach is to extract queries from the ongoing stream of closed captions, issue the queries in real time to a news search engine on the web, and postprocess the top results to determine the news articles that the authors show to the user. 

A reset simply sets the stem vector to be the empty vector; it occurs when the topic in a text segment changes substantially from the previous text segment (see below). 

The authors return the top two articles for each query so that a maximum of 514 relevant articles could be returned for this data set when L . 

For this genre of television show, the best algorithm finds a relevant page every 16-20 seconds on average, achieves a precision of 84-91%, and finds a relevant article for about 70% of the topics. 

Examples include a story about a beauty pageant for women in Lithuania’s prisons, a story about a new invention that uses recycled water from showers and baths to flush toilets, and a story about garbage trucks giving English lessons over loudspeakers in Singapore. 

as voice recognition systems improve, the same kind of topic finding and query generation algorithms described in this paper could be applied to conversations, providing relevant information immediately upon demand. 

To verify their choice of query length 2 the authors experimented with aquery shortening algorithm, which issues a multiple term query, and shortens the query until results are returned from the news search engine. 

The system would derive queries from the passages of text that were marked, and search over a local corpus for relevant documents to present to the user. 

Because the authors want to retrieve articles that are about the current news item, the authors restricted the search to articles published on the day of the broadcast or the day before.