scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Rank aggregation methods for the Web

TL;DR: A set of techniques for the rank aggregation problem is developed and compared to that of well-known methods, to design rank aggregation techniques that can be used to combat spam in Web searches.
Abstract: We consider the problem of combining ranking results from various sources. In the context of the Web, the main applications include building meta-search engines, combining ranking functions, selecting documents based on multiple criteria, and improving search precision through word associations. We develop a set of techniques for the rank aggregation problem and compare their performance to that of well-known methods. A primary goal of our work is to design rank aggregation techniques that can e ectively combat \spam," a serious problem in Web searches. Experiments show that our methods are simple, e cient, and e ective.

Summary (5 min read)

1. INTRODUCTION

  • When there is just a single criterion (or \judge") for ranking, the task is relatively easy, and is simply a re ection of the judge's opinions and biases.
  • (If simplicity w ere the only desideratum, dictatorship would prevail over democracy.).
  • In contrast, this paper addresses the problem of computing a \consensus" ranking of the alternatives, given the individual ranking preferences of several judges.
  • The authors call this the rank aggregation problem.
  • The authors provide the theoretical underpinnings for stating criteria for \good" rank aggregation techniques and evaluating speci c proposals, and they o er novel algorithmic solutions.

1.1 Motivation

  • As of February 2001, there were at least 24 general-purpose search engines (see Search Engine Watch 1]), as well as numerous special-purpose search engines.
  • There are a number of good reasons why this is the case, even if the authors restrict attention to search engines that are meant to be \general purpose.".
  • This is a problem conventionally studied in database middleware (see 15]).
  • There is a second, very broad, set of scenarios where rank aggregation is called for.
  • Notice that the former may produce no useful document, or too few of them, while the latter may produce an enormous list of documents where it is not clear which one to choose as the best.

1.2 Challenges

  • The ideal scenario for rank aggregation is when each judge (search engine in the case of meta-search, individual criterion for multi-criteria selection, and subsets of queries in the case of word association queries) gives a complete ordering of all the alternatives in the universe of alternatives.
  • This, however, is far too unrealistic for two main reasons.
  • Secondly, search e ngines routinely limit access to about the rst few hundreds of pages in their rank-ordering.
  • The issue of e ciency is also a serious bottleneck in performing rank aggregation for multi-criteria selection and word association queries.
  • Therefore, any method for rank aggregation for Web ap-plications must be capable of dealing with the fact that only the top few hundred entries of each ranking are available.

1.3 Our results

  • The authors provide a mathematical setting in which to study the rank aggregation problem, and propose several algorithms.
  • By drawing on the literature from social choice theory, statistics, and combinatorial optimization, the authors formulate precisely what it means to compute a good consensus ordering of the alternatives, given several rankings of the alternatives.
  • Besides the heuristics, the authors identify a crucial property of Kemeny optimal solutions that is particularly useful in combatting spam, and provide an e cient algorithm for minimally modifying any initial aggregation so as to enjoy this property.
  • This property is called the \extended Condorcet criterion," and the authors call the e cient process that is guaranteed to achieve it \local Kemenization.".
  • While there is no guarantee on the quality of the output, the latter methods are extremely e cient, and usually match or outperform the rst method.

1.4 Organization

  • The authors describe their framework, including the notions of ranking, distance measures, and optimal aggregation in Section 2.
  • This section also contains a brief description of concepts from graph theory and Markov c hains the authors need for this paper.
  • Section 3 discusses spam, the extended Condorcet principle, and local Kemenization.
  • Section 4 describes various rank aggregation methods, including the well-known Borda method and several other new methods.

2.1 Ranking

  • (2) There are situations where full lists are not convenient or even possible.
  • Let U denote the set of all Web pages in the world.
  • Let denote the results of a search engine in response to some xed query.
  • In other words, there are pages in the world which are unranked by this search engine with respect to the query.
  • Such lists that rank only some of the elements in U are called partial lists. (3) A special case of partial lists is the following.

2.1.1 Distance measures

  • After dividing this number by the maximum value jSj 2 =2, one can obtain a normalized value of the footrule distance, which is always between 0 and 1.
  • The footrule distance between two lists can be computed in linear time.
  • Dividing this numberby the maximum possible value ; jSj 2 the authors obtain a normalized version of the Kendall distance.
  • Note that these distances are not necessarily metrics.
  • The authors do not delve into such discussions here the interested reader can nd such arguments in the booksby Diaconis 12] , Critchlow 11] , or Marden 17] .

2.1.2 Optimal rank aggregation

  • The aggregation obtained by optimizing Kendall distance is called Kemeny optimal aggregation and in a precise sense, corresponds to the geometric median of the inputs.
  • The authors show that computing the Kemeny optimal aggregation is NP-Hard even when k = 4 .
  • In Section 3 the authors establish a strong connection between satisfaction of the extended Condorcet criterion and ghting search engine \spam.".
  • The following relation shows that Kendall distance can be approximated very well via the Spearman footrule distance.
  • In Section 4 the authors exhibit a polynomial time algorithm to compute optimal footrule aggregation (scaled footrule aggregation for partial lists).

3. SPAM RESISTANCE AND CONDORCET CRITERIA

  • This is called the extended Condorcet criterion (ECC).
  • (If the evaluators are human, the typical scenario during the design and training of search engines, then the eventual product will incorporate the biases of the training evaluators.).
  • In other words, under this de nition of spam, the spam pages are the Condorcet losers, and will occupy the bottom partition of any aggregated ranking that satis es the extended Condorcet criterion.
  • This procedure is called local Kemenization and is described next.

3.1 Local Kemenization

  • The authors i n troduce the notion of a locally Kemeny optimal aggregation, a relaxation of Kemeny optimality, that ensures satisfaction of the extended Condorcet principle and yet remains computationally tractable.
  • The authors have discussed the value of the extended Condorcet criterion in increasing resistance to search engine spam and in ensuring that elements in the top partitions remain highly ranked.
  • By applying their \local Kemenization" procedure (described below), one can obtain a ranking that is maximally consistent with the Borda ordering but in which the Condorcet winners are at the top of the list.the authors.
  • Intuitively, this approach also preserves the strengths of the initial aggregation .
  • Where the authors also show that the local Kemenization of an aggregation is unique.

4.2 Footrule and scaled footrule

  • Since the footrule optimal aggregation is a good approximation of Kemeny optimal aggregation, it merits investigation.
  • Now, the authors obtain an algorithm for footrule optimal aggregation via the following proposition: Proposition 4.
  • It can be shown that a permutation minimizing the total footrule distance to the i's is given by a minimum cost perfect matching in the bipartite graph.
  • As before, the authors can solve the minimum cost maximum matching problem on this bipartite graph to obtain the footrule aggregation algorithm for partial lists.
  • The authors called this method the scaled footrule aggregation (SFO).

5.1 Meta-search

  • Several meta-search engines exist (e.g., metacrawler 3]) and many W eb users build their own meta-search engines.
  • Given the di erent c r a wling strategies, indexing policies, and ranking functions employed by di erent search engines, meta-search engines are useful in many situations.
  • The actual success of a meta-search engine directly depends on the aggregation technique underlying it.
  • Given a query, obtain the top (say) 100 results from many s e a r c h engines, apply the rank aggregation function with the universe being the union of pages returned by the search engines, and return the top (say) 100 results of the aggregation, also known as The idea is simple.
  • The authors illustrate this scheme in Section 6.2.1 and examine the performance of their methods.

5.2 Aggregating ranking functions

  • Given a collection of documents, the problem of indexing is: store the documents in such a manner that given a search term, those most relevant to the search term can be retrieved easily.
  • Another ranking function might be the consequence of applying the vector-space model and an appropriate distance measure to the document collection.
  • The authors techniques can be applied to obtain a good aggregation of these ranking functions.
  • If the system is exible enough to let the user specify various preference criteria (travel dates/times, window/aisle seating, number of stops, frequent-ier preferences, refundable/non-refundable nature of ticket purchase, and of course, price), it can rank the available ight plans based on each of the criteria, and apply rank aggregation methods to give better quality results to the user.
  • In fact, very often there is not even a clear order of importance among the criteria.

5.3 Spam reduction

  • As the authors discussed earlier, the extended Condorcet principle is a reasonable cure for spam.
  • This extra step is inexpensive in terms of computation cost, but has the bene t of reducing spam by ranking Condorcet losers below Condorcet winners.

5.4 Word association techniques

  • Di erent search engines and portals have di erent semantics of handling a multi-word query.
  • As discussed in Section 1.1, both these scenarios are inconvenient in many situations.
  • The user lists a number of skills and a number of potential keywords in the job description, for example, "Silicon Valley C++ Java CORBA TCPIP algorithms start-up pre-IPO stock options".
  • It is clear that the \AND" rule might produce no document, and the \OR" rule is equally disastrous.
  • The authors query the search engine with these k sub-queries (using the AND semantics) and obtain k top d (say, d = 100) results for each o f the sub-queries.

5.5 Search engine comparison

  • The authors methods also imply a natural way to compare the performance of various search engines.
  • The main idea is that a search engine can be called good when it behaves like a least noisy expert for a query.
  • This agrees with their earlier notion of what an expert is and how to deal with noisy experts.
  • Thus, the procedure to rank the search engines themselves (with respect to a query) is as follows: obtain a rank aggregation of the results from various search engines and rank the search engines based on their (Kendall or footrule) distance to the aggregated ranking.

6.1 Infrastructure

  • The rst experiment is to build a meta-search engine using di erent aggregation methods (Section 4) and compare their performances.
  • The third experiment is to illustrate the technique of word association for multiword queries.
  • While the authors provide numerical values for the rst experiment, they p r o vide actual examples for the second and third experiments.
  • The authors distance measurements are with respect to union of the top 100 results from these search engines.
  • The authors notion of two urls being identical is purely syntactic (up to some canonical form) the authors do not use the content of page to determine if two urls are identical.

6.2.1 Meta-search

  • The fourth column in the table means that 27.231 pages (on average) were present in exactly three of the search engine results.
  • The second column indicates that around 284 pages were present in only one search engine while the last column indicates that less than 2 pages were present in all the search engines.
  • The performance is calculated in terms of the three distance measures described in Section 2.1.

6.2.2 Spam reduction

  • The authors use the following queries: Feng Shui, organic vegetables, gardening.
  • Notice that their de nition of spam does not mean evil!.
  • On the other hand, the authors w ere interested in urls that spammed at least two search engines | given that the overlap among search engines was not very high, this proved to be a challenging task.
  • Table 3 presents their examples: the entries are the rank within individual search engines' lists.
  • Based on results from Section 6.2.1, the authors restrict their attention to SFO and MC4 with local Kemenization.

6.2.3 Word associations

  • As noted earlier, Google uses AND semantics and hence for many i n teresting multi-word queries, the number or the quality o f the pages returned is not very high.
  • The authors c hose every pair of terms in the multi-word query to construct several lists and the apply rank aggregation (SFO and MC4) t o these lists.

6.3 Discussion

  • Of all the methods, MC4 outperforms all others.
  • This is very interesting since Borda's method is the usual choice of aggregation, and perhaps the most natural.
  • Recall that the footrule procedure for partial lists was only a heuristic modi cation of the footrule procedure for full lists.
  • In general, local Kemenization seems to improve around 1{3% in terms of the distance measures.
  • Examining the results in Section do not claim that their methods completely eliminate spam, their study shows that they reduce spam in general.

7. CONCLUSIONS AND FURTHER WORK

  • The authors have developed the theoretical groundwork for describing and evaluating rank aggregation methods.
  • The methods are also simple to implement, do not have a n y computational overhead, and out-perform popular classical like Borda's method.
  • The authors h a ve established the value of the extended Condorcet criterion in the context of meta-search, and have described a simple process, local Kemenization, for ensuring satisfaction of this criterion.
  • Further work involves trying to obtain a qualitative understanding of why the Markov c hain methods perform very well.
  • Finally, this work originated in conversations with Helen Nissenbaum on bias in searching.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Rank Aggregation Methods for the Web
Cynthia Dwork
Ravi Kumar
y
Moni Naor
z
D. Sivakumar
x
ABSTRACT
We consider the problem of combining ranking results from
various sources. In the context of the Web, the main ap-
plications include building meta-search engines, combining
ranking functions, selecting do cuments based on multiple
criteria, and improving search precision through word asso-
ciations. Wedevelop a set of techniques for the rank aggre-
gation problem and compare their p erformance to that of
well-known metho ds. A primary goal of our work is to de-
sign rank aggregation techniques that can eectively combat
\spam," a serious problem in Web searches. Experiments
show that our metho ds are simple, ecient, and eective.
Keywords:
rank aggregation, ranking functions, meta-
search, multi-word queries, spam
1. INTRODUCTION
The task of ranking a list of several alternatives based on
one or more criteria is encountered in many situations. One
of the underlying goals of this endeavor is to identify the
best alternatives, either to simply declare them to be the
best (e.g., in sp orts) or to employ them for some purp ose.
When there is just a
single
criterion (or \judge") for rank-
ing, the task is relatively easy, and is simply a reection of
the judge's opinions and biases. (If simplicitywere the only
desideratum, dictatorship would prevail over democracy.) In
contrast, this pap er addresses the problem of computing a
Compaq Systems Research Center, 130 Lytton Ave., Palo
Alto, CA 94301.
dwork@pa.dec.com
y
IBM Almaden ResearchCenter, 650 Harry Road, San Jose,
CA 95120.
ravi@almaden.ibm.com
z
Department of Computer Science and Applied Mathemat-
ics, Weizmann Institute of Science, Rehovot 76100, Israel.
This work was done while the author was visiting the
IBM Almaden Research Center and Stanford University.
naor@wisdom.weizmann.ac.il
x
IBM Almaden ResearchCenter, 650 Harry Road, San Jose,
CA 95120.
siva@almaden.ibm.com
Copyright is held by the author/owner.
WWW10, May 1-5, 2001, Hong Kong.
ACM 1-58113-348-0/01/0005.
\consensus" ranking of the alternatives, given the individ-
ual ranking preferences of
several
judges. We call this the
rank aggregation problem
. Specically, we study the rank
aggregation problem in the context of the Web, where it is
complicated by a plethora of issues. Webeginby underscor-
ing the importance of rank aggregation for Web applications
and clarifying the various characteristics of this problem in
the context of the Web. We provide the theoretical un-
derpinnings for stating criteria for \go o d" rank aggregation
techniques and evaluating sp ecic prop osals, and we oer
novel algorithmic solutions. Our experiments provide initial
evidence for the success of our methods, whichwe believe
will signicantly improveavarietyof search applications on
the Web.
1.1 Motivation
As of February 2001, there were at least 24 general-purpose
search engines (see Search Engine Watch [1]), as well as nu-
merous special-purp ose search engines. The very fact that
there are so many choices is an indication that no single
search engine has proven to b e satisfactory for all Web users.
There are a number of good reasons why this is the case,
even if we restrict attention to search engines that are meant
to b e \general purp ose." Two fairly obvious reasons are that
no one ranking algorithm can be considered broadly accept-
able and no one search engine is suciently comprehensive
in its coverage of the Web. The issues, however, are some-
what deep er.
Firstly, there is the question of \spam" | devious manip-
ulation by authors of Web pages in an attempt to achieve
undeservedly high rank. No single
ranking function
can b e
trusted to perform well for all queries. A few years ago,
query term frequency was the single main heuristic in rank-
ing Web pages; since the inuential work of Kleinberg [16]
and Brin and Page [7], link analysis has come to be identi-
ed as a very powerful technique in ranking Web pages and
other hyperlinked do cuments. Several other heuristics have
been added, including anchor-text analysis [8], page struc-
ture (headers, etc.) analysis, the use of keyword listings
and the url text itself, etc. These well-motivated heuris-
tics exploit a wealth of information, but are often prone to
manipulation by devious parties.
Secondly, in a world governed by (frequently changing)
commercial interests and alliances, it is not clear that users
haveany form of protection against the biases/interests of
individual search engines. As a case in p oint, note that
\paid placement" and \paid inclusion" (see [2]) appear to
be gaining p opularity among search engines.
In some cases, individual ranking functions are inadequate
613

for a more fundamental reason: the data b eing ranked are
simply not amenable to simple ranking functions. This is
the case with querying ab out multimedia do cuments, e.g.
\nd a do cument that has information about Greek islands
with pictures of beautiful blue b eaches." This is a problem
conventionally studied in database middleware (see [15]).
Several novel approaches have been invented for this pur-
pose, but this problem cannot b e considered well-solved by
any measure. Naturally, these problems fall under the realm
of rank aggregation.
Thus, our rst motivation for studying rank aggregation
in the context of the Web is to provide users a certain degree
of
robustness
of search, in the face of various shortcomings
and biases | malicious or otherwise | of individual search
engines. That is, to nd robust techniques for
meta-search
.
There is a second, very broad, set of scenarios where
rank aggregation is called for. Roughly describ ed, these
are the cases where the user preference includes a variety
of criteria, and the logic of classifying a do cument as ac-
ceptable or unacceptable is to o complicated or to o nebu-
lous to encode in any simple query form. As prototypi-
cal examples, we list some cases that Web users exp eri-
ence frequently. Broadly, these can be classied as
multi-
criteria selection
and
wordassociation queries
. Examples of
multi-criteria selection arise when trying to choose a pro duct
from a database of pro ducts, such as restaurants or travel
plans. Examples of word asso ciation queries arise when a
user wishes to search for a goo d document on a topic; the
user knowsalistofkeywords that collectively describ e the
topic, but isn't sure that the best do cument on the topic
necessarily contains all of them. (See Section 5 for sp e-
cic examples of b oth categories.) This is a very familiar
dilemma for Web search users: when we supply a list of
keywordstoasearch engine, do we ask for do cuments that
contain
al l
the keywords, or do we ask for do cuments that
contain
any
of the keywords? Notice that the former may
produce no useful document, or to o few of them, while the
latter may pro duce an enormous list of do cuments where it
is not clear which one to choose as the best. We prop ose the
following natural approach to this problem:
Associations Ranking
: Rank the database with
respect to several small subsets of the queries,
and aggregate these rankings.
1.2 Challenges
The ideal scenario for rank aggregation is when each judge
(search engine in the case of meta-search, individual crite-
rion for multi-criteria selection, and subsets of queries in the
case of word association queries) gives a complete ordering
of all the alternatives in the universe of alternatives. This,
however, is far to o unrealistic for two main reasons.
The rst reason is a particularly acute problem in doing
meta-search: the coverageofvarious search engines is dier-
ent; it is unlikely that all search engines will (eventually) be
capable of ranking the entire collection of pages on the Web,
whichisgrowing at a very high rate. Secondly, searchen-
gines routinely limit access to ab out the rst few hundreds
of pages in their rank-ordering. This is done b oth to ensure
the condentiality of their ranking algorithm, and in the in-
terest of eciency. The issue of eciency is also a serious
bottleneck in performing rank aggregation for multi-criteria
selection and word asso ciation queries.
Therefore, any metho d for rank aggregation for Web ap-
plications must b e capable of dealing with the fact that only
the top few hundred entries of each ranking are available. Of
course, if there is absolutely no overlap among these entries,
there isn't much
any
algorithm can do; the challenge is to
design rank aggregation algorithms that work when there is
limited but non-trivial overlap among the top few hundreds
or thousands of entriesineach ranking. Finally, in lightof
the amount of data, it is implicit that any rank aggregation
method has to b e computationally ecient.
1.3 Our results
We provide a mathematical setting in which to study
the rank aggregation problem, and prop ose several algo-
rithms. By drawing on the literature from so cial choice
theory, statistics, and combinatorial optimization, we for-
mulate precisely what it means to compute a go o d consensus
ordering of the alternatives, given several (partial) rankings
of the alternatives. Sp ecically,weidentify the method of
Kemeny, originally prop osed in the context of social choice
theory, as an esp ecially desirable approach, since it min-
imizes the total disagreement (formalized b elow) between
the several input rankings and their aggregation. Unfortu-
nately,we show that computing optimal solutions based on
Kemeny's approachisNP-hard, even when the number of
rankings to b e aggregated is only 4. Therefore, weprovide
several heuristic algorithms for rank aggregation and eval-
uate them in the context of Web applications. Besides the
heuristics, we identify a crucial property of Kemeny optimal
solutions that is particularly useful in combatting spam, and
provide an ecient algorithm for minimally modifying
any
initial aggregation so as to enjoy this prop erty. This prop-
erty is called the \extended Condorcet criterion," and we
call the ecient pro cess that is guaranteed to achieve it \lo-
cal Kemenization."
Our algorithms for initial aggregation are based on two
broad principles. The rst principle is to achieve optimality
not with respect to the Kemeny guidelines, but with respect
to a dierent, closely related, measure, for whichitispos-
sible to nd an ecient solution. The second principle is
through the use of Markovchains as a means of combining
partial comparison information | derived from the individ-
ual rankings | into a total ordering. While there is no
guarantee on the quality of the output, the latter metho ds
are extremely ecient, and usually match or outp erform the
rst metho d.
We report exp eriments and quantitative measures of qual-
ity for the meta-search problem, and give several illustra-
tions of our metho ds applied for the problems of spam re-
sistance and word asso ciation queries.
1.4 Organization
We describe our framework, including the notions of rank-
ing, distance measures, and optimal aggregation in Section
2. This section also contains a brief description of concepts
from graph theory and Markovchains we need for this pap er.
Section 3 discusses spam, the extended Condorcet principle,
and lo cal Kemenization. Section 4 describ es various rank ag-
gregation methods, including the well-known Borda method
and several other new metho ds. Section 5 presents vema-
jor applications of our metho ds and Section 6 presents an
experimental study of some of them. Finally, Section 7 con-
cludes the pap er with some remarks on future work.
614

2. PRELIMINARIES
2.1 Ranking
Given a universe
U
, an
ordered list
(or simply, a list)
with resp ect to
U
is an ordering (aka ranking) of a subset
S
U
, i.e.,
=[
x
1
x
2

x
d
], with each
x
i
2
S
, and
is some ordering relation on
S
. Also, if
i
2
U
is presentin
,let
(
i
) denote the position or rank of
i
(a highly ranked or
preferred elementhasa low-numb ered p osition in the list).
For a list
, let
j
j
denote the number of elements. By
assigning a unique identier to each elementin
U
,wemay
assume without loss of generality that
U
=
f
1
;
2
;::: ;
j
U
jg
.
Depending on the kind of information presentin
, three
situations arise:
(1) If
contains all the elements in
U
, then it is said to b e a
ful l list
. Full lists are, in fact, total orderings (p ermutations)
of
U
. For instance, if
U
is the set of all pages indexed bya
search engine, it is easy to see that a full list emerges when
we rank pages (say, with respect to a query) according to a
xed algorithm.
(2) There are situations where full lists are not convenient
or even p ossible. For instance, let
U
denote the set of all
Web pages in the world. Let
denote the results of a search
engine in resp onse to some xed query. Even though the
query might induce a total ordering of the pages indexed by
the search engine, since the index set of the search engine is
almost surely only a subset of
U
,wehave a strict inequality
j
j
<
j
U
j
. In other words, there are pages in the world
which are unranked by this search engine with resp ect to
the query. Such lists that rank only some of the elements in
U
are called
partial lists
.
(3) A sp ecial case of partial lists is the following. If
S
is the set of all the pages indexed by a particular search
engine and if
corresponds to the top 100 results of the
search engine with respect to a query, clearly the pages that
are not presentinlist
can be assumed to be ranked below
100 by the search engine. Such lists that rank only a subset
of
S
and where it is implicit that each ranked element is
above all unranked elements, are called
top
d
lists
, where
d
is the size of the list.
A natural op eration of
projection
will b e useful. Given a
list
and a subset
T
of the universe
U
, the pro jection of
with resp ect to
T
(denoted
j
T
) will be a new list that
contains only elements from
T
. Notice that if
happens
to contain all the elements in
T
, then
j
T
is a full list with
respect to
T
.
2.1.1 Distance measures
Howdowe measure distance b etween two full lists with
respect to a set
S
? Two popular distance measures are [12]:
(1) The
Spearman footrule distance
is the sum, over all
elements
i
2
S
, of the absolute dierence b etween the rank
of
i
according to the two lists. Formally, given two full lists
and
, the distance is given by
F
(
;
)=
P
j
S
j
i
=1
j
(
i
)
;
(
i
)
j
.
After dividing this number by the maximum value
j
S
j
2
=
2,
one can obtain a normalized value of the fo otrule distance,
which is always between 0 and 1. The fo otrule distance
between two lists can be computed in linear time.
(2) The
Kendal l tau distance
counts the number of pair-
wise disagreements b etween two lists; that is, the distance
between two full lists
and
is
K
(
;
)=
jf
(
i; j
)
j
i<
j;
(
i
)
<
(
j
)
;
but
(
i
)
>
(
j
)
gj
. Dividing this number by
the maximum p ossible value
;
j
S
j
2
we obtain a normalized
version of the Kendall distance. The Kendall distance for
full lists is the `bubble sort' distance, i.e., the numb er of pair-
wise adjacent transpositions needed to transform from one
list to the other. The Kendall distance b etween two lists of
length
n
can be computed in
n
log
n
time using simple data
structures.
The ab ove measures are metrics and extend in a natural
waytoseveral lists. Given several full lists
;
1
;::: ;
k
,for
instance, the normalized Footrule distance of
to
1
;::: ;
k
is given by
F
(
;
1
;::: ;
k
)=(1
=k
)
P
k
i
=1
F
(
;
i
).
One can dene generalizations of these distance measures
to partial lists. If
1
;::: ;
k
are partial lists, let
U
denote
the union of elements in
1
;::: ;
k
and let
be a full list
with respect to
U
. Now, given
, the idea is to consider the
distance b etween
i
and the pro jection of
with resp ect to
i
. Then, for instance, wehave the
inducedfootrule distance
F
(
;
1
;::: ;
k
)=
P
k
i
=1
F
(
j
i
;
i
)
=k
. In a similar manner,
induced Kendall tau distance can be dened. Finally, we
dene a third notion of distance that measures the distance
between a full list and a partial list on the same universe:
(3) Given one full list and a partial list, the
scaledfootrule
distance
weights contributions of elements based on the size
of the lists they are present in. More formally,if
is a full list
and
is a partial list,
F
0
(
;
)=
P
i
2
j
(
i
)
=
j
j;
(
i
)
=
j
jj
.
We will normalize
F
0
by dividing by
j
j
=
2.
Note that these distances are not necessarily metrics.
To a large extent, our interpretations of exp erimental re-
sults will be in terms of these distance measures. While
these distance measures seem natural, why these measures
are go od is mo ot. We do not delve into such discussions
here; the interested reader can nd such arguments in the
books by Diaconis [12], Critchlow [11], or Marden [17].
2.1.2 Optimal rank aggregation
In the generic context of rank aggregation, the notion of
`better' depends on what distance measure we strivetoop-
timize. Supp ose we wish to optimize Kendall distance, the
question then is: given (full or partial) lists
1
;::: ;
k
,nd
a
such that
is a full list with resp ect to the union of
the elements of
1
;::: ;
k
and
minimizes
K
(
;
1
;::: ;
k
).
The aggregation obtained by optimizing Kendall distance is
called
Kemeny optimal aggregation
and in a precise sense,
corresponds to the geometric median of the inputs. We
show that computing the Kemeny optimal aggregation is
NP-Hard even when
k
= 4 (see the App endix). (Note that in
contrast to the so cial choice scenario where there are many
voters and relatively few candidates, in the web aggregation
scenario wehavemany candidates (pages) and relatively few
voters (the search engines).)
Kemeny optimal aggregations have a maximum likelihood
interpretation. Suppose there is an underlying \correct" or-
dering
of
S
, and eachorder
1
;:::;
k
is obtained from
by
swapping two elements with some probability less than 1
=
2.
Thus, the
's are \noisy" versions of
. A Kemeny optimal
aggregation of
1
;:::;
k
is one that is maximally likely to
have pro duced the
's (it need not b e unique) [24]. Viewed
dierently, Kemeny optimal aggregation has the property
of eliminating noise from various dierent ranking schemes.
Furthermore, Kemeny optimal aggregations are essentially
the only ones that simultaneously satisfy natural and impor-
tant properties of rank aggregation functions, called neutral-
ity and consistency in the so cial choice literature, and the
so-called Condorcet prop erty [25]. Indeed, Kemeny optimal
615

aggregations satisfy the
extended Condorcet criterion
. In
Section 3 we establish a strong connection between satisfac-
tion of the extended Condorcet criterion and ghting search
engine \spam."
Given that Kemeny optimal aggregation is useful, but
computationally hard, howdowe compute it? The following
relation shows that Kendall distance can b e approximated
very well via the Sp earman footrule distance.
Proposition
1.
[13] For any two ful l lists
;
,
K
(
;
)
F
(
;
)
2
K
(
;
)
.
This leads us to the problem of
footrule optimal aggrega-
tion
. This is the same as b efore, except that the optimizing
criterion is the fo otrule distance. In Section 4 we exhibit
a p olynomial time algorithm to compute optimal fo otrule
aggregation (scaled fo otrule aggregation for partial lists).
Therefore wehave:
Proposition
2.
If
is the Kemeny optimal aggregation
of ful l lists
1
;::: ;
k
and
0
optimizes the footrule aggrega-
tion, then
K
(
0
;
1
;::: ;
k
)
2
K
(
;
1
;::: ;
k
)
.
Later, in Section 4, we develop rank aggregation metho ds
that do not optimize anyobvious criteria, but turn out to
be very eective in practice.
2.2 Basic notions
Readers familiar with the notions in graph theory and
Markovchains can skip this section.
2.2.1 Some concepts from graph theory
A
graph
G
= (
V; E
) consists of a set of
nodes
V
and a
set of
edges
E
. Each element
e
2
E
is an unordered pair
(
u; v
) of incident no des, representing a connection between
nodes
u
and
v
. A graph is
connected
if the no de set cannot
be partitioned into comp onents such that there are no edges
whose incident nodes o ccur in dierent components.
A
bipartite graph
G
=(
V
1
;V
2
;E
) consists of two disjoint
sets of no des
V
1
;V
2
such that each edge
e
2
E
has one no de
from
V
1
and the other node from
V
2
. A bipartite graph is
complete
if eachnodein
V
1
is connected to every no de in
V
2
.
A
matching
is a subset of edges such that for each edge in the
matching, there is no other edge that shares a no de with it.
A
maximum matching
is a matching of largest cardinality.
A
weighted graph
is a graph with a (non-negative) weight
w
e
for every edge
e
. Given a weighted graph, the
minimum
weight maximum matching
is the maximum matching with
minimum weight. The minimum weightmaximum matching
problem for bipartite graphs can be solved in time
O
(
n
2
:
5
)
where
n
is the number of nodes.
A
directedgraph
consists of no des and edges, but this time
an edge is an ordered pair of no des (
u; v
), representing a
connection from
u
to
v
. A
directedpath
is said to exist from
u
to
v
if there is a sequence of no des
u
=
w
0
;::: ;w
k
=
v
suchthat(
w
i
;w
i
+1
) is an edge, for all
i
=0
;::: ;k
;
1. A
directed cycle
is a non-trivial directed path from a no de to
itself. A
strongly connectedcomponent
of a graph is a set of
nodes such that for every pair of no des in the comp onent,
there is a directed path from one to the other. A
directed
acyclic graph
(DAG) is a directed graph with no directed
cycles. In a DAG, a
sink
node is one with no directed path
to any other no de.
2.2.2 Markov chains
A (homogeneous)
Markov chain
for a system is sp ecied
by a set of states
S
=
f
1
;
2
;::: ;n
g
and an
n
n
non-
negative, stochastic (i.e., the sum of eachrow is 1) matrix
M
. The system b egins in some start state in
S
and at each
step moves from one state to another state. This transi-
tion is guided by
M
: at each step, if the system is in state
i
, it moves to state
j
with probability
M
ij
. If the current
state is given as a probability distribution, the probability
distribution of the next state is given by the product of the
vector representing the current state distribution and
M
. In
general, the start state of the system is chosen according to
some distribution
x
(usually, the uniform distribution) on
S
.
After
t
steps, the state of the system is distributed accord-
ing to
xM
t
. Under some niceness conditions on the Markov
chain (whose details we will not discuss), irresp ective of the
start distribution
x
, the system eventually reaches a unique
xed point where the state distribution do es not change.
This distribution is called the
stationary distribution
. It can
be shown that the stationary distribution is given by the
principal left eigenvector
y
of
M
, i.e.,
yM
=
y
. In prac-
tice, a simple p ower-iteration algorithm can quickly obtain
a reasonable approximation to
y
.
An imp ortant observation here is that the entries in
y
de-
ne a natural ordering on
S
. Wecall such an ordering the
Markov chain ordering of
M
. A technical point to note while
using Markovchains for ranking is the following. A Markov
chain
M
denes a weighted graph with
n
nodes suchthat
the weight on edge (
u; v
) is given by
M
uv
. The strongly
connected comp onents of this graph form a DAG. If this
DAG has a sink no de, then the stationary distribution of
the chain will be entirely concentrated in the strongly con-
nected component corresponding to the sink no de. In this
case, we only obtain an ordering of the alternatives present
in this comp onent; if this happens, the natural extended pro-
cedure is to remove these states from the chain and repeat
the pro cess to rank the remaining no des. Of course, if this
component has suciently many alternatives, one may stop
the aggregation pro cess and output a partial list containing
some of the b est alternatives. If the DAG of connected com-
ponents is (weakly) connected and has more than one sink
node, then we will obtain two or more clusters of alterna-
tives, whichwe could sort by the total probability mass of
the comp onents. If the DAG has several weakly connected
components, we will obtain incomparable clusters of alter-
natives. Thus, when we refer to a Markovchain ordering, we
refer to the ordering obtained by this extended pro cedure.
3. SPAMRESISTANCEANDCONDORCET
CRITERIA
In 1785 Marie J. A. N. Caritat, Marquis de Condorcet,
proposed that if there is some elementof
S
,nowknown as
the
Condorcet alternative
, that defeats every other in pair-
wise simple ma jorityvoting, then that this element should
be ranked rst [9]. A natural extension, due to Truchon [22]
(see also [21]), mandates that if there is a partition (
C;
C
)
of
S
suchthatforany
x
2
C
and
y
2
C
the ma jority prefers
x
to
y
, then
x
must b e ranked ab ove
y
. This is called the
extended Condorcet criterion
(ECC). We will show that not
only can the ECC b e achieved eciently, but it also has ex-
cellent \spam-ghting" properties when used in the context
of meta-search.
616

Intuitively, a search engine has b een spammed by a page in
its index, on a given query, if it ranks the page \too highly"
with respect to other pages in the index, in the view of a
\typical" user. Indeed, in accord with this intuition, search
engines are b oth rated [18] and trained byhuman
evaluators
.
This approach to dening spam: (1) p ermits an author to
raise the rank of her page by improving the content; (2)
puts
ground truth
about the relativevalue of pages into the
purview of the users | in other words, the denition do es
not assume the existence of an absolute ordering that yields
the \true" relativevalue of a pair of pages on a query; (3)
does not assume unanimity of users' opinions or consistency
among the opinions of a single user; and (4) suggests some
natural ways to automate training of engines to incorp orate
useful biases, such as geographic bias.
We b elieve that reliance on evaluators in dening spam
is unavoidable. (If the evaluators are human, the typical
scenario during the design and training of search engines,
then the eventual pro duct will incorp orate the biases of the
training evaluators.) We
model
the evaluators by the search
engine ranking functions. That is, we make the simplifying
assumption that for any pair of pages, the relative ordering
by the ma jority of the search engines comparing them is the
same as the relative ordering by the ma jorityoftheevalua-
tors. Our intuition is that if a page spams all or even most
search engines for a particular query, then no combination
of these search engines can defeat the spam. This is rea-
sonable: Fix a query; if for some pair of pages a ma jority
of the engines is spammed, then the aggregation function is
working with overly bad data | garbage in, garbage out.
On the other hand, if a page spams strictly fewer than half
the search engines, then a ma jority of the search engines will
prefer a \go o d" page to a spam page. In other words, under
this denition of spam, the spam pages are the Condorcet
losers, and will o ccupy the b ottom partition of any aggre-
gated ranking that satises the extended Condorcet crite-
rion. Similarly, assuming that go o d pages are preferred by
the ma jority to mediocre ones, these will b e the Condorcet
winners, and will therefore b e ranked highly.
Many of the existing aggregation metho ds (see Section 4)
do not ensure the election of the Condorcet winner, should
one exist. Our aim is to obtain a simple metho d of mo di-
fying any initial aggregation of input lists so that the Con-
dorcet losers (spam) will be pushed to the b ottom of the
ranking during this pro cess. This pro cedure is called
local
Kemenization
and is describ ed next.
3.1 Local Kemenization
Weintro duce the notion of a lo cally Kemeny optimal ag-
gregation, a relaxation of Kemeny optimality, that ensures
satisfaction of the extended Condorcet principle and yet re-
mains computationally tractable. As the name implies, local
Kemeny optimal is a `lo cal' notion that p ossesses some of the
properties of a Kemeny optimal aggregation.
A full list
is a
local ly Kemeny optimal
aggregation of par-
tial lists
1
;
2
;::: ;
k
if there is no full list
0
that can b e ob-
tained from
by p erforming a single transp osition of an ad-
jacent pair of elements and for which
K
(
0
;
1
;
2
;::: ;
k
)
<
K
(
;
1
;
2
;::: ;
k
)
:
In other words, it is imp ossible to re-
duce the total distance to the
's by ipping an adjacent
pair.
Every Kemeny optimal aggregation is also lo cally Kemeny
optimal, but the converse is false. Nevertheless, we show
that a lo cally Kemeny optimal aggregation satises the ex-
tended Condorcet prop erty and can b e computed (see the
Appendix) in time
O
(
kn
log
n
).
We have discussed the value of the extended Condorcet
criterion in increasing resistance to search engine spam and
in ensuring that elements in the top partitions remain highly
ranked. However, sp ecic aggregation techniques mayadd
considerable value b eyond simple satisfaction of this crite-
rion; in particular, they may pro duce go od rankings of al-
ternatives within a given partition (as noted ab ove, the ex-
tended Condorcet criterion gives no guidance within a par-
tition). We now show how, using any initial aggregation
of partial lists
1
;:::;
k
|
one that is not necessarily
Condorcet
|we can eciently construct a locally Kemeny
optimal aggregation of the
's that is in a well-dened sense
maximally consistent with
. For example, if the
's are
full lists then
could be the Borda ordering on the alterna-
tives (see Section 4.1 for Borda's method). Even if a Con-
dorcet winner exists, the Borda ordering may not rank it
rst. However, by applying our \lo cal Kemenization" pro-
cedure (described b elow), we can obtain a ranking that is
maximally consistent with the Borda ordering but in which
the Condorcet winners are at the top of the list.
A
local Kemenization
(LK) of a full list
with resp ect to
1
;::: ;
k
is a pro cedure that computes a lo cally Kemeny
optimal aggregation of
1
;::: ;
k
that is (in a precise sense)
maximally consistentwith
. Intuitively, this approach also
preserves the strengths of the initial aggregation
. Thus:
(1) the Condorcet losers receivelow rank, while the Con-
dorcet winners receive high rank (this follows from lo cal Ke-
meny optimality)
(2) the result disagrees with
on the order of any given
pair (
i; j
) of elements only if a ma jority of those
's express-
ing opinions disagrees with
on (
i; j
)
(3) for every 1
d
j
j
, the length
d
prex of the output
is a lo cal Kemenization of the top
d
elements in
.
Thus, if
is an initial meta-search result, and we have
some faith that the top, say, 100 elements of
contain
enough goo d pages, then we can build a locally Kemeny
optimal aggregation of the pro jections of the
's onto the
top 100 elements in
.
The lo cal Kemenization pro cedure is a simple inductive
construction. Without loss of generality, let
=(1
;:::;
j
j
).
Assume inductively for that wehave constructed
, a lo cal
Kemenization of the pro jection of the
's onto the elements
1
;:::;`
;
1. Insert element
`
into the lowest-ranked \permis-
sible" p osition in
: just b elowthelowest-ranked element
y
in
such that (a) no ma jority among the (original)
's
prefers
x
to
y
and (b) for all successors
z
of
y
in
there is
a ma jority that prefers
x
to
z
. In other words, we try to
insert
x
at the end (bottom) of the list
;we bubble it up
toward the top of the list as long as a ma jorityofthe
's
insists that we do.
A rigorous treatment of local Kemeny optimality and lo cal
Kemenization is given in the Appendix, where we also show
that the lo cal Kemenization of an aggregation is unique. On
the strength of these results we suggest the following general
approach to rank aggregation:
Given
1
;:::;
k
, use your favorite aggregation
method to obtain a full list
. Output the (unique)
local Kemenization of
with respect to
1
;:::;
k
.
617

Citations
More filters
Book
17 Aug 2012
TL;DR: This graduate-level textbook introduces fundamental concepts and methods in machine learning, and provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application.
Abstract: This graduate-level textbook introduces fundamental concepts and methods in machine learning. It describes several important modern algorithms, provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics. Foundations of Machine Learning fills the need for a general textbook that also offers theoretical details and an emphasis on proofs. Certain topics that are often treated with insufficient attention are discussed in more detail here; for example, entire chapters are devoted to regression, multi-class classification, and ranking. The first three chapters lay the theoretical foundation for what follows, but each remaining chapter is mostly self-contained. The appendix offers a concise probability review, a short introduction to convex optimization, tools for concentration bounds, and several basic properties of matrices and norms used in the book. The book is intended for graduate students and researchers in machine learning, statistics, and related areas; it can be used either as a textbook or as a reference text for a research seminar.

2,511 citations

Proceedings ArticleDOI
10 May 2005
TL;DR: This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.
Abstract: In this work we present topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests. Though being detrimental to average accuracy, we show that our method improves user satisfaction with recommendation lists, in particular for lists generated using the common item-based collaborative filtering algorithm.Our work builds upon prior research on recommender systems, looking at properties of recommendation lists as entities in their own right rather than specifically focusing on the accuracy of individual recommendations. We introduce the intra-list similarity metric to assess the topical diversity of recommendation lists and the topic diversification approach for decreasing the intra-list similarity. We evaluate our method using book recommendation data, including offline analysis on 361, !, 349 ratings and an online study involving more than 2, !, 100 subjects.

1,813 citations


Cites background from "Rank aggregation methods for the We..."

  • ...Intuitively, this merged top-N list should reflect the highest quality ranking possible, also known as the “rank aggregation problem” [6]....

    [...]

Proceedings ArticleDOI
07 May 2002
TL;DR: A set of PageRank vectors are proposed, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic, and are shown to generate more accurate rankings than with a single, generic PageRank vector.
Abstract: In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. By using these (precomputed) biased PageRank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared.

1,765 citations


Cites background or methods from "Rank aggregation methods for the We..."

  • ...For our experiments, we used 35 of the sample queries given in [9], which were in turn compiled from earlier papers....

    [...]

  • ...See [9] for a discussion of various distance measures for ranked lists in the context of Web search results....

    [...]

Book
03 Jul 2006
TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review

1,548 citations

Journal ArticleDOI
TL;DR: It is shown that using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, can generate more accurate rankings than with a single, generic PageRank vector.
Abstract: The original PageRank algorithm for improving the ranking of search-query results computes a single vector, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared. By using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. We describe techniques for efficiently implementing a large-scale search system based on the topic-sensitive PageRank scheme.

1,161 citations


Cites methods from "Rank aggregation methods for the We..."

  • ...For our experiments, we used 35 of the sample queries given in [12], which were in turn compiled from earlier papers....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"Rank aggregation methods for the We..." refers background or methods in this paper

  • ...Page....

    [...]

  • ...Afewyearsago, querytermfrequencywasthesinglemainheuristicinrank­ingWebpages;sincetheinfuentialworkofKleinberg[16] andBrinandPage[7],linkanalysishascometobeidenti­fedasaverypowerfultechniqueinrankingWebpagesand otherhyperlinkeddocuments....

    [...]

  • ...Yet other ranking functions might be the ones implied by PageRank [7] and Clever [16, 8]....

    [...]

  • ...On the Web, such an approach has already proved tremendously successful [16, 8, 7]....

    [...]

  • ...the context of Web searching, the HITS algorithm of Kleinberg [16] and the PageRank algorithm of Brin and Page [7] are motivated by similar considerations....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
Jon Kleinberg1
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Abstract: The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of context on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristrics for link-based analysis.

8,328 citations

Book
Gerard Salton1
03 Jan 1989

3,571 citations


Additional excerpts

  • ...(see [20])....

    [...]

Book
01 Jan 1803

856 citations

Frequently Asked Questions (8)
Q1. What are the contributions in "Rank aggregation methods for the web" ?

The authors consider the problem of combining ranking results from various sources. The authors develop a set of techniques for the rank aggregation problem and compare their performance to that of well-known methods. 

Further work involves trying to obtain a qualitative understanding of why the Markov chain methods perform very well. Also, it will be interesting to measure the e cacy of their methods on a document base with several competing ranking functions. 

A strongly connected component of a graph is a set of nodes such that for every pair of nodes in the component, there is a directed path from one to the other. 

Using the technique of local Kemenization, it is easy to take any rank aggregation method and tweak its output to make it satisfy the extended Condorcet principle. 

The idea is simple: given a query, obtain the top (say) 100 results from many search engines, apply the rank aggregation function with the universe being the union of pages returned by the search engines, and return the top (say) 100 results of the aggregation. 

As the authors observed earlier, the problem of constructing a good meta-search engine is tantamount to obtaining a good rank aggregation function for partial and top d lists. 

their rst motivation for studying rank aggregation in the context of the Web is to provide users a certain degree of robustness of search, in the face of various shortcomings and biases | malicious or otherwise | of individual search engines. 

Several other heuristics have been added, including anchor-text analysis [8], page structure (headers, etc.) analysis, the use of keyword listings and the url text itself, etc.