What are the contributions mentioned in the paper "Ranking document clusters using markov random fields" ?

The authors present a novel cluster ranking approach that utilizes Markov Random Fields ( MRFs ). The authors use their method to re-rank an initially retrieved document list by ranking clusters that are created from the documents most highly ranked in the list. Furthermore, their cluster ranking approach significantly outperforms stateof-the-art cluster ranking methods. The authors also show that their method can be used to improve the performance of ( stateof-the-art ) results-diversification methods.

What is the significance of the feature functions assigned over the lC clique?

for the ClueWeb settings, the feature functions defined over the lC clique and which are based on query-independent document measures (e.g., max-sw1, max-sw2, max-spam) are attributed with high importance.

What is the invariant used to induce the ranking of the top 50 documents?

the authors maintain the invariant mentioned above that the scoring function used to induce the ranking upon which ClustMRF operates is rank equivalent to the document-query similarity measure used in ClustMRF.

What is the important function assigned to each of the three types of cliques?

each of the three types of cliques used in Section 2.1 for defining the MRF has at least one associated feature function that is assigned with a relatively high weight.

What is the performance for MMR and xQuAD?

More generally, the best performance for each diversification method (MMR and xQuAD) is almost always attained by ClustMRF, which often outperforms the other methods in a substantial and statistically significant manner.

What is the LM similarity between texts x and y?

The LM similarity between texts x and y is simLM (x, y) def = exp ( −CE ( p Dir[0] x (·) ∣ ∣ ∣ ∣ ∣ ∣ p Dir[µ] y (·) )) [37, 17], where CE isthe cross entropy measure; µ is set to 1000.4

What is the graph out degree and the dumping factor used by CRank?

The graph out degree and the dumping factor used by CRank are set to values in {4, 9, 19, 29, 39, 49} and {0.05, 0.1, . . . , 0.9, 0.95}, respectively.

(Open Access) Ranking document clusters using markov random fields (2013) | Fiana Raiber

Q: What are the free parameters that control the use of term proximity information in SDM?

The free parameters that control the use of term proximity information in SDM, λT , λO, and λU , are set to 0.85, 0.1, and 0.05, respectively, following previous recommendations [28].

Q: What is the first initial list used for re-ranking?

The second initial list used for re-ranking, DocMRF (discussed in Section 4.2.4), is created by enriching MRF’s SDM with query-independent document measures [3].

Q: What is the inverse compression ratio of d?

the authors define Pentropy(d) def = − ∑w∈d p(w|d) log p(w|d), where w is a term and p(w|d) is the probability assigned to w by an unsmoothed unigram language model (i.e., maximum likelihood estimate) induced from d.Inspired by work on Web spam classification [9], the authors use the inverse compression ratio of document d, Picompress(d), as an additional measure.

Ranking Document Clusters Using Markov Random Fields

Fiana Raiber

ﬁana@tx.technion.ac.il

Oren Kurland

kurland@ie.technion.ac.il

Faculty of Industr ial Engineering and Management, Technion

Haifa 32000, Israel

ABSTRACT

An important challenge in cluster-based document retrieval

is ranking document clusters by their relevance to the query.

We present a novel cluster ranking approach that utilizes

Markov Random Fields (MR Fs). MRFs enable the integra-

tion of various types of cluster-relevance evidence; e.g., the

query-similarity values of the cluster’s documents and query-

independent measures of the cluster. We use our method to

re-rank an in itially retrieved document list by ranking clus-

ters that are created from the docum ents most highly ranked

in the list. The resultant retrieval eﬀ ectiveness is substan-

tially better than th at of the initial list for several lists that

are produced by eﬀective retrieval methods. Furthermore,

our cluster ranking approach signiﬁcantly outperforms state-

of-the-art cluster ranking methods. We also show that our

metho d can be used to improve the performance of (state-

of-the-art) results-diversiﬁcation methods.

Categories and Subject Descriptors: H.3.3 [Information Search

and Retrieval]: Retrieval models

General Terms: Algorithms, Experimentation

Keywords: ad hoc retrieval, cluster ranking, query-speciﬁc clus-

ters, markov random ﬁelds

1. INTRODUCTION

The cluster hypothesis [33] gave rise to a large body of

work on using query-speciﬁc document clusters [35] for im-

proving retrieval eﬀectiveness. These clusters are created

from documents that are the most highly ranked by an ini-

tial search performed in response to the query.

For many queries there are query-speciﬁc clusters that

contain a very high percentage of relevant documents [8,

32, 25, 14]. Furthermore, positioning the constituent doc-

uments of these clusters at the top of the result list yields

highly eﬀective retrieval performance; speciﬁcally, much bet-

ter than that of state-of-the art retrieval methods that rank

documents directly [8, 32, 25, 14, 10].

As a result of these ﬁndings, there has been much work on

ranking query-speciﬁc clusters by their presumed relevance

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from permissions@acm.org.

SIGIR’13, July 28–August 1, 2013, Dublin, Ireland.

to the query (e.g., [35, 22, 24, 25, 26, 14, 15]). Most previous

approaches to cluster ranking compare a representation of

the cluster with that of the query. A few methods integrate

additional types of information such as inter-cluster an d

cluster-do cument similarities [18, 14, 15]. However, th ere

are no reports of fundamental cluster ranking frameworks

that enable to eﬀectively integrate various information types

that might attest to the relevance of a cluster to a query.

We present a novel cluster ranking approach that uses

Markov Random Fields. The approach is based on integrat -

ing various types of cluster-relevance evidence in a princi-

pled manner. These include the query-similarity values of

the cluster’s documents, inter-document similarities within

the cluster, and measures of query- in dependent properties

of the cluster, or more precisely, of its documents.

A large array of experiments conducted with a variety of

TREC datasets demonstrates the high eﬀectiveness of using

our cluster ranking method to re-rank an initially retrieved

document list. The resultant retrieval performance is sub-

stantially better than that of the initial rankin g for several

eﬀective rankings. Furthermore, our method signiﬁcantly

outperforms state-of-the-art cluster ranking methods. Al-

though the met hod ranks clusters of similar documents, we

show that using it to induce document ranking can help to

substantially improve the eﬀectiveness of (state-of-the-art)

retrieval methods that diversify search results.

2. RETRIEVAL FRAMEWORK

Suppose that some search algorithm was employed over

a corpus of documents in response to a query. Let D

init

the list of the initially highest ranked documents. Our goal

is to re-rank D

init

so as to improve retrieval eﬀectiveness.

To that end, we employ a standard cluster-based retrieval

paradigm [34, 24, 18, 26, 15]. We ﬁrst apply some cluster-

ing method upon the documents in D

init

; C l(D

init

) is the

set of resultant clusters. Then, the clusters in C l(D

init

) are

ranked by their presumed relevance to the query. Finally,

the clusters’ ranking is transformed to a ranking of the d oc-

uments in D

init

by replacing each cluster with its constituent

documents and omitting repeats in case the clusters overlap.

Documents in a cluster are ordered by their query similarity.

The motivation for employing the cluster-based approach

just described follows the cluster hypothesis [33]. That is,

letting similar documents provide relevance status su pport

to each other by the virtue of being members of the same

clusters. The challenge that we address here is devising a

(novel) cluster ranking method — i.e., we tackle the second

step of the cluster-b ased retrieval paradigm.

333

Figure 1: The three types of cliques considered for graph G. G is composed of a query node (Q) and three

(for the sake of the example) nodes (d

, d

, and d

) that correspond to the documents in cluster C. (i) l

contains the query and a single document from C; (ii) l

contains all nodes in G; and, (iii) l

contains only

the documents in C.

Formally, let C and Q d enote random variables that take

as values document clusters and queries respectively. The

cluster ranking task amounts to estimating the probability

that a cluster is relevant to a query, p(C|Q):

p(C|Q) =

p(C, Q)

p(Q)

rank

= p(C, Q). (1)

The rank equivalence holds as clusters are ranked with re-

sp ect to a ﬁxed q uery.

To estimate p(C, Q), we use Markov Random Fields (MRFs).

As we discuss below, MRFs are a convenient framework for

integrating various types of cluster-relevance evidence.

2.1 Using MRFs to rank document clusters

An MRF is deﬁ ned over a graph G. No des represent

random variables and edges represent dependencies between

these variables. Two nodes that are not connected with an

edge correspon d to random variables that are indep endent of

each other given all other random variables. The set of nodes

in the graph we constru ct is composed of a node representing

the query and nodes representing the cluster’s constituent

documents. The joint probability over G’s nodes, p(C, Q),

can be expressed as follows:

p(C, Q) =

l∈L(G)

(l)

; (2)

L(G) is th e set of cliques in G and l is a clique; ψ

(l)

is a potential (i.e., positive function) deﬁned over l; Z =

C,Q

l∈L(G)

(l) is the normalization factor that serves

to ensure that p(C, Q) is a probability distribution. The

normalizer need not be computed here as we rank clusters

with respect to a ﬁxed query.

A common instantiation of potential functions is [28]:

(l)

def

= exp(λ

(l)),

where f

(l) is a feature function deﬁned over the clique l

and λ

is the weight associated with this function. Accord-

ingly, omitting the normalizer from Equation 2, applying the

rank-preserving log transformation, and substituting the po-

tentials with the corresponding feature functions results in

our ClustMRF cluster rankin g meth od:

p(C|Q)

rank

l∈L(G)

(l). (3)

This is a generic linear (in feature functions) cluster ranking

function that depends on the graph G. To instantiate a sp e-

ciﬁc ranking method, we need to (i) determine G’s structure,

sp eciﬁcally, its clique set L(G); and, (ii) associate feature

functions with the cliques. We next address these two tasks.

2.1.1 Cliques and feature functions

We consider three types of cliques in the graph G. These

are depicted in Figure 1. In what follows we write d ∈ C to

indicate that document d is a member of cluster C.

The ﬁrst clique (type), l

, contains the query and a sin-

gle document in the cluster. This clique serves for making

inferences based on t he query similarities of the cluster’s

constituent documents when considered independently. The

second cliqu e, l

, contains all nodes of the graph; that is,

the q uery Q and all C’s constituent docu m ents. This clique

is used for inducing information from the relations between

the query-similarity values of the cluster’s constituent docu-

ments. The third clique, l

, contains only the cluster’s con-

stituent documents. It is used to induce information based

on query-independent properties of the cluster’s documents.

In what follows we describe the feature functions deﬁned

over the cliques. In some cases a few feature functions are

deﬁned for the same clique, and these are used in the summa-

tion in Equation 3. Note that the sum of feature functions

is also a feature function. The weights associated with the

feature functions are set using a train set of queries. (Details

are provided in Section 4.1.)

The l

clique. High q uery similarity exhibited by C’s

constituent documents can potentially imply to C’s rele-

vance [26]. Accordingly, let d (∈ C) be the document in

. We deﬁne f

geo−qsim;l

)

def

= log sim(Q, d)

|C|

where |C| is the number of do cuments in C, and sim(·, ·) is

some inter-text similarity measure, details of which are pro-

vided in Section 4.1. Using this feature function in Equation

3 for all the l

cliques of G amounts to using the geometric

mean of the query-similarity values of C’s constituent docu-

ments. All feature functions that we consider use logs so as

to h ave a conjunction semantics for the integration of their

assigned values when using Equation 3.

The l

clique. Using the l

clique from above results

in considering the query-similarity values of the cluster’s

documents independently of each other. In contrast, the

clique provides grounds for utilizing the relations be-

tween these similarity values. Speciﬁcally, we use the log

Before applying the log function we employ add-ǫ (=

−10

) smoothing.

334

of the minimal, maximal, and standard deviation

of t he

{sim(Q, d)}

d∈C

values as feature functions for l

, denoted

min-qsim, max-qsim, and stdv-qsim, respectively.

The l

clique. Heretofore, the l

and l

cliques served

for inducing information from the query similarity values of

C’s documents. We now consider query-independent proper-

ties of C that can potentially attest to its relevance. Doing so

amounts to deﬁning feature functions over the l

clique t hat

contains C’s documents but not the query. All the feature

functions that we deﬁne for l

are constructed as follows.

We ﬁrst deﬁne a query-independent document measure, P,

and apply it to document d (∈ C) yielding t he value P(d).

Then, we use log A({P(d)}

d∈C

) where A is an aggregator

function: minimum, maximum, and geometric mean. The

resultant feature functions are referred to as min-P, max-

P, and geo-P, respectively. We next describe the document

measures that serve as t he basis for the feature functions.

The cluster hypothesis [33] implies that relevant docu-

ments should be similar to each other. Accordingly, we mea-

sure for document d in C its similarity with all documents

in C: P

dsim

(d)

def

|C|

∈C

sim(d, d

The next few query-independent document measures are

based on the following premise. The higher the breadth of

content in a docum ent, the higher the probability it is rel-

evant to some query. Thus, a cluster containing documents

with broad content should be assigned with relatively high

probability of being relevant to some query.

High entropy of the term distribution in a document is a

potential indicator for content breadth [17, 3]. This is be-

cause the distribution is “spread” over many terms rather

than focused over a few ones. Accordingly, we deﬁne

entropy

(d)

def

= −

w∈d

p(w|d) log p(w|d), where w is a term

and p(w|d) is the p robab ility assigned to w by an unsmoothed

unigram language model ( i.e., maximum likelihood estimate)

induced from d.

Inspired by work on Web spam classiﬁcation [9], we use

the inverse compression ratio of document d, P

icompress

(d),

as an additional measure. (Gzip is used for compression.)

High compression ratio presumably attests to reduced con-

tent breadth [9].

Two additional content-breadth measures t hat were pro-

posed in work on Web retrieval [3] are the ratio between the

number of stopwords and non-stopwords in the document,

sw1

(d); and, the fraction of stopwords in a stopword list

that appear in the document, P

sw2

(d). We use INQUERY’s

stopword list [2]. A document containing many stopwords

is presumably of richer language (and hence content) than

a document that does not contain many of these; e.g., a

document containing a table composed only of keywords [3].

For some of the Web collections u sed for evaluation in

Section 4, we also use the PageRank score [4] of the d ocu-

ment, P

(d), and the conﬁdence level that the document is

not spam, P

spam

(d). The details of the spam classiﬁer are

provided in Section 4.1.

We note that using the feature fun ct ions that result from

applying the geometric mean aggregator upon the query-

independent document measures just described, excep t for

It was recently argued t hat high variance of the query-

similarity values of the cluster’s documents might be an in-

dicator for the cluster’s relevance, as it presumably attests

to a low level of “q uery drift” [19].

dsim, could h ave been described in an alternative way. That

is, using log P(d)

|C|

as a feature function over a clique con-

taining a single document. Then, using these feature func-

tions in Equation 3 amounts to using the geometric mean.

3. RELATED WORK

The work most related to ours is t hat on devising cluster

ranking methods. The standard approach is based on mea-

suring the similarity between a cluster representation and

that of the query [7, 34, 35, 16, 24, 25, 26]. Speciﬁcally, a

geometric-mean-based cluster representation was shown to

be highly eﬀective [26, 30, 15]. Indeed, ranking clusters by

the geometric mean of the query-similarity values of their

constituent d ocuments is a state-of-the-art cluster ranking

approach [15]. This approach rose as an integration of fea-

ture functions used in ClustMRF, and is shown in S ection 4

to substantially underperform ClustMRF.

Clusters were also ranked by the h ighest query similar-

ity exhibited by their constituent documents [22, 31] and by

the variance of these similarities [25, 19]. ClustMRF incor-

porates these m ethods as feature functions and is shown to

outperform each.

Some cluster ranking methods use inter-cluster and cluster-

document similarities [14, 15]. While ClustMRF does not

utilize such similarities, it is shown to substantially outper-

form one such state-of-the-art method [15].

A diﬀerent use of clusters in past work on cluster-based

retrieval is for “smooth in g” (enriching) the representation of

documents [20, 16, 24, 13]. ClustMRF is shown to substan-

tially outperform one such state-of-the-art method [13].

To the best of our knowledge, our work is ﬁrst to use

MRFs for cluster ranking. In the context of retrieval tasks,

MRFs were ﬁrst introduced for ranking documents directly

[28]. We show that using ClustMRF to produce document

ranking substantially outperforms this retrieval approach;

and, that which augments the standard MRF retrieval model

with query-independent document measures [3]. MRFs were

also used, for example, for query expansion, passage-based

document retrieval, and weighted concept expansion [27].

4. EVALUATION

4.1 Experimental setup

corpus # of docs data queries

AP 242,918 Disks 1-3 51-150

ROBUST 528,155 Disks 4-5 (-CR)

301-450,

600-700

WT10G 1,692,096 WT10g 451-550

GOV2 25,205,179 GOV2 701-850

ClueA

503,903,810 ClueWeb09 (Category A) 1-150

ClueAF

ClueB

50,220,423 ClueWeb09 (Category B) 1-150

ClueBF

Table 1: Datasets used for experiments .

The TREC datasets speciﬁed in Table 1 were used for

experiments. AP and ROBUST are small collections, com-

posed mostly of news articles. WT10G and GOV2 are Web

Similarly, we could have used the geometric mean of the

query-similarity values of the cluster constituent documents

as a feature funct ion deﬁned over the l

clique rather than

constructing it using th e l

cliques as we did above.

335

collections; th e latter is a crawl of the .gov domain. For

the ClueWeb Web collection both the English part of Cat-

egory A (ClueA) and the Category B subset (ClueB) were

used. ClueAF and ClueBF are two additional experimental

settings created from ClueWeb followin g previous work [6].

Speciﬁcally, documents assigned by Waterloo’s spam classi-

ﬁer [6] with a score below 70 and 50 for ClueA and ClueB,

respectively, were ﬁ ltered out from the initial corpus rank-

ing described below. The score indicates the percentage of

all documents in ClueWeb Category A that are presumably

“spammier” than t he document at hand. The ranking of the

residual corpus was used to create the do cument list upon

which the various methods operate. Waterloo’s spam score

is also used for the P

spam

(·) measure that was described in

Section 2.1. The P

spam

(·) and P

(·) (PageRank score) mea-

sures are used only for the ClueWeb-based settings as th ese

information types are not available for the other settings.

The titles of TREC topics served for queries. All data

was stemmed using the Krovetz stemmer. Stopwords on

the INQUERY list were removed from queries but not from

documents. The Indri toolkit (www.lemurproject.org/indri)

was used for experiments.

Initial retrieval and clustering. As described in Section

2, we use the ClustMRF cluster ranking method to re-rank

an initially retrieved document list D

init

. Recall t hat af-

ter ClustMRF ranks the clusters created from D

init

, these

are “replaced” by their constituent docum ents while omit-

ting repeats. Documents within a cluster are ranked by

their qu ery similarity, the measure of which is detailed be-

low. This cluster-based re-ranking approach is emp loyed

by all the reference comparison methods t hat we use and

that rely on cluster ranking. Furthermore, ClustMRF and

all reference comparison approaches re-rank a list D

init

that

is composed of the 50 documents t hat are the most highly

ranked by some retrieval method speciﬁed below. D

init

is rel-

atively short following recommendations in previous work on

cluster-based re-ranking [18, 25, 26, 13]. In Section 4.2.7 we

study the eﬀect of varying the list size on the performance

of ClustMRF and the reference comparisons.

We let all methods re-rank three diﬀerent initial lists D

init

The ﬁrst, denoted MRF, is used unless otherwise speciﬁed.

This list contains the documents in t he corpus that are the

most highly ranked in response to th e query when using the

state-of-the-art Markov Random Field approach with the

sequential dependence mod el (SDM) [28]. The free param-

eters that control the use of term proximity information in

SDM, λ

, λ

, and λ

, are set to 0.85, 0.1, and 0.05, respec-

tively, following previous recommendations [28]. We also use

MRF’s SDM with its free parameters set using cross valida-

tion as one of the re-ranking reference comparisons. (De-

tails provided below.) All methods operating on the MRF

initial list use the expon ent of the document score assigned

by SDM — which is a rank-equivalent estimate to that of

log p(Q, d) — as sim

MRF

(Q, d), the d ocument-query simi-

larity measure. This measure was used to induce the initial

ranking using which D

init

was created. More generally, for a

fair performance comp arison we maintain in all the experi-

ments the invariant that the scoring function used to create

an initially retrieved list is rank equivalent to the document-

query similarity measure used in methods operating on the

list. Furthermore, the document-query similarity measure is

used in all methods that are based on cluster ranking (in-

cluding ClustMRF) to order documents within the clusters.

The second initial list used for re-ranking, DocMRF (dis-

cussed in Section 4.2.4), is created by enriching MRF’s SDM

with query-independent document measures [3].

The third initial list, LM, is addressed in Section 4.2.5.

The list is created using unigram language models. In con-

trast, the MRF and DocMRF lists were created using re-

trieval methods that use term proximity information. Let

Dir[µ]

(·) be the Dirichlet-smoothed unigram language model

induced from text z; µ is the smoothing parameter. The LM

similarity between texts x and y is sim

(x, y)

def

exp



−CE



Dir[0]

(·)



Dir[µ]

(·)



[37, 17], where CE is

the cross entropy measure; µ is set to 1000.

Accordingly,

the LM initial list is created by using sim

(Q, d) to rank

the entire corpus.

This measure serves as the document-

query similarity measure for all meth ods operating over the

LM list, and for th e inter-document similarity measure used

by the dsim feature function.

Unless otherwise stated, to cluster any of the three ini-

tial lists D

init

, we use a simple nearest-neighbor clustering

approach [18, 25, 14, 26, 13, 15]. For each document d

(∈ D

init

), a cluster is created from d and the k − 1 docu-

ments d

in D

init

6= d) with the highest sim

(d, d

); k

is set to a valu e in {5, 10, 20} using cross validation as de-

scribed below. Using such small overlapping clusters (all of

which contain k documents) was shown to be highly eﬀ ective

for cluster-based document retrieval [18, 25, 14, 26, 13, 15].

In Section 4.2.6 we also study the performance of ClustMRF

when using hierarchical agglomerative clustering.

Evaluation metrics and free parameters. We use MAP

(computed at cutoﬀ 50, the size of the list D

init

that is re-

ranked) and the precision of t he top 5 documents (p@5) and

their NDCG (NDCG@5) for evaluation measures.

The free

parameters of our ClustMRF method, as well as t hose of all

reference comparison methods, are set using 10-fold cross

validation performed over the queries in an experimental

setting. Query IDs are the basis for creating the folds. The

two-tailed paired t-test with p ≤ 0.05 was used for testing

statistical signiﬁcance of performance diﬀerences.

For our ClustMRF method, th e free-parameter values are

set in two steps. First, SVM

rank

[12] is used to learn the val-

ues of the λ

weights associated with the feature functions.

The NDCG@k of the k constituent documents of a cluster

serves as the cluster score used for ranking clusters in the

learning phase

. (Recall from above that documents in a

The MRF SDM used above also uses Dirichlet-smoothed

unigram language models with µ = 1000.

Queries for which th ere was not a single relevant document

in the MRF or LM initial lists were removed from the eval-

uation. For the ClueWeb settings, the same query set was

used for ClueX and ClueXF.

We note that statAP, rather than AP, was the oﬃcial

TREC evaluation metric in 2009 for ClueWeb with queries

1–50. For consistency with the other queries for ClueWeb,

and following previous work [3], we use AP for all ClueWeb

queries by treating prel ﬁles as qrel ﬁles. We hasten to point

out that evaluation using statAP for the ClueWeb collections

with queries 1–50 yielded relative performance patterns that

are highly similar to those attained when using AP.

Using MAP@k as the cluster score resulted in a slightly

less eﬀective performance. We also note that learning-to-

336

Init TunedMRF ClustMRF

MAP 10.1 9.9 10.8

p@5 50.7 48.7 53.0

NDCG@5 50.6 49.4 54.4

ROBUST

MAP 19.9 20.0 21.0

p@5 51.0 51.0 52.4

NDCG@5 52.5 52.7 54.7

WT10G

MAP 15.8 15.4 18.0

p@5 37.5 36.9 44.9

NDCG@5 37.2 35.3

42.8

GOV2

MAP 12.7 12.7 14.2

p@5 59.3 60.8 70.1

NDCG@5 48.6 49.5 56.2

ClueA

MAP 4.5 4.9

6.3

p@5 19.1 21.1 44.6

NDCG@5 12.6 15.6

29.4

ClueAF

MAP 8.6 8.7 8.9

p@5 46.3 47.8 50.2

NDCG@5 32.4 33.1 33.9

ClueB

MAP 12.5 13.5

16.1

p@5 33.1 35.5 48.7

NDCG@5 24.4 27.0 37.4

ClueBF

MAP 15.8 16.3

17.0

p@5 44.8 46.8 48.5

NDCG@5 33.2 34.3 36.9

Table 2: The performance of ClustMRF and a tuned

MRF (TunedMRF) when re-ranking the MRF ini-

tial l ist (Init). Boldface: the best result in a row. ’i’

and ’t’ mark statistically signiﬁcant diﬀerences with

Init and TunedMRF, resp ectively.

cluster are ordered based on their query similarity.) A rank-

ing of documents in D

init

is created from the cluster ranking,

which is performed for each cluster size k (∈ {5, 10, 20}), us-

ing the approach described above; k is then also set using

cross validation by optimizing the MAP performance of the

resulting document ranking. The train/test split for the

ﬁrst and second steps are the same — i.e., the same train

set used for learning the λ

’s is the one used for setting t he

cluster size. As is the case for ClustMRF, the ﬁ nal docu-

ment ranking in duced by any reference comparison metho d

is based on using cross validation to set free-parameter val-

ues; and, MAP serves as the optimization criterion in the

training (learning) phase.

Finally, we note that the main computational overhead,

on top of the initial ranking, incurred by using ClustMRF is

the clustering. That is, the feature functions used are either

query-ind epen dent, and therefore can be computed oﬄine;

or, use mainly document-query similarity values that have

already been computed to create the initial ranking. Clus-

tering of a few dozen d ocuments can be computed eﬃciently;

e.g., based on document snippets.

4.2 Experimental results

4.2.1 Main result

Table 2 presents our main result. Namely, the perfor-

mance of ClustMRF when used to re-rank the MRF initial

list. Recall that the initial ranking was induced using MRF’s

SDM with free-parameter values set following previous rec-

ommendations [28]. Thus, we also present for reference the

re-ranking performance of using MRF’s SDM with its three

free parameters set using cross validation as is the case for

rank methods [23] other than SVM

ank

, which proved to

result in highly eﬀective performance as shown below, can

also be used for setting the values of the λ

weights.

ClustMRF

stdv-

qsim

max-

sw2

geo-

qsim

min-

sw2

MAP 10.8 9.4 9.7 10.6 9.6

p@5 53.0 43.7

44.6

50.9 49.1

NDCG@5 54.4 45.0

45.8

52.0 50.4

ROBUST

MAP 21.0 19.0

17.7

20.6 16.8

p@5 52.4 50.7 46.9

50.4 44.7

NDCG@5 54.7 52.4 49.1

52.4 45.9

WT10G

MAP 18.0 15.4

12.2

16.3

14.2

p@5 44.9 38.4

31.7

39.3

33.9

NDCG@5 42.8 37.8

28.6

39.0

32.4

GOV2

MAP 14.2 12.7

12.9

13.2

14.2

p@5 70.1 59.3

62.3

58.0

66.3

NDCG@5 56.2 48.2

48.8

46.6

52.3

ClustMRF

max-

sw2

max-

sw1

max-

qsim

geo-

qsim

ClueA

MAP 6.3 5.4

5.3

4.5

4.8

p@5 44.6 28.7

29.3

18.7

20.9

NDCG@5 29.4 20.3

20.5

12.4

14.0

ClueAF

MAP 8.9 8.6 7.8

8.3 8.6

p@5 50.2 47.2 40.4

49.3 48.7

NDCG@5 33.9 32.5 28.9

34.3 33.9

ClueB

MAP 16.1 14.2

15.4 12.8

12.9

p@5 48.7 41.9

42.9

33.9

34.2

NDCG@5 37.4 30.1

32.5

25.5

25.6

ClueBF

MAP 17.0 16.3 15.7

14.8

15.9

p@5 48.5 45.0 42.3

42.9

43.2

NDCG@5 36.9 35.5 32.8 32.8 33.6

Table 3: Using each of ClustMRF’s top-4 feature

functions by itself for ranking the clusters so as to

re-rank the MRF initial list. Boldface: the bes t per-

formance per row. ’c’ marks a statistically signiﬁ-

cant diﬀerence with ClustMRF.

the free parameters of ClustMRF; TunedMRF denotes this

metho d. We found that using exhaustive search for ﬁnding

SDM’s optimal parameter values in the training phase yields

better performance (on th e test set) than using SVM

rank

[12] and SVM

map

[36]. Speciﬁcally, λ

, λ

, and λ

were

set to values in {0, 0.05, . . . , 1} with λ

+ λ

= 1.

We ﬁrst see in Table 2 that while TunedMRF outperforms

the initial MRF ranking in most relevant comparisons (ex-

perimental setting × evaluation measure), there are cases

(e.g., for AP and WT10G) for which the reverse holds. The

latter ﬁnding implies that optimal free-parameter values of

MRF’s SDM do not necessarily generalize across queries.

More importantly, we see in Table 2 that ClustMRF out-

performs both the initial ranking and TunedMRF in all rel-

evant comparisons. Many of the improvements are substan-

tial and statistically signiﬁcant. These ﬁndings attest to the

high eﬀectiveness of using ClustMRF for re-ranking.

4.2.2 Analysis of feature functions

We now t urn to analyze the relative importance attributed

to the diﬀerent feature functions used in ClustMRF; i.e., the

weights assigned to these fun ct ions in the training phase

by SVM

rank

. We ﬁrst average, per experimental setting and

cluster size, the weights assigned to a feature function using

the diﬀerent training folds. Then, the feature function is

assigned with a score that is the reciprocal rank of its cor-

responding (average) weight. Finally, the feature functions

are ordered by averaging their scores across experimental

settings and cluster sizes. Two feature functions, pr and

spam, are only used for the ClueWeb-based settings. Hence,

we perform the analysis separately for the ClueWeb and non-

ClueWeb (AP, ROBUST, WT10G, and GOV2) settings.

337

Ranking document clusters using markov random fields

Figures

Citations

Fast and effective cluster-based information retrieval using frequent closed itemsets

A Comparison of Retrieval Models using Term Dependencies

Query-performance prediction: setting the expectations straight

Cluster-based information retrieval using pattern mining

Cluster-based polyrepresentation as science modelling approach for information retrieval

References

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Learning to Rank for Information Retrieval

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Training linear SVMs in linear time

Related Papers (5)

Cluster-based retrieval using language models

The use of hierarchic clustering in information retrieval

Efficient and effective spam filtering and re-ranking for large web datasets

The cluster hypothesis revisited

The effectiveness of query-specific hierarchic clustering in information retrieval

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Ranking document clusters using markov random fields" ?

Q2. What are the free parameters that control the use of term proximity information in SDM?

Q3. What is the first initial list used for re-ranking?

Q4. What is the significance of the feature functions assigned over the lC clique?

Q5. What is the invariant used to induce the ranking of the top 50 documents?

Q6. What is the important function assigned to each of the three types of cliques?

Q7. What is the inverse compression ratio of d?

Q8. What is the performance for MMR and xQuAD?

Q9. What is the popular ranking method for a list Dinit?

Q10. What is the LM similarity between texts x and y?

Q11. What is the graph out degree and the dumping factor used by CRank?