Ranking document clusters using markov random fields
read more
Citations
Fast and effective cluster-based information retrieval using frequent closed itemsets
A Comparison of Retrieval Models using Term Dependencies
Query-performance prediction: setting the expectations straight
Cluster-based information retrieval using pattern mining
Cluster-based polyrepresentation as science modelling approach for information retrieval
References
The anatomy of a large-scale hypertextual Web search engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Learning to Rank for Information Retrieval
The use of MMR, diversity-based reranking for reordering documents and producing summaries
Training linear SVMs in linear time
Related Papers (5)
Frequently Asked Questions (11)
Q2. What are the free parameters that control the use of term proximity information in SDM?
The free parameters that control the use of term proximity information in SDM, λT , λO, and λU , are set to 0.85, 0.1, and 0.05, respectively, following previous recommendations [28].
Q3. What is the first initial list used for re-ranking?
The second initial list used for re-ranking, DocMRF (discussed in Section 4.2.4), is created by enriching MRF’s SDM with query-independent document measures [3].
Q4. What is the significance of the feature functions assigned over the lC clique?
for the ClueWeb settings, the feature functions defined over the lC clique and which are based on query-independent document measures (e.g., max-sw1, max-sw2, max-spam) are attributed with high importance.
Q5. What is the invariant used to induce the ranking of the top 50 documents?
the authors maintain the invariant mentioned above that the scoring function used to induce the ranking upon which ClustMRF operates is rank equivalent to the document-query similarity measure used in ClustMRF.
Q6. What is the important function assigned to each of the three types of cliques?
each of the three types of cliques used in Section 2.1 for defining the MRF has at least one associated feature function that is assigned with a relatively high weight.
Q7. What is the inverse compression ratio of d?
the authors define Pentropy(d) def = − ∑w∈d p(w|d) log p(w|d), where w is a term and p(w|d) is the probability assigned to w by an unsmoothed unigram language model (i.e., maximum likelihood estimate) induced from d.Inspired by work on Web spam classification [9], the authors use the inverse compression ratio of document d, Picompress(d), as an additional measure.
Q8. What is the performance for MMR and xQuAD?
More generally, the best performance for each diversification method (MMR and xQuAD) is almost always attained by ClustMRF, which often outperforms the other methods in a substantial and statistically significant manner.
Q9. What is the popular ranking method for a list Dinit?
ClustMRF and all reference comparison approaches re-rank a list Dinit that is composed of the 50 documents that are the most highly ranked by some retrieval method specified below.
Q10. What is the LM similarity between texts x and y?
The LM similarity between texts x and y is simLM (x, y) def = exp ( −CE ( p Dir[0] x (·) ∣ ∣ ∣ ∣ ∣ ∣ p Dir[µ] y (·) )) [37, 17], where CE isthe cross entropy measure; µ is set to 1000.4
Q11. What is the graph out degree and the dumping factor used by CRank?
The graph out degree and the dumping factor used by CRank are set to values in {4, 9, 19, 29, 39, 49} and {0.05, 0.1, . . . , 0.9, 0.95}, respectively.