scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Efficient and effective spam filtering and re-ranking for large web datasets

01 Oct 2011-Information Retrieval (Springer Netherlands)-Vol. 14, Iss: 5, pp 441-465
TL;DR: It is shown that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the ClueWeb09 dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision as well as rank measures of nearly all submitted runs.
Abstract: The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam--pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the "spamminess" of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of "honeypot" queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering--from among the worst to among the best.
Citations
More filters
Proceedings ArticleDOI
24 Oct 2016
TL;DR: A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.
Abstract: In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is partially due to the fact that many important characteristics of the ad-hoc retrieval task have not been well addressed in deep models yet. Typically, the ad-hoc retrieval task is formalized as a matching problem between two pieces of text in existing work using deep models, and treated equivalent to many NLP tasks such as paraphrase identification, question answering and automatic conversation. However, we argue that the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements. In this paper, we propose a novel deep relevance matching model (DRMM) for ad-hoc retrieval. Specifically, our model employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above. Experimental results on two representative benchmark collections show that our model can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.

810 citations


Additional excerpts

  • ...ing the Waterloo Fusion spam scores [3]....

    [...]

Proceedings Article
01 Aug 2013
TL;DR: An efficient algorithm to estimate large modified Kneser-Ney models including interpolation using Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk.
Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.

553 citations

Proceedings ArticleDOI
TL;DR: Deep Relevance Matching (DRMM) as mentioned in this paper employs a joint deep architecture at the query term level for relevance matching, using matching histogram mapping, a feed forward matching network, and a term gating network.
Abstract: In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is partially due to the fact that many important characteristics of the ad-hoc retrieval task have not been well addressed in deep models yet. Typically, the ad-hoc retrieval task is formalized as a matching problem between two pieces of text in existing work using deep models, and treated equivalent to many NLP tasks such as paraphrase identification, question answering and automatic conversation. However, we argue that the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements. In this paper, we propose a novel deep relevance matching model (DRMM) for ad-hoc retrieval. Specifically, our model employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above. Experimental results on two representative benchmark collections show that our model can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.

477 citations

Proceedings Article
01 Jan 2011
TL;DR: In PAN'10, 18 plagiarism detectors were evaluated in detail, highlighting several important aspects of plagiarism detection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length as mentioned in this paper.
Abstract: Thispaper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10. We start with a unified retrieval process that sum- marizes the best practices employed this year. Then, the detectors' performances are evaluated in detail, highlighting several important aspects of plagiarism de- tection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length. Finally, all results are compared to those of last year's competition.

419 citations

Journal ArticleDOI
01 Jun 2014
TL;DR: Nous proposons une methode non supervisee pour the modelisation of concepts implicites d’une requete, dans le but of recreer la representation conceptuelle du besoin d‘information initial.
Abstract: Une requete est la representation du besoin d’information d’un utilisateur, et est le resultat d’un processus cognitif complexe qui mene souvent a un mauvais choix de mots-cles. Nous proposons une methode non supervisee pour la modelisation de concepts implicites d’une requete, dans le but de recreer la representation conceptuelle du besoin d’information initial. Nous utilisons l’allocation de Dirichlet latente (LDA) pour detecter les concepts implicites de la requete en utilisant des documents pseudo-pertinents. Nous evaluons cette methode en profondeur en utilisant deux collections de test de TREC. Nous trouvons notamment que notre approche permet de modeliser precisement les concepts implicites de la requete, tout en obtenant de bonnes performances dans le cadre d’une recherche de documents.

335 citations

References
More filters
Proceedings Article
01 Nov 2009
TL;DR: An overview of the TREC 2009 Web Track is provided, including topic development, evaluation measures, and results, including both a traditional ad hoc retrieval task and a new diversity task.
Abstract: : The TRECWeb Track explores and evaluates Web retrieval technologies. Currently, the Web Track conducts experiments using the new billion-page ClueWeb09 collection. The TREC 2009 Web Track includes both a traditional ad hoc retrieval task and a new diversity task. The goal of this diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. Topics for the track were created from the logs of a commercial search engine, with the aid of tools developed at Microsoft Research. Given a target query, these tools extracted and analyzed groups of related queries, using co-clicks and other information, to identify clusters of queries that highlight different aspects and interpretations of the target query. These clusters were employed by NIST for topic development. Each resulting topic is structured as a representative set of subtopics, each related to a different user need. Documents were judged with respect to the subtopics, as well as with respect to the topic as a whole. For each subtopic, NIST assessors made a binary judgment as to whether or not the document satisfies the information need associated with the subtopic. These topics were used for both the ad hoc task and the diversity task. A total of 26 groups submitted runs to the track, with many groups participating in both tasks. This report provides an overview of the track, including topic development, evaluation measures, and results.

289 citations


"Efficient and effective spam filter..." refers methods in this paper

  • ...Methods to estimate AP with incomplete knowledge of rel(k) have proven to be unreliable for the TREC 09 tasks [private correspondence, Web Track coordinators]....

    [...]

  • ...The first of these tasks is the ad hoc task of the Web Track (Clarke et al. 2009)....

    [...]

  • ...The first of these tasks is the ad hoc task of the Web Track [4]....

    [...]

Proceedings ArticleDOI
23 May 2006
TL;DR: This work shows that it can significantly outperform PageRank using features that are independent of the link structure of the Web, and uses RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics.
Abstract: Since the publication of Brin and Page's paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random).

213 citations


"Efficient and effective spam filter..." refers methods in this paper

  • ...Although the literature is dominated by graph-based methods for web spam filtering and static ranking (Becchetti et al. 2008; Richardson et al. 2006), content-based email spam filters were found to work as well as graph-based methods in the 2007 Web Spam Challenge (Cormack 2007)....

    [...]

Proceedings Article
01 Jan 2007
TL;DR: This paper has created the first blog test collection, namely the TREC Blog06 collection, for adhoc retrieval and opinion finding, and provides an overview of each task, summarises the obtained results and draws conclusions for the future.
Abstract: The goal of the Blog track is to explore the information seeking behaviour in the blogosphere. It aims to create the required infrastructure to facilitate research into the blogosphere and to study retrieval from blogs and other related applied tasks. The track was introduced in 2006 with a main opinion finding task and an open task, which allowed participants the opportunity to influence the determination of a suitable second task for 2007 on other aspects of blogs besides their opinionated nature. As a result, we have created the first blog test collection, namely the TREC Blog06 collection, for adhoc retrieval and opinion finding. Further background information on the Blog track can be found in the 2006 track overview [2]. TREC 2007 has continued using the Blog06 collection, and saw the addition of a new main task and a new subtask, namely a blog distillation (feed search) task and an opinion polarity subtask respectively, along with a second year of the opinion finding task. NIST developed the topics and relevance judgments for the opinion finding task, and its polarity subtask. For the blog distillation task, the participating groups created the topics and the associated relevance judgments. This second year of the track has seen an increased participation compared to 2006, with 20 groups submitting runs to the opinion finding task, 11 groups submitting runs to the polarity subtask, and 9 groups submitting runs to the blog distillation task. This paper provides an overview of each task, summarises the obtained results and draws conclusions for the future. The remainder of this paper is structured as follows. Section 2 provides a short description of the used Blog06 collection. Section 3 describes the opinion finding task and its polarity subtask, providing an overview of the submitted runs, as well as a summary of the main used techniques by the participants. Section 4 describes the newly created blog distillation (feed search) task, and summarises the results of the runs and the main approaches deployed by the participating groups. We provide concluding remarks in Section 5.

207 citations

Proceedings Article
01 Jan 2007
TL;DR: TREC's Spam Track uses a standard testing framework that presents a set of chronologically ordered email messages a spam filter for classification, which yields a binary judgement (spam or ham [i.e. non-spam]) which is compared to a human-adjudicated gold standard as discussed by the authors.
Abstract: TREC’s Spam Track uses a standard testing framework that presents a set of chronologically ordered email messages a spam filter for classification. In the filtering task, the messages are presented one at at time to the filter, which yields a binary judgement (spam or ham [i.e. non-spam]) which is compared to a humanadjudicated gold standard. The filter also yields a spamminess score, intended to reflect the likelihood that the classified message is spam, which is the subject of post-hoc ROC (Receiver Operating Characteristic) analysis. Four different forms of user feedback are modeled: with immediate feedback the gold standard for each message is communicated to the filter immediately following classification; with delayed feedback the gold standard is communicated to the filter sometime later (or potentially never), so as to model a user reading email from time to time and perhaps not diligently reporting the filter’s errors; with partial feedback the gold standard for only a subset of email recipients is transmitted to the filter, so as to model the case of some users never reporting filter errors; with active on-line learning (suggested by D. Sculley from Tufts University [5]) the filter is allowed to request immediate feedback for a certain quota of messages which is considerably smaller than the total number. Two test corpora – email messages plus gold standard judgements – were used to evaluate subject filters. One public corpus (trec07p) was distributed to participants, who ran their filters on the corpora using a track-supplied toolkit implementing the framework and the four kinds of feedback. One private corporus (MrX 3) was not distributed to participants; rather, participants submitted filter implementations that were run, using the toolkit, on the private data. Twelve groups participated in the track, each submitting up to four filters for evaluation in each of the four feedback modes (immediate; delayed; partial; active).

203 citations

Journal ArticleDOI
TL;DR: This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task.
Abstract: Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs--the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, Q?, nDCG? and AP? proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

143 citations


Additional excerpts

  • ...Unadjudicated documents are elided, and P@10 is computed on the top-ranked 10 documents that are adjudicated [23]....

    [...]