Proceedings ArticleDOI
Intelligent crawling on the World Wide Web with arbitrary predicates
Charu C. Aggarwal,Fatima Al-Garawi,Philip S. Yu +2 more
- pp 96-105
TLDR
This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.Abstract:
The enormous growth of the world wide web in recent years has made it important to perform resource discovery e ciently. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling. Speci cally, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-de ned understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-de ned predicates such as topical queries, keyword queries or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more e cient crawling for closely related predicates.read more
Citations
More filters
Journal ArticleDOI
Searching the Web
TL;DR: An overview of current Web search engine design is offered, introducing a generic search engine architecture and the results of several performance analyses conducted to compare different designs.
Journal ArticleDOI
Web Crawling
Christopher Olston,Marc Najork +1 more
TL;DR: The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.
Journal ArticleDOI
Sindice.com: a document-oriented lookup index for open linked data
Eyal Oren,Renaud Delbru,Michele Catasta,Richard Cyganiak,Holger Stenzhorn,Giovanni Tummarello +5 more
TL;DR: Sindice, a lookup index over Semantic Web resources, allows applications to automatically locate documents containing information about a given resource, and extends the sitemap protocol to efficiently index large datasets with minimal impact on data providers.
Journal ArticleDOI
Topical web crawlers: Evaluating adaptive algorithms
TL;DR: A framework to fairly evaluate topical crawling algorithms under a number of performance metrics is developed and a novel combination of explorative and exploitative bias is found, and an evolutionary crawler is introduced that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls.
Proceedings ArticleDOI
Accelerated focused crawling through online relevance feedback
TL;DR: Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page?
References
More filters
Journal ArticleDOI
Focused crawling: a new approach to topic-specific Web resource discovery
TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Proceedings ArticleDOI
Authoritative sources in a hyperlinked environment
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages that join them together in the link structure, that has connections to the eigenvectors of certain matrices associated with the link graph.
Journal ArticleDOI
Efficient crawling through URL ordering
TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.
Journal ArticleDOI
Automatic resource compilation by analyzing hyperlink structure and associated text
Soumen Chakrabarti,Byron Dom,Prabhakar Raghavan,Sridhar Rajagopalan,David Gibson,Jon Kleinberg +5 more
TL;DR: An evaluation of ARC suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
Journal ArticleDOI
Improved algorithms for topic distillation in a hyperlinked environment
Krishna Bharat,Monika Henzinger +1 more
TL;DR: This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic, by augmenting a previous connectivity analysis based algorithm with content analysis.