scispace - formally typeset
Proceedings ArticleDOI

Intelligent crawling on the World Wide Web with arbitrary predicates

TLDR
This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.
Abstract
The enormous growth of the world wide web in recent years has made it important to perform resource discovery e ciently. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling. Speci cally, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-de ned understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-de ned predicates such as topical queries, keyword queries or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more e cient crawling for closely related predicates.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Searching the Web

TL;DR: An overview of current Web search engine design is offered, introducing a generic search engine architecture and the results of several performance analyses conducted to compare different designs.
Journal ArticleDOI

Web Crawling

TL;DR: The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.
Journal ArticleDOI

Sindice.com: a document-oriented lookup index for open linked data

TL;DR: Sindice, a lookup index over Semantic Web resources, allows applications to automatically locate documents containing information about a given resource, and extends the sitemap protocol to efficiently index large datasets with minimal impact on data providers.
Journal ArticleDOI

Topical web crawlers: Evaluating adaptive algorithms

TL;DR: A framework to fairly evaluate topical crawling algorithms under a number of performance metrics is developed and a novel combination of explorative and exploitative bias is found, and an evolutionary crawler is introduced that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls.
Proceedings ArticleDOI

Accelerated focused crawling through online relevance feedback

TL;DR: Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page?
References
More filters
Journal ArticleDOI

Focused crawling: a new approach to topic-specific Web resource discovery

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Proceedings ArticleDOI

Authoritative sources in a hyperlinked environment

TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages that join them together in the link structure, that has connections to the eigenvectors of certain matrices associated with the link graph.
Journal ArticleDOI

Efficient crawling through URL ordering

TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.
Journal ArticleDOI

Automatic resource compilation by analyzing hyperlink structure and associated text

TL;DR: An evaluation of ARC suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
Journal ArticleDOI

Improved algorithms for topic distillation in a hyperlinked environment

TL;DR: This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic, by augmenting a previous connectivity analysis based algorithm with content analysis.
Related Papers (5)