Intelligent crawling on the World Wide Web with arbitrary predicates

doi:10.1145/371920.371955

Proceedings ArticleDOI

Intelligent crawling on the World Wide Web with arbitrary predicates

- pp 96-105

TLDR

This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.

Abstract:

The enormous growth of the world wide web in recent years has made it important to perform resource discovery e ciently. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling. Speci cally, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-de ned understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-de ned predicates such as topical queries, keyword queries or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more e cient crawling for closely related predicates.

Intelligent crawling on the World Wide Web with arbitrary predicates

Citations

Searching the Web

Web Crawling

Sindice.com: a document-oriented lookup index for open linked data

Topical web crawlers: Evaluating adaptive algorithms

Accelerated focused crawling through online relevance feedback

References

Focused crawling: a new approach to topic-specific Web resource discovery

Authoritative sources in a hyperlinked environment

Efficient crawling through URL ordering

Automatic resource compilation by analyzing hyperlink structure and associated text

Improved algorithms for topic distillation in a hyperlinked environment

Related Papers (5)

Focused crawling: a new approach to topic-specific Web resource discovery

Focused Crawling Using Context Graphs

Efficient crawling through URL ordering

The anatomy of a large-scale hypertextual Web search engine

The shark-search algorithm. An application: tailored Web site mapping