Crawling the Hidden Web

Open AccessProceedings Article

Crawling the Hidden Web

Sriram Raghavan, +1 more

- pp 129-138

Chats0

TLDR

In this paper, the authors address the problem of designing a crawler capable of extracting content from the hidden web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration.

Abstract:

Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web. We introduce a generic operational model of a hidden Web crawler and describe how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. We also present results from experiments conducted to test and validate our techniques.

Citations

PDF

Open Access

More filters

Patent

Serving advertisements based on content

Darrell Anderson, +7 more

TL;DR: In this article, the authors present a method for placing targeted ads on page on the web (or some other document of any media type) by obtaining content that includes available spots for ads, determining ads relevant to content, and/or combining content with ads determined to be relevant to the content.

...read moreread less

Crawling the Hidden Web.

Sriram Raghavan, +1 more

TL;DR: A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described.

...read moreread less

Proceedings ArticleDOI

Web application security assessment by fault injection and behavior monitoring

Yao-Wen Huang, +3 more

TL;DR: The design of Web application security assessment mechanisms are analyzed in order to identify poor coding practices that render Web applications vulnerable to attacks such as SQL injection and cross-site scripting.

...read moreread less

Proceedings ArticleDOI

Data extraction and label assignment for web databases

Jiying Wang, +1 more

TL;DR: A system called, DeLa, which reconstructs (part of) a "hidden" back-end web database by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table.

...read moreread less

Journal ArticleDOI

Structured databases on the web: observations and implications

Kevin Chen-Chuan Chang, +4 more

TL;DR: This paper surveys this relatively unexplored frontier of the deep Web, measuring characteristics pertinent to both exploring and integrating structured Web sources, to conclude with several implications which, while necessarily subjective, might help shape research directions and solutions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Information Retrieval: Data Structures and Algorithms

William B. Frakes, +1 more

TL;DR: For programmers and students interested in parsing text, automated indexing, its the first collection in book form of the basic data structures and algorithms that are critical to the storage and retrieval of documents.

...read moreread less

Journal ArticleDOI

Focused crawling: a new approach to topic-specific Web resource discovery

Soumen Chakrabarti, +2 more

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.

...read moreread less

Journal ArticleDOI

Accessibility of information on the web

Steve Lawrence, +1 more

- 08 Jul 1999 -

Nature

TL;DR: As the web becomes a major communications medium, the data on it must be made more accessible, and search engines need to make the data more accessible.

...read moreread less

Journal ArticleDOI

Searching the World Wide Web

Steve Lawrence, +1 more

- 03 Apr 1998 -

Science

TL;DR: The coverage and recency of the major World Wide Web search engines was analyzed, yielding some surprising results, including a lower bound on the size of the indexable Web of 320 million pages.

...read moreread less

Journal ArticleDOI

Efficient crawling through URL ordering

Junghoo Cho, +2 more

TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.

...read moreread less

Crawling the Hidden Web

Citations

Serving advertisements based on content

Crawling the Hidden Web.

Web application security assessment by fault injection and behavior monitoring

Data extraction and label assignment for web databases

Structured databases on the web: observations and implications

References

Information Retrieval: Data Structures and Algorithms

Focused crawling: a new approach to topic-specific Web resource discovery

Accessibility of information on the web

Searching the World Wide Web

Efficient crawling through URL ordering

Related Papers (5)

The deep web : Surfacing hidden value

Google's Deep Web crawl

Focused crawling: a new approach to topic-specific Web resource discovery

The anatomy of a large-scale hypertextual Web search engine

Structured databases on the web: observations and implications