scispace - formally typeset
Journal ArticleDOI

The Harvest information discovery and access system

Reads0
Chats0
TLDR
Harvest as mentioned in this paper is a system that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information, which can be used to collect, index, and extract data from the Internet.
Abstract
It is increasingly difficult to make effective use of Internet information, given the rapid growth in data volume, user base, and data diversity. In this paper we introduce Harvest, a system that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information.

read more

Citations
More filters
Journal ArticleDOI

Summary cache: a scalable wide-area web cache sharing protocol

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Proceedings ArticleDOI

Web mining: information and pattern discovery on the World Wide Web

TL;DR: This paper defines Web mining and presents an overview of the various research issues, techniques, and development efforts, and briefly describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues.
ReportDOI

A hierarchical internet object cache

TL;DR: The design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better are discussed, and performance measurements indicate that hierarchy does not measurably increase access latency.
Patent

Centrifugal communication and collaboration method

TL;DR: In this article, a system and method for communicating information among members of a distributed discussion group having peripheral communication devices involves communication between the peripheral devices and a central agent, where messages are retained in memory, thereby causing discussions to be maintained.
Journal ArticleDOI

Database techniques for the World-Wide Web: a survey

TL;DR: The primary goal of this survey is to classify the different tasks to which database concepts have been applied, and to emphasize the technical innovations that were required to do so.
References
More filters
Proceedings Article

GLIMPSE: a tool to search through entire file systems

TL;DR: Glimpse is particularly designed for personal information, such as one's own file system, that should support many types of queries, flexible interaction, low overhead, and customization, All these are important features of glimpse.
ReportDOI

Harvest: A Scalable, Customizable Discovery and Access System

TL;DR: This paper introduces Harvest, a system that provides a set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet.
Journal ArticleDOI

Scalable Internet resource discovery: research problems and approaches

TL;DR: In this paper, the authors indicate trends in these three dimensions and survey problems these trends will create for current approaches and suggest several promising directions of future resource discovery research, along with some initial results from projects carried out by members of the Internet Research Task Force Research Group on Resource Discovery and Directory Service.
Proceedings ArticleDOI

A case for caching file objects inside internetworks

TL;DR: Evidence is presented that several, judiciously placed file caches could reduce the volume of FTP traffic by 42%, and hence theVolume of all NSFNET backbone traffic by 21%, and if FTP client and server software automatically compressed data, this savings could increase to 27%.
Journal ArticleDOI

Customized information extraction as a basis for resource discovery

TL;DR: This work presents a model for type-specific, user-customizable information extraction, and a system implementation called Essence, which can extract information from most of the types of files found in common file systems, including files with nested structure.