Showing papers in "IEEE Data(base) Engineering Bulletin in 2000"

PDF

Open Access

Journal Article•

Data Cleaning: Problems and Current Approaches.

[...]

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.

...read moreread less

Abstract: We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

...read moreread less

1,675 citations

Journal Article•

Automatically Extracting Structure from Free Text Addresses.

[...]

Vinayak R. Borkar¹, Kaustubh Deshmukh², Sunita Sarawagi•Institutions (2)

University of Maryland, College Park¹, University of California, Berkeley²

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: It is shown that XJoin is an effective solution for providing fast query responses to users even in the presence of slow and bursty remote sources, and a non-blocking join operator, called XJoin, which has a small memory footprint, allowing many such operators to be active in parallel.

...read moreread less

Abstract: Wide-area distribution raises significant performance problems for traditional query processing techniques as data access becomes less predictable due to link congestion, load imbalances, and temporary outages. Pipelined query execution is a promising approach to coping with unpredictability in such environments as it allows scheduling to adjust to the arrival properties of the data. We have developed a non-blocking join operator, called XJoin, which has a small memory footprint, allowing many such operators to be active in parallel. XJoin is optimized to produce initial results quickly and can hide intermittent delays in data arrival by reactively scheduling background processing. We show that XJoin is an effective solution for providing fast query responses to users even in the presence of slow and bursty remote sources. 1 Wide-Area Query Processing The explosive growth of the Internet and the World Wide Web has made tremendous amounts of data available on-line. Emerging standards such as XML, combined with wrapper technologies address semantic challenges by providing relational-style interfaces to remote data. Beyond the issues of structure and semantics, however, there remain significant technical obstacles to building responsive, usable query processing systems for widearea environments. A key performance issue that arises in such environments is response-time unpredictability. Data access over wide-area networks involves a large number of remote data sources, intermediate sites, and communications links, all of which are vulnerable to overloading, congestion, and failures. Such problems can cause significant and unpredictable delays in the access of information from remote sources. These delays, in turn, cause traditional distributed query processing strategies to break down, resulting in unresponsive and hence, unusable systems. In previous work [AFTU96] we identified three classes of delays that can affect the responsiveness of query processing: 1) initial delay, in which there is a longer than expected wait until the first tuple arrives from a remote source; 2) slow delivery, in which data arrive at a fairly constant but slower than expected rate; and 3) bursty arrival, in which data arrive in a fluctuating manner. With traditional query processing techniques, query execution can become blocked even if only one of the accessed data sources experiences such delays. Copyright 2000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering This work was partially supported by the NSF under grant IRI-94-09575, by the Office of Naval Research under contract number N66001-97-C8539 (DARPA order number F475), by a Siemens Faculty Development Award, and by an IBM Partnership Award.

...read moreread less

332 citations

Journal Article•

Context in Web Search.

[...]

Steve Lawrence¹•Institutions (1)

NEC¹

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts.

...read moreread less

Abstract: Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. Greater use of context in web search may help increase competition and diver-

...read moreread less

306 citations

Journal Article•

Adaptive Query Processing: Technology in Evolution.

[...]

Joseph M. Hellerstein, Michael J. Franklin, Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Samuel Madden, Vijayshankar Raman, Mehul A. Shah - Show less +4 more

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A survey of prior work on adaptive query processing is presented, focusing on three characterizations of adaptivity: the frequency of adaptability, the effects of Adaptivity, and the extent of adaptiveness, to set the stage for research in the Telegraph project.

...read moreread less

Abstract: As query engines are scaled and federated, they must cope with highly unpredictable and changeable environments. In the Telegraph project, we are attempting to architect and implement a continuously adaptive query engine suitable for global-area systems, massive parallelism, and sensor networks. To set the stage for our research, we present a survey of prior work on adaptive query processing, focusing on three characterizations of adaptivity: the frequency of adaptivity, the effects of adaptivity, and the extent of adaptivity. Given this survey, we sketch directions for research in the Telegraph project.

...read moreread less

223 citations

Journal Article•

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach.

[...]

Craig A. Knoblock¹, Kristina Lerman¹, Steven Minton, Ion Muslea¹•Institutions (1)

University of Southern California¹

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A set of tools for extracting data from web sites and transforming it into a structured data format, such as XML, so that the resulting data can be used to build new applications without having to deal with unstructured data.

...read moreread less

Abstract: A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

...read moreread less

176 citations

Journal Article•

Link Analysis in Web Information Retrieval.

[...]

Monika Henzinger¹•Institutions (1)

Google¹

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: This survey describes two successful link analysis algorithms and the state-of-the art of the field.

...read moreread less

Abstract: The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.

...read moreread less

130 citations

Journal Article•

Next Generation Web Search: Setting Our Sites.

[...]

Marti A. Hearst¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A new way to support task-based site search is to dynamically present appropriate metadata that organizes the search results and suggests what to look at next, as a personalized intermixing of search and hypertext.

...read moreread less

Abstract: The current state of web search is most successful at directing users to appropriate web sites. Once at the site, the user has a choice of following hyperlinks or using site search, but the latter is notoriously problematic. One solution is to develop specialized search interfaces that explicitly support the types of tasks users perform using the information specific to the site. A new way to support task-based site search is to dynamically present appropriate metadata that organizes the search results and suggests what to look at next, as a personalized intermixing of search and hypertext.

...read moreread less

107 citations

Journal Article•

Matching Algorithms within a Duplicate Detection System.

[...]

Alvaro Monge

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A system to detect approximate duplicate records in a database is reviewed and properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system are provided.

...read moreread less

Abstract: Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system.

...read moreread less

90 citations

Journal Article•

ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments.

[...]

Panos Vassiliadis, Zografoula Vagena, Spiros Skiadopoulos, Nikos Karayannidis

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: This paper proposes a dynamic query processing architecture which includes three dynamic layers: the dynamic query optimizer, the scheduler and the query evaluator, and allows reducing significantly the overheads of the dynamic strategies.

...read moreread less

Abstract: Execution plans produced by traditional query optimizers for data integration queries may yield poor performance for several reasons. The cost estimates may be inaccurate, the memory available at run-time may be insufficient, or the data delivery rate can be unpredictable. All these problems have led database researchers and implementors to resort to dynamic strategies to correct or adapt the static QEP. In this paper, we identify the different basic techniques that must be integrated in a dynamic query engine. Following on our recent work [6] on the problem of unpredictable data arrival rates, we propose a dynamic query processing architecture which includes three dynamic layers: the dynamic query optimizer, the scheduler and the query evaluator. Having a three-layer dynamic architecture allows reducing significantly the overheads of the dynamic strategies.

...read moreread less

75 citations

Journal Article•

Adaptive Query Processing for Internet Applications

[...]

Zachary G. Ives, Alon Y. Levy, Daniel S. Weld, Daniela Florescu, Marc Friedman - Show less +1 more

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: The goal of the Tukwila project at the University of Washington is to design a query processing system that supports a range of adaptive techniques that are configurable for different query processing contexts.

...read moreread less

Abstract: As the area of data management for the Internet has gained in popularity, recent work has focused on effectively dealing with unpredictable, dynamic data volumes and transfer rates using adaptive query processing techniques. Important requirements of the Internet domain include: (1) the ability to process XML data as it streams in from the network, in addition to working on locally stored data; (2) dynamic scheduling of operators to adjust to I/O delays and flow rates; (3) sharing and re-use of data across multiple queries, where possible; (4) the ability to output results and later update them. An equally important consideration is the high degree of variability in performance needs for different query processing domains: perhaps an ad-hoc query application should optimize for display of incomplete and partial incremental results, whereas a corporate data integration application may need the best time-to-completion and may have very strict data "freshness" guarantees. The goal of the Tukwila project at the University of Washington is to design a query processing system that supports a range of adaptive techniques that are configurable for different query processing contexts. Comments Copyright 2000 IEEE. Reprinted from Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society, Volume 23, Issue 2, June 2000, pages 19-26. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it. NOTE: At the time of publication, author Zachary Ives was affiliated with the University of Washington. Currently (April 2005), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania. This journal article is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/130 Adaptive Query Processing for Internet Applications Zachary G. Ives University of Washington zives@cs.washington.edu Alon Y. Levy University of Washington alon@cs.washington.edu Daniel S. Weld University of Washington weld@cs.washington.edu Daniela Florescu INRIA Rocquencourt Daniela.Florescu@inria.fr Marc Friedman Viathan Corp. marc@viathan.com

...read moreread less

73 citations

Journal Article•

Data Mining Techniques for Personalization.

[...]

Charu C. Aggarwal, Philip S. Yu

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

Dynamic Query Evaluation Plans: Some Course Corrections?

[...]

Goetz Graefe

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

What do the Neighbours Think? Computing Web Page Reputations.

[...]

Alberto O. Mendelzon, Davood Rafiei

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A search process where the input is the URL of a page, and the output is a ranked set of topics on which the page has a reputation, and a simple formulation of the notion of reputation of a pages on a topic is described.

...read moreread less

Abstract: The textual content of the Web enriched with the hyperlink structure surrounding it can be a useful source of information for querying and searching. This paper presents a search process where the input is the URL of a page, and the output is a ranked set of topics on which the page has a reputation. For example, if the input is www.gamelan.com, then a possible output is “Java.” We describe a simple formulation of the notion of reputation of a page on a topic, and report some experiences in the use of this formulation.

...read moreread less

Journal Article•

An Interoperable Multimedia Catalog System for Electronic Commerce.

[...]

William W. Cohen, Andrew McCallum, Dallan W. Quass

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: A common theme in this discussion is the need for very robust methods for finding relevant information, extracting data from pages, and integrating information taken from multiple sources, and the importance of statistical learning methods as a tool for creating such robust methods.

...read moreread less

Abstract: In a traditional information retrieval system, it is assumed that queries can be posed about any topic. In reality, a large fraction of web queries are posed about a relatively small number of topics, like products, entertainment, current events, and so on. One way of exploiting this sort of regularity in web search is to build, from the information found on the web, comprehensive databases about specific topics. An appropriate interface to such a database can support complex structured queries which are impossible to answer with traditional topic-independent query methods. Here we discuss three case studies for this “data-centric” approach to web search. A common theme in this discussion is the need for very robust methods for finding relevant information, extracting data from pages, and integrating information taken from multiple sources, and the importance of statistical learning methods as a tool for creating such ro-

...read moreread less

Journal Article•

A Comparison of Techniques to Find Mirrored Hosts on the WWW.

[...]

Krishna Bharat, Andrei Z. Broder, Jeffrey Dean, Monika Henzinger

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: Four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information are evaluated.

...read moreread less

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

...read moreread less

Journal Article•

TPC-W E-Commerce Benchmark Using Javlin/ObjectStore.

[...]

Manish Gupta

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

Database Design for Real-World E-Commerce Systems.

[...]

Il-Yeol Song¹, Kyu-Young Whang²•Institutions (2)

Drexel University¹, KAIST²

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: An integrated 8-process value chain needed by the e-commerce system and its associated data in each stage of the value chain is presented and logical components of a typical E-commerce database system are discussed.

...read moreread less

Abstract: This paper discusses the structure and components of databases for real-world e-commerce systems. We first present an integrated 8-process value chain needed by the e-commerce system and its associated data in each stage of the value chain. We then discuss logical components of a typical e-commerce database system. Finally, we illustrate a detailed design of an e-commerce transaction processing system and comment on a few design considerations specific to e-commerce database systems, such as the primary key, foreign key, outer join, use of weak entity, and schema partition. Understanding the structure of e-commerce database systems will help database designers effectively develop and maintain e-commerce systems.

...read moreread less

Journal Article•

Declarative Specification of Electronic Commerce Applications.

[...]

Serge Abiteboul, Sophie Cluet, Laurent Mignet, Tova Milo

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: An overview of data mining techniques for personalization discusses some of the standard techniques which are used in order to adapt and increase the ability of the system to tailor itself to specific user behavior.

...read moreread less

Abstract: This paper discusses an overview of data mining techniques for personalization. It discusses some of the standard techniques which are used in order to adapt and increase the ability of the system to tailor itself to specific user behavior. We discuss several such techniques such as collaborative filtering, content based methods, and content based collaborative filtering methods. We examine the specific applicability of these techniques to various scenarios and the broad advantages of each in specific situations.

...read moreread less

Journal Article•

A Decision Theoretic Cost Model for Dynamic Plans.

[...]

Richard L. Cole

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

Letter from the Special Issue Editor.

[...]

Sunita Sarawagi

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

Personal Views for Web Catalogs.

[...]

Kajal T. Claypool, Li Chen, Elke A. Rundensteiner

01 Jan 2000-IEEE Data(base) Engineering Bulletin

Journal Article•

End-to-end E-commerce Application Development Based on XML Tools.

[...]

Weidong Kou, David Lauzon, William G. O'Farrell, Teo Loo See, Daniel Wee, Daniel Tan, Kelvin Cheung, Richard Gregory, Kostas Kontogiannis, John Mylopoulos - Show less +6 more

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: An electronic business application framework that allows for customers and service providers to be linked through XML/EDI with back-end applications and to trigger e-procurement orders which are sent using either XML or EDI messages is presented.

...read moreread less

Abstract: In this paper, we discuss an electronic business application framework and its related architecture. The framework is presented in the form of a prototype system which illustrates how XML tools can assist organizations on building and deploying e-commerce applications. The use of the system is presented in the form of a sample e-commerce application that involves shopping in a virtual store. The framework allows for customers and service providers to be linked through XML/EDI with back-end applications and to trigger e-procurement orders which are sent using either XML or EDI messages. Invoice handling and order tracking are also discussed along with extensions to the proposed architecture that allow for business process logic to be encoded at the server side, in the form of Event Condition Action scripts.

...read moreread less

Journal Article•

An Overview of the Agent-Based Electronic Commerce System (ABECOS) Project.

[...]

Ee-Peng Lim, Wee Keong Ng

01 Jan 2000-IEEE Data(base) Engineering Bulletin