scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Data(base) Engineering Bulletin in 2000"


Journal Article
TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Abstract: We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

1,675 citations


Journal Article
TL;DR: It is shown that XJoin is an effective solution for providing fast query responses to users even in the presence of slow and bursty remote sources, and a non-blocking join operator, called XJoin, which has a small memory footprint, allowing many such operators to be active in parallel.
Abstract: Wide-area distribution raises significant performance problems for traditional query processing techniques as data access becomes less predictable due to link congestion, load imbalances, and temporary outages. Pipelined query execution is a promising approach to coping with unpredictability in such environments as it allows scheduling to adjust to the arrival properties of the data. We have developed a non-blocking join operator, called XJoin, which has a small memory footprint, allowing many such operators to be active in parallel. XJoin is optimized to produce initial results quickly and can hide intermittent delays in data arrival by reactively scheduling background processing. We show that XJoin is an effective solution for providing fast query responses to users even in the presence of slow and bursty remote sources. 1 Wide-Area Query Processing The explosive growth of the Internet and the World Wide Web has made tremendous amounts of data available on-line. Emerging standards such as XML, combined with wrapper technologies address semantic challenges by providing relational-style interfaces to remote data. Beyond the issues of structure and semantics, however, there remain significant technical obstacles to building responsive, usable query processing systems for widearea environments. A key performance issue that arises in such environments is response-time unpredictability. Data access over wide-area networks involves a large number of remote data sources, intermediate sites, and communications links, all of which are vulnerable to overloading, congestion, and failures. Such problems can cause significant and unpredictable delays in the access of information from remote sources. These delays, in turn, cause traditional distributed query processing strategies to break down, resulting in unresponsive and hence, unusable systems. In previous work [AFTU96] we identified three classes of delays that can affect the responsiveness of query processing: 1) initial delay, in which there is a longer than expected wait until the first tuple arrives from a remote source; 2) slow delivery, in which data arrive at a fairly constant but slower than expected rate; and 3) bursty arrival, in which data arrive in a fluctuating manner. With traditional query processing techniques, query execution can become blocked even if only one of the accessed data sources experiences such delays. Copyright 2000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering This work was partially supported by the NSF under grant IRI-94-09575, by the Office of Naval Research under contract number N66001-97-C8539 (DARPA order number F475), by a Siemens Faculty Development Award, and by an IBM Partnership Award.

332 citations


Journal Article
Steve Lawrence1
TL;DR: Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts.
Abstract: Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. Greater use of context in web search may help increase competition and diver-

306 citations


Journal Article
TL;DR: A survey of prior work on adaptive query processing is presented, focusing on three characterizations of adaptivity: the frequency of adaptability, the effects of Adaptivity, and the extent of adaptiveness, to set the stage for research in the Telegraph project.
Abstract: As query engines are scaled and federated, they must cope with highly unpredictable and changeable environments. In the Telegraph project, we are attempting to architect and implement a continuously adaptive query engine suitable for global-area systems, massive parallelism, and sensor networks. To set the stage for our research, we present a survey of prior work on adaptive query processing, focusing on three characterizations of adaptivity: the frequency of adaptivity, the effects of adaptivity, and the extent of adaptivity. Given this survey, we sketch directions for research in the Telegraph project.

223 citations


Journal Article
TL;DR: A set of tools for extracting data from web sites and transforming it into a structured data format, such as XML, so that the resulting data can be used to build new applications without having to deal with unstructured data.
Abstract: A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

176 citations


Journal Article
Monika Henzinger1
TL;DR: This survey describes two successful link analysis algorithms and the state-of-the art of the field.
Abstract: The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.

130 citations


Journal Article
TL;DR: A new way to support task-based site search is to dynamically present appropriate metadata that organizes the search results and suggests what to look at next, as a personalized intermixing of search and hypertext.
Abstract: The current state of web search is most successful at directing users to appropriate web sites. Once at the site, the user has a choice of following hyperlinks or using site search, but the latter is notoriously problematic. One solution is to develop specialized search interfaces that explicitly support the types of tasks users perform using the information specific to the site. A new way to support task-based site search is to dynamically present appropriate metadata that organizes the search results and suggests what to look at next, as a personalized intermixing of search and hypertext.

107 citations


Journal Article
TL;DR: A system to detect approximate duplicate records in a database is reviewed and properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system are provided.
Abstract: Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system.

90 citations


Journal Article
TL;DR: This paper proposes a dynamic query processing architecture which includes three dynamic layers: the dynamic query optimizer, the scheduler and the query evaluator, and allows reducing significantly the overheads of the dynamic strategies.
Abstract: Execution plans produced by traditional query optimizers for data integration queries may yield poor performance for several reasons. The cost estimates may be inaccurate, the memory available at run-time may be insufficient, or the data delivery rate can be unpredictable. All these problems have led database researchers and implementors to resort to dynamic strategies to correct or adapt the static QEP. In this paper, we identify the different basic techniques that must be integrated in a dynamic query engine. Following on our recent work [6] on the problem of unpredictable data arrival rates, we propose a dynamic query processing architecture which includes three dynamic layers: the dynamic query optimizer, the scheduler and the query evaluator. Having a three-layer dynamic architecture allows reducing significantly the overheads of the dynamic strategies.

75 citations


Journal Article
TL;DR: The goal of the Tukwila project at the University of Washington is to design a query processing system that supports a range of adaptive techniques that are configurable for different query processing contexts.
Abstract: As the area of data management for the Internet has gained in popularity, recent work has focused on effectively dealing with unpredictable, dynamic data volumes and transfer rates using adaptive query processing techniques. Important requirements of the Internet domain include: (1) the ability to process XML data as it streams in from the network, in addition to working on locally stored data; (2) dynamic scheduling of operators to adjust to I/O delays and flow rates; (3) sharing and re-use of data across multiple queries, where possible; (4) the ability to output results and later update them. An equally important consideration is the high degree of variability in performance needs for different query processing domains: perhaps an ad-hoc query application should optimize for display of incomplete and partial incremental results, whereas a corporate data integration application may need the best time-to-completion and may have very strict data "freshness" guarantees. The goal of the Tukwila project at the University of Washington is to design a query processing system that supports a range of adaptive techniques that are configurable for different query processing contexts. Comments Copyright 2000 IEEE. Reprinted from Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society, Volume 23, Issue 2, June 2000, pages 19-26. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it. NOTE: At the time of publication, author Zachary Ives was affiliated with the University of Washington. Currently (April 2005), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania. This journal article is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/130 Adaptive Query Processing for Internet Applications Zachary G. Ives University of Washington zives@cs.washington.edu Alon Y. Levy University of Washington alon@cs.washington.edu Daniel S. Weld University of Washington weld@cs.washington.edu Daniela Florescu INRIA Rocquencourt Daniela.Florescu@inria.fr Marc Friedman Viathan Corp. marc@viathan.com

73 citations




Journal Article
TL;DR: A search process where the input is the URL of a page, and the output is a ranked set of topics on which the page has a reputation, and a simple formulation of the notion of reputation of a pages on a topic is described.
Abstract: The textual content of the Web enriched with the hyperlink structure surrounding it can be a useful source of information for querying and searching. This paper presents a search process where the input is the URL of a page, and the output is a ranked set of topics on which the page has a reputation. For example, if the input is www.gamelan.com, then a possible output is “Java.” We describe a simple formulation of the notion of reputation of a page on a topic, and report some experiences in the use of this formulation.

Journal Article
TL;DR: A common theme in this discussion is the need for very robust methods for finding relevant information, extracting data from pages, and integrating information taken from multiple sources, and the importance of statistical learning methods as a tool for creating such robust methods.
Abstract: In a traditional information retrieval system, it is assumed that queries can be posed about any topic. In reality, a large fraction of web queries are posed about a relatively small number of topics, like products, entertainment, current events, and so on. One way of exploiting this sort of regularity in web search is to build, from the information found on the web, comprehensive databases about specific topics. An appropriate interface to such a database can support complex structured queries which are impossible to answer with traditional topic-independent query methods. Here we discuss three case studies for this “data-centric” approach to web search. A common theme in this discussion is the need for very robust methods for finding relevant information, extracting data from pages, and integrating information taken from multiple sources, and the importance of statistical learning methods as a tool for creating such ro-

Journal Article
TL;DR: Four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information are evaluated.
Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.


Journal Article
TL;DR: An integrated 8-process value chain needed by the e-commerce system and its associated data in each stage of the value chain is presented and logical components of a typical E-commerce database system are discussed.
Abstract: This paper discusses the structure and components of databases for real-world e-commerce systems. We first present an integrated 8-process value chain needed by the e-commerce system and its associated data in each stage of the value chain. We then discuss logical components of a typical e-commerce database system. Finally, we illustrate a detailed design of an e-commerce transaction processing system and comment on a few design considerations specific to e-commerce database systems, such as the primary key, foreign key, outer join, use of weak entity, and schema partition. Understanding the structure of e-commerce database systems will help database designers effectively develop and maintain e-commerce systems.

Journal Article
TL;DR: An overview of data mining techniques for personalization discusses some of the standard techniques which are used in order to adapt and increase the ability of the system to tailor itself to specific user behavior.
Abstract: This paper discusses an overview of data mining techniques for personalization. It discusses some of the standard techniques which are used in order to adapt and increase the ability of the system to tailor itself to specific user behavior. We discuss several such techniques such as collaborative filtering, content based methods, and content based collaborative filtering methods. We examine the specific applicability of these techniques to various scenarios and the broad advantages of each in specific situations.




Journal Article
TL;DR: An electronic business application framework that allows for customers and service providers to be linked through XML/EDI with back-end applications and to trigger e-procurement orders which are sent using either XML or EDI messages is presented.
Abstract: In this paper, we discuss an electronic business application framework and its related architecture. The framework is presented in the form of a prototype system which illustrates how XML tools can assist organizations on building and deploying e-commerce applications. The use of the system is presented in the form of a sample e-commerce application that involves shopping in a virtual store. The framework allows for customers and service providers to be linked through XML/EDI with back-end applications and to trigger e-procurement orders which are sent using either XML or EDI messages. Invoice handling and order tracking are also discussed along with extensions to the proposed architecture that allow for business process logic to be encoded at the server side, in the form of Event Condition Action scripts.