Showing papers presented at "International Conference on Data Engineering in 2001"

PDF

Open Access

Proceedings Article•DOI•

[...]

S. Borzsony¹, Donald Kossmann², Konrad Stocker¹•Institutions (2)

University of Passau¹, Technische Universität München²

02 Apr 2001

TL;DR: This work shows how SSL can be extended to pose Skyline queries, present and evaluate alternative algorithms to implement the Skyline operation, and shows how this operation can be combined with other database operations, e.g., join.

...read moreread less

Abstract: We propose to extend database systems by a Skyline operation. This operation filters out a set of interesting points from a potentially large set of data points. A point is interesting if it is not dominated by any other point. For example, a hotel might be interesting for somebody traveling to Nassau if no other hotel is both cheaper and closer to the beach. We show how SSL can be extended to pose Skyline queries, present and evaluate alternative algorithms to implement the Skyline operation, and show how this operation can be combined with other database operations, e.g., join.

...read moreread less

2,509 citations

Proceedings Article•DOI•

PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth

[...]

Jian Pei¹, Jiawei Han¹, Behzad Mortazavi-Asl¹, Helen Pinto¹, Qiming Chen², Umeshwar Dayal², Meichun Hsu² - Show less +3 more•Institutions (2)

Simon Fraser University¹, Hewlett-Packard²

02 Apr 2001

TL;DR: This work proposes a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern Mining, and shows that Pre fixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

...read moreread less

Abstract: Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of A priori which may substantially reduce the number of combinations to be examined. Howeve6 Apriori still encounters problems when a sequence database is large andor when sequential patterns to be mined are numerous ano we propose a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern mining. Prefixspan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Moreover; prefi-projection substantially reduces the size of projected databases and leads to efJicient processing. Our performance study shows that Prefixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

...read moreread less

1,975 citations

Proceedings Article•

PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

[...]

Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu - Show less +3 more

02 Apr 2001

727 citations

Proceedings Article•DOI•

MAFIA: a maximal frequent itemset algorithm for transactional databases

[...]

Douglas Burdick¹, Manuel Calimlim¹, Johannes Gehrke¹•Institutions (1)

Cornell University¹

02 Apr 2001

TL;DR: A new algorithm for mining maximal frequent itemsets from a transactional database that integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms and combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema is presented.

...read moreread less

Abstract: We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five.

...read moreread less

391 citations

Proceedings Article•DOI•

Mining frequent itemsets with convertible constraints

[...]

Jian Pei¹, Jiawei Han¹, Laks V. S. Lakshmanan², Laks V. S. Lakshmanan³•Institutions (3)

Simon Fraser University¹, Concordia University², Indian Institute of Technology Bombay³

02 Apr 2001

TL;DR: A notion of convertible constraints is developed and systematically analyzed, classify, and characterize this class and techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining are developed.

...read moreread less

Abstract: Recent work has highlighted the importance of the constraint based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. The authors study constraints which cannot be handled with existing theory and techniques. For example, avg(S) /spl theta/ /spl nu/, median(S) /spl theta/ /spl nu/, sum(S) /spl theta/ /spl nu/ (S can contain items of arbitrary values) (/spl theta//spl isin/{/spl ges/, /spl les/}), are customarily regarded as "tough" constraints in that they cannot be pushed inside an algorithm such as a priori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.

...read moreread less

372 citations

Proceedings Article•DOI•

An index-based approach for similarity search supporting time warping in large sequence databases

[...]

Sang-Wook Kim¹, Sanghyun Park², Wesley W. Chu²•Institutions (2)

Kangwon National University¹, University of California, Los Angeles²

02 Apr 2001

TL;DR: A new distance function D/sub tw-lb/ that consistently underestimates the time warping distance and also satisfies the triangular inequality is devised and achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.

...read moreread less

Abstract: This paper proposes a new novel method for similarity search that supports time warping in large sequence databases. Time warping enables finding sequences with similar patterns even when they are of different lengths. Previous methods for processing similarity search that supports time warping fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. Our primary goal is to innovate on search performance without permitting any false dismissal. To attain this goal, we devise a new distance function D/sub tw-lb/ that consistently underestimates the time warping distance and also satisfies the triangular inequality D/sub tw-lb/ uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes and D/sub tw-lb/ as a distance function. The extensive experimental results reveal that our method achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.

...read moreread less

337 citations

Proceedings Article•DOI•

Mining partially periodic event patterns with unknown periods

[...]

Sheng Ma¹, Joseph L. Hellerstein¹•Institutions (1)

IBM¹

02 Apr 2001

TL;DR: This work develops two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association- first algorithm, and develops a novel approach based on a chi-squared test.

...read moreread less

Abstract: Periodic behavior is common in real-world applications. However in many cases, periodicities are partial in that they are present only intermittently. The authors study such intermittent patterns, which they refer to as p-patterns. The formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise. Further we develop two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association-first algorithm. Our results show that the association-first algorithm has a higher tolerance to noise; the period-first algorithm is more computationally efficient and provides flexibility as to the specification of support levels. In addition, we apply the period-first algorithm to mining data collected from two production computer networks, a process that led to several actionable insights.

...read moreread less

257 citations

Proceedings Article•DOI•

Spatial clustering in the presence of obstacles

[...]

Anthony K. H. Tung¹, J. Hou¹, Jiawei Han¹•Institutions (1)

Simon Fraser University¹

02 Apr 2001

TL;DR: It is shown that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'.

...read moreread less

Abstract: Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. In the real world there exist many physical obstacles such as rivers, lakes and highways, and their presence may affect the result of clustering substantially. We study the problem of clustering in the presence of obstacles and define it as a COD (Clustering with Obstructed Distance) problem. As a solution to this problem, we propose a scalable clustering algorithm, called COD-CLARANS. We discuss various forms of pre-processed information that could enhance the efficiency of COD-CLARANS. In the strictest sense, the COD problem can be treated as a change in distance function and thus could be handled by current clustering algorithms by changing the distance function. However, we show that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'. We conduct various performance studies to show that COD-CLARANS is both efficient and effective.

...read moreread less

189 citations

Proceedings Article•DOI•

An index structure for efficient reverse nearest neighbor queries

[...]

Congyun Yang¹, King-Ip Lin¹•Institutions (1)

University of Memphis¹

02 Apr 2001

TL;DR: A new index structure is introduced, the Rdnn-tree, that answers both RNN and NN queries efficiently and outperforms existing methods in various aspects, and makes the index structure extremely preferable in both static and dynamic cases.

...read moreread less

Abstract: The Reverse Nearest Neighbor (RNN) problem is to find all points in a given data set whose nearest neighbor is a given query point. Just like the Nearest Neighbor (NN) queries, the RNN queries appear in many practical situations such as marketing and resource management. Thus, efficient methods for the RNN queries in databases are required. The paper introduces a new index structure, the Rdnn-tree, that answers both RNN and NN queries efficiently. A single index structure is employed for a dynamic database, in contrast to the use of multiple indexes in previous work. This leads to significant savings in dynamically maintaining the index structure. The Rdnn-tree outperforms existing methods in various aspects. Experiments on both synthetic and real world data show that our index structure outperforms previous methods by a significant margin (more than 90% in terms of number of leaf nodes accessed) in RNN queries. It also shows improvement in NN queries over standard techniques. Furthermore, performance in insertion and deletion is significantly enhanced by the ability to combine multiple queries (NN and RNN) in one traversal of the tree. These facts make our index structure extremely preferable in both static and dynamic cases.

...read moreread less

178 citations

Proceedings Article•DOI•

Overcoming limitations of sampling for aggregation queries

[...]

Surajit Chaudhuri¹, Gautam Das¹, Mayur Datar², Rajeev Motwani², Vivek Narasayya¹ - Show less +1 more•Institutions (2)

Microsoft¹, Stanford University²

02 Apr 2001

TL;DR: It is demonstrated that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone.

...read moreread less

Abstract: Studies the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft's SQL Server and present experimental results that demonstrate the merits of our techniques.

...read moreread less

177 citations

Proceedings Article•DOI•

Variable length queries for time series data

[...]

Tamer Kahveci¹, Ambuj K. Singh¹•Institutions (1)

University of California, Santa Barbara¹

02 Apr 2001

TL;DR: A new indexing technique that works well for variable length queries that is to store index structures at different resolutions for a given dataset by exploiting the power of wavelets.

...read moreread less

Abstract: Finding similar patterns in a time sequence is a well-studied problem. Most of the current techniques work well for queries of a prespecified length, but not for variable length queries. We propose a new indexing technique that works well for variable length queries. The central idea is to store index structures at different resolutions for a given dataset. The resolutions are based on wavelets. For a given query, a number of subqueries at different resolutions are generated. The ranges of the subqueries are progressively refined based on results from previous subqueries. Our experiments show that the total cost for our method is 4 to 20 times less than the current techniques including linear scan. Because of the need to store information at multiple resolution levels, the storage requirement of our method could potentially be large. In the second part of the paper we show how the index information can be compressed with minimal information loss. According to our experimental results, even after compressing the size of the index to one fifth, the total cost of our method is 3 to 15 times less than the current techniques.

...read moreread less

Proceedings Article•DOI•

[...]

Roberto Figueira Santos Filho¹, Agma J. M. Traina, Caetano Traina, Christos Faloutsos•Institutions (1)

University of São Paulo¹

02 Apr 2001

TL;DR: A family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-tree, to elect a set of objects as foci and gauge all other objects with their distances from this set.

...read moreread less

Abstract: Designing a new access method inside a commercial DBMS is cumbersome and expensive. We propose a family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-trees. The idea is to elect a set of objects as foci, and gauge all other objects with their distances from this set. We show how to define the foci set cardinality, how to choose appropriate foci, and how to perform range and nearest-neighbor queries using them, without false dismissals. The foci increase the pruning of distance calculations during the query processing. Furthermore we index the distances from each object to the foci to reduce even triangular inequality comparisons. Experiments on real and synthetic datasets show that our methods match or outperform existing methods. They are up to 10 times faster, and perform up to 10 times fewer distance calculations and disk accesses. In addition, it scales up well, exhibiting sub-linear performance with growing database size.

...read moreread less

Proceedings Article•DOI•

Counting twig matches in a tree

[...]

Zhiyuan Chen¹, H. V. Jagadish, Flip Korn², Nikolaos Koudas², S. Muthukrishnan², Raymond T. Ng³, Divesh Srivastava² - Show less +3 more•Institutions (3)

Cornell University¹, AT&T², University of British Columbia³

02 Apr 2001

TL;DR: This work proposes several estimation algorithms that apply set hashing and maximal overlap to estimate the number of matches of query twiglets formed using variations on different twiglet decomposition techniques, and demonstrates that accurate and robust estimates can be achieved, even with limited space.

...read moreread less

Abstract: Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structure scalably represents approximate frequency information about twiglets (i.e. small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: set hashing, used to estimate the number of matches of each query twiglet, and maximal overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML data sets, with a variety of twig queries. Our results demonstrate that accurate and robust estimates can be achieved, even with limited space.

...read moreread less

Proceedings Article•DOI•

Incremental computation and maintenance of temporal aggregates

[...]

Jun Yang¹, Jennifer Widom¹•Institutions (1)

Stanford University¹

02 Apr 2001

TL;DR: A new index structure called the SB-tree is introduced, which incorporates features from both segment trees and B-trees and support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes.

...read moreread less

Abstract: Considers the problems of computing aggregation queries in temporal databases and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging, since a single data update can cause aggregate results to change over the entire time-line. We introduce a new index structure called the SB-tree, which incorporates features from both segment trees (S-trees) and B-trees. SB-trees support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes. We also extend the basic SB-tree index to handle cumulative (also called moving-window) aggregates. For materialized aggregate views in a temporal database or data warehouse, we propose building and maintaining SB-tree indices instead of the views themselves.

...read moreread less

Proceedings Article•DOI•

Tamino - a DBMS designed for XML

[...]

Harald Schöning¹•Institutions (1)

Software AG¹

02 Apr 2001

TL;DR: A short overview of Tamino's architecture is given and some of the design considerations for Tamino are addressed, where database design for XML was nontrivial, and where some issues are still open.

...read moreread less

Abstract: Tamino is Software AG's XML database management system. In contrast to solutions of other DBMS vendors, Tamino is not just another layer on top of a database system designed to support the relational or an object-oriented data model. Rather, Tamino has been completely designed for XML. This paper gives a short overview of Tamino's architecture and then addresses some of the design considerations for Tamino. In particular, areas are presented where database design for XML was nontrivial, and where some issues are still open.

...read moreread less

Proceedings Article•DOI•

Approximate nearest neighbor searching in multimedia databases

[...]

Hakan Ferhatosmanoglu¹, Ertem Tuncel¹, Divyakant Agrawal¹, A. El Abbadi¹•Institutions (1)

University of California, Santa Barbara¹

02 Apr 2001

TL;DR: This work proposes modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries and develops a new technique based on clustering that merges the benefits of the two general classes of approaches.

...read moreread less

Abstract: Develops a general framework for approximate nearest-neighbor queries. We categorize the current approaches for nearest-neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries. A user may therefore stop the retrieval process once enough information has been returned. We then develop a new technique based on clustering that merges the benefits of the two general classes of approaches. Our cluster-based approach allows a user to progressively explore the approximate results with increasing accuracy. We propose a new metric for evaluation of approximate nearest-neighbor searching techniques. Using both the proposed and the traditional metrics, we analyze and compare several techniques with a detailed performance evaluation. We demonstrate the feasibility and efficiency of approximate nearest-neighbor searching. We perform experiments on several real data sets and establish the superiority of the proposed cluster-based technique over the existing techniques for approximate nearest-neighbor searching.

...read moreread less

Proceedings Article•DOI•

Querying XML documents made easy: nearest concept queries

[...]

Albrecht Schmidt, Martin L. Kersten¹, Menzo Windhouwer¹•Institutions (1)

Centrum Wiskunde & Informatica¹

02 Apr 2001

TL;DR: The tree structure of XML documents is exploited to equip users with a powerful tool, the meet operator, that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies.

...read moreread less

Abstract: Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the mark-up structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the mark-up depends on the methodological, cultural and personal background of the author(s). None the less, it is this hierarchical structure that forms the basis of XML query languages. We exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: e.g., given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept. We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently.

...read moreread less

Proceedings Article•DOI•

Duality-based subsequence matching in time-series databases

[...]

Yang-Sae Moon¹, Kyu-Young Whang¹, Woong-Kee Loh¹•Institutions (1)

KAIST¹

02 Apr 2001

TL;DR: The authors propose a subsequence matching method, Dual Match, which exploits duality in constructing windows and significantly improves performance in large database applications and formally proves that the dual approach is correct.

...read moreread less

Abstract: The authors propose a subsequence matching method, Dual Match, which exploits duality in constructing windows and significantly improves performance. Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of the one by C. Faloutsos et al. (1994), which divides data sequences into sliding windows and the query sequence into disjoint windows. We formally prove that our dual approach is correct, i.e., it incurs no false dismissal. We also prove that, given the minimum query length, there is a maximum bound of the window size to guarantee correctness of Dual Match and discuss the effect of the window size on performance. FRM causes a lot of false alarms by storing minimum bounding rectangles rather than individual points representing windows to avoid excessive storage space required for the index. Dual Match solves this problem by directly storing points, but without incurring excessive storage overhead. Experimental results show that, in most cases, Dual Match provides large improvement in both false alarms and performance over FRM, given the same amount of storage space. In particular, for low selectivities (less than 10/sup -4/), Dual Match significantly improves performance up to 430-fold. On the other hand, for high selectivities(more than 10/sup -2/), it shows a very minor degradation (less than 29%). For selectivities in between (10/sup -4//spl sim/10/sup -2/), Dual Match shows performance slightly better than that of FRM. Dual Match is also 4.10/spl sim/25.6 times faster than FRM in building indexes of approximately the same size. Overall, these results indicate that our approach provides a new paradigm in subsequence matching that improves performance significantly in large database applications.

...read moreread less

Proceedings Article•DOI•

Inter-enterprise collaborative business process management

[...]

Qiming Chen¹, Meichun Hsu¹•Institutions (1)

Hewlett-Packard¹

02 Apr 2001

TL;DR: This work has developed a Collaborative Process Manager (CPM) to support decentralized, peer-to-peer process management for inter-enterprise collaboration at the business process level and embedded it into a dynamic software agent architecture, E-Carry, to elevate multi-agent cooperation from the conversation level to the process level for mediating e-commerce applications.

...read moreread less

Abstract: Conventional workflow systems are primarily designed for intra-enterprise process management, and they are hardly used to handle processes with tasks and data separated by enterprise boundaries, for reasons such as security, privacy, sharability, firewalls, etc. Further, the cooperation of multiple enterprises is often based on peer-to-peer interactions rather than centralized coordination. As a result, the conventional centralized process management architecture does not fit into the picture of inter-enterprise business-to-business e-commerce. We have developed a Collaborative Process Manager (CPM) to support decentralized, peer-to-peer process management for inter-enterprise collaboration at the business process level. A collaborative process is not handled by a centralized workflow engine, but by multiple CPMs, each representing a player in the business process. Each CPM is used to schedule, dispatch and control the tasks of the process that the player is responsible for, and the CPMs interoperate through an inter-CPM messaging protocol. We have implemented CPM and embedded it into a dynamic software agent architecture, E-Carry, that we developed at HP Labs, to elevate multi-agent cooperation from the conversation level to the process level for mediating e-commerce applications.

...read moreread less

Proceedings Article•DOI•

B-tree indexes and CPU caches

[...]

Goetz Graefe¹, Per-Åke Larson¹•Institutions (1)

Microsoft¹

02 Apr 2001

TL;DR: Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed and widely available in order to enable, structure, and hopefully stimulate future research.

...read moreread less

Abstract: Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed. Rather than providing a detailed performance evaluation for one or two of them on some specific contemporary hardware, the purpose is to survey and to make widely available this heretofore-folkloric knowledge in order to enable, structure, and hopefully stimulate future research.

...read moreread less

Proceedings Article•DOI•

Model-based mediation with domain maps

[...]

Bertram Ludäscher¹, Amarnath Gupta², Maryann E. Martone³•Institutions (3)

General Atomics¹, San Diego Supercomputer Center², University of California, San Diego³

02 Apr 2001

TL;DR: An extension to current view-based mediator systems called "model-based mediation", in which views are defined and executed at the level of conceptual models (CMs) rather than at the structural level, is proposed.

...read moreread less

Abstract: Proposes an extension to current view-based mediator systems called "model-based mediation", in which views are defined and executed at the level of conceptual models (CMs) rather than at the structural level. Structural integration and lifting of data to the conceptual level is "pushed down" from the mediator to wrappers which, in our system, export the classes, associations, constraints and query capabilities of a source. Another novel feature of our architecture is the use of domain maps - semantic nets of concepts and relationships that are used to mediate across sources from multiple worlds (i.e. whose data are related in indirect and often complex ways). As part of registering a source's CM with the mediator, the wrapper creates a "semantic index" of its data into the domain map. We show that these indexes not only semantically correlate the multiple-worlds data, and thereby support the definition of the integrated CM, but they are also useful during query processing, for example, to select relevant sources. A first prototype of the system has been implemented for a complex neuroscience mediation problem.

...read moreread less

Proceedings Article•DOI•

High dimensional similarity search with space filling curves

[...]

S. Liao¹, Mario A. Lopez¹, Scott T. Leutenegger¹•Institutions (1)

University of Denver¹

02 Apr 2001

TL;DR: The solution is dynamic, allowing insertion or deletion of points in O(d log/sub p/ n) page accesses and generalizes easily to find approximate k-nearest neighbors.

...read moreread less

Abstract: We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L/sub t/-metric, t=1,...,/spl infin/. The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d+1) B-trees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d/sup 1+1/t/) factor of the exact nearest, can be returned with at most (d+1)log, n page accesses, where p is the branching factor of the B-trees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log/sub p/ n) page accesses and generalizes easily to find approximate k-nearest neighbors.

...read moreread less

Proceedings Article•DOI•

Block oriented processing of relational database operations in modern computer architectures

[...]

Sriram Padmanabhan¹, Timothy R. Malkemus¹, Anant Jhingran¹, Ramesh C. Agarwal¹•Institutions (1)

IBM¹

02 Apr 2001

TL;DR: It is argued that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance.

...read moreread less

Abstract: Database systems are not well-tuned to take advantage of modern superscalar processor architectures. In particular, the clocks per instruction (CPI) for rather simple database queries are quite poor compared to scientific kernels or SPEC benchmarks. The lack of performance of database systems has been attributed to poor utilization of caches and processor function units as well as higher branching penalties. In this paper, we argue that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance. We have implemented the block-oriented processing technique for aggregation expression evaluation and sorting operations as a feature in the DB2 Universal Database (UDB) system. We present results from representative queries on a 30-GB TPC-H (Transaction Processing Council Benchmark H) database to show the value of this technique.

...read moreread less

Proceedings Article•DOI•

Integrating data mining with SQL databases: OLE DB for data mining

[...]

Amir Netz¹, Surajit Chaudhuri¹, Usama M. Fayyad¹, Jeff Bernhardt¹•Institutions (1)

Microsoft¹

02 Apr 2001

TL;DR: The new API for data mining proposed by Microsoft as extensions to the OLE DB standard is described and it is believed this new API will go a long way in enabling deployment of data mining in enterprise data warehouses.

...read moreread less

Abstract: The integration of data mining with traditional database systems is key to making it convenient, easy to deploy in real applications, and to growing its user base. We describe the new API for data mining proposed by Microsoft as extensions to the OLE DB standard. We illustrate the basic notions that motivated the API's design and describe the key components of an OLE DB for the data mining provider. We also include examples of the usage and treat the problems of data representation and integration with the SQL framework. We believe this new API will go a long way in enabling deployment of data mining in enterprise data warehouses. A reference implementation of a provider is available with the recent release of Microsoft SQL Server 2000 database system.

...read moreread less

Proceedings Article•DOI•

An XML indexing structure with relative region coordinate

[...]

Dao Dinh Kha¹, Masatoshi Yoshikawa¹, Shunsuke Uemura¹•Institutions (1)

Nara Institute of Science and Technology¹

02 Apr 2001

TL;DR: This paper proposes an indexing structure scheme based on the relative region coordinates that can effectively deal with the update problem and presents an algorithm to construct a tree-structured index in which related coordinates are stored together.

...read moreread less

Abstract: For most of the index structures for XML data proposed so far, updating is a problem, because an XML element's coordinates are expressed using absolute values Due to the structural relationship among the elements in XML documents, we have to re-compute these absolute values if the content of the source data is updated The reconstruction requires the updating of a large portion of the index files, which causes a serious problem, especially when the XML data content is updated frequently In this paper, we propose an indexing structure scheme based on the relative region coordinates that can effectively deal with the update problem The main idea is that we express the coordinates of an XML element based on the region of its parent element We present an algorithm to construct a tree-structured index in which related coordinates are stored together In consequence, our indexing scheme requires the updating of only a small portion of the index file

...read moreread less

Proceedings Article•DOI•

The Nimble XML data integration system

[...]

Denise L. Draper, Alon Halevy, Daniel S. Weld

02 Apr 2001

TL;DR: The article addresses architectural issues arising from designing a product to support XML as its core representation, choices in the design of the underlying algebra, on-the-fly data cleaning and caching and materialization policies, and issues which require more attention from the research community.

...read moreread less

Abstract: For better or for worse, XML has emerged as a de facto standard for data interchange. This consensus is likely to lead to increased demand for technology that allows users to integrate data from a variety of applications, repositories, and partners, which are located across the corporate intranet or on the Internet. Nimble Technology has spent two years developing a product to service this market. Originally conceived after decades of person-years of research on data integration, the product is now being deployed at several Fortune-500 beta-customer sites. The article reports on the key challenges faced in the design of our product and highlights some issues which require more attention from the research community. In particular we address architectural issues arising from designing a product to support XML as its core representation, choices in the design of the underlying algebra, on-the-fly data cleaning and caching and materialization policies.

...read moreread less

Proceedings Article•DOI•

The MD-join: an operator for complex OLAP

[...]

Damianos Chatziantoniou, T. Johnson¹, M. Akinde², S. Kim•Institutions (2)

Aalborg University¹, AT&T²

02 Apr 2001

TL;DR: The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries, and several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.

...read moreread less

Abstract: OLAP queries (i.e. group-by or cube-by queries with aggregation) have proven to be valuable for data analysis and exploration. Many decision support applications need very complex OLAP queries, requiring a fine degree of control over both the group definition and the aggregates that are computed. For example, suppose that the user has access to a data cube whose measure attribute is Sum(Sales). Then the user might wish to compute the sum of sales in New York and the sum of sales in California for those data cube entries in which Sum(Sales)>$1,000,000. This type of complex OLAP query is often difficult to express and difficult to optimize using standard relational operators (including standard aggregation operators). In this paper, we propose the MD-join operator for complex OLAP queries. The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries. In addition, the MD-join has a simple and easily optimizable implementation, while the equivalent relational algebra expression is often complex and difficult to optimize. We present several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.

...read moreread less

Proceedings Article•DOI•

A cost model and index architecture for the similarity join

[...]

Christian Bohm¹, Hans-Peter Kriegel¹•Institutions (1)

Ludwig Maximilian University of Munich¹

02 Apr 2001

TL;DR: The authors propose an analytical cost model for the similarity join operation based on indexes and propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time.

...read moreread less

Abstract: The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter /spl epsiv/. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown.

...read moreread less

Proceedings Article•DOI•

Integrating semi-join-reducers into state-of-the-art query processors

[...]

Konrad Stocker¹, Donald Kossmann, R. Braumandi, Alfons Kemper•Institutions (1)

University of Passau¹

02 Apr 2001

TL;DR: This paper shows that semi-join reducers can indeed be beneficial in modern client-server or middleware systems - either to reduce communication costs or to better exploit all the resources of a system.

...read moreread less

Abstract: Semi-join reducers were introduced in the late 1970s as a means to reduce the communication costs of distributed database systems. Subsequent work in the 1980s showed, however, that semi-join reducers are rarely beneficial for the distributed systems of that time. This paper shows that semi-join reducers can indeed be beneficial in modern client-server or middleware systems - either to reduce communication costs or to better exploit all the resources of a system. Furthermore, we present and evaluate alternative ways to extend state-of-the-art (dynamic programming) query optimizers in order to generate good query plans with semi-join reducers. We present two variants, called Access Root and Join Root, which differ in their implementation complexity, running times and the quality of the plans they produce. We present the results of performance experiments that compare both variants with a traditional query optimizer.

...read moreread less

Proceedings Article•DOI•

Workflow and process synchronization with interaction expressions and graphs

[...]

Christian Heinlein¹•Institutions (1)

University of Ulm¹

02 Apr 2001

TL;DR: Interaction expressions and graphs are proposed as a simple yet powerful formalism for the specification and implementation of synchronization conditions in general and inter-workflow dependencies in particular.

...read moreread less

Abstract: Current workflow management technology does not provide adequate means for inter-workflow coordination as concurrently executing workflows are considered completely independent. While this simplified view might suffice for one application domain or the other, there are many real-world application scenarios where workflows, though independently modeled in order to remain comprehensible and manageable, are semantically interrelated. As pragmatical approaches, like merging interdependent workflows or inter-workflow message passing, do not satisfactorily solve the inter-workflow coordination problem, interaction expressions and graphs are proposed as a simple yet powerful formalism for the specification and implementation of synchronization conditions in general and inter-workflow dependencies in particular. In addition to a graph based semi-formal interpretation of the formalism, a precise formal semantics, an equivalent operational semantics, an efficient implementation of the latter, and detailed complexity analyses have been developed, allowing the formalism to be actually applied to solve real-world problems like inter-workflow coordination.

...read moreread less