scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Data Engineering in 2001"


Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work shows how SSL can be extended to pose Skyline queries, present and evaluate alternative algorithms to implement the Skyline operation, and shows how this operation can be combined with other database operations, e.g., join.
Abstract: We propose to extend database systems by a Skyline operation. This operation filters out a set of interesting points from a potentially large set of data points. A point is interesting if it is not dominated by any other point. For example, a hotel might be interesting for somebody traveling to Nassau if no other hotel is both cheaper and closer to the beach. We show how SSL can be extended to pose Skyline queries, present and evaluate alternative algorithms to implement the Skyline operation, and show how this operation can be combined with other database operations, e.g., join.

2,509 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work proposes a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern Mining, and shows that Pre fixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.
Abstract: Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of A priori which may substantially reduce the number of combinations to be examined. Howeve6 Apriori still encounters problems when a sequence database is large andor when sequential patterns to be mined are numerous ano we propose a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern mining. Prefixspan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Moreover; prefi-projection substantially reduces the size of projected databases and leads to efJicient processing. Our performance study shows that Prefixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

1,975 citations



Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new algorithm for mining maximal frequent itemsets from a transactional database that integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms and combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema is presented.
Abstract: We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five.

391 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: A notion of convertible constraints is developed and systematically analyzed, classify, and characterize this class and techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining are developed.
Abstract: Recent work has highlighted the importance of the constraint based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. The authors study constraints which cannot be handled with existing theory and techniques. For example, avg(S) /spl theta/ /spl nu/, median(S) /spl theta/ /spl nu/, sum(S) /spl theta/ /spl nu/ (S can contain items of arbitrary values) (/spl theta//spl isin/{/spl ges/, /spl les/}), are customarily regarded as "tough" constraints in that they cannot be pushed inside an algorithm such as a priori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.

372 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new distance function D/sub tw-lb/ that consistently underestimates the time warping distance and also satisfies the triangular inequality is devised and achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.
Abstract: This paper proposes a new novel method for similarity search that supports time warping in large sequence databases. Time warping enables finding sequences with similar patterns even when they are of different lengths. Previous methods for processing similarity search that supports time warping fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. Our primary goal is to innovate on search performance without permitting any false dismissal. To attain this goal, we devise a new distance function D/sub tw-lb/ that consistently underestimates the time warping distance and also satisfies the triangular inequality D/sub tw-lb/ uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes and D/sub tw-lb/ as a distance function. The extensive experimental results reveal that our method achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.

337 citations


Proceedings ArticleDOI
Sheng Ma1, Joseph L. Hellerstein1
02 Apr 2001
TL;DR: This work develops two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association- first algorithm, and develops a novel approach based on a chi-squared test.
Abstract: Periodic behavior is common in real-world applications. However in many cases, periodicities are partial in that they are present only intermittently. The authors study such intermittent patterns, which they refer to as p-patterns. The formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise. Further we develop two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association-first algorithm. Our results show that the association-first algorithm has a higher tolerance to noise; the period-first algorithm is more computationally efficient and provides flexibility as to the specification of support levels. In addition, we apply the period-first algorithm to mining data collected from two production computer networks, a process that led to several actionable insights.

257 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is shown that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'.
Abstract: Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. In the real world there exist many physical obstacles such as rivers, lakes and highways, and their presence may affect the result of clustering substantially. We study the problem of clustering in the presence of obstacles and define it as a COD (Clustering with Obstructed Distance) problem. As a solution to this problem, we propose a scalable clustering algorithm, called COD-CLARANS. We discuss various forms of pre-processed information that could enhance the efficiency of COD-CLARANS. In the strictest sense, the COD problem can be treated as a change in distance function and thus could be handled by current clustering algorithms by changing the distance function. However, we show that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'. We conduct various performance studies to show that COD-CLARANS is both efficient and effective.

189 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new index structure is introduced, the Rdnn-tree, that answers both RNN and NN queries efficiently and outperforms existing methods in various aspects, and makes the index structure extremely preferable in both static and dynamic cases.
Abstract: The Reverse Nearest Neighbor (RNN) problem is to find all points in a given data set whose nearest neighbor is a given query point. Just like the Nearest Neighbor (NN) queries, the RNN queries appear in many practical situations such as marketing and resource management. Thus, efficient methods for the RNN queries in databases are required. The paper introduces a new index structure, the Rdnn-tree, that answers both RNN and NN queries efficiently. A single index structure is employed for a dynamic database, in contrast to the use of multiple indexes in previous work. This leads to significant savings in dynamically maintaining the index structure. The Rdnn-tree outperforms existing methods in various aspects. Experiments on both synthetic and real world data show that our index structure outperforms previous methods by a significant margin (more than 90% in terms of number of leaf nodes accessed) in RNN queries. It also shows improvement in NN queries over standard techniques. Furthermore, performance in insertion and deletion is significantly enhanced by the ability to combine multiple queries (NN and RNN) in one traversal of the tree. These facts make our index structure extremely preferable in both static and dynamic cases.

178 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is demonstrated that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone.
Abstract: Studies the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft's SQL Server and present experimental results that demonstrate the merits of our techniques.

177 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new indexing technique that works well for variable length queries that is to store index structures at different resolutions for a given dataset by exploiting the power of wavelets.
Abstract: Finding similar patterns in a time sequence is a well-studied problem. Most of the current techniques work well for queries of a prespecified length, but not for variable length queries. We propose a new indexing technique that works well for variable length queries. The central idea is to store index structures at different resolutions for a given dataset. The resolutions are based on wavelets. For a given query, a number of subqueries at different resolutions are generated. The ranges of the subqueries are progressively refined based on results from previous subqueries. Our experiments show that the total cost for our method is 4 to 20 times less than the current techniques including linear scan. Because of the need to store information at multiple resolution levels, the storage requirement of our method could potentially be large. In the second part of the paper we show how the index information can be compressed with minimal information loss. According to our experimental results, even after compressing the size of the index to one fifth, the total cost of our method is 3 to 15 times less than the current techniques.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: A family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-tree, to elect a set of objects as foci and gauge all other objects with their distances from this set.
Abstract: Designing a new access method inside a commercial DBMS is cumbersome and expensive. We propose a family of metric access methods that are fast and easy to implement on top of existing access methods, such as sequential scan, R-trees and Slim-trees. The idea is to elect a set of objects as foci, and gauge all other objects with their distances from this set. We show how to define the foci set cardinality, how to choose appropriate foci, and how to perform range and nearest-neighbor queries using them, without false dismissals. The foci increase the pruning of distance calculations during the query processing. Furthermore we index the distances from each object to the foci to reduce even triangular inequality comparisons. Experiments on real and synthetic datasets show that our methods match or outperform existing methods. They are up to 10 times faster, and perform up to 10 times fewer distance calculations and disk accesses. In addition, it scales up well, exhibiting sub-linear performance with growing database size.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work proposes several estimation algorithms that apply set hashing and maximal overlap to estimate the number of matches of query twiglets formed using variations on different twiglet decomposition techniques, and demonstrates that accurate and robust estimates can be achieved, even with limited space.
Abstract: Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structure scalably represents approximate frequency information about twiglets (i.e. small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: set hashing, used to estimate the number of matches of each query twiglet, and maximal overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML data sets, with a variety of twig queries. Our results demonstrate that accurate and robust estimates can be achieved, even with limited space.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new index structure called the SB-tree is introduced, which incorporates features from both segment trees and B-trees and support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes.
Abstract: Considers the problems of computing aggregation queries in temporal databases and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging, since a single data update can cause aggregate results to change over the entire time-line. We introduce a new index structure called the SB-tree, which incorporates features from both segment trees (S-trees) and B-trees. SB-trees support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes. We also extend the basic SB-tree index to handle cumulative (also called moving-window) aggregates. For materialized aggregate views in a temporal database or data warehouse, we propose building and maintaining SB-tree indices instead of the views themselves.

Proceedings ArticleDOI
Harald Schöning1
02 Apr 2001
TL;DR: A short overview of Tamino's architecture is given and some of the design considerations for Tamino are addressed, where database design for XML was nontrivial, and where some issues are still open.
Abstract: Tamino is Software AG's XML database management system. In contrast to solutions of other DBMS vendors, Tamino is not just another layer on top of a database system designed to support the relational or an object-oriented data model. Rather, Tamino has been completely designed for XML. This paper gives a short overview of Tamino's architecture and then addresses some of the design considerations for Tamino. In particular, areas are presented where database design for XML was nontrivial, and where some issues are still open.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work proposes modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries and develops a new technique based on clustering that merges the benefits of the two general classes of approaches.
Abstract: Develops a general framework for approximate nearest-neighbor queries. We categorize the current approaches for nearest-neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to well-known techniques to support the progressive processing of approximate nearest-neighbor queries. A user may therefore stop the retrieval process once enough information has been returned. We then develop a new technique based on clustering that merges the benefits of the two general classes of approaches. Our cluster-based approach allows a user to progressively explore the approximate results with increasing accuracy. We propose a new metric for evaluation of approximate nearest-neighbor searching techniques. Using both the proposed and the traditional metrics, we analyze and compare several techniques with a detailed performance evaluation. We demonstrate the feasibility and efficiency of approximate nearest-neighbor searching. We perform experiments on several real data sets and establish the superiority of the proposed cluster-based technique over the existing techniques for approximate nearest-neighbor searching.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The tree structure of XML documents is exploited to equip users with a powerful tool, the meet operator, that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies.
Abstract: Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the mark-up structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the mark-up depends on the methodological, cultural and personal background of the author(s). None the less, it is this hierarchical structure that forms the basis of XML query languages. We exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: e.g., given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept. We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The authors propose a subsequence matching method, Dual Match, which exploits duality in constructing windows and significantly improves performance in large database applications and formally proves that the dual approach is correct.
Abstract: The authors propose a subsequence matching method, Dual Match, which exploits duality in constructing windows and significantly improves performance. Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of the one by C. Faloutsos et al. (1994), which divides data sequences into sliding windows and the query sequence into disjoint windows. We formally prove that our dual approach is correct, i.e., it incurs no false dismissal. We also prove that, given the minimum query length, there is a maximum bound of the window size to guarantee correctness of Dual Match and discuss the effect of the window size on performance. FRM causes a lot of false alarms by storing minimum bounding rectangles rather than individual points representing windows to avoid excessive storage space required for the index. Dual Match solves this problem by directly storing points, but without incurring excessive storage overhead. Experimental results show that, in most cases, Dual Match provides large improvement in both false alarms and performance over FRM, given the same amount of storage space. In particular, for low selectivities (less than 10/sup -4/), Dual Match significantly improves performance up to 430-fold. On the other hand, for high selectivities(more than 10/sup -2/), it shows a very minor degradation (less than 29%). For selectivities in between (10/sup -4//spl sim/10/sup -2/), Dual Match shows performance slightly better than that of FRM. Dual Match is also 4.10/spl sim/25.6 times faster than FRM in building indexes of approximately the same size. Overall, these results indicate that our approach provides a new paradigm in subsequence matching that improves performance significantly in large database applications.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work has developed a Collaborative Process Manager (CPM) to support decentralized, peer-to-peer process management for inter-enterprise collaboration at the business process level and embedded it into a dynamic software agent architecture, E-Carry, to elevate multi-agent cooperation from the conversation level to the process level for mediating e-commerce applications.
Abstract: Conventional workflow systems are primarily designed for intra-enterprise process management, and they are hardly used to handle processes with tasks and data separated by enterprise boundaries, for reasons such as security, privacy, sharability, firewalls, etc. Further, the cooperation of multiple enterprises is often based on peer-to-peer interactions rather than centralized coordination. As a result, the conventional centralized process management architecture does not fit into the picture of inter-enterprise business-to-business e-commerce. We have developed a Collaborative Process Manager (CPM) to support decentralized, peer-to-peer process management for inter-enterprise collaboration at the business process level. A collaborative process is not handled by a centralized workflow engine, but by multiple CPMs, each representing a player in the business process. Each CPM is used to schedule, dispatch and control the tasks of the process that the player is responsible for, and the CPMs interoperate through an inter-CPM messaging protocol. We have implemented CPM and embedded it into a dynamic software agent architecture, E-Carry, that we developed at HP Labs, to elevate multi-agent cooperation from the conversation level to the process level for mediating e-commerce applications.

Proceedings ArticleDOI
Goetz Graefe1, Per-Åke Larson1
02 Apr 2001
TL;DR: Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed and widely available in order to enable, structure, and hopefully stimulate future research.
Abstract: Since many existing techniques for exploiting CPU caches in the implementation of B-tree indexes have not been discussed in the literature, most of them are surveyed. Rather than providing a detailed performance evaluation for one or two of them on some specific contemporary hardware, the purpose is to survey and to make widely available this heretofore-folkloric knowledge in order to enable, structure, and hopefully stimulate future research.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: An extension to current view-based mediator systems called "model-based mediation", in which views are defined and executed at the level of conceptual models (CMs) rather than at the structural level, is proposed.
Abstract: Proposes an extension to current view-based mediator systems called "model-based mediation", in which views are defined and executed at the level of conceptual models (CMs) rather than at the structural level. Structural integration and lifting of data to the conceptual level is "pushed down" from the mediator to wrappers which, in our system, export the classes, associations, constraints and query capabilities of a source. Another novel feature of our architecture is the use of domain maps - semantic nets of concepts and relationships that are used to mediate across sources from multiple worlds (i.e. whose data are related in indirect and often complex ways). As part of registering a source's CM with the mediator, the wrapper creates a "semantic index" of its data into the domain map. We show that these indexes not only semantically correlate the multiple-worlds data, and thereby support the definition of the integrated CM, but they are also useful during query processing, for example, to select relevant sources. A first prototype of the system has been implemented for a complex neuroscience mediation problem.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The solution is dynamic, allowing insertion or deletion of points in O(d log/sub p/ n) page accesses and generalizes easily to find approximate k-nearest neighbors.
Abstract: We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L/sub t/-metric, t=1,...,/spl infin/. The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d+1) B-trees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d/sup 1+1/t/) factor of the exact nearest, can be returned with at most (d+1)log, n page accesses, where p is the branching factor of the B-trees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log/sub p/ n) page accesses and generalizes easily to find approximate k-nearest neighbors.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is argued that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance.
Abstract: Database systems are not well-tuned to take advantage of modern superscalar processor architectures. In particular, the clocks per instruction (CPI) for rather simple database queries are quite poor compared to scientific kernels or SPEC benchmarks. The lack of performance of database systems has been attributed to poor utilization of caches and processor function units as well as higher branching penalties. In this paper, we argue that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance. We have implemented the block-oriented processing technique for aggregation expression evaluation and sorting operations as a feature in the DB2 Universal Database (UDB) system. We present results from representative queries on a 30-GB TPC-H (Transaction Processing Council Benchmark H) database to show the value of this technique.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The new API for data mining proposed by Microsoft as extensions to the OLE DB standard is described and it is believed this new API will go a long way in enabling deployment of data mining in enterprise data warehouses.
Abstract: The integration of data mining with traditional database systems is key to making it convenient, easy to deploy in real applications, and to growing its user base. We describe the new API for data mining proposed by Microsoft as extensions to the OLE DB standard. We illustrate the basic notions that motivated the API's design and describe the key components of an OLE DB for the data mining provider. We also include examples of the usage and treat the problems of data representation and integration with the SQL framework. We believe this new API will go a long way in enabling deployment of data mining in enterprise data warehouses. A reference implementation of a provider is available with the recent release of Microsoft SQL Server 2000 database system.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: This paper proposes an indexing structure scheme based on the relative region coordinates that can effectively deal with the update problem and presents an algorithm to construct a tree-structured index in which related coordinates are stored together.
Abstract: For most of the index structures for XML data proposed so far, updating is a problem, because an XML element's coordinates are expressed using absolute values Due to the structural relationship among the elements in XML documents, we have to re-compute these absolute values if the content of the source data is updated The reconstruction requires the updating of a large portion of the index files, which causes a serious problem, especially when the XML data content is updated frequently In this paper, we propose an indexing structure scheme based on the relative region coordinates that can effectively deal with the update problem The main idea is that we express the coordinates of an XML element based on the region of its parent element We present an algorithm to construct a tree-structured index in which related coordinates are stored together In consequence, our indexing scheme requires the updating of only a small portion of the index file

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The article addresses architectural issues arising from designing a product to support XML as its core representation, choices in the design of the underlying algebra, on-the-fly data cleaning and caching and materialization policies, and issues which require more attention from the research community.
Abstract: For better or for worse, XML has emerged as a de facto standard for data interchange. This consensus is likely to lead to increased demand for technology that allows users to integrate data from a variety of applications, repositories, and partners, which are located across the corporate intranet or on the Internet. Nimble Technology has spent two years developing a product to service this market. Originally conceived after decades of person-years of research on data integration, the product is now being deployed at several Fortune-500 beta-customer sites. The article reports on the key challenges faced in the design of our product and highlights some issues which require more attention from the research community. In particular we address architectural issues arising from designing a product to support XML as its core representation, choices in the design of the underlying algebra, on-the-fly data cleaning and caching and materialization policies.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries, and several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.
Abstract: OLAP queries (i.e. group-by or cube-by queries with aggregation) have proven to be valuable for data analysis and exploration. Many decision support applications need very complex OLAP queries, requiring a fine degree of control over both the group definition and the aggregates that are computed. For example, suppose that the user has access to a data cube whose measure attribute is Sum(Sales). Then the user might wish to compute the sum of sales in New York and the sum of sales in California for those data cube entries in which Sum(Sales)>$1,000,000. This type of complex OLAP query is often difficult to express and difficult to optimize using standard relational operators (including standard aggregation operators). In this paper, we propose the MD-join operator for complex OLAP queries. The MD-join provides a clean separation between group definition and aggregate computation, allowing great flexibility in the expression of OLAP queries. In addition, the MD-join has a simple and easily optimizable implementation, while the equivalent relational algebra expression is often complex and difficult to optimize. We present several algebraic transformations that allow relational algebra queries that include MD-joins to be optimized.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: The authors propose an analytical cost model for the similarity join operation based on indexes and propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time.
Abstract: The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter /spl epsiv/. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: This paper shows that semi-join reducers can indeed be beneficial in modern client-server or middleware systems - either to reduce communication costs or to better exploit all the resources of a system.
Abstract: Semi-join reducers were introduced in the late 1970s as a means to reduce the communication costs of distributed database systems. Subsequent work in the 1980s showed, however, that semi-join reducers are rarely beneficial for the distributed systems of that time. This paper shows that semi-join reducers can indeed be beneficial in modern client-server or middleware systems - either to reduce communication costs or to better exploit all the resources of a system. Furthermore, we present and evaluate alternative ways to extend state-of-the-art (dynamic programming) query optimizers in order to generate good query plans with semi-join reducers. We present two variants, called Access Root and Join Root, which differ in their implementation complexity, running times and the quality of the plans they produce. We present the results of performance experiments that compare both variants with a traditional query optimizer.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: Interaction expressions and graphs are proposed as a simple yet powerful formalism for the specification and implementation of synchronization conditions in general and inter-workflow dependencies in particular.
Abstract: Current workflow management technology does not provide adequate means for inter-workflow coordination as concurrently executing workflows are considered completely independent. While this simplified view might suffice for one application domain or the other, there are many real-world application scenarios where workflows, though independently modeled in order to remain comprehensible and manageable, are semantically interrelated. As pragmatical approaches, like merging interdependent workflows or inter-workflow message passing, do not satisfactorily solve the inter-workflow coordination problem, interaction expressions and graphs are proposed as a simple yet powerful formalism for the specification and implementation of synchronization conditions in general and inter-workflow dependencies in particular. In addition to a graph based semi-formal interpretation of the formalism, a precise formal semantics, an equivalent operational semantics, an efficient implementation of the latter, and detailed complexity analyses have been developed, allowing the formalism to be actually applied to solve real-world problems like inter-workflow coordination.