Showing papers in "Distributed and Parallel Databases in 2006"
TL;DR: FD is completely distributed, does not depend on the existence of certain peers, and addresses the volatility of peers during query execution, and can achieve major performance gains in terms of communication and response time.
Abstract: A major problem of unstructured P2P systems is their heavy network traffic. This is caused mainly by high numbers of query answers, many of which are irrelevant for users. One solution to this problem is to use Top-k queries whereby the user can specify a limited number (k) of the most relevant answers. In this paper, we present FD, a (Fully Distributed) framework for executing Top-k queries in unstructured P2P systems, with the objective of reducing network traffic. FD consists of a family of algorithms that are simple but effective. FD is completely distributed, does not depend on the existence of certain peers, and addresses the volatility of peers during query execution. We validated FD through implementation over a 64-node cluster and simulation using the BRITE topology generator and SimJava. Our performance evaluation shows that FD can achieve major performance gains in terms of communication and response time.
78 citations
TL;DR: This paper discusses the cgmCUBE Project, a multi-year effort to design and implement aMulti-processor platform for data cube generation that targets the relational database model (ROLAP), and discusses new algorithmic and system optimizations relating to a thorough optimization of the underlying sequential cube construction method.
Abstract: On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi-dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems.
In this paper we discuss the cgmCUBE Project, a multi-year effort to design and implement a multi-processor platform for data cube generation that targets the relational database model (ROLAP). More specifically, we discuss new algorithmic and system optimizations relating to (1) a thorough optimization of the underlying sequential cube construction method and (2) a detailed and carefully engineered cost model for improved parallel load balancing and faster sequential cube construction. These optimizations were key in allowing us to build a prototype that is able to produce data cube output at a rate of over one TeraByte per hour.
48 citations
TL;DR: This paper is the proposal of a low-complexity, practical resource selection and scheduling algorithm that enables queries to employ partitioned parallelism, in order to achieve better performance in a Grid setting.
Abstract: Advances in network technologies and the emergence of Grid computing have both increased the need and provided the infrastructure for computation and data intensive applications to run over collections of heterogeneous and autonomous nodes. In the context of database query processing, existing parallelisation techniques cannot operate well in Grid environments because the way they select machines and allocate tasks compromises partitioned parallelism. The main contribution of this paper is the proposal of a low-complexity, practical resource selection and scheduling algorithm that enables queries to employ partitioned parallelism, in order to achieve better performance in a Grid setting. The evaluation results show that the scheduler proposed outperforms current techniques without sacrificing the efficiency of resource utilisation.
39 citations
TL;DR: A static two-phase locking and high priority based, write-update type, ideal for fast and timeliness commit protocol i.e. SWIFT is proposed, which minimizes intersite message traffic, execute-commit conflicts and log writes consequently resulting in a better response time.
Abstract: Although there are several factors contributing to the difficulty in meeting distributed real time transaction deadlines, data conflicts among transactions, especially in commitment phase, are the prime factor resulting in system performance degradation. Therefore, design of an efficient commit protocol is of great significance for distributed real time database systems (DRTDBS). Most of the existing commit protocols try to improve system performance by allowing a committing cohort to lend its data to an executing cohort, thus reducing data inaccessibility. These protocols block the borrower when it tries to send WORKDONE/PREPARED message [1, 6, 8, 9], thus increasing the transactions commit time. This paper first analyzes all kind of dependencies that may arise due to data access conflicts among executing-committing transactions when a committing cohort is allowed to lend its data to an executing cohort. It then proposes a static two-phase locking and high priority based, write-update type, ideal for fast and timeliness commit protocol i.e. SWIFT. In SWIFT, the execution phase of a cohort is divided into two parts, locking phase and processing phase and then, in place of WORKDONE message, WORKSTARTED message is sent just before the start of processing phase of the cohort. Further, the borrower is allowed to send WORKSTARTED message, if it is only commit dependent on other cohorts instead of being blocked as opposed to [1, 6, 8, 9]. This reduces the time needed for commit processing and is free from cascaded aborts. To ensure non-violation of ACID properties, checking of completion of processing and the removal of dependency of cohort are required before sending the YES-VOTE message. Simulation results show that SWIFT improves the system performance in comparison to earlier protocol. The performance of SWIFT is also analyzed for partial read-only optimization, which minimizes intersite message traffic, execute-commit conflicts and log writes consequently resulting in a better response time. The impact of permitting the cohorts of the same transaction to communicate with each other [5] on SWIFT has also been analyzed.
38 citations
TL;DR: In this paper, the authors describe data mining and data warehousing techniques that can improve the performance and usability of Intrusion Detection Systems (IDS) by modeling network traffic and alerts using a multi-dimensional data model and star schemas.
Abstract: This paper describes data mining and data warehousing techniques that can improve the performance and usability of Intrusion Detection Systems (IDS). Current IDS do not provide support for historical data analysis and data summarization. This paper presents techniques to model network traffic and alerts using a multi-dimensional data model and star schemas. This data model was used to perform network security analysis and detect denial of service attacks. Our data model can also be used to handle heterogeneous data sources (e.g. firewall logs, system calls, net-flow data) and enable up to two orders of magnitude faster query response times for analysts as compared to the current state of the art. We have used our techniques to implement a prototype system that is being successfully used at Army Research Labs. Our system has helped the security analyst in detecting intrusions and in historical data analysis for generating reports on trend analysis.
29 citations
TL;DR: This paper proposes a framework for replicated declustering, using a limited amount of replication and provides extensions to apply it on real data, which include arbitrary grids and a large number of disks, and shows that this framework is effective for parallel processing of multiple queries.
Abstract: A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks. It has been shown that single copy based declustering schemes are non-optimal for range queries. In this paper, we integrate replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data. We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries, on synthetic and real data sets.
20 citations
TL;DR: A performance evaluation of Incremental Click-Stream Tree model over two different Web server access logs indicate that the proposed incremental model yields significant speed-up of recommendation time and improvement of the prediction accuracy.
Abstract: Predicting the next request of a user has gained importance as Web-based activity increases in order to guide Web users during their visits to Web sites. Previously proposed methods for recommendation use data collected over time in order to extract usage patterns. However, these patterns may change over time, because each day new log entries are added to the database and old entries are deleted. Thus, over time it is highly desirable to perform the update of the recommendation model incrementally. In this paper, we propose a new model for modeling and predicting Web user sessions which attempt to reduce the online recommendation time while retaining predictive accuracy. Since it is very easy to modify the model, it is updated during the recommendation process. The incremental algorithm yields a better prediction accuracy as well as a shorter online recommendation time. A performance evaluation of Incremental Click-Stream Tree model over two different Web server access logs indicate that the proposed incremental model yields significant speed-up of recommendation time and improvement of the prediction accuracy.
18 citations
TL;DR: This work proposes a distributed peer to peer Web service registry solution based on lightweight Web service profiles that allows the specification of arbitrary contexts of Web services and presents a prototype that uses tuple spaces as global storage and communication means.
Abstract: Transient Web service provisioning implies a variety of different requirements that are hard to meet in traditional Web service environments. Currently, Web service brokerage focuses on centralized or replicated architectures. We argue that such systems are not efficient when it comes to dynamic, respectively ad hoc, Web service provisioning. We propose a distributed peer to peer Web service registry solution based on lightweight Web service profiles. We further introduce the notion of views that allow the specification of arbitrary contexts of Web services and provide a working example to illustrate our approach. Finally, we present a prototype that uses tuple spaces as global storage and communication means.
18 citations
TL;DR: This paper shows that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks, and proposes a new placement scheme that exploits the physical properties of MEMS -based storage to solve this problem.
Abstract: Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range queries over two-dimensional datasets do not consider this property. In particular, these techniques do not consider the issue of sequential data placement when multiple I/O blocks need to be retrieved from a single device. In this paper, we reevaluate the optimal cost of range queries by declustering two-dimensional datasets over multiple devices, and prove that, in general, it is impossible to achieve the new optimal cost. This is because disks cannot facilitate two-dimensional sequential access which is required by the new optimal cost. Then we revisit the existing data allocation schemes under the new optimal cost, and show that none of them can achieve the new optimal cost. Fortunately, MEMS-based storage is being developed to reduce I/O cost. We first show that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks. Then we propose a new placement scheme that exploits the physical properties of MEMS-based storage to solve this problem. Our theoretical analysis and experimental results show that the new scheme achieves almost optimal I/O costs.
16 citations
TL;DR: This work enumerates the different kinds of dependencies that may be present in an advanced transaction and classify them into two broad categories: event ordering and event enforcement dependencies.
Abstract: Transactional dependencies play an important role in coordinating and executing the subtransactions in advanced transaction processing models, such as, nested transactions and workflow transactions. Researchers have formalized the notion of transactional dependencies and have shown how various advanced transaction models can be expressed using different kinds of dependencies. Incorrect specification of dependencies can result in unpredictable behavior of the advanced transaction, which, in turn, can lead to unavailability of resources and information integrity problems. In this work, we focus on how to correctly specify dependencies in an advanced transaction. We enumerate the different kinds of dependencies that may be present in an advanced transaction and classify them into two broad categories: event ordering and event enforcement dependencies. Different event ordering and event enforcement dependencies in an advanced transaction often interact in subtle ways resulting in conflicts and redundancies. We describe the different types of conflicts that can arise due to the presence of multiple dependencies and describe how one can detect such conflicts. An advanced transaction may also contain redundant dependencies--these are dependencies that can be logically derived from other dependencies. We show how such extraneous dependencies can be eliminated to get an equivalent set of dependencies that has the same effect as the original set. Our dependency analysis is done in the context of a generalized advanced transaction model that is capable of expressing different kinds of advanced transactions.
10 citations
TL;DR: This work proposes a method that combines both strategies efficiently, i.e. mining in parallel for the set of patterns while pushing constraints, and is able to effectively discover frequent patterns in a database made of billion transactions using a 32 processors cluster in less than an hour and a half.
Abstract: When computationally feasible, mining huge databases produces tremendously large numbers of frequent patterns. In many cases, it is impractical to mine those datasets due to their sheer size; not only the extent of the existing patterns, but mainly the magnitude of the search space. Many approaches have suggested the use of constraints to apply to the patterns or searching for frequent patterns in parallel. So far, those approaches are still not genuinely effective to mine extremely large datasets.
We propose a method that combines both strategies efficiently, i.e. mining in parallel for the set of patterns while pushing constraints. Using this approach we could mine significantly large datasets; with sizes never reported in the literature before. We are able to effectively discover frequent patterns in a database made of billion transactions using a 32 processors cluster in less than an hour and a half.
TL;DR: In this article, the authors propose a set of spatial relations that need to be supported in browsing applications, namely, the contains, contained and the overlap relations, and prove a lower bound on the storage required to answer queries about the contains relation accurately at a given resolution.
Abstract: As online spatial datasets grow both in number and sophistication, it becomes increasingly difficult for users to decide whether a dataset is suitable for their tasks, especially when they do not have prior knowledge of the dataset. In this paper, we propose browsing as an effective and efficient way to explore the content of a spatial dataset. Browsing allows users to view the size of a result set before evaluating the query at the database, thereby avoiding zero-hit/mega-hit queries and saving time and resources. Although the underlying technique supporting browsing is similar to range query aggregation and selectivity estimation, spatial dataset browsing poses some unique challenges. In this paper, we identify a set of spatial relations that need to be supported in browsing applications, namely, the contains, contained and the overlap relations. We prove a lower bound on the storage required to answer queries about the contains relation accurately at a given resolution. We then present three storage-efficient approximation algorithms which we believe to be the first to estimate query results about these spatial relations. We evaluate these algorithms with both synthetic and real world datasets and show that they provide highly accurate estimates for datasets with various characteristics.
TL;DR: A new in-network data aggregation protocol, called the Distributed Adaptive Filtering (DAF) protocol, which works in a distributed manner and proceeds adaptively in the sense that the filtering condition in each node is adaptively changed by using only local information.
Abstract: Continuous aggregation queries with a tolerable error threshold have many applications in sensor networks. Since the communication cost is important in the lifetime of sensor networks, there have been a few methods to reduce the communication cost for continuous aggregation queries having a tolerable error threshold. In previous methods, the error threshold in each node is periodically adjusted based on the global statistics collected in the central site that are obtained from all the nodes in the network. These methods require that users specify a few parameters, e.g., adjustment period. However, determination of these parameters by users, in practice, is very difficult and undesirable for sensor network applications demanding unattended operations in dynamically changing environments. In this paper, we propose a new in-network data aggregation protocol, called the Distributed Adaptive Filtering (DAF) protocol. It works in a distributed manner and proceeds adaptively in the sense that the filtering condition in each node is adaptively changed by using only local information. It does not require user parameters that are used in the previous method. We show through various experiments that the proposed method outperforms other existing methods.
TL;DR: A disk allocation and retrieval mechanism for arbitrary queries based on design theory that handles nonuniform data, high dimensions, supports incremental declustering and has good fault-tolerance property is proposed and experimental results show the feasibility of the algorithm.
Abstract: Declustering is a common technique used to reduce query response times. Data is declustered over multiple disks and query retrieval can be parallelized. Most of the research on declustering is targeted at spatial range queries and investigates schemes with low additive error. Recently, declustering using replication has been proposed to reduce the additive overhead. Replication significantly reduces retrieval cost of arbitrary queries. In this paper, we propose a disk allocation and retrieval mechanism for arbitrary queries based on design theory. Using the proposed c-copy replicated declustering scheme, $$(c-1)k^{2}+ck$$ buckets can be retrieved using at most k disk accesses. Retrieval algorithm is very efficient and is asymptotically optimal with $$\Theta(|Q|)$$ complexity for a query Q. In addition to the deterministic worst-case bound and efficient retrieval, proposed algorithm handles nonuniform data, high dimensions, supports incremental declustering and has good fault-tolerance property. Experimental results show the feasibility of the algorithm.
TL;DR: A dynamic object replication algorithm, referred to as Real-time distributed dynamic Window Mechanism (RDDWM), that adapts to the random patterns of read-write requests is designed that reduces the total servicing cost of the system.
Abstract: A real-time distributed database system (RTDDBS) must maintain the consistency constraints of objects and must also guarantee the time constraints imposed by each request arriving at the system. Such a time constraint of a request is usually defined as a deadline period, which means that the request must be serviced on or before its time constraint. Servicing these requests may incur I/O costs, control-message transferring costs or data-message transferring costs. As a result, in our work, we first present a mathematical model that considers all these costs. Using this cost model, our objective is to service all the requests on or before their respective deadline periods and minimize the total servicing cost. To this end, from theoretical standpoint, we design a dynamic object replication algorithm, referred to as Real-time distributed dynamic Window Mechanism (RDDWM), that adapts to the random patterns of read-write requests. Using competitive analysis, from practical perspective, we study the performance of RDDWM algorithm under two different extreme conditions, i.e., when the deadline period of each request is sufficiently long and when the deadline period of each request is very short. Several illustrative examples are provided for the ease of understanding.
TL;DR: It is found that the “adaptive” nature of Active Hash Join yields enhanced parallelism in all cases, especially when the aggregate ASD resources are comparable to the main CPU and main memory.
Abstract: Contemporary long-term storage devices feature powerful embedded processors and sizeable memory buffers. Active Storage Devices (ASD) is the hard disk technology that makes use of these significant resources to not only manage the disk operation but also to execute custom application code on large amounts of data. While prior research has shown that ASDs perform exceedingly well with filter-type algorithms, the evaluation of binary-relational operators has been limited. In this paper, we analyze and evaluate inter-operator parallelism of GRACE-based join algorithms that function atop ASDs. We derive accurate cost expressions for existing algorithms and expose performance bottlenecks; upon these findings we propose Active Hash Join, a new algorithm that exploits all system resources. Through experimentation, we confirm that existing algorithms are best suited for systems with either small or large numbers of ASDs. However, we find that the "adaptive" nature of Active Hash Join yields enhanced parallelism in all cases, especially when the aggregate ASD resources are comparable to the main CPU and main memory.