Author

# Todd Eavis

Other affiliations: Carleton University, Dalhousie University

Bio: Todd Eavis is an academic researcher from Concordia University. The author has contributed to research in topics: Online analytical processing & Data cube. The author has an hindex of 13, co-authored 43 publications receiving 558 citations. Previous affiliations of Todd Eavis include Carleton University & Dalhousie University.

##### Papers

More filters

••

TL;DR: In this paper, two different partitioning strategies, one for top-down and one for bottom-up cube algorithms, are proposed to assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced.

Abstract: This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.
The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.
We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.

65 citations

••

22 Apr 2003

TL;DR: This paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor that uses a ROLAP representation of the data cube where views are stored as relational tables and allows for tight integration with current relational database technology.

Abstract: The pre-computation of data cubes is critical to improving the response time of on-line analytical processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. In order to meet the need for improved performance created by growing data sizes, parallel solutions for generating the data cube are becoming increasingly important. The paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor. Since no (expensive) shared disk is required, our method can be used on low cost Beowulf style clusters consisting of standard PCs with local disks connected via a data switch. Our approach uses a ROLAP representation of the data cube where views are stored as relational tables. This allows for tight integration with current relational database technology. We have implemented our parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, local vs. global schedule trees, data skew, cardinality of dimensions, data dimensionality, and balance tradeoffs. For an input data set of 2000000 rows (72 Megabytes), our parallel data cube generation method achieves close to optimal speedup; generating a full data cube of /spl ap/227 million rows (5.6 Gigabytes) on a 16 processors cluster in under 6 minutes. For an input data set of 10,000,000 rows (360 Megabytes), our parallel method, running on a 16 processor PC cluster, created a data cube consisting of /spl ap/846 million rows (21.7 Gigabytes) in under 47 minutes.

55 citations

••

TL;DR: This paper discusses the cgmCUBE Project, a multi-year effort to design and implement aMulti-processor platform for data cube generation that targets the relational database model (ROLAP), and discusses new algorithmic and system optimizations relating to a thorough optimization of the underlying sequential cube construction method.

Abstract: On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi-dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems.
In this paper we discuss the cgmCUBE Project, a multi-year effort to design and implement a multi-processor platform for data cube generation that targets the relational database model (ROLAP). More specifically, we discuss new algorithmic and system optimizations relating to (1) a thorough optimization of the underlying sequential cube construction method and (2) a detailed and carefully engineered cost model for improved parallel load balancing and faster sequential cube construction. These optimizations were key in allowing us to build a prototype that is able to produce data cube output at a rate of over one TeraByte per hour.

48 citations

••

04 Jan 2001

TL;DR: Two different partitioning strategies are described, one for top-down and one for bottom-up cube algorithms, which enable code reuse by permitting the use of existing sequential data cube algorithms for the subcube computations on each processor.

Abstract: This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter-processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel as is done in previous parallel approaches. In fact, after the initial load distribution phase, each processor can compute its assigned subcube without any communication with the other processors. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.
The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array. Experimental results presented show that our partitioning strategies generate a close to optimal load balance between processors.

35 citations

••

12 May 2003

TL;DR: This paper addresses the query performance issue for Relational OLAP (ROLAP) datacubes by presenting a distributed multi-dimensional ROLAP indexing scheme which is practical to implement, requires only a small communication volume, and is fully adapted to distributed disks.

Abstract: This paper addresses the query performance issue for Relational OLAP (ROLAP) datacubes. We present a distributed multi-dimensional ROLAP indexing scheme which is practical to implement, requires only a small communication volume, and is fully adapted to distributed disks. Our solution is efficient for spatial searches in high dimensions and scalable in terms of data sizes, dimensions, and number of processors. Our method is also incrementally maintainable. Using "surrogate" group-bys, it allows for the efficient processing of arbitrary OLAP queries on partial cubes, where not all of the group-bys have been materialized. Our experiments show that the ROLAP advantage of better scalability, in comparison to MOLAP can be maintained while providing, at the same time, a fast and flexible index for OLAP queries.

32 citations

##### Cited by

More filters

••

TL;DR: A set of examples or training set (TS) is said to be imbalanced if one of the classes is represented by a very small number of cases compared to the other classes.

Abstract: A set of examples or training set (TS) is said to be imbalanced if one of the classes is represented by a very small number of cases compared to the other classes. Following the common practice [1,2], we consider only two-class problems and therefore, the examples are either positive or negative (that is, either from the minority class or the majority class, respectively). High imbalance occur in applications where the classi;er is to detect a rare but important case, such as fraudulent telephone calls, oil spills in satellite images, failures in a manufacturing process, or a rare medical diagnoses. It has been observed that class imbalance may cause a signi;cant deterioration in the performance attainable by standard supervised methods. Most of the attempts at dealing with this problem can be grouped into three categories [2]. One is to assign distinct costs to the classi;cation errors. The second is to resample the original TS, either by over-sampling the minority class and/or under-sampling the majority class until the classes are approximately equally represented. The third consists in internally biasing the discrimination-based process so as to compensate for the class imbalance. As pointed out by many authors, the performance of a classi;er in applications with class imbalance must not be expressed in terms of the average accuracy. For instance, consider a domain where only 2% examples are positive. In such a situation, labeling all new samples as negative would give an accuracy of 98%, but failing on all positive cases. Consequently, in environments with imbalanced

458 citations

••

01 Jun 2012

TL;DR: A tunable data bucketization algorithm is provided that allows the data owner to control the trade-off between disclosure risk and cost and is derived from cost and disclosure-risk metrics that estimate client’s computational overhead and disclosure risk respectively.

Abstract: In this paper, we study the problem of supporting multidimensional range queries on encrypted data. The problem is motivated by secure data outsourcing applications where a client may store his/her data on a remote server in encrypted form and want to execute queries using server's computational capabilities. The solution approach is to compute a secure indexing tag of the data by applying bucketization (a generic form of data partitioning) which prevents the server from learning exact values but still allows it to check if a record satisfies the query predicate. Queries are evaluated in an approximate manner where the returned set of records may contain some false positives. These records then need to be weeded out by the client which comprises the computational overhead of our scheme. We develop a bucketization procedure for answering multidimensional range queries on multidimensional data. For a given bucketization scheme, we derive cost and disclosure-risk metrics that estimate client's computational overhead and disclosure risk respectively. Given a multidimensional dataset, its bucketization is posed as an optimization problem where the goal is to minimize the risk of disclosure while keeping query cost (client's computational overhead) below a certain user-specified threshold value. We provide a tunable data bucketization algorithm that allows the data owner to control the trade-off between disclosure risk and cost. We also study the trade-off characteristics through an extensive set of experiments on real and synthetic data.

236 citations

••

TL;DR: The study presented concentrates on managing the imbalanced training sample problem, scaling up some preprocessing algorithms and filtering the training set, and Experimental results show the potential of multiple classifier systems when applied to those situations.

Abstract: Combination (ensembles) of classifiers is now a well established research line. It has been observed that the predictive accuracy of a combination of independent classifiers excels that of the single best classifier. While ensembles of classifiers have been mostly employed to achieve higher recognition accuracy, this paper focuses on the use of combinations of individual classifiers for handling several problems from the practice in the machine learning, pattern recognition and data mining domains. In particular, the study presented concentrates on managing the imbalanced training sample problem, scaling up some preprocessing algorithms and filtering the training set. Here, all these situations are examined mainly in connection with the nearest neighbour classifier. Experimental results show the potential of multiple classifier systems when applied to those situations.

233 citations

••

TL;DR: This paper presents a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue and the convenience of combining some of these techniques.

Abstract: The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.

219 citations