Managing statistical behavior of large data sets in shared-nothing architectures

doi:10.1109/71.735955

Journal ArticleDOI

Managing statistical behavior of large data sets in shared-nothing architectures

Isidore Rigoutsos, +1 more

- 01 Nov 1998 -

IEEE Transactions on Parallel and Distri...

- Vol. 9, Iss: 11, pp 1073-1087

Chats0

TLDR

A two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities, and is generally applicable and independent of the used hashing function.

Abstract:

Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality on average, for all possible data item distributions. In this paper, we propose a two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and real-world databases. The derived quasi-uniform storage occupancy and associated load-balancing gains are significant.

Managing statistical behavior of large data sets in shared-nothing architectures

Citations

Radio-wave propagation prediction using ray-tracing techniques on a network of workstations (NOW)

Method, system and program products for managing processing groups of a distributed computing environment

References

Pattern Classification and Scene Analysis.

Probability, random variables and stochastic processes

Pattern classification and scene analysis

Numerical Recipes in C: The Art of Scientific Computing

R-trees: a dynamic index structure for spatial searching

Related Papers (5)

Scaling clustering algorithms for massive data sets using data streams

Algorithms for data stream systems

Handling Uncertain Data in Array Database Systems

Incremental join of time-oriented data

Tertiary Storage Organization for Large Multidimensional Datasets.