scispace - formally typeset
Journal ArticleDOI

Managing statistical behavior of large data sets in shared-nothing architectures

Isidore Rigoutsos, +1 more
- 01 Nov 1998 - 
- Vol. 9, Iss: 11, pp 1073-1087
Reads0
Chats0
TLDR
A two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities, and is generally applicable and independent of the used hashing function.
Abstract
Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality on average, for all possible data item distributions. In this paper, we propose a two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and real-world databases. The derived quasi-uniform storage occupancy and associated load-balancing gains are significant.

read more

Citations
More filters
Journal ArticleDOI

Radio-wave propagation prediction using ray-tracing techniques on a network of workstations (NOW)

TL;DR: This paper proposes a computational framework that functions on a network of workstations (NOW) and helps speed up the lengthy prediction process and addresses issues regarding main memory consumption, intermediate data assembly, and final prediction generation.
Patent

Method, system and program products for managing processing groups of a distributed computing environment

TL;DR: In this paper, a distributed synchronous transaction system protocol is provided to manage the replication of distributed transactions for client application instances, without having the application instances be aware of other instances to receive the transaction.
References
More filters
Book

Probability, random variables and stochastic processes

TL;DR: This chapter discusses the concept of a Random Variable, the meaning of Probability, and the axioms of probability in terms of Markov Chains and Queueing Theory.
Book

Pattern classification and scene analysis

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Proceedings ArticleDOI

R-trees: a dynamic index structure for spatial searching

TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.