Showing papers on "Parallel algorithm published in 1997"

PDF

Open Access

Journal Article•DOI•

Parallel Preconditioning with Sparse Approximate Inverses

[...]

01 May 1997-SIAM Journal on Scientific Computing

TL;DR: A parallel preconditioner is presented for the solution of general sparse linear systems of equations using a sparse approximate inverse computed explicitly and then applied as a preconditionser to an iterative method.

...read moreread less

Abstract: A parallel preconditioner is presented for the solution of general sparse linear systems of equations. A sparse approximate inverse is computed explicitly and then applied as a preconditioner to an iterative method. The computation of the preconditioner is inherently parallel, and its application only requires a matrix-vector product. The sparsity pattern of the approximate inverse is not imposed a priori but captured automatically. This keeps the amount of work and the number of nonzero entries in the preconditioner to a minimum. Rigorous bounds on the clustering of the eigenvalues and the singular values are derived for the preconditioned system, and the proximity of the approximate to the true inverse is estimated. An extensive set of test problems from scientific and industrial applications provides convincing evidence of the effectiveness of this approach.

...read moreread less

635 citations

Journal Article•DOI•

DNA Solution of the Maximal Clique Problem

[...]

Qi Ouyang¹, Peter D. Kaplan¹, Peter D. Kaplan², Shumao Liu², Shumao Liu¹, Albert Libchaber¹, Albert Libchaber² - Show less +3 more•Institutions (2)

Princeton University¹, Rockefeller University²

17 Oct 1997-Science

TL;DR: The maximal clique problem has been solved by means of molecular biology techniques and the algorithm is highly parallel and has satisfactory fidelity, representing further evidence for the ability of DNA computing to solve NP-complete search problems.

...read moreread less

Abstract: The maximal clique problem has been solved by means of molecular biology techniques. A pool of DNA molecules corresponding to the total ensemble of six-vertex cliques was built, followed by a series of selection processes. The algorithm is highly parallel and has satisfactory fidelity. This work represents further evidence for the ability of DNA computing to solve NP-complete search problems.

...read moreread less

610 citations

Proceedings Article•DOI•

Scalable parallel data mining for association rules

[...]

Eui-Hong Han¹, George Karypis¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Jun 1997

TL;DR: The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

...read moreread less

Abstract: One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candidate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

...read moreread less

410 citations

Journal Article•DOI•

Parallel Algorithms for Discovery of Association Rules

[...]

Mohammed J. Zaki¹, Srinivasan Parthasarathy¹, Mitsunori Ogihara¹, Wei Li²•Institutions (2)

University of Rochester¹, Oracle Corporation²

01 Dec 1997-Data Mining and Knowledge Discovery

TL;DR: This paper describes new parallel association mining algorithms that use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets, and presents results on the performance of the algorithms on various databases, and compares it against a well known parallel algorithm.

...read moreread less

Abstract: Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.

...read moreread less

341 citations

Journal Article•DOI•

Time-Efficient Maze Routing Algorithms on Reconfigurable Mesh Architectures

[...]

Fikret Ercal¹, H.C Lee¹•Institutions (1)

Missouri University of Science and Technology¹

01 Aug 1997-Journal of Parallel and Distributed Computing

TL;DR: Time-efficient algorithms to solve the maze-routing problem on a reconfigurable mesh architecture and a fast algorithm to find the single shortest path (SSP) are presented.

...read moreread less

275 citations

Journal Article•DOI•

Highly scalable parallel algorithms for sparse matrix factorization

[...]

Anshul Gupta¹, George Karypis², Vipin Kumar²•Institutions (2)

IBM¹, University of Minnesota²

01 May 1997-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The first algorithms to factor a wide class of sparse matrices that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures are presented.

...read moreread less

Abstract: In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.

...read moreread less

239 citations

Journal Article•DOI•

Greed is good: Approximating independent sets in sparse and bounded-degree graphs

[...]

Magnús M. Halldórsson¹, Jaikumar Radhakrishnan²•Institutions (2)

University of Iceland¹, Tata Institute of Fundamental Research²

01 May 1997-Algorithmica

TL;DR: The minimum-degree greedy algorithm is shown to achieve a performance ratio of (Δ+2)/3 for approximating independent sets in graphs with degree bounded by Δ, and a precise characterization of the size of the independent sets found by the algorithm as a function of the independence number is found.

...read moreread less

Abstract: Theminimum-degree greedy algorithm, or Greedy for short, is a simple and well-studied method for finding independent sets in graphs. We show that it achieves a performance ratio of (Δ+2)/3 for approximating independent sets in graphs with degree bounded by Δ. The analysis yields a precise characterization of the size of the independent sets found by the algorithm as a function of the independence number, as well as a generalization of Turan's bound. We also analyze the algorithm when run in combination with a known preprocessing technique, and obtain an improved $$(2\bar d + 3)/5$$ performance ratio on graphs with average degree $$\bar d$$ , improving on the previous best $$(\bar d + 1)/2$$ of Hochbaum. Finally, we present an efficient parallel and distributed algorithm attaining the performance guarantees of Greedy.

...read moreread less

234 citations

Book•

Communication Complexity and Parallel Computing

[...]

Juraj Hromkovič¹•Institutions (1)

University of Kiel¹

26 Feb 1997

TL;DR: This book is written as a textbook for undergraduate and graduate students and provides a careful explanation of the subject as well as motivation for further research.

...read moreread less

Abstract: This book is devoted to the investigation of a special topic in theoretical computer science - communication complexity as an abstract measure of the complexity of computing problems. Its main aim is to show how the theoretical study of communication complexity can be useful in the process of designing effective parallel algorithms. The author shows how to get important information about the parallel complexity (parallel time, the number of processors, the descriptional complexity of the topology of the parallel architecture) of specific computing problems from knowledge of their communication complexity. The book is written as a textbook for undergraduate and graduate students and provides a careful explanation of the subject as well as motivation for further research.

...read moreread less

220 citations

Journal Article•DOI•

A new shape preserving parallel thinning algorithm for 3D digital images

[...]

Punam K. Saha¹, Bidyut B. Chaudhuri¹, D. Dutta Majumder¹•Institutions (1)

Indian Statistical Institute¹

01 Dec 1997-Pattern Recognition

TL;DR: An approach of selecting shape points and outer-layer used for erosion during each iteration of parallel thinning is introduced and the approach produces good skeleton for different types of corners.

...read moreread less

201 citations

Journal Article•DOI•

Parallel simulated annealing applied to long term transmission network expansion planning

[...]

R.A. Gallego¹, Antonio Cesar Baleeiro Alves, A. Monticelli, Ruben Romero•Institutions (1)

State University of Campinas¹

01 Feb 1997-IEEE Transactions on Power Systems

TL;DR: The PSA algorithm proposed in the paper has shown significant improvements in solution quality for the largest of the test networks, and the conditions under which the parallel algorithm is most efficient are investigated.

...read moreread less

Abstract: The simulated annealing optimization technique has been successfully applied to a number of electrical engineering problems, including transmission system expansion planning. The method is general in the sense that it does not assume any particular property of the problem being solved, such as linearity or convexity. Moreover, it has the ability to provide solutions arbitrarily close to an optimum (i.e. it is asymptotically convergent) as the cooling process slows down. The drawback of the approach is the computational burden: finding optimal solutions may be extremely expensive in some cases. This paper presents a parallel simulated annealing (PSA) algorithm for solving the long-term transmission network expansion planning problem. A strategy that does not affect the basic convergence properties of the sequential simulated annealing algorithm have been implemented and tested. The paper investigates the conditions under which the parallel algorithm is most efficient. The parallel implementations have been tested on three example networks: a small 6-bus network; and two complex real-life networks. Excellent results are reported in the test section of the paper: in addition to reductions in computing times, the PSA algorithm proposed in the paper has shown significant improvements in solution quality for the largest of the test networks.

...read moreread less

164 citations

Journal Article•DOI•

Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors Using a Parallel Genetic Algorithm

[...]

Yu-Kwong Kwok¹, Ishfaq Ahmad¹•Institutions (1)

Hong Kong University of Science and Technology¹

25 Nov 1997-Journal of Parallel and Distributed Computing

TL;DR: This paper proposes a novel GA-based algorithm with an objective to simultaneously meet the goals of high performance, scalability, and fast running time and outperforms both heuristics while taking considerably less running time.

...read moreread less

Journal Article•DOI•

Fast algorithm for computing discrete cosine transform

[...]

Chi-Wah Kok¹•Institutions (1)

University of Wisconsin-Madison¹

01 Mar 1997-IEEE Transactions on Signal Processing

TL;DR: An efficient method for computing the discrete cosine transform (DCT) is proposed, which is a generalization of the radix 2 DCT algorithm, and the recursive properties of the DCT for an even length input sequence are derived.

...read moreread less

Abstract: An efficient method for computing the discrete cosine transform (DCT) is proposed. Based on direct decomposition of the DCT, the recursive properties of the DCT for an even length input sequence is derived, which is a generalization of the radix 2 DCT algorithm. Based on the recursive property, a new DCT algorithm for an even length sequence is obtained. The proposed algorithm is very structural and requires fewer computations when compared with others. The regular structure of the proposed algorithm is suitable for fast parallel algorithm and VLSI implementation.

...read moreread less

Proceedings Article•DOI•

Does parallel repetition lower the error in computationally sound protocols

[...]

Mihir Bellare¹, Russell Impagliazzo¹, Moni Naor²•Institutions (2)

University of California, San Diego¹, Weizmann Institute of Science²

19 Oct 1997

TL;DR: Four-round protocols whose error does not decrease under parallel repetition are presented, which exploit non-malleable encryption and can be based on any trapdoor permutation.

...read moreread less

Abstract: Whether or not parallel repetition lowers the error has been a fundamental question in the theory of protocols, with applications in many different areas. It is well known that parallel repetition reduces the error at an exponential rate in interactive proofs and Arthur-Merlin games. It seems to have been taken for granted that the same is true in arguments, or other proofs where the soundness only holds with respect to computationally bounded parties. We show that this is not the case. Surprisingly, parallel repetition can actually fail in this setting. We present four-round protocols whose error does not decrease under parallel repetition. This holds for any (polynomial) number of repetitions. These protocols exploit non-malleable encryption and can be based on any trapdoor permutation. On the other hand we show that for three-round protocols the error does go down exponentially fast. The question of parallel error reduction is particularly important when the protocol is used in cryptographic settings like identification, and the error represents the probability that an intruder succeeds.

...read moreread less

Journal Article•DOI•

A Randomized Parallel Algorithm for Single-Source Shortest Paths

[...]

Philip N. Klein¹, Sairam Subramanian¹•Institutions (1)

Brown University¹

01 Nov 1997-Journal of Algorithms

TL;DR: This work gives a randomized parallel algorithm for computing single-source shortest paths in weighted digraphs and shows that the exact shortest-path problem can be efficiently reduced to solving a series of approximate shortest- path subproblems.

...read moreread less

Journal Article•DOI•

Hybrid CORDIC algorithms

[...]

S. Wang, Vincenzo Piuri¹, E.E. Wartzlander²•Institutions (2)

Polytechnic University of Milan¹, University of Texas at Austin²

01 Nov 1997-IEEE Transactions on Computers

TL;DR: This paper introduces two arctangent radices and shows that about 2/3 of the rotation directions can be derived in parallel without any error.

...read moreread less

Abstract: Each coordinate rotation digital computer iteration selects the rotation direction by analyzing the results of the previous iteration. In this paper, we introduce two arctangent radices and show that about 2/3 of the rotation directions can be derived in parallel without any error. Some architectures exploiting these strategies are proposed.

...read moreread less

Journal Article•DOI•

Collision-free path planning for a diamond-shaped robot using two-dimensional cellular automata

[...]

Panagiotis Tzionas¹, Adonios Thanailakis¹, Ph. Tsalides¹•Institutions (1)

Democritus University of Thrace¹

01 Apr 1997

TL;DR: The proposed twodimensional multistate cellular automaton architecture achieves high frequency of operation and it is particularly suited for VLSI implementation due to its inherent parallelism, structural locality, regularity, and modularity.

...read moreread less

Abstract: This paper presents a new parallel algorithm for collision-free path planning of a diamond-shaped robot among arbitrarily shaped obstacles, which are represented as a discrete image, and its implementation in VLSI. The proposed algorithm is based on a retraction of free space onto the Voronoi diagram, which is constructed through the time evolution of cellular automata, after an initial phase during which the boundaries of obstacles are identified and coded with respect to their orientation. The proposed algorithm is both space and time efficient, since it does not require the modeling of objects or distance and intersection calculations. Additionally, the proposed twodimensional multistate cellular automaton architecture achieves high frequency of operation and it is particularly suited for VLSI implementation due to its inherent parallelism, structural locality, regularity, and modularity.

...read moreread less

Journal Article•DOI•

Parallel image component labelling with watershed transformation

[...]

A.N. Moga¹, Moncef Gabbouj¹•Institutions (1)

Tampere University of Technology¹

01 May 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Timings and segmentation results of the algorithm built on top of the message passing interface and tested on the Gray T3D are brought forward to justify the superiority of the novel design solution compared against previous implementations.

...read moreread less

Abstract: The parallel watershed transformation used in gray scale image segmentation is reconsidered on the basis of the component labeling problem. The main idea is to break the sequentiality of the watershed transformation and to correctly delimit the extent of all connected components locally, on each processor, simultaneously. The internal fragmentation of the catchment basins, due to domain decomposition, into smaller subcomponents is finally solved by employing a global connected components operator. Therefore, in a pyramidal structure of master-slave processors, internal contours of adjacent subcomponents within the same component are hierarchically removed. Global final connected areas are efficiently obtained in log/sub 2/ N steps on a logical grid of N processors. Timings and segmentation results of the algorithm built on top of the message passing interface and tested on the Gray T3D are brought forward to justify the superiority of the novel design solution compared against previous implementations.

...read moreread less

Journal Article•DOI•

The DEVS environment for high-performance modeling and simulation

[...]

Bernard P. Zeigler¹, Yoonkeon Moon, Doohwan Kim, G. Ball•Institutions (1)

University of Arizona¹

01 Jul 1997

TL;DR: A prototype suggests that the DEVS formalism can be combined with genetic algorithms running in parallel to serve as the basis of a very general, very fast class of simulation environments.

...read moreread less

Abstract: DEVS-C++, a high-performance environment for modeling large-scale systems at high resolution, uses the DEVS (Discrete-EVent system Specification) formalism to represent both continuous and discrete processes. A prototype suggests that the DEVS formalism can be combined with genetic algorithms running in parallel to serve as the basis of a very general, very fast class of simulation environments.

...read moreread less

Proceedings Article•

Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources

[...]

Minos Garofalakis, Yannis Ioannidis

25 Aug 1997

TL;DR: This work develops a general approach to the problem of scheduling distributed multi-dimensional resource units for all kinds of parallelism within and across queries and operators, and presents heuristic algorithms for various forms of the problem.

...read moreread less

Abstract: Scheduling query execution plans is a particularly complex problem in hierarchical parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and communicates with remote sites by messagepassing. We develop a general approach to the problem, capturing the full complexity of scheduling distributed multi-dimensional resource units for all kinds of parallelism within and across queries and operators. We present heuristic algorithms for various forms of the problem, some of which are provably near-optimal. Preliminary experimental results confirm the effectiveness of our approach.

...read moreread less

Proceedings Article•DOI•

LoPC: modeling contention in parallel algorithms

[...]

Matthew I. Frank¹, Anant Agarwal¹, Mary K. Vernon²•Institutions (2)

Massachusetts Institute of Technology¹, University of Wisconsin-Madison²

21 Jun 1997

TL;DR: This paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations.

...read moreread less

Abstract: Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the L, o and P parameters directly from the LogP model and uses them to predict the cost of contention, C.This paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages. Model modifications for systems that implement coherent shared memory abstractions are also discussed. We carry out the analysis for two important classes of applications that have irregular communication. In the case of parallel applications with homogeneous all-to-any communication, such as sparse matrix computations, the analysis yields a simple rule of thumb and insight into contention costs. In the case of parallel client-server algorithms, the LoPC analysis provides a simple and accurate calculation of the optimal allocation of nodes between clients and servers. The LoPC estimates for these applications are shown to be accurate when compared against event driven simulation and against a sparse matrix computation on the MIT Alewife multiprocessor.

...read moreread less

Journal Article•

New parallel algorithms for fast discovery of associ-ation rules

[...]

Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li

01 Jan 1997-Data Mining and Knowledge Discovery

Journal Article•DOI•

Parallel Algorithms for the Spectral Transform Method

[...]

Ian Foster, Patrick H. Worley

01 May 1997-SIAM Journal on Scientific Computing

TL;DR: This paper describes four different parallel algorithms for implementing the spectral transform method on hypercube- and mesh-connected multicomputers with cut-through routing and reports on computational experiments that were conducted to evaluate their efficiency on parallel computers.

...read moreread less

Abstract: The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but we also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.

...read moreread less

Proceedings Article•DOI•

Simulating Boolean circuits on a DNA computer

[...]

Mitsunori Ogihara¹, Animesh Ray¹•Institutions (1)

University of Rochester¹

19 Jan 1997

TL;DR: In this article, the authors demonstrate that DNA computers can simulate Boolean circuits with a small overhead, and they also show that for the class NC$^1, the slowdown can be reduced to a constant, and that the inputs, the Boolean AND gates, and the OR gates can be encoded to DNA oligonucleotide sequences.

...read moreread less

Abstract: We demonstrate that DNA computers can simulate Boolean circuits with a small overhead. Boolean circuits embody the notion of massively parallel signal processing and are frequently encountered in many parallel algorithms. Many important problems such as sorting, integer arithmetic, and matrix multiplication are known to be computable by small size Boolean circuits much faster than by ordinary sequential digital computers. This paper shows that DNA chemistry allows one to simulate large semi-unbounded fan-in Boolean circuits with a logarithmic slowdown in computation time. Also, for the class NC$^1$, the slowdown can be reduced to a constant. In this algorithm we have encoded the inputs, the Boolean AND gates, and the OR gates to DNA oligonucleotide sequences. We operate on the gates and the inputs by standard molecular techniques of sequence-specific annealing, ligation, separation by size, limited amplification, sequence-specific cleavage, and detection by size. Preliminary biochemical experiments on a small test circuit have produced encouraging results. Further confirmatory experiments are in progress.

...read moreread less

Journal Article•DOI•

Parallel Island-Based Genetic Algorithm for Radio Network Design

[...]

Patrice Calégari¹, Frédéric Guidec¹, Pierre Kuonen¹, Daniel Kobler¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

25 Nov 1997-Journal of Parallel and Distributed Computing

TL;DR: A realistic combinatorial optimization problem is used as an example to show how a genetic algorithm can be parallelized in an efficient way and it is shown that it is possible to obtain good solutions to the problem even with a very low communication load.

...read moreread less

Proceedings Article•DOI•

Parallel domain decomposition and load balancing using space-filling curves

[...]

Srinivas Aluru¹, Fatih Erdogan Sevilgen²•Institutions (2)

New Mexico State University¹, Syracuse University²

18 Dec 1997

TL;DR: The technique is based on a comparison routine that determines the relative position of two points in the order induced by a space filling curve and could be used in conjunction with any parallel sorting algorithm to effect parallel domain decomposition.

...read moreread less

Abstract: Partitioning techniques based on space filling curves have received much recent attention due to their low running time and good load balance characteristics. The basic idea underlying these methods is to order the multidimensional data according to a space filling curve and partition the resulting one dimensional order. However, space filling curves are defined for points that lie on a uniform grid of a particular resolution. It is typically assumed that the coordinates of the points are representable using a fixed number of bits, and the run times of the algorithms depend upon the number of bits used. We present a simple and efficient technique for ordering arbitrary and dynamic multidimensional data using space filling curves and its application to parallel domain decomposition and load balancing. Our technique is based on a comparison routine that determines the relative position of two points in the order induced by a space filling curve. The comparison routine could then be used in conjunction with any parallel sorting algorithm to effect parallel domain decomposition.

...read moreread less

Journal Article•DOI•

On Optimizing a Class of Multi-Dimensional Loops with Reduction for Parallel Execution

[...]

Chi-Chung Lam¹, P. Sadayappan¹, Rephael Wenger¹•Institutions (1)

Ohio State University¹

01 Jun 1997-Parallel Processing Letters

TL;DR: This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application and a pruning search strategy for determination of an optimal form is developed.

...read moreread less

Abstract: This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application. The computations involve multi-dimensional surface and volume integrals where the integrand is a product of a number of array terms. Besides the issue of optimal distribution of the arrays among the processors, there is also scope for reordering of the operations using the commutativity and associativity properties of addition and multiplication, and the application of the distributive law to significantly reduce the number of operations executed. A formalization of the operation minimization problem and proof of its NP-completeness is provided. A pruning search strategy for determination of an optimal form is developed. An analysis of the communication requirements and a polynomial-time algorithm for determination of optimal distribution of the arrays are also provided.

...read moreread less

Book Chapter•DOI•

Efficient Parallel Graph Algorithms For Coarse Grained Multicomputers and BSP

[...]

Edson Norberto Cáceres, Frank Dehne¹, Afonso Ferreira², Paola Flocchini, Ingo Rieping, Alessandro Roncato, Nicola Santoro¹, Siang Wun Song³ - Show less +4 more•Institutions (3)

Carleton University¹, École normale supérieure de Lyon², University of São Paulo³

07 Jul 1997

TL;DR: In this paper, the authors present deterministic parallel algorithms for the coarse-grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation

...read moreread less

Abstract: In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation), and (8) cordai graph recognition (finding a perfect elimination ordering). The algorithms for Problems 1–7 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2 hold for arbitrary ratios $\frac{n}{p}$, i.e. they are fully scalable, and for Problems 3–8 it is assumed that $\frac{n}{p} \geqslant p^ \in ,{\mathbf{ }} \in {\mathbf{ }} > 0$, which is true for all commercially available multiprocessors. We view the algorithms presented as an important step towards the final goal of O(1) communication rounds. Note that, the number of communication rounds obtained in this paper is independent of n and grows only very slowly with respect to p. Hence, for most practical purposes, the number of communication rounds can be considered as constant. The result for Problem 1 is a considerable improvement over those previously reported. The algorithms for Problems 2–7 are the first practically relevant deterministic parallel algorithms for these problems to be used for commercially available coarse grained parallel machines.

...read moreread less

Journal Article•DOI•

A parallel genetic algorithm approach to solving the unit commitment problem: implementation on the transputer networks

[...]

Hong-Tzer Yang¹, Pai-Chuan Yang¹, Ching-Lien Huang²•Institutions (2)

Chung Yuan Christian University¹, National Cheng Kung University²

01 Dec 1997-IEEE Transactions on Power Systems

TL;DR: The proposed topology of dual-direction ring is shown to be well amenable to parallel implementation of the GA for the UC problem and speed-up and efficiency for each topology with different number of processor are compared to those of the sequential GA approach.

...read moreread less

Abstract: Through a constraint handling technique, this paper proposes a parallel genetic algorithm (GA) approach to solving the thermal unit commitment (UC) problem. The developed algorithm is implemented on an eight-processor transputer network, processors of which are arranged in master-slave and dual-direction ring structures, respectively. The proposed approach has been tested on a 38-unit thermal power system over a 24-hour period. Speed-up and efficiency for each topology with different number of processor are compared to those of the sequential GA approach. The proposed topology of dual-direction ring is shown to be well amenable to parallel implementation of the GA for the UC problem.

...read moreread less

Journal Article•DOI•

A parallel genetic algorithm for performance-driven VLSI routing

[...]

Jens Lienig

01 Apr 1997-IEEE Transactions on Evolutionary Computation

TL;DR: The parallel approach is shown to consistently perform better than a sequential genetic algorithm when applied to these routing problems and is able to significantly reduce the occurrence of crosstalk.

...read moreread less

Abstract: This paper presents a novel approach to solve the VLSI (very large scale integration) channel and switchbox routing problems. The approach is based on a parallel genetic algorithm (PGA) that runs on a distributed network of workstations. The algorithm optimizes both physical constraints (length of nets, number of vias) and crosstalk (delay due to coupled capacitance). The parallel approach is shown to consistently perform better than a sequential genetic algorithm when applied to these routing problems. An extensive investigation of the parameters of the algorithm yields routing results that are qualitatively better or as good as the best published results. In addition, the algorithm is able to significantly reduce the occurrence of crosstalk.

...read moreread less

Proceedings Article•DOI•

A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines

[...]

Toshio Endo¹, Kenjiro Taura¹, Akinori Yonezawa¹•Institutions (1)

University of Tokyo¹

15 Nov 1997

TL;DR: This work describes implementation of a mark-sweep garbage collector for shared-memory machines and reports its performance, and observed that the implementation detail affects the performance heavily.

...read moreread less

Abstract: This work describes implementation of a mark-sweep garbage collector (GC) for shared-memory machines and reports its performance. It is a simple ''parallel'' collector in which all processors cooperatively traverse objects in the global shared heap. The collector stops the application program during a collection and assumes a uniform access cost to all locations in the shared heap. Implementation is based on the Boehm-Demers-Weiser conservative GC (Boehm GC). Experiments have been done on Ultra Enterprise 10000 (Ultra Sparc processor 250 MHz, 64 processors). We wrote two applications, BH (an N-body problem solver) and CKY (a context free grammar parser) in a parallel extension to C++.Through the experiments, We observe that load balancing is the key to achieving scalability. A naive collector without load redistribution hardly exhibits speed-up (at most fourfold speed-up on 64 processors). Performance can be improved by dynamic load balancing, which exchanges objects to be scanned between processors, but we still observe that straightforward implementation severely limits performance. First, large objects become a source of significant load imbalance, because the unit of load redistribution is a single object. Performance is improved by splitting a large object into small pieces before pushing it onto the mark stack. Next, processors spend a significant amount of time uselessly because of serializing method for termination detection using a shared counter. This problem suddenly appeared on more than 32 processors. By implementing non-serializing method for termination detection, the idle time is eliminated and performance is improved. With all these careful implementation, we achieved average speed-up of 28.0 in BH and 28.6 in CKY on 64 processors.

...read moreread less

Collapse