scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Parallel Programming in 1989"


Journal ArticleDOI
TL;DR: This paper introduces a dynamic strategy called WorkCrews for controlling the use of parallelism on small-scale, tightly-coupled multiprocessors and favors coarse-grained subtasks, which reduces further the overhead of task decomposition.
Abstract: In implementing parallel programs, it is important to find strategies for controlling parallelism that make the most effective use of available resources. In this paper, we introduce a dynamic strategy called WorkCrews for controlling the use of parallelism on small-scale, tightly-coupled multiprocessors. In the WorkCrew model, tasks are assigned to a finite set ofworkers. As in other mechanisms for specifying parallelism, each worker can enqueue subtasks for concurrent evaluation by other workers as they become idle. The WorkCrew paradigm has two advantages. First, much of the work associated with task division can be deferred until a new worker actually undertakes the subtask and avoided altogether if the original worker ends up executing the subtask serially. Second, the ordering of queue requests under WorkCrews favors coarse-grained subtasks, which reduces further the overhead of task decomposition.

107 citations


Journal ArticleDOI
TL;DR: The results of initial investigations using the Intel iPSC/1 hypercube and the Connection Machine for parallel sequence comparisons have a wide applicability for the parallel processing of biological sequence comparisons.
Abstract: Comparison of biological (DNA or protein) sequences provides insight into molecular structure, function, and homology, and is increasingly important as the available databases become larger and more numerous. One method of increasing the speed of the calculations is to perform them in parallel. We present the results of initial investigations using the Intel iPSC/1 hypercube and the Connection Machine (CM-I) for these comparisons. Since these machines have very different architectures, the issues and performance trade-offs discussed have a wide applicability for the parallel processing of biological sequence comparisons.

76 citations


Journal ArticleDOI
TL;DR: This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multip rocessors is feasible, and results in a significant reduction in their execution times.
Abstract: Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This dissertation demonstrates that multiprocessor execution of functional programs is feasible, and results in a significant reduction in their execution times. Two implementations of the functional language ALFL were built on commercially available multiprocessors. Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, and Buckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, called serial combinators, that can be executed in parallel. One of the primary goals of the compiler is to generate serial combinators exhibiting the coarsest granularity possibly without sacrificing useful parallelism. This dissertation describes the algorithms used by the compiler to analyze, decompose, and optimize functional programs. The abstract machine model supported by Alfalfa and Buckwheat is called heterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and higher order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat run-time systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.

62 citations


Journal ArticleDOI
TL;DR: It is shown that the performance of randomized algorithms is less affected by factors that prevent most parallel deterministic algorithms from attaining their theoretical speedup bounds and reliability is enhanced because the failure of a single processor leads only to degradation, not failure, of the algorithm.
Abstract: Randomized algorithms are algorithms that employ randomness in their solution method. We show that the performance of randomized algorithms is less affected by factors that prevent most parallel deterministic algorithms from attaining their theoretical speedup bounds. A major reason is that the mapping of randomized algorithms onto multiprocessors involves very little scheduling or communication overhead. Furthermore, reliability is enhanced because the failure of a single processor leads only to degradation, not failure, of the algorithm. We present results of an extensive simulation done on a multiprocessor simulator, running a randomized branch-and-bound algorithm. The particular case we consider is the knapsack problem, due to its ease of formulation. We observe the largest speedups in precisely those problems that take large amounts of time to solve.

37 citations


Journal ArticleDOI
TL;DR: Four algorithms based on the first fit approach that provide different granularities of parallel access to the allocator's data structures are investigated, showing that simple algorithms are appropriate when the expected number of concurrent requests per memory is low and the request pattern is not bursty.
Abstract: Dynamic storage allocation is a vital component of programming systems intended for multiprocessor architectures that support globally shared memory. Highly parallel algorithms for access to system data structures lie at the core of effective memory allocation strategies as well as solutions to other parallel systems problems. In this paper, we investigate four algorithms, all based on the first fit approach, that provide different granularities of parallel access to the allocator's data structures. These solutions employ a variety of design techniques including specialized locking protocols, the use of atomic fetch-and-Ф operations, and structural modifications. We describe experiments designed to compare the performance of these schemes. The results show that simple algorithms are appropriate when the expected number of concurrent requests per memory is low and the request pattern is not bursty. Algorithms that support finer granularity access while avoiding locking protocols are successful in a range of larger processor/memory ratios.

28 citations


Journal ArticleDOI
TL;DR: The implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information, and the results for shared memory machines exhibited excellent speedups.
Abstract: Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.

23 citations


Journal ArticleDOI
TL;DR: A technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory and how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm is described.
Abstract: This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.

22 citations


Journal ArticleDOI
TL;DR: MultiScheme, the system resulting from these extensions, supports Halstead's future construct as the simple model for parallelism by revealing the underlying placeholders on top of which this construct is built, and supports a variety of additional parallel programming techniques.
Abstract: The Scheme language can be converted into a parallel processing language by adding two new data types (placeholders andweak pairs), two processor synchronization primitives, and a task distribution mechanism. The mechanisms that support task creation, scheduling, and task synchronization are built using these extensions and features already present in the sequential language. Implementing the core of the parallel processing component in Scheme itself provides testbed for a variety of experiments and extensions. MultiScheme, the system resulting from these extensions, supports Halstead's future construct as the simple model for parallelism. By revealing the underlying placeholders on top of which this construct is built, Multischeme supports a variety of additional parallel programming techniques. It supports speculative computation through a simple procedural interface and the automatic garbage collection of tasks. The qlet and qlambda constructs of the QLisp language are also easily implemented in MultiScheme, as are the more familiar fork and join constructs of imperative programming.

16 citations


Journal ArticleDOI
TL;DR: It is demonstrated that, for certain data parallel algorithms, it is possible to determine optimal design parameters analytically using a simple model for the NCUBE hypercube computer.
Abstract: Designing efficient parallel algorithms in a message-based parallel computer should consider both time-space tradeoffs and computation-communication tradeoffs. In order to balance these tradeoffs and achieve the optimal performance of an algorith, one has to consider various design parameters such as the number of processors required and the size of partitions. In this paper, we demonstrate that, for certain data parallel algorithms, it is possible to determine these design parameters analytically. To serve as a basis for the discussions that follow, a simple model for the NCUBE hypercube computer is introduced. Using this model, we use two examples, array summation and matrix multiplication, to illustrate how their performance can be modeled. By optimizing these expressions, one is able to determine optimal design parameters which arrive at efficient execution. Experiments on a 64-node NCUBE verified the accuracy of the analytic results and are used to further support the discussions.

11 citations


Journal ArticleDOI
TL;DR: It is argued that Samal and Henderson's argument makes assumptions about how processors are used and given a counterexample that enforces arc consistency in a constant number of steps usingO(n[su2a22na) processors, suggesting that the lower bound holds for a polynomial number of processors.
Abstract: Samal and Henderson claim that any parallel algorithm for enforcing arc consistency in the worst case must have Ω(na) sequential steps, wheren is the number of nodes, anda is the number of labels per node. We argue that Samal and Henderson's argument makes assumptions about how processors are used and give a counterexample that enforces arc consistency in a constant number of steps usingO(n[su2a22na) processors. It is possible that the lower bound holds for a polynomial number of processors; if such a lower bound were to be proven it would answer an important open question in theoretical computer science concerning the relation between the complexity classesP andNC. The strongest existing lower bound for the arc consistency problem states that it cannot be solved in polynomial log time unlessP=NC.

8 citations


Journal ArticleDOI
TL;DR: It is shown that modes can increase the precision of the backtracking algorithm, though the algorithm allows this precision to be traded off against overhead on a procedure- by-procedure and call-by-call basis.
Abstract: We present the first backtracking algorithm for stream AND-parallel logic programs. It relies on compile-time knowledge of the dataflow graph of each clause to let it figure out efficiently which goals to kill or restart when a goal fails. This crucial information, which we derive from mode declarations, was not available at compile-time in any previous stream AND-parallel system.