Showing papers on "Parallel algorithm published in 1993"

PDF

Open Access

Fast parallel algorithms for short-range molecular dynamics

[...]

01 May 1993

TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.

...read moreread less

Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

...read moreread less

29,323 citations

Proceedings Article•DOI•

LogP: towards a realistic model of parallel computation

[...]

David E. Culler¹, Richard M. Karp¹, David A. Patterson¹, Abhijit Sahay¹, Klaus Erik Schauser¹, Eunice E. Santos¹, Ramesh Subramonian¹, Thorsten von Eicken¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

01 Jul 1993

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.

...read moreread less

Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

...read moreread less

1,515 citations

Book•

The High Performance Fortran Handbook

[...]

Charles Koelbel¹, David B. Loveman, Robert Schreiber², Guy L. Steele, Mary E. Zosel³ - Show less +1 more•Institutions (3)

Rice University¹, Research Institute for Advanced Computer Science², Lawrence Livermore National Laboratory³

01 Nov 1993

TL;DR: High Performance Fortran is a set of extensions to Fortran expressing parallel execution at a relatively high level that brings the convenience of sequential Fortran a step closer to today's complex parallel machines.

...read moreread less

Abstract: From the Publisher: High Performance Fortran (HPF) is a set of extensions to Fortran expressing parallel execution at a relatively high level. For the thousands of scientists, engineers, and others who wish to take advantage of the power of both vector and parallel supercomputers, five of the principal authors of HPF have teamed up here to write a tutorial for the language. There is an increasing need for a common parallel Fortran that can serve as a programming interface with the new parallel machines that are appearing on the market. While HPF does not solve all the problems of parallel programming, it does provide a portable, high-level expression for data- parallel algorithms that brings the convenience of sequential Fortran a step closer to today's complex parallel machines.

...read moreread less

677 citations

Journal Article•DOI•

A Parallel Stereo Algorithm that Produces Dense Depth Maps and Preserves Image Features

[...]

Pascal Fua¹, Pascal Fua²•Institutions (2)

SRI International¹, French Institute for Research in Computer Science and Automation²

01 Dec 1993

TL;DR: This paper shows how simple and parallel techniques can be combined to achieve this goal and deal with complex real world scenes and shows that the algorithm relies on correlation followed by interpolation and performs very well on difficult images such as faces and cluttered ground level scenes.

...read moreread less

Abstract: To compute reliable dense depth maps, a stereo algorithm must preserve depth discontinuities and avoid gross errors. In this paper, we show how simple and parallel techniques can be combined to achieve this goal and deal with complex real world scenes. Our algorithm relies on correlation followed by interpolation. During the correlation phase the two images play a symmetric role and we use a validity criterion for the matches that eliminate gross errors: at places where the images cannot be correlated reliably, due to lack of texture of occlusions for example, the algorithm does not produce wrong matches but a very sparse disparity map as opposed to a dense one when the correlation is successful. To generate a dense depth map, the information is then propagated across the featureless areas, but not across discontinuities, by an interpolation scheme that takes image grey levels into account to preserve image features. We show that our algorithm performs very well on difficult images such as faces and cluttered ground level scenes. Because all the algorithms described here are parallel and very regular they could be implemented in hardware and lead to extremely fast stereo systems.

...read moreread less

483 citations

Proceedings Article•DOI•

A parallel hashed oct-tree N-body algorithm

[...]

Michael S. Warren¹, John K. Salmon²•Institutions (2)

Los Alamos National Laboratory¹, California Institute of Technology²

01 Dec 1993

TL;DR: The authors report on an efficient adaptive N-body method which computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number.

...read moreread less

Abstract: The authors report on an efficient adaptive N-body method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, the authors identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. Comments on a number of wide-ranging applications which can benefit from application of this type of algorithm are included.

...read moreread less

457 citations

Journal Article•DOI•

A parallel route building algorithm for the vehicle routing and scheduling problem with time windows

[...]

Jean-Yves Potvin¹, Jean-Marc Rousseau¹•Institutions (1)

Université de Montréal¹

14 May 1993-European Journal of Operational Research

TL;DR: This paper describes an insertion algorithm for the Vehicle Routing and Scheduling Problem with Time Windows that builds routes in parallel and uses a generalized regret measure over all unrouted customers to select the next candidate for insertion.

...read moreread less

436 citations

Book•

Synthesis of Parallel Algorithms

[...]

John H. Reif

01 Sep 1993

TL;DR: These thorough but introductory presentations demonstrate the most important algorithmic techniques and their use in exposing the hidden parallelism within problems.

...read moreread less

Abstract: From the Publisher: This landmark collaboration will enhance the knowledge and abilities of anyone interested in parallel algorithms and in developing programs for parallel computers Many problems currently solved with sequential algorithms are themselves highly parallelizable when designers use the powerful parallel techniques now available These thorough but introductory presentations demonstrate the most important algorithmic techniques and their use in exposing the hidden parallelism within problems Beginning with familiar sequential algorithms, the authors provide a careful description of the fundamental problem, its solution, and analysis--complete with examples and exercises Each of the 22 chapters then synthesizes a more sophisticated parallel algorithm using the simpler sequential and parallel techniques used to introduce the problem The PRAM shared-memory model of computing provides a unifying framework This model has been used extensively for designing parallel algorithms and can be efficiently simulated on many of the parallel architectures now in use Applying the methods in this book will offer designers a substantial advantage when solving problems for parallel computation

...read moreread less

352 citations

Journal Article•DOI•

Isoefficiency: measuring the scalability of parallel algorithms and architectures

[...]

Ananth Grama¹, Anshul Gupta¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Aug 1993-IEEE Parallel & Distributed Technology: Systems & Applications

TL;DR: Isoefficiency analysis helps to determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions.

...read moreread less

Abstract: Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions. >

...read moreread less

329 citations

Journal Article•DOI•

On the granularity and clustering of directed acyclic task graphs

[...]

Apostolos Gerasoulis¹, Tao Yang¹•Institutions (1)

Rutgers University¹

01 Jun 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one.

...read moreread less

Abstract: The authors consider the impact of the granularity on scheduling task graphs. Scheduling consists of two parts, the processors assignment of tasks, also called clustering, and the ordering of tasks for execution in each processor. The authors introduce two types of clusterings: nonlinear and linear clusterings. A clustering is nonlinear if two parallel tasks are mapped in the same cluster otherwise it is linear. Linear clustering fully exploits the natural parallelism of a given directed acyclic task graph (DAG) while nonlinear clustering sequentializes independent tasks to reduce parallelism. The authors also introduce a new quantification of the granularity of a DAG and define a coarse grain DAG as the one whose granularity is greater than one. It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one. This result is used to prove the optimality of some important linear clusterings used in parallel numerical computing. >

...read moreread less

302 citations

Journal Article•DOI•

Parallel computations on reconfigurable meshes

[...]

Russ Miller¹, V.K. Prasanna-Kumar², Dionysios Reisis³, Quentin F. Stout⁴•Institutions (4)

State University of New York System¹, University of Southern California², National Technical University of Athens³, University of Michigan⁴

01 Jun 1993-IEEE Transactions on Computers

TL;DR: Algorithms that are efficient for solving a variety of problems involving graphs and digitized images are introduced that are asymptotically superior to those previously obtained for the mesh, the mesh with multiple broadcasting, the meshes with multiple buses, theMesh-of-trees, and the pyramid computer.

...read moreread less

Abstract: The mesh with reconfigurable bus is presented as a model of computation. The reconfigurable mesh captures salient features from a variety of sources, including the CAAPP, CHiP, polymorphic-torus network, and bus automation. It consists of an array of processors interconnected by a reconfigurable bus system that can be used to dynamically obtain various interconnection patterns between the processors. A variety of fundamental data-movement operations for the reconfigurable mesh are introduced. Based on these operations, algorithms that are efficient for solving a variety of problems involving graphs and digitized images are also introduced. The algorithms are asymptotically superior to those previously obtained for the aforementioned reconfigurable architectures, as well as to those previously obtained for the mesh, the mesh with multiple broadcasting, the mesh with multiple buses, the mesh-of-trees, and the pyramid computer. The power of reconfigurability is illustrated by solving some problems, such as the exclusive OR, more efficiently on the reconfigurable mesh than is possible on the programmable random-access memory (PRAM). >

...read moreread less

261 citations

Report•DOI•

Fast Parallel Algorithms for Short-Range Molecular Dynamics

[...]

Steven J. Plimpton¹•Institutions (1)

Sandia National Laboratories¹

01 May 1993

TL;DR: In this article, three parallel algorithms for classical molecular dynamics are presented, which can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors.

...read moreread less

Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a subset of atoms; the second assigns each a subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently -- those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 10,000,000 atoms on three parallel supercomputers, the nCUBE 2, Intel iPSC/860, and Intel Delta. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and the Intel Delta performs about 30 times faster than a single Y-MP processor and 12 times faster than a single C90 processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

...read moreread less

Proceedings Article•DOI•

External-memory computational geometry

[...]

Michael T. Goodrich¹, Jyh-Jong Tsay, Darren Erik Vengroff, Jeffrey Scott Vitter•Institutions (1)

Johns Hopkins University¹

03 Nov 1993

TL;DR: New techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory are given and these algorithms are the first known optimal algorithms for a wide range of two-level and hierarchical multilevel memory models, including parallel models.

...read moreread less

Abstract: In this paper we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory. We use these techniques to develop optimal and practical algorithms for a number of important large-scale problems. We discuss our algorithms primarily in the context of single processor/single disk machines, a domain in which they are not only the first known optimal results but also of tremendous practical value. Our methods also produce the first known optimal algorithms for a wide range of two-level and hierarchical multilevel memory models, including parallel models. The algorithms are optimal both in terms of I/O cost and internal computation. >

...read moreread less

Journal Article•DOI•

Randomized parallel algorithms for backtrack search and branch-and-bound computation

[...]

Richard M. Karp¹, Yanjun Zhang¹•Institutions (1)

University of California, Berkeley¹

01 Jul 1993-Journal of the ACM

TL;DR: Universal randomized methods for parallelizing sequential backtrack search and branch-and-bound computation are presented and demonstrate the effectiveness of randomization in distributed parallel computation.

...read moreread less

Abstract: Universal randomized methods for parallelizing sequential backtrack search and branch-and-bound computation are presented. These methods execute on message-passing multi- processor systems, and require no global data structures or complex communication protocols. For backtrack search, it is shown that, uniformly on all instances, the method described in this paper is likely to yield a speed-up within a small constant factor from optimal, when all solutions to the problem instance are required. For branch-and-bound computation, it is shown that, uniformly on all instances, the execution time of this method is unlikely to exceed a certain inherent lower bound by more than a constant factor. These randomized methods demonstrate the effectiveness of randomization in distributed parallel computation. Categories and Subject Descriptors: F.2.2 (Analysis of Algorithms and Problem Complexity): Non-numerical Algorithms-computation

...read moreread less

Journal Article•DOI•

A transitive closure algorithm for test generation

[...]

Srimat Chakradhar¹, Vishwani D. Agrawal², S.G. Rothweiler¹•Institutions (2)

Princeton University¹, AT&T²

01 Jul 1993-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A transitive-closure-based test generation algorithm is presented that dependences derived from the transitive closure are used to reduce ternary relations to binary relations that in turn dynamically update the transitives closure.

...read moreread less

Abstract: A transitive-closure-based test generation algorithm is presented. A test is obtained by determining signal values that satisfy a Boolean equation derived from the neural network model of the circuit incorporating necessary conditions for fault activation and path sensitization. The algorithm is a sequence of two main steps that are repeatedly executed: transitive closure computation and decision-making. A key feature of the algorithm is that dependences derived from the transitive closure are used to reduce ternary relations to binary relations that in turn dynamically update the transitive closure. The signals are either determined from the transitive closure or are enumerated until the Boolean equation is satisfied. Experimental results on the ISCAS 1985 and the combinational parts of ISCAS 1989 benchmark circuits are presented to demonstrate efficient test generation and redundancy identification. Results on four state-of-the-art production VLSI circuits are also presented. >

...read moreread less

Journal Article•DOI•

Parallel database systems: open problems and new issues

[...]

Patrick Valduriez¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Apr 1993-Distributed and Parallel Databases

TL;DR: It is still an open issue to decide which of the various architectures among shared-memory, shared-disk, and shared-nothing, is best for database management under various conditions.

...read moreread less

Abstract: Parallel database systems attempt to exploit recent multiprocessor computer architectures in order to build high-performance and high-availability database servers at a much lower price than equivalent mainframe computers. Although there are commercial SQL-based products, a number of open problems hamper the full exploitation of the capabilities of parallel systems. These problems touch on issues ranging from those of parallel processing to distributed database management. Furthermore, it is still an open issue to decide which of the various architectures among shared-memory, shared-disk, and shared-nothing, is best for database management under various conditions. Finally, there are new issues raised by the introduction of higher functionality such as knowledge-based or object-oriented capabilities within a parallel database system.

...read moreread less

Automatic Mesh Partitioning

[...]

Gary L. Miller, Shang-Hua Teng, William P. Thurston, Stephen A. Vavasis

01 Jan 1993

TL;DR: In this paper, the authors describe an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods, making use of the underlying geometric structure of a given mesh and finding a provably good partition in random O(n) time.

...read moreread less

Abstract: This paper describes an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods. The approach makes use of the underlying geometric structure of a given mesh and finds a provably good partition in random O(n) time. It applies to meshes in both two and three dimensions. The new method has applications in efficient sequential and parallel algorithms for large-scale problems in scientific computing. This is an overview paper written with emphasis on the algorithmic aspects of the approach. Many detailed proofs can be found in companion papers.

...read moreread less

Proceedings Article•DOI•

A Parallel Bottom-up Clustering Algorithm with Applications to Circuit Partitioning in VLSI Design

[...]

Jason Cong¹, M'Lissa Smith¹•Institutions (1)

University of California, Los Angeles¹

01 Jul 1993

TL;DR: This paper presents a bottom-up clustering algorithm based on recursive collapsing of small cliques in a graph that leads to a natural parallel implementation in which multiple processors are used to identify clusters simultaneously.

...read moreread less

Abstract: In this paper, we present a bottom-up clustering algorithm based on recursive collapsing of small cliques in a graph. The sizes of the small cliques are derived using random graph theory. This clustering algorithm leads to a natural parallel implementation in which multiple processors are used to identify clusters simultaneously. We also present a cluster-based partitioning method in which our clustering algorithm is used as a preprocessing step to both the bisection algorithm by Fiduccia and Mattheyses and a ratio-cut algorithm by Wei and Cheng. Our results show that cluster-based partitioning obtains cut sizes up to 49.6% smaller than the bisection algorithm, and obtains ratio cut sizes up to 66.8% smaller than the ratio-cut algorithm. Moreover, we show that cluster-based partitioning produces much stabler results than direct partitioning.

...read moreread less

Journal Article•DOI•

Scaling parallel programs for multiprocessors: methodology and examples

[...]

Jaswinder Pal Singh¹, John L. Hennessy¹, Abhinav Gupta¹•Institutions (1)

Stanford University¹

01 Jul 1993-IEEE Computer

TL;DR: A realistic method is described that scales all relevant parameters under considerations imposed by the application domain, which leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size.

...read moreread less

Abstract: Models for the constraints under which an application should be scaled, including constant problem-size scaling, memory-constrained scaling, and time-constrained scaling, are reviewed. A realistic method is described that scales all relevant parameters under considerations imposed by the application domain. This method leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size. The primary example application is a simulation of galaxies using the Barnes-Hut hierarchical N-body method. >

...read moreread less

Journal Article•DOI•

The scalability of FFT on parallel computers

[...]

Anshul Gupta¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Aug 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric and show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh.

...read moreread less

Abstract: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture. >

...read moreread less

Proceedings Article•DOI•

Implementation of a portable nested data-parallel language

[...]

Guy E. Blelloch, Jonathan C. Hardwick, Siddhartha Chatterjee, Jay Sipelstein, Marco Zagha - Show less +1 more

01 Jul 1993

TL;DR: Initial benchmark results of NESL show that NESL's performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

...read moreread less

Abstract: This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machine CM-2, the Cray Y-MP C90, and serial machines. We compare initial benchmark results of NESL with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that NESL's performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

...read moreread less

Proceedings Article•DOI•

A data distributed, parallel algorithm for ray-traced volume rendering

[...]

Kwan-Liu Ma¹, James Painter¹, Charles Hansen¹, M. Krogh¹•Institutions (1)

Langley Research Center¹

01 Nov 1993

TL;DR: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations.

...read moreread less

Abstract: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.

...read moreread less

NESL: A Nested Data-Parallel Language (Version 2.6)

[...]

Guy E. Blelloch

01 Apr 1993

TL;DR: NESL is intended to be used as a portable interface for programming a variety of parallel and vector supercomputers, and as a basis for teaching parallel algorithms, and several examples of algorithms coded in the language are described.

...read moreread less

Abstract: This report describes NESL, a strongly-typed, applicative, data-parallel language. NESL is intended to be used as a portable interface for programming a variety of parallel and vector supercomputers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences (ordered sets), including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. NESL fully supports nested sequences and nested parallelism -- the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with complex and dynamically changing data structures, such as required in many graph and sparse matrix algorithms. NESL also provides a mechanism for calculating the asymptotic running time for a program on various parallel machine models, including the parallel random access machine (PRAM). This is useful for estimating running times of algorithms on actual machines and, when teaching algorithms, for supplying a close correspondence between the code and the theoretical complexity. This report defines NESL and describes several examples of algorithms coded in the language. The examples include algorithms for median finding, sorting, string searching, finding prime numbers, and finding a planar convex hull. NESL currently compiles to an intermediate language called Vcode, which runs on the Cray Y-MP, Connection Machine CM-2, and Encore Multimax. For many algorithms, the current implementation gives performance close to optimized machine-specific code for these machines. Note: This report is an updated version of CMU-CS-92-103, which described version 2.4 of the language. The most significant changes in version 2.6 are that it supports polymorphic types, has an ML-like syntax instead of a lisp-like syntax, and includes support for I/O.

...read moreread less

Journal Article•DOI•

A generalized scheme for mapping parallel algorithms

[...]

Vipin Chaudhary¹, Jake K. Aggarwal•Institutions (1)

Wayne State University¹

01 Mar 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics to guide the mapping of a parallel algorithm and the architecture is proposed.

...read moreread less

Abstract: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics is proposed. The authors use the knowledge from the given algorithm and the architecture to guide the mapping. The approach begins with a graphical representation of the parallel algorithm (problem graph) and the parallel computer (host graph). Using these representations, the authors generate a new graphical representation (extended host graph) on which the problem graph is mapped. An accurate characterization of the communication overhead is used in the objective functions to evaluate the optimality of the mapping. An efficient mapping scheme is developed which uses two levels of optimization procedures. The objective functions include minimizing the communication overhead and minimizing the total execution time which includes both computation and communication times. The mapping scheme is tested by simulation and further confirmed by mapping a real world application onto actual distributed environments. >

...read moreread less

Book•

Parallel Processing from Applications to Systems

[...]

Dan I. Moldovan

15 Mar 1993

TL;DR: This text provides one of the broadest presentations of parallel processing available, including the structure of parallel processors and parallel algorithms, with extensive coverage of array and multiprocessor architectures.

...read moreread less

Abstract: From the Publisher: This text provides one of the broadest presentations of parallel processing available, including the structure of parallel processors and parallel algorithms. The emphasis is on mapping algorithms to highly parallel computers, with extensive coverage of array and multiprocessor architectures. Early chapters provide insightful coverage on the analysis of parallel algorithms and program transformations, effectively integrating a variety of material previously scattered throughout the literature. Theory and practice are well balanced across diverse topics in this concise presentation. For exceptional clarity and comprehension, the author presents complex material in geometric graphs as well as algebraic notation. Each chapter includes well-chosen examples, tables summarizing related key concepts and definitions, and a broad range of worked exercises. Features: Overview of common hardware and theoretical models, including algorithm characteristics and impediments to fast performance Analysis of data dependencies and inherent parallelism through program examples, building from simple to complex Graphic and explanatory coverage of program transformations Easy-to-follow presentation of parallel processor structures and interconnection networks, including parallelizing and restructuring compilers Parallel synchronization methods and types of parallel operating systems Detailed descriptions of hypercube systems Specialized chapters on dataflow and on AI architectures

...read moreread less

Journal Article•DOI•

A vector and parallel full configuration interaction algorithm

[...]

Gian Luigi Bendazzoli, Stefano Evangelisti

15 Feb 1993-Journal of Chemical Physics

TL;DR: A full configuration interaction (FCI) algorithm is presented and discussed, an integral driven formalism based on the explicit construction of tables which realize the correspondence between the FCI vector x and the vector Hx, H being the Hamiltonian matrix of the system.

...read moreread less

Abstract: A full configuration interaction (FCI) algorithm is presented and discussed. It is an integral driven formalism based on the explicit construction of tables which realize the correspondence between the FCI vector x and the vector Hx, H being the Hamiltonian matrix of the system. In this way no decomposition of the identity is needed, and in the simplest implementation only the two vectors x and Hx need to be kept on disk. The main test has been done on the cyclic polyene C18H18 in the Pariser–Parr–Pople approximation, where the size of the FCI vector can be reduced to about 73 million components. Running on a CRAY Y‐MP with 4 CPU and 32 MW of core memory, we obtained an elapsed CPU time per iteration of about 300 s and a total elapsed time of 1000 s, which correspond to about 4 and 14 s per million determinants, respectively. The parallel CPU speed‐up obtained by running with the 4 CPU is greater than 3, without any substantial increasing of the memory or disk requirements.

...read moreread less

Toward a better parallel performance metric

[...]

Xian-He Sun, John L. Gustafson

02 Jan 1993

TL;DR: In this paper, the fairness of parallel performance metrics is studied and the theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is unfair in that it favors slow processors and poorly coded programs.

...read moreread less

Abstract: Due to programming difficulty, parallel algorithms are commonly compared with different levels of programming effort. They are also compared on different architectures. In this paper, the fairness'' of parallel performance metrics is studied. Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is unfair'', in that it favors slow processors and poorly coded programs. Two new performance metrics are introduced. The first one, sizeup, provides a fair'' performance measurement. The second one is a generalization of speedup -- the generalized speedup. The relation between sizeup, speedup, and generalized speedup are studied. A real application has been implemented on an nCUBE 2 multicomputer. The experimental results match the analytical results closely. 18 refs., 12 figs.

...read moreread less

Proceedings Article•DOI•

Recursive implementation of total least squares algorithm for image reconstruction from noisy, undersampled multiframes

[...]

N.K. Bose¹, H. C. Kim¹, H.M. Valenzuela¹•Institutions (1)

Pennsylvania State University¹

27 Apr 1993

TL;DR: It is shown how the total least squares recursive algorithm for the real data FIR (finite impulse response) adaptive filtering problem can be applied to reconstruct a high-resolution filtered image from undersampled, noisy multiframes, when the interframe displacements are not accurately known.

...read moreread less

Abstract: It is shown how the total least squares recursive algorithm for the real data FIR (finite impulse response) adaptive filtering problem can be applied to reconstruct a high-resolution filtered image from undersampled, noisy multiframes, when the interframe displacements are not accurately known. This is done in the wavenumber domain after transforming the complex data problem to an equivalent real data problem, to which the algorithm developed by C.E. Davila (Proc. ICASSP 1991 p.1853-6 of 1991) applies. The procedure developed also applies when the multiframes are degraded by linear shift-invariant blurs. All the advantages of implementation via massively parallel computational architecture apply. The performance of the algorithm is verified by computer simulations. >

...read moreread less

Journal Article•DOI•

Partitioning of processor arrays: a piecewise regular approach

[...]

Jürgen Teich¹, Lothar Thiele¹•Institutions (1)

Saarland University¹

01 Feb 1993-Integration

TL;DR: A unified approach to the solution of the partitioning problem is presented based on the following concepts: Algorithms are represented by programs, and the concept of stepwise refinement of programs is used to solve the partitions problem by applying a sequence of provably correct program transformations.

...read moreread less

Journal Article•DOI•

On the efficiency of parallel backtracking

[...]

V.N. Rao¹, Vipin Kumar²•Institutions (2)

University of Central Florida¹, University of Minnesota²

01 Apr 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented, and it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distributed distribution is nonuniform.

...read moreread less

Abstract: Analytical models and experimental results concerning the average case behavior of parallel backtracking are presented. Two types of backtrack search algorithms are considered: simple backtracking, which does not use heuristics to order and prune search, and heuristic backtracking, which does. Analytical models are used to compare the average number of nodes visited in sequential and parallel search for each case. For simple backtracking, it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distribution of solutions is nonuniform. For heuristic backtracking, the average speedup obtained is at least linear, and the speedup obtained on a subset of instances is superlinear. Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented. >

...read moreread less

Journal Article•DOI•

A robust image reconstruction algorithm and its parallel implementation in electrical impedance tomography

[...]

Eung Je Woo, P. Hua¹, John G. Webster², Willis J. Tompkins²•Institutions (2)

Siemens¹, University of Wisconsin-Madison²

01 Jan 1993-IEEE Transactions on Medical Imaging

TL;DR: An efficient and robust image reconstruction algorithm for static impedance imaging using Hachtel's augmented matrix method was developed and it was shown that the parallel computation could reduce the computation time from hours to minutes.

...read moreread less

Abstract: An efficient and robust image reconstruction algorithm for static impedance imaging using Hachtel's augmented matrix method was developed. This improved Newton-Raphson method produced more accurate images by reducing the undesirable effects of the ill-conditioned Hessian matrix. It is demonstrated that the electrical impedance tomography (EIT) system could produce two-dimensional static images from a physical phantom with 7% spatial resolution at the center and 5% at the periphery. Static EIT image reconstruction requires a large amount of computation. In order to overcome the limitations on reducing the computation time by algorithmic approaches, the improved Newton-Raphson algorithm was implemented on a parallel computer system. It is shown that the parallel computation could reduce the computation time from hours to minutes. >

...read moreread less

Collapse