scispace - formally typeset
Search or ask a question

Showing papers on "Massively parallel published in 1993"


Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

1,515 citations


Book
01 Aug 1993
TL;DR: The most comprehensive work of its kind, Evolution and Optimum Seeking offers a state-of-the-art perspective on the field for researchers in computer-aided design, planning, control, systems analysis, computational intelligence, and artificial life.
Abstract: From the Publisher: With the publication of this book, Hans-Paul Schwefel has responded to rapidly growing interest in Evolutionary Computation, a field that originated, in part, with his pioneering work in the early 1970s. Evolution and Optimum Seeking offers a systematic overview of both new and classical approaches to computer-aided optimum system design methods, including the new class of Evolutionary Algorithms and other "Parallel Problem Solving from Nature" (PPSN) methods. It presents numerical optimization methods and algorithms to computer calculations which will be particularly useful for massively parallel computers. It is the only book in the field that offers in-depth comparisons between classical direct optimization methods and the newer methods. Dr. Schwefel's method consists essentially of the adaptation of simple evolutionary rules to a computer procedure in the search for optimal parameters within a simulation model of a technical device. In addition to its historical and practical value, Evolution and Optimum Seeking will stimulate further research into PPSN and interdisciplinary thinking about multi-agent self-organization in natural and artificial environments. These developments have been accelerated by fortunate changes in the computational environment, especially with respect to new architectures. MIMD (Multiple Instructions Multiple Data) machines with many processors working in parallel on one task seem to lend themselves to inherently parallel problem solving concepts like Evolution Strategies. The most comprehensive work of its kind, Evolution and Optimum Seeking offers a state-of-the-art perspective on the field for researchers in computer-aided design, planning, control, systems analysis, computational intelligence, and artificial life. Its range and depth make it a virtual handbook for practitioners: epistemological introduction to the concepts and strategies of optimum seeking; taxonomy of optimization tasks and solution principles (material found n

704 citations


Proceedings ArticleDOI
R.E. Kessler1, J.L. Schwarzmeier1
01 Jan 1993
TL;DR: Cray Research's massively parallel processing (MPP) philosophy is presented, together with a brief description of the design of the Cray T3D, the first MPP designed by Cray Research, and the 3-D torus interprocessor interconnect is discussed.
Abstract: The authors present Cray Research's massively parallel processing (MPP) philosophy, together with a brief description of the design of the Cray T3D, the first MPP designed by Cray Research. They give a brief overview of the important features of the Cray T3D, including the 3-D torus interprocessor interconnect. They discuss in more detail the motivation for the 3-D torus interconnect. Using a very simple capacity model of network performance, they show how three dimensions provide a solid balance between locality and scalability up to thousands of nodes. >

327 citations


Proceedings ArticleDOI
06 Oct 1993
TL;DR: Pablo is a performance analysis environment designed to provide unobtrusive performance data capture, analysis, and presentation across a wide variety of scalable parallel systems.
Abstract: Developers of application codes for massively parallel computer systems face daunting performance tuning and optimization problems that must be solved if massively parallel systems are to fulfill their promise. Recording and analyzing the dynamics of application program, system software, and hardware interactions is the key to understanding and the prerequisite to performance tuning, but this instrumentation and analysis must not unduly perturb program execution. Pablo is a performance analysis environment designed to provide unobtrusive performance data capture, analysis, and presentation across a wide variety of scalable parallel systems. Current efforts include dynamic statistical clustering to reduce the volume of data that must be captured and complete performance data immersion via head-mounted displays. >

299 citations


Journal ArticleDOI
TL;DR: The authors describe their work on the massively parallel finite-element computation of compressible and incompressible flows with the CM-200 and CM-5 Connection Machines, which provides a capability for solving a large class of practical problems involving free surfaces, two-liquid interfaces, and fluid-structure interactions.
Abstract: The authors describe their work on the massively parallel finite-element computation of compressible and incompressible flows with the CM-200 and CM-5 Connection Machines. Their computations are based on implicit methods, and their parallel implementations are based on the assumption that the mesh is unstructured. Computations for flow problems involving moving boundaries and interfaces are achieved by using the deformable-spatial-domain/stabilized-space-time method. Using special mesh update schemes, the frequency of remeshing is minimized to reduce the projection errors involved and also to make parallelizing the computations easier. This method and its implementation on massively parallel supercomputers provide a capability for solving a large class of practical problems involving free surfaces, two-liquid interfaces, and fluid-structure interactions. >

262 citations


Journal ArticleDOI
TL;DR: The High Performance Fortran Forum (HPFF) as discussed by the authors was a coalition of computer vendors, government laboratories, and academic groups founded in 1992 to improve the performance and usability of Fortran-90 for computationally intensive applications on a wide variety of machines, including massively parallel single-instruction multiple-data (SIMD) and MIMD systems and vector processors.
Abstract: Fortran-90, its basis in Fortran-77, its implications for parallel machines, and the extensions developed for it by the High Performance Fortran Forum (HPFF), a coalition of computer vendors, government laboratories, and academic groups founded in 1992 to improve the performance and usability of Fortran-90 for computationally intensive applications on a wide variety of machines, including massively parallel single-instruction multiple-data (SIMD) and multiple-instruction multiple-data (MIMD) systems and vector processors, are discussed. SIMD and MIMD systems, previous attempts to develop languages for them, the genesis of the HPFF, how the group actually worked, and the HPF programming model are described. >

234 citations


Journal ArticleDOI
TL;DR: In this paper, the evolution of nonlinear dynamical systems such as fluids and plasmas is being investigated in three dimensions at increasingly high resolutions, and the authors expect the resolution to increase to 1000 3 by the end of the decade.
Abstract: With the advent of massively parallel computers, the evolution of nonlinear dynamical systems such as fluids and plasmas is being investigated in three dimensions at increasingly high resolutions. Today a typical physical volume is represented by 100 3 grid points, and we may expect the resolution to increase to 1000 3 by the end of the decade.

190 citations


Journal ArticleDOI
TL;DR: The initial implementation of cooperative shared memory uses a simple programming model, called Check-In/Check-Out (CICO), in conjunction with even simpler hardware, called Dir1SW, that adds little complexity to message-passing hardware, but efficiently supports programs written within the CICO model.
Abstract: We believe the paucity of massively parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware designers which cases are common (so they can build simple hardware to optimize them). Cooperative shared memory, our approach to shared-memory design, addresses this problem.Our initial implementation of cooperative shared memory uses a simple programming model, called Check-In/Check-Out (CICO), in conjunction with even simpler hardware, called Dir1SW. In CICO, programs bracket uses of shared data with a check_in directive terminating the expected use of the data. A cooperative prefetch directive helps hide communication latency. Dir1SW is a minimal directory protocol that adds little complexity to message-passing hardware, but efficiently supports programs written within the CICO model.

172 citations


Proceedings ArticleDOI
01 Nov 1993
TL;DR: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations.
Abstract: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.

128 citations


Book
01 Jul 1993
TL;DR: Current issues involved in the development of systems which support ne grain concurrency in a single shared address space are discussed, including algorithmic, architectural, technological, and programming issues.
Abstract: A major challenge for computer science in the 1990s is to determine the extent to which general purpose parallel computing can be achieved. The goal is to deliver both scalable parallel performance and architecture independent parallel software. (Work in the 1980s having shown that either of these alone can be achieved.) Success in this endeavour would permit the long overdue separation of software considerations in parallel computing, from those of hardware. This separation would, in turn, encourage the growth of a large and diverse parallel software industry, and provide a focus for future hardware developments. In recent years a number of new routing and memory management techniques have been developed which permit the eecient implementation of a single shared address space on distributed memory architectures. We also now have a large set of eecient, practical shared memory parallel algorithms for important problems. In this paper we discuss some of the current issues involved in the development of systems which support ne grain concurrency in a single shared address space. The paper covers algorithmic, architectural, technological, and programming issues.

119 citations


Patent
13 Dec 1993
TL;DR: In this article, a messaging facility is described that enables the passing of data from one processing element to another in a globally addressable, distributed memory multiprocessor without having an explicit destination address in the target processing element's memory.
Abstract: A messaging facility is described that enables the passing of packets of data from one processing element to another in a globally addressable, distributed memory multiprocessor without having an explicit destination address in the target processing element's memory. The messaging facility can be used to accomplish a remote action by defining an opcode convention that permits one processor to send a message containing opcode, address and arguments to another. The destination processor, upon receiving the message after the arrival interrupt, can decode the opcode and perform the indicated action using the argument address and data. The messaging facility provides the primitives for the construction of an interprocessor communication protocol. Operating system communication and message-passing programming models can be accomplished using the messaging facility.

Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new programming paradigm called ActorSpace is presented, which provides powerful support for component-based construction of massively parallel and distributed applications and open interfaces to servers and pattern-directed access to software repositories.
Abstract: We present a new programming paradigm called ActorSpace. ActorSpace provides a new communication model based on destination patterns. An actorSpace is a computationally passive container of actors which acts as a context for matching patterns. Patterns are matched against listed attributes of actors and actorSpaces that are visible in the actorSpace. Both visibility and attributes are dynamic. Messages may be sent to one or all members of a group defined by a pattern. The paradigm provides powerful support for component-based construction of massively parallel and distributed applications. In particular, it supports open interfaces to servers and pattern-directed access to software repositories.

Journal ArticleDOI
TL;DR: A new heuristic algorithm to perform tabu search on the Quadratic Assignment Problem (QAP) is developed and a new intensification strategy based on intermediate term memory is proposed and shown to be promising especially while solving large QAPs.
Abstract: A new heuristic algorithm to perform tabu search on the Quadratic Assignment Problem (QAP) is developed. A massively parallel implementation of the algorithm on the Connection Machine CM-2 is provided. The implementation usesn2 processors, wheren is the size of the problem. The elements of the algorithm, calledPar_tabu, include dynamically changing tabu list sizes, aspiration criterion and long term memory. A new intensification strategy based on intermediate term memory is proposed and shown to be promising especially while solving large QAPs. The combination of all these elements gives a very efficient heuristic for the QAP: the best known or improved solutions are obtained in a significantly smaller number of iterations than in other comparative studies. Combined with the implementation on CM-2, this approach provides suboptimal solutions to QAPs of bigger dimensions in reasonable time.


Journal ArticleDOI
TL;DR: The present DDM-based parallel finite element algorithm is combined with a hierarchical model for data and processor management to have the workload balanced dynamically among the processors.

Book
26 Mar 1993
TL;DR: Low Level Parallel Image Processing Parallel FFT-like Transform Algorithms on Transputers Parallel Edge Detection and Related Al algorithmsms Parallel Segmentation Algorithm MIMD and SIMD Parallel Range Data Se segmentation.
Abstract: Low Level Parallel Image Processing Parallel FFT-like Transform Algorithms on Transputers Parallel Edge Detection and Related Algorithms Parallel Segmentation Algorithms MIMD and SIMD Parallel Range Data Segmentation Parallel Stereo and Motion Estimation Parallel Implementations of the Backpropagation Learning Algorithm Based on Network Topology Parallel Neural Computation Based on Algebraic Partitioning Parallel Neural Computing Based on Network Duplicating PARALLEL EIKONA: A Parallel Digital Image Processing Package Parallel Architectures and Algorithms for Real Time Computer Vision Index.

Journal ArticleDOI
TL;DR: The Vesta parallel file system provides user-directed checkpointing of files during continuing program execution with very little processing overhead and is scalable to a very large number of I/O and compute nodes.
Abstract: The Vesta parallel file system provides parallel access from compute nodes to files distributed across I/O nodes in a massively parallel computer. Vesta is intended to solve the I/O problems of massively parallel computers executing numerically intensive scientific applications. Vesta has three interesting characteristics: First, it provides a user defined parallel view of file data, and allows user defined partitioning and repartitioning of files without moving data among I/O nodes. The parallel file access semantics of Vesta directly support the operations required by parallel language I/O libraries. Second, Vesta is scalable to a very large number (many hundreds) of I/O and compute nodes and does not contain any sequential bottlenecks in the data-access path. Third, it provides user-directed checkpointing of files during continuing program execution with very little processing overhead.

Proceedings ArticleDOI
01 Aug 1993
TL;DR: It turns out that the predicted parameter values allow a realistic ranking of different program versions with respect to the actual runtime, as well as a strong correlation between the statically computed parameters and actual measurements.
Abstract: This paper presents a Parameter based Performance Prediction Tool (PPPT) which is part of the Vienna Fortran Compilation System (VFCS), a compiler that automatically translates Fortran programs into message passing programs for massively parallel architectures.The PPPT is applied to an explicitly parallel program generated by the VFCS, which may contain synchronous as well as asynchronous communication and is attributed with parameters computed in a previous profiling run. It statically computes a set of optional parameters that characterize the behavior of the parallel program. This includes work distribution, the number of data transfers, the amount of data transferred, transfer times, network contention, and the number of cache misses. These parameters can be selectively determined for statements, loops, procedures, and the entire program; furthermore, their effect with respect to individual processors can be examined.The tool plays an important role in the VFCS by providing the system as well as the user with vital performance information about the program. In particular, it supports automatic data distribution generation and the intelligent selection of transformation strategies, based on properties of the algorithm and characteristics of the target architecture.The tool has been implemented. Experiments show a strong correlation between the statically computed parameters and actual measurements; furthermore it turns out that the predicted parameter values allow a realistic ranking of different program versions with respect to the actual runtime.

Patent
12 Jul 1993
TL;DR: A massively parallel electron beam array for controllably imaging a target includes a multiplicity of emitter cathodes, each incorporating one or more micron-sized emitter tips.
Abstract: A massively parallel electron beam array for controllably imaging a target includes a multiplicity of emitter cathodes, each incorporating one or more micron-sized emitter tips. Each tip is controlled by a control electrode to produce an electron stream, and its deflection is controlled by a multielement deflection electrode to permit scanning of a corresponding target region.

Proceedings ArticleDOI
01 Dec 1993
TL;DR: The Vesta interface provides a user-defined parallel view of file data, which gives users some control over the layout of data, and six parallel access modes to Vesta files are defined, which give users very versatile parallel file access.
Abstract: The Vesta parallel file system is intended to solve the I/O problems of massively parallel multicomputers executing numerically intensive scientific applications. It provides parallel access from the applications to files distributed across multiple storage nodes in the multicomputer, thereby exposing an opportunity for high-bandwidth data transfer across the multicomputer's low-latency network. The Vesta interface provides a user-defined parallel view of file data, which gives users some control over the layout of data. This is useful for tailoring data layout to much common access patterns. The interface also allows user-defined partitioning and repartitioning of files without moving data among storage nodes. Libraries with higher-level interfaces that hide the layout details, while exploiting the power of parallel access, may be implemented above the basic interface. It is shown how collective I/O operations can be implemented, and six parallel access modes to Vesta files are defined. Each mode has unique characteristics in terms of how the processes share the file and how their accesses are interleaved. The combination of user-defined file partitioning and the six access modes gives users very versatile parallel file access.

Patent
01 Dec 1993
TL;DR: In this paper, an application-level method for dynamically maintaining global load balance on a parallel computer, particularly on massively parallel MIMD computers, is proposed, where global load balancing is achieved by overlapping neighborhoods of processors, where each neighborhood performs local load balancing.
Abstract: An application-level method for dynamically maintaining global load balance on a parallel computer, particularly on massively parallel MIMD computers. Global load balancing is achieved by overlapping neighborhoods of processors, where each neighborhood performs local load balancing. The method supports a large class of finite element and finite difference based applications and provides an automatic element management system to which applications are easily integrated.

Journal ArticleDOI
TL;DR: The stabilized space-time formulation for moving boundaries and interfaces, and a new stabilized velocity-pressure-stress formulation are both described, and significant aspects of the implementation of these methods on massively parallel architectures are discussed.

Patent
13 Dec 1993
TL;DR: In this article, address translation means for distributed memory massively parallel processing (MPP) systems include means for defining virtual addresses for processing elements (PE's) and memory relative to a partition of PE's under program control, and physical addresses for PE's and memory corresponding to identities and locations of PE modules within computer cabinetry.
Abstract: Address translation means for distributed memory massively parallel processing (MPP) systems include means for defining virtual addresses for processing elements (PE's) and memory relative to a partition of PE's under program control, means for defining logical addresses for PE's and memory within a three-dimensional interconnected network of PE's in the MPP, and physical addresses for PE's and memory corresponding to identities and locations of PE modules within computer cabinetry. As physical PE's are mapped into or out of the logical MPP, as spares are needed, logical addresses are updated. Address references generated by a PE within a partition in virtual address mode are converted to logical addresses and physical addresses for routing on the network.

Patent
22 Nov 1993
TL;DR: In this article, a two-dimensional input/output system for a massively parallel SIMD computer system providing an interface for the two-way transfer of data between a host computer and the SIMD computers is presented.
Abstract: A two-dimensional input/output system for a massively parallel SIMD computer system providing an interface for the two-way transfer of data between a host computer and the SIMD computer. A plurality of buffers equal in number, and distributed with the individual processing elements of the SIMD computer are used to provide a temporary storage area which allows data in different formats to be mapped in a format suitable for transfer to the host computer or for transfer to the SIMD processing elements. The temporary storage is controlled in such a way as to transfer entire blocks of data in a single SIMD system clock cycle thereby achieving an input/output data rate of N bits/cycle for a SIMD computer consisting of N processors. The system is capable of handling irregular as well as regular data structures. The system also emphasizes a distributed approach in having the input/output system divided into N pieces and distributed to each processor to reduce the wiring complexity while maintaining the I/O rate.

Proceedings ArticleDOI
26 Apr 1993
TL;DR: The authors describe a file system design for massively parallel computers which makes very efficient use of a few disks per processor, which overcomes the traditional input/output (I/O) bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network.
Abstract: The authors describe a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional input/output (I/O) bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA (Rapid Access to Massive Archive), requires little internode synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated into the file system; in fact, RAMA runs most efficiently when tertiary storage is used. >

Journal ArticleDOI
TL;DR: It is proved that with high probability the algorithms produce well-balanced storage for sufficiently large matrices with bounded number of nonzeros in each row and column, but no other restrictions on structure.
Abstract: This paper investigates the balancing of distributed compressed storage of large sparse matrices on a massively parallel computer. For fast computation of matrix–vector and matrix–matrix products on a rectangular processor array with efficient communications along its rows and columns it is required that the nonzero elements of each matrix row or column be distributed among the processors located within the same array row or column, respectively. Randomized packing algorithms are constructed with such properties, and it is proved that with high probability the algorithms produce well-balanced storage for sufficiently large matrices with bounded number of nonzeros in each row and column, but no other restrictions on structure. Then basic matrix–vector multiplication routines are described with fully parallel interprocessor communications and intraprocessor gather and scatter operations. Their efficiency isdemonstrated on the 16,384-processor MasPar computer.

Journal ArticleDOI
TL;DR: Different modifications of a class of parallel algorithms, initially designed by A. Bellen and M. Zennaro for difference equations and called “across the steps” methods, are studied for the purpose of solving initial value problems in ordinary differential equations on a massively parallel computer.
Abstract: In this paper, we study different modifications of a class of parallel algorithms, initially designed by A Bellen and M Zennaro for difference equations and called “across the steps” methods by their authors, for the purpose of solving initial value problems in ordinary differential equations (ODE's) on a massively parallel computer Restriction to dissipative problems is discussed which allow these problems to be solved efficiently, as shown by the simulations

Book ChapterDOI
01 Jan 1993
TL;DR: The Bird-Meertens formalism is an approach to developing and executing data-parallel programs; it encourages software development by equational transformation; it can be implemented efficiently across a wide range of architecture families; and it is equipped with a realistic cost calculus, so that trade-offs in software design can be explored before implementation.
Abstract: The expense of developing and maintaining software is the major obstacle to the routine use of parallel computation. Architecture independent programming offers a way of avoiding the problem, but the requirements for a model of parallel computation that will permit it are demanding. The Bird-Meertens formalism is an approach to developing and executing data-parallel programs; it encourages software development by equational transformation; it can be implemented efficiently across a wide range of architecture families; and it can be equipped with a realistic cost calculus, so that trade-offs in software design can be explored before implementation. It makes an ideal model of parallel computation.

Journal ArticleDOI
TL;DR: An algorithm for solving nonlinear, two-stage stochastic problems with network recourse based on the framework of row-action methods that permits the massively parallel solution of all the scenario subproblems concurrently and achieves computing rates of 276 MFLOPS.
Abstract: We develop an algorithm for solving nonlinear, two-stage stochastic problems with network recourse. The algorithm is based on the framework of row-action methods. The problem is formulated by replicating the first-stage variables and then adding nonanticipativity side constraints. A series of independent deterministic network problems are solved at each step of the algorithm, followed by an iterative step over the nonanticipativity constraints. The solution point of the iterates over the nonanticipativity constraints is obtained analytically. The row-action nature of the algorithm makes it suitable for parallel implementations. A data representation of the problem is developed that permits the massively parallel solution of all the scenario subproblems concurrently. The algorithm is implemented on a Connection Machine CM-2 with up to 32K processing elements and achieves computing rates of 276 MFLOPS. Very large problems-8,192 scenarios with a deterministic equivalent nonlinear program with 868,367 constraints and 2,474,017 variables-are solved within a few minutes. We report extensive numerical results regarding the effects of stochasticity on the efficiency of the algorithm.

Proceedings ArticleDOI
05 Jan 1993
TL;DR: Hill-climbing, simulated annealing and genetic algorithms are search techniques that can be applied to most combinatorial optimization problems and are used to solve the mapping problem, which is the optimal static allocation of communication processes on distributed memory architectures.
Abstract: Hill-climbing, simulated annealing and genetic algorithms are search techniques that can be applied to most combinatorial optimization problems. The three algorithms are used to solve the mapping problem, which is the optimal static allocation of communication processes on distributed memory architectures. Each algorithm is independently evaluated and optimized according to its parameters. The parallelization of the algorithms is also considered. As an example, a massively parallel genetic algorithm is proposed for the problem, and results of its implementation on a 128-processor Supernode are given. A comparative study of the algorithms is then carried out. The criteria of performance considered are the quality of the solutions obtained and the amount of search time used for several benchmarks. A hybrid approach consisting of a combination of genetic algorithms and hill-climbing is also proposed and evaluated. >