scispace - formally typeset
Search or ask a question

Showing papers on "Massively parallel published in 1995"


Journal ArticleDOI
TL;DR: The Myrinet local area network employs the same technology used for packet communication and switching within massively parallel processors, but with the highest performance per unit cost of any current LAN.
Abstract: The Myrinet local area network employs the same technology used for packet communication and switching within massively parallel processors. In realizing this distributed MPP network, we developed specialized communication channels, cut-through switches, host interfaces, and software. To our knowledge, Myrinet demonstrates the highest performance per unit cost of any current LAN. >

1,857 citations


Journal ArticleDOI
TL;DR: A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented and a rigorous derivation of a geometric conservation law for flow problems with moving boundaries and unstructured deformable meshes is included.
Abstract: A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented The description includes a rigorous derivation of a geometric conservation law for flow problems with moving boundaries and unstructured deformable meshes The solution of the coupled governing equations with a mixed explicit (fluid)/implicit (structure) staggered procedure is discussed with particular reference to accuracy, stability, distributed computing, I/O transfers, subcycling and parallel processing A general and flexible framework for implementing partitioned solution procedures for coupled aeroelastic problems on heterogeneous and/or parallel computational platforms is described This framework and the explicit/implicit partitioned procedures are demonstrated with the numerical investigation on an iPSC-860 massively parallel processor of the instability of flat panels with infinite aspect ratio in supersonic airstreams

368 citations


Journal ArticleDOI
TL;DR: The Terasys project as discussed by the authors uses a processor-in-memory (PIM) memory with a single-bit ALU controlling each column of memory for data parallel bit C (dbC) applications.
Abstract: SRC researchers have designed and fabricated a processor-in-memory (PIM) chip, a standard 4-bit memory augmented with a single-bit ALU controlling each column of memory. In principle, PIM chips can replace the memory of any processor, including a supercomputer. To validate the notion of integrating SIMD computing into conventional processors on a more modest scale, we have built a half dozen Terasys workstations, which are Sun Microsystems Sparcstation-2 workstations in which 8 megabytes of address space consist of PIM memory holding 32K single-bit ALUs. We have designed and implemented a high-level parallel language, called data parallel bit C (dbC), for Terasys and demonstrated that dbC applications using the PIM memory as a SIMD array run at the speed of multiple Cray-YMP processors. Thus, we can deliver supercomputer performance for a small fraction of supercomputer cost. Since the successful creation of the Terasys research prototype, we have begun work on processing in memory in a supercomputer setting. In a collaborative research project, we are working with Cray Computer to incorporate a new Cray-designed implementation of the PIM chips into two octants of Cray-3 memory. >

367 citations


Book
31 Jan 1995
TL;DR: This paper presents mathematical models for Studying the Long-Range Transport of Air Pollutants and the Reliability of the Numerical Algorithms, and some experiments with the Danish Eulerian Model.
Abstract: 1. The Air Pollution Problem. 2. Mathematical Models for Studying the Long-Range Transport of Air Pollutants. 3. Numerical Treatment of Large Air Pollution Models. 4. Testing the Reliability of the Numerical Algorithms. 5. Need for Efficient Algorithms. 6. Computations on High-Speed Computers. 7. Running Air Pollution Models on Vector Machines. 8. Running Models on Parallel Machines with Shared Memory. 9. Running Models on Massively Parallel Computers. 10. Numerical Experiments with the Danish Eulerian Model. References. Author Index. Subject Index.

315 citations


Journal ArticleDOI
TL;DR: This work examines software and hardware approaches to implementing collective communication operations and describes the major classes of algorithms proposed to solve problems arising in this research area.
Abstract: Most MPC networks use wormhole routing to reduce the effect of path length on communication time. Researchers have exploited this by designing ingenious algorithms to speed collective communication. Many projects have addressed the design of efficient collective communication algorithms for wormhole-routed systems. By exploiting the relative distance-insensitivity of wormhole routing, these new algorithms often differ fundamentally from their store-and-forward counterparts. We examine software and hardware approaches to implementing collective communication operations. Although we emphasize methods in which the underlying architecture is a direct network, such as a hypercube or mesh, as opposed to an indirect switch-based network, several approaches apply to systems of either type. We illustrate several issues arising in this research area and describe the major classes of algorithms proposed to solve these problems.

203 citations


Patent
05 Jun 1995
TL;DR: In this article, an improved parallel processing apparatus and method execute an iterative sequence of instructions by arranging the sequence into subtasks and allocating those subtasks to processors in such a manner as to minimize data contention among the processors and to maximize the locality of data to them.
Abstract: An improved parallel processing apparatus and method executes an iterative sequence of instructions by arranging the sequence into subtasks and allocating those subtasks to processors. This division and allocation is conducted in such a manner as to minimize data contention among the processors and to maximize the locality of data to them. The improved apparatus and method have application to a variety of multiprocessor systems, including those which are massively parallel.

197 citations


Journal ArticleDOI
TL;DR: The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses the problem of massively parallel distributed-memory multicomputers by developing automatic methods for the efficient parallelization of sequential programs.
Abstract: To harness the computational power of massively parallel distributed-memory multicomputers, users must write efficient software. This process is laborious because of the absence of global address space. The programmer must manually distribute computations and data across processors and explicitly manage communication. The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for the efficient parallelization of sequential programs. A unified approach efficiently supports regular and irregular computations using data and functional parallelism. >

195 citations


Journal ArticleDOI
TL;DR: The authors show that adaptive parallelism has the potential to integrate heterogeneous platforms seamlessly into a unified computing resource and to permit more efficient sharing of traditional parallel processors than is possible with current systems.
Abstract: Desktop computers are idle much of the time. Ongoing trends make aggregate LAN "waste"-idle compute cycles-an increasingly attractive target for recycling. Piranha, a software implementation of adaptive parallelism, allows these waste cycles to be recaptured by putting them to work running parallel applications. Most parallel processing is static: programs execute on a fixed set of processors throughout a computation. Adaptive parallelism allows for dynamic processor sets which means that the number of processors working on a computation may vary, depending on availability. With adaptive parallelism, instead of parceling out jobs to idle workstations, a single job is distributed over many workstations. Adaptive parallelism is potentially valuable on dedicated multiprocessors as well, particularly on massively parallel processors. One key Piranha advantage is that task descriptors, not processes, are the basic movable, remappable computation unit. The task descriptor approach supports strong heterogeneity. A process image representing a task in mid computation can't be moved to a machine of a different type, but a task descriptor can be. Thus, a task begun on a Sun computer can be completed by an IBM machine. The authors show that adaptive parallelism has the potential to integrate heterogeneous platforms seamlessly into a unified computing resource and to permit more efficient sharing of traditional parallel processors than is possible with current systems. >

163 citations


Journal ArticleDOI
TL;DR: In this paper, a large-scale finite element formulation of 3D, unsteady incompressible flows, including those involving fluid-structure interactions, is presented, with time-varying spatial domains based on the deforming spatial domaidstabilized spacetime (DSD/SST) formulation.
Abstract: SUMMARY Massively parallel finite element computations of 3D, unsteady incompressible flows, including those involving fluid-structure interactions, are presented. The computations with time-varying spatial domains are based on the deforming spatial domaidstabilized spacetime (DSD/SST) finite element formulation. The capability to solve 3D problems involving fluid-structure interactions is demonstrated by investigating the dynamics of a flexible cantilevered pipe conveying fluid. Computations of flow past a stationary rectangular wing at Reynolds number 1000, 2500 and lo7 reveal interesting flow patterns. In these computations, at each time step approximately 3 x lo6 non-linear equations are solved to update the flow field. Also, preliminary results are presented for flow past a wing in flapping motion. In this case a specially designed mesh moving scheme is employed to eliminate the need for remeshing. All these computations are canied out on the Amy High Performance Computing Research Center supercomputers CM-200 and CM-5, with major speed-ups compared with traditional supercomputers. The coupled equation systems arising from the finite element discretizations of these large-scale problems are solved iteratively with diagonal preconditioners. In some cases, to reduce the memory requirements even further, these iterations are carried out with a matrix-fiee strategy. The finite element formulations and their parallel implementations assume unstructured meshes.

146 citations


Journal ArticleDOI
TL;DR: This paper investigates two ways of improvement for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES( m).

135 citations


Journal ArticleDOI
TL;DR: In this article, an expectation-driven low-level image segmentation approach is presented for road detection using mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures, where the input image is assumed to contain a distorted version of a given template.
Abstract: The main aim of this work is the development of a vision-based road detection system fast enough to cope with the difficult real-time constraints imposed by moving vehicle applications. The hardware platform, a special-purpose massively parallel system, has been chosen to minimize system production and operational costs. This paper presents a novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures. The input image is assumed to contain a distorted version of a given template; a multiresolution stretching process is used to reshape the original template in accordance with the acquired image content, minimizing a potential function. The distorted template is the process output.

Posted Content
TL;DR: A novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures, is presented.
Abstract: The main aim of this work is the development of a vision-based road detection system fast enough to cope with the difficult real-time constraints imposed by moving vehicle applications. The hardware platform, a special-purpose massively parallel system, has been chosen to minimize system production and operational costs. This paper presents a novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel SIMD architectures capable of handling hierarchical data structures. The input image is assumed to contain a distorted version of a given template; a multiresolution stretching process is used to reshape the original template in accordance with the acquired image content, minimizing a potential function. The distorted template is the process output.

Journal ArticleDOI
TL;DR: A mixed computational model is presented for GA-based structural optimization of large space structures on massively parallel supercomputers and has been implemented on Connection Machine CM-5 and applied to optimized steel structures subjected to the constraints of the American Institute of Steel Construction's allowable stress design specifications.
Abstract: Genetic-algorithm (GA)based structural optimization can be parallelized to a high degree on new generation of scalable distributed-memory multiprocessors. In this paper, a mixed computational model...

Patent
07 Jun 1995
TL;DR: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip as mentioned in this paper, which merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking.
Abstract: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip. Eight processors on a single chip have their own associated processing element, significant memory, and I/O and are interconnected with a hypercube based, but modified, topology. These nodes are then interconnected, either by a hypercube, modified hypercube, or ring, or ring within ring network topology. Conventional microprocessor MMPs consume pins and time going to memory. The new architecture merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking. Each chip will have eight 16 bit processors, each processor providing 5 MIPs performance. I/O has three internal ports and one external port shared by the plural processors on the chip. The scalable chip PME has internal and external connections for broadcast and asynchronous SIMD, MIMD and SIMIMD (SIMD/MIMD) with dynamic switching of modes. The chip can be used in systems which employ 32, 64 or 128,000 processors, and can be used for lower, intermediate and higher ranges. Local and global memory functions can all be provided by the chips themselves, and system can connect to and support other global memories and DASD. The chip can be used as a microprocessor accelerator, in personal computer applications, as a vision or avionics computer system, or as work-station or supercomputer.

Journal ArticleDOI
TL;DR: The flow problems the authors consider typically come from aerospace applications, including those in 3D and those involving moving boundaries interacting with boundary layers and shocks, and are solved using the deformable-spatial-domain/stabilized-space-time (DSD/SST) formulation.
Abstract: Massively parallel finite element computations of the compressible Euler and Navier-Stokes equations using parallel supercomputers are presented. The finite element formulations are based on the conservation variables and the streamline-upwind/Petrov-Galerkin (SUPG) stabilization method is used to prevent potential numerial oscillations due to dominant advection terms. These computations are based on both implicit and explicit methods and their parallel implementation assumes that the mesh is unstructured. The implicit computations are based on iterative strategies. Large-scale 3D problems are solved using a matrix-free iteration technique which reduces the memory requirements significantly. The flow problems we consider typically come from aerospace applications, including those in 3D and those involving moving boundaries interacting with boundary layers and shocks. Problems with fixed boundaries are solved using a semidiscrete formulation and the ones involving moving boundaries are solved using the deformable-spatial-domain/stabilized-space-time (DSD/SST) formulation.

Journal ArticleDOI
TL;DR: Here, results are presented from an approach to parallel computation that relies on explicit message‐passing and distributed computing and is able to perform computations in the hundreds of Mflops range with a demonstrated total parallel overhead of less than ten percent.
Abstract: Thermally driven convection within the earth’s mantle determines one of the longest time scales of our planet. Plate tectonics, the piecewise continuous movement of the earth’s surface, is the prime manifestation of this slow deformational process but ultimately all large scale geological activity and dynamics of the planet involves the release of potential energy in the mantle. Massively parallel supercomputers are now allowing us to construct models of mantle convection with unprecedented complexity and realism. Here we present results from an approach to parallel computation that relies on explicit message‐passing and distributed computing. Connecting workstations together as a single parallel machine over a network and using the parallel virtual machine software, we are able to perform computations in the hundreds of Mflops range with a demonstrated total parallel overhead of less than ten percent. We have run high‐resolution thermal convection calculations for the earth’s mantle on the Los Alamos 16‐node cluster of IBM RS/6000 workstations employing a finite element mesh with more than 1.3 million grid points. These results indicate this approach to parallel computing offers a practical and efficient means of utilizing a broad spectrum of parallel hardware. © 1995 American Institute of Physics.

Proceedings Article
20 Aug 1995
TL;DR: Early experiences are presented with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets, and several associated feature extraction algorithms implemented on MPP platforms.
Abstract: The important scientific challenge of understanding global climate change is one that clearly requires the application of knowledge discovery and datamining techniques on a massive scale. Advances in parallel supercomputing technology, enabling high-resolution modeling, as well as in sensor technology, allowing data capture on an unprecedented scale, conspire to overwhelm present-day analysis approaches. We present here early experiences with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets. CONQUEST (CONtent-based Querying in Space and Time) employs a combination of workstations and massively parallel processors (MPP's) to mine geophysical datasets possessing a prominent temporal component. It is designed to enable complex multi-modal interactive querying and knowledge discovery, while simultaneously coping with the extraordinary computational demands posed by the scope of the datasets involved. After outlining a working prototype, we concentrate here on the description of several associated feature extraction algorithms implemented on MPP platforms, together with some typical results.

Journal ArticleDOI
TL;DR: Simple skeleton particle-in-cell codes designed for massively parallel computers are described, used to develop new algorithms and evaluate new parallel computers.

Patent
04 Apr 1995
TL;DR: In this article, a parallel computing system comprising N blocks of processors, where N is an integer greater than or equal to 1, is presented. Each block of the N blocks contains M processors, each of which includes an arithmetic logic unit, a local memory and an input/output (I/O) interface.
Abstract: A parallel computing system comprising N blocks of processors, where N is an integer greater than 1. Each block of the N blocks of processors contains M processors, where M is an integer greater than 1. Each processor includes an arithmetic logic unit (ALU), a local memory and an input/output (I/O) interface. The computing system also contains a control means, connected to each of the M processors, for providing identical instructions to each of the M processors, and a host means, coupled to each of the control means within the N blocks of processors. The host means selectively organizes the control means of each of the N blocks of M processors into at least two groups of P blocks of M processors, P being an integer less than or equal to N. In operation, the host means causes the control means within each group of P blocks of M processors to provide each group of P blocks of M processors respectively different identical processor instructions. To facilitate communications amongst the processors, the parallel computing system includes an interprocessor communications channel that selectively interconnects the processors.

Proceedings ArticleDOI
01 Aug 1995
TL;DR: This paper addresses the problem of optimizing throughput in task pipelines and presents two new solution algorithms based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P4k2) time.
Abstract: Many applications in a variety of domains including digital signal processing, image processing and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modules and assigning a subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The first algorithm is based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P4k2) time. We also present a heuristic algorithm that is linear in the number of processors and establish with theoretical and practical results that the solutions obtained are optimal in practical situations. The entire framework is implemented as an automatic mapping tool for the Fx parallelizing compiler for High Performance Fortran. We present experimental results that demonstrate the importance of choosing a good mapping and show that the methods presented yield efficient mappings and predict optimal performance accurately.

Journal ArticleDOI
TL;DR: In this article, the authors discuss aspects of data set size and computational feasibility for general classes of algorithms in the context of CPU performance, memory size, hard disk capacity, screen resolution and massively parallel architectures.
Abstract: Recently, Huber offered a taxonomy of data set sizes ranging from tiny (102 bytes) to huge (1010 bytes). This taxonomy is particularly appealing because it quantifies the meaning of tiny, small, medium, large, and huge. Indeed, some investigators consider 300 small and 10,000 large while others consider 10,000 small. In Huber's taxonomy, most statistical and visualization techniques are computationally feasible with tiny data sets. With larger data sets, however, computers run out of computational horsepower and graphics displays run out of resolution fairly quickly. In this article, I discuss aspects of data set size and computational feasibility for general classes of algorithms in the context of CPU performance, memory size, hard disk capacity, screen resolution and massively parallel architectures. I discuss some strategies such as recursive formulations that mitigate the impact of size. I also discuss the potential for scalable parallelization that will mitigate the effects of computational ...

Book ChapterDOI
25 Apr 1995
TL;DR: CARMI, a resource management system, aimed at allowing a parallel application to make use of all available computing power, and WoDi which provides a simple interface for writing master-workers programs in a dynamic resource environment.
Abstract: In every production parallel processing environment, the set of resources potentially available to an application fluctuate due to changes in the load on the system. This is true for clusters of workstations which are an increasingly popular platform for parallel computing. Today's parallel programming environments have largely succeeded in making the communication aspect of parallel programming much easier, but they have not provided adequate resource management services which are needed to adapt to such changes in availability. To fill this need, we have developed CARMI, a resource management system, aimed at allowing a parallel application to make use of all available computing power. CARMI permits an application to grow as new resources become available, and shrink when resources are reclaimed. Building upon CARMI, we have also developed WoDi which provides a simple interface for writing master-workers programs in a dynamic resource environment. Both CARMI and WoDi are operational, and have been used on a pool of more than 200 workstations managed by the Condor batch system. Experience with the two systems has shown them to be easy to use, and capable of providing large numbers of cycles to parallel applications even in a real-life production environment in which no resources are dedicated to parallel processing.

Journal ArticleDOI
01 Sep 1995
TL;DR: This paper outlines a general method ology for the data-parallel implementation of spectral methods on massively parallel machines with distrib uted memory and shows that very high performance can be obtained on a wide range of massively parallel architectures.
Abstract: Here we have demonstrated the possibility of very high performance in the implementation of a global spectral methodology on a massively parallel architec ture with distributed memory. Spectral simulations of channel flow and thermal convection in a three-dimen sional Cartesian geometry have yielded a very high performance-up to 26 Gflops/s on a 512-node CM5. In general, implementation of spectral methodology in parallel processors with distributed memory requires nonlocal interprocessor data transfer that is not re stricted to being between nearest neighbors. In spite of their increased communication overhead, better per formance is possible in global methodologies owing to their dense matrix operations and organized data com munication. In this paper we outline a general method ology for the data-parallel implementation of spectral methods on massively parallel machines with distrib uted memory. Following the steps presented here, very high performance can be obtained on a wide vari ety of massively parallel architectures.

Proceedings ArticleDOI
25 Sep 1995
TL;DR: This paper presents the vision-based road detection system currently operative onto the MOB-LAB land vehicle, based on a full-custom low-cost massively parallel system that achieves real-time performances in the processing of image sequences, due to the extremely efficient implementation of the algorithm.
Abstract: This paper presents the vision-based road detection system currently operative onto the MOB-LAB land vehicle. Based on a full-custom low-cost massively parallel system, it achieves real-time performances (/spl sime/17 Hz) in the processing of image sequences, due to the extremely efficient implementation of the algorithm. Assuming a flat road and the complete set of acquisition parameters are known (camera position, orientation, optics), the system is capable to detect road markings on structured roads even in extremely severe shadow conditions.

Journal ArticleDOI
TL;DR: This work investigates the problem of evaluating Fortran 90-style array expressions on massively parallel distributed-memory machines and presents algorithms based on dynamic programming that solve the embedding problem optimally for several communication cost metrics: multidimensional grids and rings, hypercubes, fat-trees, and the discrete metric.
Abstract: We investigate the problem of evaluating Fortran 90-style array expressions on massively parallel distributed-memory machines. On such a machine, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression tree. The choice of where to perform the operation then affects this cost.We describe the communication cost of the parallel machine theoretically as a metric space; we model the alignment problem as that of finding a minimum-cost embedding of the expression tree into this space. We present algorithms based on dynamic programming that solve the embedding problem optimally for several communication cost metrics: multidimensional grids and rings, hypercubes, fat-trees, and the discrete metric. We also extend our approach to handle operations that change the shape of the arrays.

Journal ArticleDOI
TL;DR: In this article, three-dimensional, time-dependent features of melt flows which occur during the Czochralski growth of oxide crystals are analyzed using a theoretical bulk-flow model.

Proceedings ArticleDOI
Peter M. Kogge1, T. Sunaga1, H. Miyataka1, K. Kitamura1, Eric E. Retter2 
27 Mar 1995
TL;DR: The basic chip technology and organization, some projections on the future of EXECUBE-like PIM chips, and finally some lessons to be learned as to why this technology should radically affect the way the authors ought think about computer architecture are overviewed.
Abstract: A new 5 V 0.8 /spl mu/m CMOS technology merges 100 K custom circuits and 4.5 Mb DRAM onto a single die that supports both high density memory and significant computing logic. One of the first chips built with this technology implements a unique Processor-In-Memory (PIM) computer architecture termed EXECUBE and has 8 separate 25 MHz CPU macros and 16 separate 32 K/spl times/9 b DRAM macros on a single die. These macros are organized together to provide a single part type for scaleable massively parallel processing applications, particularly embedded ones where minimal glue logic is desired. Each chip delivers 50 Mips of performance at 2.7 W. This paper overviews the basic chip technology and organization some projections on the future of EXECUBE-like PIM chips, and finally some lessons to be learned as to why this technology should radically affect the way we ought think about computer architecture.

Proceedings ArticleDOI
25 Apr 1995
TL;DR: This paper is the first to present a parallelization of a highly efficient best-first branch-and-bound algorithm to solve large symmetric traveling salesman problems on a massively parallel computer containing 1024 processors.
Abstract: This paper is the first to present a parallelization of a highly efficient best-first branch-and-bound algorithm to solve large symmetric traveling salesman problems on a massively parallel computer containing 1024 processors The underlying sequential branch-and-bound algorithm is based on 1-tree relaxation The parallelization of the branch-and-bound algorithm is fully distributed Every processor performs the same sequential algorithm but on a different part of the solution tree To distribute subproblems among the processors we use a new direct-neighbor dynamic load-balancing strategy The general principle can be applied to all other branch-and-bound algorithms leading to an "automatic" parallelization At present we can efficiently solve traveling salesman problems up to a size of 318 cities on networks of up to 1024 transputers On hard problems we achieve an almost linear speed-up >

Journal ArticleDOI
01 Oct 1995
TL;DR: A partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed is developed and found the T3D and CM-5 are found to execute the “large” model version at roughly the same speed.
Abstract: A two-pronged effort to convert a recently developed ocean circulation model written in Fortran-77 for execution on massively parallel computers is described. A data-parallel version was developed for the CM-5 manufactured by Thinking Machines, Inc., while a message-passing version was developed for both the Cray T3D and the Silicon Graphics ONYX workstation. Since the time differentiation scheme in the ocean model is fully explicit and does not require solution of elliptic partial differential equations, adequate machine utilization has been achieved without major changes to the original algorithms. We developed a partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed. On a per-node basis (a T3D node is one Alpha processor, a CM-5 node is one Sparc chip and four vector units), the T3D and CM-5 are found to execute our “large” model version consisting of 511 × 511 horizontal mesh points at roughly the same speed.

Journal ArticleDOI
TL;DR: An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested, based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations.
Abstract: An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested. The method is based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations. Efficiency of a concurrent implementation of the FSAI-CG iterations is analyzed for a model hypercube, and an estimate of the optimal hypercube dimension is derived. For finite element applications, two strategies for selecting the preconditioner sparsity pattern are suggested. A high convergence rate of the resulting iterations is demonstrated numerically for the 3D equilibrium equations for linear elastic orthotropic materials approximated using both h- and p-versions of the FEM.