Showing papers on "Massively parallel published in 1995"

PDF

Open Access

Journal Article•DOI•

Myrinet: a gigabit-per-second local area network

[...]

N.J. Boden, Danny Cohen, R.E. Felderman, A.E. Kulawik, Charles L. Seitz, Jakov Seizovic, Wen-King Su - Show less +3 more

01 Feb 1995-IEEE Micro

TL;DR: The Myrinet local area network employs the same technology used for packet communication and switching within massively parallel processors, but with the highest performance per unit cost of any current LAN.

...read moreread less

Abstract: The Myrinet local area network employs the same technology used for packet communication and switching within massively parallel processors. In realizing this distributed MPP network, we developed specialized communication channels, cut-through switches, host interfaces, and software. To our knowledge, Myrinet demonstrates the highest performance per unit cost of any current LAN. >

...read moreread less

1,857 citations

Journal Article•DOI•

Mixed explicit/implicit time integration of coupled aeroelastic problems: Three‐field formulation, geometric conservation and distributed solution

[...]

Charbel Farhat¹, Michel Lesoinne¹, Nathan Maman¹•Institutions (1)

University of Colorado Boulder¹

30 Nov 1995-International Journal for Numerical Methods in Fluids

TL;DR: A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented and a rigorous derivation of a geometric conservation law for flow problems with moving boundaries and unstructured deformable meshes is included.

...read moreread less

Abstract: A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented The description includes a rigorous derivation of a geometric conservation law for flow problems with moving boundaries and unstructured deformable meshes The solution of the coupled governing equations with a mixed explicit (fluid)/implicit (structure) staggered procedure is discussed with particular reference to accuracy, stability, distributed computing, I/O transfers, subcycling and parallel processing A general and flexible framework for implementing partitioned solution procedures for coupled aeroelastic problems on heterogeneous and/or parallel computational platforms is described This framework and the explicit/implicit partitioned procedures are demonstrated with the numerical investigation on an iPSC-860 massively parallel processor of the instability of flat panels with infinite aspect ratio in supersonic airstreams

...read moreread less

368 citations

Journal Article•DOI•

Processing in memory: the Terasys massively parallel PIM array

[...]

M. Gokhale¹, B. Holmes, K. Iobst•Institutions (1)

Sarnoff Corporation¹

01 Apr 1995-IEEE Computer

TL;DR: The Terasys project as discussed by the authors uses a processor-in-memory (PIM) memory with a single-bit ALU controlling each column of memory for data parallel bit C (dbC) applications.

...read moreread less

Abstract: SRC researchers have designed and fabricated a processor-in-memory (PIM) chip, a standard 4-bit memory augmented with a single-bit ALU controlling each column of memory. In principle, PIM chips can replace the memory of any processor, including a supercomputer. To validate the notion of integrating SIMD computing into conventional processors on a more modest scale, we have built a half dozen Terasys workstations, which are Sun Microsystems Sparcstation-2 workstations in which 8 megabytes of address space consist of PIM memory holding 32K single-bit ALUs. We have designed and implemented a high-level parallel language, called data parallel bit C (dbC), for Terasys and demonstrated that dbC applications using the PIM memory as a SIMD array run at the speed of multiple Cray-YMP processors. Thus, we can deliver supercomputer performance for a small fraction of supercomputer cost. Since the successful creation of the Terasys research prototype, we have begun work on processing in memory in a supercomputer setting. In a collaborative research project, we are working with Cray Computer to incorporate a new Cray-designed implementation of the PIM chips into two octants of Cray-3 memory. >

...read moreread less

367 citations

Book•

Computer Treatment of Large Air Pollution Models

[...]

Zahari Zlatev

31 Jan 1995

TL;DR: This paper presents mathematical models for Studying the Long-Range Transport of Air Pollutants and the Reliability of the Numerical Algorithms, and some experiments with the Danish Eulerian Model.

...read moreread less

Abstract: 1. The Air Pollution Problem. 2. Mathematical Models for Studying the Long-Range Transport of Air Pollutants. 3. Numerical Treatment of Large Air Pollution Models. 4. Testing the Reliability of the Numerical Algorithms. 5. Need for Efficient Algorithms. 6. Computations on High-Speed Computers. 7. Running Air Pollution Models on Vector Machines. 8. Running Models on Parallel Machines with Shared Memory. 9. Running Models on Massively Parallel Computers. 10. Numerical Experiments with the Danish Eulerian Model. References. Author Index. Subject Index.

...read moreread less

315 citations

Journal Article•DOI•

Collective communication in wormhole-routed massively parallel computers

[...]

Philip K. McKinley¹, Yih-jia Tsai¹, David Robinson¹•Institutions (1)

Michigan State University¹

01 Dec 1995-IEEE Computer

TL;DR: This work examines software and hardware approaches to implementing collective communication operations and describes the major classes of algorithms proposed to solve problems arising in this research area.

...read moreread less

Abstract: Most MPC networks use wormhole routing to reduce the effect of path length on communication time. Researchers have exploited this by designing ingenious algorithms to speed collective communication. Many projects have addressed the design of efficient collective communication algorithms for wormhole-routed systems. By exploiting the relative distance-insensitivity of wormhole routing, these new algorithms often differ fundamentally from their store-and-forward counterparts. We examine software and hardware approaches to implementing collective communication operations. Although we emphasize methods in which the underlying architecture is a direct network, such as a hypercube or mesh, as opposed to an indirect switch-based network, several approaches apply to systems of either type. We illustrate several issues arising in this research area and describe the major classes of algorithms proposed to solve these problems.

...read moreread less

203 citations

Patent•

System for parallel processing that compiles a filed sequence of instructions within an iteration space

[...]

Christopher L Reeve¹, Tani Shavit, James B. Rothnie, Timothy G Peters, Linda Q. Lee, William F. Mann, Jacklin Kotikian - Show less +3 more•Institutions (1)

Sun Microsystems¹

05 Jun 1995

TL;DR: In this article, an improved parallel processing apparatus and method execute an iterative sequence of instructions by arranging the sequence into subtasks and allocating those subtasks to processors in such a manner as to minimize data contention among the processors and to maximize the locality of data to them.

...read moreread less

Abstract: An improved parallel processing apparatus and method executes an iterative sequence of instructions by arranging the sequence into subtasks and allocating those subtasks to processors. This division and allocation is conducted in such a manner as to minimize data contention among the processors and to maximize the locality of data to them. The improved apparatus and method have application to a variety of multiprocessor systems, including those which are massively parallel.

...read moreread less

197 citations

Journal Article•DOI•

The Paradigm compiler for distributed-memory multicomputers

[...]

Prithviraj Banerjee¹, John A. Chandy¹, Manish Gupta¹, Iv. E.W. Hodges¹, J.G. Holm¹, Antonio Lain¹, Daniel J. Palermo¹, Shankar Ramaswamy¹, Ernesto Su¹ - Show less +5 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Oct 1995-IEEE Computer

TL;DR: The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses the problem of massively parallel distributed-memory multicomputers by developing automatic methods for the efficient parallelization of sequential programs.

...read moreread less

Abstract: To harness the computational power of massively parallel distributed-memory multicomputers, users must write efficient software. This process is laborious because of the absence of global address space. The programmer must manually distribute computations and data across processors and explicitly manage communication. The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for the efficient parallelization of sequential programs. A unified approach efficiently supports regular and irregular computations using data and functional parallelism. >

...read moreread less

195 citations

Journal Article•DOI•

Adaptive parallelism and Piranha

[...]

Nicholas Carriero¹, Eric Freeman¹, David Gelernter¹, D. Kaminsky²•Institutions (2)

Yale University¹, IBM²

01 Jan 1995-IEEE Computer

TL;DR: The authors show that adaptive parallelism has the potential to integrate heterogeneous platforms seamlessly into a unified computing resource and to permit more efficient sharing of traditional parallel processors than is possible with current systems.

...read moreread less

Abstract: Desktop computers are idle much of the time. Ongoing trends make aggregate LAN "waste"-idle compute cycles-an increasingly attractive target for recycling. Piranha, a software implementation of adaptive parallelism, allows these waste cycles to be recaptured by putting them to work running parallel applications. Most parallel processing is static: programs execute on a fixed set of processors throughout a computation. Adaptive parallelism allows for dynamic processor sets which means that the number of processors working on a computation may vary, depending on availability. With adaptive parallelism, instead of parceling out jobs to idle workstations, a single job is distributed over many workstations. Adaptive parallelism is potentially valuable on dedicated multiprocessors as well, particularly on massively parallel processors. One key Piranha advantage is that task descriptors, not processes, are the basic movable, remappable computation unit. The task descriptor approach supports strong heterogeneity. A process image representing a task in mid computation can't be moved to a machine of a different type, but a task descriptor can be. Thus, a task begun on a Sun computer can be completed by an IBM machine. The authors show that adaptive parallelism has the potential to integrate heterogeneous platforms seamlessly into a unified computing resource and to permit more efficient sharing of traditional parallel processors than is possible with current systems. >

...read moreread less

163 citations

Journal Article•DOI•

Parallel finite element simulation of 3d incompressible flows: fluid-structure interactions

[...]

Sanjay Mittal, Tayfun E. Tezduyar¹•Institutions (1)

University of Minnesota¹

30 Nov 1995-International Journal for Numerical Methods in Fluids

TL;DR: In this paper, a large-scale finite element formulation of 3D, unsteady incompressible flows, including those involving fluid-structure interactions, is presented, with time-varying spatial domains based on the deforming spatial domaidstabilized spacetime (DSD/SST) formulation.

...read moreread less

Abstract: SUMMARY Massively parallel finite element computations of 3D, unsteady incompressible flows, including those involving fluid-structure interactions, are presented. The computations with time-varying spatial domains are based on the deforming spatial domaidstabilized spacetime (DSD/SST) finite element formulation. The capability to solve 3D problems involving fluid-structure interactions is demonstrated by investigating the dynamics of a flexible cantilevered pipe conveying fluid. Computations of flow past a stationary rectangular wing at Reynolds number 1000, 2500 and lo7 reveal interesting flow patterns. In these computations, at each time step approximately 3 x lo6 non-linear equations are solved to update the flow field. Also, preliminary results are presented for flow past a wing in flapping motion. In this case a specially designed mesh moving scheme is employed to eliminate the need for remeshing. All these computations are canied out on the Amy High Performance Computing Research Center supercomputers CM-200 and CM-5, with major speed-ups compared with traditional supercomputers. The coupled equation systems arising from the finite element discretizations of these large-scale problems are solved iteratively with diagonal preconditioners. In some cases, to reduce the memory requirements even further, these iterations are carried out with a matrix-fiee strategy. The finite element formulations and their parallel implementations assume unstructured meshes.

...read moreread less

146 citations

Journal Article•DOI•

Reducing the effect of global communication in GMRES( m ) and CG on parallel distributed memory computers

[...]

E. de Sturler¹, H.A. van der Vorst²•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Utrecht University²

01 Oct 1995-Applied Numerical Mathematics

TL;DR: This paper investigates two ways of improvement for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES( m).

...read moreread less

135 citations

Journal Article•DOI•

Vision-based road detection in automotive systems: a real-time expectation-driven approach

[...]

Alberto Broggi¹, S. Berte•Institutions (1)

University of Parma¹

01 Jun 1995-Journal of Artificial Intelligence Research

TL;DR: In this article, an expectation-driven low-level image segmentation approach is presented for road detection using mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures, where the input image is assumed to contain a distorted version of a given template.

...read moreread less

Abstract: The main aim of this work is the development of a vision-based road detection system fast enough to cope with the difficult real-time constraints imposed by moving vehicle applications. The hardware platform, a special-purpose massively parallel system, has been chosen to minimize system production and operational costs. This paper presents a novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures. The input image is assumed to contain a distorted version of a given template; a multiresolution stretching process is used to reshape the original template in accordance with the acquired image content, minimizing a potential function. The distorted template is the process output.

...read moreread less

Posted Content•

Vision-Based Road Detection in Automotive Systems: A Real-Time Expectation-Driven Approach

[...]

Alberto Broggi¹, S. Berte•Institutions (1)

University of Parma¹

01 Dec 1995-arXiv: Artificial Intelligence

TL;DR: A novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel Simd architectures capable of handling hierarchical data structures, is presented.

...read moreread less

Abstract: The main aim of this work is the development of a vision-based road detection system fast enough to cope with the difficult real-time constraints imposed by moving vehicle applications. The hardware platform, a special-purpose massively parallel system, has been chosen to minimize system production and operational costs. This paper presents a novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel SIMD architectures capable of handling hierarchical data structures. The input image is assumed to contain a distorted version of a given template; a multiresolution stretching process is used to reshape the original template in accordance with the acquired image content, minimizing a potential function. The distorted template is the process output.

...read moreread less

Journal Article•DOI•

Concurrent Structural Optimization on Massively Parallel Supercomputer

[...]

Hojjat Adeli, Sanjay Kumar

01 Nov 1995-Journal of Structural Engineering-asce

TL;DR: A mixed computational model is presented for GA-based structural optimization of large space structures on massively parallel supercomputers and has been implemented on Connection Machine CM-5 and applied to optimized steel structures subjected to the constraints of the American Institute of Steel Construction's allowable stress design specifications.

...read moreread less

Abstract: Genetic-algorithm (GA)based structural optimization can be parallelized to a high degree on new generation of scalable distributed-memory multiprocessors. In this paper, a mixed computational model...

...read moreread less

Patent•

Fully scalable parallel processing system having asynchronous SIMD processing

[...]

Paul Amba Wilkinson¹, James Warren Dieffenderfer¹, Peter M. Kogge¹, Nicholas Jerome Schoonover¹•Institutions (1)

IBM¹

07 Jun 1995

TL;DR: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip as mentioned in this paper, which merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking.

...read moreread less

Abstract: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip. Eight processors on a single chip have their own associated processing element, significant memory, and I/O and are interconnected with a hypercube based, but modified, topology. These nodes are then interconnected, either by a hypercube, modified hypercube, or ring, or ring within ring network topology. Conventional microprocessor MMPs consume pins and time going to memory. The new architecture merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking. Each chip will have eight 16 bit processors, each processor providing 5 MIPs performance. I/O has three internal ports and one external port shared by the plural processors on the chip. The scalable chip PME has internal and external connections for broadcast and asynchronous SIMD, MIMD and SIMIMD (SIMD/MIMD) with dynamic switching of modes. The chip can be used in systems which employ 32, 64 or 128,000 processors, and can be used for lower, intermediate and higher ranges. Local and global memory functions can all be provided by the chips themselves, and system can connect to and support other global memories and DASD. The chip can be used as a microprocessor accelerator, in personal computer applications, as a vision or avionics computer system, or as work-station or supercomputer.

...read moreread less

Journal Article•DOI•

Parallel fluid dynamics computations in aerospace applications

[...]

Shahrouz Aliabadi¹, Tayfun E. Tezduyar¹•Institutions (1)

University of Minnesota¹

30 Nov 1995-International Journal for Numerical Methods in Fluids

TL;DR: The flow problems the authors consider typically come from aerospace applications, including those in 3D and those involving moving boundaries interacting with boundary layers and shocks, and are solved using the deformable-spatial-domain/stabilized-space-time (DSD/SST) formulation.

...read moreread less

Abstract: Massively parallel finite element computations of the compressible Euler and Navier-Stokes equations using parallel supercomputers are presented. The finite element formulations are based on the conservation variables and the streamline-upwind/Petrov-Galerkin (SUPG) stabilization method is used to prevent potential numerial oscillations due to dominant advection terms. These computations are based on both implicit and explicit methods and their parallel implementation assumes that the mesh is unstructured. The implicit computations are based on iterative strategies. Large-scale 3D problems are solved using a matrix-free iteration technique which reduces the memory requirements significantly. The flow problems we consider typically come from aerospace applications, including those in 3D and those involving moving boundaries interacting with boundary layers and shocks. Problems with fixed boundaries are solved using a semidiscrete formulation and the ones involving moving boundaries are solved using the deformable-spatial-domain/stabilized-space-time (DSD/SST) formulation.

...read moreread less

Journal Article•DOI•

Mantle convection modeling on parallel virtual machines

[...]

Hans-Peter Bunge, John R. Baumgardner

21 Mar 1995-Computers in Physics

TL;DR: Here, results are presented from an approach to parallel computation that relies on explicit message‐passing and distributed computing and is able to perform computations in the hundreds of Mflops range with a demonstrated total parallel overhead of less than ten percent.

...read moreread less

Abstract: Thermally driven convection within the earth’s mantle determines one of the longest time scales of our planet. Plate tectonics, the piecewise continuous movement of the earth’s surface, is the prime manifestation of this slow deformational process but ultimately all large scale geological activity and dynamics of the planet involves the release of potential energy in the mantle. Massively parallel supercomputers are now allowing us to construct models of mantle convection with unprecedented complexity and realism. Here we present results from an approach to parallel computation that relies on explicit message‐passing and distributed computing. Connecting workstations together as a single parallel machine over a network and using the parallel virtual machine software, we are able to perform computations in the hundreds of Mflops range with a demonstrated total parallel overhead of less than ten percent. We have run high‐resolution thermal convection calculations for the earth’s mantle on the Los Alamos 16‐node cluster of IBM RS/6000 workstations employing a finite element mesh with more than 1.3 million grid points. These results indicate this approach to parallel computing offers a practical and efficient means of utilizing a broad spectrum of parallel hardware. © 1995 American Institute of Physics.

...read moreread less

Proceedings Article•

Fast spatio-temporal data mining of large geophysical datasets

[...]

Paul Stolorz¹, Hisashi Nakamura², Edmond Mesrobian³, Richard R. Muntz³, E.C. Shek³, J.R. Santos³, Jeonghee Yi³, K. Ng³, S.-Y. Chien³, Carlos R. Mechoso³, John D. Farrara³ - Show less +7 more•Institutions (3)

California Institute of Technology¹, University of Tokyo², University of California, Los Angeles³

20 Aug 1995

TL;DR: Early experiences are presented with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets, and several associated feature extraction algorithms implemented on MPP platforms.

...read moreread less

Abstract: The important scientific challenge of understanding global climate change is one that clearly requires the application of knowledge discovery and datamining techniques on a massive scale. Advances in parallel supercomputing technology, enabling high-resolution modeling, as well as in sensor technology, allowing data capture on an unprecedented scale, conspire to overwhelm present-day analysis approaches. We present here early experiences with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets. CONQUEST (CONtent-based Querying in Space and Time) employs a combination of workstations and massively parallel processors (MPP's) to mine geophysical datasets possessing a prominent temporal component. It is designed to enable complex multi-modal interactive querying and knowledge discovery, while simultaneously coping with the extraordinary computational demands posed by the scope of the datasets involved. After outlining a working prototype, we concentrate here on the description of several associated feature extraction algorithms implemented on MPP platforms, together with some typical results.

...read moreread less

Journal Article•DOI•

Skeleton PIC Codes for Parallel Computers

[...]

Viktor Decyk¹•Institutions (1)

California Institute of Technology¹

02 May 1995-Computer Physics Communications

TL;DR: Simple skeleton particle-in-cell codes designed for massively parallel computers are described, used to develop new algorithms and evaluate new parallel computers.

...read moreread less

Patent•

Advanced massively parallel computer using a field of the instruction to selectively enable the profiling counter to increase its value in response to the system clock

[...]

Danny Chin¹, Joseph Edward Peters¹, Herbert Hudson Taylor¹•Institutions (1)

Princeton University¹

04 Apr 1995

TL;DR: In this article, a parallel computing system comprising N blocks of processors, where N is an integer greater than or equal to 1, is presented. Each block of the N blocks contains M processors, each of which includes an arithmetic logic unit, a local memory and an input/output (I/O) interface.

...read moreread less

Abstract: A parallel computing system comprising N blocks of processors, where N is an integer greater than 1. Each block of the N blocks of processors contains M processors, where M is an integer greater than 1. Each processor includes an arithmetic logic unit (ALU), a local memory and an input/output (I/O) interface. The computing system also contains a control means, connected to each of the M processors, for providing identical instructions to each of the M processors, and a host means, coupled to each of the control means within the N blocks of processors. The host means selectively organizes the control means of each of the N blocks of M processors into at least two groups of P blocks of M processors, P being an integer less than or equal to N. In operation, the host means causes the control means within each group of P blocks of M processors to provide each group of P blocks of M processors respectively different identical processor instructions. To facilitate communications amongst the processors, the parallel computing system includes an interprocessor communications channel that selectively interconnects the processors.

...read moreread less

Proceedings Article•DOI•

Optimal mapping of sequences of data parallel tasks

[...]

Jaspal Subhlok¹, Gary L. Vondran²•Institutions (2)

Carnegie Mellon University¹, Hewlett-Packard²

01 Aug 1995

TL;DR: This paper addresses the problem of optimizing throughput in task pipelines and presents two new solution algorithms based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P4k2) time.

...read moreread less

Abstract: Many applications in a variety of domains including digital signal processing, image processing and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modules and assigning a subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The first algorithm is based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P4k2) time. We also present a heuristic algorithm that is linear in the number of processors and establish with theoretical and practical results that the solutions obtained are optimal in practical situations. The entire framework is implemented as an automatic mapping tool for the Fx parallelizing compiler for High Performance Fortran. We present experimental results that demonstrate the importance of choosing a good mapping and show that the methods presented yield efficient mappings and predict optimal performance accurately.

...read moreread less

Journal Article•DOI•

Huge Data Sets and the Frontiers of Computational Feasibility

[...]

Edward J. Wegman¹•Institutions (1)

George Mason University¹

01 Dec 1995-Journal of Computational and Graphical Statistics

TL;DR: In this article, the authors discuss aspects of data set size and computational feasibility for general classes of algorithms in the context of CPU performance, memory size, hard disk capacity, screen resolution and massively parallel architectures.

...read moreread less

Abstract: Recently, Huber offered a taxonomy of data set sizes ranging from tiny (102 bytes) to huge (1010 bytes). This taxonomy is particularly appealing because it quantifies the meaning of tiny, small, medium, large, and huge. Indeed, some investigators consider 300 small and 10,000 large while others consider 10,000 small. In Huber's taxonomy, most statistical and visualization techniques are computationally feasible with tiny data sets. With larger data sets, however, computers run out of computational horsepower and graphics displays run out of resolution fairly quickly. In this article, I discuss aspects of data set size and computational feasibility for general classes of algorithms in the context of CPU performance, memory size, hard disk capacity, screen resolution and massively parallel architectures. I discuss some strategies such as recursive formulations that mitigate the impact of size. I also discuss the potential for scalable parallelization that will mitigate the effects of computational ...

...read moreread less

Book Chapter•DOI•

Parallel Processing on Dynamic Resources with CARMI

[...]

Jim Pruyne¹, Miron Livny¹•Institutions (1)

University of Wisconsin-Madison¹

25 Apr 1995

TL;DR: CARMI, a resource management system, aimed at allowing a parallel application to make use of all available computing power, and WoDi which provides a simple interface for writing master-workers programs in a dynamic resource environment.

...read moreread less

Abstract: In every production parallel processing environment, the set of resources potentially available to an application fluctuate due to changes in the load on the system. This is true for clusters of workstations which are an increasingly popular platform for parallel computing. Today's parallel programming environments have largely succeeded in making the communication aspect of parallel programming much easier, but they have not provided adequate resource management services which are needed to adapt to such changes in availability. To fill this need, we have developed CARMI, a resource management system, aimed at allowing a parallel application to make use of all available computing power. CARMI permits an application to grow as new resources become available, and shrink when resources are reclaimed. Building upon CARMI, we have also developed WoDi which provides a simple interface for writing master-workers programs in a dynamic resource environment. Both CARMI and WoDi are operational, and have been used on a pool of more than 200 workstations managed by the Condor batch system. Experience with the two systems has shown them to be easy to use, and capable of providing large numbers of cycles to parallel applications even in a real-life production environment in which no resources are dedicated to parallel processing.

...read moreread less

Journal Article•DOI•

High Performance Spectral Simulation of Turbulent Flows in Massively Parallel Machines With Distributed Memory

[...]

Thomas A. Cortese¹, S. Balachandar¹•Institutions (1)

Urbana University¹

01 Sep 1995

TL;DR: This paper outlines a general method ology for the data-parallel implementation of spectral methods on massively parallel machines with distrib uted memory and shows that very high performance can be obtained on a wide range of massively parallel architectures.

...read moreread less

Abstract: Here we have demonstrated the possibility of very high performance in the implementation of a global spectral methodology on a massively parallel architec ture with distributed memory. Spectral simulations of channel flow and thermal convection in a three-dimen sional Cartesian geometry have yielded a very high performance-up to 26 Gflops/s on a 512-node CM5. In general, implementation of spectral methodology in parallel processors with distributed memory requires nonlocal interprocessor data transfer that is not re stricted to being between nearest neighbors. In spite of their increased communication overhead, better per formance is possible in global methodologies owing to their dense matrix operations and organized data com munication. In this paper we outline a general method ology for the data-parallel implementation of spectral methods on massively parallel machines with distrib uted memory. Following the steps presented here, very high performance can be obtained on a wide vari ety of massively parallel architectures.

...read moreread less

Proceedings Article•DOI•

A massively parallel approach to real-time vision-based road markings detection

[...]

Alberto Broggi

25 Sep 1995

TL;DR: This paper presents the vision-based road detection system currently operative onto the MOB-LAB land vehicle, based on a full-custom low-cost massively parallel system that achieves real-time performances in the processing of image sequences, due to the extremely efficient implementation of the algorithm.

...read moreread less

Abstract: This paper presents the vision-based road detection system currently operative onto the MOB-LAB land vehicle. Based on a full-custom low-cost massively parallel system, it achieves real-time performances (/spl sime/17 Hz) in the processing of image sequences, due to the extremely efficient implementation of the algorithm. Assuming a flat road and the complete set of acquisition parameters are known (camera position, orientation, optics), the system is capable to detect road markings on structured roads even in extremely severe shadow conditions.

...read moreread less

Journal Article•DOI•

Optimal evaluation of array expressions on massively parallel machines

[...]

Siddhartha Chatterjee¹, John R. Gilbert², Robert Schreiber¹, Shang-Hua Teng²•Institutions (2)

Research Institute for Advanced Computer Science¹, PARC²

01 Jan 1995-ACM Transactions on Programming Languages and Systems

TL;DR: This work investigates the problem of evaluating Fortran 90-style array expressions on massively parallel distributed-memory machines and presents algorithms based on dynamic programming that solve the embedding problem optimally for several communication cost metrics: multidimensional grids and rings, hypercubes, fat-trees, and the discrete metric.

...read moreread less

Abstract: We investigate the problem of evaluating Fortran 90-style array expressions on massively parallel distributed-memory machines. On such a machine, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression tree. The choice of where to perform the operation then affects this cost.We describe the communication cost of the parallel machine theoretically as a metric space; we model the alignment problem as that of finding a minimum-cost embedding of the expression tree into this space. We present algorithms based on dynamic programming that solve the embedding problem optimally for several communication cost metrics: multidimensional grids and rings, hypercubes, fat-trees, and the discrete metric. We also extend our approach to handle operations that change the shape of the arrays.

...read moreread less

Journal Article•DOI•

Three-dimensional melt flows in Czochralski oxide growth: high-resolution, massively parallel, finite element computations

[...]

Qiang Xiao¹, Jeffrey J. Derby¹•Institutions (1)

University of Minnesota¹

01 Jul 1995-Journal of Crystal Growth

TL;DR: In this article, three-dimensional, time-dependent features of melt flows which occur during the Czochralski growth of oxide crystals are analyzed using a theoretical bulk-flow model.

...read moreread less

Proceedings Article•DOI•

Combined DRAM and logic chip for massively parallel systems

[...]

Peter M. Kogge¹, T. Sunaga¹, H. Miyataka¹, K. Kitamura¹, Eric E. Retter² - Show less +1 more•Institutions (2)

University of Notre Dame¹, IBM²

27 Mar 1995

TL;DR: The basic chip technology and organization, some projections on the future of EXECUBE-like PIM chips, and finally some lessons to be learned as to why this technology should radically affect the way the authors ought think about computer architecture are overviewed.

...read moreread less

Abstract: A new 5 V 0.8 /spl mu/m CMOS technology merges 100 K custom circuits and 4.5 Mb DRAM onto a single die that supports both high density memory and significant computing logic. One of the first chips built with this technology implements a unique Processor-In-Memory (PIM) computer architecture termed EXECUBE and has 8 separate 25 MHz CPU macros and 16 separate 32 K/spl times/9 b DRAM macros on a single die. These macros are organized together to provide a single part type for scaleable massively parallel processing applications, particularly embedded ones where minimal glue logic is desired. Each chip delivers 50 Mips of performance at 2.7 W. This paper overviews the basic chip technology and organization some projections on the future of EXECUBE-like PIM chips, and finally some lessons to be learned as to why this technology should radically affect the way we ought think about computer architecture.

...read moreread less

Proceedings Article•DOI•

Solving the traveling salesman problem with a distributed branch-and-bound algorithm on a 1024 processor network

[...]

S. Tschoke¹, R. Lubling¹, B. Monien¹•Institutions (1)

University of Paderborn¹

25 Apr 1995

TL;DR: This paper is the first to present a parallelization of a highly efficient best-first branch-and-bound algorithm to solve large symmetric traveling salesman problems on a massively parallel computer containing 1024 processors.

...read moreread less

Abstract: This paper is the first to present a parallelization of a highly efficient best-first branch-and-bound algorithm to solve large symmetric traveling salesman problems on a massively parallel computer containing 1024 processors The underlying sequential branch-and-bound algorithm is based on 1-tree relaxation The parallelization of the branch-and-bound algorithm is fully distributed Every processor performs the same sequential algorithm but on a different part of the solution tree To distribute subproblems among the processors we use a new direct-neighbor dynamic load-balancing strategy The general principle can be applied to all other branch-and-bound algorithms leading to an "automatic" parallelization At present we can efficiently solve traveling salesman problems up to a size of 318 cities on networks of up to 1024 transputers On hard problems we achieve an almost linear speed-up >

...read moreread less

Journal Article•DOI•

A comparison of data-parallel and message-passing versions of the Miami Isopycnic Coordinate Ocean Model (MICOM)

[...]

Rainer Bleck¹, Sumner Dean², Matthew T. O'Keefe³, Aaron Sawdey³•Institutions (3)

University of Miami¹, Los Alamos National Laboratory², University of Minnesota³

01 Oct 1995

TL;DR: A partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed is developed and found the T3D and CM-5 are found to execute the “large” model version at roughly the same speed.

...read moreread less

Abstract: A two-pronged effort to convert a recently developed ocean circulation model written in Fortran-77 for execution on massively parallel computers is described. A data-parallel version was developed for the CM-5 manufactured by Thinking Machines, Inc., while a message-passing version was developed for both the Cray T3D and the Silicon Graphics ONYX workstation. Since the time differentiation scheme in the ocean model is fully explicit and does not require solution of elliptic partial differential equations, adequate machine utilization has been achieved without major changes to the original algorithms. We developed a partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed. On a per-node basis (a T3D node is one Alpha processor, a CM-5 node is one Sparc chip and four vector units), the T3D and CM-5 are found to execute our “large” model version consisting of 511 × 511 horizontal mesh points at roughly the same speed.

...read moreread less

Journal Article•DOI•

Factorized sparse approximate inverse preconditioning ii: solution of 3d fe systems on massively parallel computers

[...]

L. Yu. Kolotilina, A. Yu. Yeremin

01 Jun 1995-International Journal of High Speed Computing

TL;DR: An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested, based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations.

...read moreread less

Abstract: An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested. The method is based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations. Efficiency of a concurrent implementation of the FSAI-CG iterations is analyzed for a model hypercube, and an estimate of the optimal hypercube dimension is derived. For finite element applications, two strategies for selecting the preconditioner sparsity pattern are suggested. A high convergence rate of the resulting iterations is demonstrated numerically for the 3D equilibrium equations for linear elastic orthotropic materials approximated using both h- and p-versions of the FEM.

...read moreread less

Collapse