scispace - formally typeset
Search or ask a question
Book

Compiling sequential programs for distributed memory parallel computers with Pandore II

TL;DR: The language features related to data distribution are presented together with the compiling techniques and the programming methodology is illustrated by the parallelization of the Gram-Schmidt algorithm.
Abstract: Parallelization of programs for distributed memory parallel computers is always difficult because of many low-level problems due to a programming model based on parallel processes. The Pandore system has been designed to allow programmers to maintain a sequential programming style. The Pandore compiler generates parallel processes according to a data decomposition specified by the programmer. For this purpose, new constructs have been added to an existing sequential language. Code generation relies on the SPMD model and on the locality of writes rule. A prototype has shown the feasibility of this approach. The second version of Pandore we are currently implementing is described here. We present the language features related to data distribution together with the compiling techniques. The programming methodology is illustrated by the parallelization of the Gram-Schmidt algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: It is argued that an ideal model should by easy to program, should have a software development methodology, should be architecture-independent, should been easy to understand, should guarantee performance, and should provide accurate information about the cost of programs.
Abstract: We survey parallel programming models and languages using six criteria to assess their suitability for realistic portable parallel programming. We argue that an ideal model should by easy to program, should have a software development methodology, should be architecture-independent, should be easy to understand, should guarantee performance, and should provide accurate information about the cost of programs. These criteria reflect our belief that developments in parallelism must be driven by a parallel software industry based on portability and efficiency. We consider programming models in six categories, depending on the level of abstraction they provide. Those that are very abstract conceal even the presence of parallelism at the software level. Such models make software easy to build and port, but efficient and predictable performance is usually hard to achieve. At the other end of the spectrum, low-level models make all of the messy issues of parallel programming explicit (how many threads, how to place them, how to express communication, and how to schedule communication), so that software is hard to build and not very portable, but is usually efficient. Most recent models are near the center of this spectrum, exploring the best tradeoffs between expressiveness and performance. A few models have achieved both abstractness and efficiency. Both kinds of models raise the possibility of parallelism as part of the mainstream of computing.

410 citations

Proceedings ArticleDOI
01 Jun 1993
TL;DR: It is shown that the problems of communication code generation, local memory management, message aggregation and redundant data communication elimination can all be solved by projecting polyhedra represented by sets of inequalities onto lower dimensional spaces.
Abstract: This paper presents several algorithms to solve code generation and optimization problems specific to machines with distributed address spaces. Given a description of how the computation is to be partitioned across the processors in a machine, our algorithms produce an SPMD (single program multiple data) program to be run on each processor. Our compiler generated the necessary receive and send instructions, optimizes the communication by eliminating redundant communication and aggregating small messages into large messages, allocates space locally on each processor, and translates global data addresses to local addresses.Our techniques are based on an exact data-flow analysis on individual array element accesses. Unlike data dependence analysis, this analysis determines if two dynamic instances refer to the same value, and not just to the same location. Using this information, our compiler can handle more flexible data decompositions and find more opportunities for communication optimization than systems based on data dependence analysis.Our technique is based on a uniform framework, where data decompositions, computation decompositions and the data flow information are all represented as systems of linear inequalities. We show that the problems of communication code generation, local memory management, message aggregation and redundant data communication elimination can all be solved by projecting polyhedra represented by sets of inequalities onto lower dimensional spaces.

241 citations

01 Jan 1997
TL;DR: The Stanford SUIF interprocedural parallelizer is capable of detecting coarser granularity of parallelism in sequential scientific applications than previously possible and a framework based on systems of linear inequalities for developing compiler algorithms is introduced.
Abstract: Shared-memory multiprocessors, built out of the latest microprocessors, are becoming a widely available class of computationally powerful machines. These affordable multiprocessors can potentially deliver supercomputer-like performance to the general public. To effectively harness the power of these machines it is important to find all the available parallelism in programs. The Stanford SUIF interprocedural parallelizer we have developed is capable of detecting coarser granularity of parallelism in sequential scientific applications than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of codes, frequently requiring modifications to array data structures such as array privatization. Measurements from several standard benchmark suites demonstrate that aggressive interprocedural analyses can substantially advance the capability of automatic parallelization technology. However, locating parallelism is not sufficient in achieving high performance. It is critical to make effective use of the memory hierarchy. In parallel applications, false sharing and cache conflicts between processors can significantly reduce performance. We have developed the first compiler that automatically performs a full suite of data transformations (a combination of transposing, strip-mining and padding). The performance of many benchmarks improves drastically after the data transformations. We introduce a framework based on systems of linear inequalities for developing compiler algorithms. Many of the whole program analyses and aggressive optimizations in our compiler employ this framework. Using this framework general solutions to many compiler problems can be found systematically.

51 citations


Cites background from "Compiling sequential programs for d..."

  • ...Some notable projects are, Al [141], Blaze [100], Crystal [106], FORTRAN-D [85,139], Id Nouveau [123], Kali [114,98], Pandore [15], Pandore II [14], ps s1 ....

    [...]

  • ...Location-Centric Approach Many of the existing compilers developed for distributed memory machines have a similar basic approach to how they generate code from user-specified data decompositions [14,98,85,114,123,139]....

    [...]

Book ChapterDOI
10 Oct 1994
TL;DR: This paper presents a study about the 3D layout of some particular graphs, which are finite reachability graphs of communicating processes, and gives a conical representation which satisfies the requirements.
Abstract: This paper presents a study about the 3D layout of some particular graphs. These graphs are finite reachability graphs of communicating processes. Some interesting semantical information which is present in the graph, such as concurrency, non-determinism, and membership of actions to processes are explicitly used in the layout. We start with the study of deterministic processes and give a conical representation which satisfies our requirements. Then we extrapolate our layout for non-deterministic processes.

24 citations

Proceedings Article
10 Jul 1995
TL;DR: An innovative method to allocate local blocks and temporaries for received values and to manage the associated access mechanisms to manage efficiently distributed arrays is presented.
Abstract: High Performance Fortran and other similar languages have been designed as a means to express portable data parallel programs for distributed memory machines. As data distribution is a key-feature for exploiting the parallelism of applications, a crucial point for HPF compilers is their ability to manage efficiently distributed arrays. We present in this paper an innovative method to allocate local blocks and temporaries for received values and to manage the associated access mechanisms. The performance of these access mechanisms is measured and experimental results on the use of this array management within an existing compiler are shown.

21 citations