scispace - formally typeset
Search or ask a question

Showing papers by "Jesús Labarta published in 1996"


Book ChapterDOI
26 Aug 1996
TL;DR: An environment whose aim is to aid in the development and tuning of message passing applications before actually running them in a real system with a large number of processors is described.
Abstract: This paper describes an environment whose aim is to aid in the development and tuning of message passing applications before actually running them in a real system with a large number of processors. Our objective is not to eliminate tests on real machines but to be able to focus them in a more selective way and thereby minimize their number. The environment presented in this paper consists of three closely integrated tools: an instrumented communication library, a trace driven simulator (Dimemas) and a visualization/analysis tool (Paraver).

190 citations


Book ChapterDOI
26 Aug 1996
TL;DR: The design and implementation of a user-level thread package based on the nano-threads programming model, whose goal is to efficiently manage the application parallelism at user- level is described.
Abstract: In this paper we describe the design and implementation of a user-level thread package based on the nano-threads programming model, whose goal is to efficiently manage the application parallelism at user-level. Nano-thread applications work close to the operating system to quickly adapt to resource availability.

56 citations


Proceedings ArticleDOI
17 Nov 1996
TL;DR: A data distribution tool which automatically derives the data mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program and the quality of the solutions generated and the feasibility of the approach in terms of compilation time are described.
Abstract: This paper describes the design of a data distribution tool which automatically derives the data mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program. The layout generated can be static or dynamic, and the distribution is one-dimensional BLOCK or CYCLIC. The tool takes into account the control flow statements in the code in order to better estimate the behavior of the program. All the information regarding data movement and parallelism is contained in a single data structure named Communication-Parallelism Graph (CPG). The CPG is used to model a minimal path problem in which time is the objective function to minimize. It is solved using a general purpose linear programming solver, which finds the optimal solution for the whole problem. The experimental results will illustrate the quality of the solutions generated and the feasibility of the approach in terms of compilation time.

30 citations


Book ChapterDOI
26 Aug 1996
TL;DR: A new cooperative caching mechanism, PACA, along with a caching algorithm, LRU-Interleaved, and an aggressive prefetching algorithm, Full-File-On-Open, are presented, which avoids the cache coherence problem with no loss in performance.
Abstract: A new cooperative caching mechanism, PACA, along with a caching algorithm, LRU-Interleaved, and an aggressive prefetching algorithm, Full-File-On-Open, are presented. The caching algorithm is especially targeted to parallel machines running a microkernel-based operating system. It avoids the cache coherence problem with no loss in performance. Comparing our algorithm with another cooperative cache one (N-Chance Forwarding), in the above environment, better results have been obtained by LRU-Interleaved. We also evaluate an aggressive prefetching algorithm that highly increases read performance taking advantage of the huge caches cooperative caching offers.

23 citations


Proceedings ArticleDOI
23 Oct 1996
TL;DR: This work presents an approach to automatically derive static or dynamic data distribution strategies for the arrays used in a program using a general purpose linear 0-1 integer programming solver and finds the optimal solution for the problem for one-dimensional array distributions.
Abstract: Physically-distributed memory multiprocessors are becoming popular and data distribution and loop parallelization are aspects that a parallelizing compiler has to consider in order to get efficiency from the system. The cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect the performance of the system. It would be desirable to free the programmer from considerations of the low-level details of the target architecture, to program explicit processes or specify interprocess communication. We present an approach to automatically derive static or dynamic data distribution strategies for the arrays used in a program. All the information required about data movement and parallelism is contained in a single data structure, called the Communication-Parallelism Graph (CPG). The problem is modeled and solved using a general purpose linear 0-1 integer programming solver. This allows us to find the optimal solution for the problem for one-dimensional array distributions. We also show the feasibility of using this approach in terms of compilation time and quality of the solutions generated.

12 citations


Proceedings ArticleDOI
24 Jan 1996
TL;DR: The parallelizing algorithm solves the important problem of deciding the set of transformations to apply in order to maximize the degree of parallelism, the number of parallel loops within a loop nest, and presents a way of generating efficient transformed code that exploits coarse grain parallelism on a MIMD system.
Abstract: The paper extends the framework of linear loop transformations adding a new nonlinear step at the transformation process. The current framework of linear loop transformation cannot identify a significant fraction of parallelism. For this reason, we present a method to complement it with some basic transformations in order to extract the maximum loop parallelism in perfect nested loops with tight recurrences in the dependence graph. The parallelizing algorithm solves the important problem of deciding the set of transformations to apply in order to maximize the degree of parallelism, the number of parallel loops within a loop nest, and presents a way of generating efficient transformed code that exploits coarse grain parallelism on a MIMD system.

5 citations


Journal ArticleDOI
TL;DR: An automatic data distribution method which deal with both the alignment and the distribution problems in a single optimization phase, as opposed to sequentially solving these two inter-dependent approaches as done by previous work is described.
Abstract: This paper describes an automatic data distribution method which deal with both the alignment and the distribution problems in a single optimization phase, as opposed to sequentially solving these two inter-dependent approaches as done by previous work. The core of this work is called the Communication-Parallelism Graph, which describes the relationships among array dimensions of the same and different array references regarding communication and parallelism. The overall data distribution problem is then formulated as a linear 0-1 integer programming problem, where the objective function to be minimized is the total execution time. The solution is static in the sense that the layout of the arrays does not change during the execution of the program. We also show the feasibility of using this approach to solve the problem in terms of compilation time and quality of the solutions generated.

4 citations


Book ChapterDOI
08 Aug 1996
TL;DR: The main features of the automatic parallelization and data distribution research tool are described and the performance of the parallelization strategies generated are shown.
Abstract: Shared-memory multiprocessor systems can achieve high performance levels when appropriate work parallelization and data distribution are performed. These two actions are not independent and decisions have to be taken in a unified way trying to minimize execution time and data movement costs. The first goal is achieved by parallelizing loops (the main components suitable for parallel execution in scientific codes) and assign work to processors having in mind a good load balancing. The second goal is achieved when data is stored in the cache memories of processors minimizing both true and false sharing of cache lines. This paper describes the main features of our automatic parallelization and data distribution research tool and shows the performance of the parallelization strategies generated. The tool (named PDDT) accepts programs written in Fortran77 and generates directives of shared memory programming models (like Power Fortran from SGI or Exemplar from Convex).

4 citations


Book ChapterDOI
07 Oct 1996
TL;DR: A parallel library (PLS) to solve linear systems arising from non overlapped Domain Decomposition methods and preconditioned Krylov subspace iterative methods are considered as linear solvers.
Abstract: In this paper we describe a parallel library (PLS) to solve linear systems arising from non overlapped Domain Decomposition methods. Preconditioned Krylov subspace iterative methods are considered as linear solvers. PLS had been implemented on the top of PVM using FORTRAN 77, additional library which allows the use of dynamic memory allocation.

3 citations


Book ChapterDOI
15 Apr 1996
TL;DR: The impact of the parallel version of PERMAS is not only a cost effective and fast solution for medium size Finite Element simulations, but also that extremely large industrial examples may be solved that until now were restricted to large supercomputers at very high cost.
Abstract: The general purpose Finite Element system PERMAS has been ported to high parallel computer architectures in the scope of the ESPRIT project EUROPORT-1. We reported on the technical and theoretical background during the last HPCN conference. The parallelization rates and scalability achieved with this strategy are shown using both industrial relevant and artificial scalable examples. The behavior of the parallel version is studied on a parallel machine with a high speed communication network. The impact of the parallel version is not only a cost effective and fast solution for medium size Finite Element simulations, but also that extremely large industrial examples may be solved that until now were restricted to large supercomputers at very high cost. The results are discussed in view of the underlying approach and data structure.

3 citations


Book ChapterDOI
19 Aug 1996
TL;DR: A distributed parallel implementation of a Finite Element simulation used in the ophthalmic optics industry by using non overlapped domain decomposition methods to perform the parallelization on a cluster of workstations.
Abstract: We describe a distributed parallel implementation of a Finite Element simulation used in the ophthalmic optics industry We use non overlapped domain decomposition methods to perform the parallelization on a cluster of workstations Different numerical techniques was implemented, and the code was tuned with performance analyzer for distributed parallel programs