scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Programming in 2001"


Journal ArticleDOI
TL;DR: The results demonstrate that this style of programming will not always be the most effective mechanism on SMP systems and cannot be regarded as the ideal programming model for all codes, however, significant benefit may be obtained from a mixed mode implementation.
Abstract: MPI / OpenMP mixed mode codes could potentially offer the most effective parallelisation strategy for an SMP cluster, as well as allowing the different characteristics of both paradigms to be exploited to give the best performance on a single SMP. This paper discusses the implementation, development and performance of mixed mode MPI / OpenMP applications. The results demonstrate that this style of programming will not always be the most effective mechanism on SMP systems and cannot be regarded as the ideal programming model for all codes. In some situations, however, significant benefit may be obtained from a mixed mode implementation. For example, benefit may be obtained if the parallel (MPI) code suffers from: poor scaling with MPI processes due to load imbalance or too fine a grain problem size, memory limitations due to the use of a replicated data strategy, or a restriction on the number of MPI processes combinations. In addition, if the system has a poorly optimised or limited scaling MPI implementation then a mixed mode code may increase the code performance.

214 citations


Journal ArticleDOI
TL;DR: A "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs and a set of directives are added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping.
Abstract: OpenMP is attracting wide-spread interest because of its easy-to-use parallel programming model for shared memory multiprocessors. We have implemented a "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs. It allows OpenMP programs to run transparently in a distributed memory environment. The compiler transforms OpenMP programs into parallel programs using SCASH so that shared global variables are allocated at run time in the shared address space of SCASH. A set of directives is added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping. Our experimental results show that the data mapping may greatly impact on the performance of OpenMP programs in the software distributed shared memory system. The performance of some NAS parallel benchmark programs in OpenMP is improved by using our extended directives.

52 citations


Journal ArticleDOI
TL;DR: The primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance.
Abstract: We have developed compiler optimization techniques for explicit parallel programs using the OpenMP API. To enable optimization across threads, we designed dataflow analysis techniques in which interactions between threads are effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We developed algorithms for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. We also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.

36 citations


Journal ArticleDOI
TL;DR: The presented evaluation demonstrates that the environment offers significant support in general parallel tuning efforts and that the toolset facilitates many common tasks in OpenMP parallel programming in an efficient manner.
Abstract: We present our effort to provide a comprehensive parallel programming environment for the OpenMP parallel directive language. This environment includes a parallel programming methodology for the OpenMP programming model and a set of tools (Ursa Minor and InterPol) that support this methodology. Our toolset provides automated and interactive assistance to parallel programmers in time-consuming tasks of the proposed methodology. The features provided by our tools include performance and program structure visualization, interactive optimization, support for performance modeling, and performance advising for finding and correcting performance problems. The presented evaluation demonstrates that our environment offers significant support in general parallel tuning efforts and that the toolset facilitates many common tasks in OpenMP parallel programming in an efficient manner.

29 citations


Journal ArticleDOI
TL;DR: In this paper, a recursive method for the LU factorization of sparse matrices is described, and performance results show that the recursive approach may perform comparable to leading software packages for sparse matrix factorization in terms of execution time, memory usage and error estimates of the solution.
Abstract: This paper describes a recursive method for the LU factorization of sparse matrices. The recursive formulation of common linear algebra codes has been proven very successful in dense matrix computations. An extension of the recursive technique for sparse matrices is presented. Performance results given here show that the recursive approach may perform comparable to leading software packages for sparse matrix factorization in terms of execution time, memory usage, and error estimates of the solution.

21 citations


Journal ArticleDOI
TL;DR: Janet as discussed by the authors is a Java language extension and preprocessing tool that enables convenient integration of native code with Java programs and generates Java programs that execute with little or no degradation despite the flexibility and generality of the interface.
Abstract: Java is growing in appropriateness and usability for high performance computing. With this increasing adoption, issues relating to combining Java with existing codes in other languages become more important. The Java Native Interface (JNI) API is portable but too inconvenient to be used directly owing to its low-level API. This paper presents Janet -- a highly expressive Java language extension and preprocessing tool that enables convenient integration of native code with Java programs. The Janet methodology overcomes some of the limitations of JNI and generates Java programs that execute with little or no degradation despite the flexibility and generality of the interface.

16 citations



Journal ArticleDOI
TL;DR: A benchmark of iterative solvers for sparse matrices is presented and results on some high performance processors are given that show that performance is largely determined by memory bandwidth.
Abstract: We present a benchmark of iterative solvers for sparse matrices. The benchmark contains several common methods and data structures, chosen to be representative of the performance of a large class of methods in current use. We give results on some high performance processors that show that performance is largely determined by memory bandwidth.

13 citations


Journal ArticleDOI
TL;DR: It is claimed that if the problem naturally possesses multiple levels of parallelism, then applying parallelism to all levels may significantly enhance the scalability of the algorithm, and this claim is sustained by numerical experiments.
Abstract: In this paper we discuss the use of nested parallelism. Our claim is that if the problem naturally possesses multiple levels of parallelism, then applying parallelism to all levels may significantly enhance the scalability of your algorithm. This claim is sustained by numerical experiments. We also discuss how to implement multi-level parallelism using OpenMP. We find current OpenMP implementation, based on version 1.0, to have severe limitation for implementing nested parallelization. We then show how this can be circumvented by explicitly assign task to threads. Load balancing issues become more complicated with two (or more) levels of parallelism. To handle this problem, we have designed a distribution algorithm which groups threads into teams, each team being responsible for one course grain outer-level task. This algorithm is proven to produce the optimal load balance, under given assumptions.

13 citations


Journal ArticleDOI
TL;DR: It is shown how coarse grain OpenMP parallelism can also be used to facilitate overlapping MPI communication and computation for stencil-based grid programs such as a program performing Gauss-Seidel iteration with red-black ordering.
Abstract: Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is typically used for exploiting loop-level parallelism it can also be used to enable coarse grain parallelism, potentially leading to less overhead. We show how coarse grain OpenMP parallelism can also be used to facilitate overlapping MPI communication and computation for stencil-based grid programs such as a program performing Gauss-Seidel iteration with red-black ordering. Spatial subdivision or domain decomposition is used to assign a portion of the grid to each thread. One thread is assigned a null calculation region so it was free to perform communication. Example calculations were run on an IBM SP using both the Kuck & Associates and IBM compilers.

13 citations


Journal ArticleDOI
TL;DR: A new OpenMP clause, {\tt indirect}, is proposed, for parallel loops that have irregular data access patterns, which is trivial to implement in a conforming way by protecting every array update, but also allows for an inspector/executor compiler implementation which will be more efficient in sparse cases.
Abstract: Many scientific applications involve array operations that are sparse in nature, ie array elements depend on the values of relatively few elements of the same or another array. When parallelised in the shared-memory model, there are often inter-thread dependencies which require that the individual array updates are protected in some way. Possible strategies include protecting all the updates, or having each thread compute local temporary results which are then combined globally across threads. However, for the extremely common situation of sparse array access, neither of these approaches is particularly efficient. The key point is that data access patterns usually remain constant for a long time, so it is possible to use an inspector/executor approach. When the sparse operation is first encountered, the access pattern is inspected to identify those updates which have potential inter-thread dependencies. Whenever the code is actually executed, only these selected updates are protected. We propose a new OpenMP clause, {\tt indirect}, for parallel loops that have irregular data access patterns. This is trivial to implement in a conforming way by protecting every array update, but also allows for an inspector/executor compiler implementation which will be more efficient in sparse cases. We describe efficient compiler implementation strategies for the new directive. We also present timings from the kernels of a Discrete Element Modelling application and a Finite Element code where the inspector/executor approach is used. The results demonstrate that the method can be extremely efficient in practice.

Journal ArticleDOI
TL;DR: This work uses two similar parallelization tools, Pfortran and Cray's Co-Array Fortran, in the parallelization of the GROMOS96 molecular dynamics module, showing linear speedup within the range expected by these parallelization methods.
Abstract: After at least a decade of parallel tool development, parallelization of scientific applications remains a significant undertaking. Typically parallelization is a specialized activity supported only partially by the programming tool set, with the programmer involved with parallel issues in addition to sequential ones. The details of concern range from algorithm design down to low-level data movement details. The aim of parallel programming tools is to automate the latter without sacrificing performance and portability, allowing the programmer to focus on algorithm specification and development. We present our use of two similar parallelization tools, Pfortran and Cray's Co-Array Fortran, in the parallelization of the GROMOS96 molecular dynamics module. Our parallelization started from the GROMOS96 distribution's shared-memory implementation of the replicated algorithm, but used little of that existing parallel structure. Consequently, our parallelization was close to starting with the sequential version. We found the intuitive extensions to Pfortran and Co-Array Fortran helpful in the rapid parallelization of the project. We present performance figures for both the Pfortran and Co-Array Fortran parallelizations showing linear speedup within the range expected by these parallelization methods.

Journal ArticleDOI
TL;DR: The Computer Aided Parallelisation Toolkit has been extended to automatically generate OpenMP-based parallel programs with nominal user assistance and it is shown how efficient directives can be placed using the toolkit's in-depth interprocedural analysis.
Abstract: The shared-memory programming model can be an effective way to achieve parallelism on shared memory parallel computers. Historically however, the lack of a programming standard using directives and the limited scalability have affected its take-up. Recent advances in hardware and software technologies have resulted in improvements to both the performance of parallel programs with compiler directives and the issue of portability with the introduction of OpenMP. In this study, the Computer Aided Parallelisation Toolkit has been extended to automatically generate OpenMP-based parallel programs with nominal user assistance. We categorize the different loop types and show how efficient directives can be placed using the toolkit's in-depth interprocedural analysis. Examples are taken from the NAS parallel benchmarks and a number of real-world application codes. This demonstrates the great potential of using the toolkit to quickly parallelise serial programs as well as the good performance achievable on up to 300 processors for hybrid message passing-directive parallelisations.


Journal ArticleDOI
TL;DR: This paper introduces and compares two decomposition strategies, in the framework of shared memory systems, as applied to a case study particle in cell application, and considers time efficiency, memory occupancy, and program restructuring effort.
Abstract: A crucial issue in parallel programming (both for distributed and shared memory architectures) is work decomposition. Work decomposition task can be accomplished without large programming effort with use of high-level parallel programming languages, such as OpenMP. Anyway particular care must still be payed on achieving performance goals. In this paper we introduce and compare two decomposition strategies, in the framework of shared memory systems, as applied to a case study particle in cell application. A number of different implementations of them, based on the OpenMP language, are discussed with regard to time efficiency, memory occupancy, and program restructuring effort.

Journal ArticleDOI
TL;DR: In this article, the authors present a prototype implementation of a network interface that can preserve communication between processes during process migration, which is a substitution for the well-known socket interface.
Abstract: Efficient load balancing is essential for parallel distributed computing. Many parallel computing environments use {\tt TCP} or {\tt UDP} through the socket interface as a communication mechanism. This paper presents the design and development of a prototype implementation of a network interface that can preserve communication between processes during process migration. This new communication library is a substitution for the well-known socket interface. It is implemented in user -- space; it is portable, and no modifications of user applications are required. {\tt TCP/IP} is applied for internal communication, which guarantees relatively high performance and portability.

Journal ArticleDOI
TL;DR: This paper compares different parallel implementations of the same algorithm for solving nonlinear simulation problems on unstructured meshes and finds the explicit programming model proves to be more efficient than the implicit model by 20--70%, depends on the mesh and the machine.
Abstract: In this paper we compare different parallel implementations of the same algorithm for solving nonlinear simulation problems on unstructured meshes. In the first implementation, making use of the message-passing programming model and the PVM system, the domain decomposition of unstructured mesh is implemented, while the second implementation takes advantage of the inherent parallelism of the algorithm by adopting the shared-memory programming model. Both implementations are applied to the preconditioned GMRES method that solves iteratively the system of linear equations. A combined approach, the hybrid programming model suitable for multicomputers with SMP nodes, is introduced. For performance measurements we use compressible fluid flow simulation in which sequences of finite element solutions form time approximations to the Euler equations. The tests are performed on HP SPP1600, HP S2000 and SGI Origin2000 multiprocessors and report wall-clock execution time and speedup for different number of processing nodes and for different meshes. Experimentally, the explicit programming model proves to be more efficient than the implicit model by 20--70%, depends on the mesh and the machine.

Journal ArticleDOI
TL;DR: This paper presents and compares methodologies to generate discrete adjoint codes, and compares these methodologies in terms of execution time and memory requirement on a one dimensional thermal-hydraulic module for two-phase flow modeling.
Abstract: From a computational point of view, sensitivity analysis, calibration of a model, or variational data assimilation may be tackled after the differentiation of the numerical code representing the model into an adjoint code. This paper presents and compares methodologies to generate discrete adjoint codes. These methods can be implemented when hand writing adjoint codes, or within Automatic Differentiation (AD) tools. AD has been successfully applied to industrial codes that were large and general enough to fully validate this new technology. We compare these methodologies in terms of execution time and memory requirement on a one dimensional thermal-hydraulic module for two-phase flow modeling. With regard to this experiment, some development axes for AD tools are extracted as well as methods for AD tool users to get efficient adjoint codes semi-automatically. The next objective is to generate automatically adjoint codes as efficient as hand written ones.

Journal ArticleDOI
TL;DR: A "cluster-enabled" OpenMP compiler for a page-based approach to parallel programming for shared memory multiprocessors is implemented.
Abstract: OpenMP is attracting wide-spread interest because of its easy-to-use parallel programming model for shared memory multiprocessors. We have implemented a "cluster-enabled" OpenMP compiler for a page...

Journal ArticleDOI
TL;DR: The technique used to port a vector code to a SMP-ccNUMA architecture is described and the performances that these models have on these systems are covered.
Abstract: Weather forecast limited area models, wave models and ocean models run commonly on vector machines or on MPP systems. Recently shared memory multiprocessor systems with ccNUMA architecture (SMP-ccNUMA) have been shown to deliver very good performances on many applications. It is important to know that the SMP-ccNUMA systems perform and scale well even for the above mentioned models and that a relatively simple effort is needed to parallelize the codes on these systems due to the availability of OpenMP as standard shared memory paradigm. This paper will deal with the implementation on a SGI Origin 2000 of a weather forecast model (LAMBO -- Limited Area Model Bologna, the NCEP ETA model adapted to the Italian territory), a wave model (WA.M. -- Wave Model, on the Mediterranean Sea and on the Adriatic Sea) and an ocean model (M.O.M. -- Modular Ocean Model, used with data assimilation). These three models were written for vector machines, so the paper will describe the technique used to port a vector code to a SMP-ccNUMA architecture. Another aspect covered by this paper are the performances that these models have on these systems.



Journal ArticleDOI
TL;DR: It is argued that having several methods on a component interface can be used to mitigate performance problems that may arise when trying to solve problems in PSE's based on small components.
Abstract: We have investigated aspects of the design of Problem Solving Environments (PSE) by constructing a prototype using CORBA as middleware. The two issues we are mainly concerned with are the use of non-trivial (containing more than just a {\tt start} method) CORBA interfaces for the computational components, and the provision of interactivity using the same mechanisms used for flow control. After describing the design decisions that allow us to investigate these issues, and contrasting them with alternatives, we describe the architecture of the prototype and its use in the context of a study of photonic materials. We argue that having several methods on a component interface can be used to mitigate performance problems that may arise when trying to solve problems in PSE's based on small components. We describe how our mechanism allows a high degree of computational steering over all components.

Journal ArticleDOI
TL;DR: In this paper, compiler optimization techniques for explicit parallel programs using the OpenMP API have been developed to enable optimization across threads, in which dataflow analysis techniques in which inte...
Abstract: We have developed compiler optimization techniques for explicit parallel programs using the OpenMP API. To enable optimization across threads, we designed dataflow analysis techniques in which inte...


Journal ArticleDOI
TL;DR: The development and implementation of a distributed job execution environment for highly iterative jobs that allows for fine-grained job control, timely status notification and dynamic registration and deregistration of execution platforms depending on resources available.
Abstract: This paper describes the development and implementation of a distributed job execution environment for highly iterative jobs. An iterative job is defined here as a binary code that is run multiple times with incremental changes in the input values for each run. An execution environment is a set of resources on a computing platform that can be made available to run the job and hold the output until it is collected. The goal is to design a complete, object-oriented execution system that runs a variety of jobs with minimal changes. Areas of code that are unique to a specific type of job are decoupled from the rest. The system allows for fine-grained job control, timely status notification and dynamic registration and deregistration of execution platforms depending on resources available. Several objected-oriented technologies are employed: Java, CORBA, UML, and software design patterns. The environment has been tested using a simulation code, INS2D.



Journal ArticleDOI
TL;DR: This part of enabling technologies investigates current hardware techniques and their functionalities and provides a comparison between various products.
Abstract: The most valuable assets in every scientific community are the expert work force and the research results/data produced. The last decade has seen new experimental and computational techniques developing at an ever-faster pace, encouraging the production of ever-larger quantities of data in ever-shorter time spans. Concurrently the traditional scientific working environment has changed beyond recognition. Today scientists can use a wide spectrum of experimental, computational and analytical facilities, often widely distributed over the UK and Europe. In this environment new challenges are posed for the Management of Data every day, but are we ready to tackle them? Do we know exactly what the challenges are? Is the right technology available and is it applied where necessary? This part of enabling technologies investigates current hardware techniques and their functionalities and provides a comparison between various products.

Journal Article
TL;DR: In this article, the authors propose that without the correct audio cues that move in three-dimensional space with the visuals, the VR illusion is broken and a true immersive VR experience cannot be realized.
Abstract: Background Virtual Reality Immersive audio is used by Virtual Reality (VR) gaming companies like Oculus Rift, HTC VIVE, and Google Cardboard. They create a VR where a user can be “transported” into a fictional world and see it right before their eyes. However, visuals account for only one aspect of a true VR experience. Without the correct audio cues that move in three-dimensional space with the visuals, the VR illusion is broken and a true immersive VR experience cannot be realized.