scispace - formally typeset
Search or ask a question

Showing papers presented at "International Workshop on OpenMP in 2003"


Book ChapterDOI
26 Jun 2003
TL;DR: A new scheduling strategy is proposed, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.
Abstract: Choosing the appropriate assignment of loop iterations to threads is one of the most important decisions that need to be taken when parallelizing Loops, the main source of parallelism in numerical applications. This is not an easy task, even for expert programmers, and it can potentially take a large amount of time. OpenMP offers the schedule clause, with a set of predefined iteration scheduling strategies, to specify how (and when) this assignment of iterations to threads is done. In some cases, the best schedule depends on architectural characteristics of the target architecture, data input, ... making the code less portable. Even worse, the best schedule can change along execution time depending on dynamic changes in the behavior of the loop or changes in the resources available in the system. Also, for certain types of imbalanced loops, the schedulers already proposed in the literature are not able to extract the maximum parallelism because they do not appropriately trade-off load balancing and data locality. This paper proposes a new scheduling strategy, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.

45 citations


Book ChapterDOI
26 Jun 2003
TL;DR: A practical OpenMP compiler for SOCs, especially targeting 3SoC is proposed and solutions to extend OpenMP directives to incorporate advanced architectural features of SOCs are presented.
Abstract: With the advent of modern System-on-Chip (SOC) design, the integration of multiple-processors into one die has become the trend. By far there are no standard programming paradigms for SOCs or heterogeneous chip multiprocessors. Users are required to write complex assembly language and/or C programs for SOCs. Developing a standard programming model for this new parallel architecture is necessary. In this paper, we propose a practical OpenMP compiler for SOCs, especially targeting 3SoC. We also present our solutions to extend OpenMP directives to incorporate advanced architectural features of SOCs. Preliminary performance evaluation shows scalable speedup using different types of processors and effectiveness of performance improvement through optimization.

36 citations


Book ChapterDOI
Paul Petersen1, Sanjiv Shah1
26 Jun 2003
TL;DR: The ability to dynamically analyze multiple sibling OpenMP teams enhances the previous Assure support and complements previous work on static analysis and binary instrumentation capabilities allow detection of thread-safety violations in system and third party libraries that most applications use.
Abstract: The Intel® Thread Checker is the second incarnation of projection based dynamic analysis technology first introduced with Assure that greatly simplifies application development with OpenMP. The ability to dynamically analyze multiple sibling OpenMP teams enhances the previous Assure support and complements previous work on static analysis. In addition, binary instrumentation capabilities allow detection of thread-safety violations in system and third party libraries that most applications use.

31 citations


Book ChapterDOI
26 Jun 2003
TL;DR: A tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization.
Abstract: The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers limited capabilities to control it on ccNUMA architecture. A so-called SPMD style OpenMP program can achieve data locality by means of array privatization, and this approach has shown good performance in previous research. It is hard to write SPMD OpenMP code; therefore we are building a tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP. We show the process of the translation by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization. We are currently implementing these translations in an interactive tool based on the Open64 compiler.

19 citations


Book ChapterDOI
26 Jun 2003
TL;DR: The goal of the project is to quantify the degree to which OpenMP can be extended to distributed systems and to develop supporting compiler techniques.
Abstract: In this paper, we present techniques for translating and optimizing realistic OpenMP applications on distributed systems. The goal of our project is to quantify the degree to which OpenMP can be extended to distributed systems and to develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have proposed optimization techniques that improve the baseline performance of OpenMP applications on distributed computer systems. Our results show that, while kernel benchmarks can show high efficiency for OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to the basic translation done by OpenMP compilers for SMPs) leads to acceptable performance in very few applications. We propose optimizations such as computation repartitioning, page-aware optimizations, and access privatization that result in average 70% performance improvement on the SPEC OMPM2001 benchmark applications.

17 citations


Book ChapterDOI
26 Jun 2003
TL;DR: How to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied is explained.
Abstract: A so-called SPMD style OpenMP program can achieve scalability on ccNUMA systems by means of array privatization, and earlier research has shown good performance under this approach. Since it is hard to write SPMD OpenMP code, we showed a strategy for the automatic translation of many OpenMP constructs into SPMD style in our previous work. In this paper, we first explain how to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops. If the parallel loops are consistently scheduled, we may carry out array privatization according to OpenMP semantics. We give two examples of code patterns that can be handled despite the fact that they are not consistent, and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied.

17 citations


Book ChapterDOI
26 Jun 2003
TL;DR: This work reports on the experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory system and compares the performance of the OpenMP implementations with that of their message passing counterparts and discusses performance differences.
Abstract: In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.

17 citations


Book ChapterDOI
26 Jun 2003
TL;DR: A C++ infrastructure for source-to-source translation that implements the translation of a serial program with high-level abstractions to a lower-level parallel program in two separate phases using OpenMP directives.
Abstract: In this paper we describe a C++ infrastructure for source-to-source translation. Wedemonstrate the translation of a serial program with high-level abstractions to a lower-level parallel program in two separate phases. In the first phase OpenMP directives are introduced, driven by the semantics of high-level abstractions. Then the OpenMP directives are translated to a C++ program that explicitly creates and manages parallelism according to the specified directives. Both phases are implemented using the same mechanisms in our infrastructure.

15 citations


Book ChapterDOI
26 Jun 2003
TL;DR: An OpenMP implementation of a recursive algorithm for parallel fast Fourier transform (FFT) on shared-memory parallel computers with a recursive three-step FFT algorithm improves performance by effectively utilizing the cache memory.
Abstract: In this paper, we propose an OpenMP implementation of a recursive algorithm for parallel fast Fourier transform (FFT) on shared-memory parallel computers. A recursive three-step FFT algorithm improves performance by effectively utilizing the cache memory. Performance results of one-dimensional FFTs on the DELL PowerEdge 7150 and the hp workstation zx6000 are reported. We successfully achieved performance of about 757MFLOPS on the DELL PowerEdge 7150 (Itanium 800MHz, 4CPUs) and about 871MFLOPS on the hp workstation zx6000 (Itanium2 1GHz, 2CPUs) for 224-point FFT.

11 citations


Book ChapterDOI
26 Jun 2003
TL;DR: DMPL is presented, an OpenMP debugger interface that can be implemented as a dynamically loaded library that is currently being considered by the OpenMP Tools Committee as a mechanism to bridge the development tool gap in the Open MP standard.
Abstract: OpenMP is a widely adopted standard for threading directives across compiler implementations. The standard is very successful since it provides application writers with a simple, portable programming model for introducing shared memory parallelism into their codes. However, the standards do not address key issues for supporting that programming model in development tools such as debuggers. In this paper, we present DMPL, an OpenMP debugger interface that can be implemented as a dynamically loaded library. DMPL is currently being considered by the OpenMP Tools Committee as a mechanism to bridge the development tool gap in the OpenMP standard.

10 citations


Book ChapterDOI
26 Jun 2003
TL;DR: A new implementation, distributed counters with local sensor, is introduced, which considerably reduces overhead on POWER3 and POWER4 SMP systems and expects the relative performance of this implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.
Abstract: Barrier synchronization is an important and performance critical primitive in many parallel programming models, including the popular OpenMP model. In this paper, we compare the performance of several software implementations of barrier synchronization and introduce a new implementation, distributed counters with local sensor, which considerably reduces overhead on POWER3 and POWER4 SMP systems. Through experiments with the EPCC OpenMP benchmark, we demonstrate a 79% reduction in overhead on a 32-way POWER4 system and an 87% reduction in overhead on a 16-way POWER3 system when comparing with a fetch-and-add implementation. Since these improvements are primarily attributed to reduced L2 and L3 cache misses, we expect the relative performance of our implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.

Book ChapterDOI
26 Jun 2003
TL;DR: The implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C is described and issues that were not initially considered in the design of theBG/C architecture are identified to support a programming model such as OpenMP.
Abstract: Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that w ere not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.

Book ChapterDOI
26 Jun 2003
TL;DR: A prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM).
Abstract: This paper presents a prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM). The latter is feasible by adopting a share-everything approach for the generated by the OpenMP compiler code, which corresponds to the "default shared" philosophy of OpenMP. Specifically, the user-level thread stacks and the Fortran common blocks are allocated explicitly, though transparently to the programmer, in shared memory. The management of the internal runtime system structures and of the forkjoin multilevel parallelism is based on explicit communication, exploiting however the shared-memory hardware of the available SMP nodes whenever this is possible. The modular design of the runtime system allows the integration of existing unmodified SDSM libraries, despite their design for SPMD execution.

Book ChapterDOI
26 Jun 2003
TL;DR: Hardware counter events on some popular architectures were investigated with the purpose of detecting bottle-necks of particular interest to shared memory programming, such as OpenMP, and relevant events for the intended purpose were shown to exist on the investigated platforms.
Abstract: Hardware counter events on some popular architectures were investigated with the purpose of detecting bottle-necks of particular interest to shared memory programming, such as OpenMP. A fully portable test suite was written in OpenMP, accessing the hardware performance counters be means of PAPI. Relevant events for the intended purpose were shown to exist on the investigated platforms. Further, these events could in most cases be accessed directly through their platform independent, PAPI pre-defined, names. In some cases suggestions for improvement in the pre-defined mapping were made based on the experiments.

Book ChapterDOI
26 Jun 2003
TL;DR: There is a large potential for improvement by the runtime optimization of OpenMP applications by using a combined compile-time and run-time system, using stOMP, a specializing thread-library for OpenMP.
Abstract: This paper introduces stOMP: a specializing thread-library for OpenMP. Using a combined compile-time and run-time system, stOMP specializes OpenMP parallel regions for frequently-seen values and the configuration of the runtime system. An overview of stOMP is presented as well as motivation for the runtime optimization of OpenMP applications. The overheads incurred by a prototype implementation of stOMP are evaluated on three Spec OpenMP Benchmarks and the EPCC scheduling microbenchmark. The results are encouraging and suggest that there is a large potential for improvement by the runtime optimization of OpenMP applications.

Book ChapterDOI
26 Jun 2003
TL;DR: This paper applied OpenMP to non-traditional benchmarks, i.e. embedded applications in order to examine the applicability of OpenMP in this area, and shows that the OpenMP-inserted parallel code size is much larger than a serial version due to multithreaded libraries.
Abstract: Currently embedded systems become more and more important and widely applied to everywhere around us, such as a mobile phone, a PDA, an HDTV, and so on. In this paper, we applied OpenMP to non-traditional benchmarks, i.e. embedded applications in order to examine the applicability of OpenMP in this area. We parallelized embedded benchmarks, called EEMBC, consisting of 5 categories and total 34 applications, and measure their performance in detail. From experiment, we could find 90 parallel sections in 17 applications, but we achieved speedup only in four applications. Since embedded applications consists of a chunk of small loops, we could not get speedup due to large parallelization overheads such as thread management and instruction overheads. Also we show that the OpenMP-inserted parallel code size is much larger than a serial version due to multithreaded libraries, which is critical to embedded systems because of their limited size of memory systems. We discuss an identified critical, but a trivial problem in the current OpenMP specification when we applied OpenMP to these applications.

Book ChapterDOI
26 Jun 2003
TL;DR: In this paper, the authors proposed a parallelized version of the parallel rollout algorithm, and evaluated its performance on a multi-class task scheduling problem by using OpenMP and MPI programming model.
Abstract: Parallel rollout is a formal method of combining multiple heuristic policies available to a sequential decision maker in the framework of Markov Decision Processes (MDPs). The method improves the performances of all of the heuristic policies adapting to the different stochastic system trajectories. From an inherent multi-level parallelism in the method, in this paper we propose a parallelized version of parallel rollout algorithm, and evaluate its performance on a multi-class task scheduling problem by using OpenMP and MPI programming model. We analyze and compare the performance in two versions of parallelized codes, e.g., OpenMP and MPI on several execution environment. We show that the performance using OpenMP API is higher than MPI due to lower overhead in data synchronization across processors.

Book ChapterDOI
Shi-Jung Kao1
26 Jun 2003
TL;DR: Two possible implementation techniques are described and contrasted and ways to synchronize the execution of C++ OpenMP programs in the event of uncaught exceptions are suggested.
Abstract: This paper discusses the issue of C++ exception handling in C++ OpenMP Programs. Two possible implementation techniques are described and contrasted. This paper also suggests ways to synchronize the execution of C++ OpenMP programs in the event of uncaught exceptions.

Book ChapterDOI
26 Jun 2003
TL;DR: An in-place method using vacancy tracking cycles is developed, which outperforms the traditional 2-array method and demonstrates the validity of the parallel paradigm of mixing MPI with OpenMP.
Abstract: We evaluate dynamic data remapping on cluster of SMP architectures under OpenMP, MPI, and hybrid paradigms. Traditional method of multi-dimensional array transpose needs an auxiliary array of the same size and a copy back stage. We recently developed an in-place method using vacancy tracking cycles. The vacancy tracking algorithm outperforms the traditional 2-array method as demonstrated by extensive comparisons. Performance of multi-threaded parallelism using OpenMP are first tested with different scheduling methods and different number of threads. Both methods are then parallelized using several parallel paradigms. At node level, pure OpenMP outperforms pure MPI by a factor of 2.76 for vacancy tracking method. Across entire cluster of SMP nodes, by carefully choosing thread numbers, the hybrid MPI/OpenMP implementation outperforms pure MPI by a factor of 3.79 for traditional method and 4.44 for vacancy tracking method, demonstrating the validity of the parallel paradigm of mixing MPI with OpenMP.

Book ChapterDOI
26 Jun 2003
TL;DR: An extended overhead analysis scheme based on layered model is proposed for OpenMP programming, to further enhance the capability of overhead analysis and thus make the OpenMP performance tuning easier.
Abstract: Overhead analysis was developed as a performance tuning approach for parallel programming and were adopted by several performance analysis systems for OpenMP programs In this paper, an extended overhead analysis scheme based on layered model is proposed for OpenMP programming, to further enhance the capability of overhead analysis and thus make the OpenMP performance tuning easier An example case called ILP/TLP overlap is studied in detail to show the idea of layered overhead model, and a new way to organize the overhead hierarchically is also presented based on the layered overhead model