Showing papers presented at "International Workshop on OpenMP in 2003"

PDF

Open Access

Book Chapter•DOI•

Is the schedule clause really necessary in OpenMP

[...]

Eduard Ayguadé¹, Bob Blainey², Alejandro Duran¹, Jesús Labarta¹, Francisco Martínez¹, Xavier Martorell¹, Raul E. Silvera² - Show less +3 more•Institutions (2)

Polytechnic University of Catalonia¹, IBM²

26 Jun 2003

TL;DR: A new scheduling strategy is proposed, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.

...read moreread less

Abstract: Choosing the appropriate assignment of loop iterations to threads is one of the most important decisions that need to be taken when parallelizing Loops, the main source of parallelism in numerical applications. This is not an easy task, even for expert programmers, and it can potentially take a large amount of time. OpenMP offers the schedule clause, with a set of predefined iteration scheduling strategies, to specify how (and when) this assignment of iterations to threads is done. In some cases, the best schedule depends on architectural characteristics of the target architecture, data input, ... making the code less portable. Even worse, the best schedule can change along execution time depending on dynamic changes in the behavior of the loop or changes in the resources available in the system. Also, for certain types of imbalanced loops, the schedulers already proposed in the literature are not able to extract the maximum parallelism because they do not appropriately trade-off load balancing and data locality. This paper proposes a new scheduling strategy, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.

...read moreread less

45 citations

Book Chapter•DOI•

A practical OpenMP compiler for system on chips

[...]

Feng Liu, Vipin Chaudhary

26 Jun 2003

TL;DR: A practical OpenMP compiler for SOCs, especially targeting 3SoC is proposed and solutions to extend OpenMP directives to incorporate advanced architectural features of SOCs are presented.

...read moreread less

Abstract: With the advent of modern System-on-Chip (SOC) design, the integration of multiple-processors into one die has become the trend. By far there are no standard programming paradigms for SOCs or heterogeneous chip multiprocessors. Users are required to write complex assembly language and/or C programs for SOCs. Developing a standard programming model for this new parallel architecture is necessary. In this paper, we propose a practical OpenMP compiler for SOCs, especially targeting 3SoC. We also present our solutions to extend OpenMP directives to incorporate advanced architectural features of SOCs. Preliminary performance evaluation shows scalable speedup using different types of processors and effectiveness of performance improvement through optimization.

...read moreread less

36 citations

Book Chapter•DOI•

OpenMP support in the Intel® thread checker

[...]

Paul Petersen¹, Sanjiv Shah¹•Institutions (1)

Intel¹

26 Jun 2003

TL;DR: The ability to dynamically analyze multiple sibling OpenMP teams enhances the previous Assure support and complements previous work on static analysis and binary instrumentation capabilities allow detection of thread-safety violations in system and third party libraries that most applications use.

...read moreread less

Abstract: The Intel® Thread Checker is the second incarnation of projection based dynamic analysis technology first introduced with Assure that greatly simplifies application development with OpenMP. The ability to dynamically analyze multiple sibling OpenMP teams enhances the previous Assure support and complements previous work on static analysis. In addition, binary instrumentation capabilities allow detection of thread-safety violations in system and third party libraries that most applications use.

...read moreread less

31 citations

Book Chapter•DOI•

Improving the performance of OpenMP by array privatization

[...]

Zhenying Liu¹, Barbara Chapman¹, Tien-Hsiung Weng¹, Oscar Hernandez¹•Institutions (1)

University of Houston¹

26 Jun 2003

TL;DR: A tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization.

...read moreread less

Abstract: The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers limited capabilities to control it on ccNUMA architecture. A so-called SPMD style OpenMP program can achieve data locality by means of array privatization, and this approach has shown good performance in previous research. It is hard to write SPMD OpenMP code; therefore we are building a tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP. We show the process of the translation by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization. We are currently implementing these translations in an interactive tool based on the Open64 compiler.

...read moreread less

19 citations

Book Chapter•DOI•

Supporting realistic OpenMP applications on a commodity cluster of workstations

[...]

Seung-Jai Min¹, Ayon Basumallik¹, Rudolf Eigenmann¹•Institutions (1)

Purdue University¹

26 Jun 2003

TL;DR: The goal of the project is to quantify the degree to which OpenMP can be extended to distributed systems and to develop supporting compiler techniques.

...read moreread less

Abstract: In this paper, we present techniques for translating and optimizing realistic OpenMP applications on distributed systems. The goal of our project is to quantify the degree to which OpenMP can be extended to distributed systems and to develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have proposed optimization techniques that improve the baseline performance of OpenMP applications on distributed computer systems. Our results show that, while kernel benchmarks can show high efficiency for OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to the basic translation done by OpenMP compilers for SMPs) leads to acceptable performance in very few applications. We propose optimizations such as computation repartitioning, page-aware optimizations, and access privatization that result in average 70% performance improvement on the SPEC OMPM2001 benchmark applications.

...read moreread less

17 citations

Book Chapter•DOI•

Analyses for the translation of OpenMP codes into SPMD style with array privatization

[...]

Zhenying Liu¹, Barbara Chapman¹, Yi Wen¹, Lei Huang¹, Tien-Hsiung Weng¹, Oscar Hernandez¹ - Show less +2 more•Institutions (1)

University of Houston¹

26 Jun 2003

TL;DR: How to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied is explained.

...read moreread less

Abstract: A so-called SPMD style OpenMP program can achieve scalability on ccNUMA systems by means of array privatization, and earlier research has shown good performance under this approach. Since it is hard to write SPMD OpenMP code, we showed a strategy for the automatic translation of many OpenMP constructs into SPMD style in our previous work. In this paper, we first explain how to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops. If the parallel loops are consistently scheduled, we may carry out array privatization according to OpenMP semantics. We give two examples of code patterns that can be handled despite the fact that they are not consistent, and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied.

...read moreread less

17 citations

Book Chapter•DOI•

Experiences using OpenMP based on compiler directed software DSM on a PC cluster

[...]

Matthias Hess, Gabriele Jost¹, Matthias A. Müller, Roland Rühle•Institutions (1)

Ames Research Center¹

26 Jun 2003

TL;DR: This work reports on the experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory system and compares the performance of the OpenMP implementations with that of their message passing counterparts and discusses performance differences.

...read moreread less

Abstract: In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.

...read moreread less

17 citations

Book Chapter•DOI•

A C++ infrastructure for automatic introduction and translation of OpenMP directives

[...]

Daniel J. Quinlan¹, Markus Schordan¹, Qing Yi¹, Bronis R. de Supinski¹•Institutions (1)

Lawrence Livermore National Laboratory¹

26 Jun 2003

TL;DR: A C++ infrastructure for source-to-source translation that implements the translation of a serial program with high-level abstractions to a lower-level parallel program in two separate phases using OpenMP directives.

...read moreread less

Abstract: In this paper we describe a C++ infrastructure for source-to-source translation. Wedemonstrate the translation of a serial program with high-level abstractions to a lower-level parallel program in two separate phases. In the first phase OpenMP directives are introduced, driven by the semantics of high-level abstractions. Then the OpenMP directives are translated to a C++ program that explicitly creates and manages parallelism according to the specified directives. Both phases are implemented using the same mechanisms in our infrastructure.

...read moreread less

15 citations

Book Chapter•DOI•

An OpenMP implementation of parallel FFT and its performance on IA-64 processors

[...]

Daisuke Takahashi¹, Mitsuhisa Sato¹, Taisuke Boku¹•Institutions (1)

University of Tsukuba¹

26 Jun 2003

TL;DR: An OpenMP implementation of a recursive algorithm for parallel fast Fourier transform (FFT) on shared-memory parallel computers with a recursive three-step FFT algorithm improves performance by effectively utilizing the cache memory.

...read moreread less

Abstract: In this paper, we propose an OpenMP implementation of a recursive algorithm for parallel fast Fourier transform (FFT) on shared-memory parallel computers. A recursive three-step FFT algorithm improves performance by effectively utilizing the cache memory. Performance results of one-dimensional FFTs on the DELL PowerEdge 7150 and the hp workstation zx6000 are reported. We successfully achieved performance of about 757MFLOPS on the DELL PowerEdge 7150 (Itanium 800MHz, 4CPUs) and about 871MFLOPS on the hp workstation zx6000 (Itanium2 1GHz, 2CPUs) for 224-point FFT.

...read moreread less

11 citations

Book Chapter•DOI•

DMPL: an OpenMP DLL debugging interface

[...]

James Cownie, John DelSignore, Bronis R. de Supinski¹, Karen H. Warren¹•Institutions (1)

Lawrence Livermore National Laboratory¹

26 Jun 2003

TL;DR: DMPL is presented, an OpenMP debugger interface that can be implemented as a dynamically loaded library that is currently being considered by the OpenMP Tools Committee as a mechanism to bridge the development tool gap in the Open MP standard.

...read moreread less

Abstract: OpenMP is a widely adopted standard for threading directives across compiler implementations. The standard is very successful since it provides application writers with a simple, portable programming model for introducing shared memory parallelism into their codes. However, the standards do not address key issues for supporting that programming model in development tools such as debuggers. In this paper, we present DMPL, an OpenMP debugger interface that can be implemented as a dynamically loaded library. DMPL is currently being considered by the OpenMP Tools Committee as a mechanism to bridge the development tool gap in the OpenMP standard.

...read moreread less

10 citations

Book Chapter•DOI•

Busy-wait barrier synchronization using distributed counters with local sensor

[...]

Guansong Zhang¹, Francisco Martínez¹, Arie Tal¹, Bob Blainey¹•Institutions (1)

IBM¹

26 Jun 2003

TL;DR: A new implementation, distributed counters with local sensor, is introduced, which considerably reduces overhead on POWER3 and POWER4 SMP systems and expects the relative performance of this implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.

...read moreread less

Abstract: Barrier synchronization is an important and performance critical primitive in many parallel programming models, including the popular OpenMP model. In this paper, we compare the performance of several software implementations of barrier synchronization and introduce a new implementation, distributed counters with local sensor, which considerably reduces overhead on POWER3 and POWER4 SMP systems. Through experiments with the EPCC OpenMP benchmark, we demonstrate a 79% reduction in overhead on a 32-way POWER4 system and an 87% reduction in overhead on a 16-way POWER3 system when comparing with a fetch-and-add implementation. Since these improvements are primarily attributed to reduced L2 and L3 cache misses, we expect the relative performance of our implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.

...read moreread less

Book Chapter•DOI•

Evaluation of OpenMP for the cyclops multithreaded architecture

[...]

George Almási¹, Eduard Ayguadé², Calin Cascaval¹, José G. Castaños¹, Jesús Labarta², Francisco Martínez², Xavier Martorell², José E. Moreira¹ - Show less +4 more•Institutions (2)

IBM¹, Polytechnic University of Catalonia²

26 Jun 2003

TL;DR: The implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C is described and issues that were not initially considered in the design of theBG/C architecture are identified to support a programming model such as OpenMP.

...read moreread less

Abstract: Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that w ere not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.

...read moreread less

Book Chapter•DOI•

OpenMP runtime support for clusters of multiprocessors

[...]

Panagiotis E. Hadjidoukas¹, Eleftherios D. Polychronopoulos¹, Theodore S. Papatheodorou¹•Institutions (1)

University of Patras¹

26 Jun 2003

TL;DR: A prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM).

...read moreread less

Abstract: This paper presents a prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM). The latter is feasible by adopting a share-everything approach for the generated by the OpenMP compiler code, which corresponds to the "default shared" philosophy of OpenMP. Specifically, the user-level thread stacks and the Fortran common blocks are allocated explicitly, though transparently to the programmer, in shared memory. The management of the internal runtime system structures and of the forkjoin multilevel parallelism is based on explicit communication, exploiting however the shared-memory hardware of the available SMP nodes whenever this is possible. The modular design of the runtime system allows the integration of existing unmodified SDSM libraries, despite their design for SPMD execution.

...read moreread less

Book Chapter•DOI•

OpenMP application tuning using hardware performance counters

[...]

Nils Smeds¹•Institutions (1)

Royal Institute of Technology¹

26 Jun 2003

TL;DR: Hardware counter events on some popular architectures were investigated with the purpose of detecting bottle-necks of particular interest to shared memory programming, such as OpenMP, and relevant events for the intended purpose were shown to exist on the investigated platforms.

...read moreread less

Abstract: Hardware counter events on some popular architectures were investigated with the purpose of detecting bottle-necks of particular interest to shared memory programming, such as OpenMP. A fully portable test suite was written in OpenMP, accessing the hardware performance counters be means of PAPI. Relevant events for the intended purpose were shown to exist on the investigated platforms. Further, these events could in most cases be accessed directly through their platform independent, PAPI pre-defined, names. In some cases suggestions for improvement in the pre-defined mapping were made based on the experiments.

...read moreread less

Book Chapter•DOI•

A runtime optimization system for OpenMP

[...]

Mihai Burcea¹, Michael Voss¹•Institutions (1)

University of Toronto¹

26 Jun 2003

TL;DR: There is a large potential for improvement by the runtime optimization of OpenMP applications by using a combined compile-time and run-time system, using stOMP, a specializing thread-library for OpenMP.

...read moreread less

Abstract: This paper introduces stOMP: a specializing thread-library for OpenMP. Using a combined compile-time and run-time system, stOMP specializes OpenMP parallel regions for frequently-seen values and the configuration of the runtime system. An overview of stOMP is presented as well as motivation for the runtime optimization of OpenMP applications. The overheads incurred by a prototype implementation of stOMP are evaluated on three Spec OpenMP Benchmarks and the EPCC scheduling microbenchmark. The results are encouraging and suggest that there is a large potential for improvement by the runtime optimization of OpenMP applications.

...read moreread less

Book Chapter•DOI•

OpenMP and compilation issue in embedded applications

[...]

Jaegeun Oh¹, Seon Wook Kim¹, Chulwoo Kim¹•Institutions (1)

Korea University¹

26 Jun 2003

TL;DR: This paper applied OpenMP to non-traditional benchmarks, i.e. embedded applications in order to examine the applicability of OpenMP in this area, and shows that the OpenMP-inserted parallel code size is much larger than a serial version due to multithreaded libraries.

...read moreread less

Abstract: Currently embedded systems become more and more important and widely applied to everywhere around us, such as a mobile phone, a PDA, an HDTV, and so on. In this paper, we applied OpenMP to non-traditional benchmarks, i.e. embedded applications in order to examine the applicability of OpenMP in this area. We parallelized embedded benchmarks, called EEMBC, consisting of 5 categories and total 34 applications, and measure their performance in detail. From experiment, we could find 90 parallel sections in 17 applications, but we achieved speedup only in four applications. Since embedded applications consists of a chunk of small loops, we could not get speedup due to large parallelization overheads such as thread management and instruction overheads. Also we show that the OpenMP-inserted parallel code size is much larger than a serial version due to multithreaded libraries, which is critical to embedded systems because of their limited size of memory systems. We discuss an identified critical, but a trivial problem in the current OpenMP specification when we applied OpenMP to these applications.

...read moreread less

Book Chapter•DOI•

Parallelizing parallel rollout algorithm for solving Markov decision processes

[...]

Seon Wook Kim¹, Hyeong Soo Chang²•Institutions (2)

Korea University¹, Sogang University²

26 Jun 2003

TL;DR: In this paper, the authors proposed a parallelized version of the parallel rollout algorithm, and evaluated its performance on a multi-class task scheduling problem by using OpenMP and MPI programming model.

...read moreread less

Abstract: Parallel rollout is a formal method of combining multiple heuristic policies available to a sequential decision maker in the framework of Markov Decision Processes (MDPs). The method improves the performances of all of the heuristic policies adapting to the different stochastic system trajectories. From an inherent multi-level parallelism in the method, in this paper we propose a parallelized version of parallel rollout algorithm, and evaluate its performance on a multi-class task scheduling problem by using OpenMP and MPI programming model. We analyze and compare the performance in two versions of parallelized codes, e.g., OpenMP and MPI on several execution environment. We show that the performance using OpenMP API is higher than MPI due to lower overhead in data synchronization across processors.

...read moreread less

Book Chapter•DOI•

Managing C++ OpenMP code and its exception handling

[...]

Shi-Jung Kao¹•Institutions (1)

Hewlett-Packard¹

26 Jun 2003

TL;DR: Two possible implementation techniques are described and contrasted and ways to synchronize the execution of C++ OpenMP programs in the event of uncaught exceptions are suggested.

...read moreread less

Abstract: This paper discusses the issue of C++ exception handling in C++ OpenMP Programs. Two possible implementation techniques are described and contrasted. This paper also suggests ways to synchronize the execution of C++ OpenMP programs in the event of uncaught exceptions.

...read moreread less

Book Chapter•DOI•

An evaluation of MPI and OpenMP paradigms for multi-dimensional data remapping

[...]

Yun He¹, Chris Ding¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

26 Jun 2003

TL;DR: An in-place method using vacancy tracking cycles is developed, which outperforms the traditional 2-array method and demonstrates the validity of the parallel paradigm of mixing MPI with OpenMP.

...read moreread less

Abstract: We evaluate dynamic data remapping on cluster of SMP architectures under OpenMP, MPI, and hybrid paradigms. Traditional method of multi-dimensional array transpose needs an auxiliary array of the same size and a copy back stage. We recently developed an in-place method using vacancy tracking cycles. The vacancy tracking algorithm outperforms the traditional 2-array method as demonstrated by extensive comparisons. Performance of multi-threaded parallelism using OpenMP are first tested with different scheduling methods and different number of threads. Both methods are then parallelized using several parallel paradigms. At node level, pure OpenMP outperforms pure MPI by a factor of 2.76 for vacancy tracking method. Across entire cluster of SMP nodes, by carefully choosing thread numbers, the hybrid MPI/OpenMP implementation outperforms pure MPI by a factor of 3.79 for traditional method and 4.44 for vacancy tracking method, demonstrating the validity of the parallel paradigm of mixing MPI with OpenMP.

...read moreread less

Book Chapter•DOI•

Extended overhead analysis for OpenMP performance tuning

[...]

Chen Yongjian¹, Wang Dingxing¹, Zheng Weimin¹•Institutions (1)

Tsinghua University¹

26 Jun 2003

TL;DR: An extended overhead analysis scheme based on layered model is proposed for OpenMP programming, to further enhance the capability of overhead analysis and thus make the OpenMP performance tuning easier.

...read moreread less

Abstract: Overhead analysis was developed as a performance tuning approach for parallel programming and were adopted by several performance analysis systems for OpenMP programs In this paper, an extended overhead analysis scheme based on layered model is proposed for OpenMP programming, to further enhance the capability of overhead analysis and thus make the OpenMP performance tuning easier An example case called ILP/TLP overlap is studied in detail to show the idea of layered overhead model, and a new way to organize the overhead hierarchically is also presented based on the layered overhead model

...read moreread less