Showing papers presented at "International Workshop on OpenMP in 2010"

PDF

Open Access

Book Chapter•DOI•

A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

[...]

Chunhua Liao¹, Daniel J. Quinlan¹, Thomas Panas¹, Bronis R. de Supinski¹•Institutions (1)

14 Jun 2010

TL;DR: This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries, and presents a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries.

...read moreread less

Abstract: OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers to target modern hardware architectures for optimal performance. A variety of extensible and robust research compilers are key to OpenMP’s sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE’s internal representation to handle all OpenMP 3.0 constructs, thus facilitating experimenting with them. Since OpenMP research is often complicated by the tight coupling of the compiler translation and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni. Our comparative performance results against other OpenMP compilers demonstrate that our flexible runtime support does not incur additional overhead.

...read moreread less

67 citations

Book Chapter•DOI•

Enabling low-overhead hybrid MPI/OpenMP parallelism with MPC

[...]

Patrick Carribault, Marc Pérache, Hervé Jourdren

14 Jun 2010

TL;DR: An extended taxonomy of hybrid MPI/OpenMP programming and a new module to the MPC framework handling a fully 2.5-compliant OpenMP runtime completely integrated to an MPI 1.3 implementation are introduced.

...read moreread less

Abstract: With the advent of multicore- and manycore-based supercomputers, parallel programming models like MPI and OpenMP become more widely used to express various levels of parallelism in applications. But even though combining multiple models is possible, the resulting performance may not reach expected results. This is mainly due to collaboration issues between the runtime implementations. In this paper, we introduce an extended taxonomy of hybrid MPI/OpenMP programming and a new module to the MPC framework handling a fully 2.5-compliant OpenMP runtime completely integrated to an MPI 1.3 implementation. The design and implementation guidelines enable two features: (i) built-in oversubscribing capabilities with performance comparable to state-of-the-art implementations on pure OpenMP benchmarks and programs, and (ii) the possibility to run hybrid MPI/OpenMP applications with a limited overhead due to the mix of two different programming models.

...read moreread less

31 citations

Book Chapter•DOI•

OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler

[...]

Satoshi Ohshima¹, Shoichi Hirasawa², Hiroki Honda²•Institutions (2)

University of Tokyo¹, University of Electro-Communications²

14 Jun 2010

TL;DR: Evaluated using test programs, and validated that parallel improvement in the speed could be easily carried out in the same code as the existing OpenMP, and the framework is based on Omni OpenMP Compiler and named “OMPCUDA”.

...read moreread less

Abstract: Arithmetic performance with GPGPU attracts attention. However, the difficulty of the programming poses a problem. We have proposed GPGPU programming which used the existing parallel programming technique. We are now developing OpenMP framework for GPU as a concrete of our proposal. The framework is based on Omni OpenMP Compiler and named “OMPCUDA”. In this paper we describe a design and an implementation of OMPCUDA. We evaluated using test programs, and validated that parallel improvement in the speed could be easily carried out in the same code as the existing OpenMP.

...read moreread less

25 citations

Book Chapter•DOI•

How to reconcile event-based performance analysis with tasking in OpenMP

[...]

Daniel Lorenz¹, Bernd Mohr¹, Christian Rössel¹, Dirk Schmidl², Felix Wolf¹ - Show less +1 more•Institutions (2)

Forschungszentrum Jülich¹, RWTH Aachen University²

14 Jun 2010

TL;DR: A portable method to distinguish individual task instances and to track their suspension and resumption with event-based instrumentation is described and possible extensions of the OpenMP specification are discussed to provide general support for task identifiers with untied tasks.

...read moreread less

Abstract: With version 3.0, the OpenMP specification introduced a task construct and with it an additional dimension of concurrency. While offering a convenient means to express task parallelism, the new construct presents a serious challenge to event-based performance analysis. Since tasking may disrupt the classic sequence of region entry and exit events, essential analysis procedures such as reconstructing dynamic call paths or correctly attributing performance metrics to individual task region instances may become impossible. To overcome this limitation, we describe a portable method to distinguish individual task instances and to track their suspension and resumption with event-based instrumentation. Implemented as an extension of the OPARI source-code instrumenter, our portable solution supports C/C++ programs with tied tasks and with untied tasks that are suspended only at implied scheduling points, while introducing only negligible measurement overhead. Finally, we discuss possible extensions of the OpenMP specification to provide general support for task identifiers with untied tasks.

...read moreread less

23 citations

Proceedings Article•

A Stream-Comptuting Extension to OpenMP

[...]

Antoniu Pop, Albert Cohen

01 Jun 2010

TL;DR: The diverse motivations and constraints converging towards the design of the simple yet powerful language extension are surveyed, and experimental results of a prototype implementation in a public branch of GCC 4.5 are presented.

...read moreread less

Abstract: This paper introduces an extension to OpenMP 3.0 enabling stream programming with minimal, incremental additions that seamlessly integrate into the current specification. The stream programming model decomposes programs into tasks and explicits the flow of data among them, thus exposing data, task and pipeline parallelism. It helps the programmers to express concurrency and data locality properties, avoiding non-portable low-level code and early optimizations. We survey the diverse motivations and constraints converging towards the design of our simple yet powerful language extension, and we present experimental results of a prototype implementation in a public branch of GCC 4.5.

...read moreread less

19 citations

Book Chapter•DOI•

Towards an error model for OpenMP

[...]

Michael Wong¹, Michael Klemm², Alejandro Duran³, Timothy G. Mattson², Grant Haab², Bronis R. de Supinski⁴, Andrey Churbanov² - Show less +3 more•Institutions (4)

IBM¹, Intel², Barcelona Supercomputing Center³, Lawrence Livermore National Laboratory⁴

14 Jun 2010

TL;DR: This paper identifies issues with the current OpenMP specification and proposes a path to extend OpenMP with error-handling capabilities, and adds a construct that cleanly shuts down parallel regions as a first step.

...read moreread less

Abstract: OpenMP lacks essential features for developing mission-critical software. In particular, it has no support for detecting and handling errors or even a concept of them. In this paper, the OpenMP Error Model Subcommittee reports on solutions under consideration for this major omission. We identify issues with the current OpenMP specification and propose a path to extend OpenMP with error-handling capabilities. We add a construct that cleanly shuts down parallel regions as a first step. We then discuss two orthogonal proposals that extend OpenMP with features to handle system-level and user-defined errors.

...read moreread less

18 citations

Book Chapter•DOI•

A case for including transactions in OpenMP

[...]

Michael Wong¹, Barna L. Bihari², Bronis R. de Supinski², Peng Wu¹, Maged M. Michael¹, Yan Liu¹, W. T. Chen¹ - Show less +3 more•Institutions (2)

IBM¹, Lawrence Livermore National Laboratory²

14 Jun 2010

TL;DR: This study combines a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization, demonstrating that even with the relatively high overheads of STM, transactions can outperform OpenMP critical sections by 10%.

...read moreread less

Abstract: Transactional Memory (TM) has received significant attention recently as a mechanism to reduce the complexity of shared memory programming. We explore the potential of TM to improve OpenMP applications. We combine a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization. We apply this system to two application scenarios that reflect realistic TM use cases. Our results with this system demonstrate that even with the relatively high overheads of STM, transactions can outperform OpenMP critical sections by 10%. Overall, our study demonstrates that extending OpenMP to include transactions would ease programming effort while allowing improved performance.

...read moreread less

17 citations

Book Chapter•DOI•

Binding nested OpenMP programs on hierarchical memory architectures

[...]

Dirk Schmidl¹, Christian Terboven¹, Dieter an Mey¹, Martin Bücker¹•Institutions (1)

RWTH Aachen University¹

14 Jun 2010

TL;DR: This work provides a user friendly solution to the performance problems of nested OpenMP programs concerning thread and data locality particularly on cc-NUMA architectures and demonstrates its benefits by comparing the performance of some kernel benchmarks and some real-world applications with and without applying affinity optimizations.

...read moreread less

Abstract: In this work we discuss the performance problems of nested OpenMP programs concerning thread and data locality particularly on cc-NUMA architectures. We provide a user friendly solution and demonstrate its benefits by comparing the performance of some kernel benchmarks and some real-world applications with and without applying our affinity optimizations.

...read moreread less

14 citations

Book Chapter•DOI•

How OpenMP applications get more benefit from many-core era

[...]

Jianian Yan¹, Jiangzhou He¹, Wentao Han¹, Wenguang Chen¹, Weimin Zheng¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

14 Jun 2010

TL;DR: A prototype scheduler, SWOMPS, is designed and implemented to help schedule the threads of all the concurrent applications system-widely and it is shown that the performance slowdown of individual applications in concurrent execution is reasonable.

...read moreread less

Abstract: With the approaching of the many-core era, it becomes more and more difficult for a single OpenMP application to efficiently utilize all the available processor cores. On the other hand, the available cores become more than necessary for some applications. We believe executing multiple OpenMP applications concurrently will be a common usage model in the future. In this model, how threads are scheduled on the cores are important as cores are asymmetric. We have designed and implemented a prototype scheduler, SWOMPS, to help schedule the threads of all the concurrent applications system-widely. The scheduler makes its decision based on underlying hardware configuration as well as the hints of scheduling preference of each application provided by users. Experiment evaluation shows SWOMPS is quite efficient in improving the performance. With the help of SWOMPS, we compared exclusive running one application and concurrent running multiple applications in term of system throughput and individual application performance. In various experimental comparisons, concurrent execution outperforms in throughput, meanwhile the performance slowdown of individual applications in concurrent execution is reasonable.

...read moreread less

10 citations

Book Chapter•DOI•

A proposal for user-defined reductions in OpenMP

[...]

Alejandro Duran¹, Roger Ferrer¹, Michael Klemm², Bronis R. de Supinski³, Eduard Ayguadé¹ - Show less +1 more•Institutions (3)

Barcelona Supercomputing Center¹, Intel², Lawrence Livermore National Laboratory³

14 Jun 2010

TL;DR: This paper proposes new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators and shows that the UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.

...read moreread less

Abstract: Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e.g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a manual reduction algorithm in ways typically applied to reductions on primitive types. In this paper, we propose new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators. Our measurements show that our UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.

...read moreread less

10 citations

Book Chapter•DOI•

An extension to improve OpenMP tasking control

[...]

Eduard Ayguadé¹, James C. Beyer², Alejandro Duran¹, Roger Ferrer¹, Grant Haab³, Kelvin Li⁴, Federico Massaioli - Show less +3 more•Institutions (4)

Barcelona Supercomputing Center¹, Cray², Intel³, IBM⁴

14 Jun 2010

TL;DR: This paper addresses the common use case, of tasks generated in a tree-like hierarchy, with task granularity decreasing with increasing depth, and proposes a new final clause to force coalescing of excessively fine grained tasks.

...read moreread less

Abstract: OpenMP tasks were introduced in order to support irregular parallelism. However, task runtime overhead is necessarily higher than for worksharing constructs, and can hamper performance if the tasks are too finely grained. In this paper, we address the common use case, of tasks generated in a tree-like hierarchy, with task granularity decreasing with increasing depth, and propose a new final clause to force coalescing of excessively fine grained tasks.

...read moreread less

Book Chapter•DOI•

Hybrid parallel programming on SMP clusters using XPFortran and OpenMP

[...]

Yuanyuan Zhang¹, Hidetoshi Iwashita¹, Kuninori Ishii¹, Masanori Kaneko¹, Tomotake Nakamura¹, Kohichiro Hotta¹ - Show less +2 more•Institutions (1)

Fujitsu¹

14 Jun 2010

TL;DR: This paper introduces hybrid parallel programming by XPFortran to SMP clusters, in which thread-level parallelism is realized by OpenMP, and presents the language support and compiler implementation of OpenMP directives inXPFortran, and shows some of the experiences in XP Fortran-OpenMP hybrid programming.

...read moreread less

Abstract: Process-thread hybrid programming paradigm is commonly employed in SMP clusters. XPFortran, a parallel programming language that specifies a set of compiler directives and library routines, can be used to realize process-level parallelism in distributed memory systems. In this paper, we introduce hybrid parallel programming by XPFortran to SMP clusters, in which thread-level parallelism is realized by OpenMP. We present the language support and compiler implementation of OpenMP directives in XPFortran, and show some of our experiences in XPFortran-OpenMP hybrid programming. For nested loops parallelized by process-thread hybrid programming, it’s common sense to use process parallelization for outer loops and thread parallelization for inner ones. However, we have found that in some cases it’s possible to write XPFortran-OpenMP hybrid program in a reverse way, i.e., OpenMP outside, XPFortran inside. Our evaluation results show that this programming style sometimes delivers better performance than the traditional one. We therefore recommend using the hybrid parallelization flexibly.

...read moreread less

Proceedings Article•

Introducing Locality-Aware Computation into OpenMP

[...]

Lei Huang¹, Haoqiang Jin², Barbara Chapman¹•Institutions (2)

University of Houston¹, Ames Research Center²

26 Feb 2010

TL;DR: The syntax and examples of the proposed features to introduce data locality feature into OpenMP are presented, and hope to enable further discussion of useful language features to keep OpenMP scalable in emerging architectures.

...read moreread less

Abstract: This paper presents our idea to introduce data locality feature into OpenMP. Given the facts that the memory systems are hierarchical while OpenMP is at, we believe that it is important to introduce new features to OpenMF to provide Open MP programmer capability to manage the data layout and align tasks and data as close as possible in modern architectures. We present the syntax and examples of the proposed features in this paper, and hope to enable further discussion of useful language features to keep OpenMP scalable in emerging architectures.

...read moreread less

Book Chapter•DOI•

Topology-Aware OpenMP process scheduling

[...]

Peter Thoman¹, Hans Moritsch¹, Thomas Fahringer¹•Institutions (1)

University of Innsbruck¹

14 Jun 2010

TL;DR: This work proposes and implements process-level scheduling of OpenMP parallel regions and presents a number of scheduling optimizations based on system topology information, and evaluates their effectiveness in terms of metrics calculated in simulations as well as experimentally obtained performance and power consumption results.

...read moreread less

Abstract: Multi-core multi-processor machines provide parallelism at multiple levels, including CPUs, cores and hardware multithreading. Elements at each level in this hierarchy potentially exhibit heterogeneous memory access latencies. Due to these issues and the high degree of hardware parallelism, existing OpenMP applications often fail to use the whole system effectively. To increase throughput and decrease power consumption of OpenMP systems employed in HPC settings we propose and implement process-level scheduling of OpenMP parallel regions. We present a number of scheduling optimizations based on system topology information, and evaluate their effectiveness in terms of metrics calculated in simulations as well as experimentally obtained performance and power consumption results. On 32 core machines our methods achieve performance improvements of up to 33% as compared to standard OS-level scheduling, and reduce power consumption by an average of 12% for long-term tests.

...read moreread less

Book Chapter•DOI•

Fuzzy application parallelization using OpenMP

[...]

Chantana Chantrapornchai¹, J. Pipatpaisan¹•Institutions (1)

Silpakorn University¹

14 Jun 2010

TL;DR: This study of the parallelism in fuzzy systems using openMP and its possibility in embedded plaforms found that the coarse-grained approach is more effective due to the overhead of openMP which becomes more visible in the low-speed CPU.

...read moreread less

Abstract: Developing fuzzy applications contain many steps. Each part may require lots of computation cycles depending on applications and target platforms. In this work, we study the parallelism in fuzzy systems using openMP and its possibility in embedded plaforms. Two versions of the parallelization are mentioned: fine-grained and coarse-grained parallelism. In our study, we found that the coarse-grained approach is more effective due to the overhead of openMP which becomes more visible in the low-speed CPU. Thus, the coarse-grained approach is suggested. Two versions using paralle-for and section are proposed. Two versions give different speedup rate depending on characteristics of the applications and fuzzy parameters. In general, the experiments convey that as the system runs continuously the openMP implementation can achieve a certain speedup, overcoming the openMP overhead by the proposed parallelization schemes.

...read moreread less