scispace - formally typeset
Search or ask a question

Showing papers presented at "International Workshop on OpenMP in 2010"


Book ChapterDOI
14 Jun 2010
TL;DR: This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries, and presents a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries.
Abstract: OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers to target modern hardware architectures for optimal performance. A variety of extensible and robust research compilers are key to OpenMP’s sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE’s internal representation to handle all OpenMP 3.0 constructs, thus facilitating experimenting with them. Since OpenMP research is often complicated by the tight coupling of the compiler translation and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni. Our comparative performance results against other OpenMP compilers demonstrate that our flexible runtime support does not incur additional overhead.

67 citations


Book ChapterDOI
14 Jun 2010
TL;DR: An extended taxonomy of hybrid MPI/OpenMP programming and a new module to the MPC framework handling a fully 2.5-compliant OpenMP runtime completely integrated to an MPI 1.3 implementation are introduced.
Abstract: With the advent of multicore- and manycore-based supercomputers, parallel programming models like MPI and OpenMP become more widely used to express various levels of parallelism in applications. But even though combining multiple models is possible, the resulting performance may not reach expected results. This is mainly due to collaboration issues between the runtime implementations. In this paper, we introduce an extended taxonomy of hybrid MPI/OpenMP programming and a new module to the MPC framework handling a fully 2.5-compliant OpenMP runtime completely integrated to an MPI 1.3 implementation. The design and implementation guidelines enable two features: (i) built-in oversubscribing capabilities with performance comparable to state-of-the-art implementations on pure OpenMP benchmarks and programs, and (ii) the possibility to run hybrid MPI/OpenMP applications with a limited overhead due to the mix of two different programming models.

31 citations


Book ChapterDOI
14 Jun 2010
TL;DR: Evaluated using test programs, and validated that parallel improvement in the speed could be easily carried out in the same code as the existing OpenMP, and the framework is based on Omni OpenMP Compiler and named “OMPCUDA”.
Abstract: Arithmetic performance with GPGPU attracts attention. However, the difficulty of the programming poses a problem. We have proposed GPGPU programming which used the existing parallel programming technique. We are now developing OpenMP framework for GPU as a concrete of our proposal. The framework is based on Omni OpenMP Compiler and named “OMPCUDA”. In this paper we describe a design and an implementation of OMPCUDA. We evaluated using test programs, and validated that parallel improvement in the speed could be easily carried out in the same code as the existing OpenMP.

25 citations


Book ChapterDOI
14 Jun 2010
TL;DR: A portable method to distinguish individual task instances and to track their suspension and resumption with event-based instrumentation is described and possible extensions of the OpenMP specification are discussed to provide general support for task identifiers with untied tasks.
Abstract: With version 3.0, the OpenMP specification introduced a task construct and with it an additional dimension of concurrency. While offering a convenient means to express task parallelism, the new construct presents a serious challenge to event-based performance analysis. Since tasking may disrupt the classic sequence of region entry and exit events, essential analysis procedures such as reconstructing dynamic call paths or correctly attributing performance metrics to individual task region instances may become impossible. To overcome this limitation, we describe a portable method to distinguish individual task instances and to track their suspension and resumption with event-based instrumentation. Implemented as an extension of the OPARI source-code instrumenter, our portable solution supports C/C++ programs with tied tasks and with untied tasks that are suspended only at implied scheduling points, while introducing only negligible measurement overhead. Finally, we discuss possible extensions of the OpenMP specification to provide general support for task identifiers with untied tasks.

23 citations


Proceedings Article
01 Jun 2010
TL;DR: The diverse motivations and constraints converging towards the design of the simple yet powerful language extension are surveyed, and experimental results of a prototype implementation in a public branch of GCC 4.5 are presented.
Abstract: This paper introduces an extension to OpenMP 3.0 enabling stream programming with minimal, incremental additions that seamlessly integrate into the current specification. The stream programming model decomposes programs into tasks and explicits the flow of data among them, thus exposing data, task and pipeline parallelism. It helps the programmers to express concurrency and data locality properties, avoiding non-portable low-level code and early optimizations. We survey the diverse motivations and constraints converging towards the design of our simple yet powerful language extension, and we present experimental results of a prototype implementation in a public branch of GCC 4.5.

19 citations


Book ChapterDOI
14 Jun 2010
TL;DR: This paper identifies issues with the current OpenMP specification and proposes a path to extend OpenMP with error-handling capabilities, and adds a construct that cleanly shuts down parallel regions as a first step.
Abstract: OpenMP lacks essential features for developing mission-critical software. In particular, it has no support for detecting and handling errors or even a concept of them. In this paper, the OpenMP Error Model Subcommittee reports on solutions under consideration for this major omission. We identify issues with the current OpenMP specification and propose a path to extend OpenMP with error-handling capabilities. We add a construct that cleanly shuts down parallel regions as a first step. We then discuss two orthogonal proposals that extend OpenMP with features to handle system-level and user-defined errors.

18 citations


Book ChapterDOI
14 Jun 2010
TL;DR: This study combines a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization, demonstrating that even with the relatively high overheads of STM, transactions can outperform OpenMP critical sections by 10%.
Abstract: Transactional Memory (TM) has received significant attention recently as a mechanism to reduce the complexity of shared memory programming. We explore the potential of TM to improve OpenMP applications. We combine a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization. We apply this system to two application scenarios that reflect realistic TM use cases. Our results with this system demonstrate that even with the relatively high overheads of STM, transactions can outperform OpenMP critical sections by 10%. Overall, our study demonstrates that extending OpenMP to include transactions would ease programming effort while allowing improved performance.

17 citations


Book ChapterDOI
14 Jun 2010
TL;DR: This work provides a user friendly solution to the performance problems of nested OpenMP programs concerning thread and data locality particularly on cc-NUMA architectures and demonstrates its benefits by comparing the performance of some kernel benchmarks and some real-world applications with and without applying affinity optimizations.
Abstract: In this work we discuss the performance problems of nested OpenMP programs concerning thread and data locality particularly on cc-NUMA architectures. We provide a user friendly solution and demonstrate its benefits by comparing the performance of some kernel benchmarks and some real-world applications with and without applying our affinity optimizations.

14 citations


Book ChapterDOI
Jianian Yan1, Jiangzhou He1, Wentao Han1, Wenguang Chen1, Weimin Zheng1 
14 Jun 2010
TL;DR: A prototype scheduler, SWOMPS, is designed and implemented to help schedule the threads of all the concurrent applications system-widely and it is shown that the performance slowdown of individual applications in concurrent execution is reasonable.
Abstract: With the approaching of the many-core era, it becomes more and more difficult for a single OpenMP application to efficiently utilize all the available processor cores. On the other hand, the available cores become more than necessary for some applications. We believe executing multiple OpenMP applications concurrently will be a common usage model in the future. In this model, how threads are scheduled on the cores are important as cores are asymmetric. We have designed and implemented a prototype scheduler, SWOMPS, to help schedule the threads of all the concurrent applications system-widely. The scheduler makes its decision based on underlying hardware configuration as well as the hints of scheduling preference of each application provided by users. Experiment evaluation shows SWOMPS is quite efficient in improving the performance. With the help of SWOMPS, we compared exclusive running one application and concurrent running multiple applications in term of system throughput and individual application performance. In various experimental comparisons, concurrent execution outperforms in throughput, meanwhile the performance slowdown of individual applications in concurrent execution is reasonable.

10 citations


Book ChapterDOI
14 Jun 2010
TL;DR: This paper proposes new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators and shows that the UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.
Abstract: Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e.g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a manual reduction algorithm in ways typically applied to reductions on primitive types. In this paper, we propose new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators. Our measurements show that our UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.

10 citations


Book ChapterDOI
14 Jun 2010
TL;DR: This paper addresses the common use case, of tasks generated in a tree-like hierarchy, with task granularity decreasing with increasing depth, and proposes a new final clause to force coalescing of excessively fine grained tasks.
Abstract: OpenMP tasks were introduced in order to support irregular parallelism. However, task runtime overhead is necessarily higher than for worksharing constructs, and can hamper performance if the tasks are too finely grained. In this paper, we address the common use case, of tasks generated in a tree-like hierarchy, with task granularity decreasing with increasing depth, and propose a new final clause to force coalescing of excessively fine grained tasks.

Book ChapterDOI
14 Jun 2010
TL;DR: This paper introduces hybrid parallel programming by XPFortran to SMP clusters, in which thread-level parallelism is realized by OpenMP, and presents the language support and compiler implementation of OpenMP directives inXPFortran, and shows some of the experiences in XP Fortran-OpenMP hybrid programming.
Abstract: Process-thread hybrid programming paradigm is commonly employed in SMP clusters. XPFortran, a parallel programming language that specifies a set of compiler directives and library routines, can be used to realize process-level parallelism in distributed memory systems. In this paper, we introduce hybrid parallel programming by XPFortran to SMP clusters, in which thread-level parallelism is realized by OpenMP. We present the language support and compiler implementation of OpenMP directives in XPFortran, and show some of our experiences in XPFortran-OpenMP hybrid programming. For nested loops parallelized by process-thread hybrid programming, it’s common sense to use process parallelization for outer loops and thread parallelization for inner ones. However, we have found that in some cases it’s possible to write XPFortran-OpenMP hybrid program in a reverse way, i.e., OpenMP outside, XPFortran inside. Our evaluation results show that this programming style sometimes delivers better performance than the traditional one. We therefore recommend using the hybrid parallelization flexibly.

Proceedings Article
26 Feb 2010
TL;DR: The syntax and examples of the proposed features to introduce data locality feature into OpenMP are presented, and hope to enable further discussion of useful language features to keep OpenMP scalable in emerging architectures.
Abstract: This paper presents our idea to introduce data locality feature into OpenMP. Given the facts that the memory systems are hierarchical while OpenMP is at, we believe that it is important to introduce new features to OpenMF to provide Open MP programmer capability to manage the data layout and align tasks and data as close as possible in modern architectures. We present the syntax and examples of the proposed features in this paper, and hope to enable further discussion of useful language features to keep OpenMP scalable in emerging architectures.

Book ChapterDOI
14 Jun 2010
TL;DR: This work proposes and implements process-level scheduling of OpenMP parallel regions and presents a number of scheduling optimizations based on system topology information, and evaluates their effectiveness in terms of metrics calculated in simulations as well as experimentally obtained performance and power consumption results.
Abstract: Multi-core multi-processor machines provide parallelism at multiple levels, including CPUs, cores and hardware multithreading. Elements at each level in this hierarchy potentially exhibit heterogeneous memory access latencies. Due to these issues and the high degree of hardware parallelism, existing OpenMP applications often fail to use the whole system effectively. To increase throughput and decrease power consumption of OpenMP systems employed in HPC settings we propose and implement process-level scheduling of OpenMP parallel regions. We present a number of scheduling optimizations based on system topology information, and evaluate their effectiveness in terms of metrics calculated in simulations as well as experimentally obtained performance and power consumption results. On 32 core machines our methods achieve performance improvements of up to 33% as compared to standard OS-level scheduling, and reduce power consumption by an average of 12% for long-term tests.

Book ChapterDOI
14 Jun 2010
TL;DR: This study of the parallelism in fuzzy systems using openMP and its possibility in embedded plaforms found that the coarse-grained approach is more effective due to the overhead of openMP which becomes more visible in the low-speed CPU.
Abstract: Developing fuzzy applications contain many steps. Each part may require lots of computation cycles depending on applications and target platforms. In this work, we study the parallelism in fuzzy systems using openMP and its possibility in embedded plaforms. Two versions of the parallelization are mentioned: fine-grained and coarse-grained parallelism. In our study, we found that the coarse-grained approach is more effective due to the overhead of openMP which becomes more visible in the low-speed CPU. Thus, the coarse-grained approach is suggested. Two versions using paralle-for and section are proposed. Two versions give different speedup rate depending on characteristics of the applications and fuzzy parameters. In general, the experiments convey that as the system runs continuously the openMP implementation can achieve a certain speedup, overcoming the openMP overhead by the proposed parallelization schemes.