Showing papers presented at "International Workshop on OpenMP in 2015"

PDF

Open Access

Book Chapter•DOI•

Application-Level Energy Awareness for OpenMP

[...]

Ferdinando Alessi¹, Peter Thoman¹, Giorgis Georgakoudis², Thomas Fahringer¹, Dimitrios S. Nikolopoulos² - Show less +1 more•Institutions (2)

University of Innsbruck¹, Queen's University Belfast²

01 Oct 2015

TL;DR: This work introduces OpenMPE, an extension to OpenMP designed for power management, which exposes per-region multi-objective optimization hints and application-level adaptation parameters in order to create energy-saving opportunities for the whole system stack.

...read moreread less

Abstract: Power, and consequently energy, has recently attained first-class system resource status, on par with conventional metrics such as CPU time. To reduce energy consumption, many hardware- and OS-level solutions have been investigated. However, application-level information - which can provide the system with valuable insights unattainable otherwise - was only considered in a handful of cases. We introduce OpenMPE, an extension to OpenMP designed for power management. OpenMP is the de-facto standard for programming parallel shared memory systems, but does not yet provide any support for power control. Our extension exposes (i) per-region multi-objective optimization hints and (ii) application-level adaptation parameters, in order to create energy-saving opportunities for the whole system stack. We have implemented OpenMPE support in a compiler and runtime system, and empirically evaluated its performance on two architectures, mobile and desktop. Our results demonstrate the effectiveness of OpenMPE with geometric mean energy savings across 9 use cases of 15 % while maintaining full quality of service.

...read moreread less

27 citations

Book Chapter•DOI•

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads

[...]

Raul Vidal¹, Marc Casas¹, Miquel Moreto¹, Dimitrios Chasapis¹, Roger Ferrer¹, Xavier Martorell¹, Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹ - Show less +5 more•Institutions (1)

Polytechnic University of Catalonia¹

01 Oct 2015

TL;DR: This paper shows the usefulness of three OmpSs features not currently handled by OpenMP 4.0 by deploying them over three applications of the PARSEC benchmark suite and showing the performance benefits.

...read moreread less

Abstract: OpenMP has been for many years the most widely used programming model for shared memory architectures. Periodically, new features are proposed and some of them are finally selected for inclusion in the OpenMP standard. The OmpSs programming model developed at the Barcelona Supercomputing Center (BSC) aims to be an OpenMP forerunner that handles the main OpenMP constructs plus some extra features not included in the OpenMP standard. In this paper we show the usefulness of three OmpSs features not currently handled by OpenMP 4.0 by deploying them over three applications of the PARSEC benchmark suite and showing the performance benefits. This paper also shows performance trade-offs between the OmpSs/OpenMP tasking and loop parallelism constructs and shows how a hybrid implementation that combines both approaches is sometimes the best option.

...read moreread less

15 citations

Book Chapter•DOI•

First Experiences Porting a Parallel Application to a Hybrid Supercomputer with OpenMP4.0 Device Constructs

[...]

Alistair Hart

01 Oct 2015

TL;DR: This paper describes the process of porting the NekBone mini-application to run on a Cray XC30 hybrid supercomputer using OpenMP device constructs, as introduced in version 4.0 of the OpenMP standard and implemented in a pre-release version of the Cray Compilation Environment (CCE) compiler.

...read moreread less

Abstract: In this paper we describe the process of porting the NekBone mini-application to run on a Cray XC30 hybrid supercomputer using OpenMP device constructs, as introduced in version 4.0 of the OpenMP standard and implemented in a pre-release version of the Cray Compilation Environment (CCE) compiler. We document the process of porting and show how the performance evolves during the addition on the 66 constructs needed to accelerate the application. In doing so, we provide a user-centric introduction to the device constructs and an overview of the approach needed to port a parallel application using these. Some contrasts with OpenACC are also drawn to aid those wishing to either implement both programming models or to migrate from one to the other.

...read moreread less

15 citations

Book Chapter•DOI•

Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors

[...]

Bo Wang¹, Dirk Schmidl¹, Matthias S. Müller¹•Institutions (1)

RWTH Aachen University¹

01 Oct 2015

TL;DR: This work investigation the energy consumption of OpenMP applications on the new Intel processor generation, called Haswell, starts with the basic chip characteristics of the chip before looking at automatic energy optimization features.

...read moreread less

Abstract: Modern processors contain a lot of features to reduce the energy consumption of the chip. The gain of these features highly depends on the workload which is executed. In this work, we investigate the energy consumption of OpenMP applications on the new Intel processor generation, called Haswell. We start with the basic chip characteristics of the chip before we look at automatic energy optimization features. Then, we investigate the energy consumed by load unbalanced applications and present a library to lower the energy consumption for iteratively recurring imbalance patterns. Here, we show that energy savings of up to 20 % are possible without any loss of performance.

...read moreread less

10 citations

Book Chapter•DOI•

Exploiting Fine- and Coarse-Grained Parallelism Using a Directive Based Approach

[...]

Arpith C. Jacob¹, Ravi Nair¹, Alexandre E. Eichenberger¹, Samuel Antao¹, Carlo Bertolli¹, Tong Chen¹, Zehra Sura¹, Kevin O'Brien¹, Michael Wong¹ - Show less +5 more•Institutions (1)

IBM¹

01 Oct 2015

TL;DR: The recently introduced OpenMP 4.0 standard extends the directive-based approach to exploit accelerators, however, programming clusters still requires the use of other specialized languages or libraries.

...read moreread less

Abstract: Modern high-performance machines are challenging to program because of the availability of a wide array of compute resources that often requires low-level, specialized knowledge to exploit. OpenMP is an effective directive-based approach that can effectively exploit shared-memory multicores. The recently introduced OpenMP 4.0 standard extends the directive-based approach to exploit accelerators. However, programming clusters still requires the use of other specialized languages or libraries.

...read moreread less

9 citations

Book Chapter•DOI•

Using Transactional Memory to Avoid Blocking in OpenMP Synchronization Directives

[...]

Lars Frydendal Bonnichsen¹, Artur Podobas²•Institutions (2)

Technical University of Denmark¹, Royal Institute of Technology²

01 Oct 2015

TL;DR: This paper presents methods based on hardware transactional memory (HTM) for executing OpenMP barrier, critical, and taskwait directives without blocking, and shows a 73 % performance improvement over traditional locking approaches, and 23 % better than other HTM approaches on critical sections.

...read moreread less

Abstract: OpenMP applications with abundant parallelism are often characterized by their high-performance. Unfortunately, OpenMP applications with a lot of synchronization or serialization-points perform poorly because of blocking, i.e. the threads have to wait for each other. In this paper, we present methods based on hardware transactional memory (HTM) for executing OpenMP barrier, critical, and taskwait directives without blocking. Although HTM is still relatively new in the Intel and IBM architectures, we experimentally show a 73 % performance improvement over traditional locking approaches, and 23 % better than other HTM approaches on critical sections. Speculation over barriers can decrease execution time by up-to 41 %. We expect that future systems with HTM support and more cores will have a greater benefit from our approach as they are more likely to block.

...read moreread less

8 citations

Book Chapter•DOI•

False Sharing Detection in OpenMP Applications Using OMPT API

[...]

Millad Ghane¹, Abid M. Malik¹, Barbara Chapman¹, Ahmad Qawasmeh¹•Institutions (1)

University of Houston¹

01 Oct 2015

TL;DR: It is shown that the OMPT framework has the ability to detect unique patterns that can be used to build a quality detection model for false sharing in OpenMP programs, and this work treats the false sharing detection problem as a binary classification problem.

...read moreread less

Abstract: Writing a parallel shared memory application that scales well on the future multi-core processors is a challenging task. The contention among shared resources increases as the number of threads increases. This may cause a false sharing problem, which can degrade the performance of an application. OpenMP Tools (OMPT) [2]- a performance tool APIs for OpenMP- enables performance tools to gather useful performance related information from OpenMP applications with lower overhead. In this paper, we propose a light-weight false sharing detection technique for OpenMP programming model using OMPT. We show that the OMPT framework has the ability to detect unique patterns that can be used to build a quality detection model for false sharing in OpenMP programs. In this work, we treat the false sharing detection problem as a binary classification problem. We develop a set of OpenMP programs in which false sharing can be turned on and off. We run these programs both with and without false sharing and collect a set of hardware performance event counts using OMPT. We use the collected data to train a binary classifier. We test the trained classifier using the NAS Parallel Benchmark applications. Our experiments show that the trained classifier can detect false sharing cases with an average accuracy of around 90 %.

...read moreread less

8 citations

Book Chapter•DOI•

PAGANtec: OpenMP Parallel Error Correction for Next-Generation Sequencing Data

[...]

Markus Joppich¹, Markus Joppich², Dirk Schmidl¹, Anthony Bolger¹, Torsten Kuhlen¹, Björn Usadel¹ - Show less +2 more•Institutions (2)

RWTH Aachen University¹, Ludwig Maximilian University of Munich²

01 Oct 2015

TL;DR: PAGANtec is presented, a tool for error correction of next-generation sequencing data, based on the novel PAGAN graph structure, parallelized with OpenMP and a performance analysis and tuning led to the awareness, that OpenMP tasks are a more suitable paradigm for this work than traditional work-sharing.

...read moreread less

Abstract: Next-generation sequencing techniques reduced the cost of sequencing a genome rapidly, but came with a relatively high error rate. Therefore, error correction of this data is a necessary task before assembly can take place. Since the input data is huge and error correction is compute intensive, parallelizing this work on a modern shared-memory system can help to keep the runtime feasible. In this work we present PAGANtec, a tool for error correction of next-generation sequencing data, based on the novel PAGAN graph structure. PAGANtec was parallelized with OpenMP and a performance analysis and tuning was done. The analysis led to the awareness, that OpenMP tasks are a more suitable paradigm for this work than traditional work-sharing.

...read moreread less

7 citations

Book Chapter•DOI•

OpenMP 4.0 Device Support in the OMPi Compiler

[...]

Alexandros Papadogiannakis¹, Spiros N. Agathos¹, Vassilios V. Dimakopoulos¹•Institutions (1)

University of Ioannina¹

01 Oct 2015

TL;DR: This work presents the infrastructure for device support in the ompi research compiler, one of the few compilers that currently implement the new device directives, and discusses the necessary compiler transformations and the general runtime organization.

...read moreread less

Abstract: OpenMP 4.0 represents a major upgrade in the language specifications of the standard. Important constructs for the exploitation of simd parallelism, the support for dependencies among tasks and the ability to cancel the operations of a team of threads have been added. What is arguably the most important addition, however, is the introduction of the device model. A variety of computational units, such as gpus, dsps and general or special purpose accelerators are viewed as attached devices, where portion of a unified application code can be offloaded for execution. In this work we present the infrastructure for device support in the ompi research compiler, one of the few compilers that currently implement the new device directives. We discuss the necessary compiler transformations and the general runtime organization. For the first time, special emphasis is placed on the important problem of data environment handling. In addition, we present a prototype implementation on the popular Parallella board which exploits the dual-core arm host processor and the 16-core Epiphany accelerator of the system.

...read moreread less

6 citations

Book Chapter•DOI•

Experiences of Using the OpenMP Accelerator Model to Port DOE Stencil Applications

[...]

Pei-Hung Lin¹, Chunhua Liao¹, Daniel J. Quinlan¹, Stephen M. Guzik²•Institutions (2)

Lawrence Livermore National Laboratory¹, Colorado State University²

01 Oct 2015

TL;DR: The experience shows that the existing OpenMP Accelerator Model can effectively help programmers leverage accelerators, however, complex data types and non-canonical control structures can pose challenges for programmers to productively apply accelerator directives.

...read moreread less

Abstract: The Department of Energy has a wide range of large-scale, parallel scientific applications running on cutting-edge high-performance computing systems to support its mission and tackle critical science challenges. A recent trend in these high-performance computing systems is to add commodity accelerators, such as Nvidia GPUs and Intel Xeon Phi coprocessors, into computer nodes so we can achieve increased performance without exceeding the limited power budget. However, it is well-known in the high-performance computing community that porting existing applications to accelerators is a difficult task given the numerous set of unique hardware features and the general complexity of software. In this paper, we share our experiences of using the OpenMP Accelerator Model to port two stencil applications to exploit Nvidia GPUs. Introduced as part of the OpenMP 4.0 specification, the OpenMP accelerator model provides a set of directives for users to specify semantics related to accelerators so that compilers and runtime systems can automatically handle repetitive and error-prone accelerator programming tasks, including code transformations, work scheduling, data management, reduction, and so on. Using a prototype compiler implementation based on the ROSE source-to-source compiler framework, we report the problems we encountered during the porting process, our solutions, and the obtained performance. Productivity is also evaluated. Our experience shows that the existing OpenMP Accelerator Model can effectively help programmers leverage accelerators. However, complex data types and non-canonical control structures can pose challenges for programmers to productively apply accelerator directives.

...read moreread less

6 citations

Book Chapter•DOI•

A Case Study of OpenMP applied to Map/Reduce-style Computations

[...]

Mahwish Arif¹, Hans Vandierendonck¹•Institutions (1)

Queen's University Belfast¹

01 Oct 2015

TL;DR: This paper investigates the applicability of OpenMP, the dominant shared-memory parallel programming model in high-performance computing, to the domain of data analytics and shows that OpenMP outperforms the Phoenix++ system by a large margin for several benchmarks.

...read moreread less

Abstract: As data analytics are growing in importance they are also quickly becoming one of the dominant application domains that require parallel processing. This paper investigates the applicability of OpenMP, the dominant shared-memory parallel programming model in high-performance computing, to the domain of data analytics. We contrast the performance and programmability of key data analytics benchmarks against Phoenix++, a state-of-the-art shared memory map/reduce programming system. Our study shows that OpenMP outperforms the Phoenix++ system by a large margin for several benchmarks. In other cases, however, the programming model is lacking support for this application domain.

...read moreread less

Book Chapter•DOI•

Towards Task-Parallel Reductions in OpenMP

[...]

Jan Ciesko¹, Sergi Mateo¹, Sergi Mateo², Xavier Teruel¹, Xavier Martorell², Xavier Martorell¹, Eduard Ayguadé², Eduard Ayguadé¹, Jesús Labarta², Jesús Labarta¹, Alex Duran³, Bronis R. de Supinski⁴, Stephen L. Olivier⁵, Kelvin Li⁶, Alexandre E. Eichenberger⁶ - Show less +11 more•Institutions (6)

Barcelona Supercomputing Center¹, Polytechnic University of Catalonia², Intel³, Lawrence Livermore National Laboratory⁴, Sandia National Laboratories⁵, IBM⁶

01 Oct 2015

TL;DR: This work presents an extension to OpenMP that supports task-parallel reductions on task and taskgroup constructs to improve productivity and programmability and explores issues for programmers and software vendors regarding programming transparency as well as the impact on the current standard.

...read moreread less

Abstract: Reductions represent a common algorithmic pattern in many scientific applications. OpenMP\(^{*}\) has always supported them on parallel and worksharing constructs. OpenMP 3.0’s tasking constructs enable new parallelization opportunities through the annotation of irregular algorithms. Unfortunately the tasking model does not easily allow the expression of concurrent reductions, which limits the general applicability of the programming model to such algorithms. In this work, we present an extension to OpenMP that supports task-parallel reductions on task and taskgroup constructs to improve productivity and programmability. We present specification of the feature and explore issues for programmers and software vendors regarding programming transparency as well as the impact on the current standard with respect to nesting, untied task support and task data dependencies. Our performance evaluation demonstrates comparable results to hand coded task reductions.

...read moreread less

Book Chapter•DOI•

Exception Handling with OpenMP in Object-Oriented Languages

[...]

Xing Fan¹, Mostafa Mehrabi¹, Oliver Sinnen¹, Nasser Giacaman¹•Institutions (1)

University of Auckland¹

01 Oct 2015

TL;DR: A systematic mechanism for exception handling with the co-use of Open MP directives, which is based on a Java implementation of OpenMP, and a flexible approach to thread cancellation is proposed (as an extension on OpenMP directives) that supports this exception handling within parallel execution.

...read moreread less

Abstract: OpenMP has become increasingly prevalent due to the simplicity it offers to elegantly and incrementally introduce parallelism. However, it still lacks some high-level language features that are essential in object-oriented programming. One such mechanism is that of exception handling. In languages such as Java, the concept of exception handling has been an integral aspect to the language since the first release. For OpenMP to be truly embraced within this object-oriented community, essential object-oriented concepts such as exception handling need to be given some attention. The official OpenMP standard has little specification on error recovery, as the challenges of supporting exception-based error recovery in OpenMP extends to both the semantic specifications and related runtime support. This paper proposes a systematic mechanism for exception handling with the co-use of OpenMP directives, which is based on a Java implementation of OpenMP. The concept of exception handling with OpenMP directives has been formalized and categorized. Hand in hand with this exception handling proposal, a flexible approach to thread cancellation is also proposed (as an extension on OpenMP directives) that supports this exception handling within parallel execution. The runtime support and its implementation are discussed. The evaluation shows that while there is no prominent overhead introduced, the new approach provides a more elegant coding style which increases the parallel development efficiency and software robustness.

...read moreread less

Book Chapter•DOI•

Supporting Indirect Data Mapping in OpenMP

[...]

Thomas R. W. Scogland¹, Jeff Keasler¹, John Gyllenhaal¹, Rich Hornung¹, Bronis R. de Supinski¹, Hal Finkel² - Show less +2 more•Institutions (2)

Lawrence Livermore National Laboratory¹, Argonne National Laboratory²

01 Oct 2015

TL;DR: This paper proposes mechanisms to improve support for code-passing abstractions in OpenMP, with a particular focus on device constructs and the aggregation and passing of OpenMP state through base language abstractions.

...read moreread less

Abstract: Code-passing abstractions based on lambdas and blocks are becoming increasingly popular to capture repetitive patterns that are amenable to parallelization. These abstractions improve code mantainability and simplify choosing from a range of mechanisms to implement parallelism. Several frameworks that use this model, including RAJA and Kokkos, employ OpenMP as one of their target parallel models. However, OpenMP inadequately supports the abstraction since it frequently requires information that is not available within the abstraction. Thus, OpenMP requires access to variables and parameters not directly supplied by the base language. This paper explores the issues with supporting these abstractions in OpenMP, with a particular focus on device constructs and the aggregation and passing of OpenMP state through base language abstractions. We propose mechanisms to improve support for these abstractions and also to reduce the burden of duplication in existing OpenMP applications.

...read moreread less

Book Chapter•DOI•

On the Algorithmic Aspects of Using OpenMP Synchronization Mechanisms II: User-Guided Speculative Locks

[...]

Barna L. Bihari¹, Hansang Bae², James Cownie², Michael Klemm², Christian Terboven³, Lori A. Diachin¹ - Show less +2 more•Institutions (3)

Lawrence Livermore National Laboratory¹, Intel², RWTH Aachen University³

01 Oct 2015

TL;DR: It is found that either speculative execution option always outperforms the other two modes in terms of their convergence characteristics, and the TSX options are very competitive when it comes to runtime performance measured with the “time-to-convergence” criterion introduced in [8].

...read moreread less

Abstract: In this paper we continue our investigations started in [8] into the effects of using different synchronization mechanisms in OpenMP-threaded iterative mesh optimization algorithms. We port our test code to the Intel® Xeon® processor (former codename “Haswell”) by employing a user-guided locking API for OpenMP [4] that provides a general and unified user interface and runtime framework. Since the Intel® Transactional Synchronization Extensions (TSX) provide two different options for speculation — Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM) — we compare a total of four different run modes: (i) HLE, (ii) RTM, (iii) OpenMP critical, and (iv) “unsynchronized”. As we did in [8], we find that either speculative execution option always outperforms the other two modes in terms of their convergence characteristics. Even with their higher overhead, the TSX options are very competitive when it comes to runtime performance measured with the “time-to-convergence” criterion introduced in [8].

...read moreread less

Book Chapter•DOI•

Composing Low-Overhead Scheduling Strategies for Improving Performance of Scientific Applications

[...]

Vivek S. Kale¹, William Gropp¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Oct 2015

TL;DR: It is shown that in order to improve efficiency of loop scheduling strategies, one must adapt theloop scheduling strategies so as to handle all overheads simultaneously.

...read moreread less

Abstract: Many different sources of overheads impact the efficiency of a scheduling strategy applied to a parallel loop within a scientific application. In prior work, we handled these overheads using multiple loop scheduling strategies, with each scheduling strategy focusing on mitigating a subset of the overheads. However, mitigating the impact of one source of overhead can lead to an increase in the impact of another source of overhead, and vice versa. In this work, we show that in order to improve efficiency of loop scheduling strategies, one must adapt the loop scheduling strategies so as to handle all overheads simultaneously. To show this, we describe a composition of our existing loop scheduling strategies, and experiment with the composed scheduling strategy on standard benchmarks and application codes. Applying the composed scheduling strategy to three MPI+OpenMP scientific codes run on a cluster of SMPs improves performance an average of 31 % over standard OpenMP static scheduling.

...read moreread less

Book Chapter•DOI•

Enabling Region Merging Optimizations in OpenMP

[...]

Thomas R. W. Scogland¹, John Gyllenhaal¹, Jeff Keasler¹, Rich Hornung¹, Bronis R. de Supinski¹ - Show less +1 more•Institutions (1)

Lawrence Livermore National Laboratory¹

01 Oct 2015

TL;DR: Changes to the OpenMP specification that would allow implementations to merge adjacent parallel regions automatically are explored, including the removal of issues that make the transformation non-conforming and the addition of hints that facilitate the optimization.

...read moreread less

Abstract: Maximizing the scope of a parallel region, which avoids the costs of barriers and of launching additional parallel regions, is among the first recommendations in any optimization guide for OpenMP. While clearly beneficial and easily accomplished for code where regions are visibly contiguous, regions often become contiguous only after compiler optimization or resolution of abstraction layers. This paper explores changes to the OpenMP specification that would allow implementations to merge adjacent parallel regions automatically, including the removal of issues that make the transformation non-conforming and the addition of hints that facilitate the optimization. Beyond simple merging, we explore hints to fuse workshared loops that occur in syntactically distinct parallel regions or to apply nowait to such loops. Our evaluation shows these changes can provide an overall speedup of 2–8\(\times \) for a microbenchmark, or 6 % for a representative physics application.

...read moreread less

Book Chapter•DOI•

Lessons Learned from Implementing OMPD: A Debugging Interface for OpenMP

[...]

Joachim Protze¹, Joachim Protze², Ignacio Laguna¹, Dong H. Ahn¹, John Del Signore, Ariel Burton, Martin Schulz¹, Matthias S. Müller² - Show less +4 more•Institutions (2)

Lawrence Livermore National Laboratory¹, RWTH Aachen University²

01 Oct 2015

TL;DR: The experiences and the issues that are found on implementing an OMPD library prototyp for a commonly used OpenMP runtime and a parallel debugger are presented.

...read moreread less

Abstract: With complex codes moving to systems of increasing on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, existing ones support OpenMP at a low system-thread level, reducing their effectiveness. The previously published draft for a standard OpenMP debugging interface (OMPD) is supposed to enable the debuggers to raise their debugging abstraction to the conceptual levels of OpenMP by mediating the tools and OpenMP runtime library. In this paper, we present our experiences and the issues that we have found on implementing an OMPD library prototyp for a commonly used OpenMP runtime and a parallel debugger.

...read moreread less

Book Chapter•DOI•

Parallelization Methods for Hierarchical SMP Systems

[...]

Larry Meadows¹, Jeongnim Kim¹, Alex M. Wells¹•Institutions (1)

Intel¹

01 Oct 2015

TL;DR: This work discusses several parallelization methods for multi-level hierarchical SMP systems using a stencil-based finite difference code and suggests improvements for OpenMP runtime improvements.

...read moreread less

Abstract: We discuss several parallelization methods for multi-level hierarchical SMP systems using a stencil-based finite difference code. Performance comparisons and suggestions for OpenMP runtime improvements are provided.

...read moreread less