scispace - formally typeset
Search or ask a question

Showing papers presented at "International Workshop on OpenMP in 2011"


Book ChapterDOI
13 Jun 2011
TL;DR: This paper presents extensions to OpenMP that provide a high-level programming model that can provide accelerated performance comparable to that of hand-coded implementations in CUDA.
Abstract: OpenMP [14] is the dominant programming model for shared-memory parallelism in C, C++ and Fortran due to its easy-touse directive-based style, portability and broad support by compiler vendors. Compute-intensive application regions are increasingly being accelerated using devices such as GPUs and DSPs, and a programming model with similar characteristics is needed here. This paper presents extensions to OpenMP that provide such a programming model. Our results demonstrate that a high-level programming model can provide accelerated performance comparable to that of hand-coded implementations in CUDA.

53 citations


Book ChapterDOI
13 Jun 2011
TL;DR: OmpVerify as discussed by the authors is a static analysis tool for OpenMP programs integrated into the standard open source Eclipse IDE that detects data-race errors in OpenMP parallel loops by flagging incorrectly specified omp parallel for directives and data races.
Abstract: We describe a static analysis tool for OpenMP programs integrated into the standard open source Eclipse IDE. It can detect an important class of common data-race errors in OpenMP parallel loop programs by flagging incorrectly specified omp parallel for directives and data races. The analysis is based on the polyhedral model, and covers a class of program fragments called Affine Control Loops (ACLs, or alternatively, Static Control Parts, SCoPs). ompVerify automatically extracts such ACLs from an input C program, and then flags the errors as specific and precise error messages reported to the user. We illustrate the power of our techniques through a number of simple but non-trivial examples with subtle parallelization errors that are difficult to detect, even for expert OpenMP programmers.

46 citations


Book ChapterDOI
13 Jun 2011
TL;DR: An overview of one OpenMP implementation is provided, its main features are highlighted, the implementation is discussed, the performance is demonstrated, and its performance with user controlled runtime variables is demonstrated.
Abstract: Many task-based programming models have been developed and refined in recent years to support application development for shared memory platforms. Asynchronous tasks are a powerful programming abstraction that offer flexibility in conjunction with great expressivity. Research involving standardized tasking models like OpenMP and nonstandardized models like Cilk facilitate improvements in many tasking implementations. While the asynchronous task is arguably a fundamental element of parallel programming, it is the implementation, not the concept, that makes all the difference with respect to the performance that is obtained by a program that is parallelized using tasks. There are many approaches to implementing tasking constructs, but few have also given attention to providing the user with some capabilities for fine tuning the execution of their code. This paper provides an overview of one OpenMP implementation, highlights its main features, discusses the implementation, and demonstrates its performance with user controlled runtime variables.

20 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This work implements an empirical tuning framework and proposes an algorithm that partitions programs into sections and tunes each code section individually, and is one of the first approaches delivering an auto-parallelization system that guarantees performance improvements for nearly all programs.
Abstract: Automatic parallelization combined with tuning techniques is an alternative to manual parallelization of sequential programs to exploit the increased computational power that current multi-core systems offer. Automatic parallelization concentrates on finding any possible parallelism in the program, whereas tuning systems help identifying efficient parallel code segments and serializing inefficient ones using runtime performance metrics. In this work we study the performance gap between automatic and hand parallel OpenMP applications and try to find whether this gap can be filled by compile-time techniques or it needs dynamic or user-interactive solutions. We implement an empirical tuning framework and propose an algorithm that partitions programs into sections and tunes each code section individually. Experiments show that tuned applications perform better than original serial programs in the worst case and sometimes outperform hand-parallel applications. Our work is one of the first approaches delivering an auto-parallelization system that guarantees performance improvements for nearly all programs; hence it eliminates the need for users to "experiment" with such tools in order to obtain the shortest runtime of their applications.

17 citations


Book ChapterDOI
13 Jun 2011
TL;DR: An extension to the Thread-Local Storage (TLS) mechanism is proposed to support data placement in the thread-based MPI model and the data visibility with nested hybrid MPI/OpenMP applications.
Abstract: With the advent of the multicore era, the architecture of supercomputers in HPC (High-Performance Computing) is evolving to integrate larger computational nodes with an increasing number of cores. This change contributes to evolve the parallel programming models currently used by scientific applications. Multiple approaches advocate for the use of thread-based programming models. One direction is the exploitation of the thread-based MPI programming model mixed with OpenMP leading to hybrid applications. But mixing parallel programming models involves a fine management of data placement and visibility. Indeed, every model includes extensions to privatize some variable declarations, i.e., to create a small amount of storage only accessible by one task or thread. This article proposes an extension to the Thread-Local Storage (TLS) mechanism to support data placement in the thread-based MPI model and the data visibility with nested hybrid MPI/OpenMP applications.

12 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code and the effects of cache locality, update management, work division, and synchronization frequency are studied.
Abstract: The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case.

10 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This paper proposes two new synchronization constructs in the OpenMP programming model, thread-level phasers and iteration level phasers to support various synchronization patterns such as point-to-point synchronizations and sub-group barriers with neighbor threads.
Abstract: OpenMP is a widely used standard for parallel programing on a broad range of SMP systems. In the OpenMP programming model, synchronization points are specified by implicit or explicit barrier operations. However, certain classes of computations such as stencil algorithms need to specify synchronization only among particular tasks/threads so as to support pipeline parallelism with better synchronization efficiency and data locality than wavefront parallelism using all-to-all barriers. In this paper, we propose two new synchronization constructs in the OpenMP programming model, thread-level phasers and iteration level phasers to support various synchronization patterns such as point-to-point synchronizations and sub-group barriers with neighbor threads. Experimental results on three platforms using numerical applications show performance improvements of phasers over OpenMP barriers of up to 1.74× on an 8-core Intel Nehalem system, up to 1.59× on a 16-core Core-2-Quad system and up to 1.44× on a 32-core IBM Power7 system. It is reasonable to expect larger increases on future manycore processors.

9 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This work focuses on the OpenMP parallelisation of an iterative linear equation solver and parallel usage of an explicit solver for the nonlinear phase-field equations in microstructure evolution simulations based on the phase- field method.
Abstract: This work focuses on the OpenMP parallelisation of an iterative linear equation solver and parallel usage of an explicit solver for the nonlinear phase-field equations. Both solvers are used in microstructure evolution simulations based on the phase-field method. For the latter one, we compare a graph based solution using OpenMP tasks to a first-come-first-serve scheduling using an OpenMP critical section. We discuss how the task solution might benefit from the introduction of OpenMP task dependencies. The concepts are implemented in the software MICRESS which is mainly used by material engineers for the simulation of the evolving material microstructure during processing.

8 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This work proposes and implements means to query system information from within the program, so that expert users can take advantage of this knowledge and demonstrate the usefulness of the approach with an application from the Fraunhofer Institute for Laser Technology in Aachen.
Abstract: Today most multi-socket shared memory systems exhibit a nonuniform memory architecture (NUMA). However, programming models such as OpenMP do not provide explicit support for that. To overcome this limitation, we propose a platform-independent approach to describe the system topology and to place threads on the hardware. A distance matrix provides system information and is used to allow for thread binding with user-defined strategies. We propose and implement means to query this information from within the program, so that expert users can take advantage of this knowledge, and demonstrate the usefulness of our approach with an application from the Fraunhofer Institute for Laser Technology in Aachen.

7 citations


Book ChapterDOI
13 Jun 2011
TL;DR: Experiments with a prototype implementation on the Cell Broadband Engine show the benefit of allowing OpenMP teams to be created across the different elements of a heterogeneous architecture.
Abstract: Modern architectures are becoming more heterogeneous. OpenMP currently has no mechanism for assigning work to specific parts of these heterogeneous architectures.We propose a combination of thread mapping and subteams as a means to give programmers control over how work is allocated on these architectures. Experiments with a prototype implementation on the Cell Broadband Engine show the benefit of allowing OpenMP teams to be created across the different elements of a heterogeneous architecture.

6 citations


Book ChapterDOI
13 Jun 2011
TL;DR: This paper studies the performance issues of the OpenMP execution on virtualized multicore systems to quantify the performance deficit of virtualization of OpenMP applications and further to detect the reason of the performance loss.
Abstract: Virtualization technology has been applied to a variety of areas including server consolidation, High Performance Computing, as well as Grid and Cloud computing. Due to the fact that applications do not run directly on the hardware of a host machine, virtualization generally causes a performance loss for both sequential and parallel applications. This paper studies the performance issues of the OpenMP execution on virtualized multicore systems. The goal of this study is to quantify the performance deficit of virtualization of OpenMP applications and further to detect the reason of the performance loss. The results of the investigation are expected to guide the optimization of virtualization technologies as well as the applications.

Book ChapterDOI
13 Jun 2011
TL;DR: The functionality of a collector-based dynamic optimization framework called DARWIN is demonstrated that uses collected performance data as feedback to affect the behavior of the program through the OpenMP runtime, thus able to optimizing certain aspects.
Abstract: Developing shared memory parallel programs using OpenMP is straightforward, but getting good performance in terms of speedup and scalability can be difficult. This paper demonstrates the functionality of a collector-based dynamic optimization framework called DARWIN that uses collected performance data as feedback to affect the behavior of the program through the OpenMP runtime, thus able to optimizing certain aspects. The DARWIN framework utilizes the OpenMP Collector API to drive the optimization activity and various open source libraries to support its data collection and optimizations.

Book ChapterDOI
Mark Woodyard1
13 Jun 2011
TL;DR: An experimental OpenMP performance analysis model has been developed to give insight into many application scalability bottlenecks and a case study shows how the tool helped diagnose performance loss caused by OpenMP work distribution schedule strategies.
Abstract: Data centers are increasingly focused on optimal use of resources. For technical computing environments, with compute-dominated workloads, we can increase data center efficiencies by increasing multicore processor utilization. OpenMP programmers need assistance in better understanding efficiencies and scaling for both dedicated and throughput environments. An experimental OpenMP performance analysis model has been developed to give insight into many application scalability bottlenecks. A tool has been developed to implement the model. Compared to other performance analysis tools, this tool takes into account how the operating system scheduler affects OpenMP threaded application performance. Poor parallel scalability can result in reduced system utilization. A case study shows how the tool helped diagnose performance loss caused by OpenMP work distribution schedule strategies. Changing the work distribution schedule substantially improved application performance and system utilization. This tool is specific to Solaris and Studio compilers, although the performance model is applicable to other OpenMP compilers, Linux and UNIX systems.