scispace - formally typeset
Search or ask a question

Showing papers presented at "Parallel and Distributed Processing Techniques and Applications in 2013"


Proceedings Article
01 Jan 2013
TL;DR: This work presents the extension of SkePU for GPU clusters without the need to modify the SkePU application source code, and shows the benefit of lazy memory copying in terms of speedup gained for one level of Strassen’s algorithm and another synthetic matrix sum application.
Abstract: SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. SkePU provides programmability, portability and even performance portability, but up to now application written using SkePU could only run on a single multi-GPU node. We present the extension of SkePU for GPU clusters without the need to modify the SkePU application source code. With our prototype implementation, we performed two experiments. The first experiment demonstrates the scalability with regular algorithms for N-body simulation and electric field calculation over multiple GPU nodes. The results for the second experiment show the benefit of lazy memory copying in terms of speedup gained for one level of Strassen’s algorithm and another synthetic matrix sum application.

14 citations


Proceedings ArticleDOI
22 Jul 2013
TL;DR: This research has been supported by MICINN-Spain under contracts TIN2011-28689-C02-01, TIN2012-38876-C 02-01 and the Generalitat of Catalunya (2009-SGR-1434).
Abstract: This research has been supported by MICINN-Spain under contracts TIN2011-28689-C02-01, TIN2012-38876-C02-01 and the Generalitat of Catalunya (2009-SGR-1434).

8 citations


Proceedings Article
09 Apr 2013
TL;DR: A benchmark-driven model is developed for a simple shallow water code on a Cray XE6 system, to explore how deployment choices such as domain decomposition and core affinity affect performance.
Abstract: The complexity of current and emerging architectures provides users with options about how best to use the available resources, but makes predicting performance challenging. In this work a benchmark-driven model is developed for a simple shallow water code on a Cray XE6 system, to explore how deployment choices such as domain decomposition and core affinity affect performance. The resource sharing present in modern multi-core architectures adds various levels of heterogeneity to the system. Shared resources often includes cache, memory, network controllers and in some cases floating point units (as in the AMD Bulldozer), which mean that the access time depends on the mapping of application tasks, and the core's location within the system. Heterogeneity further increases with the use of hardware-accelerators such as GPUs and the Intel Xeon Phi, where many specialist cores are attached to general-purpose cores. This trend for shared resources and non-uniform cores is expected to continue into the exascale era. The complexity of these systems means that various runtime scenarios are possible, and it has been found that under-populating nodes, altering the domain decomposition and non-standard task to core mappings can dramatically alter performance. To find this out, however, is often a process of trial and error. To better inform this process, a performance model was developed for a simple regular grid-based kernel code, shallow. The code comprises two distinct types of work, loop-based array updates and nearest-neighbour halo-exchanges. Separate performance models were developed for each part, both based on a similar methodology. Application specific benchmarks were run to measure performance for different problem sizes under different execution scenarios. These results were then fed into a performance model that derives resource usage for a given deployment scenario, with interpolation between results as necessary.

2 citations


Proceedings Article
01 Jan 2013

2 citations


Proceedings Article
22 Jul 2013
TL;DR: This paper addresses issues in application resilience, i.e., fault-tolerance to algorithmic errors and to resource allocation failures, and overviews a platform used to deploy, execute, monitor, restart and resume distributed applications on grids and cloud infrastructures in case of unexpected behavior.
Abstract: Distributed computing infrastructures support system and network fault-tolerance, e.g., grids and clouds. They transparently repair and prevent communication and system software errors. They also allow duplication and migration of jobs and data to prevent hardware failures. However, only limited work has been done so far on application resilience, i.e., the ability to resume normal execution after errors and abnormal executions in distributed environments. This paper addresses issues in application resilience, i.e., fault-tolerance to algorithmic errors and to resource allocation failures. It addresses solutions for error detection and management. It also overviews a platform used to deploy, execute, monitor, restart and resume distributed applications on grids and cloud infrastructures in case of unexpected behavior.

Proceedings Article
22 Jul 2013
TL;DR: A way to emulate a precise CPU frequency thanks to the DVFS management in virtualized environments is proposed and implemented and evaluated in the Xen hypervisor.
Abstract: Nowadays, virtualization is present in almost all computing infrastructures. Thanks to VM migration and server consolidation, virtualization helps in reducing power consumption in distributed environments. On another side, Dynamic Voltage and Frequency Scaling (DVFS) allows servers to dynamically modify the processor frequency (according to the CPU load) in order to achieve less energy consumption. We observe that while DVFS is widely used, it still generates a waste of energy. By default and thanks to the ondemand governor, it scales up or down the processor frequency according to the current load and the different predefined threshold (up and down). However, DVFS frequency scaling policies are based on advertised processor frequencies, i.e. the set of frequencies constitutes a discrete range of frequencies. The frequency required for a given load will be set to a frequency higher than necessary; which leads to an energy waste. In this paper, we propose a way to emulate a precise CPU frequency thanks to the DVFS management in virtualized environments. We implemented and evaluated our prototype in the Xen hypervisor.