scispace - formally typeset
Search or ask a question

Showing papers on "Software portability published in 2012"


Book
15 Feb 2012
TL;DR: This book provides a complete and comprehensive reference/guide to Pyomo (Python Optimization Modeling Objects) for both beginning and advanced modelers, including students at the undergraduate and graduate levels, academic researchers, and practitioners.
Abstract: This book provides a complete and comprehensive reference/guide to Pyomo (Python Optimization Modeling Objects) for both beginning and advanced modelers, including students at the undergraduate and graduate levels, academic researchers, and practitioners. The text illustrates the breadth of the modeling and analysis capabilities that are supported by the software and support of complex real-world applications. Pyomo is an open source software package for formulating and solving large-scale optimization and operations research problems. The text begins with a tutorial on simple linear and integer programming models. A detailed reference of Pyomo's modeling components is illustrated with extensive examples, including a discussion of how to load data from data sources like spreadsheets and databases. Chapters describing advanced modeling capabilities for nonlinear and stochastic optimization are also included. The Pyomo software provides familiar modeling features within Python, a powerful dynamic programming language that has a very clear, readable syntax and intuitive object orientation. Pyomo includes Python classes for defining sparse sets, parameters, and variables, which can be used to formulate algebraic expressions that define objectives and constraints. Moreover, Pyomo can be used from a command-line interface and within Python's interactive command environment, which makes it easy to create Pyomo models, apply a variety of optimizers, and examine solutions. The software supports a different modeling approach than commercial AML (Algebraic Modeling Languages) tools, and is designed for flexibility, extensibility, portability, and maintainability but also maintains the central ideas in modern AMLs.

683 citations


Journal ArticleDOI
Efraim Rotem1, Alon Naveh1, Doron Rajwan1, Avinash N. Ananthakrishnan1, Eliezer Weissmann1 
TL;DR: This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor, and suggests an architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting performance and efficiency goals.
Abstract: Modern microprocessors are evolving into system-on-a-chip designs with high integration levels, catering to ever-shrinking form factors. Portability without compromising performance is a driving market need. An architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting these performance and efficiency goals. This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor.

452 citations


Journal ArticleDOI
TL;DR: Accuracy tests show that Arduino boards may be an inexpensive tool for many psychological and neurophysiological labs and may be useful in many lab environments.
Abstract: Typical experiments in psychological and neurophysiological settings often require the accurate control of multiple input and output signals. These signals are often generated or recorded via computer software and/or external dedicated hardware. Dedicated hardware is usually very expensive and requires additional software to control its behavior. In the present article, I present some accuracy tests on a low-cost and open-source I/O board (Arduino family) that may be useful in many lab environments. One of the strengths of Arduinos is the possibility they afford to load the experimental script on the board’s memory and let it run without interfacing with computers or external software, thus granting complete independence, portability, and accuracy. Furthermore, a large community has arisen around the Arduino idea and offers many hardware add-ons and hundreds of free scripts for different projects. Accuracy tests show that Arduino boards may be an inexpensive tool for many psychological and neurophysiological labs.

329 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: This work proposes a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity, and demonstrates the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide and compiling them for ARM, x86, and GPUs.
Abstract: Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism.We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code.We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.

256 citations


Journal ArticleDOI
TL;DR: The authors show how plans in the Topology and Orchestration Specification for Cloud Applications (TOSCA) can enable portability of these operational aspects of the application components themselves.
Abstract: For cloud services to be portable, their management must also be portable to the targeted environment, as must the application components themselves. Here, the authors show how plans in the Topology and Orchestration Specification for Cloud Applications (TOSCA) can enable portability of these operational aspects.

233 citations


Proceedings ArticleDOI
22 Oct 2012
TL;DR: This paper presents the state of the art of the evaluation and measurement of mobile application usability, and proposes methods to evaluate it.
Abstract: Mobile devices and applications provide significant advantages to their users, in terms of portability, location awareness, and accessibility. A number of studies have examined usability challenges in the mobile context, and proposed definitions of mobile application usability and methods to evaluate it. This paper presents the state of the art of the evaluation and measurement of mobile application usability.

224 citations


Proceedings ArticleDOI
02 Jun 2012
TL;DR: It is argued that Model-Driven Development can be helpful in this context as it would allow developers to design software systems in a cloud-agnostic way and to be supported by model transformation techniques into the process of instantiating the system into specific, possibly, multiple Clouds.
Abstract: Cloud computing is emerging as a major trend in the ICT industry. While most of the attention of the research community is focused on considering the perspective of the Cloud providers, offering mechanisms to support scaling of resources and interoperability and federation between Clouds, the perspective of developers and operators willing to choose the Cloud without being strictly bound to a specific solution is mostly neglected. We argue that Model-Driven Development can be helpful in this context as it would allow developers to design software systems in a cloud-agnostic way and to be supported by model transformation techniques into the process of instantiating the system into specific, possibly, multiple Clouds. The MODAClouds (MOdel-Driven Approach for the design and execution of applications on multiple Clouds) approach we present here is based on these principles and aims at supporting system developers and operators in exploiting multiple Clouds for the same system and in migrating (part of) their systems from Cloud to Cloud as needed. MODAClouds offers a quality-driven design, development and operation method and features a Decision Support System to enable risk analysis for the selection of Cloud providers and for the evaluation of the Cloud adoption impact on internal business processes. Furthermore, MODAClouds offers a run-time environment for observing the system under execution and for enabling a feedback loop with the design environment. This allows system developers to react to performance fluctuations and to re-deploy applications on different Clouds on the long term.

223 citations


Proceedings ArticleDOI
25 Jun 2012
TL;DR: This announcement describes the problem based benchmark suite (PBBS), a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems.
Abstract: This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a problem specification and a set of input distributions. No requirements are made in terms of algorithmic approach, programming language, or machine architecture. The goal of the benchmarks is not only to compare runtimes, but also to be able to compare code and other aspects of an implementation (e.g., portability, robustness, determinism, and generality). As such the code for an implementation of a benchmark is as important as its runtime, and the public PBBS repository will include both code and performance results.The benchmarks are designed to make it easy for others to try their own implementations, or to add new benchmark problems. Each benchmark problem includes the problem specification, the specification of input and output file formats, default input generators, test codes that check the correctness of the output for a given input, driver code that can be linked with implementations, a baseline sequential implementation, a baseline multicore implementation, and scripts for running timings (and checks) and outputting the results in a standard format. The current suite includes the following problems: integer sort, comparison sort, remove duplicates, dictionary, breadth first search, spanning forest, minimum spanning forest, maximal independent set, maximal matching, K-nearest neighbors, Delaunay triangulation, convex hull, suffix arrays, n-body, and ray casting. For each problem, we report the performance of our baseline multicore implementation on a 40-core machine.

196 citations


Proceedings ArticleDOI
29 Apr 2012
TL;DR: The tool Go Ahead is introduced that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs and provides a scripting interface and all features can be accessed remotely.
Abstract: Exploiting the benefits of partial run-time reconfiguration requires efficient tools. In this paper, we introduce the tool Go Ahead that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs. This includes in particular support for low cost and low power Spartan-6 FPGAs. Go Ahead assists during floor planning and automates the constraint generation. It interacts with the Xilinx vendor tools and triggers the physical implementation phases all the way down to the final configuration bit streams. Go Ahead enables the building of flexible systems for integrating many reconfigurable modules very efficiently into a system. The tool targets (re)usability, portability to future devices, and migration paths among reconfigurable systems featuring different FPGAs or even FPGA families. Moreover, it provides a scripting interface and all features can be accessed remotely.

138 citations


Proceedings ArticleDOI
15 Oct 2012
TL;DR: This paper finds out how fragmentation is manifested within the Android project and a method for tracking fragmentation using feature analysis on project repositories is proposed and it is found that Labeled-LDA produced better, i.e., more feature oriented, topics than LDA.
Abstract: The fragmentation of the Android ecosystem causes portability and compatibility issues within the entire Android platform, which increases developer workload, delays application deployment, and ultimately disappoints users. This subject is discussed in the press and in scientific publications but it has yet to be systematically examined. The Android bug reports, as submitted by Android-device users, span across operating-system versions and hardware platforms and can provide interesting evidence about the problem. In this paper, we analyze the bug reports related to two popular vendors, HTC and Motorola. First, we manually label the bug reports. Next, we use Labeled-LDA (Latent Dirichlet Allocation) on the labeled data and LDA on the original data, to infer topics. Finally, by examining the relevance of the top 18 bug topics for each vendor's bug reports over time, we classify topics as common or unique (vendor-specific). The latter category constitutes evidence of fragmentation and lack of portability. By comparing Labeled-LDA against LDA, we find that Labeled-LDA produced better, i.e., more feature oriented, topics than LDA. In this paper we find out how fragmentation is manifested within the Android project and we propose a method for tracking fragmentation using feature analysis on project repositories.

125 citations


Proceedings ArticleDOI
05 Sep 2012
TL;DR: The neuFlow SoC was designed to accelerate neural networks and other complex vision algorithms based on large numbers of convolutions and matrix-to-matrix operations and post-layout characterization shows that the system delivers up to 320 GOPS with an average power consumption of 0.6 W.
Abstract: This paper presents a bio-inspired vision system-on-a-chip - neuFlow SoC implemented in the IBM 45 nm SOI process. The neuFlow SoC was designed to accelerate neural networks and other complex vision algorithms based on large numbers of convolutions and matrix-to-matrix operations. Post-layout characterization shows that the system delivers up to 320 GOPS with an average power consumption of 0.6 W. The power-efficiency and portability of this system is ideal for embedded vision-based devices, such as driver assistance, and robotic vision.

Journal ArticleDOI
TL;DR: In this paper, agent-oriented software development (AOSD) and MDE paradigms are fully integrated for the development of MAS and meta-modeling techniques are explicitly used to speed up several phases of the process.

Proceedings ArticleDOI
13 May 2012
TL;DR: It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Abstract: OP2 is an “active” library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

Proceedings ArticleDOI
10 Nov 2012
TL;DR: It is found that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.
Abstract: Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.

Journal ArticleDOI
01 Apr 2012
TL;DR: This paper presents the skeleton library Muesli, which not only simplifies parallel programming but also allows to write a single application that may be executed on a variety of parallel machines ranging from simple multi-core processors with shared memory to clusters of multi-and many- core processors with distributed memory as well as multi-GPU systems and GPU clusters.
Abstract: Due to the lack of high-level abstractions, developers of parallel applications have to deal with low-level details such as coordinating threads or synchronising processes. Thus, parallel programming still remains a difficult and error-prone task. In order to shield the user from these low-level details, algorithmic skeletons have been proposed. They encapsulate typical parallel programming patterns and have emerged to be an efficient approach to simplifying the development of parallel applications. In this paper, we present our skeleton library Muesli, which not only simplifies parallel programming. Additionally, it allows to write a single application that may be executed on a variety of parallel machines ranging from simple multi-core processors with shared memory to clusters of multi-and many-core processors with distributed memory as well as multi-GPU systems and GPU clusters. The level of platform independence is not reached by other existing approaches, that simplify parallel programming. Internally, the skeletons are based on MPI, OpenMP and CUDA. We demonstrate portability and efficiency of our approach by providing experimental results.

Proceedings ArticleDOI
10 Nov 2012
TL;DR: This work presents work in progress on PyOP2, a high-level embedded domain-specific language for mesh-based simulation codes that executes numerical kernels in parallel over unstructured meshes that generates kernels for finite element computations automatically from equations given in the domain- specific Unified Form Language.
Abstract: Emerging many-core platforms are very difficult to program in a performance portable manner whilst achieving high efficiency on a diverse range of architectures. We present work in progress on PyOP2, a high-level embedded domain-specific language for mesh-based simulation codes that executes numerical kernels in parallel over unstructured meshes. Just-in-time kernel compilation and parallel scheduling are delayed until runtime, when problem-specific parameters are available. Using generative metaprogramming, performance portability is achieved, while details of the parallel implementation are abstracted from the programmer. PyOP2 kernels for finite element computations can be generated automatically from equations given in the domain-specific Unified Form Language. Interfacing to the multi-phase CFD code Fluidity through a very thin layer on top of PyOP2 yields a general purpose finite element solver with an input notation very close to mathematical formulae. Preliminary performance figures show speedups of up to 3.4× compared to Fluidity's built-in solvers when running in parallel.

Proceedings ArticleDOI
25 Jun 2012
TL;DR: A framework to simplify the interface between a variety of external sensors and consumer Android devices is presented and three alternative architectures for application-level drivers are explored to understand trade-offs in performance, device portability, simplicity, and deployment ease.
Abstract: Smartphones can now connect to a variety of external sensors over wired and wireless channels. However, ensuring proper device interaction can be burdensome, especially when a single application needs to integrate with a number of sensors using different communication channels and data formats. This paper presents a framework to simplify the interface between a variety of external sensors and consumer Android devices. The framework simplifies both application and driver development with abstractions that separate responsibilities between the user application, sensor framework, and device driver. These abstractions facilitate a componentized framework that allows developers to focus on writing minimal pieces of sensor-specific code enabling an ecosystem of reusable sensor drivers. The paper explores three alternative architectures for application-level drivers to understand trade-offs in performance, device portability, simplicity, and deployment ease. We explore these tradeoffs in the context of four sensing applications designed to support our work in the developing world. They highlight a range of sensor usage models for our application-level driver framework that vary data types, configuration methods, communication channels, and sampling rates to demonstrate the framework's effectiveness.

Proceedings ArticleDOI
25 Feb 2012
TL;DR: It is demonstrated that work-stealing scheduling principles are applicable to a rich programming language such as X10, achieving performance at scale without compromising expressivity, ease of use, or portability.
Abstract: The X10 programming language is intended to ease the programming of scalable concurrent and distributed applications. X10 augments a familiar imperative object-oriented programming model with constructs to support light-weight asynchronous tasks as well as execution across multiple address spaces. A crucial aspect of X10's runtime system is the scheduling of concurrent tasks. Work-stealing schedulers have been shown to efficiently load balance fine-grain divide-and-conquer task-parallel program on SMPs and multicores. But X10 is not limited to shared-memory fork-join parallelism. X10 permits tasks to suspend and synchronize by means of conditional atomic blocks and remote task invocations.In this paper, we demonstrate that work-stealing scheduling principles are applicable to a rich programming language such as X10, achieving performance at scale without compromising expressivity, ease of use, or portability. We design and implement a portable work-stealing execution engine for X10. While this engine is biased toward the efficient execution of fork-join parallelism in shared memory, it handles the full X10 language, especially conditional atomic blocks and distribution.We show that this engine improves the run time of a series of benchmark programs by several orders of magnitude when used in combination with the C++ backend compiler and runtime for X10. It achieves scaling comparable to state-of-the art work-stealing scheduler implementations---the Cilk++ compiler and the Java fork/join framework---despite the dramatic increase in generality.

Journal ArticleDOI
26 Jan 2012
TL;DR: The extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors is evaluated and the extent towhich compiler backends are a suitable tool to provide automated support for the proposed mitigations are discussed.
Abstract: This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side channels that relate to data flow. We study the efficiency, effectiveness, portability, predictability and sensitivity of several mitigating code transformations that eliminate or minimize key-dependent execution time variations. Furthermore, we discuss the extent to which compiler backends are a suitable tool to provide automated support for the proposed mitigations.

Proceedings ArticleDOI
10 Dec 2012
TL;DR: This work introduces MAClets, software programs uploaded and executed on-demand over wireless cards, and devised to change the card's real-time medium access control operation, and envision a new architecture for wireless cards based on a protocol interpreter and a powerful API.
Abstract: We introduce MAClets, software programs uploaded and executed on-demand over wireless cards, and devised to change the card's real-time medium access control operation. MAClets permit seamless reconfiguration of the MAC stack, so as to adapt it to mutated context and spectrum conditions and perform tailored performance optimizations hardly accountable by an once-for-all protocol stack design. Following traditional active networking principles, MAClets can be directly conveyed within data packets and executed on hard-coded devices acting as virtual MAC machines. Indeed, rather than executing a pre-defined protocol, we envision a new architecture for wireless cards based on a protocol interpreter (enabling code portability) and a powerful API. Experiments involving the distribution of MAClets within data packets, and their execution over commodity WLAN cards, show the flexibility and viability of the proposed concept.

Journal ArticleDOI
TL;DR: The effects of CPU and GPU parallel computing specifically in EM & OM applications in a broad scope, which include digital image/volume correlation, fringe pattern analysis, tomography, hyperspectral imaging, computer-generated holograms, and integral imaging are reviewed.

Proceedings ArticleDOI
22 Apr 2012
TL;DR: The goal of this combination "Work-in-Progress and Vision" paper is to delineate application requirements in a manner that is not overly specific to individual applications or the optimizations used for certain hardware platforms, so that the authors can draw broader conclusions about hardware requirements.
Abstract: In the past, evaluating the architectural innovation of parallel computing devices relied on a benchmark suite based on existing programs, e.g., EEMBC or SPEC. However, with the growing ubiquity of parallel computing devices, we argue that it is unclear how best to express parallel computation, and hence, a need exists to identify a higher level of abstraction for reasoning about parallel application requirements. Therefore, the goal of this combination "Work-in-Progress and Vision" paper is to delineate application requirements in a manner that is not overly specific to individual applications or the optimizations used for certain hardware platforms, so that we can draw broader conclusions about hardware requirements. Our initial effort, dubbed "OpenCL and the 13 Dwarfs" or OCD for short, realizes Berkeley's 13 computational dwarfs of scientific computing in OpenCL, where each dwarf captures a pattern of computation and communication that is common to a class of important applications.

Book ChapterDOI
12 Sep 2012
TL;DR: This paper analyzes TOSCA with the focus on requirements on workflow modeling languages to come up with a strong link to the application topology with the goal to improve modeling support.
Abstract: TOSCA is an upcoming standard to capture cloud application topologies and their management in a portable way. Management aspects include provisioning, operation and deprovisioning of an application. Management plans capture these aspects in workflows. BPMN 2.0 as general-purpose language can be used to model these workflows. There is, however, no tailored support for management plans in BPMN. This paper analyzes TOSCA with the focus on requirements on workflow modeling languages to come up with a strong link to the application topology with the goal to improve modeling support. To simplify the modeling of management plans, we introduce BPMN4TOSCA, which extends BPMN with four TOSCA-specific elements: TOSCA Topology Management Task, TOSCA Node Management Task, TOSCA Script Task, and TOSCA Data Object. Portability is ensured by a transformation of BPMN4TOSCA to plain BPMN. A prototypical modeling tool supports the strong link between the management plan and the TOSCA topology.

Journal ArticleDOI
TL;DR: A cooperative human-robot interaction system that has been specifically developed for portability between different humanoid platforms, by abstraction layers at the perceptual and motor interfaces is presented.
Abstract: Robots should be capable of interacting in a cooperative and adaptive manner with their human counterparts in open-ended tasks that can change in real-time. An important aspect of the robot behavior will be the ability to acquire new knowledge of the cooperative tasks by observing and interacting with humans. The current research addresses this challenge. We present results from a cooperative human-robot interaction system that has been specifically developed for portability between different humanoid platforms, by abstraction layers at the perceptual and motor interfaces. In the perceptual domain, the resulting system is demonstrated to learn to recognize objects and to recognize actions as sequences of perceptual primitives, and to transfer this learning, and recognition, between different robotic platforms. For execution, composite actions and plans are shown to be learnt on one robot and executed successfully on a different one. Most importantly, the system provides the ability to link actions into shared plans, that form the basis of human-robot cooperation, applying principles from human cognitive development to the domain of robot cognitive systems.

Journal ArticleDOI
TL;DR: A new lightweight video monitoring system (COSMOS) that has been developed to target several key characteristics including portability, low-cost, robustness and easy installation is presented.

Proceedings ArticleDOI
12 Mar 2012
TL;DR: Three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems are discussed and it is shown how they could complement each other in an integrational programming framework for heterogeneous Multicore systems.
Abstract: We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.

Proceedings ArticleDOI
22 Feb 2012
TL;DR: This paper addresses the portability challenge by introducing a framework of architecture and middleware for virtualization of FPGA platforms, collectively named VirtualRC, and enabling portability of 11 applications and two high-level synthesis tools across three physical platforms.
Abstract: Numerous studies have shown significant performance and power benefits of field-programmable gate arrays (FPGAs). Despite these benefits, FPGA usage has been limited by application design complexity caused largely by the lack of code and tool portability across different FPGA platforms, which prevents design reuse. This paper addresses the portability challenge by introducing a framework of architecture and middleware for virtualization of FPGA platforms, collectively named VirtualRC. Experiments show modest overhead of 5-6% in performance and 1% in area, while enabling portability of 11 applications and two high-level synthesis tools across three physical platforms.

Proceedings ArticleDOI
21 May 2012
TL;DR: Novel methods and compiler transformations that increase programmer productivity by enabling users of the language Chapel to provide a single code implementation that the compiler can then use to target not only conventional multiprocessors, but also high-throughput and hybrid machines are presented.
Abstract: It has been widely shown that high-throughput computing architectures such as GPUs offer large performance gains compared with their traditional low-latency counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, loss of portability across different architectures, explicit data movement, and challenges in performance optimization. This paper presents novel methods and compiler transformations that increase programmer productivity by enabling users of the language Chapel to provide a single code implementation that the compiler can then use to target not only conventional multiprocessors, but also high-throughput and hybrid machines. Rather than resorting to different parallel libraries or annotations for a given parallel platform, this work leverages a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of providing portability across different parallel architectures. Finally, this work presents experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA on both GPUs and multicore platforms.

Journal ArticleDOI
TL;DR: The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices.
Abstract: Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].

Journal ArticleDOI
TL;DR: This research work will promote HPC application developers to select an apt monitoring mechanism and HPC tool developers to augment required energy monitoring mechanisms which fit well with their basic monitoring infrastructures and validate the existing tools in terms of overhead, portability, and user-friendly parameters.