Showing papers on "Software portability published in 2010"

PDF

Open Access

Journal Article•DOI•

Native Client: a sandbox for portable, untrusted x86 native code

[...]

Bennet S. Yee¹, David C. Sehr¹, Gregory Dardyk¹, J. Bradley Chen¹, Robert Muth¹, Tavis Ormandy¹, Shiki Okasaka¹, Neha Narula¹, Nicholas Fullagar¹ - Show less +5 more•Institutions (1)

Google¹

01 Jan 2010-Communications of The ACM

TL;DR: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code that combines software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client.

...read moreread less

Abstract: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code. Native Client aims to give browser-based applications the computational performance of native applications without compromising safety. Native Client uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client. Native Client provides operating system portability for binary code while supporting performance-oriented features generally absent from web application programming environments, such as thread support, instruction set extensions such as SSE, and use of compiler intrinsics and hand-coded assembler. We combine these properties in an open architecture that encourages community review and 3rd-party tools.

...read moreread less

434 citations

Proceedings Article•DOI•

The Hadoop distributed filesystem: Balancing portability and performance

[...]

Jeffrey Shafer¹, Scott Rixner¹, Alan L. Cox¹•Institutions (1)

Rice University¹

28 Mar 2010

TL;DR: The performance of HDFS is analyzed and several performance issues are uncovered, including architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks.

...read moreread less

Abstract: Hadoop is a popular open-source implementation of MapReduce for the analysis of large datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem - HDFS - is written in Java and designed for portability across heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Second, portability limitations prevent the Java implementation from exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions about how the native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. This paper investigates the root causes of these performance bottlenecks in order to evaluate tradeoffs between portability and performance in the Hadoop distributed filesystem.

...read moreread less

331 citations

Proceedings Article•DOI•

An auto-tuning framework for parallel multicore stencil computations

[...]

Shoaib Kamil¹, Cy Chan¹, Leonid Oliker¹, John Shalf¹, Samuel Williams¹ - Show less +1 more•Institutions (1)

Lawrence Berkeley National Laboratory¹

19 Apr 2010

TL;DR: In this article, the authors present a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA.

...read moreread less

Abstract: Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs Results show that our generalized methodology delivers significant performance gains of up to 22× speedup over the reference serial implementation Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems

...read moreread less

243 citations

Journal Issue•DOI•

A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers

[...]

Horacio González-Vélez¹, Mario Leyton²•Institutions (2)

Robert Gordon University¹, University of Chile²

01 Nov 2010-Software - Practice and Experience

TL;DR: The ASKF can be positioned as high-level structured parallel programming enablers, as their systematic utilization permits the abstract description of programs and fosters portability by focusing on the description of the algorithmic structure rather than on its detailed implementation.

...read moreread less

Abstract: Structured parallel programs ought to be conceived as two separate and complementary entities: computation, which expresses the calculations in a procedural manner, and coordination, which abstracts the interaction and communication. By abstracting commonly used patterns of parallel computation, communication, and interaction, algorithmic skeletons enable programmers to code algorithms without specifying platform-dependent primitives. This article presents a literature review on algorithmic skeleton frameworks (ASKF), parallel software development environments furnishing a collection of parameterizable algorithmic skeletons, where the control flow, nesting, resource monitoring, and portability of the resulting parallel program is dictated by the ASKF as opposed to the programmer. Consequently, the ASKF can be positioned as high-level structured parallel programming enablers, as their systematic utilization permits the abstract description of programs and fosters portability by focusing on the description of the algorithmic structure rather than on its detailed implementation. Copyright © 2010 John Wiley & Sons, Ltd.

...read moreread less

186 citations

Journal Article•DOI•

An asymmetric distributed shared memory model for heterogeneous parallel systems

[...]

Isaac Gelado¹, John E. Stone², Javier Cabezas¹, Sanjay J. Patel², Nacho Navarro¹, Wen-mei W. Hwu² - Show less +2 more•Institutions (2)

Polytechnic University of Catalonia¹, University of Illinois at Urbana–Champaign²

13 Mar 2010

TL;DR: A new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa, is presented.

...read moreread less

Abstract: Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

...read moreread less

170 citations

Proceedings Article•DOI•

MapCG: writing parallel program portable between CPU and GPU

[...]

Chuntao Hong¹, Dehao Chen¹, Wenguang Chen¹, Weimin Zheng¹, Haibo Lin² - Show less +1 more•Institutions (2)

Tsinghua University¹, IBM²

11 Sep 2010

TL;DR: This research presents a novel and scalable approaches to solve the problem of high development and maintenance cost of writing GPU specific code with low level GPU APIs such as CUDA.

...read moreread less

Abstract: Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.We believe it is desired to have a programming model which provides source code portability between CPUs and GPUs, and different GPUs: Programmers only need to write one version of code and can be compiled and executed on either CPUs or GPUs efficiently without modification.In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPU and GPU. Different from OpenCL, our framework is based on MapReduce, which provides a high level programming model, making programming much easier.We describe the design of the MapReduce-based high-level programming language and the underlying runtime system to enable portability between CPU and GPU. A prototype of MapCG runtime was implemented, supporting multi-core CPU and NVIDIA GPUs. Experiments show that our implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average of 1.6-2.5x speedup over previous implementations of MapReduce on eight commonly used applications.

...read moreread less

150 citations

Book•

Chasing Stars: The Myth of Talent and the Portability of Performance

[...]

Boris Groysberg

19 Apr 2010

TL;DR: In this paper, the authors discuss women's performance more portable than men's than women's ability to move in teams, and the importance of women's mobility in the field of talent management.

...read moreread less

Abstract: Acknowledgments ix Introduction 3 Part One: Talent and Portability Chapter 1: Moving On 15 Chapter 2: Analysts' Labor Market 35 Chapter 3: The Limits of Portability 51 Chapter 4: Do Firms Benefit from Hiring Stars? 77 Part Two: Facets of Portability Chapter 5: Stars and Their Galaxies: Firms of Origin and Portability 93 Chapter 6: Integrating Stars: The Hiring Firm and Portability of Performance 125 Chapter 7: Liftouts (Taking Some of It with You): Moving in Teams 141 Chapter 8: Women and Portability: Why Is Women's Performance More Portable than Men's? 163 Part Three: Implications for Talent Management: Developing, Retaining, and Rewarding Stars Chapter 9: Star Formation: Developmental Cultures at Work 197 Chapter 10: Turnover: Who Leaves and Why 239 Chapter 11: A Special Case of Turnover: Stars as Entrepreneurs 253 Chapter 12: Measuring and Rewarding Stars' Performance 273 Chapter 13: Lessons from Wall Street and Elsewhere 321 Appendix 343 Notes 353 Index 437

...read moreread less

130 citations

Journal Article•DOI•

Understanding Position Transducer Technology for Strength and Conditioning Practitioners

[...]

Nigel Harris, John B. Cronin, Kristie-Lee Taylor, Jodovtseff Boris¹, Jeremy M. Sheppard - Show less +1 more•Institutions (1)

University of Liège¹

01 Aug 2010-Strength and Conditioning Journal

TL;DR: This article discusses this PIECE of technology from its design to how it may be USed to inform practice and offers a great deal of information that can be used to guide programming and training to better EFFECT.

...read moreread less

Abstract: SUMMARY: STRENGTH AND POWER ASSESSMENTS IN CONDITIONING PRACTICE HAVE TYPICALLY INVOLVED RUDIMENTARY MEASURES SUCH AS 1 REPETITION MAXIMUM. MORE COMPLEX LABORATORY ANALYSIS HAS BEEN AVAILABLE BUT BECAUSE OF THE PRICE AND PORTABILITY OF EQUIPMENT, SUCH ANALYSIS REMAINED IMPRACTICAL TO MOST PRACTITIONERS. RECENTLY, A NUMBER OF DEVICES HAVE BECOME AVAILABLE THAT ARE REASONABLY INEXPENSIVE AND PORTABLE AND OFFER A GREAT DEAL OF INFORMATION THAT CAN BE USED TO GUIDE PROGRAMMING AND TRAINING TO BETTER EFFECT. ONE SUCH DEVICE IS THE LINEAR POSITION TRANSDUCER. THIS ARTICLE DISCUSSES THIS PIECE OF TECHNOLOGY FROM ITS DESIGN TO HOW IT MAY BE USED TO INFORM PRACTICE

...read moreread less

117 citations

Proceedings Article•DOI•

Partitioning streaming parallelism for multi-cores: a machine learning based approach

[...]

Zheng Wang¹, Michael O'Boyle¹•Institutions (1)

University of Edinburgh¹

11 Sep 2010

TL;DR: This work develops a portable and automatic compiler-based approach to partitioning streaming programs using machine learning that predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line.

...read moreread less

Abstract: Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90x speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77x performance improvement. By porting our approach to a 8-core platform, we are able to obtain 1.8x improvement over the StreamIt default scheme, demonstrating the portability of our approach.

...read moreread less

113 citations

Proceedings Article•DOI•

Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing

[...]

James Coole¹, Greg Stitt¹•Institutions (1)

University of Florida¹

24 Oct 2010

TL;DR: In this paper, a virtual reconfigurable architectures for different application domains, implemented on top of commercial off-the-shelf (COTS) devices, is proposed to hide the complexity of fine-grained physical devices and enable circuit portability across all devices that implement the intermediate fabric.

...read moreread less

Abstract: Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of integrating hardware design tools into well-established software tool flows, 2) increasingly lengthy FPGA design iterations due to placement and routing, and 3) a lack of portability and interoperability resulting from device/platform-specific tools and bitfiles. In this paper, we directly address the last two problems by introducing intermediate fabrics, which are virtual reconfigurable architectures specialized for different application domains, implemented on top of commercial-off-the-shelf devices. Such specialization enables near-instantaneous placement and routing by hiding the complexity of fine-grained physical devices, while also enabling circuit portability across all devices that implement the intermediate fabric. When combined with existing work on runtime synthesis from software binaries, intermediate fabrics reduce the effects of all three problems by enabling transparent usage of COTS FPGAs by software designers. In this paper, we explore intermediate fabric architectures using specialization techniques to minimize area and performance overhead of the virtual fabric while maximizing routability and speedup of placement and routing. We present results showing an average placement and routing speedup of 554x, with an average area overhead of 10% and clock overhead of 18%, which corresponds to an average frequency of 195 MHz.

...read moreread less

103 citations

Book Chapter•DOI•

Building a mosaic of clouds

[...]

Beniamino Di Martino¹, Dana Petcu, Roberto Cossu², Pedro Gonçalves, Tamás Máhr, Miguel Loichate - Show less +2 more•Institutions (2)

Seconda Università degli Studi di Napoli¹, European Space Agency²

31 Aug 2010

TL;DR: A position paper exposing the concepts behind a recent proposal for an open-source application programming interface and platform for dealing with multiple Cloud computing offers is presented.

...read moreread less

Abstract: The current diversity of Cloud computing services, benefic for the fast development of a new IT market, hinders the easy development, portability and inter-operability of Cloud oriented applications. Developing an application oriented view of Cloud services instead the current provider ones can lead to a step forward in the adoption of Cloud computing on a larger scale than the actual one. In this context, we present a position paper exposing the concepts behind a recent proposal for an open-source application programming interface and platform for dealing with multiple Cloud computing offers.

...read moreread less

Proceedings Article•DOI•

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

[...]

Jayanth Gummaraju¹, Ben Sander¹, Laurent Morichetti¹, Benedict R. Gaster¹, Mike Houston¹, Bixia Zheng¹ - Show less +2 more•Institutions (1)

Advanced Micro Devices¹

11 Sep 2010

TL;DR: Twin Peaks is presented, a software platform for heterogeneous computing that executes code originally targeted for GPUs on CPUs as well, which permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments.

...read moreread less

Abstract: Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general-purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs efficiently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multicore CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets.

...read moreread less

Proceedings Article•DOI•

Secure Virtual Machine Execution under an Untrusted Management OS

[...]

Chunxiao Li¹, Anand Raghunathan², Niraj K. Jha¹•Institutions (2)

Princeton University¹, Purdue University²

05 Jul 2010

TL;DR: This paper proposes a secure virtualization architecture that provides a secure run-time environment, network interface, and secondary storage for a guest VM, and evaluates the performance penalties incurred, and demonstrates that the penalties are minimal.

...read moreread less

Abstract: Virtualization is a rapidly evolving technology that can be used to provide a range of benefits to computing systems, including improved resource utilization, software portability, and reliability. For security-critical applications, it is highly desirable to have a small trusted computing base (TCB), since it minimizes the surface of attacks that could jeopardize the security of the entire system. In traditional virtualization architectures, the TCB for an application includes not only the hardware and the virtual machine monitor (VMM), but also the whole management operating system (OS) that contains the device drivers and virtual machine (VM) management functionality. For many applications, it is not acceptable to trust this management OS, due to its large code base and abundance of vulnerabilities. In this paper, we address the problem of providing a secure execution environment on a virtualized computing platform under the assumption of an untrusted management OS. We propose a secure virtualization architecture that provides a secure run-time environment, network interface, and secondary storage for a guest VM. The proposed architecture significantly reduces the TCB of security-critical guest VMs, leading to improved security in an untrusted management environment. We have implemented a prototype of the proposed approach using the Xen virtualization system, and demonstrated how it can be used to facilitate secure remote computing services. We evaluate the performance penalties incurred by the proposed architecture, and demonstrate that the penalties are minimal.

...read moreread less

Proceedings Article•DOI•

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

[...]

Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David E. Wohlford, Cédric Bastoul, Richard Lethin - Show less +3 more

14 Mar 2010

TL;DR: This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers.

...read moreread less

Abstract: Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability.This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language.The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

...read moreread less

Journal Article•DOI•

ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

[...]

François Broquedis¹, Nathalie Furmento¹, Brice Goglin¹, Pierre-André Wacrenier¹, Raymond Namyst¹ - Show less +1 more•Institutions (1)

University of Bordeaux¹

30 May 2010-International Journal of Parallel Programming

TL;DR: The runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues that enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability.

...read moreread less

Abstract: Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.

...read moreread less

Proceedings Article•DOI•

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

[...]

Louis-Noël Pouchet¹, Uday Bondhugula², Cédric Bastoul³, Albert Cohen⁴, J. Ramanujam⁵, P. Sadayappan¹ - Show less +2 more•Institutions (5)

Ohio State University¹, IBM², University of Paris-Sud³, French Institute for Research in Computer Science and Automation⁴, Louisiana State University⁵

13 Nov 2010

TL;DR: This work has developed a completely automatic framework in which it focuses the empirical search on the set of valid possibilities to perform fusion/code motion, and rely on model-based mechanisms to perform tiling, vectorization and parallelization on the transformed program.

...read moreread less

Abstract: Today's multi-core era places significant demands on an optimizing compiler, which must parallelize programs, exploit memory hierarchy, and leverage the ever-increasing SIMD capabilities of modern processors. Existing model-based heuristics for performance optimization used in compilers are limited in their ability to identify profitable parallelism/locality trade-offs and usually lead to sub-optimal performance. To address this problem, we distinguish optimizations for which effective model-based heuristics and profitability estimates exist, from optimizations that require empirical search to achieve good performance in a portable fashion. We have developed a completely automatic framework in which we focus the empirical search on the set of valid possibilities to perform fusion/code motion, and rely on model-based mechanisms to perform tiling, vectorization and parallelization on the transformed program. We demonstrate the effectiveness of this approach in terms of strong performance improvements on a single target as well as performance portability across different target architectures.

...read moreread less

Reference Book•DOI•

Scientific Computing with Multicore and Accelerators

[...]

Jakub Kurzak¹, Jakub Kurzak², David A. Bader², Jack Dongarra²•Institutions (2)

National Center for Supercomputing Applications¹, University of Illinois at Urbana–Champaign²

07 Dec 2010

TL;DR: Scientific Computing with Multicore and Accelerators focuses on the architectural design and implementation of multicore and manycore processors and accelerators, including graphics processing units (GPUs) and the Sony Toshiba IBM Cell Broadband Engine (BE) currently used in the Sony PlayStation 3.

...read moreread less

Abstract: The hybrid/heterogeneous nature of future microprocessors and large high-performance computing systems will result in a reliance on two major types of components: multicore/manycore central processing units and special purpose hardware/massively parallel accelerators. While these technologies have numerous benefits, they also pose substantial performance challenges for developers, including scalability, software tuning, and programming issues. Researchers at the Forefront Reveal Results from Their Own State-of-the-Art WorkEdited by some of the top researchers in the field and with contributions from a variety of international experts, Scientific Computing with Multicore and Accelerators focuses on the architectural design and implementation of multicore and manycore processors and accelerators, including graphics processing units (GPUs) and the Sony Toshiba IBM (STI) Cell Broadband Engine (BE) currently used in the Sony PlayStation 3. The book explains how numerical libraries, such as LAPACK, help solve computational science problems; explores the emerging area of hardware-oriented numerics; and presents the design of a fast Fourier transform (FFT) and a parallel list ranking algorithm for the Cell BE. It covers stencil computations, auto-tuning, optimizations of a computational kernel, sequence alignment and homology, and pairwise computations. The book also evaluates the portability of drug design applications to the Cell BE and illustrates how to successfully exploit the computational capabilities of GPUs for scientific applications. It concludes with chapters on dataflow frameworks, the Charm++ programming model, scan algorithms, and a portable intracore communication framework. Explores the New Computational Landscape of Hybrid Processors By offering insight into the process of constructing and effectively using the technology, this volume provides a thorough and practical introduction to the area of hybrid computing. It discusses introductory concepts and simple examples of parallel computing, logical and performance debugging for parallel computing, and advanced topics and issues related to the use and building of many applications.

...read moreread less

Proceedings Article•

An experimental study on performance portability of OpenCL kernels

[...]

Sean Rul¹, Hans Vandierendonck, Joris D'Haene¹, Koen De Bosschere¹•Institutions (1)

Ghent University¹

01 Jan 2010

TL;DR: This paper investigates the specificity of code optimizations to accelerator architecture and the severity of lack of performance portability, and achieves functional protability, allowing to reduce the development time of kernels.

...read moreread less

Abstract: Accelerator processors allow energy-efficient computation at high performance, especially for computationintensive applications. There exists a plethora of different accelerator architectures, such as GPUs and the Cell Broadband Engine. Each accelerator has its own programming language, but the recently introduced OpenCL language unifies accelerator programming languages. Hereby, OpenCL achieves functional protability, allowing to reduce the development time of kernels. Functional portability however has limited value without performance portability: the possibility to re-use optimized kernels with good performance. This paper investigates the specificity of code optimizations to accelerator architecture and the severity of lack of performance portability.

...read moreread less

Proceedings Article•DOI•

Lazy binary-splitting: a run-time adaptive work-stealing scheduler

[...]

Alexandros Tzannes¹, George C. Caragea¹, Rajeev Barua¹, Uzi Vishkin¹•Institutions (1)

University of Maryland, College Park¹

09 Jan 2010

TL;DR: Lazy Binary Splitting is presented, a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager binary Splitting work-stealing, but improves performance and ease-of-programming.

...read moreread less

Abstract: We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning.We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.

...read moreread less

Book Chapter•DOI•

Architecturing a sky computing platform

[...]

Dana Petcu, Ciprian Craciun, Marian Neagul, Silviu Panica, Beniamino Di Martino¹, Salvatore Venticinque¹, Massimiliano Rak¹, Rocco Aversa¹ - Show less +4 more•Institutions (1)

Seconda Università degli Studi di Napoli¹

13 Dec 2010

TL;DR: The need for an open-source Cloud application programming interface and a platform targeted for developing multi-Cloud oriented applications and the approach that is proposed for a platform that allows the deployment of component-based applications in Cloud environments taking into account multiple Cloud provider offers is described.

...read moreread less

Abstract: Current Cloud computing solutions force people to be stranded into locked, proprietary systems. In order to overcome this limitation several efforts of the research community are addressing issues such as common programming models, open standard interfaces, adequate service level agreements or portability of applications. In this context, we argue about the need for an open-source Cloud application programming interface and a platform targeted for developing multi-Cloud oriented applications. This paper describes the approach that we propose for a platform that allows the deployment of component-based applications in Cloud environments taking into account multiple Cloud provider offers.

...read moreread less

Journal Article•DOI•

Exploring e-Learning Knowledge Through Ontological Memetic Agents

[...]

Giovanni Acampora¹, Vincenzo Loia¹, Matteo Gaeta¹•Institutions (1)

University of Salerno¹

01 May 2010-IEEE Computational Intelligence Magazine

TL;DR: A novel multi-agent e- learning system empowered with (ontological) knowledge representation and memetic computing to efficiently manage complex and unstructured information that characterize e-Learning is proposed.

...read moreread less

Abstract: E-Learning systems have proven to be fundamental in several areas of tertiary education and in business companies. There are many significant advantages for people who learn online such as convenience, portability, flexibility and costs. However, the remarkable velocity and volatility of modern knowledge due to the exponential growth of the World Wide Web, requires novel learning methods that offer additional features such as information structuring, efficiency, task relevance and personalization. This paper proposes a novel multi-agent e-Learning system empowered with (ontological) knowledge representation and memetic computing to efficiently manage complex and unstructured information that characterize e-Learning. In particular, differing from other similar approaches, our proposal uses (1) ontologies to provide a suitable method for modeling knowledge about learning content and activities, and (2) memetic agents as intelligent explorers in order to create ?in time? and personalized e-Learning experiences that satisfy learners' specific preferences. The proposed method has been tested by realizing a multi-agent software plug-in for an industrial e-Learning platform with experimentations to validate our memetic proposal in terms of flexibility, efficiency and interoperability.

...read moreread less

Proceedings Article•DOI•

VMKit: a substrate for managed runtime environments

[...]

Nicolas Geoffray¹, Gaël Thomas¹, Julia Lawall², Gilles Muller¹, Bertil Folliot¹ - Show less +1 more•Institutions (2)

Pierre-and-Marie-Curie University¹, University of Copenhagen²

17 Mar 2010

TL;DR: VMKit is described and evaluates, a first attempt to build a common substrate that eases the development of high-level MREs, and has performance comparable to the well established open source M REs Cacao, Apache Harmony and Mono.

...read moreread less

Abstract: Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages.This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.

...read moreread less

Journal Issue•DOI•

CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications

[...]

Gabriel Rodríguez¹, María Martín¹, Patricia González¹, Juan Touriño¹, Ramón Doallo¹ - Show less +1 more•Institutions (1)

University of A Coruña¹

01 Apr 2010-Concurrency and Computation: Practice and Experience

TL;DR: This paper covers both the operation of the CPPC library and its compiler support, and experimental results using benchmarks and large-scale real applications are included, demonstrating usability, efficiency, and portability.

...read moreread less

Abstract: With the evolution of high-performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I-O). Directly manipulating the underlying ad hoc representations renders checkpointing tools unable to work on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data need to be stored, where to store them, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler helps to achieve transparency by relieving the user from time-consuming tasks, such as data flow and communications analyses and adding instrumentation code. This paper covers both the operation of the CPPC library and its compiler support. Experimental results using benchmarks and large-scale real applications are included, demonstrating usability, efficiency, and portability. Copyright © 2009 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•DOI•

Mobile interaction with static and dynamic NFC-based displays

[...]

Robert Hardy¹, Enrico Rukzio¹, Paul Holleis², Matthias Wagner²•Institutions (2)

Lancaster University¹, NTT DoCoMo²

07 Sep 2010

TL;DR: A development framework, two prototypes, and a comparative study in the area of multi-tag Near-Field Communication (NFC) interaction indicate that all participants preferred the dynamic display, although the static display has advantages, e.g. with respect to privacy and portability.

...read moreread less

Abstract: This paper reports on a development framework, two prototypes, and a comparative study in the area of multi-tag Near-Field Communication (NFC) interaction. By combining NFC with static and dynamic displays, such as posters and projections, services are made more visible and allow users to interact with them easily by interacting directly with the display with their phone. In this paper, we explore such interactions, in particular, the combination of the phone display and large NFC displays. We also compare static displays and dynamic displays, and present a list of deciding factors for a particular deployment situation. We discuss one prototype for each display type and developed a corresponding framework which can be used to accelerate the development of such prototypes whilst supporting a high level of versatility. The findings of a controlled comparative study indicate, among other things, that all participants preferred the dynamic display, although the static display has advantages, e.g. with respect to privacy and portability.

...read moreread less

Proceedings Article•DOI•

Processor virtualization and split compilation for heterogeneous multicore embedded systems

[...]

Albert Cohen¹, Erven Rohou¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

13 Jun 2010

TL;DR: This work combines instruction-set virtualization with just-in-time compilation, compiling C, C++ and managed languages to a target-independent intermediate language, maximizing the information flow between compilation steps in a split optimization process.

...read moreread less

Abstract: Embedded multiprocessors have always been heterogeneous, driven by the power-efficiency and compute-density of hardware specialization. We aim to achieve portability and sustained performance of complete applications, leveraging diverse programmable cores. We combine instruction-set virtualization with just-in-time compilation, compiling C, C++ and managed languages to a target-independent intermediate language, maximizing the information flow between compilation steps in a split optimization process.

...read moreread less

Patent•

Detachable computer with variable performance computing environment

[...]

Feng-Hsiung Hsu¹, Xiong-Fei Cai¹, Rui Gao¹, Chunhui Zhang¹•Institutions (1)

Microsoft¹

24 Jun 2010

TL;DR: In this paper, the authors present a configuration of a computing device featuring a display unit with a resource-conserving processor that may be used independently (e.g., as a tablet), but may be connected to a base unit featuring a resource intensive processor.

...read moreread less

Abstract: Computing devices are often designed in view of a particular usage scenario, but may be unsuitable for usage in other computing scenarios. For example, a notebook computer with a large display, an integrated keyboard, and a high-performance processor suitable for many computing tasks may be heavy, large, and power-inefficient; and a tablet lacking a keyboard and incorporating a low-powered processor may improve portability but may present inadequate performance for many tasks. Presented herein is a configuration of a computing device featuring a display unit with a resource-conserving processor that may be used independently (e.g., as a tablet), but that may be connected to a base unit featuring a resource-intensive processor. The operating system of the device may accordingly transition between a resource-intensive computing environment and a resource-conserving computing environment based on the connection with the base unit, thereby satisfying the dual roles of workstation and portable tablet device.

...read moreread less

Journal Article•DOI•

Self-Consistent MPI Performance Guidelines

[...]

Jesper Larsson Träff¹, William Gropp², Rajeev Thakur³•Institutions (3)

University of Vienna¹, University of Illinois at Urbana–Champaign², Argonne National Laboratory³

01 May 2010-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work introduces and semiformalize the concept of self-consistent performance guidelines for MPI, and provides a (nonexhaustive) set of such guidelines in a form that could be automatically verified by benchmarks and experiment management tools.

...read moreread less

Abstract: Message passing using the Message-Passing Interface (MPI) is at present the most widely adopted framework for programming parallel applications for distributed memory and clustered parallel systems. For reasons of (universal) implementability, the MPI standard does not state any specific performance guarantees, but users expect MPI implementations to deliver good and consistent performance in the sense of efficient utilization of the underlying parallel (communication) system. For performance portability reasons, users also naturally desire communication optimizations performed on one parallel platform with one MPI implementation to be preserved when switching to another MPI implementation on another platform. We address the problem of ensuring performance consistency and portability by formulating performance guidelines and conditions that are desirable for good MPI implementations to fulfill. Instead of prescribing a specific performance model (which may be realistic on some systems, under some MPI protocol and algorithm assumptions, etc.), we formulate these guidelines by relating the performance of various aspects of the semantically strongly interrelated MPI standard to each other. Common-sense expectations, for instance, suggest that no MPI function should perform worse than a combination of other MPI functions that implement the same functionality, no specialized function should perform worse than a more general function that can implement the same functionality, no function with weak semantic guarantees should perform worse than a similar function with stronger semantics, and so on. Such guidelines may enable implementers to provide higher quality MPI implementations, minimize performance surprises, and eliminate the need for users to make special, nonportable optimizations by hand. We introduce and semiformalize the concept of self-consistent performance guidelines for MPI, and provide a (nonexhaustive) set of such guidelines in a form that could be automatically verified by benchmarks and experiment management tools. We present experimental results that show cases where guidelines are not satisfied in common MPI implementations, thereby indicating room for improvement in today's MPI implementations.

...read moreread less

Proceedings Article•DOI•

SIRC: An Extensible Reconfigurable Computing Communication API

[...]

Ken Eguro¹•Institutions (1)

Microsoft¹

02 May 2010

TL;DR: The Simple Interface for Reconfigurable Computing (SIRC) project provides a straightforward, portable and extensible open-source communication and synchronization API that allows applications built for existing systems to migrate to different platforms without significant modification to user code.

...read moreread less

Abstract: Reconfigurable computing applications often need to divide computation between software running on a conventional desktop processor and hardware mapped to an FPGA. However, the reconfigurable computing development platforms available today either do not provide a sufficient mechanism for the communication and synchronization that is needed or else employ a complex & proprietary API specific to a given toolflow or device, limiting code portability. The Simple Interface for Reconfigurable Computing (SIRC) project provides a straightforward, portable and extensible open-source communication and synchronization API. It consists of both a software-side interface and a hardware-side interface that allows C++ code running on a host PC to communicate and synchronize with a Verilog-based circuit mapped to a FPGA. One key feature of this API is that both the hardware and software user interfaces can remain consistent across all platforms and future releases. This allows applications built for existing systems to migrate to different platforms without significant modification to user code.

...read moreread less

Proceedings Article•

Investigating multiple approaches for SLU portability to a new language

[...]

Bassam Jabaian¹, Laurent Besacier, Fabrice Lefèvre¹•Institutions (1)

University of Avignon¹

01 Jan 2010

TL;DR: The first experimental results show the efficiency of the proposed portability methods in general for a fast and low-cost SLU porting from French to Italian and the best performance are obtained by using translation only at the test level.

...read moreread less

Abstract: The challenge with language portability of a spoken language understanding module is to be able to reuse the knowledge and the data available in a source language to produce knowledge in the target language. In this paper several approaches are proposed, motivated by the availability of the MEDIA French dialogue corpus and its manual translation into Italian. The three portability methods investigated are based on statistical machine translation or automatic word alignment techniques and differ in the level of the system development at which the translation is performed. The first experimental results show the efficiency of the proposed portability methods in general for a fast and low-cost SLU porting from French to Italian and the best performance are obtained by using translation only at the test level.

...read moreread less

Journal Article•DOI•

MPI as a Programming Model for High-Performance Reconfigurable Computers

[...]

Manuel Saldana, Arun Patel, Christopher A. Madill¹, Daniel Nunes¹, Danyao Wang¹, Paul Chow¹, Ralph D. Wittig², Henry E. Styles², Andrew Putnam² - Show less +5 more•Institutions (2)

University of Toronto¹, Xilinx²

01 Nov 2010-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures.

...read moreread less

Abstract: High-Performance Reconfigurable Computers (HPRCs) consist of one or more standard microprocessors tightly-coupled with one or more reconfigurable FPGAs. HPRCs have been shown to provide good speedups and good cost/performance ratios, but not necessarily ease of use, leading to a slow acceptance of this technology. HPRCs introduce new design challenges, such as the lack of portability across platforms, incompatibilities with legacy code, users reluctant to change their code base, a prolonged learning curve, and the need for a system-level Hardware/Software co-design development flow. This article presents the evolution and current work on TMD-MPI, which started as an MPI-based programming model for Multiprocessor Systems-on-Chip implemented in FPGAs, and has now evolved to include multiple X86 processors. TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures. Also presented is the TMD-MPI Ecosystem, which consists of research projects and tools that are developed around TMD-MPI to further improve HPRC usability. Finally, we present preliminary communication performance measurements.

...read moreread less

Collapse