Showing papers in &quot;ACM Sigarch Computer Architecture News in 2011&quot;

Surviving the end of frequency scaling with reconfigurable dataflow computing

TL;DR: This work studies the automated generation of full or truncated multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs, and addresses arbitrary precisions including single, double but also the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware.

...read moreread less

Abstract: The implementation of high-precision floating-point applications on reconfigurable hardware requires large multipliers. Full multipliers are the core of floating-point multipliers. Truncated multipliers, trading resources for a well-controlled accuracy degradation, are useful building blocks in situations where a full multiplier is not needed.This work studies the automated generation of such multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs. The optimization of such multipliers is expressed as a tiling problem, where a tile represents a hardware multiplier, and super-tiles represent combinations of several hardware multipliers and adders, making efficient use of the DSP internal resources. This tiling technique is shown to adapt to full or truncated multipliers. It addresses arbitrary precisions including single, double but also the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware. An open-source implementation is provided in the FloPoCo project.

...read moreread less

68 citations

Journal Article•DOI•

[...]

Oliver Pell, Oskar Mencer¹•Institutions (1)

Efficient reconfigurable design for pricing asian options

TL;DR: The MaxCompiler programming system is described which allows software engineers to create dataflow engines optimized for their particular applications, and an example application that has been accelerated using this methodology is discussed.

...read moreread less

Abstract: Over the past decade x86 processors have come to dominate the world's largest supercomputers. However in the future conventional multicore processors are unlikely to be able to deliver the necessary performance per $ and per W to achieve exascale performance. Heterogeneous computing is emerging as a powerful alternative to conventional multi-core to help address these challenges. In this paper we describe our approach to Maximum Performance Computing - building applicationspecific computers which complement conventional x86 processors with high performance dataflow engines implemented on FPGA to provide 10-100x improvements in performance and performance/W. We describe the MaxCompiler programming system which allows software engineers to create dataflow engines optimized for their particular applications, and discuss an example application that has been accelerated using this methodology.

...read moreread less

48 citations

Journal Article•DOI•

[...]

Anson H.T. Tse¹, David B. Thomas¹, Kuen Hung Tsoi¹, Wayne Luk¹•Institutions (1)

Improving software diagnosability via log enhancement

TL;DR: An FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture is proposed, and the proposed pipelined design is described parametrically, facilitating its re-use for different technologies.

...read moreread less

Abstract: Arithmetic Asian options are financial derivatives which have the feature of path-dependency: they depend on the entire price path of the underlying asset, rather than just the instantaneous price. This path-dependency makes them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture. The proposed pipelined design is described parametrically, facilitating its re-use for different technologies. An implementation of this architecture in a Virtex-5 xc5vlx330t FPGA at 200MHz is 313 times faster than a multi-threaded software implementation running on a Intel Xeon E5420 quad-core CPU at 2.5GHz; it is also 2.2 times faster than the Tesla C1060 GPU at 1.3 GHz.

...read moreread less

31 citations

Journal Article•DOI•

[...]

YuanDing, ZhengJing, ParkSoyeon, ZhouYuanyuan, SavageStefan - Show less +1 more

05 Mar 2011-ACM Sigarch Computer Architecture News

TL;DR: This research presents a novel and scalable approach called "Smart Guess" that automates the very labor-intensive and therefore time-heavy and expensive and expensive process of manually cataloging and cataloging individual components of a software system.

...read moreread less

Abstract: Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity...

...read moreread less

29 citations

Journal Article•DOI•

Power profiling and optimization for heterogeneous multi-core systems

[...]

Kuen Hung Tsoi¹, Wayne Luk¹•Institutions (1)

Survey and analysis of disk scheduling methods

TL;DR: The approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.

...read moreread less

Abstract: Processing speed and energy efficiency are two of the most critical issues for computer systems. This paper presents a systematic approach for profiling the power and performance characteristics of application targeting heterogeneous multi-core computing platforms. Our approach enables rapid and automated design space exploration involving optimisation of workload distribution for systems with accelerators such as FPGAs and GPUs. We demonstrate that, with minor modification to the design, it is possible to estimate performance and power efficiency trade off to identify optimized workload distribution. Our approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.

...read moreread less

26 citations

Journal Article•DOI•

[...]

Alexander Thomasian¹•Institutions (1)

Chinese Academy of Sciences¹

31 Aug 2011-ACM Sigarch Computer Architecture News

TL;DR: This paper proposes improvements to a recent analysis of the SCAN policy and carries out an empirical investigation of SATF performance to derive a relationship between the queue-length and mean service time, while both methods outperform FCFS scheduling.

...read moreread less

Abstract: Performance of many important computer applications depends on the performance of Hard Disk Drives (HDDs). Disk capacities and transfer rates have been increasing rapidly, but the improvement in disk access time is disappointingly slow. Caching and prefetching are two method to alleviate this delay, which is 6-7 orders of magnitude longer than the processor cycle time. Disk scheduling is desirable when the data is not cached and a disk access is required. This paper is concerned with the analysis of two disk arm scheduling methods: SATF (shortest access time first) which outperforms SCAN, while both methods outperform FCFS scheduling. We propose improvements to a recent analysis of the SCAN policy and carry out an empirical investigation of SATF performance to derive a relationship between the queue-length and mean service time. A review of variations of SCAN and SATF is provided, since they have been utilized in conjunction with multilevel disk scheduling methods. We also discuss recent developments to improve the performance of high capacity HDDs, which allow multiple tracks to be accessed without incurring seeks.

...read moreread less

24 citations

Journal Article•DOI•

Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

[...]

LeeYunsup, AvizienisRimas, BisharaAlex, XiaRichard, LockhartDerek, BattenChristopher, AsanovićKrste - Show less +3 more

Automatic fusions of CUDA-GPU kernels for parallel map

TL;DR: This work presents a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword- SIMD, SIMT, and vector-thread (VT) architectural design patterns.

...read moreread less

Abstract: We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. ...

...read moreread less

23 citations

Journal Article•DOI•

[...]

Jan Fousek¹, Jiří Filipovič¹, Matúš Madzin¹•Institutions (1)

Masaryk University¹

Scalable power control for many-core architectures running multi-threaded applications

TL;DR: In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.

...read moreread less

Abstract: When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggests to decompose the computational problem to be solved by several simple functions implemented as standalone kernels and to fuse some of these functions later into more complex kernels to improve memory locality. In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.

...read moreread less

23 citations

Journal Article•DOI•

[...]

MaKai, LiXue, ChenMing, WangXiaorui

Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system

TL;DR: In this article, the authors proposed a centralized solution to optimize the performance of a multi-core microprocessor within a power budget. But most existing solutions are centralized and cannot scale well with

...read moreread less

Abstract: Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention However, most existing solutions are centralized and cannot scale well with

...read moreread less

21 citations

Journal Article•DOI•

[...]

Hirokazu Morisita¹, Kenta Inakagata¹, Yasunori Osana², Naoyuki Fujita³, Hideharu Amano¹ - Show less +1 more•Institutions (3)

Keio University¹, Seikei University², Japan Aerospace Exploration Agency³

Binary acceleration using coarse-grained reconfigurable architecture

TL;DR: The deep and complicated pipeline structure generated from MUSCL dataflow is divided and optimized into two FPGA boards by using a tuning tool called RER, and about 60% utilization of the pipeline is achieved even by using serial links between two boards.

...read moreread less

Abstract: UPACS (Unified Platform for Aerospace Computational Simulation) is one of the practical CFD (Computational Fluid Dynamics) packages supporting various selectability. A custom machine for efficient execution of MUSCL; a core functions of UPACS is implemented on FLOPS-2D (Flexibly Linkable Object for Programmable System); multi-FPGA reconfigurable system. The deep and complicated pipeline structure generated from MUSCL dataflow is divided and optimized into two FPGA boards by using a tuning tool called RER. With optimization of the order of operations and pipeline structure, about 60% utilization of the pipeline is achieved even by using serial links between two boards. The execution time is 6.16-23.19 times faster than that of the software on 2.66 GHz Intel Core 2 Duo processor.

...read moreread less

Journal Article•DOI•

[...]

Jong Kyung Paek¹, Kiyoung Choi¹, Jongeun Lee²•Institutions (2)

Seoul National University¹, Ulsan National Institute of Science and Technology²

Bypass and insertion algorithms for exclusive last-level caches

TL;DR: This paper proposes a framework for automatic transformation of an application at binary-level, with which the user can execute an arbitrary application on the CGRA, and describes the overall process and present solutions to several problems that arise from such an approach.

...read moreread less

Abstract: Coarse-grained reconfigurable architectures (CGRAs) have been well-researched and shown to be particularly effective in acceleration of data-intensive applications. However, practical difficulties in application mapping have hindered their widespread adoption. Typically, an application must be modified manually or by using special compilers and design tools in order to fully exploit the architecture. This incurs considerable design costs to the application developer and reduces software portability. In this paper, we propose a framework for automatic transformation of an application at binary-level, with which the user can execute an arbitrary application on the CGRA. Our approach analyzes the binary code and determines which portions of the program to accelerate, maps them to the reconfigurable array, then modifies the binary code appropriately to run on the CGRA. We describe the overall process of our framework, and present solutions to several problems that arise from such an approach. Results from our preliminary experiments show that we are able to achieve speedup of up to 14.8.

...read moreread less

Journal Article•DOI•

[...]

GaurJayesh, ChaudhuriMainak, SubramoneySreenivas

Benefits and limitations of tapping into stored energy for datacenters

TL;DR: Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks, but as the industry moves toward cache hierarchies with larger inner levels, this wasted silicon estate is being reclaimed.

...read moreread less

Abstract: Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted...

...read moreread less

Journal Article•DOI•

[...]

GovindanSriram, SivasubramaniamAnand, UrgaonkarBhuvan

Programming framework for clusters with heterogeneous accelerators

TL;DR: This research presents a meta-modelling architecture suitable for scalable, modular, and scalable datacenter power consumption and shows clear trends in how these costs are optimized over time.

...read moreread less

Abstract: Datacenter power consumption has a significant impact on both its recurring electricity bill (Op-ex) and one-time construction costs (Cap-ex). Existing work optimizing these costs has relied primar...

...read moreread less

Journal Article•DOI•

[...]

Kuen Hung Tsoi¹, Anson H.T. Tse¹, Peter Pietzuch¹, Wayne Luk¹•Institutions (1)

Design space exploration of adaptive beamforming acceleration for bedside and portable medical ultrasound imaging

TL;DR: A programming framework for high performance clusters with various hardware accelerators that has been used to support physics simulation and financial application development and achieves significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.

...read moreread less

Abstract: We describe a programming framework for high performance clusters with various hardware accelerators. In this framework, users can utilize the available heterogeneous resources productively and efficiently. The distributed application is highly modularized to support dynamic system configuration with changing types and number of the accelerators. Multiple layers of communication interface are introduced to reduce the overhead in both control messages and data transfers. Parallelism can be achieved by controlling the accelerators in various schemes through scheduling extension. The framework has been used to support physics simulation and financial application development. We achieve significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.

...read moreread less

Journal Article•DOI•

[...]

Junying Chen¹, Billy Y. S. Yiu¹, Brandon Kyle Hamilton², Alfred C. H. Yu¹, Hayden K.-H. So¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, University of Cape Town²

An efficient CELL library for lattice quantum chromodynamics

TL;DR: This paper presents an initial design space exploration on viable compute architectures that might address the drastically different requirements between bedside and portable medical ultrasound imaging systems using adaptive beamforming and the design and implementation of a GPU accelerator that provides over 45x performance improvement over the equivalent C implementation on a single CPU.

...read moreread less

Abstract: The use of adaptive beamforming is a viable solution to provide high-resolution real-time medical ultrasound imaging. However, the increase in image resolution comes at an expense of a significant increase in compute requirement over conventional algorithms. In a bedside diagnosis setting where plug-in power is available, GPUs are promising accelerators to address the processing demand. However, in the case of point-of-care diagnostics where portable ultrasound imaging devices must be used, alternative power-efficient computer systems must be employed, possibly at the expense of lower image resolution in order to maintain real-time performance. This paper presents an initial design space exploration on viable compute architectures that might address the drastically different requirements between bedside and portable medical ultrasound imaging systems using adaptive beamforming. The design and implementation of a GPU accelerator that provides over 45x performance improvement over the equivalent C implementation on a single CPU is presented. Furthermore, and implementation of the beamforming algorithm on a high-performance mobile platform based on an ARM Cortex A8 mobile processor in combination with the built-in NEON accelerator is also presented. The mobile platform delivers over 270x reduction in power consumption when compared to the GPU platform at an expense of much reduced performance. The tradeoffs between power, performance and image quality among the target platforms are studied and future research directions in power-efficient architectures for high-performance medical ultrasound systems are presented.

...read moreread less

Journal Article•DOI•

[...]

Claude Tadonki¹, Gilbert Grodidier¹, Olivier Pene¹•Institutions (1)

Centre national de la recherche scientifique¹

Automatic abstraction and fault tolerance in cortical microachitectures

TL;DR: This work proposes a set of CELL-accelerated routines for basic LQCD calculations, and shows a significant speedup compare to standard processor, 11 times better than a 2.83 GHz INTEL processor for instance.

...read moreread less

Abstract: Quantum chromodynamics (QCD) is the theory of subnuclear physics, aiming at modeling the strong nuclear force, which is responsible for the interactions of nuclear particles. Numerical QCD studies are performed through a discrete formalism called LQCD (Lattice Quantum Chromodynamics). Typical simulations involve very large volume of data and numerically sensitive entities, thus the crucial need of high performance computing systems. We propose a set of CELL-accelerated routines for basic LQCD calculations. Our framework is provided as a unified library and is particularly optimized for an iterative use. Each routine is parallelized among the SPUs, and each SPU achieves it task by looping on small chunk of arrays from the main memory. Our SPU implementation is vectorized with double precision data, and the cooperation with the PPU shows a good overlap between data transfers and computations. Moreover, we permanently keep the SPU context and use mailboxes to synchronize between consecutive calls. We validate our library by using it to derive a CELL version of an existing LQCD package (tmLQCD). Experimental results on individual routines show a significant speedup compare to standard processor, 11 times better than a 2.83 GHz INTEL processor for instance (without SSE). This ratio is around 9 (with QS22 blade) when consider a more cooperative context like solving a linear system of equations (usually referred as Wislon-Dirac inversion). Our results clearly demonstrate that the CELL is a very promising way for high-scale LQCD simulations.

...read moreread less

Journal Article•DOI•

Faults in linux

[...]

PalixNicolas, ThomasGaël, SahaSuman, CalvèsChristophe, LawallJulia, MullerGilles - Show less +2 more

05 Mar 2011-ACM Sigarch Computer Architecture News

TL;DR: In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1, finding that the drivers directory contained up to 10,000 faults.

...read moreread less

Abstract: In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to...

...read moreread less

Journal Article•DOI•

[...]

HashmiAtif, BerryHugues, TemamOlivier, LipastiMikko

Software-based branch predication for AMD GPUs

TL;DR: Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from existing systems.

...read moreread less

Abstract: Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically...

...read moreread less

Journal Article•DOI•

[...]

Ryan Taylor¹, Xiaoming Li¹•Institutions (1)

University of Delaware¹

GPU accelerated CAE using open solvers and the cloud

TL;DR: It is revealed that branch predication can enable instruction packing, a VLIW-like GPU feature that is designed to increase the parallel execution of independent instructions, and can also decrease the number of control flow instructions thereby improving the performance of GPU kernels with both single and multiple branch paths.

...read moreread less

Abstract: Branch predication is a program transformation technique that combines instructions of multiple branches of an if statement into a straight-line sequence and associates each instruction of the sequence with a predicate. The branch predication improves the execution of branch statements on processors that support predicated execution of instruction, e.g., Intel IA-64, because such transformation improves the instruction scheduling and might help cache performance. This paper proposes a novel software-based branch predication technique for GPU. The main motivation is that branch instructions can easily become a performance bottleneck for a GPU program because of the cost of branch instructions compared to ALU instructions and the possibility of low ALU utilization due to separation of ALU instructions within control flow blocks. Due to the SIMD nature and massive multi-threading architecture of the GPU, branching can be costly if more than one path is taken by a set of concurrent threads in a kernel. In this paper we reveal that branch predication can enable instruction packing, a VLIW-like GPU feature that is designed to increase the parallel execution of independent instructions, and can also decrease the number of control flow instructions thereby improving the performance of GPU kernels with both single and multiple branch paths. The key of our novel branch predication technique is a set of transformation rules that takes into consideration the specialties of the GPU architecture and implements software-based predicated execution of instruction on the GPU with little to no overhead. Furthermore, we identify architectural and program factors that affect the effectiveness of our technique and build a benefit analysis model for the transformation. The implementation of our technique on synthetic benchmarks and real-world application proves its effectiveness.

...read moreread less

Journal Article•DOI•

[...]

Serban Georgescu¹, Peter Chow¹•Institutions (1)

Fujitsu¹

Dynamic knobs for responsive power-aware computing

TL;DR: This work compares the performance that can be achieved using the open-source solver package PETSc ran on GPU-enabled Amazon EC2 hardware with that of an optimized legacy FEM code ran on a last generation 12-core blade server and shows that, although good performance can be achieve, some development is still needed to achieve peak performance.

...read moreread less

Abstract: After more than five years since GPUs were first used as accelerators for general scientific computations, the field of General Purpose GPU computing or GPGPU has finally reached mainstream. Developers have now access to a mature hardware and software ecosystem. On the software side, several major open-source packages now support GPU acceleration while on the hardware side cloud-based solutions provide a simple way to access powerful machines with the latest GPUs at low cost. In this context, we look at the GPU acceleration of CAE, with a focus on the matrix solvers. We compare the performance that can be achieved using the open-source solver package PETSc ran on GPU-enabled Amazon EC2 hardware with that of an optimized legacy FEM code ran on a last generation 12-core blade server. Our results show that, although good performance can be achieved, some development is still needed to achieve peak performance.

...read moreread less

Journal Article•DOI•

[...]

HoffmannHenry, SidiroglouStelios, CarbinMichael, MisailovicSasa, AgarwalAnant, RinardMartin - Show less +2 more

05 Mar 2011-ACM Sigarch Computer Architecture News

TL;DR: PowerDial is a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations to transform static configuration parameters.

...read moreread less

Abstract: We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters...

...read moreread less

Journal Article•DOI•

High speed CRC with 64-bit generator polynomial on an FPGA

[...]

Amila Akagic¹, Hideharu Amano¹•Institutions (1)

Keio University¹

An abacus turn model for time/space-efficient reconfigurable routing

TL;DR: This research studies possibility of using 64-bit polynomials in software and hardware, by using fastest multiple lookup tables algorithms for generating CRCs, and shows that throughput will continue to increase when the number of processed bits at a time is increased.

...read moreread less

Abstract: Deployment of jumbo frame sizes beyond 9000 bytes for storage systems is limited by 32-bit Cyclic Redundancy Checks used by a network protocol. In order to overcome this limitation we study possibility of using 64-bit polynomials in software and hardware, by using fastest multiple lookup tables algorithms for generating CRCs. CRC is a sequential process, thus the software based solutions are limited in throughput by speed and architectural improvements of a single CPU. We study tradeoff between using distributed LUTs and embedded BRAM in hardware implementations. Our results show that BRAM-based approach is the fastest hardware implementation, reaching maximum of 347.37 Gbps while processing 1024 bits at a time, which is 606x faster than the software implementation of the same algorithm running on Xeon 3.2 GHz with 2 MB of L2 cache. The proposed architectures have been implemented on Xilinx Virtex 6 LX550T prototyping device, requiring less than 1% of the device's resources. Our research show that throughput will continue to increase when we increase the number of processed bits at a time.

...read moreread less

Journal Article•DOI•

[...]

FuBinzhang, HanYinhe, MaJun, LiHuawei, LiXiaowei - Show less +1 more

Parallelism Level Impact on Energy Consumption in Reconfigurable Devices

TL;DR: Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by, which will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC).

...read moreread less

Abstract: Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by. This will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC). M...

...read moreread less

Journal Article•DOI•

[...]

Robin Bonamy¹, Daniel Chillet¹, Olivier Sentieys¹, Sébastien Bilavarn²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Nice Sophia Antipolis²

Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation

TL;DR: Clearly, parallelism directly affects the area and the execution time, but this paper shows that the energy consumption is not constant, and decreases when the parallelism grows up.

...read moreread less

Abstract: Nowadays, System-on-Chip architectures are composed of several execution resources which support complex applications. As it shares silicon area and limits the cost of the global circuit, the embedding of a reconfigurable resource in these SoC provides flexibility to the hardware. In this case, several implementations of the same algorithm, offering different characteristics, can be considered in order to optimize performances. In general, the tasks mapped on reconfigurable resources are algorithms that can be defined through several levels of parallelism. Clearly, parallelism directly affects the area and the execution time, this paper shows that the energy consumption is not constant, and decreases when the parallelism grows up.

...read moreread less

Journal Article•DOI•

[...]

Kentaro Sano¹, Luzhou Wang¹, Satoru Yamamoto¹•Institutions (1)

Tohoku University¹

KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks

TL;DR: The basic architecture of the SCMA is described, the requirements and the design of SCMAs to scalably operate over multiple devices are described, and the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.

...read moreread less

Abstract: This paper demonstrates and evaluates the performance and the scalability of the systolic computational-memory array (SCMA) for stencil computation, which is a typical computing kernel of scientific simulation. We describe the basic architecture of th SCMA, and show the requirements and the design of SCMAs to scalably operate over multiple devices. We implement a prototype of the SCMA with three ALTERA Stratix III FPGAs, which form a 1--3 FPGA array by conecting three DE3 boards with different clock sources. The prototype SCMA demonstrates that the difference in operating clock frequency hardly influences the total execution cycles while it slightly causes stall cycles to sub-SCMAs on different FPGAs. With three banchmark programs of typical computing kernels based on the finite difference method, we show that the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.

...read moreread less

Journal Article•DOI•

[...]

Ana Balevic¹, Bart Kienhuis¹•Institutions (1)

Leiden University¹

Adaptive granularity memory systems

TL;DR: Asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification is presented and implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs).

...read moreread less

Abstract: With advances in manycore and accelerator architectures, the high performance and embedded spaces are rapidly converging. Emerging architectures feature different forms of parallelism. The Polyhedral Processes Networks (PPNs) are a proven model of choice for automated generation of pipeline and task parallel programs from sequential source code, however data parallelism is not addressed. In this paper, we present asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification. The approach is implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs). First experiments indicate that generated applications have a potential to exploit different forms of parallelism provided by the architecture and that kernels feature a highly regular structure that allows subsequent optimizations.

...read moreread less

Journal Article•DOI•

[...]

YoonDoe Hyun, JeongMin Kyu, ErezMattan

The challenges of writing portable, correct and high performance libraries for GPUs

TL;DR: This work augments virtual memory to allow each page to specify its preferred granularity of access, and proposes adaptive granularity to combine the best of fine-grained and coarse- grained memory accesses.

...read moreread less

Abstract: We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access b...

...read moreread less

Journal Article•DOI•

[...]

Miriam Leeser¹, Devon Yablonski², Dana H. Brooks¹, Laurie Smith King³•Institutions (3)

Northeastern University¹, Mercury Systems², College of the Holy Cross³

CoreSymphony: an efficient reconfigurable multi-core architecture

TL;DR: This paper aims to deliver working, efficient GPU code in a library that is downloaded and run by many different users, and targets the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.

...read moreread less

Abstract: Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically pertain to a specific implementation with fixed parameters mapped to a specific hardware implementation. The implementations are not designed to be easily ported to other GPUs, even from the same manufacturer. When target hardware changes, the application must be re-optimized.In this paper we address a different problem. We aim to deliver working, efficient GPU code in a library that is downloaded and run by many different users. The issue is to deliver efficiency independent of the individual user parameters and without a priori knowledge of the hardware the user will employ. This problem requires a different set of tradeoffs than finding the best runtime for a single solution. Solutions must be adaptable to a range of different parameters both to solve users' problems and to make the best use of the target hardware.Another issue is the integration of GPUs into a Problem Solving Environment (PSE) where the use of a GPU is almost invisible from the perspective of the user. Ease of use and smooth interactions with the existing user interface are important to our approach. We illustrate our solution with the incorporation of GPU processing into the Scientific Computing Institute (SCI)Run Biomedical PSE developed at the University of Utah. SCIRun allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms in the scientist's simulations. Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.

...read moreread less

Journal Article•DOI•

[...]

Tomoyuki Nagatsuka¹, Yoshito Sakaguchi¹, Takayuki Matsumura¹, Kenji Kise¹•Institutions (1)

Tokyo Institute of Technology¹