scispace - formally typeset
Search or ask a question

Showing papers in "ACM Sigarch Computer Architecture News in 2011"


Journal ArticleDOI
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

4,039 citations


Journal ArticleDOI
TL;DR: This work studies the automated generation of full or truncated multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs, and addresses arbitrary precisions including single, double but also the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware.
Abstract: The implementation of high-precision floating-point applications on reconfigurable hardware requires large multipliers. Full multipliers are the core of floating-point multipliers. Truncated multipliers, trading resources for a well-controlled accuracy degradation, are useful building blocks in situations where a full multiplier is not needed.This work studies the automated generation of such multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs. The optimization of such multipliers is expressed as a tiling problem, where a tile represents a hardware multiplier, and super-tiles represent combinations of several hardware multipliers and adders, making efficient use of the DSP internal resources. This tiling technique is shown to adapt to full or truncated multipliers. It addresses arbitrary precisions including single, double but also the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware. An open-source implementation is provided in the FloPoCo project.

68 citations


Journal ArticleDOI
TL;DR: The MaxCompiler programming system is described which allows software engineers to create dataflow engines optimized for their particular applications, and an example application that has been accelerated using this methodology is discussed.
Abstract: Over the past decade x86 processors have come to dominate the world's largest supercomputers. However in the future conventional multicore processors are unlikely to be able to deliver the necessary performance per $ and per W to achieve exascale performance. Heterogeneous computing is emerging as a powerful alternative to conventional multi-core to help address these challenges. In this paper we describe our approach to Maximum Performance Computing - building applicationspecific computers which complement conventional x86 processors with high performance dataflow engines implemented on FPGA to provide 10-100x improvements in performance and performance/W. We describe the MaxCompiler programming system which allows software engineers to create dataflow engines optimized for their particular applications, and discuss an example application that has been accelerated using this methodology.

48 citations


Journal ArticleDOI
TL;DR: An FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture is proposed, and the proposed pipelined design is described parametrically, facilitating its re-use for different technologies.
Abstract: Arithmetic Asian options are financial derivatives which have the feature of path-dependency: they depend on the entire price path of the underlying asset, rather than just the instantaneous price. This path-dependency makes them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture. The proposed pipelined design is described parametrically, facilitating its re-use for different technologies. An implementation of this architecture in a Virtex-5 xc5vlx330t FPGA at 200MHz is 313 times faster than a multi-threaded software implementation running on a Intel Xeon E5420 quad-core CPU at 2.5GHz; it is also 2.2 times faster than the Tesla C1060 GPU at 1.3 GHz.

31 citations


Journal ArticleDOI
TL;DR: This research presents a novel and scalable approach called "Smart Guess" that automates the very labor-intensive and therefore time-heavy and expensive and expensive process of manually cataloging and cataloging individual components of a software system.
Abstract: Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity...

29 citations


Journal ArticleDOI
TL;DR: The approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.
Abstract: Processing speed and energy efficiency are two of the most critical issues for computer systems. This paper presents a systematic approach for profiling the power and performance characteristics of application targeting heterogeneous multi-core computing platforms. Our approach enables rapid and automated design space exploration involving optimisation of workload distribution for systems with accelerators such as FPGAs and GPUs. We demonstrate that, with minor modification to the design, it is possible to estimate performance and power efficiency trade off to identify optimized workload distribution. Our approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.

26 citations


Journal ArticleDOI
TL;DR: This paper proposes improvements to a recent analysis of the SCAN policy and carries out an empirical investigation of SATF performance to derive a relationship between the queue-length and mean service time, while both methods outperform FCFS scheduling.
Abstract: Performance of many important computer applications depends on the performance of Hard Disk Drives (HDDs). Disk capacities and transfer rates have been increasing rapidly, but the improvement in disk access time is disappointingly slow. Caching and prefetching are two method to alleviate this delay, which is 6-7 orders of magnitude longer than the processor cycle time. Disk scheduling is desirable when the data is not cached and a disk access is required. This paper is concerned with the analysis of two disk arm scheduling methods: SATF (shortest access time first) which outperforms SCAN, while both methods outperform FCFS scheduling. We propose improvements to a recent analysis of the SCAN policy and carry out an empirical investigation of SATF performance to derive a relationship between the queue-length and mean service time. A review of variations of SCAN and SATF is provided, since they have been utilized in conjunction with multilevel disk scheduling methods. We also discuss recent developments to improve the performance of high capacity HDDs, which allow multiple tracks to be accessed without incurring seeks.

24 citations


Journal ArticleDOI
TL;DR: This work presents a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword- SIMD, SIMT, and vector-thread (VT) architectural design patterns.
Abstract: We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. ...

23 citations


Journal ArticleDOI
TL;DR: In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.
Abstract: When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggests to decompose the computational problem to be solved by several simple functions implemented as standalone kernels and to fuse some of these functions later into more complex kernels to improve memory locality. In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.

23 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a centralized solution to optimize the performance of a multi-core microprocessor within a power budget. But most existing solutions are centralized and cannot scale well with
Abstract: Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention However, most existing solutions are centralized and cannot scale well with

21 citations


Journal ArticleDOI
TL;DR: The deep and complicated pipeline structure generated from MUSCL dataflow is divided and optimized into two FPGA boards by using a tuning tool called RER, and about 60% utilization of the pipeline is achieved even by using serial links between two boards.
Abstract: UPACS (Unified Platform for Aerospace Computational Simulation) is one of the practical CFD (Computational Fluid Dynamics) packages supporting various selectability. A custom machine for efficient execution of MUSCL; a core functions of UPACS is implemented on FLOPS-2D (Flexibly Linkable Object for Programmable System); multi-FPGA reconfigurable system. The deep and complicated pipeline structure generated from MUSCL dataflow is divided and optimized into two FPGA boards by using a tuning tool called RER. With optimization of the order of operations and pipeline structure, about 60% utilization of the pipeline is achieved even by using serial links between two boards. The execution time is 6.16-23.19 times faster than that of the software on 2.66 GHz Intel Core 2 Duo processor.

Journal ArticleDOI
TL;DR: This paper proposes a framework for automatic transformation of an application at binary-level, with which the user can execute an arbitrary application on the CGRA, and describes the overall process and present solutions to several problems that arise from such an approach.
Abstract: Coarse-grained reconfigurable architectures (CGRAs) have been well-researched and shown to be particularly effective in acceleration of data-intensive applications. However, practical difficulties in application mapping have hindered their widespread adoption. Typically, an application must be modified manually or by using special compilers and design tools in order to fully exploit the architecture. This incurs considerable design costs to the application developer and reduces software portability. In this paper, we propose a framework for automatic transformation of an application at binary-level, with which the user can execute an arbitrary application on the CGRA. Our approach analyzes the binary code and determines which portions of the program to accelerate, maps them to the reconfigurable array, then modifies the binary code appropriately to run on the CGRA. We describe the overall process of our framework, and present solutions to several problems that arise from such an approach. Results from our preliminary experiments show that we are able to achieve speedup of up to 14.8.

Journal ArticleDOI
TL;DR: Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks, but as the industry moves toward cache hierarchies with larger inner levels, this wasted silicon estate is being reclaimed.
Abstract: Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted...

Journal ArticleDOI
TL;DR: This research presents a meta-modelling architecture suitable for scalable, modular, and scalable datacenter power consumption and shows clear trends in how these costs are optimized over time.
Abstract: Datacenter power consumption has a significant impact on both its recurring electricity bill (Op-ex) and one-time construction costs (Cap-ex). Existing work optimizing these costs has relied primar...

Journal ArticleDOI
TL;DR: A programming framework for high performance clusters with various hardware accelerators that has been used to support physics simulation and financial application development and achieves significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.
Abstract: We describe a programming framework for high performance clusters with various hardware accelerators. In this framework, users can utilize the available heterogeneous resources productively and efficiently. The distributed application is highly modularized to support dynamic system configuration with changing types and number of the accelerators. Multiple layers of communication interface are introduced to reduce the overhead in both control messages and data transfers. Parallelism can be achieved by controlling the accelerators in various schemes through scheduling extension. The framework has been used to support physics simulation and financial application development. We achieve significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.

Journal ArticleDOI
TL;DR: This paper presents an initial design space exploration on viable compute architectures that might address the drastically different requirements between bedside and portable medical ultrasound imaging systems using adaptive beamforming and the design and implementation of a GPU accelerator that provides over 45x performance improvement over the equivalent C implementation on a single CPU.
Abstract: The use of adaptive beamforming is a viable solution to provide high-resolution real-time medical ultrasound imaging. However, the increase in image resolution comes at an expense of a significant increase in compute requirement over conventional algorithms. In a bedside diagnosis setting where plug-in power is available, GPUs are promising accelerators to address the processing demand. However, in the case of point-of-care diagnostics where portable ultrasound imaging devices must be used, alternative power-efficient computer systems must be employed, possibly at the expense of lower image resolution in order to maintain real-time performance. This paper presents an initial design space exploration on viable compute architectures that might address the drastically different requirements between bedside and portable medical ultrasound imaging systems using adaptive beamforming. The design and implementation of a GPU accelerator that provides over 45x performance improvement over the equivalent C implementation on a single CPU is presented. Furthermore, and implementation of the beamforming algorithm on a high-performance mobile platform based on an ARM Cortex A8 mobile processor in combination with the built-in NEON accelerator is also presented. The mobile platform delivers over 270x reduction in power consumption when compared to the GPU platform at an expense of much reduced performance. The tradeoffs between power, performance and image quality among the target platforms are studied and future research directions in power-efficient architectures for high-performance medical ultrasound systems are presented.

Journal ArticleDOI
TL;DR: This work proposes a set of CELL-accelerated routines for basic LQCD calculations, and shows a significant speedup compare to standard processor, 11 times better than a 2.83 GHz INTEL processor for instance.
Abstract: Quantum chromodynamics (QCD) is the theory of subnuclear physics, aiming at modeling the strong nuclear force, which is responsible for the interactions of nuclear particles. Numerical QCD studies are performed through a discrete formalism called LQCD (Lattice Quantum Chromodynamics). Typical simulations involve very large volume of data and numerically sensitive entities, thus the crucial need of high performance computing systems. We propose a set of CELL-accelerated routines for basic LQCD calculations. Our framework is provided as a unified library and is particularly optimized for an iterative use. Each routine is parallelized among the SPUs, and each SPU achieves it task by looping on small chunk of arrays from the main memory. Our SPU implementation is vectorized with double precision data, and the cooperation with the PPU shows a good overlap between data transfers and computations. Moreover, we permanently keep the SPU context and use mailboxes to synchronize between consecutive calls. We validate our library by using it to derive a CELL version of an existing LQCD package (tmLQCD). Experimental results on individual routines show a significant speedup compare to standard processor, 11 times better than a 2.83 GHz INTEL processor for instance (without SSE). This ratio is around 9 (with QS22 blade) when consider a more cooperative context like solving a linear system of equations (usually referred as Wislon-Dirac inversion). Our results clearly demonstrate that the CELL is a very promising way for high-scale LQCD simulations.

Journal ArticleDOI
TL;DR: In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1, finding that the drivers directory contained up to 10,000 faults.
Abstract: In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to...

Journal ArticleDOI
TL;DR: Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from existing systems.
Abstract: Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically...

Journal ArticleDOI
TL;DR: It is revealed that branch predication can enable instruction packing, a VLIW-like GPU feature that is designed to increase the parallel execution of independent instructions, and can also decrease the number of control flow instructions thereby improving the performance of GPU kernels with both single and multiple branch paths.
Abstract: Branch predication is a program transformation technique that combines instructions of multiple branches of an if statement into a straight-line sequence and associates each instruction of the sequence with a predicate. The branch predication improves the execution of branch statements on processors that support predicated execution of instruction, e.g., Intel IA-64, because such transformation improves the instruction scheduling and might help cache performance. This paper proposes a novel software-based branch predication technique for GPU. The main motivation is that branch instructions can easily become a performance bottleneck for a GPU program because of the cost of branch instructions compared to ALU instructions and the possibility of low ALU utilization due to separation of ALU instructions within control flow blocks. Due to the SIMD nature and massive multi-threading architecture of the GPU, branching can be costly if more than one path is taken by a set of concurrent threads in a kernel. In this paper we reveal that branch predication can enable instruction packing, a VLIW-like GPU feature that is designed to increase the parallel execution of independent instructions, and can also decrease the number of control flow instructions thereby improving the performance of GPU kernels with both single and multiple branch paths. The key of our novel branch predication technique is a set of transformation rules that takes into consideration the specialties of the GPU architecture and implements software-based predicated execution of instruction on the GPU with little to no overhead. Furthermore, we identify architectural and program factors that affect the effectiveness of our technique and build a benefit analysis model for the transformation. The implementation of our technique on synthetic benchmarks and real-world application proves its effectiveness.

Journal ArticleDOI
Serban Georgescu1, Peter Chow1
TL;DR: This work compares the performance that can be achieved using the open-source solver package PETSc ran on GPU-enabled Amazon EC2 hardware with that of an optimized legacy FEM code ran on a last generation 12-core blade server and shows that, although good performance can be achieve, some development is still needed to achieve peak performance.
Abstract: After more than five years since GPUs were first used as accelerators for general scientific computations, the field of General Purpose GPU computing or GPGPU has finally reached mainstream. Developers have now access to a mature hardware and software ecosystem. On the software side, several major open-source packages now support GPU acceleration while on the hardware side cloud-based solutions provide a simple way to access powerful machines with the latest GPUs at low cost. In this context, we look at the GPU acceleration of CAE, with a focus on the matrix solvers. We compare the performance that can be achieved using the open-source solver package PETSc ran on GPU-enabled Amazon EC2 hardware with that of an optimized legacy FEM code ran on a last generation 12-core blade server. Our results show that, although good performance can be achieved, some development is still needed to achieve peak performance.

Journal ArticleDOI
TL;DR: PowerDial is a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations to transform static configuration parameters.
Abstract: We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters...

Journal ArticleDOI
TL;DR: This research studies possibility of using 64-bit polynomials in software and hardware, by using fastest multiple lookup tables algorithms for generating CRCs, and shows that throughput will continue to increase when the number of processed bits at a time is increased.
Abstract: Deployment of jumbo frame sizes beyond 9000 bytes for storage systems is limited by 32-bit Cyclic Redundancy Checks used by a network protocol. In order to overcome this limitation we study possibility of using 64-bit polynomials in software and hardware, by using fastest multiple lookup tables algorithms for generating CRCs. CRC is a sequential process, thus the software based solutions are limited in throughput by speed and architectural improvements of a single CPU. We study tradeoff between using distributed LUTs and embedded BRAM in hardware implementations. Our results show that BRAM-based approach is the fastest hardware implementation, reaching maximum of 347.37 Gbps while processing 1024 bits at a time, which is 606x faster than the software implementation of the same algorithm running on Xeon 3.2 GHz with 2 MB of L2 cache. The proposed architectures have been implemented on Xilinx Virtex 6 LX550T prototyping device, requiring less than 1% of the device's resources. Our research show that throughput will continue to increase when we increase the number of processed bits at a time.

Journal ArticleDOI
FuBinzhang, HanYinhe, MaJun, LiHuawei, LiXiaowei 
TL;DR: Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by, which will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC).
Abstract: Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by. This will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC). M...

Journal ArticleDOI
TL;DR: Clearly, parallelism directly affects the area and the execution time, but this paper shows that the energy consumption is not constant, and decreases when the parallelism grows up.
Abstract: Nowadays, System-on-Chip architectures are composed of several execution resources which support complex applications. As it shares silicon area and limits the cost of the global circuit, the embedding of a reconfigurable resource in these SoC provides flexibility to the hardware. In this case, several implementations of the same algorithm, offering different characteristics, can be considered in order to optimize performances. In general, the tasks mapped on reconfigurable resources are algorithms that can be defined through several levels of parallelism. Clearly, parallelism directly affects the area and the execution time, this paper shows that the energy consumption is not constant, and decreases when the parallelism grows up.

Journal ArticleDOI
TL;DR: The basic architecture of the SCMA is described, the requirements and the design of SCMAs to scalably operate over multiple devices are described, and the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.
Abstract: This paper demonstrates and evaluates the performance and the scalability of the systolic computational-memory array (SCMA) for stencil computation, which is a typical computing kernel of scientific simulation. We describe the basic architecture of th SCMA, and show the requirements and the design of SCMAs to scalably operate over multiple devices. We implement a prototype of the SCMA with three ALTERA Stratix III FPGAs, which form a 1--3 FPGA array by conecting three DE3 boards with different clock sources. The prototype SCMA demonstrates that the difference in operating clock frequency hardly influences the total execution cycles while it slightly causes stall cycles to sub-SCMAs on different FPGAs. With three banchmark programs of typical computing kernels based on the finite difference method, we show that the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.

Journal ArticleDOI
TL;DR: Asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification is presented and implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs).
Abstract: With advances in manycore and accelerator architectures, the high performance and embedded spaces are rapidly converging. Emerging architectures feature different forms of parallelism. The Polyhedral Processes Networks (PPNs) are a proven model of choice for automated generation of pipeline and task parallel programs from sequential source code, however data parallelism is not addressed. In this paper, we present asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification. The approach is implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs). First experiments indicate that generated applications have a potential to exploit different forms of parallelism provided by the architecture and that kernels feature a highly regular structure that allows subsequent optimizations.

Journal ArticleDOI
TL;DR: This work augments virtual memory to allow each page to specify its preferred granularity of access, and proposes adaptive granularity to combine the best of fine-grained and coarse- grained memory accesses.
Abstract: We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access b...

Journal ArticleDOI
TL;DR: This paper aims to deliver working, efficient GPU code in a library that is downloaded and run by many different users, and targets the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.
Abstract: Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically pertain to a specific implementation with fixed parameters mapped to a specific hardware implementation. The implementations are not designed to be easily ported to other GPUs, even from the same manufacturer. When target hardware changes, the application must be re-optimized.In this paper we address a different problem. We aim to deliver working, efficient GPU code in a library that is downloaded and run by many different users. The issue is to deliver efficiency independent of the individual user parameters and without a priori knowledge of the hardware the user will employ. This problem requires a different set of tradeoffs than finding the best runtime for a single solution. Solutions must be adaptable to a range of different parameters both to solve users' problems and to make the best use of the target hardware.Another issue is the integration of GPUs into a Problem Solving Environment (PSE) where the use of a GPU is almost invisible from the perspective of the user. Ease of use and smooth interactions with the existing user interface are important to our approach. We illustrate our solution with the incorporation of GPU processing into the Scientific Computing Institute (SCI)Run Biomedical PSE developed at the University of Utah. SCIRun allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms in the scientist's simulations. Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.

Journal ArticleDOI
TL;DR: CoreSymphony enables some narrow-issue cores to be fused into a single wide-issue core, a cooperative and reconfigurable superscalar processor architecture that improves single-thread performance in chip multiprocessor.
Abstract: This paper describes CoreSymphony, a cooperative and reconfigurable superscalar processor architecture that improves single-thread performance in chip multiprocessor. CoreSymphony enables some narrow-issue cores to be fused into a single wide-issue core. In this paper, we describe the problems associated with achieving the cooperative superscalar processor. We then describe techniques by which to overcome these problems. The evaluation results obtained using SPEC2006 benchmarks indicate that four-core fusion achieves 88% higher IPC than an individual core.