scispace - formally typeset
Search or ask a question
Author

Rafael Ubal

Bio: Rafael Ubal is an academic researcher from Northeastern University. The author has contributed to research in topics: Microarchitecture & Instruction set. The author has an hindex of 10, co-authored 28 publications receiving 788 citations. Previous affiliations of Rafael Ubal include Polytechnic University of Valencia.

Papers
More filters
Proceedings ArticleDOI
19 Sep 2012
TL;DR: This paper presents Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU, and addresses program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite.
Abstract: Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed user's guide are publicly available at www.multi2sim.org.

440 citations

Proceedings ArticleDOI
19 Nov 2007
TL;DR: The Multi2Sim simulation framework is presented, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators.
Abstract: Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes.

164 citations

Proceedings ArticleDOI
22 Jun 2019
TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

51 citations

Proceedings ArticleDOI
08 Jun 2015
TL;DR: This paper advocates using silicon-photonic link technology for on-chip communication in GPUs, and presents the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC.
Abstract: Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future many-core systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid silicon-photonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6 x (2.7 x on average) and reduces the energy-delay2 product (ED2P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9 x (43% on average) and achieve up to 62% reduction in ED2P (3% on average) in comparison to mesh design with best performance.

31 citations

Journal ArticleDOI
TL;DR: Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches and allows the CPU to access GPUs modified data by at at least 13 × faster.
Abstract: In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

26 citations


Cited by
More filters
Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.
Abstract: General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

558 citations

Journal ArticleDOI
TL;DR: Gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems, able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces.
Abstract: gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular full-system CPU simulator, and GPGPU-Sim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces. gem5-gpu can run most unmodified CUDA 3.2 source code. Applications can launch non-blocking kernels, allowing the CPU and GPU to execute simultaneously. We present gem5-gpu ’s software architecture and a brief performance validation. We also discuss possible extensions to the simulator. gem5-gpu is open source and available at gem5-gpu.cs.wisc.edu.

228 citations

Journal ArticleDOI
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clusters give the best EDA2P and EDAP.
Abstract: This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At microarchitectural level, McPAT includes models for the fundamental components of a complete chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, and integrated system components such as memory controllers and Ethernet controllers. At circuit level, McPAT supports detailed modeling of critical-path timing, area, and power. At technology level, McPAT models timing, area, and power for the device types forecast in the ITRS roadmap. McPAT has a flexible XML interface to facilitate its use with many performance simulators.Combined with a performance simulator, McPAT enables architects to accurately quantify the cost of new ideas and assess trade-offs of different architectures using new metrics such as Energy-Delay-Area2 Product (EDA2P) and Energy-Delay-Area Product (EDAP). This article explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting trade-offs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies from cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clustering gives the best EDA2P and EDAP.

201 citations

Proceedings ArticleDOI
07 Feb 2015
TL;DR: A GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware that runs as fast as, or faster than the program running natively on real hardware after an initial training phase.
Abstract: Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPU's configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations. Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our model's performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

196 citations

Proceedings ArticleDOI
25 Mar 2016
TL;DR: OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework that leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design.
Abstract: Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.

165 citations