Home
/
Authors
/
Rafael Ubal

Author

Rafael Ubal

Other affiliations: Polytechnic University of Valencia

Bio: Rafael Ubal is an academic researcher from Northeastern University. The author has contributed to research in topics: Microarchitecture & Instruction set. The author has an hindex of 10, co-authored 28 publications receiving 788 citations. Previous affiliations of Rafael Ubal include Polytechnic University of Valencia.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Multi2Sim: a simulation framework for CPU-GPU computing

[...]

Rafael Ubal¹, Byunghyun Jang², Perhaad Mistry¹, Dana Schaa¹, David Kaeli¹ - Show less +1 more•Institutions (2)

Northeastern University¹, University of Mississippi²

19 Sep 2012

TL;DR: This paper presents Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU, and addresses program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite.

...read moreread less

Abstract: Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed user's guide are publicly available at www.multi2sim.org.

...read moreread less

440 citations

Proceedings Article•DOI•

Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors

[...]

Rafael Ubal, Julio Sahuquillo, Salvador Petit, Pedro López

19 Nov 2007

TL;DR: The Multi2Sim simulation framework is presented, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators.

...read moreread less

Abstract: Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes.

...read moreread less

164 citations

Proceedings Article•DOI•

MGPUSim: enabling multi-GPU performance modeling and optimization

[...]

Yifan Sun¹, Trinayan Baruah¹, Saiful A. Mojumder², Shi Dong¹, Xiang Gong¹, Shane Treadway¹, Yuhui Bao¹, Spencer Hance¹, Carter McCardwell¹, Vincent Zhao¹, Harrison Barclay¹, Amir Kavyan Ziabari³, Zhongliang Chen³, Rafael Ubal¹, José L. Abellán⁴, John Kim¹, Ajay Joshi², David Kaeli¹ - Show less +14 more•Institutions (4)

Northeastern University¹, Boston University², Advanced Micro Devices³, Universidad Católica San Antonio de Murcia⁴

22 Jun 2019

TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.

...read moreread less

Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

...read moreread less

51 citations

Proceedings Article•DOI•

Leveraging Silicon-Photonic NoC for Designing Scalable GPUs

[...]

Amir Kavyan Ziabari¹, José L. Abellán², Rafael Ubal¹, Chao Chen³, Ajay Joshi⁴, David Kaeli¹ - Show less +2 more•Institutions (4)

Northeastern University¹, Universidad Católica San Antonio de Murcia², Freescale Semiconductor³, Boston University⁴

08 Jun 2015

TL;DR: This paper advocates using silicon-photonic link technology for on-chip communication in GPUs, and presents the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC.

...read moreread less

Abstract: Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future many-core systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid silicon-photonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6 x (2.7 x on average) and reduces the energy-delay2 product (ED2P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9 x (43% on average) and achieve up to 62% reduction in ED2P (3% on average) in comparison to mesh design with best performance.

...read moreread less

31 citations

Journal Article•DOI•

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

[...]

Amir Kavyan Ziabari¹, Yifan Sun¹, Yenai Ma², Dana Schaa¹, José L. Abellán³, Rafael Ubal¹, John Kim⁴, Ajay Joshi², David Kaeli¹ - Show less +5 more•Institutions (4)

Northeastern University¹, Boston University², Universidad Católica San Antonio de Murcia³, KAIST⁴

02 Dec 2016-ACM Transactions on Architecture and Code Optimization

TL;DR: Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches and allows the CPU to access GPUs modified data by at at least 13 × faster.

...read moreread less

Abstract: In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

...read moreread less

26 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

GPUWattch: enabling energy optimizations in GPGPUs

[...]

Jingwen Leng¹, Tayler Hetherington², Ahmed ElTantawy², Syed Zohaib Gilani³, Nam Sung Kim³, Tor M. Aamodt², Vijay Janapa Reddi¹ - Show less +3 more•Institutions (3)

University of Texas at Austin¹, University of British Columbia², University of Wisconsin-Madison³

23 Jun 2013

TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.

...read moreread less

Abstract: General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

...read moreread less

558 citations

Journal Article•DOI•

gem5-gpu: A Heterogeneous CPU-GPU Simulator

[...]

Jason Power¹, Joel Hestness¹, Marc S. Orr¹, Mark D. Hill¹, Darien Wood¹ - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2015-IEEE Computer Architecture Letters

TL;DR: Gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems, able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces.

...read moreread less

Abstract: gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular full-system CPU simulator, and GPGPU-Sim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces. gem5-gpu can run most unmodified CUDA 3.2 source code. Applications can launch non-blocking kernels, allowing the CPU and GPU to execute simultaneously. We present gem5-gpu ’s software architecture and a brief performance validation. We also discuss possible extensions to the simulator. gem5-gpu is open source and available at gem5-gpu.cs.wisc.edu.

...read moreread less

228 citations

Journal Article•DOI•

The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing

[...]

Sheng Li¹, Jung Ho Ahn², Richard Strong³, Jay B. Brockman⁴, Dean M. Tullsen³, Norman P. Jouppi¹ - Show less +2 more•Institutions (4)

Hewlett-Packard¹, Seoul National University², University of California, San Diego³, University of Notre Dame⁴

01 Apr 2013-ACM Transactions on Architecture and Code Optimization

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clusters give the best EDA2P and EDAP.

...read moreread less

Abstract: This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At microarchitectural level, McPAT includes models for the fundamental components of a complete chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, and integrated system components such as memory controllers and Ethernet controllers. At circuit level, McPAT supports detailed modeling of critical-path timing, area, and power. At technology level, McPAT models timing, area, and power for the device types forecast in the ITRS roadmap. McPAT has a flexible XML interface to facilitate its use with many performance simulators.Combined with a performance simulator, McPAT enables architects to accurately quantify the cost of new ideas and assess trade-offs of different architectures using new metrics such as Energy-Delay-Area2 Product (EDA2P) and Energy-Delay-Area Product (EDAP). This article explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting trade-offs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies from cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clustering gives the best EDA2P and EDAP.

...read moreread less

201 citations

Proceedings Article•DOI•

GPGPU performance and power estimation using machine learning

[...]

Gene Wu¹, Joseph L. Greathouse², Alexander Lyashevsky², Nuwan Jayasena², Derek Chiou¹ - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Advanced Micro Devices²

07 Feb 2015

TL;DR: A GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware that runs as fast as, or faster than the program running natively on real hardware after an initial training phase.

...read moreread less

Abstract: Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPU's configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations. Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our model's performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

...read moreread less

196 citations

Proceedings Article•DOI•

OpenPiton: An Open Source Manycore Research Framework

[...]

Jonathan Balkind¹, Michael McKeown¹, Yaosheng Fu¹, Tri Nguyen¹, Yanqi Zhou¹, Alexey Lavrov¹, Mohammad Shahrad¹, Adi Fuchs¹, Samuel Payne², Xiaohua Liang¹, Matthew Matl¹, David Wentzlaff¹ - Show less +8 more•Institutions (2)

Princeton University¹, Nvidia²

25 Mar 2016

TL;DR: OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework that leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design.

...read moreread less

Abstract: Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.

...read moreread less

165 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

Collapse