scispace - formally typeset
Search or ask a question
Author

Yuhui Bao

Bio: Yuhui Bao is an academic researcher from Northeastern University. The author has contributed to research in topics: Instruction set & Microarchitecture. The author has an hindex of 2, co-authored 2 publications receiving 39 citations.

Papers
More filters
Proceedings ArticleDOI
22 Jun 2019
TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

51 citations

Posted Content
TL;DR: MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture is presented and the design implications from this work are evaluated, suggesting that D-MGPU is an attractive programming model for future multi- GPU systems.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multi-threaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs $5.5\%$ on average when compared against actual GPU hardware. We also achieve a $3.5\times$ and a $2.5\times$ average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation. We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system (U-MGPU) and a discrete multi-GPU system (D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems.

9 citations

Proceedings ArticleDOI

[...]

08 Oct 2022
TL;DR: In this paper , the authors present a simulator for the AMD RDNA GPU, which allows researchers to explore new GPU designs based on the state-of-the-art RDNA architecture.
Abstract: As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture.
Proceedings ArticleDOI
08 Oct 2022
TL;DR: NaviSim as discussed by the authors is a cycle-level GPU simulator that models AMD RDNA GPUs, which can accurately model the GPU's kernel execution time, achieving similar performance to hardware execution within 9.92% on average.
Abstract: As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture. In this paper, we present the NaviSim simulator, the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU's kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU. To demonstrate the full utility of the NaviSim simulator, we carry out a performance study of the impact of individual RDNA features, attempting to understand better the design decisions behind these features. We carry out a number of experiments to isolate each RDNA feature and evaluate its impact on overall performance, as well as demonstrate the usability and flexibility of NaviSim.

Cited by
More filters
Proceedings ArticleDOI
30 May 2020
TL;DR: A new GPU simulator frontend is introduced that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution- driven simulation of the virtual ISA.
Abstract: In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true. To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process. We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs. Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.

130 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: VANS is developed, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics ofoptane-DIMM-attached Intel servers, and develops two architectural optimizations on top ofOptane D IMM, Lazy Cache and Pre-translation, which significantly improve cloud workload performance.
Abstract: Scalable server-grade non-volatile RAM (NVRAM) DIMMs became commercially available with the release of Intel’s Optane DIMM. Recent studies on Optane DIMM systems unveil discrepant performance characteristics, compared to what many researchers assumed before the product release. Most of these studies focus on system software design and performance analysis. To thoroughly analyze the source of this discrepancy and facilitate real-NVRAM-aware architecture design, we propose a framework that characterizes and models Optane DIMM’s microarchitecture. Our framework consists of a Low-level profilEr for Non-volatile memory Systems (LENS) and a Validated cycle-Accurate NVRAM Simulator (VANS). LENS allows us to comprehensively analyze the performance attributes and reverse engineer NVRAM microarchitectures. Based on LENS characterization, we develop VANS, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics of Optane-DIMM-attached Intel servers. VANS adopts a modular design that can be easily modified to extend to other NVRAM architecture designs; it can also be attached to full-system simulators, such as gem51. By using LENS and VANS, we develop two architectural optimizations on top of Optane DIMM, Lazy Cache and Pre-translation, which significantly improve cloud workload performance.

70 citations

Proceedings ArticleDOI
01 Nov 2020
TL;DR: TABOR as discussed by the authors proposes a new objective function to guide optimization to identify a trojan backdoor more correctly and accurately and prune the restored triggers, which can not only facilitate the identification of intentionally injected triggers but also filter out false alarms.
Abstract: A trojan backdoor is a hidden pattern typically implanted in a deep neural network (DNN). It could be activated and thus forces that infected model to behave abnormally when an input sample with a particular trigger is fed to that model. As such, given a DNN and clean input samples, it is challenging to inspect and determine the existence of a trojan backdoor. Recently, researchers design and develop several pioneering solutions to address this problem. They demonstrate that the proposed techniques have great potential in trojan detection. However, we show that none of these existing techniques completely address the problem. On the one hand, they mostly work under an unrealistic assumption of assuming the availability of the contaminated training database. On the other hand, these techniques can neither accurately detect the existence of trojan backdoors, nor restore high-fidelity triggers, especially when infected models are trained with high-dimensional data, and the triggers pertaining to the trojan vary in size, shape, and position. In this work, we propose TABOR, a new trojan detection technique. Conceptually, it formalizes the detection of a trojan backdoor as solving an optimization objective function. Different from the existing technique which also models trojan detection as an optimization problem, TABOR first designs a new objective function that could guide optimization to identify a trojan backdoor more correctly and accurately. Second, TABOR borrows the idea of interpretable AI to further prune the restored triggers. Last, TABOR designs a new anomaly detection method, which could not only facilitate the identification of intentionally injected triggers but also filter out false alarms (i.e., triggers detected from an uninfected model). We train 112 DNNs on five datasets and infect these models with two existing trojan attacks. We evaluate TABOR by using these infected models, and demonstrate that TABOR has much better performance in trigger restoration, trojan detection, and elimination than Neural Cleanse, the state-of-the-art trojan detection technique.

32 citations

Proceedings ArticleDOI
18 Oct 2021
TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.
Abstract: Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

28 citations

Journal ArticleDOI
TL;DR: This monograph presents a meta-analyses of the architecture and code optimization challenges faced in the rapidly changing environment and some of the strategies used to address these challenges have been proposed and described.
Abstract: Today’s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads It also uses a lightweight frontier structure to further reduce atomic operations With easy-to-use interface that abstracts the above details, Grus shows up to 64× average speedup over the state-of-the-art in-memory GPU graph processing framework It allows one to process large graphs of 55 billion edges in seconds with a single GPU

24 citations