scispace - formally typeset
Search or ask a question
Author

Harrison Barclay

Bio: Harrison Barclay is an academic researcher from Northeastern University. The author has contributed to research in topics: Microarchitecture & Emulation. The author has an hindex of 1, co-authored 2 publications receiving 31 citations.

Papers
More filters
Proceedings ArticleDOI
22 Jun 2019
TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

51 citations

Journal ArticleDOI
01 Dec 2019
TL;DR: It is found that reproducibility of the multi-physics simulations on a single-node system, as well as confirm multi-node scaling, is able to be demonstrated, however, reproducible of the visual and geophysical simulation results were inconclusive due to issues related to input parameters provided to the model.
Abstract: This paper evaluates the reproducibility of a Supercomputing 17 paper titled Extreme Scale Multi-Physics Simulations of the Tsunamigenic 2004 Sumatra Megathrust Earthquake. We evaluate reproducibility on a significantly smaller computer system than used in the original work. We found that we able to demonstrate reproducibility of the multi-physics simulations on a single-node system, as well as confirm multi-node scaling. However, reproducibility of the visual and geophysical simulation results were inconclusive due to issues related to input parameters provided to our model. The SC 17 paper provided results for both CPU-based simulations as well as Xeon Phi based simulations. Since our cluster uses NVIDIA V100s for acceleration, we are only able to assess the CPU-based results in terms of reproducibility.

2 citations


Cited by
More filters
Proceedings ArticleDOI
30 May 2020
TL;DR: A new GPU simulator frontend is introduced that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution- driven simulation of the virtual ISA.
Abstract: In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true. To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process. We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs. Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.

130 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: VANS is developed, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics ofoptane-DIMM-attached Intel servers, and develops two architectural optimizations on top ofOptane D IMM, Lazy Cache and Pre-translation, which significantly improve cloud workload performance.
Abstract: Scalable server-grade non-volatile RAM (NVRAM) DIMMs became commercially available with the release of Intel’s Optane DIMM. Recent studies on Optane DIMM systems unveil discrepant performance characteristics, compared to what many researchers assumed before the product release. Most of these studies focus on system software design and performance analysis. To thoroughly analyze the source of this discrepancy and facilitate real-NVRAM-aware architecture design, we propose a framework that characterizes and models Optane DIMM’s microarchitecture. Our framework consists of a Low-level profilEr for Non-volatile memory Systems (LENS) and a Validated cycle-Accurate NVRAM Simulator (VANS). LENS allows us to comprehensively analyze the performance attributes and reverse engineer NVRAM microarchitectures. Based on LENS characterization, we develop VANS, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics of Optane-DIMM-attached Intel servers. VANS adopts a modular design that can be easily modified to extend to other NVRAM architecture designs; it can also be attached to full-system simulators, such as gem51. By using LENS and VANS, we develop two architectural optimizations on top of Optane DIMM, Lazy Cache and Pre-translation, which significantly improve cloud workload performance.

70 citations

Proceedings ArticleDOI
01 Nov 2020
TL;DR: TABOR as discussed by the authors proposes a new objective function to guide optimization to identify a trojan backdoor more correctly and accurately and prune the restored triggers, which can not only facilitate the identification of intentionally injected triggers but also filter out false alarms.
Abstract: A trojan backdoor is a hidden pattern typically implanted in a deep neural network (DNN). It could be activated and thus forces that infected model to behave abnormally when an input sample with a particular trigger is fed to that model. As such, given a DNN and clean input samples, it is challenging to inspect and determine the existence of a trojan backdoor. Recently, researchers design and develop several pioneering solutions to address this problem. They demonstrate that the proposed techniques have great potential in trojan detection. However, we show that none of these existing techniques completely address the problem. On the one hand, they mostly work under an unrealistic assumption of assuming the availability of the contaminated training database. On the other hand, these techniques can neither accurately detect the existence of trojan backdoors, nor restore high-fidelity triggers, especially when infected models are trained with high-dimensional data, and the triggers pertaining to the trojan vary in size, shape, and position. In this work, we propose TABOR, a new trojan detection technique. Conceptually, it formalizes the detection of a trojan backdoor as solving an optimization objective function. Different from the existing technique which also models trojan detection as an optimization problem, TABOR first designs a new objective function that could guide optimization to identify a trojan backdoor more correctly and accurately. Second, TABOR borrows the idea of interpretable AI to further prune the restored triggers. Last, TABOR designs a new anomaly detection method, which could not only facilitate the identification of intentionally injected triggers but also filter out false alarms (i.e., triggers detected from an uninfected model). We train 112 DNNs on five datasets and infect these models with two existing trojan attacks. We evaluate TABOR by using these infected models, and demonstrate that TABOR has much better performance in trigger restoration, trojan detection, and elimination than Neural Cleanse, the state-of-the-art trojan detection technique.

32 citations

Proceedings ArticleDOI
18 Oct 2021
TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.
Abstract: Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

28 citations

Journal ArticleDOI
TL;DR: This monograph presents a meta-analyses of the architecture and code optimization challenges faced in the rapidly changing environment and some of the strategies used to address these challenges have been proposed and described.
Abstract: Today’s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads It also uses a lightweight frontier structure to further reduce atomic operations With easy-to-use interface that abstracts the above details, Grus shows up to 64× average speedup over the state-of-the-art in-memory GPU graph processing framework It allows one to process large graphs of 55 billion edges in seconds with a single GPU

24 citations