scispace - formally typeset
Search or ask a question
Author

Yifan Sun

Other affiliations: Northeastern University
Bio: Yifan Sun is an academic researcher from College of William & Mary. The author has contributed to research in topics: Computer science & Overhead (computing). The author has an hindex of 8, co-authored 28 publications receiving 227 citations. Previous affiliations of Yifan Sun include Northeastern University.

Papers
More filters
Posted Content
TL;DR: It is found that transistor scaling remains critical in keeping the laws valid, but architectural solutions have become increasingly important and will play a larger role in the future, and that GPUs consistently deliver higher performance than CPUs.
Abstract: Moore's Law and Dennard Scaling have guided the semiconductor industry for the past few decades. Recently, both laws have faced validity challenges as transistor sizes approach the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In this work, we collect data of more than 4000 publicly-available CPU and GPU products. We find that transistor scaling remains critical in keeping the laws valid. However, architectural solutions have become increasingly important and will play a larger role in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we also see the ratio of GPU to CPU performance moving closer to parity, thanks to new SIMD extensions on CPUs and increased CPU core counts.

53 citations

Proceedings ArticleDOI
22 Jun 2019
TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

51 citations

Proceedings ArticleDOI
06 May 2021
TL;DR: In this paper, an analysis of 668 COVID-19 visualizations is presented to examine who, (uses) what data, (to communicate) what messages, in what form, under what circumstances in the context of crisis visualizations.
Abstract: In response to COVID-19, a vast number of visualizations have been created to communicate information to the public. Information exposure in a public health crisis can impact people’s attitudes towards and responses to the crisis and risks, and ultimately the trajectory of a pandemic. As such, there is a need for work that documents, organizes, and investigates what COVID-19 visualizations have been presented to the public. We address this gap through an analysis of 668 COVID-19 visualizations. We present our findings through a conceptual framework derived from our analysis, that examines who, (uses) what data, (to communicate) what messages, in what form, under what circumstances in the context of COVID-19 crisis visualizations. We provide a set of factors to be considered within each component of the framework. We conclude with directions for future crisis visualization research.

49 citations

Proceedings ArticleDOI
TL;DR: In this paper, the authors present a conceptual framework derived from their analysis, that examines who, (uses) what data, (to communicate) what messages, in what form, under what circumstances in the context of COVID-19 crisis visualizations.
Abstract: In response to COVID-19, a vast number of visualizations have been created to communicate information to the public. Information exposure in a public health crisis can impact people's attitudes towards and responses to the crisis and risks, and ultimately the trajectory of a pandemic. As such, there is a need for work that documents, organizes, and investigates what COVID-19 visualizations have been presented to the public. We address this gap through an analysis of 668 COVID-19 visualizations. We present our findings through a conceptual framework derived from our analysis, that examines who, (uses) what data, (to communicate) what messages, in what form, under what circumstances in the context of COVID-19 crisis visualizations. We provide a set of factors to be considered within each component of the framework. We conclude with directions for future crisis visualization research.

38 citations

Proceedings ArticleDOI
02 Oct 2018
TL;DR: This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.
Abstract: High performance multi-GPU systems are widely used to accelerate training of deep neural networks (DNNs) by exploiting the inherently massive parallel nature of the training process. Typically, the training of DNNs in multi-GPU systems leverages a data-parallel model in which a DNN is replicated on every GPU, and each GPU performs Forward Propagation (FP), Backward Propagation (BP) and, Weight Update (WU). We analyze the WU stage that is composed of collective communication (e.g., allReduce, broadcast), which demands very efficient communication among the GPUs to avoid diminishing returns when scaling the number of GPUs in the system. To overcome this issue, different data transfer mechanisms and libraries have been introduced by NVIDIA, and adopted by high-level frameworks to train DNNs. In this work, we evaluate and compare the performance of peer-to-peer (P2P) data transfer method and NCCL library-based communication method for training DNNs on a DGX-1 system consisting of 8 NVIDIA Volta-based GPUs. We profile and analyze the training of five popular DNNs (GoogLeNet, AlexNet, Inception-v3, ResNet and LeNet) using 1, 2, 4 and 8 GPUs. We show the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of the training algorithm as well as to identify the bottlenecks in the multi-GPU system architecture. Our detailed profiling and analysis can help programmers and DNN model designers accelerate the training process in DNNs.

33 citations


Cited by
More filters
Posted Content
TL;DR: GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Abstract: Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

385 citations

Book
01 Jan 2016
TL;DR: It’s time to dust off the gloves and get ready for the cold weather.
Abstract: 1 インフラを構築する(AWSにおけるインフラ;VPCを構成する;VPCとオンプレミス環境とを接続する) 2 ファイルオブジェクトを保存・共有・公開する(オブジェクトストレージS3の機能;ファイルストレージとして利用する;Webサーバーを構築する;信頼性とコストのバランスをとりたい) 3 アプリケーションサーバーを構築する(Amazon EC2とAWS Lambda;スケーラビリティーを高める;サーバーレスでプログラムを動かす;データベースサービスを活用する) 4 AWSシステムを管理する(リソース監視と異常検知・通報;耐障害性を高める仕組みとバックアップ&リカバリー;構成管理)

350 citations

Proceedings ArticleDOI
30 May 2020
TL;DR: A new GPU simulator frontend is introduced that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution- driven simulation of the virtual ISA.
Abstract: In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true. To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process. We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs. Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.

130 citations

Journal ArticleDOI
TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

118 citations