scispace - formally typeset
A

Amir Kavyan Ziabari

Researcher at Northeastern University

Publications -  16
Citations -  380

Amir Kavyan Ziabari is an academic researcher from Northeastern University. The author has contributed to research in topics: Memory hierarchy & Network on a chip. The author has an hindex of 10, co-authored 16 publications receiving 322 citations. Previous affiliations of Amir Kavyan Ziabari include Advanced Micro Devices.

Papers
More filters
Proceedings ArticleDOI

Hetero-mark, a benchmark suite for CPU-GPU collaborative computing

TL;DR: The Hetero-Mark is proposed to help heterogeneous system programmers understand CPU-GPU collaborative computing and to provide guidance to computer architects in order to enhance the design of the runtime and the driver.
Proceedings ArticleDOI

MGPUSim: enabling multi-GPU performance modeling and optimization

TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Proceedings ArticleDOI

Asymmetric NoC Architectures for GPU Systems

TL;DR: An asymmetric NoC design tailored for a GPU's memory access pattern is explored, providing one network for L1-to-L2 communication and a second for L2- to-L1 traffic, showing that an asymmetric multi-network Cmesh provides the most energy-efficient communication fabric for the target GPU system.
Proceedings ArticleDOI

A comprehensive performance analysis of HSA and OpenCL 2.0

TL;DR: This paper provides the first comprehensive study of OpenCL 2.0 and HSA 1.0 execution, considering OpenCL 1.2 as the baseline, and finds that by using HSA signals, it can remove 92% of the overhead due to synchronous kernel launches.
Proceedings ArticleDOI

Profiling DNN Workloads on a Volta-based DGX-1 System

TL;DR: This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.