scispace - formally typeset
H

Hongwen Dai

Researcher at North Carolina State University

Publications -  9
Citations -  213

Hongwen Dai is an academic researcher from North Carolina State University. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 5, co-authored 9 publications receiving 175 citations.

Papers
More filters
Proceedings ArticleDOI

Locality-Driven Dynamic GPU Cache Bypassing

TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.
Proceedings ArticleDOI

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

TL;DR: The proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.
Proceedings ArticleDOI

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

TL;DR: This paper presents an in-depth study to reveal interesting and sometimes unexpected tradeoffs between shared memory and the hardware-managed L1 D- caches in GPU architecture and shows that most benchmarks perform significantly better with shared memory than the L 1 D-caches.
Proceedings ArticleDOI

A model-driven approach to warp/thread-block level GPU cache bypassing

TL;DR: This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks to bypass the cache, and designs a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme.
Journal ArticleDOI

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

TL;DR: A coordinated approach for CTA combination and bandwidth partitioning that dynamically detects co-running kernels as latency sensitive or bandwidth intensive and allocates more CTA resources for latency-sensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRam-intensive kernels.