Home
/
Authors
/
Xuhao Chen

Author

Xuhao Chen

National University of Defense Technology

Other affiliations: Massachusetts Institute of Technology, University of Texas at Austin, University of Illinois at Urbana–Champaign

Bio: Xuhao Chen is an academic researcher from National University of Defense Technology. The author has contributed to research in topics: Speedup & Cache. The author has an hindex of 9, co-authored 37 publications receiving 363 citations. Previous affiliations of Xuhao Chen include Massachusetts Institute of Technology & University of Texas at Austin.

Topics: Speedup, Cache, Parallel algorithm, Overhead (computing), Register file ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adaptive Cache Management for Energy-Efficient GPU Computing

[...]

Xuhao Chen¹, Li-Wen Chang¹, Christopher I. Rodrigues¹, Jie Lv¹, Zhiying Wang², Wen-mei W. Hwu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, National University of Defense Technology²

13 Dec 2014

TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.

...read moreread less

Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

...read moreread less

142 citations

Journal Article•DOI•

Pangolin: an efficient and flexible graph mining system on CPU and GPU

[...]

Xuhao Chen¹, Roshan Dathathri¹, Gurbinder Gill¹, Keshav Pingali¹•Institutions (1)

University of Texas at Austin¹

01 Apr 2020

TL;DR: Pangolin this paper is an efficient and flexible in-memory graph pattern mining (GPM) framework targeting shared-memory CPUs and GPUs that provides high-level abstractions for GPU processing.

...read moreread less

Abstract: There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability.We present Pangolin, an efficient and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which allows users to specify application specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization.Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49×, 88×, and 80× on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15× on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

...read moreread less

44 citations

Posted Content•

Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU

[...]

Xuhao Chen¹, Roshan Dathathri¹, Gurbinder Gill¹, Keshav Pingali¹•Institutions (1)

University of Texas at Austin¹

16 Nov 2019

TL;DR: Pangolin is the first graph mining system that supports GPU processing and provides a simple embedding-centric programming interface based on the extend-reduce-filter model, which enables user to specify application-specific knowledge like aggressive enumeration search space pruning and isomorphism test elimination.

...read moreread less

27 citations

Proceedings Article•DOI•

DistTC: High Performance Distributed Triangle Counting

[...]

Loc Hoang¹, Vishwesh Jatala¹, Xuhao Chen¹, Udit Agarwal¹, Roshan Dathathri¹, Gurbinder Gill¹, Keshav Pingali¹ - Show less +3 more•Institutions (1)

University of Texas at Austin¹

01 Sep 2019

TL;DR: Experimental results show that this novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting can handle very large graphs such as clueweb12.

...read moreread less

Abstract: We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and 37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.

...read moreread less

25 citations

Proceedings Article•DOI•

Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression

[...]

Hang Zhang¹, Xuhao Chen¹, Nong Xiao¹, Fang Liu¹•Institutions (1)

National University of Defense Technology¹

05 Jun 2016

TL;DR: This paper proposes to optimize STT-RAM based GPU register files for better energy-efficiency and performance via two techniques, including a light-weight compression framework with awareness of register value similarity and a centralized SRAM-based write buffer design to address the long write latency overhead.

...read moreread less

Abstract: To facilitate efficient context switches, GPUs usually employ a large-capacity register file to accommodate a massive amount of context information. However, the large register file introduces high power consumption, flowing to high leakage power SRAM cells. Emerging non-volatile STT-RAM memory has recently been studied as a potential replacement to alleviate the leakage challenge when constructing register files on GPUs. Unfortunately, due to the long write latency and high energy consumption associated with write operations in STT-RAM, simply replacing SRAM with STTRAM for register files would incur non-trivial performance overhead and only bring marginal energy benefits. In this paper, we propose to optimize STT-RAM based GPU register files for better energy-efficiency and performance via two techniques. First, we employ a light-weight compression framework with awareness of register value similarity. It is coupled with a group-based write driver control to mitigate the high energy overhead caused by STT-RAM writes. Second, to address the long write latency overhead of STT-RAM, we propose a centralized SRAM-based write buffer design to efficiently absorb STT-RAM writes with better buffer utilization, rather than the conventional design with distributed per-bank based write buffers. The experimental results show that our STT-RAM based register file design consumes only 37.4% energy over the SRAM baseline, while incurring only negligible performance degradation.

...read moreread less

25 citations

1
2
3
4
…
5
6
7
8

Collapse

Cited by

PDF

Open Access

More filters

Matrix Factorization Techniques for Recommender Systems

[...]

Patrick Seemann

01 Jan 2014

2,080 citations

Proceedings Article•DOI•

Adaptive Cache Management for Energy-Efficient GPU Computing

[...]

Xuhao Chen¹, Li-Wen Chang¹, Christopher I. Rodrigues¹, Jie Lv¹, Zhiying Wang², Wen-mei W. Hwu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, National University of Defense Technology²

13 Dec 2014

...read moreread less

142 citations

Proceedings Article•DOI•

Coordinated static and dynamic cache bypassing for GPUs

[...]

Xiaolong Xie¹, Yun Liang¹, Yu Wang², Guangyu Sun¹, Tao Wang¹ - Show less +1 more•Institutions (2)

Peking University¹, Tsinghua University²

09 Mar 2015

TL;DR: In this paper, a coordinated static and dynamic cache bypassing technique is proposed to improve application performance by identifying the global loads that indicate strong preferences for caching or bypassing through profiling.

...read moreread less

Abstract: The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average I.32X) performance speedup for a variety of GPU applications.

...read moreread less

129 citations

Proceedings Article•DOI•

Locality-Driven Dynamic GPU Cache Bypassing

[...]

Chao Li¹, Shuaiwen Leon Song², Hongwen Dai¹, Albert Sidelnik³, Siva Kumar Sastry Hari³, Huiyang Zhou¹ - Show less +2 more•Institutions (3)

North Carolina State University¹, Pacific Northwest National Laboratory², Nvidia³

08 Jun 2015

TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.

...read moreread less

Abstract: This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

...read moreread less

109 citations

Proceedings Article•DOI•

An efficient compiler framework for cache bypassing on GPUs

[...]

Xiaolong Xie¹, Yun Liang¹, Guangyu Sun¹, Deming Chen²•Institutions (2)

Peking University¹, University of Illinois at Urbana–Champaign²

18 Nov 2013

TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.

...read moreread less

Abstract: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

...read moreread less

87 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

Collapse