Home
/
Authors
/
Chao Li

Author

Chao Li

Other affiliations: Princeton University

Bio: Chao Li is an academic researcher from North Carolina State University. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 10, co-authored 18 publications receiving 490 citations. Previous affiliations of Chao Li include Princeton University.

Topics: Cache, Cache pollution, Shared memory, Smart Cache, Cache coloring ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

yaSpMV: yet another SpMV framework on GPUs

[...]

Shengen Yan¹, Chao Li², Yunquan Zhang¹, Huiyang Zhou²•Institutions (2)

Chinese Academy of Sciences¹, North Carolina State University²

06 Feb 2014

TL;DR: A new SpMV format is devised, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate format so as to alleviate the bandwidth problem and an auto-tuning framework is introduced to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms.

...read moreread less

Abstract: SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

...read moreread less

134 citations

Proceedings Article•DOI•

Locality-Driven Dynamic GPU Cache Bypassing

[...]

Chao Li¹, Shuaiwen Leon Song², Hongwen Dai¹, Albert Sidelnik³, Siva Kumar Sastry Hari³, Huiyang Zhou¹ - Show less +2 more•Institutions (3)

North Carolina State University¹, Pacific Northwest National Laboratory², Nvidia³

08 Jun 2015

TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.

...read moreread less

Abstract: This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

...read moreread less

109 citations

Proceedings Article•DOI•

Optimizing memory efficiency for deep convolutional neural networks on GPUs

[...]

Chao Li¹, Yi Yang, Min Feng, Srimat Chakradhar, Huiyang Zhou¹ - Show less +1 more•Institutions (1)

North Carolina State University¹

13 Nov 2016

TL;DR: In this article, the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns, with up to 27.9× for a single layer and up to 5.6× on the whole networks.

...read moreread less

Abstract: Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. While existing works mainly focus on the computational efficiency of CNNs, the memory efficiency of CNNs have been largely overlooked. Yet CNNs have intricate data structures and their memory behavior can have significant impact on the performance. In this work, we study the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns. Experiments show the universal effect of our proposed optimizations on both single layers and various networks, with up to 27.9× for a single layer and up to 5.6× on the whole networks.

...read moreread less

80 citations

Journal Article•DOI•

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

[...]

Yi Yang, Chao Li¹, Huiyang Zhou¹•Institutions (1)

North Carolina State University¹

21 Jan 2015-Journal of Computer Science and Technology

TL;DR: This paper first study a set of GPGPU benchmarks that contain parallel loops, and highlights that the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead, and presents the proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP.

...read moreread less

Abstract: Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

...read moreread less

55 citations

Patent•

Memory efficiency for convolutional neural networks operating on graphics processing units

[...]

Yi Yang¹, Chao Li¹, Min Feng¹, Srimat Chakradhar¹•Institutions (1)

Princeton University¹

20 May 2016

TL;DR: In this paper, the authors examine the performance impact of different data layouts and then describe a method to produce data layout selection for various layers of the CNN, including a fast transformation implementation.

...read moreread less

Abstract: Aspects of the present disclosure are directed to techniques that improve performance of CNN systems through the effect of improved memory efficiencies for CNNs operating on GPUs. Aspects of the disclosure demonstrate that off-chip memory in such CNN systems is underutilized due to at least three characteristics namely, data layout, data locality and inter-kernel redundancy. Aspects of the disclosure examine the performance impact of different data layouts and then describe a method to produce data layout selection for various layers of the CNN including a fast transformation implementation. Disclosed are improvements to data locality from working set expansion, elimination of inter-kernel redundancy and increase of TLP using kernel reconstruction techniques including kernel fusion and thread injection. Disclosed experimental results show that our optimizations are very effective to boost the performance of CNNs by amounts up to 9.76 times for a single kernel and 2.05 times for a network.

...read moreread less

36 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

[...]

Tal Ben-Nun¹, Torsten Hoefler¹•Institutions (1)

ETH Zurich¹

30 Aug 2019-ACM Computing Surveys

TL;DR: The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

...read moreread less

433 citations

Posted Content•

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

[...]

Tal Ben-Nun¹, Torsten Hoefler¹•Institutions (1)

ETH Zurich¹

26 Feb 2018-arXiv: Learning

...read moreread less

279 citations

Posted Content•

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

[...]

Liangzhen Lai, Naveen Suda, Vikas Chandra

19 Jan 2018-arXiv: Neural and Evolutionary Computing

TL;DR: CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices are presented.

...read moreread less

Abstract: Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

...read moreread less

278 citations

Proceedings Article•DOI•

HarDNet: A Low Memory Traffic Network

[...]

Ping Chao¹, Chao-Yang Kao², Yu-Shan Ruan², Chien-Hsiang Huang², Youn-Long Lin² - Show less +1 more•Institutions (2)

University of Michigan¹, National Tsing Hua University²

01 Oct 2019

TL;DR: In this paper, a Harmonic Densely Connected Network (HDN) was proposed to achieve high efficiency in terms of both low MACs and memory traffic for real-time object detection and semantic segmentation.

...read moreread less

Abstract: State-of-the-art neural network architectures such as ResNet, MobileNet, and DenseNet have achieved outstanding accuracy over low MACs and small model size counterparts. However, these metrics might not be accurate for predicting the inference time. We suggest that memory traffic for accessing intermediate feature maps can be a factor dominating the inference latency, especially in such tasks as real-time object detection and semantic segmentation of high-resolution video. We propose a Harmonic Densely Connected Network to achieve high efficiency in terms of both low MACs and memory traffic. The new network achieves 35%, 36%, 30%, 32%, and 45% inference time reduction compared with FC-DenseNet-103, DenseNet-264, ResNet-50, ResNet-152, and SSD-VGG, respectively. We use tools including Nvidia profiler and ARM Scale-Sim to measure the memory traffic and verify that the inference latency is indeed proportional to the memory traffic consumption and the proposed network consumes low memory traffic. We conclude that one should take memory traffic into consideration when designing neural network architectures for high-resolution applications at the edge.

...read moreread less

238 citations

Proceedings Article•DOI•

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

[...]

Weifeng Liu¹, Brian Vinter¹•Institutions (1)

University of Copenhagen¹

08 Jun 2015

TL;DR: CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi, is proposed for real-world applications such as a solver with only tens of iterations because of its low-overhead for format conversion.

...read moreread less

Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6%, 28.5%, 173.0% and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion.

...read moreread less

226 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

Collapse