Home
/
Authors
/
Jiajia Li

Author

Jiajia Li

Other affiliations: Chinese Academy of Sciences, Georgia Institute of Technology

Bio: Jiajia Li is an academic researcher from Pacific Northwest National Laboratory. The author has contributed to research in topics: Sparse matrix & Speedup. The author has an hindex of 14, co-authored 40 publications receiving 716 citations. Previous affiliations of Jiajia Li include Chinese Academy of Sciences & Georgia Institute of Technology.

Topics: Sparse matrix, Speedup, Tensor, Tensor (intrinsic definition), Memory bandwidth ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

[...]

Jiajia Li¹, Guangming Tan¹, Mingyu Chen¹, Ninghui Sun¹•Institutions (1)

Chinese Academy of Sciences¹

16 Jun 2013

TL;DR: A Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage and automatically determines the optimal format and implementation for any input sparse matrix at runtime.

...read moreread less

Abstract: Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the libraries become too complicated to be used extensively in real applications. In this work we develop a Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage. SMAT provides users with a unified programming interface in compressed sparse row (CSR) format and automatically determines the optimal format and implementation for any input sparse matrix at runtime. For this purpose, SMAT leverages a learning model, which is generated in an off-line stage by a machine learning method with a training set of more than 2000 matrices from the UF sparse matrix collection, to quickly predict the best combination of the matrix feature parameters. Our experiments show that SMAT achieves impressive performance of up to 51GFLOPS in single-precision and 37GFLOPS in double-precision on mainstream x86 multi-core processors, which are both more than 3 times faster than the Intel MKL library. We also demonstrate its adaptability in an algebraic multigrid solver from Hypre library with above 20% performance improvement reported.

...read moreread less

125 citations

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

01 Jan 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

...read moreread less

118 citations

Proceedings Article•DOI•

HiCOO: hierarchical storage of sparse tensors

[...]

Jiajia Li¹, Jimeng Sun¹, Richard Vuduc¹•Institutions (1)

Georgia Institute of Technology¹

11 Nov 2018

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.

...read moreread less

Abstract: This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: "haiku"). It derives from coordinate (COO) format, arguably the de facto standard for general sparse tensor storage. HiCOO improves upon COO by compressing the indices in units of sparse tensor blocks, with the goals of preserving the "mode-agnostic" simplicity of COO while reducing the bytes needed to represent the tensor and promoting data locality. We evaluate HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm. This MTTKRP implementation achieves up to 23.0X (6.8X on average) speedup over COO format and up to 15.6X (3.1X on average) speedup over another state-of-the-art format, compressed sparse fiber (CSF), by using less or comparable storage of them. When used within CPD, we also observe speedups against COO- and CSF-based implementations.

...read moreread less

83 citations

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

11 Mar 2019-arXiv: Hardware Architecture

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.

...read moreread less

80 citations

Proceedings Article•DOI•

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

[...]

Jiajia Li¹, Casey Battaglino¹, Ioakeim Perros¹, Jimeng Sun¹, Richard Vuduc¹ - Show less +1 more•Institutions (1)

Georgia Institute of Technology¹

15 Nov 2015

TL;DR: A novel framework for producing fast single-node implementations of dense tensor-times-matrix multiply of arbitrary dimension, called In-TensLi's in-place and input-adaptive T implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.

...read moreread less

Abstract: This paper describes a novel framework, called I n T ens L i ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (T tm ) of arbitrary dimension. Whereas conventional implementations of T tm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (G emm ) implementation---our framework's strategy is to carry out the T tm in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the T tm 's inputs. When compared to widely used single-node T tm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (C tf ), In-TensLi's in-place and input-adaptive T tm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.

...read moreread less

72 citations

1
2
3
4
…
5
6
7
8
9

Collapse

Cited by

PDF

Open Access

More filters

Data Mining - Concepts and Techniques.

[...]

Petra Perner

01 Jan 2002

9,314 citations

Journal Article•DOI•

A Survey of CPU-GPU Heterogeneous Computing Techniques

[...]

Sparsh Mittal¹, Jeffrey S. Vetter¹•Institutions (1)

Oak Ridge National Laboratory¹

21 Jul 2015-ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

...read moreread less

414 citations

Journal Article•DOI•

Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1.

[...]

Andrzej Cichocki, Namgil Lee, Ivan V. Oseledets, Anh Huy Phan, Qibin Zhao, Danilo P. Mandic - Show less +2 more

04 Sep 2016-arXiv: Numerical Analysis

TL;DR: In this paper, the authors provide mathematical and graphical representations and interpretation of tensor networks, with the main focus on the Tucker and Tensor Train (TT) decompositions and their extensions or generalizations.

...read moreread less

Abstract: Machine learning and data mining algorithms are becoming increasingly important in analyzing large volume, multi-relational and multi--modal datasets, which are often conveniently represented as multiway arrays or tensors. It is therefore timely and valuable for the multidisciplinary research community to review tensor decompositions and tensor networks as emerging tools for large-scale data analysis and data mining. We provide the mathematical and graphical representations and interpretation of tensor networks, with the main focus on the Tucker and Tensor Train (TT) decompositions and their extensions or generalizations. Keywords: Tensor networks, Function-related tensors, CP decomposition, Tucker models, tensor train (TT) decompositions, matrix product states (MPS), matrix product operators (MPO), basic tensor operations, multiway component analysis, multilinear blind source separation, tensor completion, linear/multilinear dimensionality reduction, large-scale optimization problems, symmetric eigenvalue decomposition (EVD), PCA/SVD, huge systems of linear equations, pseudo-inverse of very large matrices, Lasso and Canonical Correlation Analysis (CCA) (This is Part 1)

...read moreread less

381 citations

Journal Article•DOI•

The tensor algebra compiler

[...]

Fredrik Kjolstad¹, Shoaib Kamil², Stephen Chou¹, David Lugato, Saman Amarasinghe¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Adobe Systems²

12 Oct 2017

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.

...read moreread less

Abstract: Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

...read moreread less

240 citations

Posted Content•

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.

[...]

Zhe Jia, Marco Maggioni, Benjamin Staiger, Daniele Paolo Scarpazza

18 Apr 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This technical report presents the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly, and compares quantitatively the findings against its predecessors, Kepler, Maxwell and Pascal.

...read moreread less

Abstract: Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery. This has led to a prolific line of publications that shed light on instruction encoding, and memory hierarchy's geometry and features at each level. Namely, research that describes the performance and behavior of the Kepler, Maxwell and Pascal architectures. In this technical report, we continue this line of research by presenting the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly. Additionally, we compare quantitatively our Volta findings against its predecessors, Kepler, Maxwell and Pascal.

...read moreread less

233 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

Collapse