scispace - formally typeset
J

Jiajia Li

Researcher at Pacific Northwest National Laboratory

Publications -  45
Citations -  1073

Jiajia Li is an academic researcher from Pacific Northwest National Laboratory. The author has contributed to research in topics: Sparse matrix & Speedup. The author has an hindex of 14, co-authored 40 publications receiving 716 citations. Previous affiliations of Jiajia Li include Chinese Academy of Sciences & Georgia Institute of Technology.

Papers
More filters
Proceedings ArticleDOI

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

TL;DR: A Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage and automatically determines the optimal format and implementation for any input sparse matrix at runtime.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Proceedings ArticleDOI

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

TL;DR: A novel framework for producing fast single-node implementations of dense tensor-times-matrix multiply of arbitrary dimension, called In-TensLi's in-place and input-adaptive T implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.