Home
/
Authors
/
Aravind Sukumaran-Rajam

Author

Aravind Sukumaran-Rajam

Other affiliations: Ohio State University, University of Strasbourg

Bio: Aravind Sukumaran-Rajam is an academic researcher from Washington State University. The author has contributed to research in topics: Multiplication & Sparse matrix. The author has an hindex of 12, co-authored 35 publications receiving 370 citations. Previous affiliations of Aravind Sukumaran-Rajam include Ohio State University & University of Strasbourg.

Topics: Multiplication, Sparse matrix, Matrix multiplication, Computer science, Stencil ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adaptive sparse tiling for sparse matrix multiplication

[...]

Changwan Hong¹, Aravind Sukumaran-Rajam¹, Israt Nisa¹, Kunal Singh¹, P. Sadayappan¹ - Show less +1 more•Institutions (1)

Ohio State University¹

16 Feb 2019

TL;DR: This paper devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (sampled dense-dense matrix multiplication).

...read moreread less

Abstract: Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern of sparse matrix multiplication makes it challenging to use tiling to enhance data reuse. In this paper, we devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (sampled dense-dense matrix multiplication). In contrast to studies that have resorted to non-standard sparse-matrix representations to enhance performance, we use the standard Compressed Sparse Row (CSR) representation, within which intra-row reordering is performed to enable adaptive tiling. Experimental evaluation using an extensive set of matrices from the Sparse Suite collection demonstrates significant performance improvement over currently available state-of-the-art alternatives.

...read moreread less

96 citations

Proceedings Article•DOI•

Efficient sparse-matrix multi-vector product on GPUs

[...]

Changwan Hong¹, Aravind Sukumaran-Rajam¹, Bortik Bandyopadhyay¹, Jinsung Kim¹, Sureyya Emre Kurt¹, Israt Nisa¹, Shivani Sabhlok¹, Ümit V. Çatalyürek², Srinivasan Parthasarathy¹, P. Sadayappan¹ - Show less +6 more•Institutions (2)

Ohio State University¹, Georgia Institute of Technology²

11 Jun 2018

TL;DR: An in-depth analysis is presented to contrast SpMV and SpMM, and a new sparse-matrix representation and computation approach suited to achieving high data-movement efficiency and effective GPU parallelization of SpMM is developed.

...read moreread less

Abstract: Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we present an in-depth analysis to contrast SpMV and SpMM, and develop a new sparse-matrix representation and computation approach suited to achieving high data-movement efficiency and effective GPU parallelization of SpMM. Experimental evaluation using the entire SuiteSparse matrix suite demonstrates significant performance improvement over existing SpMM implementations from vendor libraries.

...read moreread less

51 citations

Proceedings Article•DOI•

[...]

Prashant Singh Rawat¹, Fabrice Rastello, Aravind Sukumaran-Rajam¹, Louis-Noël Pouchet², Atanas Rountev¹, P. Sadayappan¹ - Show less +2 more•Institutions (2)

Ohio State University¹, Colorado State University²

10 Feb 2018

TL;DR: A statement reordering framework is developed that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees.

...read moreread less

Abstract: The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse However, the resulting code is often highly constrained by register pressure While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes

...read moreread less

45 citations

Journal Article•DOI•

Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

[...]

Prashant Singh Rawat¹, Miheer Vaidya¹, Aravind Sukumaran-Rajam¹, Mahesh Ravishankar², Vinod Grover², Atanas Rountev¹, Louis-Noël Pouchet³, P. Sadayappan¹ - Show less +4 more•Institutions (3)

Ohio State University¹, Nvidia², Colorado State University³

30 Aug 2018

TL;DR: How effective tiled code can be generated for GPUs from a domain-specific language (DSL) for stencils from a state-of-the-art general-purpose compiler optimizations is described.

...read moreread less

Abstract: Stencil computations arise in a number of computational domains They exhibit significant data parallelism and are thus well suited for execution on graphical processing units (GPUs), but can be memory-bandwidth limited unless temporal locality is utilized via tiling This paper describes how effective tiled code can be generated for GPUs from a domain-specific language (DSL) for stencils Experimental results demonstrate the benefits of such a domain-specific optimization approach over state-of-the-art general-purpose compiler optimizations

...read moreread less

43 citations

Proceedings Article•DOI•

Analytical characterization and design space exploration for optimization of CNNs

[...]

Rui Li¹, Yufan Xu¹, Aravind Sukumaran-Rajam², Atanas Rountev³, P. Sadayappan¹ - Show less +1 more•Institutions (3)

University of Utah¹, Washington State University², Ohio State University³

19 Apr 2021

TL;DR: In this paper, an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs was developed, which achieves comparable or better performance than state-of-the-art libraries and auto-tuning based optimizers.

...read moreread less

Abstract: Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is explosively large. This paper develops an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs. Experimental evaluation shows that this approach achieves comparable or better performance than state-of-the-art libraries and auto-tuning based optimizers for CNNs.

...read moreread less

33 citations

1
2
3
4
…
5
6
7
8
9

Collapse

Cited by

PDF

Open Access

More filters

Matrix Factorization Techniques for Recommender Systems

[...]

Patrick Seemann

01 Jan 2014

2,080 citations

Proceedings Article•DOI•

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

[...]

Roshan Dathathri¹, Gurbinder Gill¹, Loc Hoang¹, Hoang-Vu Dang², Alex Brooks², Nikoli Dryden², Marc Snir², Keshav Pingali¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, University of Illinois at Urbana–Champaign²

11 Jun 2018

TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.

...read moreread less

Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

...read moreread less

125 citations

Proceedings Article•DOI•

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

[...]

Nitish Srivastava¹, Hanchen Jin¹, Jie Liu¹, David H. Albonesi¹, Zhiru Zhang¹ - Show less +1 more•Institutions (1)

Cornell University¹

01 Oct 2020

TL;DR: This work proposes MatRaptor, a novel SpGEMM accelerator that is high performance and highly resource efficient, based on row-wise product, which offers a better tradeoff in terms of data reuse and on-chip memory requirements, and achieves higher performance for large sparse matrices.

...read moreread less

Abstract: Sparse-sparse matrix multiplication (SpGEMM) is a computation kernel widely used in numerous application domains such as data analytics, graph processing, and scientific computing. In this work we propose MatRaptor, a novel SpGEMM accelerator that is high performance and highly resource efficient. Unlike conventional methods using inner or outer product as the meta operation for matrix multiplication, our approach is based on row-wise product, which offers a better tradeoff in terms of data reuse and on-chip memory requirements, and achieves higher performance for large sparse matrices. We further propose a new hardware-friendly sparse storage format, which allows parallel compute engines to access the sparse data in a vectorized and streaming fashion, leading to high utilization of memory bandwidth. We prototype and simulate our accelerator architecture using gem5 on a diverse set of matrices. Our experiments show that MatRaptor achieves 129.2× speedup over single-threaded CPU, 8.8× speedup over GPU and 1.8× speedup over the state-of-the-art SpGEMM accelerator (OuterSPACE). MatRaptor also has 7.2× lower power consumption and 31.3× smaller area compared to OuterSPACE.

...read moreread less

104 citations

Journal Article•DOI•

Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer

[...]

Yuedan Chen¹, Kenli Li¹, Yang Wangdong¹, Guoqing Xiao¹, Xianghui Xie, Tao Li² - Show less +2 more•Institutions (2)

Hunan University¹, University of Florida²

01 Apr 2019-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer.

...read moreread less

Abstract: General sparse matrix-sparse matrix multiplication (SpGEMM) is one of the fundamental linear operations in a wide variety of scientific applications. To implement efficient SpGEMM for many large-scale applications, this paper proposes scalable and optimized SpGEMM kernels based on COO, CSR, ELL, and CSC formats on the Sunway TaihuLight supercomputer. First, a multi-level parallelism design for SpGEMM is proposed to exploit the parallelism of over 10 millions cores and better control memory based on the special Sunway architecture. Optimization strategies, such as load balance, coalesced DMA transmission, data reuse, vectorized computation, and parallel pipeline processing, are applied to further optimize performance of SpGEMM kernels. Second, we thoroughly analyze the performance of the proposed kernels. Third, a performance-aware model for SpGEMM is proposed to select the most appropriate compressed storage formats for the sparse matrices that can achieve the optimal performance of SpGEMM on the Sunway. The experimental results show the SpGEMM kernels have good scalability and meet the challenge of the high-speed computing of large-scale data sets on the Sunway. In addition, the performance-aware model for SpGEMM achieves an absolute value of relative error rate of 8.31 percent on average when the kernels are executed in one single process and achieves 8.59 percent on average when the kernels are executed in multiple processes. It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer.

...read moreread less

97 citations

Proceedings Article•DOI•

Adaptive sparse tiling for sparse matrix multiplication

[...]

Changwan Hong¹, Aravind Sukumaran-Rajam¹, Israt Nisa¹, Kunal Singh¹, P. Sadayappan¹ - Show less +1 more•Institutions (1)

Ohio State University¹

16 Feb 2019

...read moreread less

96 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

Collapse