Conference

Irregular Applications: Architectures and Algorithms

About: Irregular Applications: Architectures and Algorithms is an academic conference. The conference publishes majorly in the area(s): Speedup & Sparse matrix. Over the lifetime, 93 publications have been published by the conference receiving 1007 citations.

...read moreread less

Topics: Speedup, Sparse matrix, Memory bandwidth, Parallel algorithm, Graph (abstract data type) ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Tensor-matrix products with a compressed sparse tensor

[...]

Shaden Smith¹, George Karypis¹•Institutions (1)

University of Minnesota¹

15 Nov 2015

TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.

...read moreread less

Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

...read moreread less

125 citations

Proceedings Article•DOI•

Fast triangle counting on the GPU

[...]

Oded Green, Pavan Yalamanchili, Lluís-Miquel Munguía¹•Institutions (1)

Georgia Institute of Technology¹

16 Nov 2014

TL;DR: This paper shows the first scalable GPU implementation for triangle counting using a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm), which has two levels of parallelism.

...read moreread less

Abstract: Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

...read moreread less

73 citations

Proceedings Article•DOI•

Highly scalable near memory processing with migrating threads on the emu system architecture

[...]

Timothy J. Dysart, Peter M. Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay B. Brockman, Kenneth Jacobsen, Yujen Juan, Shannon K. Kuntz, Richard Lethin, Janice O. McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, Steve Stein - Show less +13 more

13 Nov 2016

TL;DR: A new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access, and a comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages.

...read moreread less

Abstract: There is growing evidence that current architectures do not well handle cache-unfriendly applications such as sparse math operations, data analytics, and graph algorithms. This is due, in part, to the irregular memory access patterns demonstrated by these applications, and in how remote memory accesses are handled. This paper introduces a new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access. Scaling both memory capacities and the number of cores can be largely invisible to the programmer.The first implementation of this architecture, implemented with FPGAs, is discussed in detail. A comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages. Early projections of performance against several well-documented kernels translate these advantages into comparative numbers. Future implementations of this architecture may expand the performance advantages by the application of current state of the art silicon technology.

...read moreread less

53 citations

Proceedings Article•DOI•

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

[...]

Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, George Karypis - Show less +5 more

11 Oct 2020

TL;DR: DistDGL as mentioned in this paper is a system for training GNNs in a mini-batch fashion on a cluster of machines based on the Deep Graph Library (DGL), a popular GNN development framework.

...read moreread less

Abstract: Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines.

...read moreread less

48 citations

Proceedings Article•DOI•

Improving graph partitioning for modern graphs and architectures

[...]

Dominique LaSalle¹, Md. Mostofa Ali Patwary², Nadathur Satish², Narayanan Sundaram², Pradeep Dubey², George Karypis¹ - Show less +2 more•Institutions (2)

University of Minnesota¹, Intel²

15 Nov 2015

TL;DR: Algorithmic improvements to the multithreaded graph partitioner mt-Metis are presented, which decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

...read moreread less

Abstract: Graph partitioning is an important preprocessing step in applications dealing with sparse-irregular data. As such, the ability to efficiently partition a graph in parallel is crucial to the performance of these applications. The number of compute cores in a compute node continues to increase, demanding ever more scalability from shared-memory graph partitioners. In this paper we present algorithmic improvements to the multithreaded graph partitioner mt-Metis. We experimentally evaluate our methods on a 36 core machine, using 20 different graphs from a variety of domains. Our improvements decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

...read moreread less

39 citations

Collapse

Performance

Metrics

Papers

1,007

Citations

No. of papers from the Conference in previous years
Year	Papers
2020	8
2019	11
2018	9
2017	11
2016	14
2015	13