scispace - formally typeset
Search or ask a question
Conference

Irregular Applications: Architectures and Algorithms 

About: Irregular Applications: Architectures and Algorithms is an academic conference. The conference publishes majorly in the area(s): Speedup & Sparse matrix. Over the lifetime, 93 publications have been published by the conference receiving 1007 citations.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
15 Nov 2015
TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.
Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

125 citations

Proceedings ArticleDOI
16 Nov 2014
TL;DR: This paper shows the first scalable GPU implementation for triangle counting using a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm), which has two levels of parallelism.
Abstract: Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

73 citations

Proceedings ArticleDOI
13 Nov 2016
TL;DR: A new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access, and a comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages.
Abstract: There is growing evidence that current architectures do not well handle cache-unfriendly applications such as sparse math operations, data analytics, and graph algorithms. This is due, in part, to the irregular memory access patterns demonstrated by these applications, and in how remote memory accesses are handled. This paper introduces a new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access. Scaling both memory capacities and the number of cores can be largely invisible to the programmer.The first implementation of this architecture, implemented with FPGAs, is discussed in detail. A comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages. Early projections of performance against several well-documented kernels translate these advantages into comparative numbers. Future implementations of this architecture may expand the performance advantages by the application of current state of the art silicon technology.

53 citations

Proceedings ArticleDOI
11 Oct 2020
TL;DR: DistDGL as mentioned in this paper is a system for training GNNs in a mini-batch fashion on a cluster of machines based on the Deep Graph Library (DGL), a popular GNN development framework.
Abstract: Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines.

48 citations

Proceedings ArticleDOI
15 Nov 2015
TL;DR: Algorithmic improvements to the multithreaded graph partitioner mt-Metis are presented, which decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.
Abstract: Graph partitioning is an important preprocessing step in applications dealing with sparse-irregular data. As such, the ability to efficiently partition a graph in parallel is crucial to the performance of these applications. The number of compute cores in a compute node continues to increase, demanding ever more scalability from shared-memory graph partitioners. In this paper we present algorithmic improvements to the multithreaded graph partitioner mt-Metis. We experimentally evaluate our methods on a 36 core machine, using 20 different graphs from a variety of domains. Our improvements decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

39 citations

Performance
Metrics
No. of papers from the Conference in previous years
YearPapers
20208
201911
20189
201711
201614
201513