Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2015"

PDF

Open Access

Proceedings Article•DOI•

Tensor-matrix products with a compressed sparse tensor

[...]

Shaden Smith¹, George Karypis¹•Institutions (1)

15 Nov 2015

TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.

...read moreread less

Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

...read moreread less

125 citations

Proceedings Article•DOI•

Improving graph partitioning for modern graphs and architectures

[...]

Dominique LaSalle¹, Md. Mostofa Ali Patwary², Nadathur Satish², Narayanan Sundaram², Pradeep Dubey², George Karypis¹ - Show less +2 more•Institutions (2)

University of Minnesota¹, Intel²

15 Nov 2015

TL;DR: Algorithmic improvements to the multithreaded graph partitioner mt-Metis are presented, which decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

...read moreread less

Abstract: Graph partitioning is an important preprocessing step in applications dealing with sparse-irregular data. As such, the ability to efficiently partition a graph in parallel is crucial to the performance of these applications. The number of compute cores in a compute node continues to increase, demanding ever more scalability from shared-memory graph partitioners. In this paper we present algorithmic improvements to the multithreaded graph partitioner mt-Metis. We experimentally evaluate our methods on a 36 core machine, using 20 different graphs from a variety of domains. Our improvements decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

...read moreread less

39 citations

Proceedings Article•DOI•

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

[...]

Justus A. Calvin¹, Cannada A. Lewis¹, Edward F. Valeev¹•Institutions (1)

Virginia Tech¹

15 Nov 2015

TL;DR: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication, is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC).

...read moreread less

Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization. Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).

...read moreread less

35 citations

Proceedings Article•DOI•

Hybrid memory cube performance characterization on data-centric workloads

[...]

Maya Gokhale¹, Scott Lloyd¹, Chris Macaraeg¹•Institutions (1)

Lawrence Livermore National Laboratory¹

15 Nov 2015

TL;DR: It is found that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads, suggesting that combined read/write channels might show higher utilization on these access patterns.

...read moreread less

Abstract: The Hybrid Memory Cube is an early commercial product embodying attributes of future stacked DRAM architectures, namely large capacity, high bandwidth, on-package memory controller, and high speed serial interface. We study the performance and energy of a Gen2 HMC on data-centric workloads through a combination of emulation and execution on an HMC FPGA board. An in-house FPGA emulator has been used to obtain memory traces for a small collection of data-centric benchmarks. Our FPGA emulator is based on a 32-bit ARM processor and non-intrusively captures complete memory access traces at only 20X slowdown from real time. We have developed tools to run combined trace fragments from multiple benchmarks on the HMC board, giving a unique capability to characterize HMC performance and power usage under a data parallel workload. We find that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads. Our benchmarks achieve between 66% -- 80% of peak bandwidth (80 GB/s for 32-byte packets with 50--50 read/write mix) on the HMC, suggesting that combined read/write channels might show higher utilization on these access patterns. Bandwidth scales linearly up to saturation with increased demand on highly concurrent application workloads with many independent memory requests. There is a corresponding increase in latency, ranging from 80 ns on an extremely light load to 130 ns at high bandwidth.

...read moreread less

33 citations

Proceedings Article•DOI•

Dynamic parallelism for simple and efficient GPU graph algorithms

[...]

Peter Zhang¹, Eric Holk¹, John Matty², Samantha Misurda², Marcin Zalewski¹, Jonathan Chu², Scott McMillan², Andrew Lumsdaine¹ - Show less +4 more•Institutions (2)

Indiana University¹, Carnegie Mellon University²

15 Nov 2015

TL;DR: This paper presents breadth-first search and single-source shortest paths algorithms that use dynamic parallelism to adapt to the irregular and data-driven nature of these problems and results in simple code that closely follows the high-level description of the algorithms.

...read moreread less

Abstract: Dynamic parallelism allows GPU kernels to launch additional kernels at runtime directly from the GPU. In this paper we show that dynamic parallelism enables relatively simple high-performance graph algorithms for GPUs. We present breadth-first search (BFS) and single-source shortest paths (SSSP) algorithms that use dynamic parallelism to adapt to the irregular and data-driven nature of these problems. Our approach results in simple code that closely follows the high-level description of the algorithms but yields performance competitive with the current state of the art.

...read moreread less

23 citations

Proceedings Article•DOI•

Generalised vectorisation for sparse matrix: vector multiplication

[...]

A. N. Yzelman¹•Institutions (1)

Katholieke Universiteit Leuven¹

15 Nov 2015

TL;DR: This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised and arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats.

...read moreread less

Abstract: This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised. It arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats.The new data structure is relevant since efficient use of new hardware requires the use of increasingly wide vector registers. Normally, the use of vectorisation for sparse computations is limited due to bandwidth constraints. In cases where computations are limited by memory latencies instead of memory bandwidth, however, vectorisation can still help performance. The Intel Xeon Phi, appearing as a component in several top-500 supercomputers, displays exactly this behaviour for SpMV multiplication. On this architecture the use of the new generalised vectorisation scheme increases performance up to 178 percent.

...read moreread less

11 citations

Proceedings Article•DOI•

PL2AP: fast parallel cosine similarity search

[...]

David C. Anastasiu¹, George Karypis¹•Institutions (1)

University of Minnesota¹

15 Nov 2015

TL;DR: al. as discussed by the authors proposed pL2AP, which uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the AllPairs cosine similarity search problem in a multi-core environment.

...read moreread less

Abstract: Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.

...read moreread less

10 citations

Proceedings Article•DOI•

A GPU-parallel construction of volumetric tree

[...]

Mohammad M. Hossain¹, Thomas M. Tucker, Thomas R. Kurfess¹, Richard Vuduc¹•Institutions (1)

Georgia Institute of Technology¹

15 Nov 2015

TL;DR: This work presents a compact hierarchical data structure, dubbed HDT, for extreme-scale volume modeling, shorthand for hybrid dynamic tree, and demonstrates that HDT is arguably the most storage-effective representation.

...read moreread less

Abstract: Volume data has extensive uses in medical imaging, like, MRI (magnetic resonance imaging) scan, visual effects production, including volume rendering and fluid simulation, computer-aided design and manufacturing (CAD/CAM) in advanced prototyping, such as, 3D Printing, among others. This work presents a compact hierarchical data structure, dubbed HDT, for extreme-scale volume modeling. The name is shorthand for hybrid dynamic tree. HDT is hybrid in that it fuses two contrasting structures, namely, octree and tiled grid. Such fusion of two opposing data layouts alleviate the limitations inherent to the respective structures.We describe the HDT construction algorithm to generate volumetric representation of triangle mesh on GPUs. While HDT mirrors much of the existing works on sparse 3D data modeling, our evaluation and comparative studies with prior researches demonstrate that HDT is arguably the most storage-effective representation. The geometric topology in HDT consumes just short of two bits per voxel --- five times compact relative to the current state-of-the-art volumetric structure.

...read moreread less

10 citations

Proceedings Article•DOI•

GAIL: the graph algorithm iron law

[...]

Scott Beamer¹, Krste Asanovic¹, David A. Patterson¹•Institutions (1)

University of California, Berkeley¹

15 Nov 2015

TL;DR: The Graph Algorithm Iron Law (GAIL) is presented to quantify these tradeoffs to help understand graph algorithm performance.

...read moreread less

Abstract: As new applications for graph algorithms emerge, there has been a great deal of research interest in improving graph processing. However, it is often difficult to understand how these new contributions improve performance. Execution time, the most commonly reported metric, distinguishes which alternative is the fastest but does not give any insight as to why. A new contribution may have an algorithmic innovation that allows it to examine fewer graph edges. It could also have an implementation optimization that reduces communication. It could even have optimizations that allow it to increase its memory bandwidth utilization. More interestingly, a new innovation may simultaneously affect all three of these factors (algorithmic work, communication volume, and memory bandwidth utilization). We present the Graph Algorithm Iron Law (GAIL) to quantify these tradeoffs to help understand graph algorithm performance.

...read moreread less

9 citations

Proceedings Article•DOI•

Data-centric GPU-based adaptive mesh refinement

[...]

Mohamed Wahib, Naoya Maruayama

15 Nov 2015

TL;DR: This work proposes a data-centric approach to GPU-based Adaptive Mesh Refinement (AMR), porting all the mesh adaptation operations touching the data arrays to the GPU to allow the stencil data array to reside on the GPU memory for the entirety of the simulation.

...read moreread less

Abstract: It has been demonstrated that explicit stencil computations of high-resolution scheme can highly benefit from GPUs. This includes Adaptive Mesh Refinement (AMR), which is a model for adapting the resolution of a stencil grid locally. Unlike uniform grid stencils, however, adapting the grid is typically done on the CPU side. This requires transferring the stencil data arrays to/from CPU every time the grid is adapted. We propose a data-centric approach to GPU-based AMR. That is, porting all the mesh adaptation operations touching the data arrays to the GPU. This would allow the stencil data arrays to reside on the GPU memory for the entirety of the simulation. Thus, the GPU code would specialize on the data residing on its memory while the CPU specializes on the AMR metadata residing on CPU memory. We compare the performance of the proposed method to a basic GPU implementation and an optimized GPU implementation that overlaps communication and computation. The performance of two GPU-based AMR applications is enhanced by 2.21x, and 2.83x compared to the basic implementation.

...read moreread less

7 citations

Proceedings Article•DOI•

Betweenness centrality on Multi-GPU systems

[...]

Massimo Bernaschi¹, Giancarlo Carbone², Flavio Vella²•Institutions (2)

IAC¹, Sapienza University of Rome²

15 Nov 2015

TL;DR: The proposed techniques are well suited to compute BC scores in graphs which are too large to fit in single GPU memory and the computation time of a 234 million edges graph is reduced to less than 2 hours.

...read moreread less

Abstract: Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The exact BC computation for a large scale graph is an extraordinary challenging and requires high performance computing techniques to provide results in a reasonable amount of time. Here, we present the techniques we developed to speed-up the computation of the BC on Multi-GPU systems. Our approach combines the bi-dimensional (2-D) decomposition of the graph and multi-level parallelism. Experimental results show that the proposed techniques are well suited to compute BC scores in graphs which are too large to fit in single GPU memory. In particular, the computation time of a 234 million edges graph is reduced to less than 2 hours.

...read moreread less

Proceedings Article•DOI•

PathFinder: a signature-search miniapp and its runtime characteristics

[...]

Aditya M. Deshpande¹, Jeffrey Draper¹, J. Brian Rigdon², Richard F. Barrett²•Institutions (2)

University of Southern California¹, Sandia National Laboratories²

15 Nov 2015

TL;DR: This work describes PathFinder, its signature search algorithm, which is a modified depth-first recursive search wherein adjacent nodes are compared before recursing down its edges for labels, and its general performance and cache characteristics.

...read moreread less

Abstract: Graphs are widely used in data analytics applications in a variety of fields and are rapidly gaining attention in the computational scientific and engineering (CSE) application community. An important application of graphs concerns binary (executable) signature search to address the potential of a suspect binary evading binary signature detection via obfuscation. A control flow graph generated from a binary allows identification of a pattern of system calls, an ordered sequence of which can then be used as signatures in the search. An application proxy, named PathFinder, represents these properties, allowing examination of the performance characteristics of algorithms used in the search. In this work, we describe PathFinder, its signature search algorithm, which is a modified depth-first recursive search wherein adjacent nodes are compared before recursing down its edges for labels, and its general performance and cache characteristics. We highlight some important differences between PathFinder and traditional CSE applications. For example, the L2 cache hit ratio (less than 60%) in PathFinder is observed to be substantially lower than those observed for traditional CSE applications.

...read moreread less

Proceedings Article•DOI•

How to match in parallel: approximation algorithms and multicore machines

[...]

Alex Pothen¹•Institutions (1)

Purdue University¹

15 Nov 2015

TL;DR: This chapter discusses parallel algorithms for matching, which are difficult to design because many algorithms rely on searching for long paths in a graph, or implicitly communicate information along long paths, and thus have little concurrency.

...read moreread less

Abstract: Computing a matching in a graph is one of "the hardest simple problems" in computer science. It is simple since most variants of matching can be solved in polynomial time, yet hard because the running times are high and the algorithms are complex. It is even more challenging to design parallel algorithms for matching, since many algorithms rely on searching for long paths in a graph, or implicitly communicate information along long paths, and thus have little concurrency.

...read moreread less