scispace - formally typeset
Search or ask a question

Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2015"


Proceedings ArticleDOI
15 Nov 2015
TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.
Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

125 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: Algorithmic improvements to the multithreaded graph partitioner mt-Metis are presented, which decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.
Abstract: Graph partitioning is an important preprocessing step in applications dealing with sparse-irregular data. As such, the ability to efficiently partition a graph in parallel is crucial to the performance of these applications. The number of compute cores in a compute node continues to increase, demanding ever more scalability from shared-memory graph partitioners. In this paper we present algorithmic improvements to the multithreaded graph partitioner mt-Metis. We experimentally evaluate our methods on a 36 core machine, using 20 different graphs from a variety of domains. Our improvements decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.

39 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication, is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC).
Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization. Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).

35 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: It is found that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads, suggesting that combined read/write channels might show higher utilization on these access patterns.
Abstract: The Hybrid Memory Cube is an early commercial product embodying attributes of future stacked DRAM architectures, namely large capacity, high bandwidth, on-package memory controller, and high speed serial interface. We study the performance and energy of a Gen2 HMC on data-centric workloads through a combination of emulation and execution on an HMC FPGA board. An in-house FPGA emulator has been used to obtain memory traces for a small collection of data-centric benchmarks. Our FPGA emulator is based on a 32-bit ARM processor and non-intrusively captures complete memory access traces at only 20X slowdown from real time. We have developed tools to run combined trace fragments from multiple benchmarks on the HMC board, giving a unique capability to characterize HMC performance and power usage under a data parallel workload. We find that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads. Our benchmarks achieve between 66% -- 80% of peak bandwidth (80 GB/s for 32-byte packets with 50--50 read/write mix) on the HMC, suggesting that combined read/write channels might show higher utilization on these access patterns. Bandwidth scales linearly up to saturation with increased demand on highly concurrent application workloads with many independent memory requests. There is a corresponding increase in latency, ranging from 80 ns on an extremely light load to 130 ns at high bandwidth.

33 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: This paper presents breadth-first search and single-source shortest paths algorithms that use dynamic parallelism to adapt to the irregular and data-driven nature of these problems and results in simple code that closely follows the high-level description of the algorithms.
Abstract: Dynamic parallelism allows GPU kernels to launch additional kernels at runtime directly from the GPU. In this paper we show that dynamic parallelism enables relatively simple high-performance graph algorithms for GPUs. We present breadth-first search (BFS) and single-source shortest paths (SSSP) algorithms that use dynamic parallelism to adapt to the irregular and data-driven nature of these problems. Our approach results in simple code that closely follows the high-level description of the algorithms but yields performance competitive with the current state of the art.

23 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised and arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats.
Abstract: This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised. It arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats.The new data structure is relevant since efficient use of new hardware requires the use of increasingly wide vector registers. Normally, the use of vectorisation for sparse computations is limited due to bandwidth constraints. In cases where computations are limited by memory latencies instead of memory bandwidth, however, vectorisation can still help performance. The Intel Xeon Phi, appearing as a component in several top-500 supercomputers, displays exactly this behaviour for SpMV multiplication. On this architecture the use of the new generalised vectorisation scheme increases performance up to 178 percent.

11 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: al. as discussed by the authors proposed pL2AP, which uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the AllPairs cosine similarity search problem in a multi-core environment.
Abstract: Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.

10 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: This work presents a compact hierarchical data structure, dubbed HDT, for extreme-scale volume modeling, shorthand for hybrid dynamic tree, and demonstrates that HDT is arguably the most storage-effective representation.
Abstract: Volume data has extensive uses in medical imaging, like, MRI (magnetic resonance imaging) scan, visual effects production, including volume rendering and fluid simulation, computer-aided design and manufacturing (CAD/CAM) in advanced prototyping, such as, 3D Printing, among others. This work presents a compact hierarchical data structure, dubbed HDT, for extreme-scale volume modeling. The name is shorthand for hybrid dynamic tree. HDT is hybrid in that it fuses two contrasting structures, namely, octree and tiled grid. Such fusion of two opposing data layouts alleviate the limitations inherent to the respective structures.We describe the HDT construction algorithm to generate volumetric representation of triangle mesh on GPUs. While HDT mirrors much of the existing works on sparse 3D data modeling, our evaluation and comparative studies with prior researches demonstrate that HDT is arguably the most storage-effective representation. The geometric topology in HDT consumes just short of two bits per voxel --- five times compact relative to the current state-of-the-art volumetric structure.

10 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: The Graph Algorithm Iron Law (GAIL) is presented to quantify these tradeoffs to help understand graph algorithm performance.
Abstract: As new applications for graph algorithms emerge, there has been a great deal of research interest in improving graph processing. However, it is often difficult to understand how these new contributions improve performance. Execution time, the most commonly reported metric, distinguishes which alternative is the fastest but does not give any insight as to why. A new contribution may have an algorithmic innovation that allows it to examine fewer graph edges. It could also have an implementation optimization that reduces communication. It could even have optimizations that allow it to increase its memory bandwidth utilization. More interestingly, a new innovation may simultaneously affect all three of these factors (algorithmic work, communication volume, and memory bandwidth utilization). We present the Graph Algorithm Iron Law (GAIL) to quantify these tradeoffs to help understand graph algorithm performance.

9 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: This work proposes a data-centric approach to GPU-based Adaptive Mesh Refinement (AMR), porting all the mesh adaptation operations touching the data arrays to the GPU to allow the stencil data array to reside on the GPU memory for the entirety of the simulation.
Abstract: It has been demonstrated that explicit stencil computations of high-resolution scheme can highly benefit from GPUs. This includes Adaptive Mesh Refinement (AMR), which is a model for adapting the resolution of a stencil grid locally. Unlike uniform grid stencils, however, adapting the grid is typically done on the CPU side. This requires transferring the stencil data arrays to/from CPU every time the grid is adapted. We propose a data-centric approach to GPU-based AMR. That is, porting all the mesh adaptation operations touching the data arrays to the GPU. This would allow the stencil data arrays to reside on the GPU memory for the entirety of the simulation. Thus, the GPU code would specialize on the data residing on its memory while the CPU specializes on the AMR metadata residing on CPU memory. We compare the performance of the proposed method to a basic GPU implementation and an optimized GPU implementation that overlaps communication and computation. The performance of two GPU-based AMR applications is enhanced by 2.21x, and 2.83x compared to the basic implementation.

7 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: The proposed techniques are well suited to compute BC scores in graphs which are too large to fit in single GPU memory and the computation time of a 234 million edges graph is reduced to less than 2 hours.
Abstract: Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The exact BC computation for a large scale graph is an extraordinary challenging and requires high performance computing techniques to provide results in a reasonable amount of time. Here, we present the techniques we developed to speed-up the computation of the BC on Multi-GPU systems. Our approach combines the bi-dimensional (2-D) decomposition of the graph and multi-level parallelism. Experimental results show that the proposed techniques are well suited to compute BC scores in graphs which are too large to fit in single GPU memory. In particular, the computation time of a 234 million edges graph is reduced to less than 2 hours.

Proceedings ArticleDOI
15 Nov 2015
TL;DR: This work describes PathFinder, its signature search algorithm, which is a modified depth-first recursive search wherein adjacent nodes are compared before recursing down its edges for labels, and its general performance and cache characteristics.
Abstract: Graphs are widely used in data analytics applications in a variety of fields and are rapidly gaining attention in the computational scientific and engineering (CSE) application community. An important application of graphs concerns binary (executable) signature search to address the potential of a suspect binary evading binary signature detection via obfuscation. A control flow graph generated from a binary allows identification of a pattern of system calls, an ordered sequence of which can then be used as signatures in the search. An application proxy, named PathFinder, represents these properties, allowing examination of the performance characteristics of algorithms used in the search. In this work, we describe PathFinder, its signature search algorithm, which is a modified depth-first recursive search wherein adjacent nodes are compared before recursing down its edges for labels, and its general performance and cache characteristics. We highlight some important differences between PathFinder and traditional CSE applications. For example, the L2 cache hit ratio (less than 60%) in PathFinder is observed to be substantially lower than those observed for traditional CSE applications.

Proceedings ArticleDOI
Alex Pothen1
15 Nov 2015
TL;DR: This chapter discusses parallel algorithms for matching, which are difficult to design because many algorithms rely on searching for long paths in a graph, or implicitly communicate information along long paths, and thus have little concurrency.
Abstract: Computing a matching in a graph is one of "the hardest simple problems" in computer science. It is simple since most variants of matching can be solved in polynomial time, yet hard because the running times are high and the algorithms are complex. It is even more challenging to design parallel algorithms for matching, since many algorithms rely on searching for long paths in a graph, or implicitly communicate information along long paths, and thus have little concurrency.