scispace - formally typeset
Search or ask a question

Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2014"


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This paper shows the first scalable GPU implementation for triangle counting using a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm), which has two levels of parallelism.
Abstract: Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

73 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This paper describes a supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices and performance results for commonly studied matrices are presented.
Abstract: Sparse direct factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. While the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost, many aspects of sparse factorization and GPU computing, most particularly the prevalence of small/irregular dense math and slow PCIe communication, make it challenging to fully utilize this resource.In this paper we describe a supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream branches of the elimination tree (subtrees which terminate in leaves) through the GPU and perform the factorization of each branch entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these branches, many independent, small, dense operations are batched to minimize kernel launch overhead and several of these batched kernels are executed concurrently to maximize device utilization. Supernodes towards the root of the elimination tree (where a branch involving that supernode would exceed device memory) typically involve sufficient dense math such that PCIe communication can be effectively hidden, GPU utilization is high and hybrid computing can be easily leveraged.Performance results for commonly studied matrices are presented along with suggested actions for further optimizations.

35 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: A method of implementing SPARQL queries within the GraphLab framework is outlined, obtaining good scaling to the size of the system, 51 nodes.
Abstract: In this paper we explore the fusion of two largely disparate but related communities, that of Big Data and the Semantic Web. Due to the rise of large real-world graph datasets, a number of graph-centric parallel platforms have been proposed and developed. Many of these platforms, notable among them Pregel, Giraph, GraphLab, GraphChi, the Graph Processing System, and GraphX, present a programming interface that is vertex-centric, a variant of Valiant's Bulk Synchronous Parallel model. These platforms seek to address growing analytical needs for very large graph datasets arising from a variety of sources, such as social, biological, and computer networks. With this growing interest in large graphs, there has also been a concomitant rise in the Semantic Web, which describes data in terms of subject-predicate-object triples, or in other words edges of a graph where the predicate is a directed labeled edge between the two vertices, the subject and object. Despite the graph-oriented nature of Semantic Web data, and the advent of an increasingly large web of data, no one has explored the usage of these maturing graph platforms to analyze Semantic Web data. In this paper we outline a method of implementing SPARQL queries within the GraphLab framework, obtaining good scaling to the size of our system, 51 nodes.

22 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This work proposes a distributed priority scheduler to (approximately) prioritize useful work without synchronization, and implements the single-source shortest paths algorithm using the scheduler and shows significant improvements over a previous implementation of SSSP using Δ-stepping.
Abstract: Massively parallel computers provide unprecedented computing power that is only expected to grow. With great power comes great responsibility---parallel overheads may dominate and must be minimized. The synchronization overhead in particular is deeply rooted in the programming practice because it makes algorithms easier to design and implement. In the effort to eliminate it, we introduce the idea of distributed control where global synchronization is reduced to termination detection and each worker optimistically proceeds ahead, based on the local knowledge of the global computation. We propose a distributed priority scheduler to (approximately) prioritize useful work without synchronization, and we implement the single-source shortest paths (SSSP) algorithm using the scheduler. We show significant improvements over a previous implementation of SSSP using Δ-stepping.

11 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This work presents an extremely fast and scalable algorithm for identifying the high closeness centrality vertices, using group testing, and shows that this approach is significantly faster (best-case over 50 times, worst- case over 7 times) than the currently used methods.
Abstract: The significance of an entity in a network is generally given by the centrality value of its vertex. For most analysis purposes, only the high ranked vertices are required. However, most algorithms calculate the centrality values of all the vertices. We present an extremely fast and scalable algorithm for identifying the high closeness centrality vertices, using group testing. We show that our approach is significantly faster (best-case over 50 times, worst-case over 7 times) than the currently used methods. We can also use group testing to identify networks that are sensitive to edge perturbation.

7 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: ARCHER, an application for computing radiation dosimetry for CT imaging involving whole-body patient phantoms has been extended to execute on any combination of CPUs, GPUs and MICs concurrently to create a new level of heterogeneous concurrent execution of Monte Carlo photon transport.
Abstract: In this paper, a new level of heterogeneous concurrent execution of Monte Carlo photon transport is presented. ARCHER, an application for computing radiation dosimetry for CT imaging involving whole-body patient phantoms has been extended to execute on any combination of CPUs, GPUs and MICs concurrently. The goal is for ARCHER to detect and simultaneously utilize all CPU, GPU and MIC processing devices available. Due to the irregular nature of the Monte Carlo photon transport algorithm, a new "self service" approach to organizing the heterogeneous device computing has been implemented. This approach efficiently and effectively allows each device to repeatedly grab portions of the domain and compute concurrently until the entire domain has been simulated. New timing benchmarks using various combinations of various Intel and NVIDIA devices are made and presented. A speedup of 13x has been observed when utilizing Intel's Xeon X5650 CPU, Intel's Xeon Phi 5110P MIC and NVIDIA's K40 GPU concurrently versus just the Intel Xeon X5650.

6 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This work considers the relationship between merging, branch predictors, and input data dependency, and explains this phenomenon using a visualization technique called Merge Path that intuitively shows this.
Abstract: Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch prediction accuracy go hand-in-hand, these have not been studied in the context of merging. We thoroughly test merging using multiple input array sizes and values using the same code and compile optimizations. As the number of possible keys increase, so the do the number of branch mis-predictions - resulting in reduced performance. The reduction in performance can be as much as 5X. We explain this phenomenon using a visualization technique called Merge Path that intuitively shows this. We support this visualization approach with modeling, thorough testing, and analysis on multiple systems.

5 citations


Proceedings ArticleDOI
16 Nov 2014
TL;DR: This paper sets out to characterize the workload disparity and to explore data-driven models for executing irregular computations on GPU architectures to explore performance benefits over multi-core systems.
Abstract: General-purpose computing on graphics processing unit (GPGPU) architectures rely on data locality and regular computation to leverage parallel resources to achieve performance benefits over multi-core systems. Current GPUs often cannot effectively accommodate irregular algorithms and non-uniform memory accesses. Irregular execution corrupts the efficiency of the rigid GPU groups of threads that lead to idle execution execution and workload disparity in the GPU architecture. While hardware aspects of branch divergence, local memory bank conflicts, and non-coalesced memory accesses cause workload disparity, the greater issue is the lost of execution potential from other GPU workgroups. This paper sets out to characterize the workload disparity and toexplore data-driven models for executing irregular computations on GPU architectures.

2 citations