scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Falcon: A Graph Manipulation Language for Heterogeneous Systems

TL;DR: A domain-specific language (DSL) is proposed, Falcon, for implementing graph algorithms that abstracts the hardware, provides constructs to write explicitly parallel programs at a higher level, and can work with general algorithms that may change the graph structure.
Abstract: Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Citations
More filters
Proceedings ArticleDOI
11 Jun 2018
TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.
Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

125 citations

Proceedings ArticleDOI
19 Oct 2016
TL;DR: This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.
Abstract: Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.

82 citations


Cites background from "Falcon: A Graph Manipulation Langua..."

  • ...Since our compiler has full discretion on scheduling the iterations of ForAll loops, some low-level features of CUDA such as shared memory are not supported in their full generality and may not be used when writing operator code....

    [...]

  • ...Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features; D.3.4 [Programming Languages]: Processors Keywords Graph applications, amorphous data-parallelism, GPUs, compilers, optimization, throughput...

    [...]

Journal ArticleDOI
01 Apr 2020
TL;DR: Pangolin this paper is an efficient and flexible in-memory graph pattern mining (GPM) framework targeting shared-memory CPUs and GPUs that provides high-level abstractions for GPU processing.
Abstract: There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability.We present Pangolin, an efficient and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which allows users to specify application specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization.Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49×, 88×, and 80× on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15× on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

44 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper develops an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks, and uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices.
Abstract: High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.

30 citations


Cites methods from "Falcon: A Graph Manipulation Langua..."

  • ...Several such GPU graph processing frameworks have been recently developed, including VWC [9], MapGraph [7], Medusa [22], CuSha [11], WS [10], Frog [19], GreenMarl [8], Falcon [4], Groute [2], and Gunrock [21]....

    [...]

Journal ArticleDOI
TL;DR: This work cast the X3DH protocol as a specific type of authenticated key exchange (AKE) protocol, which it is called a Signal-conforming AKE protocol, and formally defines its security model based on the vast prior works on AKE protocols, which results in the first post-quantum secure replacement of the X 3DH protocol on well-established assumptions.

18 citations

References
More filters
01 Jan 2009
TL;DR: This paper presents fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest Path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model.
Abstract: The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively multithreaded model of the GPU makes it possible to adopt the data-parallel approach even to irregular algorithms like graph algorithms, using O(V ) or O(E) simultaneous threads. The algorithms and the underlying techniques presented in this paper are likely to be applicable to many irregular algorithms. We show the impact of our implementations on random, scalefree, and real-life graphs of up to millions of vertices on highend and low-end GPUs. The availability and spread of GPUs to desktops and laptops make them ideal candidates to accelerate graph operations over the CPU-only implementations. Practical implementations of basic operations go a long way in realizing their potential.

94 citations


"Falcon: A Graph Manipulation Langua..." refers methods in this paper

  • ...Efficient implementations of local computation algorithms such as breadth-first search (BFS) and SSSP were reported several years ago [Harish and Narayanan 2007; Harish et al. 2009]....

    [...]

Proceedings ArticleDOI
23 Feb 2013
TL;DR: This work proposes efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms and provides several insights into how other morph algorithms can be efficiently implemented on GPUs.
Abstract: There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs.In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges.We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs.This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

90 citations


"Falcon: A Graph Manipulation Langua..." refers background or methods in this paper

  • ...Handwritten codes of LonestarGPU [Nasre et al. 2013b] for GPUs and Galois [Pingali et al....

    [...]

  • ...Handwritten codes of LonestarGPU [Nasre et al. 2013b] for GPUs and Galois [Pingali et al. 2011] for multicore CPUs, both of which support morph algorithms, are very complex....

    [...]

Journal ArticleDOI
TL;DR: GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing, although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers advantages over previous models.
Abstract: GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers impr...

85 citations

Book ChapterDOI
24 Aug 1998
TL;DR: The Δ-stepping algorithm, a generalization of Dial's algorithm and the Bellman-Ford algorithm, improves the situation at least in the following "average-case" sense: for random directed graphs with edge probability d/n and uniformly distributed edge weights a PRAM version works in expected time O(log3 n/ log log n) using linear work.
Abstract: In spite of intensive research, little progress has been made towards fast and work-efficient parallel algorithms for the single source shortest path problem. Our Δ-stepping algorithm, a generalization of Dial's algorithm and the Bellman-Ford algorithm, improves this situation at least in the following "average-case" sense: For random directed graphs with edge probability d/n and uniformly distributed edge weights a PRAM version works in expected time O(log3 n/ log log n) using linear work. The algorithm also allows for efficient adaptation to distributed memory machines. Implementations show that our approach works on real machines. As a side effect, we get a simple linear time sequential algorithm for a large class of not necessarily random directed graphs with random edge weights.

85 citations

Journal ArticleDOI
09 Jun 2012
TL;DR: This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs, and demonstrates how iGPU exception support enables virtual memory paging with very low overhead and circuit-speculation techniques that can provide over 25% reduction in energy.
Abstract: Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs. First, we observe that GPU programs can be broken into code regions that contain little or no live register state at their entry point. We then also recognize that it is simple to generate these regions in such a way that they are idempotent, allowing their entry points to function as program recovery points and enabling support for exception handling, fast context switches, and speculation, all with very low overhead. We call the architecture of GPUs executing these idempotent regions the iGPU architecture. The hardware extensions required are minimal and the construction of idempotent code regions is fully transparent under the typical dynamic compilation framework of GPUs. We demonstrate how iGPU exception support enables virtual memory paging with very low overhead (1% to 4%), and how speculation support enables circuit-speculation techniques that can provide over 25% reduction in energy.

84 citations


"Falcon: A Graph Manipulation Langua..." refers methods in this paper

  • ...The iGPU [Menon et al. 2012] architecture proposes a method for breaking a GPU function execution into many idempotent regions so that in between two continuous regions, there is very little live state, and this fact can be used for speculative execution....

    [...]