Falcon: A Graph Manipulation Language for Heterogeneous Systems

doi:10.1145/2842618

Home
/
Papers
/
Falcon: A Graph Manipulation Language for Heterogeneous Systems

Journal Article•DOI•

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Unnikrishnan Cheramangalath¹, Rupesh Nasre², Y. N. Srikant¹•Institutions (2)

Indian Institute of Science¹, Indian Institute of Technology Madras²

22 Dec 2015-ACM Transactions on Architecture and Code Optimization (ACM)-Vol. 12, Iss: 4, pp 54

TL;DR: A domain-specific language (DSL) is proposed, Falcon, for implementing graph algorithms that abstracts the hardware, provides constructs to write explicitly parallel programs at a higher level, and can work with general algorithms that may change the graph structure.

read less

Abstract: Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

[...]

Roshan Dathathri¹, Gurbinder Gill¹, Loc Hoang¹, Hoang-Vu Dang², Alex Brooks², Nikoli Dryden², Marc Snir², Keshav Pingali¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, University of Illinois at Urbana–Champaign²

11 Jun 2018

TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.

...read moreread less

Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

...read moreread less

125 citations

Proceedings Article•DOI•

A compiler for throughput optimization of graph algorithms on GPUs

[...]

Sreepathi Pai¹, Keshav Pingali¹•Institutions (1)

University of Texas at Austin¹

19 Oct 2016

TL;DR: This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.

...read moreread less

Abstract: Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.

...read moreread less

82 citations

Cites background from "Falcon: A Graph Manipulation Langua..."

...Since our compiler has full discretion on scheduling the iterations of ForAll loops, some low-level features of CUDA such as shared memory are not supported in their full generality and may not be used when writing operator code....
[...]
...Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features; D.3.4 [Programming Languages]: Processors Keywords Graph applications, amorphous data-parallelism, GPUs, compilers, optimization, throughput...
[...]

Journal Article•DOI•

Pangolin: an efficient and flexible graph mining system on CPU and GPU

[...]

Xuhao Chen¹, Roshan Dathathri¹, Gurbinder Gill¹, Keshav Pingali¹•Institutions (1)

University of Texas at Austin¹

01 Apr 2020

TL;DR: Pangolin this paper is an efficient and flexible in-memory graph pattern mining (GPM) framework targeting shared-memory CPUs and GPUs that provides high-level abstractions for GPU processing.

...read moreread less

Abstract: There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability.We present Pangolin, an efficient and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which allows users to specify application specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization.Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49×, 88×, and 80× on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15× on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

...read moreread less

44 citations

Proceedings Article•DOI•

MultiGraph: Efficient Graph Processing on GPUs

[...]

Changwan Hong¹, Aravind Sukumaran-Rajam¹, Jinsung Kim¹, P. Sadayappan¹•Institutions (1)

Ohio State University¹

01 Sep 2017

TL;DR: This paper develops an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks, and uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices.

...read moreread less

Abstract: High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.

...read moreread less

30 citations

Cites methods from "Falcon: A Graph Manipulation Langua..."

...Several such GPU graph processing frameworks have been recently developed, including VWC [9], MapGraph [7], Medusa [22], CuSha [11], WS [10], Frog [19], GreenMarl [8], Falcon [4], Groute [2], and Gunrock [21]....
[...]

Journal Article•DOI•

An Efficient and Generic Construction for Signal’s Handshake (X3DH): Post-quantum, State Leakage Secure, and Deniable

[...]

Keitaro Hashimoto, Shuichi Katsumata, Kris Kwiatkowski, Thomas Peckett Prest

05 May 2022-Journal of Cryptology

TL;DR: This work cast the X3DH protocol as a specific type of authenticated key exchange (AKE) protocol, which it is called a Signal-conforming AKE protocol, and formally defines its security model based on the vast prior works on AKE protocols, which results in the first post-quantum secure replacement of the X 3DH protocol on well-established assumptions.

...read moreread less

18 citations

1
2
3
4
…
5
6
7
8
9

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Experimental analysis of dynamic algorithms for the single source shortest paths problem

[...]

Daniele Frigioni¹, Mario Ioffreda¹, Umberto Nanni¹, Giulio Pasqualone•Institutions (1)

Sapienza University of Rome¹

01 Sep 1998-ACM Journal of Experimental Algorithms

TL;DR: This paper performs an experimental analysis of three different algorithms: Dijkstra's algorithm, and the two output bounded algorithms proposed by Ramalingam and Reps in [30] and by Frigioni, Marchetti-Spaccamela and Nanni in [18], respectively.

...read moreread less

Abstract: In this paper we propose the first experimental study of the fully dynamic single-source shortest-paths problem on directed graphs with positive real edge weights. In particular, we perform an experimental analysis of three different algorithms: Dijkstra's algorithm, and the two output bounded algorithms proposed by Ramalingam and Reps in [30] and by Frigioni, Marchetti-Spaccamela and Nanni in [18], respectively. The main goal of this paper is to provide a first experimental evidence for: (a) the effectiveness of dynamic algorithms for shortest paths with respect to a traditional static approach to this problem; (b) the validity of the theoretical model of output boundedness to analyze dynamic graph algorithms. Beside random generated graphs, useful to capture the "asymptotic" behavior of the algorithms, we also developed experiments by considering a widely used graph from the real world, i.e., the Internet graph.

...read moreread less

77 citations

"Falcon: A Graph Manipulation Langua..." refers methods in this paper

...A dynamic algorithm where only edges get added (deleted) is called an incremental (decremental) algorithm, whereas algorithms where both insertion and deletion of edges happen are called fully dynamic algorithms [Frigioni et al. 1998]....
[...]

Proceedings Article•DOI•

Parallel implementation of Bouvka's minimum spanning tree algorithm

[...]

Sun Chung¹, A. Condon¹•Institutions (1)

University of Wisconsin-Madison¹

15 Apr 1996

TL;DR: Analysis of a parallel algorithm for the minimum spanning tree problem, based on the sequential algorithm of O. Boruvka (1926), shows that in principle a speedup proportional to the number of processors can be achieved, but that communication costs can be significant.

...read moreread less

Abstract: We study parallel algorithms for the minimum spanning tree problem, based on the sequential algorithm of O. Boruvka (1926). The target architectures for our algorithm are asynchronous, distributed-memory machines. Analysis of our parallel algorithm on a simple model that is reminiscent of the LogP model, shows that in principle a speedup proportional to the number of processors can be achieved, but that communication costs can be significant. To reduce these costs, we develop a new randomized linear work pointer jumping scheme that performs better than previous linear work algorithms. We also consider empirically the effects of data imbalance on the running time. For the graphs used in our experiments, load balancing schemes result in little improvement in running times. Our implementations on sparse graphs with 64,000 vertices on Thinking Machine's CM-5 achieve a speedup factor of about 4 on 16 processors. On this environment, packaging of messages turns out to be the most effective way to reduce communication costs.

...read moreread less

76 citations

"Falcon: A Graph Manipulation Langua..." refers methods in this paper

...A Set data type can be used to implement, as an example, Boruvka’s minimum spanning tree (MST) algorithm [Chung and Condon 1996]....
[...]

Proceedings Article•DOI•

Betweenness centrality on GPUs and heterogeneous architectures

[...]

Ahmet Erdem Sariyuce¹, Kamer Kaya¹, Erik Saule¹, Ümit V. Çatalyürek¹•Institutions (1)

Ohio State University¹

16 Mar 2013

TL;DR: This work investigates a set of techniques to make the betweenness centrality computations faster on GPUs as well as on heterogeneous CPU/GPU architectures, and shows that heterogeneous computing, i.e., using both architectures at the same time, is a promising solution for betweennesscentrality.

...read moreread less

Abstract: The betweenness centrality metric has always been intriguing for graph analyses and used in various applications. Yet, it is one of the most computationally expensive kernels in graph mining. In this work, we investigate a set of techniques to make the betweenness centrality computations faster on GPUs as well as on heterogeneous CPU/GPU architectures. Our techniques are based on virtualization of the vertices with high degree, strided access to adjacency lists, removal of the vertices with degree 1, and graph ordering. By combining these techniques within a fine-grain parallelism, we reduced the computation time on GPUs significantly for a set of social networks. On CPUs, which can usually have access to a large amount of memory, we used a coarse-grain parallelism. We showed that heterogeneous computing, i.e., using both architectures at the same time, is a promising solution for betweenness centrality. Experimental results show that the proposed techniques can be a great arsenal to reduce the centrality computation time for networks. In particular, it reduces the computation time of a 234 million edges graph from more than 4 months to less than 12 days.

...read moreread less

76 citations

"Falcon: A Graph Manipulation Langua..." refers methods in this paper

...In addition, there have been successful implementations of other local computation algorithms such as n-body simulation [Burtscher and Pingali 2011], betweenness centrality [Sariyüce et al. 2013], and dataflow analysis [Mendez-Lojo et al....
[...]
...In addition, there have been successful implementations of other local computation algorithms such as n-body simulation [Burtscher and Pingali 2011], betweenness centrality [Sariyüce et al. 2013], and dataflow analysis [Mendez-Lojo et al. 2012; Prabhu et al. 2011] on the GPU....
[...]

Proceedings Article•DOI•

Simplifying Scalable Graph Processing with a Domain-Specific Language

[...]

Sungpack Hong¹, Semih Salihoglu², Jennifer Widom², Kunle Olukotun²•Institutions (2)

Oracle Corporation¹, Stanford University²

15 Feb 2014

TL;DR: This paper uses Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations, and shows that the P Regel programs generated by the Green-marl compiler perform similarly to manually coded PRegel implementations of the same algorithms.

...read moreread less

Abstract: Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and scalability but inconvenient for implementing non-trivial graph algorithms. In this paper, we use Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations. Using the semantic information captured by Green-Marl, the compiler applies a set of transformation rules that convert imperative graph algorithms into Pregel's programming model. Our experiments show that the Pregel programs generated by the Green-Marl compiler perform similarly to manually coded Pregel implementations of the same algorithms. The compiler is even able to generate a Pregel implementation of a complicated graph algorithm for which a manual Pregel implementation is very challenging.

...read moreread less

63 citations

Journal Article•DOI•

EigenCFA: accelerating flow analysis with GPUs

[...]

Tarun Prabhu¹, Shreyas Ramalingam¹, Matthew Might¹, Mary Hall¹•Institutions (1)

University of Utah¹

26 Jan 2011

TL;DR: EigenCFA, an algorithm for accelerating higher-order control-flow analysis (specifically, 0CFA) with a GPU, is described, implemented and benchmarked, with a factor of 72 speedup over an optimized CPU implementation.

...read moreread less

Abstract: We describe, implement and benchmark EigenCFA, an algorithm for accelerating higher-order control-flow analysis (specifically, 0CFA) with a GPU. Ultimately, our program transformations, reductions and optimizations achieve a factor of 72 speedup over an optimized CPU implementation.We began our investigation with the view that GPUs accelerate high-arithmetic, data-parallel computations with a poor tolerance for branching. Taking that perspective to its limit, we reduced Shivers's abstract-interpretive 0CFA to an algorithm synthesized from linear-algebra operations. Central to this reduction were "abstract" Church encodings, and encodings of the syntax tree and abstract domains as vectors and matrices.A straightforward (dense-matrix) implementation of EigenCFA performed slower than a fast CPU implementation. Ultimately, sparse-matrix data structures and operations turned out to be the critical accelerants. Because control-flow graphs are sparse in practice (up to 96% empty), our control-flow matrices are also sparse, giving the sparse matrix operations an overwhelming space and speed advantage.We also achieved speedups by carefully permitting data races. The monotonicity of 0CFA makes it sound to perform analysis operations in parallel, possibly using stale or even partially-updated data.

...read moreread less

54 citations

"Falcon: A Graph Manipulation Langua..." refers methods in this paper

...2013], and dataflow analysis [Mendez-Lojo et al. 2012; Prabhu et al. 2011] on the GPU....
[...]
...In addition, there have been successful implementations of other local computation algorithms such as n-body simulation [Burtscher and Pingali 2011], betweenness centrality [Sariyüce et al. 2013], and dataflow analysis [Mendez-Lojo et al. 2012; Prabhu et al. 2011] on the GPU....
[...]