scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Falcon: A Graph Manipulation Language for Heterogeneous Systems

TL;DR: A domain-specific language (DSL) is proposed, Falcon, for implementing graph algorithms that abstracts the hardware, provides constructs to write explicitly parallel programs at a higher level, and can work with general algorithms that may change the graph structure.
Abstract: Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Citations
More filters
Proceedings ArticleDOI
11 Jun 2018
TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.
Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

125 citations

Proceedings ArticleDOI
19 Oct 2016
TL;DR: This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.
Abstract: Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.

82 citations


Cites background from "Falcon: A Graph Manipulation Langua..."

  • ...Since our compiler has full discretion on scheduling the iterations of ForAll loops, some low-level features of CUDA such as shared memory are not supported in their full generality and may not be used when writing operator code....

    [...]

  • ...Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features; D.3.4 [Programming Languages]: Processors Keywords Graph applications, amorphous data-parallelism, GPUs, compilers, optimization, throughput...

    [...]

Journal ArticleDOI
01 Apr 2020
TL;DR: Pangolin this paper is an efficient and flexible in-memory graph pattern mining (GPM) framework targeting shared-memory CPUs and GPUs that provides high-level abstractions for GPU processing.
Abstract: There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability.We present Pangolin, an efficient and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which allows users to specify application specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization.Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49×, 88×, and 80× on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15× on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

44 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper develops an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks, and uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices.
Abstract: High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.

30 citations


Cites methods from "Falcon: A Graph Manipulation Langua..."

  • ...Several such GPU graph processing frameworks have been recently developed, including VWC [9], MapGraph [7], Medusa [22], CuSha [11], WS [10], Frog [19], GreenMarl [8], Falcon [4], Groute [2], and Gunrock [21]....

    [...]

Journal ArticleDOI
TL;DR: This work cast the X3DH protocol as a specific type of authenticated key exchange (AKE) protocol, which it is called a Signal-conforming AKE protocol, and formally defines its security model based on the vast prior works on AKE protocols, which results in the first post-quantum secure replacement of the X 3DH protocol on well-established assumptions.

18 citations

References
More filters
Proceedings ArticleDOI
03 Nov 2013
TL;DR: X-Stream is novel in using an edge-centric rather than a vertex-centric implementation of this model, and streaming completely unordered edge lists rather than performing random access, and competes favorably with existing systems for graph processing.
Abstract: X-Stream is a system for processing both in-memory and out-of-core graphs on a single shared-memory machine. While retaining the scatter-gather programming model with state stored in the vertices, X-Stream is novel in (i) using an edge-centric rather than a vertex-centric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth.We demonstrate that a large number of graph algorithms can be expressed using the edge-centric scatter-gather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. X-Stream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that X-Stream does not need to sort edge lists during preprocessing.

640 citations

Proceedings ArticleDOI
14 Feb 2009
TL;DR: This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications, and identifies several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance.
Abstract: GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).

476 citations


"Falcon: A Graph Manipulation Langua..." refers methods in this paper

  • ...OpenMP to GPGPU [Lee et al. 2009] is a framework for automatic code generation for the GPU from OpenMP CPU code....

    [...]

Proceedings ArticleDOI
01 Jul 1993
TL;DR: This paper presents a technique for creating high-quality triangular meshes for regions on curved surfaces, an extension of previous methods developed for regions in the plane.
Abstract: For several commonly-used solution techniques for partial differential equations, the first step is to divide the problem region into simply-shaped elements, creating a mesh. We present a technique for creating high-quality triangular meshes for regions on curved surfaces. This technique is an extension of previous methods we developed for regions in the plane. For both flat and curved surfaces, the resulting meshes are guaranteed to exhibit the following properties: (1) internal and external boundaries are respected, (2) element shapes are guaranteed—all elements are triangles with angles between 30 and 120 degrees (with the exception of badly shaped elements that may be required by the specified boundary), and (3) element density can be controlled, producing small elements in “interesting” areas and large elements elsewhere. An additional contribution of this paper is the development of a practical generalization of Delaunay triangulation to curved surfaces.

467 citations

Book ChapterDOI
01 Jan 2012
TL;DR: Thrust as mentioned in this paper is a parallel template library for CUDA C/C++ applications with minimal programming effort that allows developers to make fine-grained decisions about how computations are decomposed into parallel threads and executed on the device.
Abstract: Publisher Summary This chapter demonstrates how to leverage the Thrust parallel template library to implement high performance applications with minimal programming effort. With the introduction of CUDA C/C++, developers can harness the massive parallelism of the graphics processing unit (GPU) through a standard programming language. CUDA allows developers to make fine-grained decisions about how computations are decomposed into parallel threads and executed on the device. The level of control offered by CUDA C/C++ is an important feature; it facilitates the development of high-performance algorithms for a variety of computationally demanding tasks which merit significant optimization and profit from low-level control of the mapping onto hardware. With Thrust, developers describe their computation using a collection of high-level algorithms and completely delegate the decision of how to implement the computation to the library. Thrust is implemented entirely within CUDA C/C++ and maintains interoperability with the rest of the CUDA ecosystem. Interoperability is an important feature because no single language or library is the best tool for every problem. Thrust presents a style of programming emphasizing genericity and composability. Indeed, the vast majority of Thrust's functionality is derived from four fundamental parallel algorithms—for each, reduce, scan, and sort. Thrust's high-level algorithms enhance programmer productivity by automating the mapping of computational tasks onto the GPU. Thrust also boosts programmer productivity by providing a rich set of algorithms for common patterns.

435 citations

Journal ArticleDOI
04 Jun 2011
TL;DR: It is suggested that the operator formulation and tao-analysis of algorithms can be the foundation of a systematic approach to parallel programming.
Abstract: For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in "regular" algorithms that use dense arrays, such as finite-differences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are "irregular" data structures like graphs, trees, and sets.To address the need for better abstractions, we introduce a data-centric formulation of algorithms called the operator formulation in which an algorithm is expressed in terms of its action on data structures. This formulation is the basis for a structural analysis of algorithms that we call tao-analysis. Tao-analysis can be viewed as an abstraction of algorithms that distills out algorithmic properties important for parallelization. It reveals that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous in algorithms, and that, depending on the tao-structure of the algorithm, this parallelism may be exploited by compile-time, inspector-executor or optimistic parallelization, thereby unifying these seemingly unrelated parallelization techniques. Regular algorithms emerge as a special case of irregular algorithms, and many application-specific optimization techniques can be generalized to a broader context.These results suggest that the operator formulation and tao-analysis of algorithms can be the foundation of a systematic approach to parallel programming.

380 citations


"Falcon: A Graph Manipulation Langua..." refers background or methods in this paper

  • ...This last issue is addressed by library-based approaches such as Galois [Pingali et al. 2011] and Totem [Gharaibeh et al. 2013]....

    [...]

  • ...Handwritten codes of LonestarGPU [Nasre et al. 2013b] for GPUs and Galois [Pingali et al. 2011] for multicore CPUs, both of which support morph algorithms, are very complex....

    [...]

  • ...2013b] for GPUs and Galois [Pingali et al. 2011] for multicore CPUs, both of which support morph algorithms, are very complex....

    [...]

  • ...The Galois framework [Pingali et al. 2011], which is a library implementation in C++, supports cautious morph algorithms and generates code only for multicore CPUs....

    [...]

  • ...2.1 [Pingali et al. 2011], Totem, and Green-Marl [Hong et al. 2012]....

    [...]