scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Falcon: A Graph Manipulation Language for Heterogeneous Systems

TL;DR: A domain-specific language (DSL) is proposed, Falcon, for implementing graph algorithms that abstracts the hardware, provides constructs to write explicitly parallel programs at a higher level, and can work with general algorithms that may change the graph structure.
Abstract: Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Citations
More filters
Journal Article
TL;DR:
Abstract: This paper presents benchmarking and profiling of the two lattice-based signature scheme finalists, Dilithium and Falcon, on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. This research is motivated by the Cortex M7 device being the only processor in the Cortex-M family to offer a double precision (i.e., 64-bit) floating-point unit, making Falcon’s implementations, requiring 53 bits of precision, able to fully run native floating-point operations without any emulation. Falcon shows significant speed-ups between 6.2-8.3x in clock cycles, 6.2-11.8x in runtime, but Dilithium does not show much improvement other than those gained by the slightly faster processor. We then present profiling results of the two schemes on the Cortex M7 to show their respective bottlenecks and operations where the improvements are and can be made, which show some operations in Falcon’s procedures observe speed-ups by an order of magnitude. Finally, we test the native FPU instructions on the Cortex M7, used in Falcon’s FPR instructions, for constant runtime and find irregularities on four different STM32 boards, as well as on the Raspberry Pi 3, used in previous benchmarking results for Falcon.

5 citations

01 Jan 2016
TL;DR: This paper introduces vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in their solution and shows that these techniques provide up to 5.46x over the recently proposed WS-VR framework over multiple algorithms and inputs.
Abstract: Author(s): Khorasani, Farzad | Advisor(s): Gupta, Rajiv | Abstract: Massive parallel processing power of GPU's presents an attractive opportunity for accelerating large scale vertex-centric graph computations. However, the inherent irregularity and large sizes of real-world power law graphs creates many challenges. Lock-step execution by threads within a SIMD group restricts exploitable parallelism, the limited GPU's DRAM size restricts the sizes of graphs that can be offloaded to the GPU, and the limited inter-GPU communication bandwidth necessitates judicious use available bandwidth. This dissertation addresses all of these challenges.We present Warp Segmentation that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex while employing the compact CSR representation to maximize the graph size that can be held in GPU global memory. Prior works can either maximize graph sizes (e.g., VWC) or device utilization (e.g., CuSha). We scale graph processing over multiple GPUs via Vertex Refinement that dynamically collects and transfers only the updated boundary vertices leading to dramatically reduced amount of inter-GPU data transfer. Existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. Since processing all vertices at every iteration wastes much of GPU's computation power, we present a work-efficient solution that processes only those vertices during an iteration that were activated in the previous iteration. We employ an effective task expansion strategy that avoids intra-warp thread underutilization. For multi-GPU graph computation, we present permissive partitioning to dynamically balance load across GPUs. Also, as recording vertex activeness requires additional data structures, to manage the graph storage overhead, we introduce vertex grouping that enables trade-off between memory consumption and work efficiency. Finally, to apply the proposed solutions to other irregular applications, we generalize our techniques and present Collaborative Context Collection (CCC) and Collaborative Task Engagement (CTE). CCC is a software/compiler technique to enhance the SIMD-efficiency in loops containing thread divergence. CTE abstracts away the complexities of a rather complicated technique using a CUDA C++ device side template library and balances load across threads within a SIMD group.

5 citations

Posted Content
TL;DR: Pangolin is the first GPM system that provides high-level abstractions for GPU processing and provides a simple programming interface based on the extend-reduce-filter model, which enables users to specify application-specific knowledge for search space pruning and isomorphism test elimination.
Abstract: There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability. We present Pangolin, a high-performance and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which enables users to specify application-specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization. Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49x, 88x, and 80x on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15x on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

4 citations

Book ChapterDOI
28 Aug 2017
TL;DR: Large graphs are widely used in real world graph analytics and some graph algorithms having a high degree of parallelism ideally run on an accelerator cluster, but existing frameworks do not deal with the accelerator clusters.
Abstract: Large graphs are widely used in real world graph analytics. Memory available in a single machine is usually inadequate to process these graphs. A good solution is to use a distributed environment. Typical programming styles used in existing distributed environment frameworks are different from imperative programming and difficult for programmers to adapt. Moreover, some graph algorithms having a high degree of parallelism ideally run on an accelerator cluster. Error prone and lower level programming methods (memory and thread management) available for such systems repel programmers from using such architectures. Existing frameworks do not deal with the accelerator clusters.

4 citations


Cites background or methods from "Falcon: A Graph Manipulation Langua..."

  • ...Our framework uses the front-end of the Falcon compiler [18]....

    [...]

  • ...Besides C data types, Falcon [18] has additional data types pertinent to graph algorithms such as graph, vertex, edge, set and collection....

    [...]

  • ...Keywords: Distributed architecture, Accelerator, Cross-Platform, Graph processing, DSL, Falcon...

    [...]

  • ...Front-end of Falcon parses the DSL code and generates an AST which is input to the back-end of our framework....

    [...]

  • ...Falcon [18], a graph DSL, extends C programming language and helps programmers to implement graph analysis algorithms intuitively....

    [...]

Journal ArticleDOI
TL;DR: The HyPar strategy provides equivalent performance to the state-of-art, optimized CPU-only and GPU-only implementations of the corresponding applications, when compared to the prevalent BSP approach for multi-device executions of graphs.

4 citations

References
More filters
23 Oct 2011

4,685 citations

Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"Falcon: A Graph Manipulation Langua..." refers background in this paper

  • ...Pregel [Malewicz et al. 2010] is a graph-processing framework in a distributed setting....

    [...]

Proceedings ArticleDOI
11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

2,216 citations


"Falcon: A Graph Manipulation Langua..." refers methods in this paper

  • ...We provide atomic library functions MIN, MAX, SUB, AND, and so on, which are abstraction over the similar one in CUDA [Nickolls et al. 2008] and GCC [Stallman et al....

    [...]

  • ...It uses the compare_and_swap variant of CUDA [Nickolls et al. 2008] and GCC [Stallman et al....

    [...]

  • ...We provide atomic library functions MIN, MAX, SUB, AND, and so on, which are abstraction over the similar one in CUDA [Nickolls et al. 2008] and GCC [Stallman et al. 2011]....

    [...]

  • ...It uses the compare_and_swap variant of CUDA [Nickolls et al. 2008] and GCC [Stallman et al. 2011] for the GPU and CPU, respectively....

    [...]