scispace - formally typeset
Search or ask a question
Journal ArticleDOI

DisGCo: A Compiler for Distributed Graph Analytics

30 Sep 2020-ACM Transactions on Architecture and Code Optimization (Association for Computing Machinery (ACM))-Vol. 17, Iss: 4, pp 1-26
TL;DR: DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems.
Abstract: Graph algorithms are widely used in various applications. Their programmability and performance have garnered a lot of interest among the researchers. Being able to run these graph analytics programs on distributed systems is an important requirement. Green-Marl is a popular Domain Specific Language (DSL) for coding graph algorithms and is known for its simplicity. However, the existing Green-Marl compiler for distributed systems (Green-Marl to Pregel) can only compile limited types of Green-Marl programs (in Pregel canonical form). This severely restricts the types of parallel Green-Marl programs that can be executed on distributed systems. We present DisGCo, the first compiler to translate any general Green-Marl program to equivalent MPI program that can run on distributed systems. Translating Green-Marl programs to MPI (SPMD/MPMD style of computation, distributed memory) presents many other exciting challenges, besides the issues related to differences in syntax, as Green-Marl gives the programmer a unified view of the whole memory and allows the parallel and serial code to be inter-mixed. We first present the set of challenges involved in translating Green-Marl programs to MPI and then present a systematic approach to do the translation. We also present a few optimization techniques to improve the performance of our generated programs. DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems. Our preliminary evaluation of DisGCo shows that our generated programs are scalable. Further, compared to the state-of-the-art DH-Falcon compiler that translates a subset of Falcon programs to MPI, our generated codes exhibit a geomean speedup of 17.32×.
Citations
More filters
Journal ArticleDOI
TL;DR: StarPlat as mentioned in this paper is a graph DSL that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification.
Abstract: Graphs model several real-world phenomena. With the growth of unstructured and semi-structured data, parallelization of graph algorithms is inevitable. Unfortunately, due to inherent irregularity of computation, memory access, and communication, graph algorithms are traditionally challenging to parallelize. To tame this challenge, several libraries, frameworks, and domain-specific languages (DSLs) have been proposed to reduce the parallel programming burden of the users, who are often domain experts. However, existing frameworks to model graph algorithms typically target a single architecture. In this paper, we present a graph DSL, named StarPlat, that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification. In particular, the DSL compiler generates OpenMP for multi-core, MPI for distributed, and CUDA for many-core GPUs. Since these three are completely different parallel programming paradigms, binding them together under the same language is challenging. We share our experience with the language design. Central to our compiler is an intermediate representation which allows a common representation of the high-level program, from which individual backend code generations begin. We demonstrate the expressiveness of StarPlat by specifying four graph algorithms: betweenness centrality computation, page rank computation, single-source shortest paths, and triangle counting. We illustrate the effectiveness of our approach by comparing the performance of the generated codes with that obtained with hand-crafted library codes. We find that the generated code is competitive to library-based codes in many cases. More importantly, we show the feasibility to generate efficient codes for different target architectures from the same algorithmic specification of graph algorithms.
Journal ArticleDOI
TL;DR: In this article , a novel backend for the Open Neural Network Compiler (ONNC) is proposed, which exploits machine learning to optimize code for the ARM Cortex-M device.
Abstract: The diversity of software and hardware forces programmers to spend a great deal of time optimizing their source code, which often requires specific treatment for each platform. The problem becomes critical on embedded devices, where computational and memory resources are strictly constrained. Compilers play an essential role in deploying source code on a target device through the backend. In this work, a novel backend for the Open Neural Network Compiler (ONNC) is proposed, which exploits machine learning to optimize code for the ARM Cortex-M device. The backend requires minimal changes to Open Neural Network Exchange (ONNX) models. Several novel optimization techniques are also incorporated in the backend, such as quantizing the ONNX model’s weight and automatically tuning the dimensions of operators in computations. The performance of the proposed framework is evaluated for two applications: handwritten digit recognition on the Modified National Institute of Standards and Technology (MNIST) dataset and model, and image classification on the Canadian Institute For Advanced Research and 10 (CIFAR-10) dataset with the AlexNet-Light model. The system achieves 98.90% and 90.55% accuracy for handwritten digit recognition and image classification, respectively. Furthermore, the proposed architecture is significantly more lightweight than other state-of-the-art models in terms of both computation time and generated source code complexity. From the system perspective, this work provides a novel approach to deploying direct computations from the available ONNX models to target devices by optimizing compilers while maintaining high efficiency in accuracy performance.
Journal ArticleDOI
TL;DR: In this paper , the authors investigate the use of a programming model based on series-parallel partial orders: computations are expressed as directed graphs that expose parallelization opportunities and necessary sequencing by construction.
Abstract: The number of processing elements per solution is growing. From embedded devices now employing (often heterogeneous) multi-core processors, across many-core scientific computing platforms, to distributed systems comprising thousands of interconnected processors, parallel programming of one form or another is now the norm. Understanding how to efficiently parallelize code, however, is still an open problem, and the difficulties are exacerbated across heterogeneous processing, and especially at run time, when it is sometimes desirable to change the parallelization strategy to meet non-functional requirements (e.g., load balancing and power consumption). In this article, we investigate the use of a programming model based on series-parallel partial orders: computations are expressed as directed graphs that expose parallelization opportunities and necessary sequencing by construction. This programming model is suitable as an intermediate representation for higher-level languages. We then describe a model of computation for such a programming model that maps such graphs into a stack-based structure more amenable to hardware processing. We describe the formal small-step semantics for this model of computation and use this formal description to show that the model can be arbitrarily parallelized, at compile and runtime, with correct execution guaranteed by design. We empirically support this claim and evaluate parallelization benefits using a prototype open-source compiler, targeting a message-passing many-core simulation. We empirically verify the correctness of arbitrary parallelization, supporting the validity of our formal semantics, analyze the distribution of operations within cores to understand the implementation impact of the paradigm, and assess execution time improvements when five micro-benchmarks are automatically and randomly parallelized across 2 × 2 and 4 × 4 multi-core configurations, resulting in execution time decrease by up to 95% in the best case.
References
More filters
Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

01 Jan 2005

19,250 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...There are also standalone tools that partition the graphs for later use [29, 50]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...Many works [12, 17, 23, 25, 30, 37, 45] in the literature exploit the efficient BSP model to bring out abstractions for programming, as well...

    [...]

  • ...For example, of the 27 programs in the Green-Marl repository, only 7 could be compiled by the existing Pregel backend....

    [...]

  • ...Even though the Pregel backend can be used to compile Green-Marl programs to be run on distributed systems, the backend can only translate programs in Pregel canonical form [28]: a small subset of possible Green-Marl programs....

    [...]

  • ...graph algorithms using traditional general-purpose high-level languages (for example, C++, Java, and so on), researchers have proposed languages/frameworks/libraries such as GraphLab [35], PowerGraph [20], Gemini [58], Pregel [37], Green-Marl [27], and DH-Falcon [13] that provide different APIs for writing parallel graph algorithms....

    [...]

  • ...Publication date: September 2020. graph algorithms using traditional general-purpose high-level languages (for example, C++, Java, and so on), researchers have proposed languages/frameworks/libraries such as GraphLab [35], PowerGraph [20], Gemini [58], Pregel [37], Green-Marl [27], and DH-Falcon [13] that provide different APIs for writing parallel graph algorithms....

    [...]

Proceedings ArticleDOI
08 Oct 2012
TL;DR: This paper describes the challenges of computation on natural graphs in the context of existing graph-parallel abstractions and introduces the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges.
Abstract: Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability.In this paper, we characterize the challenges of computation on natural graphs in the context of existing graph-parallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.

1,710 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...Publication date: September 2020. graph algorithms using traditional general-purpose high-level languages (for example, C++, Java, and so on), researchers have proposed languages/frameworks/libraries such as GraphLab [35], PowerGraph [20], Gemini [58], Pregel [37], Green-Marl [27], and DH-Falcon [13] that provide different APIs for writing parallel graph algorithms....

    [...]

  • ...There are many frameworks [17, 20, 34, 38, 43, 44, 46, 47, 58] that help encode different types of graph algorithms for distributed systems....

    [...]

  • ...graph algorithms using traditional general-purpose high-level languages (for example, C++, Java, and so on), researchers have proposed languages/frameworks/libraries such as GraphLab [35], PowerGraph [20], Gemini [58], Pregel [37], Green-Marl [27], and DH-Falcon [13] that provide different APIs for writing parallel graph algorithms....

    [...]

  • ...PowerGraph [20] focuses on the challenges of power-law graphs where the programmer needs to provide the implementations for Gather, Apply, and Scatter functions to code any graph algorithm....

    [...]