scispace - formally typeset
Search or ask a question
Journal ArticleDOI

DisGCo: A Compiler for Distributed Graph Analytics

30 Sep 2020-ACM Transactions on Architecture and Code Optimization (Association for Computing Machinery (ACM))-Vol. 17, Iss: 4, pp 1-26
TL;DR: DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems.
Abstract: Graph algorithms are widely used in various applications. Their programmability and performance have garnered a lot of interest among the researchers. Being able to run these graph analytics programs on distributed systems is an important requirement. Green-Marl is a popular Domain Specific Language (DSL) for coding graph algorithms and is known for its simplicity. However, the existing Green-Marl compiler for distributed systems (Green-Marl to Pregel) can only compile limited types of Green-Marl programs (in Pregel canonical form). This severely restricts the types of parallel Green-Marl programs that can be executed on distributed systems. We present DisGCo, the first compiler to translate any general Green-Marl program to equivalent MPI program that can run on distributed systems. Translating Green-Marl programs to MPI (SPMD/MPMD style of computation, distributed memory) presents many other exciting challenges, besides the issues related to differences in syntax, as Green-Marl gives the programmer a unified view of the whole memory and allows the parallel and serial code to be inter-mixed. We first present the set of challenges involved in translating Green-Marl programs to MPI and then present a systematic approach to do the translation. We also present a few optimization techniques to improve the performance of our generated programs. DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems. Our preliminary evaluation of DisGCo shows that our generated programs are scalable. Further, compared to the state-of-the-art DH-Falcon compiler that translates a subset of Falcon programs to MPI, our generated codes exhibit a geomean speedup of 17.32×.
Citations
More filters
Journal ArticleDOI
TL;DR: StarPlat as mentioned in this paper is a graph DSL that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification.
Abstract: Graphs model several real-world phenomena. With the growth of unstructured and semi-structured data, parallelization of graph algorithms is inevitable. Unfortunately, due to inherent irregularity of computation, memory access, and communication, graph algorithms are traditionally challenging to parallelize. To tame this challenge, several libraries, frameworks, and domain-specific languages (DSLs) have been proposed to reduce the parallel programming burden of the users, who are often domain experts. However, existing frameworks to model graph algorithms typically target a single architecture. In this paper, we present a graph DSL, named StarPlat, that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification. In particular, the DSL compiler generates OpenMP for multi-core, MPI for distributed, and CUDA for many-core GPUs. Since these three are completely different parallel programming paradigms, binding them together under the same language is challenging. We share our experience with the language design. Central to our compiler is an intermediate representation which allows a common representation of the high-level program, from which individual backend code generations begin. We demonstrate the expressiveness of StarPlat by specifying four graph algorithms: betweenness centrality computation, page rank computation, single-source shortest paths, and triangle counting. We illustrate the effectiveness of our approach by comparing the performance of the generated codes with that obtained with hand-crafted library codes. We find that the generated code is competitive to library-based codes in many cases. More importantly, we show the feasibility to generate efficient codes for different target architectures from the same algorithmic specification of graph algorithms.
Journal ArticleDOI
TL;DR: In this article , a novel backend for the Open Neural Network Compiler (ONNC) is proposed, which exploits machine learning to optimize code for the ARM Cortex-M device.
Abstract: The diversity of software and hardware forces programmers to spend a great deal of time optimizing their source code, which often requires specific treatment for each platform. The problem becomes critical on embedded devices, where computational and memory resources are strictly constrained. Compilers play an essential role in deploying source code on a target device through the backend. In this work, a novel backend for the Open Neural Network Compiler (ONNC) is proposed, which exploits machine learning to optimize code for the ARM Cortex-M device. The backend requires minimal changes to Open Neural Network Exchange (ONNX) models. Several novel optimization techniques are also incorporated in the backend, such as quantizing the ONNX model’s weight and automatically tuning the dimensions of operators in computations. The performance of the proposed framework is evaluated for two applications: handwritten digit recognition on the Modified National Institute of Standards and Technology (MNIST) dataset and model, and image classification on the Canadian Institute For Advanced Research and 10 (CIFAR-10) dataset with the AlexNet-Light model. The system achieves 98.90% and 90.55% accuracy for handwritten digit recognition and image classification, respectively. Furthermore, the proposed architecture is significantly more lightweight than other state-of-the-art models in terms of both computation time and generated source code complexity. From the system perspective, this work provides a novel approach to deploying direct computations from the available ONNX models to target devices by optimizing compilers while maintaining high efficiency in accuracy performance.
Journal ArticleDOI
TL;DR: In this paper , the authors investigate the use of a programming model based on series-parallel partial orders: computations are expressed as directed graphs that expose parallelization opportunities and necessary sequencing by construction.
Abstract: The number of processing elements per solution is growing. From embedded devices now employing (often heterogeneous) multi-core processors, across many-core scientific computing platforms, to distributed systems comprising thousands of interconnected processors, parallel programming of one form or another is now the norm. Understanding how to efficiently parallelize code, however, is still an open problem, and the difficulties are exacerbated across heterogeneous processing, and especially at run time, when it is sometimes desirable to change the parallelization strategy to meet non-functional requirements (e.g., load balancing and power consumption). In this article, we investigate the use of a programming model based on series-parallel partial orders: computations are expressed as directed graphs that expose parallelization opportunities and necessary sequencing by construction. This programming model is suitable as an intermediate representation for higher-level languages. We then describe a model of computation for such a programming model that maps such graphs into a stack-based structure more amenable to hardware processing. We describe the formal small-step semantics for this model of computation and use this formal description to show that the model can be arbitrarily parallelized, at compile and runtime, with correct execution guaranteed by design. We empirically support this claim and evaluate parallelization benefits using a prototype open-source compiler, targeting a message-passing many-core simulation. We empirically verify the correctness of arbitrary parallelization, supporting the validity of our formal semantics, analyze the distribution of operations within cores to understand the implementation impact of the paradigm, and assess execution time improvements when five micro-benchmarks are automatically and randomly parallelized across 2 × 2 and 4 × 4 multi-core configurations, resulting in execution time decrease by up to 95% in the best case.
References
More filters
Journal ArticleDOI
01 Sep 2011
TL;DR: A parallel library written with message-passing (MPI) calls that allows algorithms to be expressed in the MapReduce paradigm, allowing processing of terabyte-scale data sets on traditional MPI-based clusters and compares the results with non-MapReduce algorithms, different machines, and different Map Reduce software.
Abstract: We describe a parallel library written with message-passing (MPI) calls that allows algorithms to be expressed in the MapReduce paradigm. This means the calling program does not need to include explicit parallel code, but instead provides ''map'' and ''reduce'' functions that operate independently on elements of a data set distributed across processors. The library performs needed data movement between processors. We describe how typical MapReduce functionality can be implemented in an MPI context, and also in an out-of-core manner for data sets that do not fit within the aggregate memory of a parallel machine. Our motivation for creating this library was to enable graph algorithms to be written as MapReduce operations, allowing processing of terabyte-scale data sets on traditional MPI-based clusters. We outline MapReduce versions of several such algorithms: vertex ranking via PageRank, triangle finding, connected component identification, Luby's algorithm for maximally independent sets, and single-source shortest-path calculation. To test the algorithms on arbitrarily large artificial graphs we generate randomized R-MAT matrices in parallel; a MapReduce version of this operation is also described. Performance and scalability results for the various algorithms are presented for varying size graphs on a distributed-memory cluster. For some cases, we compare the results with non-MapReduce algorithms, different machines, and different MapReduce software, namely Hadoop. Our open-source library is written in C++, is callable from C++, C, Fortran, or scripting languages such as Python, and can run on any parallel platform that supports MPI.

208 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...There are many frameworks [17, 20, 34, 38, 43, 44, 46, 47, 58] that help encode different types of graph algorithms for distributed systems....

    [...]

  • ...[43] proposed a library called MR-MPI, which helps to write MPI programs for graphs in Map-Reduce format....

    [...]

Proceedings ArticleDOI
Jung-Won Kim1, Sangmin Seo1, Jun Lee1, Jeongho Nah1, Gangwon Jo1, Jaejin Lee1 
25 Jun 2012
TL;DR: It is shown that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming.
Abstract: In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. The target cluster architecture consists of a designated, single host node and many compute nodes. They are connected by an interconnection network, such as Gigabit Ethernet and InfiniBand switches. Each compute node is equipped with multicore CPUs and multiple GPUs. A set of CPU cores or each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. SnuCL provides a system image running a single operating system instance for heterogeneous CPU/GPU clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. SnuCL also provides collective communication extensions to OpenCL to facilitate manipulating memory objects. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.

190 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...SnuCL [31] extends OpenCL to admit programs that can be run on heterogeneous systems....

    [...]

Proceedings ArticleDOI
Jim Gray1, Raymond A. Lorie1, G. R. Putzolu1
22 Sep 1975
TL;DR: This paper proposes a locking protocol which associates locks with sets of resources and allows simultaneous locking at various granularities by different transactions based on the introduction of additional lock modes besides the conventional share mode and exclusive mode.
Abstract: This paper proposes a locking protocol which associates locks with sets of resources. This protocol allows simultaneous locking at various granularities by different transactions. It is based on the introduction of additional lock modes besides the conventional share mode and exclusive mode. The protocol is generalized from simple hierarchies of locks to directed acyclic graphs of locks and to dynamic graphs of locks. The issues of scheduling and granting conflicting requests for the same resource are then discussed. Lastly, these ideas are compared with the lock mechanisms provided by existing data management systems.

177 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...Optimizing the overheads of synchronization is a hot area of research to improve the performance of parallel programs [14, 21, 22, 55, 56]....

    [...]

Proceedings Article
08 Jul 2015
TL;DR: Grappa enables users to program a cluster as if it were a single, large, non-uniform memory access (NUMA) machine, and addresses deficiencies of previous DSM systems by exploiting application parallelism, trading off latency for throughput.
Abstract: We present Grappa, a modern take on software distributed shared memory (DSM) for in-memory data-intensive applications Grappa enables users to program a cluster as if it were a single, large, non-uniform memory access (NUMA) machine Performance scales up even for applications that have poor locality and input-dependent load distribution Grappa addresses deficiencies of previous DSM systems by exploiting application parallelism, trading off latency for throughput We evaluate Grappa with an in-memory MapReduce framework (10× faster than Spark [74]); a vertex-centric framework inspired by GraphLab (133× faster than native GraphLab [48]); and a relational query execution engine (125× faster than Shark [31]) All these frameworks required only 60-690 lines of Grappa code

174 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...There are many frameworks [17, 20, 34, 38, 43, 44, 46, 47, 58] that help encode different types of graph algorithms for distributed systems....

    [...]

Proceedings ArticleDOI
01 Aug 1995
TL;DR: Novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes are presented and computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision.
Abstract: This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the single-program, multiple data (SPMD) model. By exploiting compile-time computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on average and by several orders of magnitude for certain programs.

144 citations


"DisGCo: A Compiler for Distributed ..." refers background in this paper

  • ...There have been many prior works [7, 10, 16, 52] that translate fork-join style code to SPMD code....

    [...]