scispace - formally typeset
Search or ask a question
Author

Hoang-Vu Dang

Bio: Hoang-Vu Dang is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: CUDA & Sparse matrix-vector multiplication. The author has an hindex of 11, co-authored 20 publications receiving 321 citations. Previous affiliations of Hoang-Vu Dang include Nanyang Technological University & University of Mainz.

Papers
More filters
Proceedings ArticleDOI
11 Jun 2018
TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.
Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

125 citations

Journal ArticleDOI
TL;DR: A distributed physically based integrated flow model that couples two-dimensional overland flow and three-dimensional variably saturated sub-surface flow on a GPU-based (Graphic Processing Unit) parallel computing architecture is presented and suggests the feasibility of GPU computing for fully distributed, physics-based hydrologic models over large areas.
Abstract: The widespread availability of high-resolution lidar data provides an opportunity to capture micro-topographic control on the partitioning and transport of water for incorporation in coupled surface - sub-surface flow modeling. However, large-scale simulations of integrated flow at the lidar data resolution are computationally expensive due to the density of the computational grid and the iterative nature of the algorithms for solving nonlinearity. Here we present a distributed physically based integrated flow model that couples two-dimensional overland flow and three-dimensional variably saturated sub-surface flow on a GPU-based (Graphic Processing Unit) parallel computing architecture. Alternating Direction Implicit (ADI) scheme modified for GPU structure is used for numerical solutions in both models. Boundary condition switching approach is applied to partition potential water fluxes into actual fluxes for the coupling between surface and sub-surface models. The algorithms are verified using five benchmark problems that have been widely adopted in literature. This is followed by a large-scale simulation using lidar data. We demonstrate that the method is computationally efficient and produces physically consistent solutions. This computational efficiency suggests the feasibility of GPU computing for fully distributed, physics-based hydrologic models over large areas. We present an integrated 2-D overland flow and 3-D sub-surface flow model.The model is implemented on a GPU-based parallel computing architecture.Widely adopted benchmark problems for model verification are presented.A large-scale simulation using high-resolution lidar topographic data is presented.

44 citations

Journal ArticleDOI
01 Nov 2013
TL;DR: A new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations is presented and compared to existing formats of the NVIDIA Cusp library using large sparse matrices.
Abstract: We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix into SCOO format.An efficient Dual-GPU implementation which overlaps computation and communication is described.Extensive performance comparisons of SCOO compared to other formats on GPUs and CPUs are provided. Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.Source code is available at https://github.com/danghvu/cudaSpmv.

40 citations

Proceedings ArticleDOI
25 Sep 2016
TL;DR: It is shown that, with this change, one can efficiently support millions of concurrently communicating light-weight threads using send-receive communication.
Abstract: We explore in this paper the advantages that accrue from avoiding the use of wildcards in MPI. We show that, with this change, one can efficiently support millions of concurrently communicating light-weight threads using send-receive communication.

25 citations

Proceedings ArticleDOI
01 Sep 2019
TL;DR: A lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous graph analytics, which is the first asynchronous distributed GPU graph analytics system and shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter graphs at scale.
Abstract: Distributed graph analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed graph analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous graph analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed graph analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU graph analytics system.

21 citations


Cited by
More filters
01 Jan 2002
TL;DR: In this article, the authors present 15 technical reports that were directly supported by the JSC grant and four staff members (Tesar, Tosunoglu, Hooper, and Freeman) have beer.
Abstract: This has been wonderful support for the University and I very much appreciate your role and others at JSC in giving us good advice and direction. As you can see from the table of contents, the program has covered a very broad range of topics. The 15 technical reports are those which were directly supported by the grant. Note also. that 13 M.Sc. and 11 Ph.D. students have also been directly involved in this effort. Finally, four staff members (Tesar, Tosunoglu, Hooper, and Freeman) have beer. involved and participated in the direction of the research.

195 citations

Journal ArticleDOI
24 Oct 2018
TL;DR: GraphIt is introduced, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures and which outperforms the next fastest shared-memory frameworks on 24 out of 32 experiments.
Abstract: The performance bottlenecks of graph applications depend not only on the algorithm and the underlying hardware, but also on the size and structure of the input graph. As a result, programmers must try different combinations of a large set of techniques, which make tradeoffs among locality, work-efficiency, and parallelism, to develop the best implementation for a specific algorithm and type of graph. Existing graph frameworks and domain specific languages (DSLs) lack flexibility, supporting only a limited set of optimizations. This paper introduces GraphIt, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures. GraphIt separates what is computed (algorithm) from how it is computed (schedule). Programmers specify the algorithm using an algorithm language, and performance optimizations are specified using a separate scheduling language. The algorithm language simplifies expressing the algorithms, while exposing opportunities for optimizations. We formulate graph optimizations, including edge traversal direction, data layout, parallelization, cache, NUMA, and kernel fusion optimizations, as tradeoffs among locality, parallelism, and work-efficiency. The scheduling language enables programmers to easily search through this complicated tradeoff space by composing together a large set of edge traversal, vertex data layout, and program structure optimizations. The separation of algorithm and schedule also enables us to build an autotuner on top of GraphIt to automatically find high-performance schedules. The compiler uses a new scheduling representation, the graph iteration space, to model, compose, and ensure the validity of the large number of optimizations. We evaluate GraphIt’s performance with seven algorithms on graphs with different structures and sizes. GraphIt outperforms the next fastest of six state-of-the-art shared-memory frameworks (Ligra, Green-Marl, GraphMat, Galois, Gemini, and Grazelle) on 24 out of 32 experiments by up to 4.8×, and is never more than 43% slower than the fastest framework on the other experiments. GraphIt also reduces the lines of code by up to an order of magnitude compared to the next fastest framework.

137 citations

Proceedings ArticleDOI
11 Jun 2018
TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.
Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

125 citations

Journal ArticleDOI
TL;DR: The partitioning strategy has balanced load distribution and the performance of SpMV is significantly improved when a sparse matrix is partitioned into dense blocks using the method presented, which has the highest mean density compared with other strategies.
Abstract: This paper presents a sparse matrix partitioning strategy to improve the performance of SpMV on GPUs and multicore CPUs. This method has wide adaptability for different types of sparse matrices, and is different from existing methods which only adapt to some particular sparse matrices. In addition, our partitioning method can obtain dense blocks by analyzing the probability distribution of non-zero elements in a sparse matrix, and result in very low proportion of zero padded. We make the following significant contributions. (1) We present a partitioning strategy of sparse matrices based on probabilistic modeling of non-zero elements in a row. (2) We prove that our method has the highest mean density compared with other strategies according to certain given ratios of partition obtained from the computing powers of heterogeneous processors. (3) We develop a CPU-GPU hybrid parallel computing model for SpMV on GPUs and multicore CPUs in a heterogeneous computing platform. Our partitioning strategy has balanced load distribution and the performance of SpMV is significantly improved when a sparse matrix is partitioned into dense blocks using our method. The average performance improvement of our solution for SpMV is about 15.75 percent on multicore CPUs, compared to that of the other solutions. By considering the rows of a matrix in a unique order based on the probability mass function of the number of non-zeros in a row, the average performance improvement of our solution for SpMV is about 33.52 percent on GPUs and multicore CPUs of a heterogeneous computing platform, compared to that of the partitioning methods based on the original row order of a matrix.

113 citations