Home
/
Authors
/
Hoang-Vu Dang

Author

Hoang-Vu Dang

University of Illinois at Urbana–Champaign

Other affiliations: Nanyang Technological University, University of Mainz

Bio: Hoang-Vu Dang is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: CUDA & Sparse matrix-vector multiplication. The author has an hindex of 11, co-authored 20 publications receiving 321 citations. Previous affiliations of Hoang-Vu Dang include Nanyang Technological University & University of Mainz.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

[...]

Roshan Dathathri¹, Gurbinder Gill¹, Loc Hoang¹, Hoang-Vu Dang², Alex Brooks², Nikoli Dryden², Marc Snir², Keshav Pingali¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, University of Illinois at Urbana–Champaign²

11 Jun 2018

TL;DR: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.

...read moreread less

Abstract: This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

...read moreread less

125 citations

Journal Article•DOI•

GPU-based high-performance computing for integrated surface-sub-surface flow modeling

[...]

Phong V. V. Le¹, Praveen Kumar¹, Albert J. Valocchi¹, Hoang-Vu Dang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Nov 2015-Environmental Modelling and Software

TL;DR: A distributed physically based integrated flow model that couples two-dimensional overland flow and three-dimensional variably saturated sub-surface flow on a GPU-based (Graphic Processing Unit) parallel computing architecture is presented and suggests the feasibility of GPU computing for fully distributed, physics-based hydrologic models over large areas.

...read moreread less

Abstract: The widespread availability of high-resolution lidar data provides an opportunity to capture micro-topographic control on the partitioning and transport of water for incorporation in coupled surface - sub-surface flow modeling. However, large-scale simulations of integrated flow at the lidar data resolution are computationally expensive due to the density of the computational grid and the iterative nature of the algorithms for solving nonlinearity. Here we present a distributed physically based integrated flow model that couples two-dimensional overland flow and three-dimensional variably saturated sub-surface flow on a GPU-based (Graphic Processing Unit) parallel computing architecture. Alternating Direction Implicit (ADI) scheme modified for GPU structure is used for numerical solutions in both models. Boundary condition switching approach is applied to partition potential water fluxes into actual fluxes for the coupling between surface and sub-surface models. The algorithms are verified using five benchmark problems that have been widely adopted in literature. This is followed by a large-scale simulation using lidar data. We demonstrate that the method is computationally efficient and produces physically consistent solutions. This computational efficiency suggests the feasibility of GPU computing for fully distributed, physics-based hydrologic models over large areas. We present an integrated 2-D overland flow and 3-D sub-surface flow model.The model is implemented on a GPU-based parallel computing architecture.Widely adopted benchmark problems for model verification are presented.A large-scale simulation using high-resolution lidar topographic data is presented.

...read moreread less

44 citations

Journal Article•DOI•

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

[...]

Hoang-Vu Dang¹, Bertil Schmidt¹•Institutions (1)

University of Mainz¹

01 Nov 2013

TL;DR: A new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations is presented and compared to existing formats of the NVIDIA Cusp library using large sparse matrices.

...read moreread less

Abstract: We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix into SCOO format.An efficient Dual-GPU implementation which overlaps computation and communication is described.Extensive performance comparisons of SCOO compared to other formats on GPUs and CPUs are provided. Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.Source code is available at https://github.com/danghvu/cudaSpmv.

...read moreread less

40 citations

Proceedings Article•DOI•

Towards millions of communicating threads

[...]

Hoang-Vu Dang¹, Marc Snir¹, William Gropp¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

25 Sep 2016

TL;DR: It is shown that, with this change, one can efficiently support millions of concurrently communicating light-weight threads using send-receive communication.

...read moreread less

Abstract: We explore in this paper the advantages that accrue from avoiding the use of wildcards in MPI. We show that, with this change, one can efficiently support millions of concurrently communicating light-weight threads using send-receive communication.

...read moreread less

25 citations

Proceedings Article•DOI•

Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics

[...]

Roshan Dathathri¹, Gurbinder Gill¹, Loc Hoang¹, Vishwesh Jatala¹, Keshav Pingali¹, V. Krishna Nandivada², Hoang-Vu Dang³, Marc Snir³ - Show less +4 more•Institutions (3)

University of Texas at Austin¹, Indian Institute of Technology Madras², University of Illinois at Urbana–Champaign³

01 Sep 2019

TL;DR: A lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous graph analytics, which is the first asynchronous distributed GPU graph analytics system and shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter graphs at scale.

...read moreread less

Abstract: Distributed graph analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed graph analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous graph analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed graph analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU graph analytics system.

...read moreread less

21 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

An overview of current applications, challenges, and future trends in distributed process-based models in hydrology

[...]

Simone Fatichi¹, Enrique R. Vivoni², Fred L. Ogden³, Valeriy Y. Ivanov⁴, Benjamin B. Mirus⁵, David Gochis⁶, Charles W. Downer⁷, Matteo Camporese⁸, J. H. Davison⁹, Brian A. Ebel⁵, Norm Jones¹⁰, Jongho Kim⁴, Jongho Kim¹¹, Giuseppe Mascaro², Richard G. Niswonger⁵, Pedro Restrepo¹², Riccardo Rigon¹³, Chaopeng Shen¹⁴, Mauro Sulis¹⁵, David G. Tarboton¹⁶ - Show less +16 more•Institutions (16)

ETH Zurich¹, Arizona State University², University of Wyoming³, University of Michigan⁴, United States Geological Survey⁵, National Center for Atmospheric Research⁶, Engineer Research and Development Center⁷, University of Padua⁸, University of Waterloo⁹, Brigham Young University¹⁰, Sejong University¹¹, National Oceanic and Atmospheric Administration¹², University of Trento¹³, Pennsylvania State University¹⁴, University of Bonn¹⁵, Utah State University¹⁶

01 Jun 2016-Journal of Hydrology

TL;DR: The use of process-based hydrological models has a long history dating back to the 1960s as mentioned in this paper, and a more nuanced view is that these tools are necessary in many situations and, in a certain class of problems, they are the most appropriate type of hydrologogical model.

...read moreread less

374 citations

University of Texas at Austin의 연구 현황

[...]

Sun-Ok Gwon

01 Jan 2002

TL;DR: In this article, the authors present 15 technical reports that were directly supported by the JSC grant and four staff members (Tesar, Tosunoglu, Hooper, and Freeman) have beer.

...read moreread less

Abstract: This has been wonderful support for the University and I very much appreciate your role and others at JSC in giving us good advice and direction. As you can see from the table of contents, the program has covered a very broad range of topics. The 15 technical reports are those which were directly supported by the grant. Note also. that 13 M.Sc. and 11 Ph.D. students have also been directly involved in this effort. Finally, four staff members (Tesar, Tosunoglu, Hooper, and Freeman) have beer. involved and participated in the direction of the research.

...read moreread less

195 citations

Journal Article•DOI•

GraphIt: a high-performance graph DSL

[...]

Yunming Zhang¹, Mengjiao Yang¹, Riyadh Baghdadi¹, Shoaib Kamil², Julian Shun¹, Saman Amarasinghe¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, Adobe Systems²

24 Oct 2018

TL;DR: GraphIt is introduced, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures and which outperforms the next fastest shared-memory frameworks on 24 out of 32 experiments.

...read moreread less

Abstract: The performance bottlenecks of graph applications depend not only on the algorithm and the underlying hardware, but also on the size and structure of the input graph. As a result, programmers must try different combinations of a large set of techniques, which make tradeoffs among locality, work-efficiency, and parallelism, to develop the best implementation for a specific algorithm and type of graph. Existing graph frameworks and domain specific languages (DSLs) lack flexibility, supporting only a limited set of optimizations. This paper introduces GraphIt, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures. GraphIt separates what is computed (algorithm) from how it is computed (schedule). Programmers specify the algorithm using an algorithm language, and performance optimizations are specified using a separate scheduling language. The algorithm language simplifies expressing the algorithms, while exposing opportunities for optimizations. We formulate graph optimizations, including edge traversal direction, data layout, parallelization, cache, NUMA, and kernel fusion optimizations, as tradeoffs among locality, parallelism, and work-efficiency. The scheduling language enables programmers to easily search through this complicated tradeoff space by composing together a large set of edge traversal, vertex data layout, and program structure optimizations. The separation of algorithm and schedule also enables us to build an autotuner on top of GraphIt to automatically find high-performance schedules. The compiler uses a new scheduling representation, the graph iteration space, to model, compose, and ensure the validity of the large number of optimizations. We evaluate GraphIt’s performance with seven algorithms on graphs with different structures and sizes. GraphIt outperforms the next fastest of six state-of-the-art shared-memory frameworks (Ligra, Green-Marl, GraphMat, Galois, Gemini, and Grazelle) on 24 out of 32 experiments by up to 4.8×, and is never more than 43% slower than the fastest framework on the other experiments. GraphIt also reduces the lines of code by up to an order of magnitude compared to the next fastest framework.

...read moreread less

137 citations

Proceedings Article•DOI•

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

[...]

Roshan Dathathri¹, Gurbinder Gill¹, Loc Hoang¹, Hoang-Vu Dang², Alex Brooks², Nikoli Dryden², Marc Snir², Keshav Pingali¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, University of Illinois at Urbana–Champaign²

11 Jun 2018

...read moreread less

125 citations

Journal Article•DOI•

Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs

[...]

Yang Wangdong¹, Kenli Li¹, Zeyao Mo, Keqin Li¹•Institutions (1)

Hunan University¹

01 Sep 2015-IEEE Transactions on Computers

TL;DR: The partitioning strategy has balanced load distribution and the performance of SpMV is significantly improved when a sparse matrix is partitioned into dense blocks using the method presented, which has the highest mean density compared with other strategies.

...read moreread less

Abstract: This paper presents a sparse matrix partitioning strategy to improve the performance of SpMV on GPUs and multicore CPUs. This method has wide adaptability for different types of sparse matrices, and is different from existing methods which only adapt to some particular sparse matrices. In addition, our partitioning method can obtain dense blocks by analyzing the probability distribution of non-zero elements in a sparse matrix, and result in very low proportion of zero padded. We make the following significant contributions. (1) We present a partitioning strategy of sparse matrices based on probabilistic modeling of non-zero elements in a row. (2) We prove that our method has the highest mean density compared with other strategies according to certain given ratios of partition obtained from the computing powers of heterogeneous processors. (3) We develop a CPU-GPU hybrid parallel computing model for SpMV on GPUs and multicore CPUs in a heterogeneous computing platform. Our partitioning strategy has balanced load distribution and the performance of SpMV is significantly improved when a sparse matrix is partitioned into dense blocks using our method. The average performance improvement of our solution for SpMV is about 15.75 percent on multicore CPUs, compared to that of the other solutions. By considering the rows of a matrix in a unique order based on the probability mass function of the number of non-zeros in a row, the average performance improvement of our solution for SpMV is about 33.52 percent on GPUs and multicore CPUs of a heterogeneous computing platform, compared to that of the partitioning methods based on the original row order of a matrix.

...read moreread less

113 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

Collapse