Author

Pavan Yalamanchili

Bio: Pavan Yalamanchili is an academic researcher. The author has contributed to research in topics: Path (graph theory). The author has an hindex of 1, co-authored 1 publications receiving 60 citations.

Topics: Path (graph theory)

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Fast triangle counting on the GPU

[...]

Oded Green, Pavan Yalamanchili, Lluís-Miquel Munguía¹•Institutions (1)

Georgia Institute of Technology¹

16 Nov 2014

TL;DR: This paper shows the first scalable GPU implementation for triangle counting using a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm), which has two levels of parallelism.

...read moreread less

Abstract: Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

...read moreread less

73 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Ligra: a lightweight graph processing framework for shared memory

[...]

Julian Shun¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

23 Feb 2013

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

816 citations

Proceedings Article•DOI•

Multicore triangle computations without tuning

[...]

Julian Shun¹, Kanat Tangwongsan²•Institutions (2)

Carnegie Mellon University¹, Mahidol University International College²

13 Apr 2015

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.

...read moreread less

Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

...read moreread less

143 citations

Journal Article•DOI•

Gunrock: GPU Graph Analytics

[...]

Yangzihao Wang¹, Yuechao Pan¹, Andrew Davidson¹, Yuduo Wu¹, Carl Yang¹, Leyuan Wang¹, Muhammad Osama¹, Chenshan Yuan¹, Weitang Liu¹, Andy Riffel¹, John D. Owens¹ - Show less +7 more•Institutions (1)

University of California, Davis¹

23 Aug 2017

TL;DR: The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries, and better performance than any other GPU high-level graph library.

...read moreread less

Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. “Gunrock,” our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock’s overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries, such as Ligra and Galois, and better performance than any other GPU high-level graph library.

...read moreread less

99 citations

Posted Content•

Gunrock: GPU Graph Analytics

[...]

University of California, Davis¹

04 Jan 2017-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: Gunrock as discussed by the authors is a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier for large-scale graph analytics.

...read moreread less

Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.

...read moreread less

55 citations

Proceedings Article•DOI•

TriCore: parallel triangle counting on GPUs

[...]

Yang Hu¹, Hang Liu², H. Howie Huang¹•Institutions (2)

George Washington University¹, University of Massachusetts Lowell²

11 Nov 2018

TL;DR: TriCore, a scalable GPU-based triangle counting system that consists of three major techniques, design a binary search based algorithm that can increase both the thread parallelism and memory performance on Graphics Processing Units (GPUs), and develops a dynamic workload management technique to balance the workload across GPUs.

...read moreread less

Abstract: Exact triangle counting algorithm enumerates the triangles in a graph by identifying the common neighbors of two vertices of each edge. In this work, we present TriCore, a scalable GPU-based triangle counting system that consists of three major techniques. First, we design a binary search based algorithm that can increase both the thread parallelism and memory performance on Graphics Processing Units (GPUs), both of which are absent from prior work. Second, in contrast to prior attempts which require multiple graph representations, i.e., compressed sparse row (CSR), edge list, and bitmap, to be present in the GPU memory, TriCore evenly partitions and distributes the partitioned CSR data across all the GPUs, and uses a streaming buffer to load the edge list from the CPU memory on the fly. This design enables TriCore to process the graphs that are orders of magnitude larger than the GPU memory. Third, we further develop a dynamic workload management technique to balance the workload across GPUs. our evaluation demonstrates that TriCore on a single GPU can count the triangles in the billion-edge Twitter graph within 24 seconds, that is, 22× faster than the state-of-the-art CPU project which uses CPUs that are 8× more expensive. When processing big graphs (up to 33.4 billion edges) that are ∼22× larger than the memory size of a single GPU, it achieves 24× speedup when scaling from 1 to 32 GPUs.

...read moreread less

54 citations

Collapse