scispace - formally typeset
Search or ask a question

Showing papers on "Breadth-first search published in 2017"


Journal ArticleDOI
01 Sep 2017
TL;DR: This paper proposes a GPU-based dynamic graph storage scheme to support existing graph algorithms easily and proposes parallel update algorithms to support efficient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs.
Abstract: As graph analytics often involves compute-intensive operations, GPUs have been extensively used to accelerate the processing. However, in many applications such as social networks, cyber security, and fraud detection, their representative graphs evolve frequently and one has to perform a rebuild of the graph structure on GPUs to incorporate the updates. Hence, rebuilding the graphs becomes the bottleneck of processing high-speed graph streams. In this paper, we propose a GPU-based dynamic graph storage scheme to support existing graph algorithms easily. Furthermore, we propose parallel update algorithms to support efficient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs. Our extensive experiments with three streaming applications on large-scale real and synthetic datasets demonstrate the superior performance of our proposed approach.

75 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: SlimSell as mentioned in this paper is a vectorizable graph representation to accelerate BFS based on sparse-matrix dense-vector (SpMV) products, which reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem.
Abstract: Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.

65 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: This is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization, and proposes a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity).
Abstract: Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.

60 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: This paper shares the experience of designing and implementing the Breadth-First Search algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores, and achieves 23755.7 giga-traversed edges per second, which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list.
Abstract: Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].

39 citations


Posted Content
TL;DR: This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers, and considers both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm.
Abstract: Author(s): Buluc, Aydin; Beamer, Scott; Madduri, Kamesh; Asanovic, Krste; Patterson, David | Abstract: This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.

34 citations


Journal Article
TL;DR: In this paper, the authors revisited the result of Herbster et al. and showed that it has important implications for signal denoising on graphs, which can be translated to our setting as follows: given a general graph, if we run the standard depth-first search (DFS) traversal algorithm, then the total variation of any signal over the chain graph induced by DFS is no more than twice its total variation over the original graph.
Abstract: The fused lasso, also known as (anisotropic) total variation denoising, is widely used for piecewise constant signal estimation with respect to a given undirected graph. The fused lasso estimate is highly nontrivial to compute when the underlying graph is large and has an arbitrary structure. But for a special graph structure, namely, the chain graph, the fused lasso--or simply, 1d fused lasso--can be computed in linear time. In this paper, we revisit a result recently established in the online classification literature (Herbster et al., 2009; Cesa-Bianchi et al., 2013) and show that it has important implications for signal denoising on graphs. The result can be translated to our setting as follows. Given a general graph, if we run the standard depth-first search (DFS) traversal algorithm, then the total variation of any signal over the chain graph induced by DFS is no more than twice its total variation over the original graph. This result leads to several interesting theoretical and computational conclusions. Letting m and n denote the number of edges and nodes, respectively, of the graph in consideration, it implies that for an underlying signal with total variation t over the graph, the fused lasso (properly tuned) achieves a mean squared error rate of t2/3n-2/3. Moreover, precisely the same mean squared error rate is achieved by running the 1d fused lasso on the DFS-induced chain graph. Importantly, the latter estimator is simple and computationally cheap, requiring O(m) operations to construct the DFS-induced chain and O(n) operations to compute the 1d fused lasso solution over this chain. Further, for trees that have bounded maximum degree, the error rate of t2/3n-2/3 cannot be improved, in the sense that it is the minimax rate for signals that have total variation t over the tree. Finally, several related results also hold--for example, the analogous result holds for a roughness measure defined by the l0 norm of differences across edges in place of the total variation metric.

32 citations


Journal ArticleDOI
TL;DR: A new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer.
Abstract: There are many large-scale graphs in real world such as Web graphs and social graphs. The interest in large-scale graph analysis is growing in recent years. Breadth-First Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer. By the use of our proposed algorithm, the K-Computer was ranked 1st in Graph500 using all the 82,944 nodes available on June and November 2015 and June 2016 38,621.4 GTEPS. Based on the hybrid BFS algorithm by Beamer (Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’13, IEEE Computer Society, Washington, 2013), we devise sets of optimizations for scaling to extreme number of nodes, including a new efficient graph data structure and several optimization techniques such as vertex reordering and load balancing. Our performance evaluation on K-Computer shows that our new BFS is 3.19 times faster on 30,720 nodes than the base version using the previously known best techniques.

27 citations


Journal ArticleDOI
TL;DR: The experimental results showed that the shape of tree skeleton extracted was consistent with the real tree, which showed the method proposed in the paper is effective and feasible.
Abstract: Tree skeleton could describe the shape and topological structure of a tree, which are useful to forest researchers. Terrestrial laser scanner (TLS) can scan trees with high accuracy and speed to acquire the point cloud data, which could be used to extract tree skeletons. An adaptive extracting method of tree skeleton based on the point cloud data of TLS was proposed in this paper. The point cloud data were segmented by artificial filtration and -means clustering, and the point cloud data of trunk and branches remained to extract skeleton. Then the skeleton nodes were calculated by using breadth first search (BFS) method, quantifying method, and clustering method. Based on their connectivity, the skeleton nodes were connected to generate the tree skeleton, which would be smoothed by using Laplace smoothing method. In this paper, the point cloud data of a toona tree and peach tree were used to test the proposed method and for comparing the proposed method with the shortest path method to illustrate the robustness and superiority of the method. The experimental results showed that the shape of tree skeleton extracted was consistent with the real tree, which showed the method proposed in the paper is effective and feasible.

27 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present an efficient distributed memory parallel algorithm for computing connected components in undirected graphs based on Shiloach-Vishkin's PRAM approach and employ a heuristic that allows the algorithm to quickly predict the type of the network by computing the degree distribution and follow the optimal hybrid route.
Abstract: We present an efficient distributed memory parallel algorithm for computing connected components in undirected graphs based on Shiloach-Vishkin’s PRAM approach. We discuss multiple optimization techniques that reduce communication volume as well as load-balance the algorithm. We also note that the efficiency of the parallel graph connectivity algorithm depends on the underlying graph topology. Particularly for short diameter graph components, we observe that parallel Breadth First Search (BFS) method offers better performance. However, running parallel BFS is not efficient for computing large diameter components or large number of small components. To address this challenge, we employ a heuristic that allows the algorithm to quickly predict the type of the network by computing the degree distribution and follow the optimal hybrid route. Using large graphs with diverse topologies from domains including metagenomics, web crawl, social graph and road networks, we show that our hybrid implementation is efficient and scalable for each of the graph types. Our approach achieves a runtime of 215 seconds using 32 K cores of Cray XC30 for a metagenomic graph with over 50 billion edges. When compared against the previous state-of-the-art method, we see performance improvements up to 24 $\times$ .

22 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: It is proved that resolving the decision question NC = RNC, would imply an NC algorithm for finding a bipartite perfect matching and finding a DFS tree in NC.
Abstract: We present a pseudo-deterministic NC algorithm for finding perfect matchings in bipartite graphs. Specifically, our algorithm is a randomized parallel algorithm which uses poly(n) processors, poly(log n) depth, poly(log n) random bits, and outputs for each bipartite input graph a unique perfect matching with high probability. That is, on the same graph it returns the same matching for almost all choices of randomness. As an immediate consequence we also find a pseudo-deterministic NC algorithm for constructing a depth first search (DFS) tree. We introduce a method for computing the union of all min-weight perfect matchings of a weighted graph in RNC and a novel set of weight assignments which in combination enable isolating a unique matching in a graph. We then show a way to use pseudo-deterministic algorithms to reduce the number of random bits used by general randomized algorithms. The main idea is that random bits can be reused by successive invocations of pseudo-deterministic randomized algorithms. We use the technique to show an RNC algorithm for constructing a depth first search (DFS) tree using only O(log^2 n) bits whereas the previous best randomized algorithm used O(log^7 n), and a new sequential randomized algorithm for the set-maxima problem which uses fewer random bits than the previous state of the art. Furthermore, we prove that resolving the decision question NC = RNC, would imply an NC algorithm for finding a bipartite perfect matching and finding a DFS tree in NC. This is not implied by previous randomized NC search algorithms for finding bipartite perfect matching, but is implied by the existence of a pseudo-deterministic NC search algorithm.

21 citations


Journal ArticleDOI
TL;DR: A novel tree-grafting method is described that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches and creates more parallelism than single source algorithms.
Abstract: It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. Our algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. We provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.

Proceedings ArticleDOI
01 Jun 2017
TL;DR: A distributed algorithm using messages of size O(logn) for constructing virtual rings of graphs that are on average shorter than rings based on depth first search is presented.
Abstract: The loose coupling and the inherent scalability make publish/subscribe systems an ideal candidate for event-driven services for wireless networks using low power protocols such as IEEE 802.15.4. This work introduces a distributed algorithm to build and maintain a routing structure for such networks. The algorithm dynamically maintains a multicast tree for each node. While previous work focused on minimizing these trees we aim to keep the effort to maintain them in case of fluctuations of subscribers low. The multicast trees are implicitly defined by a novel structure called augmented virtual ring. The main contribution is a distributed algorithm to build and maintain this augmented virtual ring. Maintenance operations after sub-and unsubscriptions require message exchange in a limited region only. We compare the average lengths of the constructedforwarding paths with an almost ideal approach. As a resultof independent interest we present a distributed algorithm using messages of size O(logn) for constructing virtual rings of graphs that are on average shorter than rings based on depth first search.

Journal ArticleDOI
TL;DR: This work proposes a new DPOP algorithm using a Breadth First Search (BFS) pseudo-tree as the communication structure, named BFSDPOP, and compares it with original DPOP on three types of problems - graph coloring problems, meeting scheduling problems and random DCOPs.
Abstract: Depth First Search (DFS) pseudo-tree is popularly used as the communication structure in complete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) from multiagent systems. The advantage of a DFS pseudo-tree lies in its parallelism derived from pseudo-tree branches because the nodes in different branches are relatively independent and can compute concurrently. However, the constructed DFS pseudo-trees in experiments often come to be chain-like and greatly impair the performances of solving algorithms. Therefore, we propose a new DPOP algorithm using a Breadth First Search (BFS) pseudo-tree as the communication structure, named BFSDPOP. Compared with a DFS pseudo-tree, a BFS pseudo-tree is more excellent on the parallelism as it has much more branches. Another notable advantage is that the height of a BFS pseudo-tree is much lower than that of a DFS pseudo-tree, which gives rise to the shorter communication paths and less communication time. The method of Cluster Removing is also presented to allocate cross-edge constraints to reduce the size of the largest message in BFSDPOP. In the experiment, BFSDPOP with a BFS pseudo-tree and original DPOP with a DFS pseudo-tree are compared on three types of problems - graph coloring problems, meeting scheduling problems and random DCOPs. The results show that BFSDPOP outperforms original DPOP in most cases, which proves the excellent attributes of BFS pseudo-tree over DFS pseudo-tree.

Proceedings ArticleDOI
12 Nov 2017
TL;DR: This work proposes a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG) that outperforms sequential DFS on the CPU by up to 6x in the authors' experiments.
Abstract: Depth-First Search (DFS) is a pervasive algorithm, often used as a building block for topological sort, connectivity and planarity testing, among many other applications. We propose a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG). The algorithm traverses the entire DAG in a BFS-like fashion no more than three times. As a result it finds the DFS pre-order (discovery) and post-order (finish time) as well as the parent relationship associated with every node in a DAG. We analyse the runtime and work complexity of this novel parallel algorithm. Also, we show that our algorithm is easy to implement and optimize for performance. In particular, we show that its CUDA implementation on the GPU outperforms sequential DFS on the CPU by up to 6x in our experiments.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this paper, the authors present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries, such as GraphMat, Graph500, Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph.
Abstract: The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems—one survey in 2014 identified over 80 Determining the best approach for a given problem is infeasible for most developers We present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries We demonstrate our approach on five graph processing packages: GraphMat, Graph500, Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph using synthetic and real-world datasets We examine previously overlooked aspects of parallel graph processing performance such as phases of execution and energy usage for three algorithms: breadth first search, single source shortest paths, and PageRank

Book ChapterDOI
03 Aug 2017
TL;DR: Several time-space tradeoffs for performing Maximum Cardinality Search (MCS), Stack Breadth First Search (Stack BFS), and Queue Breadth first Search (Queue BFS) on a given input graph are presented and space-efficient implementations for testing if a given undirected graph is chordal, reporting an independent set, and a proper coloring of a given chordal graph are provided.
Abstract: Following the recent trends of designing space efficient algorithms for fundamental algorithmic graph problems, we present several time-space tradeoffs for performing Maximum Cardinality Search (MCS), Stack Breadth First Search (Stack BFS), and Queue Breadth First Search (Queue BFS) on a given input graph. As applications of these results, we also provide space-efficient implementations for testing if a given undirected graph is chordal, reporting an independent set, and a proper coloring of a given chordal graph among others. Finally, we also show how two other seemingly different graph problems and their algorithms have surprising connection with MCS with respect to designing space efficient algorithms.

Proceedings Article
01 Jan 2017
TL;DR: New anytime search algorithms that combine best-first with depth-first search into hybrid schemes for Marginal MAP inference in graphical models are introduced, leading to solutions for difficult instances where previous solvers were unable to find even a single solution.
Abstract: We introduce new anytime search algorithms that combine best-first with depth-first search into hybrid schemes for Marginal MAP inference in graphical models. The main goal is to facilitate the generation of upper bounds (via the bestfirst part) alongside the lower bounds of solutions (via the depth-first part) in an anytime fashion. We compare against two of the best current state-of-the-art schemes and show that our best+depth search scheme produces higher quality solutions faster while also producing a bound on their accuracy, which can be used to measure solution quality during search. An extensive empirical evaluation demonstrates the effectiveness of our new methods which enjoy the strength of best-first (optimality of search) and of depth-first (memory robustness), leading to solutions for difficult instances where previous solvers were unable to find even a single solution.

Journal ArticleDOI
TL;DR: Since the slicing list detection is performed at each visited node in the detection tree, the complexity reduction is especially significant when the number of antennas and the alphabet size are large, making the proposed detector a competitive option for high spectral-efficiency wireless systems.
Abstract: A bottleneck in multiple-input multiple-output communications systems is the complexity of detection at the receiver. The complexity of optimum maximum-likelihood detection is often prohibitive, especially for large numbers of antennas and large alphabets. A suboptimal tree-search-based detector known as the $K$ -best detector is an effective scheme that provides a flexible performance-complexity tradeoff. In this paper, we identify scalar list detection as a key building block of the $K$ -best detector, and we propose an efficient low-complexity implementation of the scalar list detector for $M$ -ary QAM using a slicing operation. Embedding the slicing list detector into the $K$ -best framework leads to our proposed slicing $K$ -best detector. Simulation results show that the proposed detector offers comparable performance to the conventional $K$ -best detector, but with significantly reduced complexity when $K$ is less than the QAM alphabet size $M$ . Since the slicing list detection is performed at each visited node in the detection tree, the complexity reduction is especially significant when the number of antennas and the alphabet size are large, making the proposed detector a competitive option for high spectral-efficiency wireless systems.

Journal ArticleDOI
Guangyan Zhang1, Shuhan Cheng1, Jiwu Shu1, Qingda Hu1, Weimin Zheng1 
TL;DR: This article proposes FastBFS, a new approach that accelerates breadth-first graph search on a single server by leverage of the access pattern during iterating over a big graph by using an edge-centric graph processing model to obtain the high bandwidth of sequential memory and/or disk access without expensive data preprocessing.

Book ChapterDOI
01 Jan 2017
TL;DR: An application of Breadth-first search (BFS) technique for storage optimization of the ASRS, which is flexible to change in the order of the rack matrix and tends to exclude the array of sensors.
Abstract: Automated storage and retrieval systems (ASRS) are generally used in the production and supply chain industries for storage as well as retrieval of products. In the present era, it is also used in state-of-the-art applications like automated car parking systems, automated library management system, and automated locker system. Breadth-first search (BFS) is a type of un-informed searching technique in graph theory. This research paper explains an application of BFS technique for storage optimization of the ASRS. The implementation of BFS in the random storage assignment is the core area of research. The algorithm searches the nearest empty slot for material storage. The algorithm described in this paper is flexible to change in the order of the rack matrix. To know the status (empty/filled) of the racks, a unique method which tends to exclude the array of sensors which is generally used to note the status has been discussed.

Journal ArticleDOI
TL;DR: This work presents the first algorithm for maintaining a DFS tree for an undirected graph under insertion of edges, and takes total O(n^2) time for processing any arbitrary online sequence of edge insertions.
Abstract: Depth First Search (DFS) tree is a fundamental data structure for graphs used in solving various algorithmic problems. However, very few results are known for maintaining DFS tree in a dynamic environment--insertion or deletion of edges. We present the first algorithm for maintaining a DFS tree for an undirected graph under insertion of edges. For processing any arbitrary online sequence of edge insertions, this algorithm takes total $$O(n^2)$$O(n2) time.

Proceedings ArticleDOI
24 Jul 2017
TL;DR: In this article, the authors presented a parallel algorithm for maintaining a DFS tree in O(1) time using m processors on an EREW PRAM, where m is the number of nodes in the graph.
Abstract: Depth first search (DFS) tree is a fundamental data structure for solving various graph problems. The classical algorithm [SIAMCOMP74] for building a DFS tree requires O(m+n) time for a given undirected graph G having n vertices and m edges. Recently, Baswana et al. [SODA16] presented a simple algorithm for updating the DFS tree of an undirected graph after an edge/vertex update in O (n) time. However, their algorithm is strictly sequential. We present an algorithm achieving similar bounds, that can be adopted easily to the parallel environment. In the parallel environment, a DFS tree can be computed from scratch using O(m) processors in expected O (1) time [SICOMP90] on an EREW PRAM, whereas the best deterministic algorithm takes O (√n) time [SIAMCOMP90,JAL93] on a CRCW PRAM. Our algorithm can be used to develop optimal (upto polylog n factors) deterministic algorithms for maintaining fully dynamic DFS and fault tolerant DFS, of an undirected graph. 1- Parallel Fully Dynamic DFS - Given any arbitrary online sequence of vertex or edge updates, we can maintain a DFS tree of an undirected graph in O (1) time per update using m processors on an EREW PRAM. 2- Parallel Fault tolerant DFS - An undirected graph can be preprocessed to build a data structure of size O(m) such that for a set of k updates (where k is constant) in the graph, a DFS tree of the updated graph can be computed in O (1) time using n processors on an EREW PRAM. For constant k, this is also work optimal (upto polylog n factors) Moreover, our fully dynamic DFS algorithm provides, in a seamless manner, nearly optimal (upto polylog n factors) algorithms for maintaining a DFS tree in the semi-streaming environment and a restricted distributed model. These are the first parallel, semi-streaming and distributed algorithms for maintaining a DFS tree in the dynamic setting.

Journal ArticleDOI
TL;DR: A lattice-reduction (LR)-aided breadth-first tree searching algorithm for MIMO detection achieving near-optimal performance with very low complexity and the proposed algorithm’s higher efficiency in terms of the performance/complexity tradeoff than the existing LR-aided K-best detectors and LR- aided fixed-complexity sphere decoders is verified.
Abstract: We propose a lattice-reduction (LR)-aided breadth-first tree searching algorithm for MIMO detection achieving near-optimal performance with very low complexity. At each level of the tree in the search, only the paths whose accumulated metrics satisfy a particular restriction condition will be kept as the candidates. Furthermore, the number of child nodes expanded on each parent node, and the maximum number of candidates preserved at each level, are also restricted, respectively. All these measures ensure the proposed algorithm reaching a preset near-optimal performance and achieving very low average and maximum computational complexity. Simulation results verify the proposed algorithm’s higher efficiency in terms of the performance/complexity tradeoff than the existing LR-aided K-best detectors and LR-aided fixed-complexity sphere decoders.

Book ChapterDOI
04 Oct 2017
TL;DR: This article studies how PageRank can be updated in an evolving tree graph and determines PageRanks as the expected numbers of random walk starting from any vertex in the graph using breadth-first search algorithm.
Abstract: In this article, we study how PageRank can be updated in an evolving tree graph. We are interested in finding how ranks of the graph can be updated simultaneously and effectively using previous ranks without resorting to iterative methods such as the Jacobi or Power method. We demonstrate and discuss how PageRank can be updated when a leaf is added to a tree, at least one leaf is added to a vertex with at least one outgoing edge, an edge added to vertices at the same level and forward edge is added in a tree graph. The results of this paper provide new insights and applications of standard partitioning of vertices of the graph into levels using breadth-first search algorithm. Then, one determines PageRanks as the expected numbers of random walk starting from any vertex in the graph. We noted that time complexity of the proposed method is linear, which is quite good. Also, it is important to point out that the types of vertex play essential role in updating of PageRank.

Posted Content
TL;DR: This work proposes three techniques for improved load-balancing of graph applications on GPUs, and illustrates the effectiveness of each of the proposed techniques in comparison to the existing node-based and edge-based mechanisms.
Abstract: Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, it can result in skewed task-distribution, leading to poor load-balance. In contrast, if the work-assignment uses edge-based graph partitioning, the load-balancing is better, but the memory requirement is relatively higher. This makes it unsuitable for large graphs. In this work, we propose three techniques for improved load-balancing of graph applications on GPUs. Each technique brings in unique advantages, and a user may have to employ a specific technique based on the requirement. Using Breadth First Search and Single Source Shortest Paths as our processing kernels, we illustrate the effectiveness of each of the proposed techniques in comparison to the existing node-based and edge-based mechanisms.

Posted Content
TL;DR: This paper fills the gap through though arbitrary grouping of branches and including this in the delay analysis of any contention tree algorithm that runs a breadth first search exploration, and shows that the analysis is in agreement with the realizations.
Abstract: Contention tree algorithm is initially invented as a solution to improve the stable throughput problem of Slotted ALOHA in multiple access schemes. Even though the throughput is stabilized in tree algorithms, the delay of requests may grow to infinity with respect to the arrival rate of the system. Delay depends heavily on the exploration of the tree structure, i.e., breadth search, or depth search. Breadth search is necessary for faster exploration of tree. The analytical probability distribution of delay, which is available mostly for depth search, is not generalizable to all breadth search. In this paper we fill this gap through though arbitrary grouping of branches and including this in the delay analysis. This enables obtaining the delay analysis of any contention tree algorithm that runs a breadth first search exploration. We show through simulations that the analysis is in agreement with the realizations.

Journal ArticleDOI
TL;DR: A new approach for the additive projection parallel preconditioned iterative method based on semiaggregation and a subspace compression technique, for general sparse linear systems, is presented, based on nonoverlapping domain decomposition in conjunction with algebraic graph partitioning techniques for separating the subdomains.
Abstract: During the last decades, the continuous expansion of supercomputing infrastructures necessitates the design of scalable and robust parallel numerical methods for solving large sparse linear systems. A new approach for the additive projection parallel preconditioned iterative method based on semiaggregation and a subspace compression technique, for general sparse linear systems, is presented. The subspace compression technique utilizes a subdomain adjacency matrix and breadth first search to discover and aggregate subdomains to limit the average size of the local linear systems, resulting in reduced memory requirements. The depth of aggregation is controlled by a user defined parameter. The local coefficient matrices use the aggregates computed during the formation of the subdomain adjacency matrix in order to avoid recomputation and improve performance. Moreover, the rows and columns corresponding to the newly formed aggregates are ordered last to further reduce fill-in during the factorization of the local coefficient matrices. Furthermore, the method is based on nonoverlapping domain decomposition in conjunction with algebraic graph partitioning techniques for separating the subdomains. Finally, the applicability and implementation issues are discussed and numerical results along with comparative results are presented.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this paper, two algorithms are applied to STA module and compared the runtime efficiencies by testing a large number of sequential circuit instances, finding that BFS algorithm can implement STA module more efficiently than DFS algorithm.
Abstract: This paper is a study to the application of graph traversal algorithms in static Timing Analysis (STA) module of Electronic Design Automation(EDA) tool. After Field Programmable Gate Array (FPGA) read into the design circuit, it replaces circuit with UDM (Universal Design Methodology) netlist internally. STA module does not work directly on the netlist, but converts the netlist to timing graph that only saves information related to timing. The timing graph consists of nodes and edges, nodes correspond to component pins or input and output ports, and edges are used to connect nodes, Edges have weights attached to them that can denote some characteristics, such as timing arc delays in this case[1]. There are two basic graph traversal algorithms: Depth First Search (DFS) and Breadth First Search (BFS). In this paper, two algorithms are applied to STA module and compared the runtime efficiencies by testing a large number of sequential circuit instances. The conclusion is that BFS algorithm can implement STA module more efficiently than DFS algorithm.

Posted Content
TL;DR: The results show that depth first search is efficient for energy constrained exploration of trees, even though it is known that the same does not hold for the exploration of arbitrary graphs.
Abstract: Depth first search is a natural algorithmic technique for constructing a closed route that visits all vertices of a graph. The length of such route equals, in an edge-weighted tree, twice the total weight of all edges of the tree and this is asymptotically optimal over all exploration strategies. This paper considers a variant of such search strategies where the length of each route is bounded by a positive integer $B$ (e.g. due to limited energy resources of the searcher). The objective is to cover all the edges of a tree $T$ using the minimum number of routes, each starting and ending at the root and each being of length at most $B$. To this end, we analyze the following natural greedy tree traversal process that is based on decomposing a depth first search traversal into a sequence of limited length routes. Given any arbitrary depth first search traversal $R$ of the tree $T$, we cover $R$ with routes $R_1,\ldots,R_l$, each of length at most $B$ such that: $R_i$ starts at the root, reaches directly the farthest point of $R$ visited by $R_{i-1}$, then $R_i$ continues along the path $R$ as far as possible, and finally $R_i$ returns to the root. We call the above algorithm \emph{piecemeal-DFS} and we prove that it achieves the asymptotically minimal number of routes $l$, regardless of the choice of $R$. Our analysis also shows that the total length of the traversal (and thus the traversal time) of piecemeal-DFS is asymptotically minimum over all energy-constrained exploration strategies. The fact that $R$ can be chosen arbitrarily means that the exploration strategy can be constructed in an online fashion when the input tree $T$ is not known in advance. Surprisingly, our results show that depth first search is efficient for energy constrained exploration of trees, even though it is known that the same does not hold for energy constrained exploration of arbitrary graphs.

Book ChapterDOI
01 Jan 2017
TL;DR: This paper presented a fast, scalable, and energy-efficient BFS for a nonuniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered.
Abstract: The breadth-first search (BFS) is one of the most centric processing in graph theory In this paper, we presented a fast, scalable, and energy-efficient BFS for a nonuniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered Our implementation achieved performance rates of 175 billion edges per second for Kronecker graph with \(2^{33}\) vertices and \(2^{37}\) edges on two racks of a SGI UV 2000 system with 1,280 threads and the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists It also produced the most energy-efficient entries in the first and second (small data category) and third, fourth, fifth, and sixth (big data category) Green Graph500 lists on a 4-socket Intel Xeon E5-4640 system