scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache-Oblivious Data Structures and Algorithms for Undirected Breadth-First Search and Shortest Paths

TL;DR: The cache-oblivious SSSP-algorithm takes nearly full advantage of block transfers for dense graphs, and the number of I/Os for sparse graphs is reduced by a factor of nearly sqrt{B}, where B is the cache-block size.
Abstract: We present improved cache-oblivious data structures and algorithms for breadth-first search (BFS) on undirected graphs and the single-source shortest path (SSSP) problem on undirected graphs with non-negative edge weights. For the SSSP problem, our result closes the performance gap between the currently best cache-aware algorithm and the cache-oblivious counterpart. Our cache-oblivious SSSP-algorithm takes nearly full advantage of block transfers for dense graphs. The algorithm relies on a new data structure, called bucket heap , which is the first cache-oblivious priority queue to efficiently support a weak D ECREASE K EY operation. For the BFS problem, we reduce the number of I/Os for sparse graphs by a factor of nearly sqrt{B}, where B is the cache-block size, nearly closing the performance gap between the currently best cache-aware and cache-oblivious algorithms.

Content maybe subject to copyright    Report

Citations
More filters
Book
23 Sep 2018
TL;DR: This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting, and proposes both multiway distribution-based with string sample sort and multiway merge-based string sorting with LCP-aware merge and mergesort, and engineer and parallelize both approaches.
Abstract: This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill. Sorting strings or vectors is a basic algorithmic challenge different from integer sorting because it is important to access components of the keys to avoid repeated operations on the entire string. We focus on sorting large inputs which fit into the RAM of a shared-memory machine. String sorting is needed for instance in database index construction, suffix sorting algorithms, and to order high-dimensional geometric data. We first survey engineered variants of basic sequential string sorting algorithms and perform an extensive experimental evaluation to measure their performance. Furthermore, we perform experiments to quantify parallel memory bandwidth and latency experiments as preliminary work for designing parallel string sorting algorithms. We then propose string sample sort as an adaptation of sample sort to string objects and present its engineered version Super Scalar String Sample Sort. This parallel-ready algorithm runs in O(D/w + n log n) expected time, makes effective use of the cache hierarchy, uses word- and instruction-level parallelism, and avoids branch mispredictions. Our parallelization named Parallel Super Scalar String Sample Sort (pS5) employs voluntary work sharing for load balancing and is the overall best performing algorithm on single-socket multi-core machines in our experiments. For platforms with non-uniform memory access (NUMA) we propose to run pS5 on each NUMA node independently and then merge the sorted string sequences. To accelerate the merge with longest common prefix (LCP) values we present a new LCP-aware multiway merge algorithm using a tournament tree. The merge algorithm is also used to construct a stand-alone LCP-aware K-way mergesort, which runs in O(D + n log n + n/K) time and benefits from long common prefixes in the input. Broadly speaking, we propose both multiway distribution-based with string sample sort and multiway merge-based string sorting with LCP-aware merge and mergesort, and engineer and parallelize both approaches. We also present parallelizations of multikey quicksort and radix sort, and perform an extensive experimental evaluation using six machines and seven inputs. For all input instances, except random strings and URLs, pS5 achieves higher speedups on modern single-socket multi-core machines than our own parallel multikey quicksort and radix sort implementations, which are already better than any previous ones. On multi-socket NUMA machines pS5 combined with the LCP-aware top-level multiway merging was fastest on most inputs. We then turn our focus to suffix sorting, which is equivalent to suffix array construction. The suffix array is one of the most popular text indexes and can be used for fast substring search in DNA or text corpora, in compression applications, and is the basis for many string algorithms. When augmented with the LCP array and additional tables, the suffix array can emulate the suffix tree in a myriad of stringology algorithms. Our goal is to create fast and scalable suffix sorting algorithms to generate large suffix arrays for real-world inputs. As introduction to suffix array construction, we first present a brief survey of their principles and history. Our initial contribution to this field is eSAIS, the first external memory suffix sorting algorithm which uses the induced sorting principle. Its central loop is an elegant reformulation of this principle using an external memory priority queue, and our theoretical analysis shows that eSAIS requires at most Sort(17n) + Scan(9n) I/O volume. We then extend eSAIS to also construct the LCP array while suffix sorting, which yields the first implementation of fully external memory suffix and LCP array construction in the literature. Our experiments demonstrate that eSAIS is a factor two faster than DC3, the previously best external memory suffix sorting implementation. After our initial publication of eSAIS, many authors showed interest in the topic and we review their contributions and improvements over eSAIS. For scaling to even larger inputs, we then consider suffix sorting on a distributed cluster machine. To harness the computational power of a such a system in a convenient data-flow style functional programming paradigm, we propose the new high-performance distributed big data processing framework Thrill. Thrill’s central concept is a distributed immutable array (DIA), which is a virtual array of C++ objects distributed onto the cluster. Such arrays can be manipulated using a small set of scalable primitives, such as mapping, reducing, and sorting. These are implemented using pipelined distributed external memory algorithms encapsulated as C++ template classes, which can be efficiently coupled to form large complex applications. Our Thrill prototype is evaluated using five micro benchmarks against the popular frameworks Apache Spark and Flink on up to 16 hosts in the AWS Elastic Compute Cloud. Thrill consistently outperforms the other frameworks in all benchmarks and on all numbers of hosts. Using Thrill we then implement five suffix sorting algorithms as a case study. Three are based on prefix doubling and two are variants of the linear-time difference cover algorithm DC. The implementation of these complex algorithms demonstrates the expressiveness of the scalable primitives provided by Thrill. They also are the first distributed external memory suffix sorters presented in the literature. We compare them experimentally against two hand-coded MPI implementations and the fastest non-distributed sequential suffix sorters. Our results show that algorithms implemented using Thrill are competitive to MPI programs, but scale to larger inputs due to automatic usage of external memory. In the future, these implementations can benefit from improvements of Thrill such as fault tolerance or specialized sorting algorithms.

16 citations

Book ChapterDOI
11 Jul 2005
TL;DR: This work presents an efficient cache-oblivious implementation of the shortest-path algorithm for planar graphs by Klein et al., and proves that it incurs no more than O(N) block transfers on a graph with N vertices.
Abstract: We present an efficient cache-oblivious implementation of the shortest-path algorithm for planar graphs by Klein et al., and prove that it incurs no more than ${\mathcal O}(\frac{N}{B^{1/2 - \epsilon}} + \frac{N}{B}{\rm log}N$) block transfers on a graph with N vertices. This is the first cache-oblivious algorithm for this problem that incurs o(N) block transfers.

12 citations


Cites background from "Cache-Oblivious Data Structures and..."

  • ...Recently, a number of cache-oblivious graph algorithms have been obtained for general graphs, including algorithms for computing connected components and minimum spanning trees [2], directed breadth-first search and depth-first search [2], undirected breadth-first search [12], and undirected shortest paths [12, 14]....

    [...]

Proceedings ArticleDOI
06 Jan 2019
TL;DR: This paper proposes a new priority queue which supports the DecreaseKey operation and has an expected amortized I/O complexity of $O(\frac{1}{B}\log B/\log\log N)$.
Abstract: A priority queue is a fundamental data structure that maintains a dynamic set of (key, priority)-pairs and supports Insert, Delete, ExtractMin and DecreaseKey operations. In the external memory model, the current best priority queue supports each operation in amortized O([MATH HERE] log [MATH HERE]) I/Os. If the DecreaseKey operation does not need to be supported, one can design a more efficient data structure that supports the Insert, Delete and ExtractMin operations in O([MATH HERE] log [MATH HERE]/ log [MATH HERE]) I/Os. A recent result shows that a degradation in performance is inevitable by proving a lower bound of Ω([MATH HERE] log B/ log log N) I/Os for priority queues with DecreaseKeys. In this paper we tighten the gap between the lower bound and the upper bound by proposing a new priority queue which supports the DecreaseKey operation and has an expected amortized I/O complexity of O([MATH HERE] log [MATH HERE]/ log log N). Our result improves the external memory priority queue with DecreaseKeys for the first time in over a decade, and also gives the fastest external memory single source shortest path algorithm.

12 citations


Cites methods from "Cache-Oblivious Data Structures and..."

  • ...In the cache-oblivious model the priority queues with [13][10] or without DecreaseKeys [3][8] can both achieve the same I/O complexity as in the external memory model....

    [...]

Book ChapterDOI
23 Feb 2006
TL;DR: In this paper, a randomized algorithm for sorting binary strings in external memory was proposed, where the error probability was chosen as O(N$^{\rm -{\it c}$) for any positive constant c. This bound was later improved to O(n) for the cache-oblivious model under the tall cache assumption.
Abstract: We give a randomized algorithm for sorting strings in external memory. For K binary strings comprising N words in total, our algorithm finds the sorted order and the longest common prefix sequence of the strings using $O(\frac{K}{B}log_{M/B}(\frac{K}{M})log(\frac{N}{K}) + \frac{N}{B})$ I/Os. This bound is never worse than $O(\frac{K}{B}log_{M/B}(\frac{K}{M})log log_{M/B}(\frac{K}{M}) + \frac{N}{B})$ I/Os, and improves on the (deterministic) algorithm of Arge et al. (On sorting strings in external memory, STOC '97). The error probability of the algorithm can be chosen as O(N$^{\rm -{\it c}}$) for any positive constant c. The algorithm even works in the cache-oblivious model under the tall cache assumption, i.e,, assuming M > B1+e for some e > 0. An implication of our result is improved construction algorithms for external memory string dictionaries.

11 citations

Proceedings ArticleDOI
07 Jan 2007
TL;DR: The algorithm is the first cache-oblivious shortest-path algorithm incurring less than one memory transfer per vertex if the graph is sparse and the cache block size is 2°(B).
Abstract: We present a cache-oblivious algorithm for computing single-source shortest paths in undirected graphs with non-negative edge lengths. The algorithm incurs O(√(nm log w)/B+(m/B) log n +MST (n, m)) memory transfers on a graph with n vertices, m edges, and real edge lengths between 1 and W; B denotes the cache block size, and MST(n, m) denotes the number of memory transfers required to compute a minimum spanning tree of a graph with n vertices and m edges. Our algorithm is the first cache-oblivious shortest-path algorithm incurring less than one memory transfer per vertex if the graph is sparse (m = O(n)) and W = 2°(B).

10 citations


Cites background or methods from "Cache-Oblivious Data Structures and..."

  • ...The improved algorithms of [6, 14, 15, 16] overcome this bottleneck using the following idea: Instead of accessing one adjacency list at a time, group vertices appropriately and form edge groups by concatenating the adjacency lists of the vertices in each vertex group; when the first vertex in a vertex group is visited, load the whole corresponding edge group into a hot pool....

    [...]

  • ...ing variants of SSSP on undirected graphs in a cacheefficient manner [3, 6, 8, 13, 14, 15, 16, 17]....

    [...]

  • ...The main bottleneck in the algorithms of [6, 8, 13, 17] is that retrieving the adjacency lists of visited vertices, in order to relax their incident edges, requires at least one MT per vertex because the order in which vertices are visited is hard to predict....

    [...]

  • ...The algorithms of [6, 8, 13, 15, 16] face the same problem and address it by using a second priority queue to eliminate re-inserted vertices before they can be visited for a second time....

    [...]

  • ...This bound has been matched in the cache-oblivious model [6, 8] by developing a cache-oblivious priority queue to replace the external one used in [13]....

    [...]