scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 Feb 2013
TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

816 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.
Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

143 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.
Abstract: The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

141 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...Comparison Sort: We use a low-depth cache-efficient sample sort [9]....

    [...]

Journal ArticleDOI
09 Jun 2012
TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Abstract: Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

87 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...There have been various techniques proposed to address these algorithmic changes, either using compiler assisted optimization [27], using cache-oblivious algorithms [6] or specialized languages like Sequoia [21]....

    [...]

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

84 citations


Cites background from "Low depth cache-oblivious algorithm..."

  • ...However, all future references .t into cache until reaching a supertask that does not .t in cache, at which point the Problem Span Cache Complexity Q * Scan (pre.x sums, etc.) O(log n) O(ln/Bl) Matrix Transpose (n × m matrix) [20] v v O(log(n + m))v O(lnm/Bl)v Matrix Multiplication ( n × n matrix) [20] v v O( n)v O(ln 1.5/Bl/ M + 1) v Matrix Inversion ( n × n matrix) O( n) O(ln 1.5/Bl/ M + 1) Quicksort [22] O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Sample Sort [10] O(log2 n) O(ln/BlllogM+2 nl) Sparse-Matrix Vector Multiply [10] (m nonzeros, nE edge separators) O(log2 n) O(lm/B + n/(M + 1)1-El) Convex Hull (e.g., see [8]) O(log2 n) O(ln/BlllogM+2 nl) Barnes Hut tree (e.g., see [8]) O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Table 1: Cache complexities of some algorithms analyzed in the PCO model....

    [...]

  • ...Unfortunately, current dynamic parallelism approaches have important limitations: they either apply to hierarchies of only private or only shared caches [1,9,10,16,21], require some strict balance criteria [7,15], or require a joint algorithm/scheduler analysis [7, 13–16]....

    [...]

  • ...Sample Sort [10] O(log(2) n) O(dn/BedlogM+2 ne) Sparse-Matrix Vector Multiply [10] O(log(2) n) O(dm/B + n/(M + 1)1− e) (m nonzeros, n edge separators)...

    [...]

  • ...A pair of common abstract measures for capturing parallel cache based locality are the number of misses given a sequential ordering of a parallel computation [1, 9, 10, 21], and the depth (span, critical path length) of the computation....

    [...]

References
More filters
Proceedings ArticleDOI
20 Sep 1993
TL;DR: The paper assesses the strengths and weaknesses of the PMH model as a generic model and recommends a parameterized generic model that captures the features of diverse computer architectures to facilitate the development of portable programs.
Abstract: A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for each machine that it is intended to represent, it should have a reasonably accurate specific model. The Parallel Memory Hierarchy (PMH) model of computation uses a single mechanism to model the costs of both interprocessor communication and memory hierarchy traffic. A computer is modeled as a tree of memory modules with processors at the leaves. All data movement takes the form of block transfers between children and their parents. The paper assesses the strengths and weaknesses of the PMH model as a generic model. >

81 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...Clearly this lower bound also applies to more general multi-level cache models such as studied in [3, 4, 30, 42, 46, 49]....

    [...]

  • ...It would be interesting to extend these results to more general multi-level models [3, 4, 18, 30, 42, 46, 49], while preserving the goal of supporting a simple model for algorithm design and analysis....

    [...]

Book ChapterDOI
08 Jul 2002
TL;DR: This work adapts the distribution sweeping model for divide-and-conquer algorithms to the cache oblivious model, and demonstrates by a series of algorithms the feasibility of the method in a cache oblivious setting.
Abstract: We adapt the distribution sweepingmetho d to the cache oblivious model. Distribution sweepingis the name used for a general approach for divide-and-conquer algorithms where the combination of solved subproblems can be viewed as a merging process of streams. We demonstrate by a series of algorithms for specific problems the feasibility of the method in a cache oblivious setting. The problems all come from computational geometry, and are: orthogonal line segment intersection reporting, the all nearest neighbors problem, the 3D maxima problem, computingthe measure of a set of axis-parallel rectangles, computingthe visibility of a set of line segments from a point, batched orthogonal range queries, and reportingpairwise intersections of axis-parallel rectangles. Our basic buildingblo ck is a simplified version of the cache oblivious sorting algorithm Funnelsort of Frigo et al., which is of independent interest.

81 citations

Journal ArticleDOI
Matteo Frigo1, Volker Strumpen1
TL;DR: It is shown that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk work-stealing scheduler on a machine with P processors, each with a cache of size Z, with high probability.
Abstract: We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability.

76 citations


"Low depth cache-oblivious algorithm..." refers background or methods in this paper

  • ...[39], the multi-level PMDH assumes a variant of the dag consistency cache consistency model [19] that uses an optimal replacement policy instead of LRU replacement....

    [...]

  • ...It seems that avoiding this would require a modified model for cache complexity, or taking into account particular properties of programs as studied in some previous work [39, 14, 29]....

    [...]

  • ...[39] for a class of algorithms with a regularity condition, by Blelloch et al....

    [...]

Proceedings ArticleDOI
19 Apr 2010
TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.
Abstract: We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel shared-memory machine with hierarchical multi-level caching, and we introduce a multicore-oblivious (MO) approach to algorithms and schedulers for HM. An MO algorithm is specified with no mention of any machine parameters, such as the number of cores, number of cache levels, cache sizes and block lengths. However, it is equipped with a small set of instructions that can be used to provide hints to the run-time scheduler on how to schedule parallel tasks. We present efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components. The notion of an MO algorithm is complementary to that of a network-oblivious (NO) algorithm, recently introduced by Bilardi et al. for parallel distributed-memory machines where processors communicate point-to-point. We show that several of our MO algorithms translate into efficient NO algorithms, adding to the body of known efficient NO algorithms.

70 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...[31] has studied cache oblivious algorithms for a parallel model with a tree of caches....

    [...]

Journal ArticleDOI
TL;DR: If A is stored in column major layout, it is proved that SpMV has I/O complexity $\Theta(\min\{\frac{kN}{B}\max\{1,\log_{M/B}\frac{N}{M}\},kN\}})$ for k≤N/2.
Abstract: We study the problem of sparse-matrix dense-vector multiplication (SpMV) in external memory. The task of SpMV is to compute y:=Ax, where A is a sparse N×N matrix and x is a vector. We express sparsity by a parameter k, and for each choice of k consider the class of matrices where the number of nonzero entries is kN, i.e., where the average number of nonzero entries per column is k. We investigate what is the external worst-case complexity, i.e., the best possible upper bound on the number of I/Os, as a function of k, N and the parameters M (memory size) and B (track size) of the I/O-model. We determine this complexity up to a constant factor for all meaningful choices of these parameters, as long as k≤N 1−e , where e depends on the problem variant. Our model of computation for the lower bound is a combination of the I/O-models of Aggarwal and Vitter, and of Hong and Kung. We study variants of the problem, differing in the memory layout of A. If A is stored in column major layout, we prove that SpMV has I/O complexity $\Theta(\min\{\frac{kN}{B}\max\{1,\log_{M/B}\frac{N}{\max\{k,M\}}\},\,kN\})$for k≤N 1−e and any constant 0

60 citations