scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 Feb 2013
TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

816 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.
Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

143 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.
Abstract: The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

141 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...Comparison Sort: We use a low-depth cache-efficient sample sort [9]....

    [...]

Journal ArticleDOI
09 Jun 2012
TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Abstract: Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

87 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...There have been various techniques proposed to address these algorithmic changes, either using compiler assisted optimization [27], using cache-oblivious algorithms [6] or specialized languages like Sequoia [21]....

    [...]

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

84 citations


Cites background from "Low depth cache-oblivious algorithm..."

  • ...However, all future references .t into cache until reaching a supertask that does not .t in cache, at which point the Problem Span Cache Complexity Q * Scan (pre.x sums, etc.) O(log n) O(ln/Bl) Matrix Transpose (n × m matrix) [20] v v O(log(n + m))v O(lnm/Bl)v Matrix Multiplication ( n × n matrix) [20] v v O( n)v O(ln 1.5/Bl/ M + 1) v Matrix Inversion ( n × n matrix) O( n) O(ln 1.5/Bl/ M + 1) Quicksort [22] O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Sample Sort [10] O(log2 n) O(ln/BlllogM+2 nl) Sparse-Matrix Vector Multiply [10] (m nonzeros, nE edge separators) O(log2 n) O(lm/B + n/(M + 1)1-El) Convex Hull (e.g., see [8]) O(log2 n) O(ln/BlllogM+2 nl) Barnes Hut tree (e.g., see [8]) O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Table 1: Cache complexities of some algorithms analyzed in the PCO model....

    [...]

  • ...Unfortunately, current dynamic parallelism approaches have important limitations: they either apply to hierarchies of only private or only shared caches [1,9,10,16,21], require some strict balance criteria [7,15], or require a joint algorithm/scheduler analysis [7, 13–16]....

    [...]

  • ...Sample Sort [10] O(log(2) n) O(dn/BedlogM+2 ne) Sparse-Matrix Vector Multiply [10] O(log(2) n) O(dm/B + n/(M + 1)1− e) (m nonzeros, n edge separators)...

    [...]

  • ...A pair of common abstract measures for capturing parallel cache based locality are the number of misses given a sequential ordering of a parallel computation [1, 9, 10, 21], and the depth (span, critical path length) of the computation....

    [...]

References
More filters
Proceedings ArticleDOI
01 Nov 1986
TL;DR: A new deterministic coin tossing technique that provides for a fast and eff ient b reak ing of a symmetr ic s i tuat ion in paral le l is introduced.
Abstract: Several results concern ing para l le l a lgor i thms are improved. A part ial list of the new results includes: For r ank ing a l inked list of length n, O(lognlog~n) t ime using an op t ima l number of processors . For se lect ing the m-th smal les t out of n e lements O(lognlog'n) t ime using an op t imal number of processors. For g raph connect ivi ty O(lognlogt-~nlogt3~n) t ime using (m + n ) a (m ,n ) / ( l ogn logf2~n logt3~n) processors , and for f inding m in imum spanning forest in a g raph O(lognlogt-~nlogtS~n) t ime using (m+n)/(lognlog~2~n) processors , where n is the number of ver t ices and m is the number of edges. All the new algor i thms are de terminis t ic . These results provide an op t imal de terminis t ic paral le l a lgor i thm for list ranking that achieves poly-log t ime. Also , they prov ide an op t imal a lgor i thm for connect ivi ty which runs in a lmos t logar i thmic t ime when m>-nlog*n. This op t ima l a lgor i thm achieves logar i thmic t ime w h e n m = n t-~, where 0 < e _ < l . Our results are also s t rong enough to refute a known conjecture regard ing a l imit on the poss ible per formance of any paral le l a lgor i thm for the list rank ing p rob lem. This paper introduces a new deterministic coin tossing technique that provides for a fast and eff ic ient b reak ing of a symmetr ic s i tuat ion in paral le l . Prev ious ly , it was known how to break such symmet r i e s only by means of r andomiza t ion . Interestingly, the s t ructure of all the a lgor i thms in this paper fol low a pa rad igm which we call accelerating cascades. Given severa l a l te rna t ive para l le l a lgor i thms for the same prob lem, this paradigm constructs a new algor i thm for the same problem ou t of these a lgor i thms; the per formance of the new a lgor i thm compares favourab ly wi th that of any of its bui ld ing blocks.

203 citations


"Low depth cache-oblivious algorithm..." refers methods in this paper

  • ...Vishkin’s deterministic coin tossing [32] to find a O(log log n)ruling set and then convert the ruling set to an independent set of size at least n/3 in O(log log n) rounds....

    [...]

Journal ArticleDOI
TL;DR: A locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor and improves the performance of work stealing up to 80%.
Abstract: This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G n whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O (C ⌉m/s;⌈PT ∞more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T ∈ fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.

185 citations

Journal ArticleDOI
TL;DR: A procedure is proposed which is a generalization of minimal storage tree sorting and which has the following three properties: there is a significant improvement in the expected number of comparisons required to sort the input sequence, the procedure is statistically insensitive to bias in theinput sequence, and the expected numbers of comparisons approaches the information-theoretic lower bound on the number of compared required.
Abstract: The methods currently in use and previously proposed for the choice of a root in minimal storage tree sorting are in reality methods for making inefficient statistical estimates of the median of the sequence to be sorted. By making efficient use of the information in a random sample chosen during input of the sequence to be sorted, significant improvements over ordinary minimal storage tree sorting can be made.A procedure is proposed which is a generalization of minimal storage tree sorting and which has the following three properties: (a) There is a significant improvement (over ordinary minimal storage tree sorting) in the expected number of comparisons required to sort the input sequence. (b) The procedure is statistically insensitive to bias in the input sequence. (c) The expected number of comparisons required by the procedure approaches (slowly) the information-theoretic lower bound on the number of comparisons required. The procedure is, therefore, “asymptotically optimal.”

179 citations


"Low depth cache-oblivious algorithm..." refers methods in this paper

  • ...Our parallel sorting algorithm is based on a version of sample sort [37, 45], and has optimal cache complexity....

    [...]

Journal ArticleDOI
TL;DR: The main result is an optimal randomized parallel algorithm for INTEGER_SORT, the first known that is optimal: the product of its time and processor bounds is upper bounded by a linear function of the input size.
Abstract: This paper assumes a parallel RAM (random access machine) model which allows both concurrent reads and concurrent writes of a global memory.The main result is an optimal randomized parallel algorithm for INTEGER_SORT (i.e., for sorting n integers in the range $[1,n]$). This algorithm costs only logarithmic time and is the first known that is optimal: the product of its time and processor bounds is upper bounded by a linear function of the input size. Also given is a deterministic sublogarithmic time algorithm for prefix sum. In addition this paper presents a sublogarithmic time algorithm for obtaining a random permutation of n elements in parallel. And finally, sublogarithmic time algorithms for GENERAL_SORT and INTEGER_SORT are presented. Our sub-logarithmic GENERAL_SORT algorithm is also optimal.

175 citations


"Low depth cache-oblivious algorithm..." refers methods in this paper

  • ...Our parallel sorting algorithm is based on a version of sample sort [37, 45], and has optimal cache complexity....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors introduced the Uniform Memory Hierarchy (UMH) model, which captures performance-relevant aspects of the hierarchical nature of computer memory and is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.
Abstract: TheUniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.

175 citations