scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 Feb 2013
TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

816 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.
Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

143 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.
Abstract: The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

141 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...Comparison Sort: We use a low-depth cache-efficient sample sort [9]....

    [...]

Journal ArticleDOI
09 Jun 2012
TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Abstract: Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

87 citations


Cites methods from "Low depth cache-oblivious algorithm..."

  • ...There have been various techniques proposed to address these algorithmic changes, either using compiler assisted optimization [27], using cache-oblivious algorithms [6] or specialized languages like Sequoia [21]....

    [...]

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

84 citations


Cites background from "Low depth cache-oblivious algorithm..."

  • ...However, all future references .t into cache until reaching a supertask that does not .t in cache, at which point the Problem Span Cache Complexity Q * Scan (pre.x sums, etc.) O(log n) O(ln/Bl) Matrix Transpose (n × m matrix) [20] v v O(log(n + m))v O(lnm/Bl)v Matrix Multiplication ( n × n matrix) [20] v v O( n)v O(ln 1.5/Bl/ M + 1) v Matrix Inversion ( n × n matrix) O( n) O(ln 1.5/Bl/ M + 1) Quicksort [22] O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Sample Sort [10] O(log2 n) O(ln/BlllogM+2 nl) Sparse-Matrix Vector Multiply [10] (m nonzeros, nE edge separators) O(log2 n) O(lm/B + n/(M + 1)1-El) Convex Hull (e.g., see [8]) O(log2 n) O(ln/BlllogM+2 nl) Barnes Hut tree (e.g., see [8]) O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Table 1: Cache complexities of some algorithms analyzed in the PCO model....

    [...]

  • ...Unfortunately, current dynamic parallelism approaches have important limitations: they either apply to hierarchies of only private or only shared caches [1,9,10,16,21], require some strict balance criteria [7,15], or require a joint algorithm/scheduler analysis [7, 13–16]....

    [...]

  • ...Sample Sort [10] O(log(2) n) O(dn/BedlogM+2 ne) Sparse-Matrix Vector Multiply [10] O(log(2) n) O(dm/B + n/(M + 1)1− e) (m nonzeros, n edge separators)...

    [...]

  • ...A pair of common abstract measures for capturing parallel cache based locality are the number of misses given a sequential ordering of a parallel computation [1, 9, 10, 21], and the depth (span, critical path length) of the computation....

    [...]

References
More filters
Proceedings ArticleDOI
11 Aug 2009
TL;DR: This paper describes cache-oblivious sorting algorithms with optimal work, optimal cache complexity and polylogarithmic depth, which lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache.
Abstract: Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifics (cache size and cache line size) of each level. In this paper, we describe cache-oblivious sorting algorithms with optimal work, optimal cache complexity and polylogarithmic depth. Using known mappings, these lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. Moreover, the low cache complexities extend to shared-memory multiprocessors with common configurations of multi-level caches. The key factor in the low cache complexity on multiprocessors is the low depth of the algorithms we propose.

7 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...A preliminary version of this paper appeared as a three page brief announcement in SPAA’09 [17]....

    [...]

Proceedings ArticleDOI
07 May 2007
TL;DR: This talk will discuss some of these insights, mostly based on work coauthored by the speaker over the years, thereby offering a somewhat personal perspective rather than attempting a balanced and complete survey of the field.
Abstract: Technological evolution successively reduces accidental constraints and leads to machine organizations dictated by fundamental physical constraints on layout and signal speed. More specifically, machines have to be realized in threedimensional space, with elementary building blocks and connecting elements that exhibit minimum feature sizes. Furthermore, there is a maximum speed at which information can travel. The conjunction of layout and signal speed constraints results in lower limits to latencies and upper limits to bandwidths, respectively, between points and across cuts of any machine. In the outlined scenario, communication becomes a dominant factor determining performance. Machine organizations become increasingly parallel and hierarchical, to provide the most efficient support possible for data processing and communication requirements of computations. At the same time, the design and the implementation of algorithms must strive for increased concurrency and locality, to reduce the communication requirements arising from their execution. While, at the state of the art, we do not have a complete and systematic theory of optimal design of machines and algorithms, under the fundamental physical constraints, we do have a number of insights that can provide valuable guidance toward a unified framework for parallel and hierarchical computation. This talk will discuss some of these insights, mostly based on work coauthored by the speaker over the years, thereby offering a somewhat personal perspective rather than attempting a balanced and complete survey of the field. For reasons reflecting historical technological tradeoffs between local communication overheads and flight time of messages, many of the models of computation studied in computer science assume instantaneous communication, irrespective of the distance between source and destination. The latter assumption is, of course, incompatible with the upper limit on the speed of signals. Once this limit is duly recognized and incorporated into the models, a number on interesting consequences emerge, as systematically investigated in [8, 9, 10]. One significant consequence is that slack parallelism can not be traded off against locality, in a scalable way. In other words, the hierarchical nature of machines Copyright is held by the author/owner(s). CF’07, May 7–9, 2007, Ischia, Italy. ACM 978-1-59593-683-7/07/0005. has to be explicitly taken into account for performance optimization. The Random Access Machine (RAM) has been a cornerstone of the history of computing, providing the basis for much design and analysis of sequential algorithms and inspiring much development in uniprocessor computer architecture. However, constant-time access to any location in memory is not achievable with bounded signal speed. This is simply established theoretically and clearly witnessed in the practice of computer design, where memory latencies have become a serious bottleneck, as reflected in an increasing gap between the maximum number of instructions executable by the processor in unit time and those actually executed for average workloads. In [1, 2, 3], a systematic study has been undertaken of how to approximate the ideal RAM under layout and speed-of-signal constraints. Novel highly pipelinable hierarchical memory structures as well as novel processor organizations have been proposed and analyzed in this context, but considerable work remains to be done in this direction, to fully exploit the combined potential of concurrency and locality in the memory hierarchy. Communication requirements arise in a computation when an interaction is required between data produced at different places or at different times. The network and the memory system typically handle the data transfers in the two cases. Network and memory data transfers become strictly interwined in systems built with Chip Multi Processors (CMPs). Indeed, these transfers have to compete for the same bandwidth across the chip boundary. A theory based on the quantitative notion of information exchange has been developed in [6] and applied to the optimization of dedicated machines. In particular, the theory determines the optimal split of the chip area between memory and functional units, given the information-exchange function of the target application. While CMP organizations further open the machine design space and pose a number of interesting performance optimization problems, it is recognized that the greatest challenge to the full exploitation of the new architectures comes from the software side. Indeed, managing parallelism and locality adds significant burdens to the process of algorithm design and implementation. Furthermore, the greater variety to be expected in the architecture and in the amount of resources of the platforms that will become available creates a non trivial obstacle to the portability of software, particularly where performance is at stake. Several models of computation have been formulated that capture the variations in the properties of both the mem-

1 citations


Additional excerpts

  • ...7] and distributed memory machines [48, 33, 12]....

    [...]