The power of parallel prefix

doi:10.1109/TC.1985.6312202

Journal Article•DOI•

The power of parallel prefix

Clyde P. Kruskal¹, Larry Rudolph², Marc Snir³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Carnegie Mellon University², Hebrew University of Jerusalem³

01 Oct 1985-IEEE Transactions on Computers (IEEE)-Vol. 34, Iss: 10, pp 965-968

TL;DR: This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model) to solve the prefix computation problem, when the order of the elements is specified by a linked list.

read less

Abstract: The prefix computation problem is to compute all n initial products a1* . . . *a1,i=1, . . ., n of a set of n elements, where * is an associative operation. An O(((logn) log(2n/p))XI(n/p)) time deterministic parallel algorithm using p≤n processors is presented to solve the prefix computation problem, when the order of the elements is specified by a linked list. For p≤O(n1-e)(e〉0 any constant), this algorithm achieves linear speedup. Such optimal speedup was previously achieved only by probabilistic algorithms. This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A parallel algorithm for random walk construction with application to the Monte Carlo solution of partial differential equations

[...]

Abdou Youssef¹•Institutions (1)

George Washington University¹

01 Mar 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A parallel algorithm for random walk generation in regular as well as irregular regions is presented and is shown to ideally fit on a hypercube of n nodes, where n is the number of processors.

...read moreread less

Abstract: Random walks are widely applicable in statistical and scientific computations. In particular, they are used in the Monte Carlo method to solve elliptic and parabolic partial differential equations (PDEs). This method holds several advantages over other methods for PDEs as it solves problems with irregular boundaries and/or discontinuities, gives solutions at individual points, and exhibits great parallelism. However, the generation of each random walk in the Monte Carlo method has been done sequentially because each point in the walk is derived from the preceding point by moving one grid step along a randomly selected direction. A parallel algorithm for random walk generation in regular as well as irregular regions is presented. The algorithm is based on parallel prefix computations. The communication structure of the algorithm is shown to ideally fit on a hypercube of n nodes, where n is the number of processors. >

...read moreread less

4 citations

Journal Article•DOI•

On constructing multiple spanning trees in a hypercube

[...]

Feng-Hsu Wang¹, Ferng-Ching Lin¹•Institutions (1)

National Taiwan University¹

22 Mar 1993-Information Processing Letters

TL;DR: A routing strategy to ensure the edge-disjointness of the routing paths in executing binary tree algorithms is identified and the fault tolerance of the embedding method is discussed.

...read moreread less

4 citations

Proceedings Article•DOI•

An Extended Work-Stealing Framework for Mixed-Mode Parallel Applications

[...]

Martin Wimmer¹, Jesper Larsson Träff¹•Institutions (1)

University of Vienna¹

16 May 2011

TL;DR: This paper presents a shared-memory programming framework that allows tasks to dynamically spawn subtasks with a given degree of parallelism for implementing tightly coupled parallel parts of the algorithm, and presents a new algorithm for work-stealing with deterministic team-building.

...read moreread less

Abstract: Parallelizing complex applications even for well-behaved parallel systems often calls for different parallelization approaches within the same application. In this paper we discuss three applications from the literature that for both reasons of efficiency and expressive convenience benefit from a mixture of task and more tightly coupled data parallelism. These three applications, namely Quick sort, list ranking, and LU factorization with partial pivoting, are paradigms for recursive, mixed-mode parallel algorithms that can neither easily nor efficiently be expressed in either a purely data-parallel or a purely task-parallel fashion. As a solution we present a shared-memory programming framework that allows tasks to dynamically spawn subtasks with a given degree of parallelism for implementing tightly coupled parallel parts of the algorithm. All three paradigmatic applications can naturally be expressed in this framework, which in turn can be supported by an extended, non-conventional work-stealing scheduler, which we also briefly sketch. Using our new algorithm for work-stealing with deterministic team-building we are able to show, beyond the improved, more natural implementability, in many cases better scalability and sometimes absolute performance than with less natural implementations based on pure task-parallelism executed with conventional work-stealing. Detailed performance results using an Intel 32-core system substantiate our claims.

...read moreread less

4 citations

Cites background from "The power of parallel prefix"

...The operation of splicing out of a list element is called pair_off in [19]....
[...]

Proceedings Article•DOI•

Finding Articulation Points and Bridges of Permutation Graphs

[...]

Oscar H. Ibarra¹, Qi Zheng¹•Institutions (1)

University of California, Berkeley¹

16 Aug 1993

TL;DR: It is shown that articulation points and bridges of permutation graphs can be found in O( logn) time using O(n/logn) processors on an EREW PRAM.

...read moreread less

Abstract: We show that articulation points and bridges of permutation graphs can be found in O(logn) time using O(n/logn) processors on an EREW PRAM. The algorithms are optimal with respect to the time-processor product.

...read moreread less

4 citations

Journal Article•DOI•

Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image Registration

[...]

Marcin Copik¹, Tobias Grosser², Torsten Hoefler¹, Paolo Bientinesi³, Benjamin Berkels⁴ - Show less +1 more•Institutions (4)

ETH Zurich¹, University of Edinburgh², Umeå University³, RWTH Aachen University⁴

01 Mar 2022-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, a hierarchical prefix scan algorithm was proposed to reduce the time of registration of a series of electron microscopy images to less than 3 minutes by translating the image registration into a specific instance of the prefix scan.

...read moreread less

Abstract: Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images – a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.

...read moreread less

4 citations

Collapse

The power of parallel prefix

Citations

Cites background from "The power of parallel prefix"

Related Papers (5)