scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The power of parallel prefix

TL;DR: This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model) to solve the prefix computation problem, when the order of the elements is specified by a linked list.
Abstract: The prefix computation problem is to compute all n initial products a1* . . . *a1,i=1, . . ., n of a set of n elements, where * is an associative operation. An O(((logn) log(2n/p))XI(n/p)) time deterministic parallel algorithm using p≤n processors is presented to solve the prefix computation problem, when the order of the elements is specified by a linked list. For p≤O(n1-e)(e〉0 any constant), this algorithm achieves linear speedup. Such optimal speedup was previously achieved only by probabilistic algorithms. This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model).
Citations
More filters
Journal ArticleDOI
TL;DR: A parallel algorithm to generate the dominance graph on a collection of nonoverlapping iso-oriented rectangles is presented, which is the directed graph which contains an edge from a rectangleb to rectanglec iffc is immediately aboveb.
Abstract: A parallel algorithm to generate the dominance graph on a collection of nonoverlapping iso-oriented rectangles is presented. This graph arises from the constraint graph commonly used in compaction algorithms for VLSI circuits. The dominance graph expresses the notion of “aboveness” on a collection of nonoverlapping rectangles: it is the directed graph which contains an edge from a rectangleb to rectanglec iffc is immediately aboveb. The algorithm is based on the divide and conquer paradigm; in the EREW PRAM model, it has time complexityO(log2n), usingn/logn processors. Its processor-time product isO(nlogn), which is optimal.

5 citations


Cites background from "The power of parallel prefix"

  • ...(14) Once this array is computed, O(n'/p') time suffices to move all marked members of Ty to Ty; in particular, each marked TyEi] is moved to T~ [partial_sum[i] ]....

    [...]

Posted Content
TL;DR: By identifying a suitable and well-optimized prefix scan algorithm, this article reduces time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes, enabling derivation of material properties at nanoscale for long microscopy image series.
Abstract: Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.

5 citations


Cites methods from "The power of parallel prefix"

  • ...al [17] presented such algorithm on an EREW model [17]....

    [...]

Journal ArticleDOI
TL;DR: An optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph in O(logn) time on n/logn processors is presented and an improved parallel algorithm is obtained for constructing a depth-first search tree rooted at any given vertex in a connected planar Undirectedgraph.
Abstract: We present an optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph inO(logn) time onn/logn processors. As a consequence, we also obtain an improved parallel algorithm for constructing a depth-first search tree rooted at any given vertex in a connected planar undirected graph in O(log2n) time on n/logn processors. The best previous algorithms for computing depth-first search trees and cycle separators achieved the same time complexities, but withn processors. Our algorithms run on a parallel random access machine that permits concurrent reads and concurrent writes in its shared memory and allows an arbitrary processor to succeed in case of a write conflict.

5 citations


Cites methods from "The power of parallel prefix"

  • ...Step 5 uses optimal algorithms for prefix computation and list ranking [5], [10], [17], [28], [29]....

    [...]

Proceedings ArticleDOI
10 Oct 1988
TL;DR: The development of algorithms which can be ported among different fine-grain, massively parallel architectures and yield reasonably good implementations on each is discussed, and sample algorithms are given to solve some fundamental geometric problems.
Abstract: The development of algorithms which can be ported among different fine-grain, massively parallel architectures and yield reasonably good implementations on each is discussed. The approach is to write algorithms in terms of general data movement operations and then implement the data movement operations on the target architecture. Efficient implementation of the data movement operations requires careful programming, but since the data movement operations form the foundation of many programs, the cost of implementing them can be amortized. The use of data movement operations also helps programmers think in terms of higher-level programming units, in the same way that the use of standard data structures helps programmers of serial computers. An approach is described for designing efficient, portable algorithms, and sample algorithms are given to solve some fundamental geometric problems. The difficulties of portability and efficiency for these geometric problems are redirected into similar difficulties for the standardization operations. >

5 citations


Cites background from "The power of parallel prefix"

  • ...Interested readers might consult [2, 3 , 5, 6, 7] for additional operations and extensive uses of the operations discussed here....

    [...]

  • ...More recently there have been attempts to promote specific data movement operations as a programming aid [2, 3 ], or to develop a collection of data movement operations particularly useful for a specific architecture [5]....

    [...]

Journal ArticleDOI
01 Jul 1992
TL;DR: The algorithm for finding all polygons in G that are congruent to P requires Θ(n log n) time for a CREW PRAM with m processors, which improves upon the O(n2) time required by the systolic array algorithmm of [7].
Abstract: Given a straight-line embedded plane graphh G of n edges and a polygon P of m edges, m≤n, we describe an algorithm for finding all polygons in G that are congruent to P. Our algorithm requires Θ(n log n) time for a CREW PRAM with m processors. This improves upon the O(n2) time (with m processors) required by the systolic array algorithmm of [7]. We also show the problem is in NC by showing how to implement our algorithm in Θ(log n) time using mn processors.

5 citations