scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The power of parallel prefix

TL;DR: This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model) to solve the prefix computation problem, when the order of the elements is specified by a linked list.
Abstract: The prefix computation problem is to compute all n initial products a1* . . . *a1,i=1, . . ., n of a set of n elements, where * is an associative operation. An O(((logn) log(2n/p))XI(n/p)) time deterministic parallel algorithm using p≤n processors is presented to solve the prefix computation problem, when the order of the elements is specified by a linked list. For p≤O(n1-e)(e〉0 any constant), this algorithm achieves linear speedup. Such optimal speedup was previously achieved only by probabilistic algorithms. This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model).
Citations
More filters
Proceedings ArticleDOI
26 May 2011
TL;DR: In this article, a parallel prefix algorithm for message-passing multicomputers is presented, which uses only half-duplex communications and provides the flexibility of choosing parameter values for either fewer computation time steps or fewer communication time steps to achieve the minimal running time based on the ratio of the time required by a communication step to the time of a computation step.
Abstract: A new computation-efficient parallel prefix algorithm for message-passing multicomputers is presented. The algorithm uses only half-duplex communications. It provides the flexibility of choosing parameter values for either fewer computation time steps or fewer communication time steps to achieve the minimal running time based on the ratio of the time required by a communication step to the time required by a computation step. Thus, under certain conditions, the new algorithm can run faster than previous ones for the same multicomputer model.

1 citations

Posted Content
TL;DR: A parallel algorithm (EREW PRAM algorithm) is presented that when the authors contract a linked list from size n to size c for a suitable constant $c$ they can pack the linked list into an array of size n/d for a constant $1 < d\leq c$ in the time of 3 coloring the list.
Abstract: We present a parallel algorithm (EREW PRAM algorithm) for linked lists contraction. We show that when we contract a linked list from size $n$ to size $n/c$ for a suitable constant $c$ we can pack the linked list into an array of size $n/d$ for a constant $1 < d\leq c$ in the time of 3 coloring the list. Thus for a set of linked lists with a total of $n$ elements and the longest list has $l$ elements our algorithm contracts them in $O(n\log i/p+(\log^{(i)}n+\log i )\log \log l+ \log l)$ time, for an arbitrary constructible integer $i$, with $p$ processors on the EREW PRAM, where $\log^{(1)} n =\log n$ and $\log^{(t)}n=\log \log^{(t-1)} n$ and $\log^*n=\min \{ i|\log^{(i)} n < 10\}$. When $i$ is a constant we get time $O(n/p+\log^{(i)}n\log \log l+\log l)$. Thus when $l=\Omega (\log^{(c)}n)$ for any constant $c$ we achieve $O(n/p+\log l)$ time. The previous best deterministic EREW PRAM algorithm has time $O(n/p+\log n)$ and best CRCW PRAM algorithm has time $O(n/p+\log n/\log \log n+\log l)$. Keywords: Parallel algorithms, linked list, linked list contraction, uniform linked list contraction, EREW PRAM.

1 citations

DissertationDOI
01 Jan 2014
TL;DR: By modeling how list ranking algorithms retrieve information on the structure of the list in the memory, a lower bound is given that is quadratic in sorting complexity for certain parameter settings, the first non-trivial lower bounds for list ranking for the bulk synchronous parallel and the MapReduce model.
Abstract: The performance of many algorithms on large input instances substantially depends on the number of triggered cache misses instead of the number of executed operations. This behavior is captured by the external memory model in a natural way. It models a computer by a fast cache of bounded size and a conceptually infinite (external) memory. In contrast to the classical RAMmodel, the complexity measure is the number of cache lines transferred between the cache and the memory. Computations on elements in the cache are not counted. Recent trends in processor design and advances in big data computing require massively parallel algorithms. The parallel external memory (PEM) model extends the external memory model so that it also captures parallelism. It consists of multiple processors which each have a private cache and share the (external) memory. This thesis considers three computational problems in the context of (parallel) external memory algorithms. For the fundamental problem of list ranking, previously, an algorithm was known that has sorting complexity for many settings of the PEM model. In the first part of this thesis, this algorithm is complemented by matching lower bounds for most practical settings. Interestingly, a stronger lower bound for parameter ranges which previously have not been considered is shown. By modeling how list ranking algorithms retrieve information on the structure of the list in the memory, we give a lower bound that is quadratic in sorting complexity for certain parameter settings. It is noteworthy that this result implies the first non-trivial lower bounds for list ranking for the bulk synchronous parallel and the MapReduce model. These lower bounds are complemented by a list ranking algorithm which is, in contrast to previous algorithms, analyzed for all parameter settings of the PEM model. In the second part, an efficient algorithm for the PEM model to compute a tree decomposition of bounded width for a graph is presented. The main challenge is to implement a load balancing strategy such that the running

1 citations

01 Jan 1985
TL;DR: Two parallel algorithms for constructing the Voronoi diagram of a.
Abstract: We present two parallel algorithms for constructing the Voronoi diagram of a. aet of n > 0 line segments in the plane: a) The first algorithm runs in 000g2 n) time using O(n) processors. This improves the previous best results (by A. Chow and also by Aggarwal, Chazelle, Guibas, 6'Dunlaing and Yap) in two respects. First we improve the running time by a factor of O(logn) and second the original results allow only aets ofpointa. b) By using O(n1+() processors. for any f. > 0, we improve the running time to 00ogn). This is the fastest known algorithm uaing a subquadratic number of processors. The results combine a number of techniques: a new O(logn) method for point location in certain tree-shaped Voronoi diagrams, a method of Aggarwal et al for reducing contour tracing to merging tree-shaped Voronoi diagrams, and a technique of Yap for computing the Voronoi diagrams of line segments. The computational model we use is the CREW PRAM (Concurrent-Read, Exclusive-Write Parallel RAM).

1 citations


Cites methods from "The power of parallel prefix"

  • ...Sort E 1 and E'}. in O(Iogn) time with O(n) processors along the y~direction using parallel prefix [10] (every edge can determine in 0(1) time its predecessor)....

    [...]

Journal ArticleDOI
TL;DR: This paper considers the problem of matching image curves against a database of object curve models on massively parallel computers such as the Connection Machine by iteratively finding the longest common subcurve.

1 citations