scispace - formally typeset
Search or ask a question
Book

Parallel merge sort

09 Sep 2011-
TL;DR: In this paper, a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time is given, and the constant in the running time is small.
Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the running time is still moderate, though not as small.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 May 2009
TL;DR: The design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA, are described, which are the fastest GPU sort and the fastest comparison-based sort reported in the literature.
Abstract: We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA's GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.

684 citations

Book
01 Jan 1990
TL;DR: A model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based is described, and it is argued that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity.
Abstract: "Vector Models for Data-Parallel Computing "describes a model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based. It presents many algorithms based on the model, ranging from graph algorithms to numerical algorithms, and argues that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity. Many of the author's ideas have been incorporated into the instruction set and into algorithms currently running on the Connection Machine.The book includes the definition of a parallel vector machine; an extensive description of the uses of the scan (also called parallel-prefix) operations; the introduction of segmented vector operations; parallel data structures for trees, graphs, and grids; many parallel computational-geometry, graph, numerical and sorting algorithms; techniques for compiling nested parallelism; a compiler for Paralation Lisp; and details on the implementation of the scan operations.Guy E. Blelloch is an Assistant Professor of Computer Science and a Principal Investigator with the Super Compiler and Advanced Language project at Carnegie Mellon University.Contents: Introduction. Parallel Vector Models. The Scan Primitives. Computational-Geometry Algorithms. Graph Algorithms. Numerical Algorithms. Languages and Compilers. Correction-Oriented Languages. Flattening Nested Parallelism. A Compiler for Paralation Lisp. Paralation-Lisp Code. The Scan Vector Model. Data Structures. Implementing Parallel Vector Models. Implementing the Scan Operations. Conclusions. Glossary.

571 citations

Journal ArticleDOI
TL;DR: A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented and it is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many technologies.
Abstract: A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine). >

543 citations

Book ChapterDOI
30 Jun 2003
TL;DR: The skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine is introduced.
Abstract: A suffix array represents the suffixes of a string in sorted order. Being a simpler and more compact alternative to suffix trees, it is an important tool for full text indexing and other string processing tasks. We introduce the skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine: 1. recursively sort suffixes beginning at positions i mod 3 ≠ 0. 2. sort the remaining suffixes using the information obtained in step one. 3. merge the two sorted sequences obtained in steps one and two. The algorithm is much simpler than previous linear time algorithms that are all based on the more complicated suffix tree data structure. Since sorting is a well studied problem, we obtain optimal algorithms for several other models of computation, e.g. external memory with parallel disks, cache oblivious, and parallel. The adaptations for BSP and EREW-PRAM are asymptotically faster than the best previously known algorithms.

465 citations

Journal ArticleDOI
TL;DR: A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms.
Abstract: Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

420 citations

References
More filters
Proceedings ArticleDOI
30 Apr 1968
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

2,553 citations

Journal ArticleDOI
TL;DR: It is pointed out that analyses of parallelism in computational problems have practical implications even when multi-processor machines are not available, and a unified framework for cases like this is presented.
Abstract: The goal of this paper is to point out that analyses of parallelism m computational problems have practical implications even when mult~processor machines are not available. This is true because, in many cases, a good parallel algorithm for one problem may turn out to be useful for designing an efficsent serial algorithm for another problem A unified framework for cases like this is presented. Particular cases, which axe discussed in this paper, provide motivation for examining parallelism in sorting, selecuon, minimum-spanning-tree, shortest route, max-flow, and matrix multiplication problems, as well as in scheduling and locational problems.

696 citations

Journal ArticleDOI
TL;DR: A sorting network withcn logn comparisons where in thei-th step of the algorithm the contents of registersRj, andRk, wherej, k are absolute constants then change their contents or not according to the result of the comparison.
Abstract: We give a sorting network withcn logn comparisons. The algorithm can be performed inc logn parallel steps as well, where in a parallel step we comparen/2 disjoint pairs. In thei-th step of the algorithm we compare the contents of registersR j(i) , andR k(i) , wherej(i), k(i) are absolute constants then change their contents or not according to the result of the comparison.

497 citations

Journal ArticleDOI
TL;DR: The worst-case time complexity of algorithms for multiprocessor computers with binary comparisons as the basic operations is investigated and the algorithm for finding the maximum is shown to be optimal for all values of k and n.
Abstract: The worst-case time complexity of algorithms for multiprocessor computers with binary comparisons as the basic operations is investigated. It is shown that for the problems of finding the maximum, sorting, and merging a pair of sorted lists, if n, the size of the input set, is not less than k, the number of processors, speedups of at least $O(k/\log \log k)$ can be achieved with respect to comparison operations. The algorithm for finding the maximum is shown to be optimal for all values of k and n.

412 citations

Proceedings ArticleDOI
05 May 1982
TL;DR: It is shown that log log n - log log r is asymptotically optimal for rn processors to merge two sorted lists of n elements and is able to achieve such an efficient sort via Valiant's parallel merging algorithm.
Abstract: A variety of models have been proposed for the study of synchronous parallel computation. We review these models and study further some prototype problems. We distinguish two classes of models, fixed connection networks and models based on a shared memory. Routing is the prototype problem for the networks. In particular, routing provides the basis for simulating the more powerful shared memory models. We show that a simple but important class of deterministic strategies (oblivious routing) is necessarily inefficient with respect to worst case analysis. Routing can be viewed as a special case of sorting and the existence of a deterministic O(logn) routing or sorting algorithm for an n processor fixed connection network remains open. However, if we consider the more powerful class of shared memory models, we are -&-ldquo;almost-&-rdquo; able to achieve such an efficient sort via Valiant's parallel merging algorithm. Within a spectrum of models, we show that log log n - log log r is asymptotically optimal for rn processors to merge two sorted lists of n elements.

231 citations