Parallel merge sort

Home
/
Papers
/
Parallel merge sort

Book•

Parallel merge sort

09 Sep 2011-

TL;DR: In this paper, a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time is given, and the constant in the running time is small.

read less

Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the running time is still moderate, though not as small.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Designing efficient sorting algorithms for manycore GPUs

[...]

Nadathur Satish¹, Mark J. Harris², Michael Garland²•Institutions (2)

University of California, Berkeley¹, Nvidia²

23 May 2009

TL;DR: The design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA, are described, which are the fastest GPU sort and the fastest comparison-based sort reported in the literature.

...read moreread less

Abstract: We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA's GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.

...read moreread less

684 citations

Book•

Vector models for data-parallel computing

[...]

Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 1990

TL;DR: A model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based is described, and it is argued that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity.

...read moreread less

Abstract: "Vector Models for Data-Parallel Computing "describes a model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based. It presents many algorithms based on the model, ranging from graph algorithms to numerical algorithms, and argues that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity. Many of the author's ideas have been incorporated into the instruction set and into algorithms currently running on the Connection Machine.The book includes the definition of a parallel vector machine; an extensive description of the uses of the scan (also called parallel-prefix) operations; the introduction of segmented vector operations; parallel data structures for trees, graphs, and grids; many parallel computational-geometry, graph, numerical and sorting algorithms; techniques for compiling nested parallelism; a compiler for Paralation Lisp; and details on the implementation of the scan operations.Guy E. Blelloch is an Assistant Professor of Computer Science and a Principal Investigator with the Super Compiler and Advanced Language project at Carnegie Mellon University.Contents: Introduction. Parallel Vector Models. The Scan Primitives. Computational-Geometry Algorithms. Graph Algorithms. Numerical Algorithms. Languages and Compilers. Correction-Oriented Languages. Flattening Nested Parallelism. A Compiler for Paralation Lisp. Paralation-Lisp Code. The Scan Vector Model. Data Structures. Implementing Parallel Vector Models. Implementing the Scan Operations. Conclusions. Glossary.

...read moreread less

571 citations

Journal Article•DOI•

Scans as primitive parallel operations

[...]

Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

01 Nov 1989-IEEE Transactions on Computers

TL;DR: A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented and it is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many technologies.

...read moreread less

Abstract: A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine). >

...read moreread less

543 citations

Book Chapter•DOI•

Simple linear work suffix array construction

[...]

Juha Kärkkäinen¹, Peter Sanders¹•Institutions (1)

Max Planck Society¹

30 Jun 2003

TL;DR: The skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine is introduced.

...read moreread less

Abstract: A suffix array represents the suffixes of a string in sorted order. Being a simpler and more compact alternative to suffix trees, it is an important tool for full text indexing and other string processing tasks. We introduce the skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine: 1. recursively sort suffixes beginning at positions i mod 3 ≠ 0. 2. sort the remaining suffixes using the information obtained in step one. 3. merge the two sorted sequences obtained in steps one and two. The algorithm is much simpler than previous linear time algorithms that are all based on the more complicated suffix tree data structure. Since sorting is a well studied problem, we obtain optimal algorithms for several other models of computation, e.g. external memory with parallel disks, cache oblivious, and parallel. The adaptations for BSP and EREW-PRAM are asymptotically faster than the best previously known algorithms.

...read moreread less

465 citations

Journal Article•DOI•

Linear work suffix array construction

[...]

Juha Kärkkäinen¹, Peter Sanders², Stefan Burkhardt³•Institutions (3)

University of Helsinki¹, Karlsruhe Institute of Technology², Google³

01 Nov 2006-Journal of the ACM

TL;DR: A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms.

...read moreread less

Abstract: Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

...read moreread less

420 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Sorting networks and their applications

[...]

Kenneth E. Batcher¹•Institutions (1)

Goodyear Aerospace¹

30 Apr 1968

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.

...read moreread less

Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

...read moreread less

2,553 citations

Journal Article•DOI•

Applying Parallel Computation Algorithms in the Design of Serial Algorithms

[...]

Nimrod Megiddo¹•Institutions (1)

Tel Aviv University¹

01 Oct 1983-Journal of the ACM

TL;DR: It is pointed out that analyses of parallelism in computational problems have practical implications even when multi-processor machines are not available, and a unified framework for cases like this is presented.

...read moreread less

Abstract: The goal of this paper is to point out that analyses of parallelism m computational problems have practical implications even when mult~processor machines are not available. This is true because, in many cases, a good parallel algorithm for one problem may turn out to be useful for designing an efficsent serial algorithm for another problem A unified framework for cases like this is presented. Particular cases, which axe discussed in this paper, provide motivation for examining parallelism in sorting, selecuon, minimum-spanning-tree, shortest route, max-flow, and matrix multiplication problems, as well as in scheduling and locational problems.

...read moreread less

696 citations

Journal Article•DOI•

Sorting in c log n parallel steps

[...]

Miklós Ajtai¹, János Komlós¹, Endre Szemerédi¹•Institutions (1)

Hungarian Academy of Sciences¹

01 Jan 1983-Combinatorica

TL;DR: A sorting network withcn logn comparisons where in thei-th step of the algorithm the contents of registersRj, andRk, wherej, k are absolute constants then change their contents or not according to the result of the comparison.

...read moreread less

Abstract: We give a sorting network withcn logn comparisons. The algorithm can be performed inc logn parallel steps as well, where in a parallel step we comparen/2 disjoint pairs. In thei-th step of the algorithm we compare the contents of registersR j(i) , andR k(i) , wherej(i), k(i) are absolute constants then change their contents or not according to the result of the comparison.

...read moreread less

497 citations

Journal Article•DOI•

Parallelism in Comparison Problems

[...]

Leslie G. Valiant

01 Sep 1975-SIAM Journal on Computing

TL;DR: The worst-case time complexity of algorithms for multiprocessor computers with binary comparisons as the basic operations is investigated and the algorithm for finding the maximum is shown to be optimal for all values of k and n.

...read moreread less

Abstract: The worst-case time complexity of algorithms for multiprocessor computers with binary comparisons as the basic operations is investigated. It is shown that for the problems of finding the maximum, sorting, and merging a pair of sorted lists, if n, the size of the input set, is not less than k, the number of processors, speedups of at least $O(k/\log \log k)$ can be achieved with respect to comparison operations. The algorithm for finding the maximum is shown to be optimal for all values of k and n.

...read moreread less

412 citations

Proceedings Article•DOI•

Routing, merging and sorting on parallel models of computation

[...]

Allan Borodin¹, John E. Hopcroft²•Institutions (2)

University of Toronto¹, Cornell University²

05 May 1982

TL;DR: It is shown that log log n - log log r is asymptotically optimal for rn processors to merge two sorted lists of n elements and is able to achieve such an efficient sort via Valiant's parallel merging algorithm.

...read moreread less

Abstract: A variety of models have been proposed for the study of synchronous parallel computation. We review these models and study further some prototype problems. We distinguish two classes of models, fixed connection networks and models based on a shared memory. Routing is the prototype problem for the networks. In particular, routing provides the basis for simulating the more powerful shared memory models. We show that a simple but important class of deterministic strategies (oblivious routing) is necessarily inefficient with respect to worst case analysis. Routing can be viewed as a special case of sorting and the existence of a deterministic O(logn) routing or sorting algorithm for an n processor fixed connection network remains open. However, if we consider the more powerful class of shared memory models, we are -&-ldquo;almost-&-rdquo; able to achieve such an efficient sort via Valiant's parallel merging algorithm. Within a spectrum of models, we show that log log n - log log r is asymptotically optimal for rn processors to merge two sorted lists of n elements.

...read moreread less

231 citations