An introduction to parallel algorithms

Home
/
Papers
/
An introduction to parallel algorithms

Book•

An introduction to parallel algorithms

Joseph JaJa¹•Institutions (1)

01 Oct 1992-

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

read less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Tree-Like Parallelization of Reduct and Construct Computation

[...]

Robert Susmaga¹•Institutions (1)

Poznań University of Technology¹

01 Jun 2004-Lecture Notes in Computer Science

TL;DR: The paper presents a so-called constrained tree-like model of parallelization of this task and illustrates practical behaviour of this algorithm in a computational experiment.

...read moreread less

Abstract: The paper addresses the problem of parallel computing in reduct/construct generation. The reducts are subsets of attributes that may be successfully applied in information/decision table analysis. Constructs, defined in a similar way, represent a notion that is a kind of generalization of the reduct. They ensure both discernibility between pairs of objects belonging to different classes (in which they follow the reducts) as well as similarity between pairs of objects belonging to the same class (which is not the case with reducts). Unfortunately, exhaustive sets of minimal constructs, similarly to sets of minimal reducts, are NP-hard to generate. To speed up the computations, decomposing the original task into multiple subtasks and executing these in parallel is employed. The paper presents a so-called constrained tree-like model of parallelization of this task and illustrates practical behaviour of this algorithm in a computational experiment.

...read moreread less

13 citations

Book Chapter•DOI•

The Bulk-Synchronous Parallel Random Access Machine

[...]

Alexandre Tiskin¹•Institutions (1)

University of Oxford¹

26 Aug 1996

TL;DR: A new model of bulk-synchronous parallel computation, called BSPRAM, is presented, which is a variant of BSP based on a mixture of shared and distributed memory, which are equivalent for some important classes of algorithms.

...read moreread less

Abstract: The model of bulk-synchronous parallel (BSP) computation is intended to provide a simple and realistic framework for general-purpose parallel computing. Originally, BSP was defined as a distributed memory model. In this paper we present a new model, called BSPRAM, which is a variant of BSP based on a mixture of shared and distributed memory. The two models are equivalent for some important classes of algorithms. We identify two such classes: oblivious and coarse-block algorithms. Finally, we present BSPRAM algorithms for dense matrix multiplication and Fast Fourier Transform.

...read moreread less

13 citations

Journal Article•DOI•

Random Data Accesses on a Coarse-Grained Parallel Machine I. One-to-One Mappings

[...]

Ravi V. Shankar¹, Sanjay Ranka²•Institutions (2)

Syracuse University¹, University of Florida²

10 Jul 1997-Journal of Parallel and Distributed Computing

TL;DR: This paper describes deterministic communication-efficient algorithms for performing dynamic permutations on a coarse-grained parallel machine and shows that the general permutation operation can be completed in C?n/p(+ lower order terms) time and is optimal and scalable provided n?p3+p2?/?

...read moreread less

13 citations

Journal Article•DOI•

Optimal parallel algorithms for rectilinear link-distance problems

[...]

Andrzej Lingas¹, Anil Maheshwari², Jörg-Rüdiger Sack³•Institutions (3)

Lund University¹, Tata Institute of Fundamental Research², Carleton University³

01 Sep 1995-Algorithmica

TL;DR: Techniques for solving link-distance problems in parallel which are expected to find applications in the design of other parallel computational geometry algorithms are developed.

...read moreread less

Abstract: We provide optimal parallel solutions to several link-distance problems set in trapezoided rectilinear polygons. All our main parallel algorithms are deterministic and designed to run on the exclusive read exclusive write parallel random access machine (EREW PRAM). LetP be a trapezoided rectilinear simple polygon withn vertices. InO(logn) time usingO(n/logn) processors we can optimally compute:1.Minimum rectilinear link paths, or shortest paths in theL1 metric from any point inP to all vertices ofP.2.Minimum rectilinear link paths from any segment insideP to all vertices ofP.3.The rectilinear window (histogram) partition ofP.4.Both covering radii and vertex intervals for any diagonal ofP.5.A data structure to support rectilinear link-distance queries between any two points inP (queries can be answered optimally inO(logn) time by uniprocessor). Our solution to 5 is based on a new linear-time sequential algorithm for this problem which is also provided here. This improves on the previously best-known sequential algorithm for this problem which usedO(n logn) time and space.5 We develop techniques for solving link-distance problems in parallel which are expected to find applications in the design of other parallel computational geometry algorithms. We employ these parallel techniques, for example, to compute (on a CREW PRAM) optimally the link diameter, the link center, and the central diagonal of a rectilinear polygon.

...read moreread less

13 citations

Cites methods from "An introduction to parallel algorit..."

...6. Using the pointer jumping technique [ 16 ], for each vertex w assign a pointer to a vertex v for which e(v, d~) is known and d, 16 ]....
[...]
...Since the dual trees are not constant degree trees, we need to convert them to binary trees in order to avoid concurrent writes. This can be done by the algorithm in [ 16 ]....
[...]
...... we inductively define f(u) as the most vertically remote maximal diagonal among f*(u) and the f(c)'s for children c of u. Note that this reduces the computation of f(u) to finding a minimum of single values given by children and a single precomputed value at the node u. Therefore, we can use the tree contraction technique here and compute the required information by performing the work in logarithmic time using O(n/log n) processors ......
[...]
...For the other tools such as parallel prefix, list ranking, and doubling, see [ 16 ] and [17]....
[...]

Proceedings Article•DOI•

O(log2 n) time efficient parallel factorization of dense, sparse separable, and banded matrices

[...]

John H. Reif¹•Institutions (1)

Carnegie Mellon University¹

01 Aug 1994

TL;DR: New parallel methods for various exact factorizations of several classes of matrices are given, match the best known work bounds of parallel algorithms with polylog time bounds, and are within a log factor of the work bounds for the bestknown sequential algorithms for the same problems.

...read moreread less

Abstract: Known polylog parallel algorithms for the solution of linear systems and related problems require computation of the characteristic polynomial or related forms, which are known to be highly unstable in practice. However, matrix factorizations of various types, bypassing computation of the characteristic polynomial, are used extensively in sequential numerical computations and are essential in many applications. This paper gives new parallel methods for various exact factorizations of several classes of matrices. We assume the input matrices are n × n with either integer entries of size ≤ 2no(1). We make no other assumption on the input. We assume the arithmetic PRAM model of parallel computation. Our main result is a reduction of the known parallel time bounds for these factorizations from O(log3n) to O(log2n). Our results are work efficient; we match the best known work bounds of parallel algorithms with polylog time bounds, and are within a log n factor of the work bounds for the best known sequential algorithms for the same problems. The exact factorizations we compute for symmetric positive definite matrices include: 1. recursive factorization sequences and trees, 2. LU factorizations, 3. QR factorizations, and 4. reduction to upper Hessenberg form. The classes of matrices for which we can efficiently compute these factorizations include:1. dense matrices, in time O(log2n) with processor bound P(n) (the number of processors needed to multiply two n × n matrices in O(log n time), 2. block diagonal matrices, in time O(log2b with P(b)n/b processors, 3. sparse matrices which are s(n)-separable (recursive factorizations only), in time O(log2n) with P(s(n)) processors where s(n) is of the form n γ for 0 4. banded matrices, in parallel time O((logn) log b) with P(b)n/b processors. Our factorizations also provide us similarly efficient algorithms for exact computation (given arbitrary rational matrices that need not be symmetric positive definite) of the following: 1. solution of the corresponding linear systems, 2. the determinant, 3. the inverse. Thus our results provide the first known efficient parallel algorithms for exact solution of these matrix problems, that avoids computation of the characteristic polynomial or related forms. Instead we use a construction which modifies the input matrix, which may initially have arbitrary condition, so as to have condition nearly 1, and then applies a multilevel, pipelined Newton iteration, followed by a similar multilevel, pipelined Hensel Lifting.

...read moreread less

13 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
…
103
104
105
106
107
108
109
…
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes

[...]

F. Thomson Leighton

01 Sep 1991

TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.

...read moreread less

Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

...read moreread less

2,895 citations

"An introduction to parallel algorit..." refers background in this paper

...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....
[...]
...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....
[...]
...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....
[...]

Book•

Computer Architecture and Parallel Processing

[...]

Kai Hwang, Faye A. Briggs

01 Jan 1984

TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

1,410 citations

"An introduction to parallel algorit..." refers background in this paper

...Parallel architectures have been described in several books (see, for example, [18, 29])....
[...]

Journal Article•DOI•

Data parallel algorithms

[...]

W. Daniel Hillis, Guy L. Steele

01 Dec 1986-Communications of The ACM

TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

1,000 citations

"An introduction to parallel algorit..." refers background in this paper

...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....
[...]

Proceedings Article•DOI•

Parallelism in random access machines

[...]

Steven Fortune, James C. Wyllie

01 May 1978

TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.

...read moreread less

Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

...read moreread less

951 citations

"An introduction to parallel algorit..." refers background in this paper

...Rigorous descriptions of shared-memory models were introduced later in [11,12]....
[...]

Journal Article•DOI•

The Parallel Evaluation of General Arithmetic Expressions

[...]

Richard P. Brent¹•Institutions (1)

Australian National University¹

01 Apr 1974-Journal of the ACM

TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.

...read moreread less

Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

...read moreread less

864 citations

"An introduction to parallel algorit..." refers methods in this paper

...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....
[...]