An introduction to parallel algorithms

Home
/
Papers
/
An introduction to parallel algorithms

Book•

An introduction to parallel algorithms

Joseph JaJa¹•Institutions (1)

01 Oct 1992-

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

read less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Models and Resource Metrics for Parallel and Distributed Computation

[...]

John H. Reif

01 Apr 1994

TL;DR: In this article, the authors present a framework of using resource metrics to characterize the various models of parallel computation, including the basic synchronous models and extensions of the basic models which more accurately reflect practical machines by incorporating notions of asynchrony, communication cost, and memory hierarchy.

...read moreread less

Abstract: This paper presents a framework of using resource metrics to characterize the various models of parallel computation. Our framework reflects the approach of recent models to abstract architectural details into several generic parameters, which we call resource metrics. We examine the different resource metrics chosen by different parallel models, categorizing the models into four classes: the basic synchronous models, and extensions of the basic models which more accurately reflect practical machines by incorporating notions of asynchrony, communication cost, and memory hierarchy. We then present a new parallel computation model, the LogP-HMM model, as an illustration of design principles based on the framework of resource metrics. The LogP-HMM model extends an existing parameterized network model (LogP) with a sequential hierarchical memory model (HMM) characterizing each processor. The result captures both network communication costs and the effects of multileveled memory such as local cache and I/O. Mo...

...read moreread less

23 citations

Proceedings Article•DOI•

The Parallel Persistent Memory Model

[...]

Guy E. Blelloch¹, Phillip B. Gibbons¹, Yan Gu¹, Charles McGuffey¹, Julian Shun² - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

11 Jul 2018

TL;DR: In this article, the authors consider a parallel computational model, the Parallel Persistent Memory model, comprised of P processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory.

...read moreread less

Abstract: We consider a parallel computational model, the Parallel Persistent Memory model, comprised of P processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault at any time (with bounded probability), and possibly restart. When a processor faults, all of its state and local ephemeral memory is lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are nearly as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. We present several results for the model, using an approach that breaks a computation into capsules, each of which can be safely run multiple times. For the single-processor version we describe how to simulate any program in the RAM, the external memory model, or the ideal-cache model with an expected constant factor overhead. For the multiprocessor version we describe how to efficiently implement a work-stealing scheduler within the model such that it handles both soft faults, with a processor restarting, and hard faults, with a processor permanently failing. For any multithreaded fork-join computation that is race free, write-after-read conflict free and has W work, D depth, and C maximum capsule work in the absence of faults, the scheduler guarantees a time bound on the model of $Oleft(\fracW P_A + \fracDP P_A leftlceillog_1/(C\f) W\right\rceil\right)$ in expectation, where P is the maximum number of processors, $P_A$ is the average number, and $\faultprob leq 1/(2C)$ is the probability a processor faults between successive persistent memory accesses. Within the model, and using the proposed methods, we develop efficient algorithms for parallel prefix sums, merging, sorting, and matrix multiply.

...read moreread less

23 citations

Journal Article•

Some perfect matchings and perfect half-integral matchings in NC ∗

[...]

Raghav Kulkarni¹, Meena Mahajan, Kasturi Varadarajan²•Institutions (2)

University of Chicago¹, University of Iowa²

01 Jan 2008-Chicago Journal of Theoretical Computer Science

TL;DR: It is shown that for any class of bipartite graphs which is closed under edge deletion and where the number of perfect matchings can be counted in NC, there is a deterministic NC algorithm for finding a perfect matching.

...read moreread less

Abstract: We show that for any class of bipartite graphs which is closed under edge deletion and where the number of perfect matchings can be counted in NC, there is a deterministic NC algorithm for finding a perfect matching. In particular, a perfect matching can be found in NC for planar bipartite graphs and K3,3-free bipartite graphs via this approach. A crucial ingredient is part of an interior-point algorithm due to Goldberg, Plotkin, Shmoys and Tardos. An easy observation allows this approach to handle regular bipartite graphs as well. We show, by a careful analysis of the polynomial time algorithm due to Galluccio and Loebl, that the number of perfect matchings in a graph of small (O(log n)) genus can be counted in NC. So perfect matchings in small genus bipartite graphs can also be found via this approach. We then present a different algorithm for finding a perfect matching in a planar bipartite graph. This algorithm is substantially different from the algorithm described above, and also from the algorithm of Miller and Naor, which predates the approach of Goldberg et al. and tackles the same problem. Our new algorithm extends to small genus bipartite graphs, but not to K3,3-free bipartite graphs. We next show that a non-trivial extension of this algorithm allows us to compute a vertex of the fractional perfect matching polytope (such a vertex is either a perfect matching or a half-integral matching) in NC, provided the graph is planar or small genus but not necessarily bipartite, and has a perfect matching to begin with. This extension rekindles the hope for an NC-algorithm to find a perfect matching in a non-bipartite planar graph. � Most results in this paper were originally announced in papers in Proc. 32nd ACM Symposium on Theory of

...read moreread less

23 citations

Cites methods from "An introduction to parallel algorit..."

...Most algorithms in this paper are presented using the concurrent-read concurrent-write (CRCW) parallel random-access machine PRAM model, see for instance [ 13 ]....
[...]

Book Chapter•DOI•

Work-Efficient Parallel Union-Find with Applications to Incremental Graph Connectivity

[...]

Natcha Simsiri¹, Kanat Tangwongsan², Srikanta Tirthapura³, Kun-Lung Wu⁴•Institutions (4)

University of Massachusetts Amherst¹, Mahidol University International College², Iowa State University³, IBM⁴

24 Aug 2016

TL;DR: This paper presents the first shared-memory parallel algorithm for IGC and equivalently, Union-Find that is provably work-efficient i.e., does no more work than the sequential optimal and has polylogarithmic parallel depth.

...read moreread less

Abstract: On an undirected graph, how can one quickly answer whether two vertices are connected while allowing more edges to be added incrementally? This is the well-studied incremental graph connectivity IGC problem, a fundamental problem that can be efficiently solved using solutions to the classical union-find problem. Motivated by the need to handle larger and rapidly-changing graphs, this paper presents the first shared-memory parallel algorithm for IGC and equivalently, Union-Find that is provably work-efficient i.e., does no more work than the sequential optimal and has polylogarithmic parallel depth. It performs path compression in parallel without a lock or speculative execution. We also present a simpler algorithm with slightly worse theoretical properties, but which is easier to implement, and has good practical performance.

...read moreread less

22 citations

Proceedings Article•DOI•

Spartan: A Framework For Sparse Robust Addressable Networks

[...]

John Augustine¹, Sumathi Sivasubramaniam²•Institutions (2)

Indian Institutes of Technology¹, Indian Institute of Technology Madras²

01 May 2018

TL;DR: This work presents an overlay framework called Sparse Robust Addressable Network (Spartan) that can tolerate heavy adversarial churn, and shows that Spartan can be built efficiently in a fully distributed manner within O(log n) rounds.

...read moreread less

Abstract: A Peer-to-Peer (P2P) network is a dynamic collection of nodes that connect with each other via virtual overlay links built upon an underlying network (usually, the Internet). Typical P2P networks are highly dynamic and can experience very heavy churn, i.e., a large number of nodes join/leave the network every time step. We present an overlay framework called Sparse Robust Addressable Network (Spartan) that can tolerate heavy adversarial churn. We show that Spartan can be built efficiently in a fully distributed manner within O(log n) rounds. Furthermore, the Spartan overlay structure can be maintained, again, in a fully distributed manner despite adversarially controlled churn (i.e., nodes joining and leaving) and significant variation in the number of nodes. When the number of nodes in the network lies in [n, fn] for any fixed f > 1 the adversary can remove up to ?n nodes and add up to ? n nodes (for some small but fixed ? > 0) within any period of P rounds for some P ? O(log log n). Moreover, the adversary can add or remove nodes from the network at will and without any forewarning. Despite such uncertainty in the network, Spartan maintains ?(n/log n) committees that are stable and addressable collections of ?(log n) nodes each. Any node that enters the network will be able to gain membership in one of these committees within O(1) rounds. The committees are also capable of performing sustained computation and passing messages between each other. Thus, any protocol designed for static networks can be simulated on Spartan with minimal overhead. This makes Spartan an ideal platform for developing applications. All our results hold with high probability.

...read moreread less

22 citations

Cites methods from "An introduction to parallel algorit..."

...Normally, the random walks would require O(log n) steps to reach a random node, but we adapt the well-known pointer doubling technique [19, 33] to our context to achieve an exponential speedup of O(log logn) steps....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
…
64
65
66
67
68
69
70
…
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes

[...]

F. Thomson Leighton

01 Sep 1991

TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.

...read moreread less

Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

...read moreread less

2,895 citations

"An introduction to parallel algorit..." refers background in this paper

...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....
[...]
...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....
[...]
...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....
[...]

Book•

Computer Architecture and Parallel Processing

[...]

Kai Hwang, Faye A. Briggs

01 Jan 1984

TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

1,410 citations

"An introduction to parallel algorit..." refers background in this paper

...Parallel architectures have been described in several books (see, for example, [18, 29])....
[...]

Journal Article•DOI•

Data parallel algorithms

[...]

W. Daniel Hillis, Guy L. Steele

01 Dec 1986-Communications of The ACM

TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

1,000 citations

"An introduction to parallel algorit..." refers background in this paper

...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....
[...]

Proceedings Article•DOI•

Parallelism in random access machines

[...]

Steven Fortune, James C. Wyllie

01 May 1978

TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.

...read moreread less

Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

...read moreread less

951 citations

"An introduction to parallel algorit..." refers background in this paper

...Rigorous descriptions of shared-memory models were introduced later in [11,12]....
[...]

Journal Article•DOI•

The Parallel Evaluation of General Arithmetic Expressions

[...]

Richard P. Brent¹•Institutions (1)

Australian National University¹

01 Apr 1974-Journal of the ACM

TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.

...read moreread less

Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

...read moreread less

864 citations

"An introduction to parallel algorit..." refers methods in this paper

...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....
[...]