An introduction to parallel algorithms

Home
/
Papers
/
An introduction to parallel algorithms

Book•

An introduction to parallel algorithms

Joseph JaJa¹•Institutions (1)

01 Oct 1992-

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

read less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Parallel Approach to the Firefly Algorithm

[...]

Gabriel Linhares¹, Guilherme Anchieta Costa¹, Leonardo Jacobson¹, Lucio Rodrigues¹, Vitor Ballabenute¹, Guilherme Alberto Wachs-Lopes¹, Paulo Sergio Rodrigues¹ - Show less +3 more•Institutions (1)

Centro Universitário da FEI¹

01 Oct 2018

TL;DR: This paper uses the distributed computing concept to optimize the sequential version of the Firefly Algorithm, and shows that the proposed distributed version is more efficient than the regular existing algorithm.

...read moreread less

Abstract: Distributed computing is a computation approach in which many calculations are made at the same time in a distributed memory model, exploring the fact that big problems can sometimes be divided into little ones that can be solved at the same time. This paper uses the distributed computing concept to optimize the sequential version of the Firefly Algorithm (FA). Results show that the proposed distributed version is more efficient than the regular existing algorithm.

...read moreread less

Cites background from "An introduction to parallel algorit..."

...In that way a problem is divided into smaller parts that can be solved concurrently, after that each part is further broken down into a series of instructions and them a control/coordination mechanism is employed to administrate [10]....
[...]

Proceedings Article•DOI•

Parallel backtrack computing of association schemes using classroom pc's

[...]

Izumi Miyamoto

01 Jul 2002

An Immediate Concurrent Execution (ICE) Abstraction Proposal for Many-Cores

[...]

Uzi Vishkin

01 Dec 2008

TL;DR: The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests “immediate concurrent execution (ICE)” as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easyto-program general-purpose many- core platform.

...read moreread less

Abstract: Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform. The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests “immediate concurrent execution (ICE)” as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easyto-program general-purpose many-core platform. 1. Case for Abstraction In 2004, standard (desktop) computers comprised one processor core. In 2008, some have 8 cores. By 2012, 64core computers (another factor of 8) are expected. Transition from serial computing to parallel computing mandates the reinvention of the very heart of computer science (CS). These highly parallel computers need to be built and programmed in a new way. Current solutions by leading vendors do not scale to tens of cores. Given that clock speeds have not been improving for quite a few years, the use of parallel processing for improving single-program completion time is a critical target for future designs. We need to figure out how to build scalable many-core computers, how to program them effectively so that programmers can get strong performance with minimal programming effort, how to train the workforce, and how to teach this new environment at all levels, including introductory programming courses to college freshmen and K-12 students. Foremost among current challenges is timely convergence to a robust many-core platform that will serve the world for many years to come. Critical to the economy and workforce, the basic motivation behind the current position paper is bringing about the reinvention of CS for meeting this challenge: 1) Andy Grove (Intel) noted that the software spiral (hardware improvements lead to software improvements that lead back to hardware improvements) had been an engine of sustained growth for IT; but (as explained in [6] and since convergence is yet to happen), it is now broken! 2) Both under-trained and mistrained for a future certain to be dominated by parallelism, most CS students only study the old serial paradigm, acquiring serial habits that complicate later transition to parallelism. But, how should we approach the convergence challenge, and, in particular, what the first step should be. The final posting in a special series on why research advances are needed to overcome the problems posed by mulitcore processors on the Computing Community Consortium blog [5] perhaps implies a perception of despair in the community. The problem is not new. Many parallel computer architectures have been proposed and built over the last 40 years, but with limited success. Exploiting the parallelism present in them has often eluded their users. The main source of encouragement in [5] is a call on all involved communities to collaboratively start with a clean slate, rather than have language researchers locked into mechanisms supported by commodity hardware and hardware researchers locked into fully supporting any current software. This is not the first time that CS is facing a complex system problem requiring a solution that involves many different players and should be robust over time in the face of system upgrades. It has become a signature intellectual success story of CS to address such problems by figuring out a simple abstraction that acts as “a single nail holding everything together”. In fact, abstractions that present the user with a virtual machine that is easier to understand and program than the underlying hardware, but still allows effective use of the hardware, facilitated significant Computer Science accomplishments. Broad consensus built around these simple One of the dictionary definitions of abstract is difficult to understand, or abstruse. In CS, however, abstraction has become synonym with the quest for simplicity. Interestingly, the word abstraction in Hebrew shares the same root with simple (as well as undress and expand). abstractions was the key. Some formative abstractions were: (i) that any single instruction available for execution in a serial program executes immediately, henceforth called immediate serial execution (ISE); note that since an instruction may apply to any location in memory, ISE extended another formative abstraction that we call “immediate memory access (IMA)”: that any particular word of an indefinitely large memory is immediately available, and (ii) that a computer is serving the task that the user is currently working on exclusively, henceforth exclusive computer availability (ECA). The IMA abstraction abstracts away a hierarchy of memories, each with greater capacity, but slower access time, than the preceding one, and the ISE abstraction extends it to immediate execution of any operation. The ECA abstraction abstracts away virtual file systems that can be implemented in local storage or a local or global network, access to the Internet, and other tasks that may be concurrently using the same computer system. These abstractions have improved the productivity of programmers and other users, and contributed towards broadening participation in computing. Some simple and robust abstraction can be the first writing on the clean slate sought in [5]. We will then need to build a consensus around such an abstraction as a way to reproduce past CS success stories for the many-core era. Finding the best many-core platform requires a battle of ideas whose outcome will affect a rather broad community. The need for acceptance by all relevant segments of the community suggests the necessity of benchmarks for predicting the success of a many-core platform. Development of such benchmarks is, in fact, long overdue. Abstractions provide an effective way for lowering the bar towards broadening participation in the debate to all relevant participants. While the utility of abstraction will become much clearer once such benchmarks are available, there is no reason not to focus on abstractions immediately. The desired abstraction will: (i) be simple, hiding the details of the underlying hardware, (ii) be accessible to the broadest possible groups of users, (iii) allow strong speedups for applications, (iv) be scalable; a user of a 16-core computer should rely on the same abstraction as a user of a future generation 1024-core computer, or else performance code will have to be continuously rewritten; this will also help put theions was the key. Some formative abstractions were: (i) that any single instruction available for execution in a serial program executes immediately, henceforth called immediate serial execution (ISE); note that since an instruction may apply to any location in memory, ISE extended another formative abstraction that we call “immediate memory access (IMA)”: that any particular word of an indefinitely large memory is immediately available, and (ii) that a computer is serving the task that the user is currently working on exclusively, henceforth exclusive computer availability (ECA). The IMA abstraction abstracts away a hierarchy of memories, each with greater capacity, but slower access time, than the preceding one, and the ISE abstraction extends it to immediate execution of any operation. The ECA abstraction abstracts away virtual file systems that can be implemented in local storage or a local or global network, access to the Internet, and other tasks that may be concurrently using the same computer system. These abstractions have improved the productivity of programmers and other users, and contributed towards broadening participation in computing. Some simple and robust abstraction can be the first writing on the clean slate sought in [5]. We will then need to build a consensus around such an abstraction as a way to reproduce past CS success stories for the many-core era. Finding the best many-core platform requires a battle of ideas whose outcome will affect a rather broad community. The need for acceptance by all relevant segments of the community suggests the necessity of benchmarks for predicting the success of a many-core platform. Development of such benchmarks is, in fact, long overdue. Abstractions provide an effective way for lowering the bar towards broadening participation in the debate to all relevant participants. While the utility of abstraction will become much clearer once such benchmarks are available, there is no reason not to focus on abstractions immediately. The desired abstraction will: (i) be simple, hiding the details of the underlying hardware, (ii) be accessible to the broadest possible groups of users, (iii) allow strong speedups for applications, (iv) be scalable; a user of a 16-core computer should rely on the same abstraction as a user of a future generation 1024-core computer, or else performance code will have to be continuously rewritten; this will also help put the above noted software spiral back on track; (v) extend, rather than replace, existing (successful) abstractions; in particular, when code provides no parallelism, the user will need to be able to fall back on the serial abstraction ISE; and last, but definitely not least, (vi) be buildable; we must be able to build an actual computer system that provides good performance for users that rely on the abstraction. Note also that the ECA abstraction does

...read moreread less

Cites methods from "An introduction to parallel algorit..."

...The methodology of restricting attention only to work and depth has, in fact, been used as the main framework for the presentation of PRAM algorithms in texts such as [2,3]; see also the class notes available through [9]....
[...]

Journal Article•DOI•

Parallel two dimensional witness computation

[...]

Richard Cole¹, Zvi Galil², Ramesh Hariharan³, S. Muthukrishnan⁴, Kunsoo Park⁵ - Show less +1 more•Institutions (5)

Courant Institute of Mathematical Sciences¹, Columbia University², Indian Institute of Science³, Rutgers University⁴, Seoul National University⁵

10 Jan 2004-Information & Computation

TL;DR: An optimal parallel CRCW-PRAM algorithm to compute witnesses for all non-period vectors of an m1 × m2 pattern is given and yields a work optimal algorithm for 2D pattern matching.

...read moreread less

Abstract: An optimal parallel CRCW-PRAM algorithm to compute witnesses for all non-period vectors of an m1 × m2 pattern is given. The algorithm takes O(log log m) time and does O(m1 × m2) work, where m = max{m1, m2}. This yields a work optimal algorithm for 2D pattern matching which takes O(log log m) preprocessing time and O(1) text processing time.

...read moreread less

Cites background from "An introduction to parallel algorit..."

..., simultaneous writes to the same location by several processors are guaranteed to be of the same value [15]....
[...]

Book Chapter•DOI•

Ultrafast Randomized Parallel Construction- and Approximation Algorithms for Spanning Forests in Dense Graphs

[...]

Anders Dessmark¹, Carsten Dorgerloh², Andrzei Lingas¹, J "urgen Wirtgen²•Institutions (2)

Lund University¹, University of Bonn²

01 Oct 1997

TL;DR: A first randomized time and work CRCW-PRAM algorithm for finding a spanning forest of an undirected dense graph with $n$ vertices and is optimal with respect to time, work and space.

...read moreread less

Abstract: We present a first randomized $\O$$(\log^{(k)} n)$ time and $\O$$(n + m)$ work CRCW-PRAM algorithm for finding a spanning forest of an undirected dense graph with $n$ vertices. Furthermore we construct a randomized $\O$$(\log \log n)$ time and $\O$$(n \log n)$ work CREW-PRAM algorithm for finding spanning trees in random graphs. Our algorithm is optimal with respect to time, work and space.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book•

Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes

[...]

F. Thomson Leighton

01 Sep 1991

TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.

...read moreread less

Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

...read moreread less

2,895 citations

"An introduction to parallel algorit..." refers background in this paper

...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....
[...]
...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....
[...]
...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....
[...]

Book•

Computer Architecture and Parallel Processing

[...]

Kai Hwang, Faye A. Briggs

01 Jan 1984

TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

...read moreread less

1,410 citations

"An introduction to parallel algorit..." refers background in this paper

...Parallel architectures have been described in several books (see, for example, [18, 29])....
[...]

Journal Article•DOI•

Data parallel algorithms

[...]

W. Daniel Hillis, Guy L. Steele

01 Dec 1986-Communications of The ACM

TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

1,000 citations

"An introduction to parallel algorit..." refers background in this paper

...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....
[...]

Proceedings Article•DOI•

Parallelism in random access machines

[...]

Steven Fortune, James C. Wyllie

01 May 1978

TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.

...read moreread less

Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

...read moreread less

951 citations

"An introduction to parallel algorit..." refers background in this paper

...Rigorous descriptions of shared-memory models were introduced later in [11,12]....
[...]

Journal Article•DOI•

The Parallel Evaluation of General Arithmetic Expressions

[...]

Richard P. Brent¹•Institutions (1)

Australian National University¹

01 Apr 1974-Journal of the ACM

TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.

...read moreread less

Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

...read moreread less

864 citations

"An introduction to parallel algorit..." refers methods in this paper

...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....
[...]