scispace - formally typeset
Search or ask a question
Book

An introduction to parallel algorithms

01 Oct 1992-
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

Content maybe subject to copyright    Report

Citations
More filters
26 Nov 2014
TL;DR: This thesis works towards developing algorithms for some computing primitives such as that of list ranking, sorting and pseudorandomness and some graph algorithms on a hybrid platform consisting of a 6-core Intel CPU and Nvidia GPUs of broadly two generations.
Abstract: The computing industry has undergone several paradigm shifts in the last few decades. Fueled by the need of faster computing, larger data and real time processing needs parallel computing has emerged as one of the dominant paradigms. Motivated by the success achieved in distributed computing models and the limitations faced by single core processors, parallel computing is the only alternative for building faster computers. Parallel computing is one of the most challenging areas computer science in the present and developing algorithms and optimization techniques for utilizing the processing power present in a current generation parallel computer is still a very exciting area for research. The parallel computing industry underwent a massive shift with the conventional sequential computers hitting the power wall. It led to the development of multicore and many-core computing chips that had multiple sequential computing cores packed into a single chip. The immediate impact was the need for (re)designing sequential algorithms in order to utilize the computing power of such chips. Combined with the intricate memory and cache structures, parallel algorithms require a higher degree of engineering for the most optimal performance. The many-core revolution started with the release of Graphics Processing Units (GPU) which had a large number of compute cores and offered massive parallelism. With the evolution of the many-core chips, the GPUs found application in graphics, gaming as well as general purpose computation. In the same time frame, the Central Processing Units (CPU) too under went a sea of innovation and emerged as more powerful and mature computing machines. However, the multicore CPUs were mostly ignored in its initial days. With the advancement of accelerator platforms, the CPUs and GPUs are now able to communicate in a more efficient manner. In the recent times there has been quite a few works such as the ones in [79, 91, 43] that shows that hybrid algorithms actually provide better performance and efficiency over conventional accelerator based computing. In this thesis we work towards the development of hybrid multicore computing. Hybrid multicore computing is developing algorithms and optimization strategies for popular computing primitives on an hybrid platform. A hybrid platform is one which contains two or more multicore or many-core devices. There are several challenges towards the efficient algorithm design on hybrid platforms such as that of communication bottlenecks, load balance and synchronization. In this thesis we work towards developing algorithms for some computing primitives such as that of list ranking, sorting and pseudorandomness and some graph algorithms. We experiment on a hybrid platform consisting of a 6-core Intel CPU and Nvidia GPUs of broadly two generations.

3 citations


Cites background or methods from "An introduction to parallel algorit..."

  • ...Our algorithm follows the model of most parallel list ranking algorithms [64]....

    [...]

  • ...[64]) that with high probability, the FIS constructed above has at least n/c nodes for c ≥ 24....

    [...]

  • ...However, to be time-optimal, this requires O(log n) time per iteration [64]....

    [...]

Ulrich Meyer1
01 Jan 2001
TL;DR: This paper gives simple label-setting and label-correcting algorithms for arbitrary directed graphs with random real edge weights uniformly distributed in [0; 1℄] and shows that they need linear timeO(n+m) with high probability.
Abstract: The quest for a linear-time single-source shortest-path (S SSP) algorithm on directed graphs with positive edge weights is an ongoing hot re search topic. While Thorup recently found anO(n +m) time RAM algorithm for undirected graphs with n nodes,m edges and integer edge weights in f0; : : : ; 2w 1g wherew denotes the word length, the currently best time bound for dire cted sparse graphs on a RAM isO(n+m log logn). In the present paper we study the average-case complexity of SSSP. We give simple label-setting and label-correcting algorithms for arbitrary directed graphs with random real edge weights uniformly distributed in [0; 1℄ and show that they need linear timeO(n+m) with high probability. A variant of the label-correcting approach also supports parallelization. Furthermore, we p ropose a general method to construct graphs with random edge weights which incur lar ge non-linear expected running times on many traditional shortest-path alg orithms.

3 citations


Cites methods from "An introduction to parallel algorit..."

  • ...In the following we sketch the result of a simple paralleliza tion for SP-C and SP-C* on the parallel random access machine (PRAM) [22, 34 ]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an O(log n) time parallel algorithm for arithmetic expression evaluation, on an n × n processor array with reconfigurable bus system, where n is the sum of the number of operators and constants in the expression.

3 citations

01 Jan 1997
TL;DR: In this article, the authors give a randomized O(log n)-time algorithm for CRCW PRAM with memory faults with constant fraction of memory faults, which is the first work-optimal randomized CRCW algorithm with memory failures.
Abstract: In this paper we show two results on PRAM with constant fraction of memory faults. First we show how to preprocess (i.e. connect a constant fraction of processors into a binary tree) a faulty EREW PRAM with n/log n processors and O(n) memory cells in O(log n) time. The preprocessing is a basic step of simulations from [7, 9, 17]. Our algorithm, together with the results from [17], gives a first fully work-optimal randomized simulations of EREW on EREW with faults with logarithmic overhead. In the second part of this paper, we consider the CRCW PRAM with memory faults. We show that (after O(log* n)-time preprocessing) any algorithm for O(n)-processor PRAM can be simulated with optimal work in O(log* n) time on CRCW with memory faults. The simulation improves the result of [7]. All simulations assume static faults, i.e. that the errors are determined before the computation starts and that no new errors occur during the computation.

3 citations

Book ChapterDOI
24 Oct 2011
TL;DR: This paper presents a new parallel algorithm to generate a short addition chain for x using polynomial number processors under EREW PRAM (exclusive read exclusive write parallel random access machine) and is faster than previous algorithms and is based on binary method.
Abstract: An addition chain for a natural number x of n bits is a sequence of numbers a0, a1, ..., al, such that a0 = 1, al = x, and ak = ai+aj with 0 ≥ i, j > k ≥ l. The addition chain problem is what is the minimal number of additions needed to compute x starting from 1? In this paper, we present a new parallel algorithm to generate a short addition chain for x. The algorithm has running time O(log2 n) using polynomial number processors under EREW PRAM (exclusive read exclusive write parallel random access machine). The algorithm is faster than previous algorithms and is based on binary method.

3 citations


Cites background or methods from "An introduction to parallel algorit..."

  • ...They differ in the conventions regarding to concurrent reading and writing [1,7]....

    [...]

  • ...Communication Complexity: The communication complexity is defined as the worst-case bound on the traffic between the shared memory and any local memory of a processor [7]....

    [...]

  • ...We can compute W by applying the prefix sums [1,7] on Y....

    [...]

  • ...The array Q can be constructed by applying the prefix multiplication [1,7]....

    [...]

References
More filters
Book
01 Sep 1991
TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.
Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

2,895 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....

    [...]

  • ...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....

    [...]

  • ...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....

    [...]

Book
01 Jan 1984
TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.
Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

1,410 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Parallel architectures have been described in several books (see, for example, [18, 29])....

    [...]

Journal ArticleDOI
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

1,000 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....

    [...]

Proceedings ArticleDOI
01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

951 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Rigorous descriptions of shared-memory models were introduced later in [11,12]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.
Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

864 citations


"An introduction to parallel algorit..." refers methods in this paper

  • ...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....

    [...]