scispace - formally typeset
Search or ask a question
Book

An introduction to parallel algorithms

01 Oct 1992-
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

Content maybe subject to copyright    Report

Citations
More filters
Journal Article
TL;DR: This paper shows that the total ordering of the program variables results in an elegant divide and conquer algorithm for the problem of feasibility testing in Difference Constraint Systems, which can be parallelized in straightforward fashion to run on any SIMD or MIMD architecture, thereby greatly enhancing its effectiveness.
Abstract: Chain Programming is a restricted form of Linear Programming with a number of interesting properties A Chain Program is characterized by a total ordering on the program variables (of a linear program) In other words, the constraints x1 ≤ x2xn are either implicitly or explicitly part of the constraint system defined by a linear program At the present juncture, it is not clear whether an arbitrary linear program augmented with a chain is easier to solve than linear programs in general, either asymptotically or computationally; however, it is definitely the case that Chain Programs possess a number of useful properties In this paper, we restrict ourselves to linear programs that are constituted entirely of difference constraints For such linear programs (also called Difference Constraint Systems), we show that the total ordering of the program variables results in an elegant divide and conquer algorithm for the problem of feasibility testing This approach can be parallelized in straightforward fashion to run on any SIMD or MIMD architecture, thereby greatly enhancing its effectiveness Inasmuch as difference constraint logic is an integral part of a number of verification problems in both Model-checking and real-time scheduling, our result is of particular importance to these communities Secondly, difference constraint systems are duals of shortest path problems in directed graphs and hence our study is important from the perspectives of the Artificial Intelligence and Operations Research communities as well One of the surprising consequences of our research is the establishment of a link between Chain Programs over Difference Constraints (CPD) and a specialized class of Totally Unimodular (TUM) matrices called interval matrices This connection makes it likely that there exists a more efficient algorithm for the problem of feasibility testing in CPDs

1 citations

Proceedings ArticleDOI
14 May 2011
TL;DR: The partial sum algorithm is used to improve the accuracy of single precision adding, the correctness of the algorithm is verified from the perspective of experiment by means of the matrix multiplication, and the effect of partial sum algorithms on compute peak of the GPU is analyzed.
Abstract: The single precision in the computer is composed of two parts: the mantissa and the exponent. which are expressed by the limited binary bits. During adding on the single precision, the smaller one should shift to line up the decimal points, If the mantissa of the smaller one exceeds the range of registers, truncating or rounding off will be executed and cause losing precision. As far as the serials of GTX200 is concerned, the length of the register storing intermediate results is the same as the one for the final results. The accuracy problem is very prominent while the length of the mantissa exceeds the range of register after lining up the decimal points during the single precision adding. In this paper, we use the partial sum algorithm to improve the accuracy of single precision adding, and verify the correctness of the algorithm from the perspective of experiment by means of the matrix multiplication. Finally, we analyze the effect of partial sum algorithm on compute peak of the GPU and come to the conclusion that the partial sum algorithm has little influence on the compute peak of the GPU.

1 citations

Dissertation
01 Jan 1999
TL;DR: Extensions to the HASE simulation environment are proposed that classify simulation components according to their communication interfaces, which facilitates the loose coupling of simulation entities and consequently promotes component reuse and potential runtime reduction.
Abstract: The practice of simulating real world systems on computers is widespread and forms an important aspect of many different disciplines. A simulation model provides a simplified view of a real world system facilitating interaction with key aspects of a system without the distraction of unnecessary detail. This thesis is concerned with the role of simulation in computer architecture design. It is recognised that the use of simulation in the design lifecycle is expensive and has tended to focus upon the register transfer (RT) level of design. The majority of design projects have no need for fully articulated models in the initial stages; the designer is more involved with fundamental decisions typically based upon choice of algorithm and high-level performance analysis. Following an overview of current simulation techniques and software, extensions to the HASE simulation environment are proposed that classify simulation components according to their communication interfaces. This facilitates the loose coupling of simulation entities and consequently promotes component reuse. In addition, the problem of allowing entities represented at different levels of architectural abstraction to communicate was examined and a technique developed to allow entities to negotiate a level of service. The MEDL and EDL languages were developed to enhance HASE's component library and project storage facilities; other software tools allowing the visualisation of a hierarchical model in terms of communication and abstraction were also developed. Various model libraries were developed to investigate the trade-offs between model accuracy, runtime and flexibility afforded by the new techniques. It was demonstrated that the developed techniques facilitate component reuse and offer potential runtime reduction.

1 citations

Posted Content
TL;DR: A new sequential algorithm for biconnectivity augmentation in trees is presented by simplifying the algorithm reported in \cite{nsn} by the help of edge contraction tool and it is shown that the parallel algorithm is optimal whose processor-time product is O(n) where $n$ is the number of vertices of a tree.
Abstract: For a connected graph, a vertex separator is a set of vertices whose removal creates at least two components and a minimum vertex separator is a vertex separator of least cardinality. The vertex connectivity refers to the size of a minimum vertex separator. For a connected graph $G$ with vertex connectivity $k (k \geq 1)$, the connectivity augmentation refers to a set $S$ of edges whose augmentation to $G$ increases its vertex connectivity by one. A minimum connectivity augmentation of $G$ is the one in which $S$ is minimum. In this paper, we focus our attention on connectivity augmentation of trees. Towards this end, we present a new sequential algorithm for biconnectivity augmentation in trees by simplifying the algorithm reported in \cite{nsn}. The simplicity is achieved with the help of edge contraction tool. This tool helps us in getting a recursive subproblem preserving all connectivity information. Subsequently, we present a parallel algorithm to obtain a minimum connectivity augmentation set in trees. Our parallel algorithm essentially follows the overall structure of sequential algorithm. Our implementation is based on CREW PRAM model with $O(\Delta)$ processors, where $\Delta$ refers to the maximum degree of a tree. We also show that our parallel algorithm is optimal whose processor-time product is O(n) where $n$ is the number of vertices of a tree, which is an improvement over the parallel algorithm reported in \cite{hsu}.

1 citations

References
More filters
Book
01 Sep 1991
TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.
Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

2,895 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....

    [...]

  • ...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....

    [...]

  • ...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....

    [...]

Book
01 Jan 1984
TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.
Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

1,410 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Parallel architectures have been described in several books (see, for example, [18, 29])....

    [...]

Journal ArticleDOI
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

1,000 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....

    [...]

Proceedings ArticleDOI
01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

951 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Rigorous descriptions of shared-memory models were introduced later in [11,12]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.
Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

864 citations


"An introduction to parallel algorit..." refers methods in this paper

  • ...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....

    [...]