scispace - formally typeset
Search or ask a question
Book

An introduction to parallel algorithms

01 Oct 1992-
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
01 May 2000
TL;DR: Some fundamental PRAM algorithms, such as prefix sums and list ranking algorithms, are implemented and evaluated, and it is shown that these algorithms achieve optimal speedup up to at least 16 processors.
Abstract: The main contribution of this work is to in troducea multithreaded parallel computer model (MPCM), whic hhas a number of multithreaded processors connected with an interconnection network. We have implemented some fundamental PRAM algorithms, such as prefix sums and list ranking algorithms, and evaluated their performance. These algorithms achiev ed optimal speedup up to at least 16 processors.

1 citations

Proceedings Article
01 Jan 2008
TL;DR: Low-degree PRAM as mentioned in this paper is based on the PRAM architecture and inherits many of its interesting theoretical properties, such as bounded number of processors or cores, the synchronization model is looser and the use of parallelism is at a higher level unless explicitly specified otherwise.
Abstract: We propose a new model with small degreee of parallelism that reflects current and future multicore architectures in practice. The model is based on the PRAM architecture and hence it inherits many of its interesting theoretical properties. The key observations and differences are that the degree of parallelism (i.e. number of processors or cores) is bounded by O(log n), the synchronization model is looser and the use of parallelism is at a higher level unless explicitly specified otherwise. Surprisingly, these three rather minor variants result in a model in which obtaining work optimal algorithms is significantly easier than for the PRAM. The new model is called Low-degree PRAM or LoPRAM for short. Lastly we observe that there are thresholds in complexity of programming at p=O(log n) and p=O(sqrt(n)) and provide references for specific problems for which this threshold has been formally proven.

1 citations

Journal ArticleDOI
TL;DR: A new method on natural sorting algorithm with a run time for n input data O( n) to O(nlogm), where m defines a positive value and surrounded by 50% of n, indicating that the method uses less time as well as acceptable memory to sort a data sequence considering the natural order behavior.
Abstract: Problem statement: Researchers focused their attention on optimally ad aptive sorting algorithm and illustrated a need to develop tools f or constructing adaptive algorithms for large class es of measures. In adaptive sorting algorithm the run time for n input data smoothly varies from O(n) to O(nlogn), with respect to several measures of disor der. Questions were raised whether any approach or technique would reduce the run time of adaptive sor ting algorithm and provide an easier way of implementation for practical applications. Approach: The objective of this study is to present a new method on natural sorting algorithm with a run time for n input data O(n) to O(nlogm), where m defines a positive value and surrounded by 50% of n . In our method, a single pass over the inputted data creates some blocks of data or buffers accordi ng to their natural sequential order and the order can be in ascending or descending. Afterward, a bottom up approach is applied to merge the naturally sorted subsequences or buffers. Additionally, a par allel merging technique is successfully aggregated in our proposed algorithm. Results: Experiments are provided to establish the best, wo rst and average case runtime behavior of the proposed method. The simulation statistics provide same harmony with the theoretical calculation and proof the method ef ficiency. Conclusion: The results indicated that our method uses less time as well as acceptable memory to sort a data sequence considering the natural order behavior and applicable to the realistic rese arches. The parallel implementation can make the algorithm for efficient in time domain and will be the future research issue.

1 citations


Cites background or methods from "An introduction to parallel algorit..."

  • ...Using merge sort algorithm, sorting a sequence of n elements can be done optimally in O(logn log (log n )) (JaJa, 1992)....

    [...]

  • ...…• The resulting implementations are practical and do not require complex data structures • Parallelism is present as the approach is inherited from Mergesort (JaJa, 1992) In the proposed technique, at first, the disorderness of the data is checked and partitioned in a single pass over the data set....

    [...]

  • ...According to Simple Merge Sort (JaJa, 1992), the running time of this algorithm is O(logn log (log n)) and the total number of operations used is O(n log n) (where PRAM model will be CREW PRAM)....

    [...]

  • ...The design of generic sorting algorithms results in several advantages (Estivill-Castro and Wood, 1992a), for example: • The algorithm designer can focus the efforts on the combinatorial properties of the measures of disorder of interest rather than in the combinatoril properties of the algorithm • The designer can regulate the trade-off between the number of measures for adaptivity and the amount of machinery required • The resulting implementations are practical and do not require complex data structures • Parallelism is present as the approach is inherited from Mergesort (JaJa, 1992) In the proposed technique, at first, the disorderness of the data is checked and partitioned in a single pass over the data set....

    [...]

  • ...The number of ascending runs is directly related to the measure Runs, Natural Mergesort takes O(|n| (1 + log[Runs(n) + 1])) time....

    [...]

DissertationDOI
01 Jan 2014
TL;DR: A highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs and a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions are presented.
Abstract: Title of dissertation: OPTIMIZATION TECHNIQUES FOR MAPPING ALGORITHMS AND APPLICATIONS ONTO CUDA GPU PLATFORMS AND CPU-GPU HETEROGENEOUS PLATFORMS Jing Wu, Doctor of Philosophy, 2014 Dissertation directed by: Professor Joseph F JaJa, Department of Electrical and Computer Engineering An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU. OPTIMIZATION TECHNIQUES FOR MAPPING ALGORITHMS AND APPLICATIONS ONTO CUDA GPU PLATFORMS AND CPU-GPU HETEROGENEOUS PLATFORMS

1 citations


Cites background from "An introduction to parallel algorit..."

  • ...For shared memory models, the most prevalent theoretical model is the Parallel Random Access Machine (PRAM) model [23]....

    [...]

ReportDOI
01 Jan 2006
TL;DR: BEC is a portable lightweight approach for incremental acceptance of the GAS model, along an evolution path that leverages existing infrastructures and maintains backward compatibility with existing programming methods and environments.
Abstract: We propose an abstraction, named BEC, to enable Global Address Space (GAS) capabilities for parallel programming in SPMD style . It is a portable lightweight approach for incremental acceptance of the GAS model, along an evolution path that leverages existing infrastructures and maintains backward compatibility with existing programming methods and environments . It assists migration of legacy applications thereby encouraging their expert programmers to adopt the new model. In addition, it provides for some of the unaddressed needs, such as efficient support for high-volume fine-grained and random communications, which are common in parallel graph algorithms, sparse matrix operations, and large scale simulations . The idea behind BEC is that messages are aggregated by a runtime library for bulk transport to handle such unpredictable communication patterns . Data from initial experiments with a prototype communication bundling library using the Bundle-Exchange-Compute (thus motivating the name BEC) programming style shows that this approach scales well . As examples of suitable BEC applications, we present sparse matrix kernels for multiplication and overlapping Schwarz preconditioning [5, 11] . We also discuss solid mechanics material contact [1, 18] with abundant irregular, fine-grained communication. BEC can be used as an enhancement to existing environments such as MPI . It can also function as an intermediate language [14] to other high level GAS languages such as PRAM C [8] and UPC [30] . Furthermore, it can serve as a bridge between programming models such as virtual shared memory and message passing. *Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI 48109 I Sandia National Laboratories, Albuquerque, NM 87185, email : spgoudy@sandia .gov Sandia National Laboratories, Albuquerque, NM 87185, email : maherou©sandia .gov 5 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332 Sandia National Laboratories, Albuquerque, NM 87185, email : zwen@sandia.gov

1 citations

References
More filters
Book
01 Sep 1991
TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.
Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

2,895 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....

    [...]

  • ...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....

    [...]

  • ...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....

    [...]

Book
01 Jan 1984
TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.
Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

1,410 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Parallel architectures have been described in several books (see, for example, [18, 29])....

    [...]

Journal ArticleDOI
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

1,000 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....

    [...]

Proceedings ArticleDOI
01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

951 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Rigorous descriptions of shared-memory models were introduced later in [11,12]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.
Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

864 citations


"An introduction to parallel algorit..." refers methods in this paper

  • ...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....

    [...]