scispace - formally typeset
Search or ask a question
Book

An introduction to parallel algorithms

01 Oct 1992-
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2003
TL;DR: This thesis has numerically shown non convexity of the functional FCT for an arbitrarily chosen image, and proposed a method for brain extraction which is based on segmentation and a nonlinear diffusion equation.
Abstract: Segmentation of images is a major task of image processing. There is no general segmentation procedure that can deal with all sorts of images, and the correct solution will always to a certain degree depend on subjectivity. This thesis focuses on segmentation of images using level set functions, and specifically explores the work of Tony Chan and Luminita Vese [CV01+]. A new model, the Chan-Tai model for image segmentation is tested, a model which was proposed in [CT03] and is an extension of the work of ChanVese. This level set formulation depends on two level set functions which can identify four distinct regions with no vacuum or overlap between the regions. The segmented image is approximated with a piecewise constant function q, such that the functional F = 1 2 ∫ Ω (q −B)dxdy is minimised with regard to the two level set functions φ1 and φ2 over the region Ω. B is the image function. When minimising the above mentioned functional, the δ-function will appear in the EulerLagrange equations. The δ-function can numerically create oscillations, and has sometimes in the literature been replaced by the gradient of the argument. In this thesis we propose and test numerically different combinations for replacements of the δ-function. The least squares minimalisation problem of FCT is non convex, something which complicates the issue of finding a global minimiser of the functional. For most practical problems there will be several local, sometimes global minima. Theoretically it is a difficult task to show non convexity of a functional, partly since convexity or non convexity will depend on the image which is segmented. In this thesis we have numerically shown non convexity of the functional FCT for an arbitrarily chosen image. To improve the ability of the numerical scheme to locate a global solution, we propose a method for initialising the level set functions. This method, the isodata algorithm is a well known technique for segmentation, but it has large limitations if standing alone as a segmentation method. This initialisation represents a good initial guess of the solution, and the chances for ending up at a global minimum increases greatly. Segmentation of MRI images plays an important role in this thesis as a major motivation for segmentation. Costly human resources are relieved if MRI images can be segmented automatically into different tissues. For the level set approach with two level set functions to work properly on MRI images, we need to do a brain extraction. The doctors are interested to know the location and quantity of the three tissue types white matter, gray matter and cerebrospinal fluid. These tissues are located inside the brain, and in the process of brain extraction, other types of tissues like bones, skin and muscles are removed. 4 5 Then we are left with only the most interesting tissues. In this thesis we have proposed a method for brain extraction which is based on segmentation and a nonlinear diffusion equation. Finally we have investigated an image registration technique for reading number plates on cars. The characters cut from the number plate are aligned through the center of mass to ideal characters (templates), and are then registered to all possible outcomes of these templates. The character is then chosen to be the template with the greatest similarity to the character after the registration process.

13 citations

Dissertation
12 Dec 2005
TL;DR: La conception d'un langage de programmation est le resultat d' un compromis qui determine l'equilibre entre les differentes qualites du langage telles que l'expressivite, the surete, the prediction des performances, l'efficacite ou bien the simplicite de the semantique.
Abstract: Certains problemes necessitent des performances que seules les machines massivement paralleles ou les meta-ordinateurs peuvent offrir. L'ecriture d'algorithmes pour ce type de machines demeure plus difficile que pour celles strictement sequentielles et la conception de langages adaptes est un sujet de recherche actif nonobstant la frequente utilisation de la programmation concurrente. En effet, la conception d'un langage de programmation est le resultat d'un compromis qui determine l'equilibre entre les differentes qualites du langage telles que l'expressivite, la surete, la prediction des performances, l'efficacite ou bien la simplicite de la semantique. Dans des travaux anterieurs a cette these, il a ete entrepris d'approfondir la position intermediaire que le paradigme des patrons occupe. Toutefois il ne s'agissait pas de concevoir un ensemble a priori fixe d'operations puis de concevoir des methodologies pour la prediction des performances, mais de fixer un modele de parallelisme structure (avec son modele de couts) puis de concevoir un ensemble universel d'operations permettant de programmer n'importe quel algorithme de ce modele. L'objectif est donc le suivant : parvenir a la conception de langages universels dans lesquels le programmeur peut se faire une idee du cout a partir du code source. Cette these s'inscrit dans le cadre du projet «CoordinAtion et Repartition des Applications Multiprocesseurs en objective camL» (CARAML) de l'ACI GRID dont l'objectif etait le developpement de bibliotheques pour le calcul haute-performance et globalise autour du langage OCaml. Ce projet etait organise en trois phases successives : surete et operations data-paralleles irregulieres mono-utilisateur ; operations de multitraitement data-parallele ; operations globalisees pour la programmation de grilles de calcul. Ce tapuscrit est organise en 3 parties correspondant chacune aux contributions de l'auteur dans chacune des phases du projet CARAML : une etude semantique d'un langage fonctionnel pour la programmationBSP et la certification des programmes ecrits dans ce langage ; une presentation d'une primitive de composition parallele (et qui permet aussi la programmation d'algorithmes «diviser-pour-regner» paralleles), un exemple d'application via l'utilisation et l'implantation de structures de donnees paralleles et une extension pour les entrees/sorties paralleles en BSML ; l'adaption du langage pour le meta-calcul.

13 citations


Cites methods from "An introduction to parallel algorit..."

  • ...Mais vu la subtilité de certains algorithmes parallèles [156], cette méthode ne peut être utilisée de façon générale et satisfaisante....

    [...]

  • ...Il utilise une sémantique concurrente pour exprimer des algorithmes parallèles qui servent par définition à concevoir des fonctions [156] et non des relations générales,...

    [...]

Journal ArticleDOI
TL;DR: Two simple optimal-work parallel algorithms for sorting a list L= (X1, X2, …, Xm) of m strings over an arbitrary alphabet Σ, where ∑i = 1m¦Xi¦ = n and two elements of Σ can be compared in unit time using a single processor.

13 citations

Proceedings Article
01 Jan 2011
TL;DR: It is found that the maximum potential merge speedup is limited since only two of its four stages are likely to benefit from parallelization, so a parallel dictionary slice merge algorithm as well as an alternative parallel merge algorithm for GPUs that achieves up to 40% more throughput than its CPU implementation are presented.
Abstract: Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, updated or deleted. This has to be taken into account for TPC-C like workloads with around 45% of all queries being transactional modifications. A technique called differential updates can be used to allow faster modifications. In addition to the main storage, the database then maintains a delta storage to accommodate modifying queries. During the merge process, the modifications of the delta are merged with the main storage in parallel to the normal operation of the database. Current hardware and software trends suggest that this problem can be tackled by massively parallelizing the merge process. One approach to massive parallelism are GPUs that offer order of magnitudes more cores than modern CPUs. Therefore, we analyze the feasibility of a parallel GPU merge implementation and its potential speedup. We found that the maximum potential merge speedup is limited since only two of its four stages are likely to benefit from parallelization. We present a parallel dictionary slice merge algorithm as well as an alternative parallel merge algorithm for GPUs that achieves up to 40% more throughput than its CPU implementation. In addition, we propose a parallel duplicate removal algorithm that achieves up to 27 times the throughput of the CPU implementation.

13 citations

Proceedings ArticleDOI
10 May 2009
TL;DR: This paper outlines a configurable emulated shared memory (CESM) architecture that retains the advantages of the ESM architectures for parallel enough code but is also able to execute applications with low parallelism efficiently.
Abstract: Emulated shared memory (ESM) multiprocessor systems on chip (MP-SOC) and network on chip (NOC) regions are efficient general purpose computing engines for future computers and embedded systems running applications unknown at the design phase. While they provide programmer a synchronous, unified, and constant time accessible shared memory, the existing ESM architectures have been shown to be inefficient with workloads having low parallelism. In this paper we outline a configurable emulated shared memory (CESM) architecture that retains the advantages of the ESM architectures for parallel enough code but is also able to execute applications with low parallelism efficiently. This happens by allowing multiple threads to join as a single nonuniform memory access (NUMA) bunch and organizing memory system to support NUMA-like behavior for thread-local data if parallelism is limited. Performance simulations as well as silicon area and power consumption estimations of CESM MP-SOC/ NOC regions are provided.

13 citations

References
More filters
Book
01 Sep 1991
TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.
Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index

2,895 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Multiprocessorbased computers have been around for decades and various types of computer architectures [2] have been implemented in hardware throughout the years with different types of advantages/performance gains depending on the application....

    [...]

  • ...Every location in the array represents a node of the tree: T [1] is the root, with children at T [2] and T [3]....

    [...]

  • ...The text by [2] is a good start as it contains a comprehensive description of algorithms and different architecture topologies for the network model (tree, hypercube, mesh, and butterfly)....

    [...]

Book
01 Jan 1984
TL;DR: The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.
Abstract: The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

1,410 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Parallel architectures have been described in several books (see, for example, [18, 29])....

    [...]

Journal ArticleDOI
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

1,000 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Recent work on the mapping of PRAM algorithms on bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15]....

    [...]

Proceedings ArticleDOI
01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

951 citations


"An introduction to parallel algorit..." refers background in this paper

  • ...Rigorous descriptions of shared-memory models were introduced later in [11,12]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log 2 + 10(n - 1) using processors which can independently perform arithmetic operations in unit time.
Abstract: It is shown that arithmetic expressions with n ≥ 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n - 1)/p using p ≥ 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.

864 citations


"An introduction to parallel algorit..." refers methods in this paper

  • ...The WT scheduling principle is derived from a theorem in [7], In the literature, this principle is commonly referred to as Brent's theorem or Brent's scheduling principle....

    [...]