scispace - formally typeset
Search or ask a question

Showing papers on "Bulk synchronous parallel published in 1999"


Journal ArticleDOI
TL;DR: The BSPRAM model is used to simplify the description of the algorithms, and new memory-efficient BSP algorithms both for standard and for fast matrix multiplication are proposed.
Abstract: The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. Its modification, the BSPRAM model, allows one to combine the advantages of distributed and shared-memory style programming. In this paper we study the BSP memory complexity of matrix multiplication. We propose new memory-efficient BSP algorithms both for standard and for fast matrix multiplication. The BSPRAM model is used to simplify the description of the algorithms. The communication and synchronization complexity of our algorithms is slightly higher than that of known time-efficient BSP algorithms. The current time-efficient and new memory-efficient algorithms are connected by a continuous tradeoff.

77 citations


Journal ArticleDOI
TL;DR: The Nondeterminator is an asymptotically efficient serial algorithm for detecting determinacy races in series-parallel directed acyclic graphs and exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which the authors contend is an acceptable slowdown for debugging purposes.
Abstract: A parallel multithreaded program that is ostensibly deterministic may nevertheless behave nondeterministically due to bugs in the code. These bugs are called determinacy races, and they result when one thread updates a location in shared memory while another thread is concurrently accessing the location. We have implemented a provably efficient determinacy-race detector for Cilk, an algorithmic multithreaded programming language. If a Cilk program is run on a given input data set, our debugging tool, which we call the ``Nondeterminator,'' either determines at least one location in the program that is subject to a determinacy race, or else it certifies that the program is race free when run on the data set. The core of the Nondeterminator is an asymptotically efficient serial algorithm (inspired by Tarjan's nearly linear-time least-common-ancestors algorithm) for detecting determinacy races in series-parallel directed acyclic graphs. For a Cilk program that runs in T time on one processor and uses v shared-memory locations, the Nondeterminator runs in O(T α(v,v)) time, where α is Tarjan's functional inverse of Ackermann's function, a very slowly growing function which, for all practical purposes, is bounded above by 4 . The Nondeterminator uses at most a constant factor more space than does the original program. On a variety of Cilk program benchmarks, the Nondeterminator exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which we contend is an acceptable slowdown for debugging purposes.

61 citations


Journal ArticleDOI
TL;DR: The design and implementation of the Green BSP Library, a small library of functions that implement the BSP model, and of several applications that were written for this library are described and the performance of the library and application programs on several parallel architectures are discussed.
Abstract: The Bulk-Synchronous Parallel (BSP) model was proposed by Valiant as a standard interface between parallel software and hardware. In theory, the BSP model has been shown to allow the asymptotically optimal execution of architecture independent software on a variety of architectures. Our goal in this work is to experimentally examine the practical use of the BSP model on current parallel architectures. We describe the design and implementation of the Green BSP Library, a small library of functions that implement the BSP model, and of several applications that were written for this library. We then discuss the performance of the library and application programs on several parallel architectures. Our results are positive in that we demonstrate efficiency and portability over a range of parallel architectures and show that the BSP cost model is useful for predicting performance trends and estimating execution times.

47 citations


BookDOI
01 Jan 1999
TL;DR: This book discusses distributed branch and Bound Algorithms for Global Optimization, large-scale Structured Discrete Optimization Via Parallel Genetic Algorithm, and the limits of QAP Problems Using Parallel Processing.
Abstract: Foreword.- Preface.- Distributed Branch And Bound Algorithms For Global Optimization.- Large-Scale Structured Discrete Optimization Via Parallel Genetic Algorithms.- Pushing The Limits of QAP Problems Using Parallel Processing.- Is Nugent30 Within Reach?- On The Design of Parallel Discrete Algorithms For High Performance Computing Systems.- Parallel Algorithms For Satisfiability (SAT) Testing.- Sequential And Parallel Branch-And-Bound Search Under Limited-Memory Constraints.- A Parallel Grasp For The Data Association Multidimensional Assignment Problem.- Basic Algorithms On Parallel Optical Models Of Computing.- Randomized Parallel Algorithms.- Finite Behavior Of Simulated Annealing: A Probabilistic Study.

21 citations


Book ChapterDOI
27 Nov 1999
TL;DR: This work presents an extension to the BSP model - a decomposable BSP (dBSP for short), and shows how space-bounded sequential algorithms can be transformed into pipelined ones with bounded period on dBSP.
Abstract: The Bulk Synchronous Parallel (BSP) computer is a generally accepted realistic model of parallel computers introduced by Valiant in 1990. We present an extension to the BSP model - a decomposable BSP (dBSP for short). Performance of several elementary algorithms, namely broadcasting, prefix computation, and matrix multiplication, is analyzed on BSP and dBSP models. For a suitable setting of parameters, these algorithms run asymptotically faster on dBSP than on BSP. We also show how space-bounded sequential algorithms can be transformed into pipelined ones with bounded period on dBSP. Such a transformation is proved impossible for the BSP model. Finally, we present an algorithm for the simulation of dBSP on BSP.

20 citations


Book
01 Jan 1999
TL;DR: In this paper, the authors present an implementation of MPI/MBCF on the CRAY T3E Gigabit Testbed in a German Gigabit testbed with the purpose of measuring MPI performance characteristics.
Abstract: Evaluation and Performance- Performance Issues of Distributed MPI Applications in a German Gigabit Testbed- Reproducible Measurements of MPI Performance Characteristics- Performance Evaluation of MPI/MBCF with the NAS Parallel Benchmarks- Performance and Predictability of MPI and BSP Programs on the CRAY T3E- Automatic Profiling of MPI Applications with Hardware Performance Counters- Monitor Overhead Measurement with SKaMPI- A Standard Interface for Debugger Access to Message Queue Information in MPI- Towards Portable Runtime Support for Irregular and Out-of-Core Computations- Enhancing the Functionality of Performance Measurement Tools for Message Passing Environments- Performance Modeling Based on PVM- Efficient Replay of PVM Programs- Relating the Execution Behaviour with the Structure of the Application- Extensions and Improvements- Extending PVM with consistent cut capabilities: Application Aspects and Implementation Strategies- Flattening on the Fly: efficient handling of MPI derived datatypes- PVM Emulation in the Harness Metacomputing System: A Plug-In Based Approach- Implementing MPI-2 Extended Collective Operations- Modeling MPI Collective Communications on the AP3000 Multicomputer- MPL: Efficient Record/Replay of nondeterministic features of message passing libraries- Comparison of PVM and MPI on SGI multiprocessors in a High Bandwidth Multimedia Application- On Line Visualisation or Combining the Standard ORNL PVM with a Vendor PVM Implementation- Native versus Java Message Passing- JPT: A Java Parallelization Tool- Facilitating Parallel Programming in PVM Using Condensed Graphs- Nested Bulk Synchronous Parallel Computing- Implementation Issues- An MPI Implementation on the Top of the Virtual Interface Architecture- MiMPI: A Multithread-Safe Implementation of MPI- Building MPI for Multi-Programming Systems Using Implicit Information- The design for a high performance MPI implementation on the Myrinet network- Implementing MPI's One-Sided Communications for WMPI- Tools- A Parallel Genetic Programming Tool Based on PVM- Net-Console: A Web-Based Development Environment for MPI Programs- VisualMPI - A Knowledge-Based System for Writing Efficient MPI Applications- Algorithms- Solving Generalized Boundary Value Problems with Distributed Computing and Recursive Programming- Hyper-Rectangle Distribution Algorithm for Parallel Multidimensional Numerical Integration- Parallel Monte Carlo Algorithms for Sparse SLAE Using MPI- A Method for Model Parameter Identification Using Parallel Genetic Algorithms- Large-scale FE modelling in geomechanics: a case study in parallelization- A Parallel Robust Multigrid Algorithm Based on Semi-coarsening- Applications in Science and Engineering- PLIERS: A Parallel Information Retrieval System Using MPI- Parallel DSIR Text Retrieval System- PVM Implementation of Heterogeneous ScaLAPACK Dense Linear Solvers- Using PMD to Parallel-Solve Large-Scale Navier-Stokes Equations Performance Analysis on SGI/CRAY-T3E Machine- Implementation Issues of Computational Fluid Dynamics Algorithms on Parallel Computers- A Scalable Parallel Gauss-Seidel and Jacobi Solver for Animal Genetics- Parallel Approaches to a Numerically-Intensive Application Using PVM- Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI- A Parallel Implementation of the Eigenproblem for Large, Symmetric and Sparse Matrices- Parallel Computation of the SVD of a Matrix Product- Porting generalized eigenvalue software on distributed memory machines using systolic model principles- Heading for an Asynchronous Parallel Ocean Model- Distributed Collision Handling for Particle-Based Simulation- Parallel watershed algorithm on images from cranial CT-scans using PVM and MPI on a distributed memory system- MPIPOV: A Parallel Implementation of POV-Ray Based on MPI- Minimum Communication Cost Fractal Image Compression on PVM- Cluster Computing Using MPI and Windows NT to Solve the Processing of Remotely Sensed Imagery- Ground Water Flow Modelling in PVM- Networking- Virtual BUS: A Simple Implementation of an Effortless Networking System Based on PVM- Collective Communication on Dedicated Clusters of Workstations- Experiences Deploying a Distributed Parallel Processing Environment over a Broadband Multiservice Network- Asynchronous Communications in MPI - the BIP/Myrinet Approach- Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications- Benchmarking the PVM Group Communication Efficiency- Heterogeneous Distributed Systems- Dynamic Assignment with Process Migration in Distributed Environments- Parallelizing of Sequential Annotated Programs in PVM Environment- di_pSystem: A Parallel Programming System for Distributed Memory Architectures- Parallel NLP Strategies Using PVM on Heterogeneous Distributed Environments- Using PVM for Distributed Logic Minimization in a Network of Computers

15 citations


Book ChapterDOI
12 Apr 1999
TL;DR: This paper shows how the Bulk Synchronous Parallel (BSP) model is implemented using the Bayanihan software framework to enable programmers to port the growing base of BSP-based parallel applications to Java while achieving adaptive parallelism and protection against both the random faults and intentional sabotage that are possible in volunteer computing systems.
Abstract: In recent years, there has been a surge of interest in Javabased volunteer computing systems, which aim to make it possible to build very large paralledl computing networks very quickly by enabling users to join a parallel computation by simply visiting a web page and running a Java applet on a standard browser. A key research issue in implementing such systems is that of choosing an appropriate programming model. While traditional models such as MPI-like message-passing can and have been ported to Java-based systems, they are not generally well-suited to the heterogeneous and dynamic structure of volunteer computing systems, where nodes can join and leave a computation at any time. In this paper, we present an implementation of the Bulk Synchronous Parallel (BSP) model, which provides programmers with familiar message-passing and remote memory primitives while remaining flexible enough to be used in dynamic environments. We show how we have implemented this model using the Bayanihan software framework to enable programmers to port the growing base of BSP-based parallel applications to Java while achieving adaptive parallelism and protection against both the random faults and intentional sabotage that are possible in volunteer computing systems.

13 citations


Journal ArticleDOI
TL;DR: In this paper, a new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks are described, which allows the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquire/release constructs.
Abstract: As parallel machines become part of the mainstream computing environment, compilers will need to apply synchronization optimizations to deliver efficient parallel software. This paper describes a new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks. These transformations allow the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquire and release constructs.The paper also presents a new synchronization algorithm, lock elimination, for reducing synchronization overhead. This optimization locates computations that repeatedly acquire and release the same lock, then uses the transformations to obtain equivalent computations that acquire and release the lock only once. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of this optimization. For three benchmark programs the optimization dramatically reduces the number of times the computations acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. For one of the three benchmarks, the optimization always significantly improves the overall performance. Depending on the number of processors executing the computation, the optimized version runs between 2.11 and 1.83 times faster than the unoptimized version. For one of the other benchmarks, the optimized version runs between 1.13 and 0.96 times faster than the unoptimized version, with a mean of 1.08 times faster. For the final benchmark, the optimization reduces the overall performance.

12 citations


Proceedings Article
01 Jan 1999
TL;DR: Motivation operation for parallel disk I/O for distributed memory parallel machines assumes that each processor has multiple local hard disks available, where each disk is only accessible by the respective local processor.
Abstract: 1 Motivation operation. Parallel algorithms for the Bulk Synchronous Parallel (BSP) and closely related Coarse Gained Multicomputer (CGM) programming model assume that all data can be distributed over the main memories of the processors involved. In practice, this may not be the case. For large scale applications where parallel processing is helpful, the total amount of data often exceeds the total main memory available and parallel disk I/O becomes a necessity. A common scenario for distributed memory parallel machines assumes that each processor has multiple local hard disks available, where each disk is only accessible by the respective local processor. Parallel disk I/O has been identified as a critical component of a suitable high performance computer for a number of the Grand Challenge problems see for instance the Scalable I/O Initiative project [5] \ .

9 citations


Journal ArticleDOI
01 Feb 1999
TL;DR: Fork95 as discussed by the authors is a parallel programming language for the Parallel Random Access Machine (PRAM) model of parallel computation, which is used in the SB-PRAM project at the University of Saarbrucken.
Abstract: We investigate the well-known Parallel Random Access Machine (PRAM) model of parallel computation as a practical parallel programming model . The two components of this project are a general-purpose PRAM programming language, called Fork95, and a library, called PAD, of fundamental, efficiently implemented parallel algorithms and data structures. We outline the main features of Fork95 as they apply to the implementation of PAD, and describe the implementation of library procedures for prefix-sums and sorting. The Fork95 compiler generates code for the SB-PRAM, a hardware emulation of the PRAM, which is currently being completed at the University of Saarbrucken. Both language and library can immediately be used with this machine. The project is, however, of independent interest. The programming environment can help the algorithm designer to evaluate the practicality of new parallel algorithms, and can furthermore be used as a tool for teaching and communication of parallel algorithms.

8 citations


01 Jan 1999
TL;DR: Simulation techniques are developed, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSP-like parallel computing models which answer to a challenge posed by the ACM Working Group on Storage I/O for Large-Scale Computing.
Abstract: External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the basis of input/output (I/O) complexity. Parallel algorithms are designed to efficiently utilize the computing power of multiple processing units, interconnected by a communication mechanism. A popular model for developing and analyzing parallel algorithms is the Bulk Synchronous Parallel (BSP) model due to Valiant. In this work we develop simulation techniques, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSP-like parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. We propose new, more comprehensive models for EM and parallel algorithms which consider the total costs incurred by the algorithm including computation, I/O and communication. The new EM-BSP, EM-BSP*, and EM-CGM models combine the features of the BSP and PDM and thereby answer to a challenge posed by the ACM Working Group on Storage I/O for Large-Scale Computing. We obtain parallel external memory algorithms for a large number of problems including sorting, permutation, matrix transpose, geometric and GIS problems including 3D convex hulls (2D Voronoi diagrams), and various graph problems.

Proceedings ArticleDOI
23 Jun 1999
TL;DR: A practical parallel computation model LogPQ taking account of communication queues into the LogP model is presented, which shows that theLogPQ model expects the execution times more accurately than thelogP model.
Abstract: Massively parallel computers consisting of a large number of processing elements have been developed and expected as high performance computers in advanced science and technology Practical parallel computation model has been required to analyze parallel algorithms on massively parallel computers We present a practical parallel computation model LogPQ taking account of communication queues into the LogP model The LogPQ model has three queues for each communication line, and four supplement parameters in addition to the LogP model This paper addresses the performance of parallel matrix multiplication using the LogPQ model The parallel performances on the parallel machine CM-5 are compared between the LogP and LogPQ model It is seen that the LogPQ model expects the execution times more accurately than the LogP model

Book ChapterDOI
12 Apr 1999
TL;DR: This paper describes how the language of Communicating Sequential Processes has been applied to the analysis of a transport layer protocol used in the implementation of the Bulk Synchronous Parallel model.
Abstract: In this paper we describe how the language of Communicating Sequential Processes (CSP) has been applied to the analysis of a transport layer protocol used in the implementation of the Bulk Synchronous Parallel model (BSP). The protocol is suited to the bulk transfer of data between a group of processes that communicate over an unreliable medium with fixed buffer capacities on both sender and receiver. This protocol is modelled using CSP, and verified using the refinement checker FDR2. This verification has been used to establish that the protocol is free from the potential for both deadlock and livelock, and also that it is fault-tolerant.

Proceedings ArticleDOI
11 Nov 1999
TL;DR: The running time cost of performing discrete-event simulation on the bulk-synchronous parallel (BSP) model of computing is discussed and a performance prediction methodology is devised that enables the designer of parallel simulation models to predict in advance the systems which are amenable for efficient execution on a given BSP computer.
Abstract: This paper discusses the running time cost of performing discrete-event simulation on the bulk-synchronous parallel (BSP) model of computing. The BSP model provides a general purpose framework for parallel computing which is independent of the architecture of the computer and thereby it enables the development of portable software. In addition, the structure of BSP computations allows the accurate determination of the cost of parallel algorithms. We use this feature to devise a performance prediction methodology that enables the designer of parallel simulation models to predict in advance the systems which are amenable for efficient execution on a given BSP computer. The methodology is simple enough to be automated in parallel simulation languages.

Journal ArticleDOI
TL;DR: Load balance, both in communication and computation, as well as linear speedup have been achieved for the Toeplitz system solver and at the same time the minimum memory requirement is achieved.

Journal ArticleDOI
01 Feb 1999
TL;DR: This paper explores the practical use of BSP, focusing on the portability and predictability it offers, without incurring any significant loss in performance.
Abstract: Valiant proposed the Bulk Synchronous Parallel (BSP) model as a possible model for parallel computing. He refers to BSP as a “bridging” model, being applicable to both system and algorithm design. The model allows hardware and software design to proceed independently but ensures compatibility between parallel computers and parallel programs. This paper explores the practical use of BSP, focusing on the portability and predictability it offers, without incurring any significant loss in performance. A BSP algorithm for sorting proposed by Gerbessiotis and Valiant is implemented in a portable fashion on three different parallel computers, specifically an Intel iPSC/860, a Transtech Parastation and an Alex AVX Series 2. The program uses a standard library of communication functions designed and implemented for each machine to support the BSP model. The measured performance of the program is compared to the BSP predictions and to other sorting results on similar machines to provide evidence for the utility of the BSP model.

Book ChapterDOI
26 Sep 1999
TL;DR: This work compares the prediction accuracy of the BSP and BSPWB models and the performance of their respective software libraries: Oxford BSPlib and MPI and shows not only a better scalability of MPI but that the performanceof MPI programs can be predicted with the same exactitude than OxfordBSPlib programs.
Abstract: It has been argued that message passing systems based on pairwise, rather than barrier, synchronization suffer from having no simple analytic cost for model prediction. The BSP Without Barriers Model (BSPWB) has been proposed as an alternative to the Bulk Synchronous Parallel (BSP) model for the analysis, design and prediction of asynchronous MPI programs. This work compares the prediction accuracy of the BSP and BSPWB models and the performance of their respective software libraries: Oxford BSPlib and MPI. Three test cases, representing three general problem solving paradigms are considered. These cases cover a wide range of requirements in communication, synchronisation and computation. The results obtained on the CRAY-T3E show not only a better scalability of MPI but that the performance of MPI programs can be predicted with the same exactitude than Oxford BSPlib programs.

Proceedings ArticleDOI
03 Feb 1999
TL;DR: The parallel computing model presented in this paper, the Collective Computing model (CCM), is an improvement of the well-known Bulk Synchronous Parallel (BSP) model and describes a system exploited through standard software platforms with functions for group a creation and collective operations.
Abstract: The parallel computing model presented in this paper, the Collective Computing model (CCM), is an improvement of the well-known Bulk Synchronous Parallel (BSP) model. The synchronicity imposed by the BSP model restricts the set of available algorithms and prevents the overlapping of computation and communication. Other models, like the LogP model, allow asynchronous computing and overlapping but depend on the use of specific libraries. The CCM is asynchronous and describes a system exploited through standard software platforms with functions for group a creation and collective operations. Based in the BSP model, two kinds of supersteps are considered: the division and the normal. To illustrate these concepts, the Fast Fourier Transform Algorithm and the Parallel Sorting by Regular Sampling are used. Computational results prove the accuracy of the model in three different parallel computers: a Cray T3E, a Silicon Graphics Origin 2000 and a Digital Alpha Server.

Journal ArticleDOI
01 Mar 1999
TL;DR: The first type of parallel preconditioner type is obtained considering only the diagonal blocks of the multisplitting of the matrix system, reducing in this case the communication cost of the algorithm, making it better for machines with slow communication networks.
Abstract: To obtain an efficient parallel algorithm to solve sparse linear systems with the preconditioned conjugate gradient (PCG) method, two types of parallel preconditioners are introduced. The first is a polynomial preconditioner type based on a multisplitting of the matrix system, and the second one is obtained considering only the diagonal blocks of the multisplitting, reducing in this case the communication cost of the algorithm. Therefore it is better for machines with slow communication networks. Its validity as preconditioner is justified theoretically. Indeed, the complexity of the PCG is analyzed using the Bulk Synchronous Parallel (BSP) model. Experimental results obtained on a IBM SP2 and a CONVEX SPP1000 using the Oxford BSP Library are reported.

Journal ArticleDOI
TL;DR: The Bulk Synchronous Parallel (BSP) model is used to design a fully efficient, scalable and portable parallel IQMR algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec GC/PowerPlus, and a cluster of workstations connected by an Ethernet.
Abstract: For the solutions of unsymmetric linear systems of equations, we have proposed an improved version of the quasi-minimal residual (IQMR) method l21r by using the Lanczos process as a major component combining elements of numerical stability and parallel algorithm design. For Lanczos process, stability is obtained by a couple two-term procedure that generates Lanczos vectors scaled to unit length. The algorithm is derived such that all inner products and matrix-vector multiplications of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time. In this paper, we use the Bulk Synchronous Parallel (BSP) model to design a fully efficient, scalable and portable parallel IQMR algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec GC/PowerPlus, and a cluster of workstations connected by an Ethernet. This performance model provides us useful insight in the time complexity of the IQMR method using only a few system dependent parameters based on a simple and accurate cost modeling. The theoretical performance prediction are compared with measured timing results of a numerical application from ocean flow simulation.

Journal Article
TL;DR: This paper generated a generalpurpose data, suitable for BSP, to data storage structures, and discussed the query optim ization, transaction processing and prim ary database operations such as Join algorithm s, outer sort algorithm s and so on.
Abstract: Currentparalleldatabasesystem sare m ainly aboutthe traditionalrelationaldatabase, and m ost ofthem are based on special purpose parallelfram eworks and there has been little generalparallelORDB (object relationaldatabase) system . Based on m onoid calculus, this paper generated a generalpurpose data m odelw hich issim ple buthas pow erfulexpressive abilities and good extensibilities. This data m odel can alm ostexpressallofSQL3 and OQLqueries. Choosing BSP(bulk synchronous parallel) m odelas an idealgeneralparallelenvironm ent, the paperdesigned and analyzed severalm ethodsofone dim ension data placem ent, applied (a, b) tree, suitable for BSP, to data storage structures, discussed the query optim ization, transaction processing and prim ary database operations such as Join algorithm s, outer sort algorithm s and so on. Com bining m onoid calculus with BSP, itis practicalto generate a generalpurpose parallelORDBsystem .

Dissertation
01 Jan 1999
TL;DR: It is demonstrated why this is required to implement protocols that both maintain and take into account global state for optimising performance and suggested a regression technique which can be applied to sampled global performance data.
Abstract: In the Bulk Synchronous Parallel (or BSP) model of parallel communication represented by BSPlib , the relaxed coupling of the global computation, communication and synchronisation, whilst providing a definite semantics, does not prescribe exactly when and where communication is to be carried out during the computation. It merely states that it cannot happen before requested by the application and that at certain points local computation cannot proceed unless updates have been applied from the other participating processors. The nature of the computation and this framework is open to exploitation by the implementation of the runtime system and can be made to suit particular physical environments without requiring application program changes. This bulk and global view of parallel computation can be used to implement protocols that both maintain and take into account global state for optimising performance. Such global protocols can provide performance improvements which are not easily achieved with local and greedy strategies and may in turn be locally sub-optimal. This global perspective and the exploitable nature of BSP computation is applied to congestion avoidance, transport layer protocols suitable for BSP computation, global stable check-pointing, and work process placement and migration, to achieve a better overall performance. An important consideration for the compositionality of parallel computer systems into larger systems is that in order for the composite to exhibit good performance, the individual components must also do so. However, it is not obvious how the individual components contribute to the global performance. Already mentioned is that non-locally optimal strategies might lead to globally optimal performance, but also of importance is that variance observed at the local level also influences performance. A number of decisions in the transport protocol design and implementations have been made in order that the observed variance in the protocol's behaviour is minimised. It is demonstrated why this is required using the BSP model. The analysis also suggests a regression technique which can be applied to sampled global performance data.

Reference EntryDOI
27 Dec 1999
TL;DR: The sections in this article are Lock-Step synchronous Parallel Languages, Bulk Synchronous parallel Languages, and Implicitly Parallel Programming Languages.
Abstract: The sections in this article are 1 Lock-Step Synchronous Parallel Languages 2 Bulk Synchronous Parallel Languages 3 Fine-Grain Synchronous Parallel Languages 4 Implicitly Parallel Programming Languages 5 Conclusion

Book ChapterDOI
11 Jul 1999
TL;DR: An error is correct in the analysis of local computation cost of a bulk-synchronous parallel algorithm for Boolean matrix multiplication.
Abstract: We correct an error in the analysis of local computation cost of a bulk-synchronous parallel algorithm for Boolean matrix multiplication.

Journal ArticleDOI
TL;DR: A Hierarchical Bulk Synchronous Parallel performance model is introduced to capture the performance optimization problem for various stages in parallel program development and to accurately predict the performance of a parallel program by considering factors causing variance at local computation and global communication.
Abstract: Based on the framework of BSP, a Hierarchical Bulk Synchronous Parallel (HBSP) performance model is introduced in this paper to capture the performance optimization problem for various stages in parallel program development and to accurately predict the performance of a parallel program by considering factors causing variance at local computation and global communication. The related methodology has been applied to several real applications and the results show that HBSP is a suitable model for optimizing parallel programs.

Proceedings ArticleDOI
23 Jun 1999
TL;DR: A new parallel performance profiling system for the Bulk Synchronous Parallel (BSP) model that uses BSP Profiler to trace and generate more comprehensive profiling information resulting from BSP program executions and which is visualised and shown as performance profiling graphs using BSP Visualiser.
Abstract: The paper introduces a new parallel performance profiling system for the Bulk Synchronous Parallel (BSP) model. The profiling system, called BSP Pro, consists of a performance profiling tool, BSP Profiler and a performance visualisation tool, BSP Visualiser. The aim of BSP Pro is to assist in the analysis and improvement of BSP program performance by minimising load imbalance among processes. BSP Pro is different from other systems, such as the profiling tools within the Oxford BSP toolset, in terms of both its features and its implementation. It uses BSP Profiler to trace and generate more comprehensive profiling information resulting from BSP program executions. The profiling information is then visualised and shown as performance profiling graphs using BSP Visualiser. The visualising component of BSP Pro is fully developed in Java and utilises Java graphics to expose and highlight process load imbalance in both computation and interprocess communication.

Book ChapterDOI
26 Sep 1999
TL;DR: The study of optimal policies for the implementation of the Division Functions lead to the concept of dynamic polytope, and the advantages of the model are exemplified through the divide and conquer paradigm.
Abstract: The BSP model can be extended with the inclusion of processor sets. In this model, processor sets can be divided and processor sets synchronize through Collective operations. The study of optimal policies for the implementation of the Division Functions lead us to the concept of dynamic polytope. The advantages of the model are exemplified through the divide and conquer paradigm. Computational results for two instances of this paradigm in three parallel machines are presented.

Book ChapterDOI
06 Sep 1999
TL;DR: A new parallel performance profiling system for the BSP model that traces and generates comprehensive information on timing and communication by each process in each superstep, making it easier to identify overloaded processes in a superstep.
Abstract: Load balance is one of the critical factors affecting the overall performance of the BSP (Bulk Synchronous Parallel) programs. Without sufficient performance profiling information generated by effective profiling tools, it is often difficult to find out what extent and where load imbalance has occurred in a BSP program. In this paper, we introduce a new parallel performance profiling system for the BSP model. The system traces and generates comprehensive information on timing and communication by each process in each superstep. Its aim is to assist in the improvement of BSP program performance by identifying load imbalance among processors. The profiling data is visualised via a series of performance profiling graphs, making it easier to identify overloaded processes in a superstep. The visualising component of the system is written in Java, thus runs on almost any type of computer systems.

Book ChapterDOI
06 Sep 1999
TL;DR: A student exercise is introduced that is devoted to compare two parallel languages, namely MPI (Message Passing Interface) and BSP (Bulk Synchronous Parallel language).
Abstract: In this paper we introduce a student exercise that is devoted to compare two parallel languages, namely MPI (Message Passing Interface) and BSP (Bulk Synchronous Parallel language) The work to accomplish is integrated in a "long term project" because questions act as a nest of dolls: answering one opens a new direction