scispace - formally typeset
Search or ask a question

Showing papers in "Concurrency and Computation: Practice and Experience in 1999"


Journal ArticleDOI
TL;DR: This investigation investigates the claim that functional languages offer low-cost parallelism in the context of symbolic programs on modest parallel architectures and presents the first comparative study of the construction of large applications in a parallel functional language, in this case in Glasgow Parallel Haskell (GPH).
Abstract: We investigate the claim that functional languages offer low-cost parallelism in the context of symbolic programs on modest parallel architectures. In our investigation we present the first comparative study of the construction of large applications in a parallel functional language, in our case in Glasgow Parallel Haskell (GPH). The applications cover a range of application areas, use several parallel programming paradigms, and are measured on two very different parallel architectures. On the applications level the most significant result is that we are able to achieve modest wall-clock speedups (between factors of 2 and 10) over the optimised sequential versions for all but one of the programs. Speedups are obtained even for programs that were not written with the intention of being parallelised. These gains are achieved with a relatively small programmer-effort. One reason for the relative ease of parallelisation is the use of evaluation strategies, a new parallel programming technique that separates the algorithm from the co-ordination of parallel behaviour. On the language level we show that the combination of lazy and parallel evaluation is useful for achieving a high level of abstraction. In particular we can describe top-level parallelism, and also preserve module abstraction by describing parallelism over the data structures provided at the module interface (‘data-oriented parallelism’). Furthermore, we find that the determinism of the language is helpful, as is the largely implicit nature of parallelism in GPH. Copyright © 1999 John Wiley & Sons, Ltd.

38 citations





Journal ArticleDOI
TL;DR: A software system for the management of geographically distributed high‐performance computers and co‐ordinates the co‐operative use of resources in autonomous computing sites.
Abstract: We present a software system for the management of geographically distributed high‐performance computers. It consists of three components: 1. The Computing Center Software (CCS) is a vendor‐independent resource management software for local HPC systems. It controls the mapping and scheduling of interactive and batch jobs on massively parallel systems; 2. The Resource and Service Description (RSD) is used by CCS for specifying and mapping hardware and software components of (meta‐)computing environments. It has a graphical user interface, a textual representation and an object‐oriented API; 3. The Service Coordination Layer (SCL) co‐ordinates the co‐operative use of resources in autonomous computing sites. It negotiates between the applications' requirements and the available system services.

26 citations


Journal ArticleDOI
TL;DR: Efficiency is achieved by the concept of a self-optimising class library of primitive image processing operations, which allows programs to be written in a high level, algebraic notation and which is automatically parallelised (using an application-specific data parallel approach).
Abstract: This paper describes a domain specific programming model for execution on parallel and distributed architectures. The model has initially been targeted at the application area of image processing, though the techniques developed may be more generally applicable to other domains where an algebraic or library-based approach is common. Efficiency is achieved by the concept of a self-optimising class library of primitive image processing operations, which allows programs to be written in a high level, algebraic notation and which is automatically parallelised (using an application-specific data parallel approach). The class library is extended automatically with optimised operations, generated by a transformation system, giving improved execution performance. The parallel implementation of the model described here is based on MPI and has been tested on a C40 processor network, a quad-processor Unix workstation, and a network of PCs running Linux. Timings are included to indicate the impact of the automatic optimisation facility (rather than the effect of parallelisation). Copyright © 1999 John Wiley & Sons, Ltd.

20 citations


Journal ArticleDOI
TL;DR: A significant improvement in the total execution time and a reduction in the number of message contentions are illustrated, and it is proved that the generalized hypercube is a very versatile interconnection network.
Abstract: SUMMARY This paper presents results of evaluating the communications capabilities of the generalized hypercube interconnection network. The generalized hypercube has outstanding topological properties, but it has not been implemented on a large scale because of its very high wiring complexity. For this reason, this network has not been studied extensively in the past. However, recent and expected technological advancements will soon render this network viable for massively parallel systems. We first present implementations of randomized manyto-all broadcasting and multicasting on generalized hypercubes, using as the basis the oneto-all broadcast algorithm presented by Fragopoulou et al. (1996). We test the proposed implementations under realistic communication traffic patterns and message generations, for the all-port model of communication. Our results show that the size of the intermediate message buffers has a significant effect on the total communication time, and this effect becomes very dramatic for large systems with large numbers of dimensions. We also propose a modification of this multicast algorithm that applies congestion control to improve its performance. The results illustrate a significant improvement in the total execution time and a reduction in the number of message contentions, and also prove that the generalized hypercube is a very versatile interconnection network. Copyright © 1999 John Wiley & Sons, Ltd.

18 citations




Journal ArticleDOI
TL;DR: The use of unstructured adaptive tetrahedral meshes in the solution of transient flows poses a challenge for parallel computing due to the irregular and frequently changing nature of the data and its distribution, and a parallel mesh adaptation algorithm, PTETRAD, is described and analysed.
Abstract: The use of unstructured adaptive tetrahedral meshes in the solution of transient flows poses a challenge for parallel computing due to the irregular and frequently changing nature of the data and its distribution. A parallel mesh adaptation algorithm, PTETRAD, for unstructured tetrahedral meshes (based on the serial code TETRAD) is described and analysed. The portable implementation of the parallel code in C with MPI is described and discussed. The scalability of the code is considered, analysed and illustrated by numerical experiments using a shock wave diffraction problem.

13 citations


Journal ArticleDOI
TL;DR: In this paper, a new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks are described, which allows the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquire/release constructs.
Abstract: As parallel machines become part of the mainstream computing environment, compilers will need to apply synchronization optimizations to deliver efficient parallel software. This paper describes a new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks. These transformations allow the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquire and release constructs.The paper also presents a new synchronization algorithm, lock elimination, for reducing synchronization overhead. This optimization locates computations that repeatedly acquire and release the same lock, then uses the transformations to obtain equivalent computations that acquire and release the lock only once. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of this optimization. For three benchmark programs the optimization dramatically reduces the number of times the computations acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. For one of the three benchmarks, the optimization always significantly improves the overall performance. Depending on the number of processors executing the computation, the optimized version runs between 2.11 and 1.83 times faster than the unoptimized version. For one of the other benchmarks, the optimized version runs between 1.13 and 0.96 times faster than the unoptimized version, with a mean of 1.08 times faster. For the final benchmark, the optimization reduces the overall performance.


Journal ArticleDOI
TL;DR: This work derives the optimal mapping and scheduling of tiles to physical processors under some reasonable assumptions, under the context of limited computational resources and assuming communication‐computation overlap.
Abstract: SUMMARY In the framework of fully permutable loops, tiling is a compiler technique (also known as ‘loop blocking’) that has been extensively studied as a source-to-source program transformation. Little work has been devoted to the mapping and scheduling of the tiles on to physical parallel processors. We present several new results in the context of limited computational resources and assuming communication‐computation overlap. In particular, under some reasonable assumptions, we derive the optimal mapping and scheduling of tiles to physical processors. Copyright © 1999 John Wiley & Sons, Ltd.



Journal ArticleDOI
TL;DR: This work presents a technique for controlling applied communication load that achieves high communication throughput and minimise its variance in networks of workstations.
Abstract: The BSP cost model measures the cost of communication using a single architectural parameter, g, which measures permeability of the network to continuous traffic. Architectures such as networks of workstations pose particular problems for high-performance communication because it is hard to achieve high communication throughput, and even harder to do so predictably. Yet both of these are required for BSP to be effective. We present a technique for controlling applied communication load that achieves both. Traffic is presented to the communication network at a rate chosen to maximise throughput and minimise its variance. Significant performance improvements can be achieved compared to unstructured communication over the same transport protocols as in the case of, for example, MPI. Copyright © 1999 John Wiley & Sons, Ltd.


Journal ArticleDOI
TL;DR: This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems.
Abstract: Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuraton optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.

Journal ArticleDOI
TL;DR: An analytical tool is described which determines the performance characteristics of relational database transactions executing on particular machine configurations and provides simple graphical visualisations of these to enable users to obtain rapid insight into particular scenarios.
Abstract: The uptake of parallel DBMSs is being hampered by uncertainty about the impact on performance of porting database applications from sequential to parallel systems. The development of tools which aid the system manager or machine vendor could help to reduce this problem. This paper describes an analytical tool which determines the performance characteristics (in terms of throughput, resource utilisation and response time) of relational database transactions executing on particular machine configurations and provides simple graphical visualisations of these to enable users to obtain rapid insight into particular scenarios. The problems of handling different parallel DBMSs are illustrated with reference to three systems – Ingres, Informix and Oracle. A brief description is also given of two different approaches used to confirm the validity of the analytical approach on which the tool is based. Copyright © 1999 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Several parallel implementations based on DASSLSO are explored and their performance when using the Message Passing Interface (MPI) on an SGI Origin 2000 is compared.
Abstract: In this paper, we discuss the parallel computation of the sensitivity analysis of systems of differential-algebraic equations (DAEs) with a moderate number of state variables and a large number of sensitivity parameters. Several parallel implementations based on DASSLSO are explored and their performance when using the Message Passing Interface (MPI) on an SGI Origin 2000 is compared. Copyright © 1999 John Wiley & Sons, Ltd.



Journal ArticleDOI
TL;DR: The design and capabilities of the SDSC Encryption and Authentication system are presented and future plans for enhancing this system are discussed.
Abstract: As part of the Distributed Object Computation Testbed project (DOCT) and the Data Intensive Computing initiative of the National Partnership for Advanced Computational Infrastructure (NPACI), the San Diego Supercomputer Center has designed and implemented a multi-platform encryption and authentication system referred to as the SDSC Encryption and Authentication, or SEA, system. The SEA system is based on RSA and RC5 encryption capabilities and is designed for use in an HPC/WAN environment containing diverse hardware architectures and operating systems (including Cray T90, Cray T3E, Cray J90, SunOS, Solaris, AIX, SGI, HP, NextStep, and Linux). The system includes the SEA library, which provides reliable, efficient, and flexible authentication and encryption capabilities between two processes communicating via TCP/IP sockets, and SEA utilities/daemons, which provide a simple key management system. It is currently in use by the SDSC Storage Resource Broker (SRB), as well as by user interface utilities to SDSC's installation of the High Performance Storage System (HPSS). This paper presents the design and capabilities of the SEA system and discusses future plans for enhancing this system. Copyright © 1999 John Wiley & Sons, Ltd.



Journal ArticleDOI
TL;DR: DSU is designed to assist geophysicists in developing and executing sequences of Seismic Unix (SU) applications in clusters of workstations as well as on tightly coupled multiprocessor machines.
Abstract: This paper describes a distributed system called Distributed Seismic Unix (DSU) DSU provides tools for creating and executing application sequences over several types of multiprocessor environments DSU is designed to assist geophysicists in developing and executing sequences of Seismic Unix (SU) applications in clusters of workstations as well as on tightly coupled multiprocessor machines SU is a large collection of subroutine libraries, graphics tools and fundamental seismic data processing applications that is freely available via the Internet from the Center for Wave Phenomena (CWP) of the Colorado School of Mines SU is currently used at more than 500 sites in 32 countries around the world DSU is built on top of three publicly available software packages: SU itself; TCL/TK, which provides the necessary tools to build the graphical user interface (GUI); and PVM (Parallel Virtual Machine), which supports process management and communication DSU handles tree-like graphs representing sequences of SU applications Nodes of a graph represent SU applications, while the arcs represent the way the data flow from the root node to the lead nodes of the tree In general the root node corresponds to an application that reads or creates synthetic seismic data, and the leaf nodes are associated with applications that write or display the processed seismic data; intermediate nodes are usually associated with typical seismic processing applications like filters, convolutions and signal processing Pipelining parallelism is obtained when executing single-branch tree sequences, while a higher degree of parallelism is obtained when executing sequences with several branches A major advantage of the DSU framework for distribution is that SU applications do not need to be modified for parallelism; only a few low-level system functions need to be modified Copyright © 1999 John Wiley & Sons, Ltd

Journal ArticleDOI
TL;DR: A comparison of CSDT with other existing approaches and the lessons learned from the experience with this technique are compared, to highlight the benefits of this technique.
Abstract: Software design is the process of mapping software functional requirements into a set of modules for implementation. In this paper, a new design technique called the concurrent software design technique (CSDT) is proposed. CSDT extends software design techniques, which are based on structured analysis and design, by identifying independent concurrent tasks for implementation in multiprocessing, multitasking and the C/S environment. A case study on re-engineering a large legacy system, implemented on mainframes as a sequential system, to a C/S environment is presented next in order to highlight the benefits of the CSDT. Finally, this paper concludes with a comparison of CSDT with other existing approaches and the lessons learned from the experience with this technique. © 1999 John Wiley & Sons, Ltd.