scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1984"


Proceedings Article
01 Jan 1984
TL;DR: A general algorithmic technique which simplifies and improve computation of various functions on tress is introduced, which typically requires 0(log n) time using 0(n) space on an exclusive-read exclusive-write parallel RAM.
Abstract: In this paper we propose a new algorithm for finding the blocks (biconnected components) of an undirected graph. A serial implementation runs in O(n+m) time and space on a graph of n vertices and m edges. 4 parallel implementation runs in O(log n) time and O(n+m) space using O(n+m) processors on a concurrent-read, concurrent-write parallel RAM. An alternative implementation runs in Obn2/p3 time and O(n2) space using any number p C n /log n of processors, on a concurrent-read, exclusive-write parallel RAM. The latter algorithm has optimal speedup, assuming an adjacency matrix representation of the input. A general algorithmic technique which simplifies and improves computation of various functions on trees is introduced. This technique typically requires o(1og n) time using o(n) processors and O(n) space on an exclusive-read exclusive-write parallel RAM.

198 citations


Journal ArticleDOI
Nicolau1, Fisher
TL;DR: This paper focuses on long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, and argues that even if the authors had infinite hardware, these architectures could not provide a speedup of more than a factor of 2 or 3 on real programs.
Abstract: Long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, are a popular means of obtaining code speedup via fine-grained parallelism. The falling cost of hardware holds out the hope of using these architectures for much more parallelism. But this hope has been diminished by experiments measuring how much parallelism is available in the code to start with. These experiments implied that even if we had infinite hardware, long instruction word architectures could not provide a speedup of more than a factor of 2 or 3 on real programs.

190 citations


Book
01 Jan 1984
TL;DR: The overall result is that the larger the problem, the closer the algorithms approach optimal speedup, which allows algorithms to be designed assuming any number of processing elements.
Abstract: We present and analyze several practical parallel algorithms for multicomputers. Chapter four presents two distributed algorithms for implementing alpha-beta search on a tree of processors. Each processor is an independent computer with its own memory and is connected by communication lines to each of its nearest neighbors. Measurements of the first algorithm's performance on the Arachne distributed operating system are presented. For each algorithm, a theoretical model is developed that predicts speedup with arbitrarily many processors. Chapter five shows how locally-defined iterative methods give rise to natural multicomputer algorithms. We consider two interconnection topologies, the grid and the tree. Each processor (or terminal processor in the case of a tree multicomputer) engages in serial computation on its region and communicates border values to its neighbors when those values become available. As a focus for our investigation we consider the numerical solution of elliptic partial differential equations. We concentrate on the Dirichlet problem for Laplace's equation on a square region, but our results can be generalized to situations involving arbitrarily shaped domains (of any number of dimensions) and elliptic equations with variable coefficients. Our analysis derives the running time of the grid and the tree algorithms with respect to per-message overhead, per-point communication time, and per-point computation time. The overall result is that the larger the problem, the closer the algorithms approach optimal speedup. We also show how to apply the tree algorithms to non-uniform regions. A large-network algorithm solves a problem of size N on a network of N processors. Chapter six presents a general method for transforming large-network algorithms into quotient-network algorithms, which solve problems of size N on networks with fewer processors. This transformation allows algorithms to be designed assuming any number of processing elements. The implementation of such algorithms on a quotient network results in no loss of efficiency, and often a great savings in hardware cost.

48 citations


01 Jan 1984
TL;DR: This thesis concentrates on finding scheduling algorithms which reduce the sensitivity of the solution time to the communication delay among computational elements, and has been determined that the overall speedup is sensitive to the delays between cooperating computational elements.
Abstract: A large amount of computer time is used for the solution of systems of linear equations in the course of the circuit simulation during the design of integrated circuits. This expenditure limits the size of circuits which can be practically simulated, and results in poor response time in an interactive environment. In order to increase the size of circuits which can be simulated, and increase the response time, one option pursued here is to apply concurrent computation to the linear equation solution aspect of circuit simulation. This concurrent computation will exploit inherent parallelism in the linear equation solution to reduce the time required for that solution. We focus on one particular method for solution of the linear equations: LU decomposition. While LU decomposition has a great deal of inherent parallelism, the wide range of sparse matrix structures requires that this parallelism be detected automatically. It has been determined that the overall speedup is sensitive to the delays between cooperating computational elements, and the manner in which the concurrent computations are mapped onto computational elements is therefore of importance. The approach used is as follows: Given a sparse matrix with a particular structure, a code generator produces a program representing the LU decomposition for that matrix. Another program detects the precedence constraints among the sequential instructions in the code and models the solution process as a directed graph. Based on this graph, scheduling techniques are employed to assign segments of code to computational elements for concurrent execution. Most of this thesis concentrates on the last problem, finding scheduling algorithms which reduce the sensitivity of the solution time to the communication delay among computational elements. This is based on the following observation. With zero delay, the common Hu's level scheduling algorithm gives good speedup performance. However when the communication delay is large compared to the execution time of an instruction in the code, considerable degradation on the speedup performance is observed for Hu's algorithm. Polynomial-time optimal scheduling algorithms appear to be intractable. Hence heuristic algorithms with feasible running time that give suboptimal schedules have to be constructed. This is approached in two different ways. Heuristic local minimization scheduling algorithms using two matching algorithms from combinatorial optimization are studied and promising results are obtained. These two matching algorithms, min-max matching and weighted matching, give optimal code-to-processor assignment at each time step. . . . (Author's abstract exceeds stipulated maximum length. Discontinued here with permission of author.) UMI

26 citations


01 Jan 1984
TL;DR: The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition.
Abstract: An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90% for sufficiently large problems.

22 citations


ReportDOI
22 Feb 1984
TL;DR: Waveform Relaxation algorithms have been proven to be effective in the transient analysis of large scale integrated circuits and a new waveform relaxation simulator for MOS digital circuits, RELAX2, is described.
Abstract: : Waveform Relaxation (WR) algorithms have been proven to be effective in the transient analysis of large scale integrated circuits. A new waveform relaxation simulator for MOS digital circuits, RELAX2, is described. Several speedup techniques included in RELAX2, such as adjusting the length of the inteval of simulation, using simpler models in the first few iterations, and allowing looser timestep control in the first few iterations, are also presented.

22 citations


Journal ArticleDOI
John H. Reif1
TL;DR: It is shown that parallelism uniformly speeds up time bounded Probabilistic sequential RAM computations by nearly a quadratic factor and that probabilistic choice can be eliminated from parallel computation by introducing nonuniformity.
Abstract: This paper introduces probabilistic choice to synchronous parallel machine models; in particular parallel RAMs. The power of probabilistic choice in parallel computations is illustrated by parallelizing some known probabilistic sequential algorithms. We characterize the computational complexity of time, space, and processor bounded probabilistic parallel RAMs in terms of the computational complexity of probabilistic sequential RAMs. We show that parallelism uniformly speeds up time bounded probabilistic sequential RAM computations by nearly a quadratic factor. We also show that probabilistic choice can be eliminated from parallel computations by introducing nonuniformity.

17 citations


01 May 1984
TL;DR: The goal was to both reduce the real time required and to increase the scope of the simulations by splitting the existing simulator to run on a network of VAX 11/750s connected by DECNet/ETHERNET-II.
Abstract: : We describe our experience in improving the real-time performance of a particular large and complex simulator through distributed processing. Our goal was to both reduce the real time required and to increase the scope of the simulations by splitting the existing simulator (written in Lsip) to run on a network of VAX 11/750s connected by DECNet/ETHERNET-II. We present data showing that the additional CPU power and the combined physical memory available in the network contribute to significant real-time speedup. Experience with a two and three machine network indicates that there was no memory contention in a single process simulator, we obtain a speedup proportional to the number of processes. Where there was memory contention in the single process simulator, the speed up is much more dramatic. We also detail the capabilities that were added to the conventional network communication structure to implement, debug, and interact with the distributed simulator.

14 citations


Proceedings ArticleDOI
24 Oct 1984
TL;DR: It is proved that at least log (n) + 1 steps are necessary for computing the sum of n integers by a WRAM regardless of the number of processors and the solution of write conflicts, and expands a lower bond of Yao for parallel computation trees.
Abstract: Lower bounds for sequential and parallel random access machines (RAM's, WRAM's) and distributed systems of RAM's (DRAM's) are proved. We show that, when p processors instead of one are available, the computation of certain functions cannot be speeded up by a factor p but only by a factor 0 (log(p)). For DRAM's with communication graph of degree c a maximal speedup 0 (log(c)) can be achieved for these problems. We apply these results to testing the solvability of linear diophantine equations. This generalizes a lower bond of Yao for parallel computation trees. Improving results of Dobkin/Lipton and Klein/Meyer auf der Heide, we establish large lower bounds for the above problem on RAM's. Finnaly we prove that at least log (n) + 1 steps are necessary for computing the sum of n integers by a WRAM regardless of the number of processors and the solution of write conflicts.

14 citations


Journal ArticleDOI
TL;DR: It is shown that sorting n numbers can be done on a chip with processing area A = o(n) with an almost optimal speedup in a network with mesh-connected interconnections.
Abstract: We propose a new VLSI architecture which allows many problems to be solved quite efficiently on chips with very small processing areas. We consider in detail the sorting problem and show how it can be solved quickly and elegantly on our model. We show that sorting n numbers can be done on a chip with processing area A = o(n) with an almost optimal speedup in a network with mesh-connected interconnections. The control is shown to be simple and easily implementable in VLSI.

13 citations


Journal ArticleDOI
TL;DR: The effect of introducing parallel computers to solve optimisation problems is considered and the possible interaction of the four classifications with the currently available parallel processing machines is considered.
Abstract: In this paper we will consider the effect of introducing parallel computers to solve optimisation problems. We briefly highlight four situations where most improvements are likely. We consider the possible interaction of the four classifications with the currently available parallel processing machines. A brief description of one of the parallel systems, ICL DAP, is outlined. We have implemented two parallel (SIMD) algorithms, one for local optimisation and the other for global optimisation, on the ICL DAP. Numerical results, together with the processing times, are reported.

Journal ArticleDOI
TL;DR: Molecular dynamics model is processed by a parallel array type computer PAX, that has an architecture of nearest neighbor meash connection of processors, that realizes high efficiency close to 1, which assures the linear speedup proportional to the size of the processor array.

Journal ArticleDOI
01 Dec 1984
TL;DR: It is demonstrated that careful algorithm design can lead to a significant speedup of the calculation when more than one processor is used, and the throughput times obtained in this study are an order of magnitude faster than some conventional approaches.
Abstract: The availability of a multiprocessor vector machine, such as the CRAY X-MP, along with large, fast secondary memory such as the CRAY SSD, opens new frontiers to numerical algorithm design for 3-D simulations The 3-D seismic migration, which is of crucial importance in exploration seismology, will be studied as a model problem The numerical model discussed in this paper employs an alternating direction implicit (ADI) Crank-Nicolson scheme which takes full advantage of the parallel architecture of the underlying machine It is demonstrated that careful algorithm design can lead to a significant speedup of the calculation when more than one processor is used The throughput times obtained in this study are an order of magnitude faster than some conventional approaches

01 Jan 1984
TL;DR: This dissertation model a computer job as a Directed Acyclic Graph (DAG), each node in the DAG representing a separate task that can be processed by any processor, and defines a common concurrency measure which gives a comparison of how much parallelism can be achieved.
Abstract: The idea of multiprocessing has been with us for many years. We would like to know, however, how much gain (i.e., speedup) is really achieved when multi-processors are used. In this dissertation, we model a computer job as a Directed Acyclic Graph (DAG), each node in the DAG representing a separate task that can be processed by any processor. Four parameters are used to characterize the concurrency problem which results in 16 cases. The four parameters are: (1) How the jobs arrive: either a fixed number of jobs at time zero or jobs arriving from a Poisson source; (2) the DAG: either the same for each job or each job randomly selecting its DAG; (3) service time of each task: constant or exponentially distributed; (4) the number of processors: either a fixed number or an infinite number (infinite number of processors meaning that whenever a task requires a processor, one will be available). For all cases studied, we define a common concurrency measure which gives a comparison of how much parallelism can be achieved. The concurrency measure is obtained exactly for several cases by first converting the DAG into a Markov chain where each state represents a possible set of tasks that can be executed in parallel. From this Markov chain, and by utilizing a special feature in the chain, we are able to find the equilibrium probabilities of each state and the average time required to process a single job. We also find upper and lower bounds for the concurrency measure for certain cases studied. The upper bound is found by synchronizing of the execution at various places in the DAG. We present two algorithms for assigning the tasks to processors. One algorithm minimizes the expected time to complete all jobs while the other algorithm maximizes the utilization of the processors. The communication cost between any two tasks that reside on different processors is modeled as a task. We study the effect of the communication costs on the gains that are achieved from multi-processing.

Journal ArticleDOI
01 Mar 1984
TL;DR: A relaxation algorithm composed of both a time-step parallel algorithm and a component-wise parallel algorithm for solving large-scale system simulation problems in parallel is proposed and the possible trade-offs between the speedup ratio, efficiency, and waiting time are analyzed.
Abstract: A relaxation algorithm composed of both a time-step parallel algorithm and a component-wise parallel algorithm is proposed for solving large-scale system simulation problems in parallel. The interconnected nature of the system, which is characterized by the component connection model, is fully exploited by this approach. A technique for finding an optimal number of the time steps is also described. Finally, this algorithm is illustrated via several examples in which the possible trade-offs between the speedup ratio, efficiency, and waiting time are analyzed.

DOI
01 Mar 1984
TL;DR: These models are tailored to suit the requirements of real-time microprocessor systems and thus are different from much of the literature on memory interference which is directed toward general-purpose multiprocessor systems.
Abstract: Two mathematical models are presented for the analysis of memory interference in time-shared-bus multimicroprocessor systems. The first is a discrete-time queuing model and the second is a Markov model. The measure of performance in each case is the fractional increase in execution time resulting from bus contention. Another measure, which is derived from this, is the speed up of the multiprocessor as compared to a uniprocessor. These models are tailored to suit the requirements of real-time microprocessor systems and thus are different from much of the literature on memory interference which is directed toward general-purpose multiprocessor systems. The validity of the models is verified by comparison with simulation results and actual hardware measurements.

01 Jan 1984
TL;DR: The MOSSDM Simulation Engine (MSE) is a special purpose processor for performing switch-level simulation of MOS VLSI circuits and functional simulation is provided on the MSE to facilitate the efficient simulation of large circuits.
Abstract: As the complexity of VLSI circuits approaches 10 to the power of 6 devices, the computational requirements of design verification are exceeding the capacity of general purpose computers. To provide the computing power required to verify these complex VLSI chips, special purpose hardware for performing simulation is required. Existing simulation engines which perform logic simulation are inadequate for MOS VLSI because they cannot accurately model MOS circuits. Switch-level simulation, on the other hand, models the affects of capacitance and transistor ratios at speeds comparable to logic simulation. The MOSSDM Simulation Engine (MSE) is a special purpose processor for performing switch-level simulation of MOS VLSI circuits. A single processor MSE perfonns switch=level simulation 200 to 500 times faster than a VAX 11/780. Several MSE processors can be connected in pallel to achieve additional speedup. A virtual processor mechanism allows the MSE to simulate large circuits with the size of the circuit limited only by the amount of backing store available t o hold the circuit description. Functional simulation is provided on the MSE to facilitate the efficient simulation of large circuits.

01 Jan 1984
TL;DR: A brief review of two classes of rule-based expert systems is presented, followed by a detailed analysis of potential sources of parallelism at the production or rule level, the subrule level, and at the search level.
Abstract: A brief review of two classes of rule-based expert systems is presented, followed by a detailed analysis of potential sources of parallelism at the production or rule level, the subrule level (including match, select, and act parallelism), and at the search level (including AND, OR, and stream parallelism). The potential amount of parallelism from each source is discussed and characterized in terms of its granularity, inherent serial constraints, efficiency, speedup, dynamic behavior, and communication volume, frequency, and topology. Subrule parallelism will yield, at best, two- to tenfold speedup, and rule level parallelism will yield a modest speedup on the order of 5 to 10 times. Rule level can be combined with OR, AND, and stream parallelism in many instances to yield further parallel speedups.

Proceedings ArticleDOI
Wayne Moore1
01 Dec 1984
TL;DR: The application of a parallel processor to the speed up of computation in various control applications and how this affects decision-making is discussed.
Abstract: This paper discusses the application of a parallel processor to the speed up of computation in various control applications.