Showing papers on "Speedup published in 1984"

PDF

Open Access

Proceedings Article•

Finding Biconnected Components and Computing Tree Functions in Logarithmic Parallel Time (Extended Summary)

[...]

01 Jan 1984

TL;DR: A general algorithmic technique which simplifies and improve computation of various functions on tress is introduced, which typically requires 0(log n) time using 0(n) space on an exclusive-read exclusive-write parallel RAM.

...read moreread less

Abstract: In this paper we propose a new algorithm for finding the blocks (biconnected components) of an undirected graph. A serial implementation runs in O(n+m) time and space on a graph of n vertices and m edges. 4 parallel implementation runs in O(log n) time and O(n+m) space using O(n+m) processors on a concurrent-read, concurrent-write parallel RAM. An alternative implementation runs in Obn2/p3 time and O(n2) space using any number p C n /log n of processors, on a concurrent-read, exclusive-write parallel RAM. The latter algorithm has optimal speedup, assuming an adjacency matrix representation of the input. A general algorithmic technique which simplifies and improves computation of various functions on trees is introduced. This technique typically requires o(1og n) time using o(n) processors and O(n) space on an exclusive-read exclusive-write parallel RAM.

...read moreread less

198 citations

Journal Article•DOI•

Measuring the Parallelism Available for Very Long Instruction Word Architectures

[...]

Nicolau¹, Fisher•Institutions (1)

Cornell University¹

01 Nov 1984-IEEE Transactions on Computers

TL;DR: This paper focuses on long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, and argues that even if the authors had infinite hardware, these architectures could not provide a speedup of more than a factor of 2 or 3 on real programs.

...read moreread less

Abstract: Long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, are a popular means of obtaining code speedup via fine-grained parallelism. The falling cost of hardware holds out the hope of using these architectures for much more parallelism. But this hope has been diminished by experiments measuring how much parallelism is available in the code to start with. These experiments implied that even if we had infinite hardware, long instruction word architectures could not provide a speedup of more than a factor of 2 or 3 on real programs.

...read moreread less

190 citations

Book•

Analysis of speedup in distributed algorithms

[...]

John Philip Fishburn

01 Jan 1984

TL;DR: The overall result is that the larger the problem, the closer the algorithms approach optimal speedup, which allows algorithms to be designed assuming any number of processing elements.

...read moreread less

Abstract: We present and analyze several practical parallel algorithms for multicomputers. Chapter four presents two distributed algorithms for implementing alpha-beta search on a tree of processors. Each processor is an independent computer with its own memory and is connected by communication lines to each of its nearest neighbors. Measurements of the first algorithm's performance on the Arachne distributed operating system are presented. For each algorithm, a theoretical model is developed that predicts speedup with arbitrarily many processors. Chapter five shows how locally-defined iterative methods give rise to natural multicomputer algorithms. We consider two interconnection topologies, the grid and the tree. Each processor (or terminal processor in the case of a tree multicomputer) engages in serial computation on its region and communicates border values to its neighbors when those values become available. As a focus for our investigation we consider the numerical solution of elliptic partial differential equations. We concentrate on the Dirichlet problem for Laplace's equation on a square region, but our results can be generalized to situations involving arbitrarily shaped domains (of any number of dimensions) and elliptic equations with variable coefficients. Our analysis derives the running time of the grid and the tree algorithms with respect to per-message overhead, per-point communication time, and per-point computation time. The overall result is that the larger the problem, the closer the algorithms approach optimal speedup. We also show how to apply the tree algorithms to non-uniform regions. A large-network algorithm solves a problem of size N on a network of N processors. Chapter six presents a general method for transforming large-network algorithms into quotient-network algorithms, which solve problems of size N on networks with fewer processors. This transformation allows algorithms to be designed assuming any number of processing elements. The implementation of such algorithms on a quotient network results in no loss of efficiency, and often a great savings in hardware cost.

...read moreread less

48 citations

Lu decomposition on a multiprocessing system with communications delay

[...]

Wang Ho Yu

01 Jan 1984

TL;DR: This thesis concentrates on finding scheduling algorithms which reduce the sensitivity of the solution time to the communication delay among computational elements, and has been determined that the overall speedup is sensitive to the delays between cooperating computational elements.

...read moreread less

Abstract: A large amount of computer time is used for the solution of systems of linear equations in the course of the circuit simulation during the design of integrated circuits. This expenditure limits the size of circuits which can be practically simulated, and results in poor response time in an interactive environment. In order to increase the size of circuits which can be simulated, and increase the response time, one option pursued here is to apply concurrent computation to the linear equation solution aspect of circuit simulation. This concurrent computation will exploit inherent parallelism in the linear equation solution to reduce the time required for that solution. We focus on one particular method for solution of the linear equations: LU decomposition. While LU decomposition has a great deal of inherent parallelism, the wide range of sparse matrix structures requires that this parallelism be detected automatically. It has been determined that the overall speedup is sensitive to the delays between cooperating computational elements, and the manner in which the concurrent computations are mapped onto computational elements is therefore of importance. The approach used is as follows: Given a sparse matrix with a particular structure, a code generator produces a program representing the LU decomposition for that matrix. Another program detects the precedence constraints among the sequential instructions in the code and models the solution process as a directed graph. Based on this graph, scheduling techniques are employed to assign segments of code to computational elements for concurrent execution. Most of this thesis concentrates on the last problem, finding scheduling algorithms which reduce the sensitivity of the solution time to the communication delay among computational elements. This is based on the following observation. With zero delay, the common Hu's level scheduling algorithm gives good speedup performance. However when the communication delay is large compared to the execution time of an instruction in the code, considerable degradation on the speedup performance is observed for Hu's algorithm. Polynomial-time optimal scheduling algorithms appear to be intractable. Hence heuristic algorithms with feasible running time that give suboptimal schedules have to be constructed. This is approached in two different ways. Heuristic local minimization scheduling algorithms using two matching algorithms from combinatorial optimization are studied and promising results are obtained. These two matching algorithms, min-max matching and weighted matching, give optimal code-to-processor assignment at each time step. . . . (Author's abstract exceeds stipulated maximum length. Discontinued here with permission of author.) UMI

...read moreread less

26 citations

Finite elements and the method of conjugate gradients on a concurrent processor

[...]

Gregory A. Lyzenga¹, Arthur Raefsky¹, G. H. Hager¹•Institutions (1)

California Institute of Technology¹

01 Jan 1984

TL;DR: The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition.

...read moreread less

Abstract: An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90% for sufficiently large problems.

...read moreread less

22 citations

Report•DOI•

RELAX2: A Modified Waveform Relaxation Approach to the Simulation of MOS Digital Circuits

[...]

J White

22 Feb 1984

TL;DR: Waveform Relaxation algorithms have been proven to be effective in the transient analysis of large scale integrated circuits and a new waveform relaxation simulator for MOS digital circuits, RELAX2, is described.

...read moreread less

Abstract: : Waveform Relaxation (WR) algorithms have been proven to be effective in the transient analysis of large scale integrated circuits. A new waveform relaxation simulator for MOS digital circuits, RELAX2, is described. Several speedup techniques included in RELAX2, such as adjusting the length of the inteval of simulation, using simpler models in the first few iterations, and allowing looser timestep control in the first few iterations, are also presented.

...read moreread less

22 citations

Journal Article•DOI•

On synchronous parallel computations with independent probabilistic choice

[...]

John H. Reif¹•Institutions (1)

Harvard University¹

01 Feb 1984-SIAM Journal on Computing

TL;DR: It is shown that parallelism uniformly speeds up time bounded Probabilistic sequential RAM computations by nearly a quadratic factor and that probabilistic choice can be eliminated from parallel computation by introducing nonuniformity.

...read moreread less

Abstract: This paper introduces probabilistic choice to synchronous parallel machine models; in particular parallel RAMs. The power of probabilistic choice in parallel computations is illustrated by parallelizing some known probabilistic sequential algorithms. We characterize the computational complexity of time, space, and processor bounded probabilistic parallel RAMs in terms of the computational complexity of probabilistic sequential RAMs. We show that parallelism uniformly speeds up time bounded probabilistic sequential RAM computations by nearly a quadratic factor. We also show that probabilistic choice can be eliminated from parallel computations by introducing nonuniformity.

...read moreread less

17 citations

Distributing a Distributed Problem Solving Network Simulator

[...]

Edmund H. Durfee¹, Daniel D. Corkill¹, Victor Lesser¹•Institutions (1)

University of Massachusetts Amherst¹

01 May 1984

TL;DR: The goal was to both reduce the real time required and to increase the scope of the simulations by splitting the existing simulator to run on a network of VAX 11/750s connected by DECNet/ETHERNET-II.

...read moreread less

Abstract: : We describe our experience in improving the real-time performance of a particular large and complex simulator through distributed processing. Our goal was to both reduce the real time required and to increase the scope of the simulations by splitting the existing simulator (written in Lsip) to run on a network of VAX 11/750s connected by DECNet/ETHERNET-II. We present data showing that the additional CPU power and the combined physical memory available in the network contribute to significant real-time speedup. Experience with a two and three machine network indicates that there was no memory contention in a single process simulator, we obtain a speedup proportional to the number of processes. Where there was memory contention in the single process simulator, the speed up is much more dramatic. We also detail the capabilities that were added to the conventional network communication structure to implement, debug, and interact with the distributed simulator.

...read moreread less

14 citations

Proceedings Article•DOI•

On The Limits To Speed Up Parallel Machines By Large Hardware And Unbounded Communication

[...]

F Meyer auf der Heide¹, Rüdiger Reischuk•Institutions (1)

Goethe University Frankfurt¹

24 Oct 1984

TL;DR: It is proved that at least log (n) + 1 steps are necessary for computing the sum of n integers by a WRAM regardless of the number of processors and the solution of write conflicts, and expands a lower bond of Yao for parallel computation trees.

...read moreread less

Abstract: Lower bounds for sequential and parallel random access machines (RAM's, WRAM's) and distributed systems of RAM's (DRAM's) are proved. We show that, when p processors instead of one are available, the computation of certain functions cannot be speeded up by a factor p but only by a factor 0 (log(p)). For DRAM's with communication graph of degree c a maximal speedup 0 (log(c)) can be achieved for these problems. We apply these results to testing the solvability of linear diophantine equations. This generalizes a lower bond of Yao for parallel computation trees. Improving results of Dobkin/Lipton and Klein/Meyer auf der Heide, we establish large lower bounds for the above problem on RAM's. Finnaly we prove that at least log (n) + 1 steps are necessary for computing the sum of n integers by a WRAM regardless of the number of processors and the solution of write conflicts.

...read moreread less

14 citations

Journal Article•DOI•

VLSI Sorting with Reduced Hardware

[...]

Joseph JaJa¹, Robert Michael Owens²•Institutions (2)

University of Maryland, College Park¹, Pennsylvania State University²

01 Jul 1984-IEEE Transactions on Computers

TL;DR: It is shown that sorting n numbers can be done on a chip with processing area A = o(n) with an almost optimal speedup in a network with mesh-connected interconnections.

...read moreread less

Abstract: We propose a new VLSI architecture which allows many problems to be solved quite efficiently on chips with very small processing areas. We consider in detail the sorting problem and show how it can be solved quickly and elegantly on our model. We show that sorting n numbers can be done on a chip with processing area A = o(n) with an almost optimal speedup in a network with mesh-connected interconnections. The control is shown to be simple and easily implementable in VLSI.

...read moreread less

13 citations

Journal Article•DOI•

Parallel computation and numerical optimisation

[...]

K. D. Patel

01 Jun 1984-Annals of Operations Research

TL;DR: The effect of introducing parallel computers to solve optimisation problems is considered and the possible interaction of the four classifications with the currently available parallel processing machines is considered.

...read moreread less

Abstract: In this paper we will consider the effect of introducing parallel computers to solve optimisation problems. We briefly highlight four situations where most improvements are likely. We consider the possible interaction of the four classifications with the currently available parallel processing machines. A brief description of one of the parallel systems, ICL DAP, is outlined. We have implemented two parallel (SIMD) algorithms, one for local optimisation and the other for global optimisation, on the ICL DAP. Numerical results, together with the processing times, are reported.

...read moreread less

Journal Article•DOI•

Processing of the molecular dynamics model by the parallel computer pax

[...]

Tsutomu Hoshino¹, Kiyo Takenouchi¹•Institutions (1)

University of Tsukuba¹

01 Mar 1984-Computer Physics Communications

TL;DR: Molecular dynamics model is processed by a parallel array type computer PAX, that has an architecture of nearest neighbor meash connection of processors, that realizes high efficiency close to 1, which assures the linear speedup proportional to the size of the processor array.

...read moreread less

Journal Article•DOI•

A numerical seismic 3-D migration model for vector multiprocessors

[...]

Christopher C. Hsiung¹, Werner Butscher¹•Institutions (1)

Cray¹

01 Dec 1984

TL;DR: It is demonstrated that careful algorithm design can lead to a significant speedup of the calculation when more than one processor is used, and the throughput times obtained in this study are an order of magnitude faster than some conventional approaches.

...read moreread less

Abstract: The availability of a multiprocessor vector machine, such as the CRAY X-MP, along with large, fast secondary memory such as the CRAY SSD, opens new frontiers to numerical algorithm design for 3-D simulations The 3-D seismic migration, which is of crucial importance in exploration seismology, will be studied as a model problem The numerical model discussed in this paper employs an alternating direction implicit (ADI) Crank-Nicolson scheme which takes full advantage of the parallel architecture of the underlying machine It is demonstrated that careful algorithm design can lead to a significant speedup of the calculation when more than one processor is used The throughput times obtained in this study are an order of magnitude faster than some conventional approaches

...read moreread less

Concurrency in parallel processing systems (distributed, multiprocessing)

[...]

Kenneth Ching-Yu Kung

01 Jan 1984

TL;DR: This dissertation model a computer job as a Directed Acyclic Graph (DAG), each node in the DAG representing a separate task that can be processed by any processor, and defines a common concurrency measure which gives a comparison of how much parallelism can be achieved.

...read moreread less

Abstract: The idea of multiprocessing has been with us for many years. We would like to know, however, how much gain (i.e., speedup) is really achieved when multi-processors are used. In this dissertation, we model a computer job as a Directed Acyclic Graph (DAG), each node in the DAG representing a separate task that can be processed by any processor. Four parameters are used to characterize the concurrency problem which results in 16 cases. The four parameters are: (1) How the jobs arrive: either a fixed number of jobs at time zero or jobs arriving from a Poisson source; (2) the DAG: either the same for each job or each job randomly selecting its DAG; (3) service time of each task: constant or exponentially distributed; (4) the number of processors: either a fixed number or an infinite number (infinite number of processors meaning that whenever a task requires a processor, one will be available). For all cases studied, we define a common concurrency measure which gives a comparison of how much parallelism can be achieved. The concurrency measure is obtained exactly for several cases by first converting the DAG into a Markov chain where each state represents a possible set of tasks that can be executed in parallel. From this Markov chain, and by utilizing a special feature in the chain, we are able to find the equilibrium probabilities of each state and the average time required to process a single job. We also find upper and lower bounds for the concurrency measure for certain cases studied. The upper bound is found by synchronizing of the execution at various places in the DAG. We present two algorithms for assigning the tasks to processors. One algorithm minimizes the expected time to complete all jobs while the other algorithm maximizes the utilization of the processors. The communication cost between any two tasks that reside on different processors is modeled as a task. We study the effect of the communication costs on the gains that are achieved from multi-processing.

...read moreread less

Journal Article•DOI•

Parallel system simulation

[...]

Heng-Ming Tai¹, R. Saeks²•Institutions (2)

Texas Tech University¹, Arizona State University²

01 Mar 1984

TL;DR: A relaxation algorithm composed of both a time-step parallel algorithm and a component-wise parallel algorithm for solving large-scale system simulation problems in parallel is proposed and the possible trade-offs between the speedup ratio, efficiency, and waiting time are analyzed.

...read moreread less

Abstract: A relaxation algorithm composed of both a time-step parallel algorithm and a component-wise parallel algorithm is proposed for solving large-scale system simulation problems in parallel. The interconnected nature of the system, which is characterized by the component connection model, is fully exploited by this approach. A technique for finding an optimal number of the time steps is also described. Finally, this algorithm is illustrated via several examples in which the possible trade-offs between the speedup ratio, efficiency, and waiting time are analyzed.

...read moreread less

DOI•

Memory interference in multimicroprocessor systems with a time-shared bus

[...]

P.A. Grasso¹, Tharam S. Dillon¹, K.E. Forward¹•Institutions (1)

Monash University¹

01 Mar 1984

TL;DR: These models are tailored to suit the requirements of real-time microprocessor systems and thus are different from much of the literature on memory interference which is directed toward general-purpose multiprocessor systems.

...read moreread less

Abstract: Two mathematical models are presented for the analysis of memory interference in time-shared-bus multimicroprocessor systems. The first is a discrete-time queuing model and the second is a Markov model. The measure of performance in each case is the fractional increase in execution time resulting from bus contention. Another measure, which is derived from this, is the speed up of the multiprocessor as compared to a uniprocessor. These models are tailored to suit the requirements of real-time microprocessor systems and thus are different from much of the literature on memory interference which is directed toward general-purpose multiprocessor systems. The validity of the models is verified by comparison with simulation results and actual hardware measurements.

...read moreread less

The MOSSIM Simulation Engine Architecture and Design

[...]

William J. Dally

01 Jan 1984

TL;DR: The MOSSDM Simulation Engine (MSE) is a special purpose processor for performing switch-level simulation of MOS VLSI circuits and functional simulation is provided on the MSE to facilitate the efficient simulation of large circuits.

...read moreread less

Abstract: As the complexity of VLSI circuits approaches 10 to the power of 6 devices, the computational requirements of design verification are exceeding the capacity of general purpose computers. To provide the computing power required to verify these complex VLSI chips, special purpose hardware for performing simulation is required. Existing simulation engines which perform logic simulation are inadequate for MOS VLSI because they cannot accurately model MOS circuits. Switch-level simulation, on the other hand, models the affects of capacitance and transistor ratios at speeds comparable to logic simulation. The MOSSDM Simulation Engine (MSE) is a special purpose processor for performing switch-level simulation of MOS VLSI circuits. A single processor MSE perfonns switch=level simulation 200 to 500 times faster than a VAX 11/780. Several MSE processors can be connected in pallel to achieve additional speedup. A virtual processor mechanism allows the MSE to simulate large circuits with the size of the circuit limited only by the amount of backing store available t o hold the circuit description. Functional simulation is provided on the MSE to facilitate the efficient simulation of large circuits.

...read moreread less

Characterizing the parallelism in rule-based expert systems

[...]

R.J. Douglass

01 Jan 1984

TL;DR: A brief review of two classes of rule-based expert systems is presented, followed by a detailed analysis of potential sources of parallelism at the production or rule level, the subrule level, and at the search level.

...read moreread less

Abstract: A brief review of two classes of rule-based expert systems is presented, followed by a detailed analysis of potential sources of parallelism at the production or rule level, the subrule level (including match, select, and act parallelism), and at the search level (including AND, OR, and stream parallelism). The potential amount of parallelism from each source is discussed and characterized in terms of its granularity, inherent serial constraints, efficiency, speedup, dynamic behavior, and communication volume, frequency, and topology. Subrule parallelism will yield, at best, two- to tenfold speedup, and rule level parallelism will yield a modest speedup on the order of 5 to 10 times. Rule level can be combined with OR, AND, and stream parallelism in many instances to yield further parallel speedups.

...read moreread less

Proceedings Article•DOI•

Digital control using a parallel processor

[...]

Wayne Moore¹•Institutions (1)

Mitchell College¹

01 Dec 1984

TL;DR: The application of a parallel processor to the speed up of computation in various control applications and how this affects decision-making is discussed.

...read moreread less

Abstract: This paper discusses the application of a parallel processor to the speed up of computation in various control applications.

...read moreread less