scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1987"


Journal ArticleDOI
TL;DR: An algorithm which computes the standard Grassberger-Procaccia correlation dimension of a strange attractor from a finite sample of N points on the attractor, and has obtained speedup factors of up to a thousand over the usual method.
Abstract: We present an algorithm which computes the standard Grassberger-Procaccia correlation dimension of a strange attractor from a finite sample of N points on the attractor. The usual algorithm involves measuring the distances between all pairs of points on the attractor, but then discarding those distances greater than some cutoff ${r}_{0}$. Our idea is to avoid computing those larger distances altogether. This is done with a spatial grid of boxes (each of size ${r}_{0}$) into which the points are organized. By computing distances between pairs of points only if those points are in the same or in adjacent boxes, we get all the distances less than ${r}_{0}$ and avoid computing many of the larger distances. The execution time for the algorithm depends on the choice of ${r}_{0}$, the smaller ${r}_{0}$, the fewer distances to calculate, and in general the shorter the run time. The minimum time scales as O(NlogN); this compares with the O(${N}^{2}$) time that is usually required. Using this algorithm, we have obtained speedup factors of up to a thousand over the usual method.

249 citations


Journal ArticleDOI
01 Nov 1987
TL;DR: The mechanics of Time Warp are reviewed, the TWOS operating system is described, how to construct simulations in object-oriented form to run under TWOS is shown, and a qualitative comparison of time-to-completion, speedup, rollback rate, and antimessage rate are offered.
Abstract: This paper describes the Time Warp Operating System, under development for three years at the Jet Propulsion Laboratory for the Caltech Mark III Hypercube multi-processor. Its primary goal is concurrent execution of large, irregular discrete event simulations at maximum speed. It also supports any other distributed applications that are synchronized by virtual time.The Time Warp Operating System includes a complete implementation of the Time Warp mechanism, and is a substantial departure from conventional operating systems in that it performs synchronization by a general distributed process rollback mechanism. The use of general rollback forces a rethinking of many aspects of operating system design, including programming interface, scheduling, message routing and queueing, storage management, flow control, and commitment.In this paper we review the mechanics of Time Warp, describe the TWOS operating system, show how to construct simulations in object-oriented form to run under TWOS, and offer a qualitative comparison of Time Warp to the Chandy-Misra method of distributed simulation. We also include details of two benchmark simulations and preliminary measurements of time-to-completion, speedup, rollback rate, and antimessage rate, all as functions of the number of processors used.

193 citations


Proceedings Article
01 Jan 1987
TL;DR: The authors review the mechanics of Time warp, describe the TWOS operating system, show how to construct simulations in object-oriented form to run under TWOS, and offer a qualitative comparison of Time Warp to the Chandy-Misra method of distributed simulation.
Abstract: This paper describes the Time Warp Operating System, under development for three years at the Jet Propulsion Laboratory for the Caltech Mark III Hypercube multiprocessor. Its primary goal is concurrent execution of large, irregular discrete event simulations at maximum speed. It also supports any other distributed applications that are synchronized by virtual time. In this paper, the authors review the mechanics of Time Warp, describe the TWOS operating system, show how to construct simulations in object-oriented form to run under TWOS, and offer a qualitative comparison of Time Warp to the Chandy-Misra method of distributed simulation. They also include details of two benchmark simulations and preliminary measurements of time-to-completion, speedup, rollback rate, and antimessage rate, all as functions of the number of processors used.

184 citations


Journal ArticleDOI
TL;DR: A parallel formulation of depth-first search which retains the storage efficiency of sequential depth- first search and can be mapped on to anyMIMD architecture is presented and results show that hypercube and sharedmemory architectures are significantly better.
Abstract: This paper presents a parallel formulation of depth-first search which retains the storage efficiency of sequential depth-first search and can be mapped on to any MIMD architecture. To study its effectiveness it has been implemented to solve the 15-puzzle problem on three commercially available microprocessors - Sequent Balance 21000, the Intel Hypercube and BBN Butterfly. We have been able to achieve fairly linear speedup on Sequent up to 30 processors (the maximum configuration available) and on the Intel Hypercube and BBN Butterfly up to 128 processors (the maximum configurations available). Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our experimental results show that hypercube and shared-memory architectures are significantly better. At the heart of our parallel formulation is a dynamic work distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work distribution scheme and architectural features such as presence/absence of shared memory, the diameter of the network, relative speed of the communication network, etc. In a companion paper [16], we analyze the effectiveness of different load-balancing schemes and architectures, and also present new improved work distribution schemes.

174 citations


Journal ArticleDOI
TL;DR: It is shown that an adaptive strategy which switches between two parallel decompositions at the optimal temperature yields speedup significantly better than any single strategy approach, and models are developed to account for the observed performance, and to predict the crossover points for switching strategies.
Abstract: Physical design tools based on simulated annealing algorithms have been shown to produce results of extremely high quality, but typically at a very high cost in execution time. This paper selects a representative annealing application--standard cell placement--and develops multiprocessor-based annealing algorithms for placement. A taxonomy of possible multiprocessor decompositions of annealing algorithms is presented which divides decomposition schemes into two broad classes: those which divide individual moves into subtasks and distribute them across cooperating processors, and those which perform complete moves in parallel. It is shown that the choice of multiprocessor annealing strategy is influenced by temperature; in particular, the paper introduces the idea of adaptive strategies that dynamically change the parallel decomposition scheme to achieve maximum speedup as the annealing task progresses through each temperature regime. Implementations of three parallel placement strategies are described for an experimental shared-memory multiprocessor. Practical speedups are achieved over a serial version of the algorithm, and it is shown that an adaptive strategy which switches between two parallel decompositions at the optimal temperature yields speedup significantly better than any single strategy approach. Models are developed to account for the observed performance, and to predict the crossover points for switching strategies.

150 citations


Journal ArticleDOI
TL;DR: The notion of linear multi-step methods for solving ordinary differential equations is generalized to a class of multi-block methods where step values are all obtained together in a single block advance by allocating the parallel tasks on separate processors.
Abstract: The notion of linear multi-step methods for solving ordinary differential equations is generalized to a class of multi-block methods. In a multi-block method step values are all obtained together in a single block advance which is accomplished by allocating the parallel tasks on separate processors. The expected benefit of multi-block methods is the speedup in the computation of solutions. The basic formulation is described. Examples are given to demonstrate the existence of such schemes. The predictor-corrector type combination is formed and the resulting stability problem is considered. Test results of one of these multi-block methods on the Denelcor HEP machine are reported.

111 citations


Journal ArticleDOI
TL;DR: The results indicate that for algorithms where both the computation and the communication overhead can be fully decomposed among N processors, the speedup is a nondecreasing function of the level of granularity for arbitrary interconnection structure and allocation of subproblems to processors.
Abstract: In this paper we analyze the effects of the problem decomposition, the allocation of subproblems to processors, and the grain size of subproblems on the performance of a multiple- processor shared-memory architecture. Our results indicate that for algorithms where both the computation and the communication overhead can be fully decomposed among N processors, the speedup is a nondecreasing function of the level of granularity for arbitrary interconnection structure and allocation of subproblems to processors. For these algorithms, the speedup is an increasing function of the level of granularity provided that the interconnection bandwidth is greater than unity. If the bandwidth is equal to unity, then the speedup converges to the value equal to the ratio of processing time to communication time. For algorithms where the computation is decomposable but the communication overhead cannot be decomposed, the speedup is a nondecreasing function of the level of granularity for the best case bandwidth only. If the bandwidth is less than N, the speedup reaches its maximum and then decreases approaching zero as the level of granularity grows. For algorithms where the computation consists of parallel and serial sections of code and the communication overhead is fully decomposable, the speedup converges to a value inversely proportional to the fraction of time spent in the serial code even for the best case interconnection bandwidth.

89 citations


Journal ArticleDOI
01 Jun 1987
TL;DR: It is shown that the mapping problem—assigning processes to processors—can be reduced to the graph partitioning problem and an evolution method derived from biology is applied to the travelling salesman problem.
Abstract: Problems concerning parallel computers can be understood and solved using results of natural sciences. We show that the mapping problem—assigning processes to processors—can be reduced to the graph partitioning problem. We solve the partitioning problem by an evolution method derived from biology. The evolution method is then applied to the travelling salesman problem. The competition part of the evolution method gives an almost linear speedup compared to the sequential method. A cooperation method leads to new heuristics giving better results than the known heuristics.

82 citations


Proceedings Article
13 Jul 1987
TL;DR: A parallel version of the Iterative-Deepening-A* (IDA*) algorithm that retatins all the nice properties of the sequential IDA* and yet does not appear to be limited in the amount of parallelism.
Abstract: This paper presents a parallel version of the Iterative-Deepening-A* (IDA*) algorithm. Iterative-Deepening-A* is an important admissible algorithm for state-space search which has been shown to be optimal both in time and space for a wide variety of state-space search problems. Our parallel version retatins all the nice properties of the sequential IDA* and yet does not appear to be limited in the amount of parallelism. To test its effectiveness, we have implemented this algorithm on Sequent Balance 21000 parallel processor to solve the 15-puzzle problem, and have been able to obtain almost linear speedups on the 30 processors that are available on the machine. On machines where larger number of processors are available, we expect that the speedup will still grow linearly. The parallel version seems suitable even for loosely coupled architectures such as the Hypercube.

75 citations


Journal ArticleDOI
Stone1
TL;DR: A performance analysis of speedup or other aspects of algorithmic behavior that would reveal what factors of machine and algorithm design contribute most strongly to the performance were not provided are provided.
Abstract: Parallelism by itself does not necessarily lead to higher speed. In the case study presented here, the parallel algorithm was far less efficient than a good serial algorithm. The study does, however, reveal how to best use parallelism to best use-run the more efficient serial algorithm in a parallel manner. The case study extends the work of others who presented an algorithm for high-speed querying of a large database. The results show that the throughput for parallel query analysis is high in an absolute sense. But a performance analysis of speedup or other aspects of algorithmic behavior that would reveal what factors of machine and algorithm design contribute most strongly to the performance were not provided. This article provides that analysis.

71 citations


Journal ArticleDOI
01 Oct 1987
TL;DR: It is shown that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY -1S with a vector unit and the CFT vectorizing compiler.
Abstract: This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

Journal ArticleDOI
01 Nov 1987
TL;DR: It is shown how to efficiently implement the preconditioned conjugate gradient method on a four processors computer CRAY X-MP/48 with nearly optimal speedup and high Mflops rates.
Abstract: We show how to efficiently implement the preconditioned conjugate gradient method on a four processors computer CRAY X-MP/48. We solve block tridiagonal systems using block preconditioners well suited to parallel computation. Numerical results are presented that exhibit nearly optimal speedup and high Mflops rates.

Journal ArticleDOI
TL;DR: Measurements of code parallelism for the, LINPACK numerical package are presented to support the belief that typical numerical programs contain much potential parallelism that can be discovered by a good restructuring compiler.
Abstract: The main aim of the paper is to study allocation of processors to, parallel programs executing on a multiprocessor system, and the resulting speedups. First, we consider a parallel program as a sequence of steps where each step consists of a set of parallel operations. General bounds on the speedup on a p- processor system are derived based on this model. Measurements of code parallelism for the, LINPACK numerical package are presented to support the belief that typical numerical programs contain much potential parallelism that can be discovered by a good restructuring compiler. Next, a parallel program is represented as a task graph whose nodes are do across loops (i.e., loops whose iterations can be partially, overlapped). It is shown how processors can be allocated to exploit horizontal and vertical parallelism in such graphs. Two processor allocation heuristic algorithms (WP and PA) are presented. PA is the heart of the WP and is used to obtain efficient processor allocations for a set of independent parallel tasks. WP allocates processors to general task graphs. Finally, a general formula for the speedup of a DO across loop is given that is more accurate than the known formula.

Journal ArticleDOI
01 Sep 1987
TL;DR: A graphical model is described that profiles the execution of the barriers and other parallel programming constructs and shows that in order to achieve the best performance, different situations call for different barrier implementations.
Abstract: A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor that support these conclusions, are detailed.

Book ChapterDOI
08 Jun 1987
TL;DR: A model from which a superlinear speedup can be deduced is described, based on the fact that in the average the solutions are distributed nonuniformly in the case of the satisfiability problem.
Abstract: We have implemented a backtracking strategy for the satisfiability problem on a ring of processors and we observed a superlinear speedup in the average. In this paper we describe a model from which this superlinear speedup can be deduced. The model is based on the fact that in the average the solutions are distributed nonuniformly in the case of the satisfiability problem. To our knowledge this phenomenon was not used before in the analysis of algorithms.

Journal ArticleDOI
Stolfo1
TL;DR: The author presents the initial measured performance of the implemented systems on the DADO2 prototype and a general overview of the D ADO architecture and the theoretical speedup achievable for almost decomposable searching problems.
Abstract: On December 5, 1985 a 1023-processor parallel machine named DADO2 was successfully demonstrated at Columbia University. DADO2 is the fourth prototype, but the first large-scale prototype, of a class of machines called DADO. DADO was first proposed in 1980 as a special-purpose parallel computer attached to a conventional host processor and designed to accelerate a particular class of artificial intelligence rule-based programming paradigms called production systems. PSs have been widely used for the implementation of knowledge-based expert systems. Since its introduction, we have learned that DADO suits the high-speed execution of a large class of problems that we call almost decomposable searching problems, of which PSs are but one example. Several specific examples of this class of problems presently run on DADO2; others are currently under development and nearing completion. In this article the author presents the initial measured performance of the implemented systems on the DADO2 prototype. This data is preceded by a general overview of the DADO architecture and the theoretical speedup achievable for almost decomposable searching problems.

Journal ArticleDOI
01 Jul 1987
TL;DR: A row-oriented implementation of Gaussian elimination with partial pivoting on a local-memory multiprocessor with an Intel hypercube is described, which is inexpensive and which maintains computational balance in the presence of pivoting.
Abstract: A row-oriented implementation of Gaussian elimination with partial pivoting on a local-memory multiprocessor is described. In the absence of pivoting, the initial data loading of the node processors leads to a balanced computation. However, if interchanges occur, the computational loads on the processors may become unbalanced, leading to inefficiency. A simple load-balancing scheme is described which is inexpensive and which maintains computational balance in the presence of pivoting. Using some reasonable assumptions about the probability of pivoting occurring, an analysis of the communication costs of the algorithm is developed, along with an analysis of the computation performed in each node processor. This model is then used to derive the expected speedup of the algorithm. Finally, experiments using an Intel hypercube are presented in order to demonstrate the extent to which the analytical model predicts the performance.

Journal ArticleDOI
TL;DR: The results show that the speedup which can be obtained theoretically in a parallel system may be decreased significantly by synchronization constraints, and the performance indices of the parallel execution of programs are studied.
Abstract: The paper presents a performance model of fork and join synchronization primitives. The primitives are used in parallel programs executed on distributed systems. Three variants of the execution of parallel programs with fork and join primitives are considered and queueing models are proposed to evaluate their performance on a finite number of processors. Synchronization delays incurred by the programs are represented by a state-dependent server with service rate depending on a particular synchronization scheme. Closed form results are presented for the two processor case and a numerical method is proposed for many processors. Fork-join queueing networks having more complex structure i.e., processors arranged in series and in parallel, are also analyzed in the same manner. The networks can model the execution of jobs with a general task precedence graph corresponding to a nested structure of the fork-join primitives. Some performance indices of the parallel execution of programs are studied. The results show that the speedup which can be obtained theoretically in a parallel system may be decreased significantly by synchronization constraints.

Proceedings Article
01 Dec 1987
TL;DR: The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture and identifies the smallest grid size which fully benefits from using all available processors.
Abstract: The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture. The equation's domain is discretized into n sup 2 grid points which are divided into partitions and mapped onto the individual processor memories. The relationships between grid size, stencil type, partitioning strategy, processor execution time, and communication network type are analytically quantified. In so doing, the optimal number of processors was determined to assign to the solution, and identified (1) the smallest grid size which fully benefits from using all available processors, (2) the leverage on performance given by increasing processor speed or communication network speed, and (3) the suitability of various architectures for large numerical problems.

Proceedings ArticleDOI
01 Oct 1987
TL;DR: This paper presents statistics of several designs at four design abstraction levels - the instruction, behavioral, RTL, and gate levels, which found a factor of roughly ten speed-up between each of the abstraction levels.
Abstract: This paper presents statistics of several designs at four design abstraction levels - the instruction, behavioral, RTL, and gate levels. The data includes simulation time profiles, maximum speedup and limitations of parallelism, typical model evaluation times, event distributions, element intensities, and component counts for the four abstraction levels. This data is then used to analyze and evaluate several speed-up approaches: mixed-level simulation, parallel software simulators, parallel pipelined hardware accelerators, and decreased time resolution. The results show that element activity is around 0.1 to 0.5% at any particular time point. For the example circuits (3400 gates, 5000 gates, and 150,000 transistors), simulations show that parallelism can obtain speed-ups between 10-30. We found a factor of roughly ten speed-up between each of the abstraction levels.

Journal ArticleDOI
01 Sep 1987
TL;DR: Modification of the concurrent algorithm, to take vantage of the cache and frontwidth re duction by element reordering, doubled the concurrent speedup on the ELXSI to 2.8 on four processors.
Abstract: Frontal methods are an efficient and popular means of Gauss elimination of matrixequations that arise in finite elementanalysis. Nested dissection of a computational domain makes possible high-levelparallelism in a widely used frontal algorithm for unsymmetric systems. A concurrent, highly vectorized, multifrontal, finiteelement analysis of axisymmetric liquiddrop oscillations with 2,210 equations runson the CRAY X-MP/48 with factors of 1.9and 2.9 reduction in elapsed time on twoand four processors, respectively. On anELXSI 6400 which has an additionalmemory level, local processor cache, ignored in the algorithm's design for theCRAY, implementation of the sameproblem initially achieved a speedup ofonly 1.4 on four processors. Modificationof the concurrent algorithm, to take advantage of the cache and frontwidth reduction by element reordering, doubled theconcurrent speedup on the ELXSI to 2.8 onfour processors.

Journal ArticleDOI
TL;DR: This paper demonstrates that the potential of intrinsic parallelism in Monte Carlo methods, which has remained essentially untapped so far, can be exploited to implement these methods efficiently on SIMD and MIMD computers.
Abstract: This paper demonstrates that the potential of intrinsic parallelism in Monte Carlo methods, which has remained essentially untapped so far, can be exploited to implement these methods efficiently on SIMD and MIMD computers. Two basic static and dynamic computation assignment schemes are proposed for assigning the primary estimate computations (PECs) to processors in a parallel computer. These schemes can be used to design parallel Monte Carlo algorithms for many applications. The time complexity analyses of static computation assignment (SCA) schemes are carried out using some results from order statistics, whereas those of dynamic computation assignment (DCA) schemes are carried out using results from order statistics, renewal and queueing theories. It is shown that for smaller number of processors, linear speedup can be achieved with the SCA schemes and the speedup almost equal to the number of processors can be achieved with the DCA schemes. Some computational results for Monte Carlo solutions of Lapla...

Journal ArticleDOI
TL;DR: This paper presents the parallel version of the Sieve, a straightforward algorithm for finding all prime numbers in a given range that serves as a test of some of the capabilities of a parallel machine.
Abstract: The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.

Journal ArticleDOI
TL;DR: The optimal block-partitioning algorithm which tears a power system into several blocks is presented and its clear difference from that for the load flow is clarified.
Abstract: This paper presents an effective parallel computation algorithm for the static state estimation of a power system. The proposed algorithm makes use of the matrix inversion lemma and assumes the use of a MIMD type computer system. The optimal block-partitioning algorithm which tears a power system into several blocks is presented and its clear difference from that for the load flow is clarified. Simulation results on systems of up to 135 nodes have demonstrated the algorithm is very promising to the speed up of the state estimation.

Journal ArticleDOI
R. B. King1, V. Sonnad1
TL;DR: An element-by-element solution algorithm for systems of equations arising in applying the finite element method in solid mechanics was implemented on the loosely coupled array of processors (lCAP) parallel computer located at IBM Kingston.
Abstract: An element-by-element solution algorithm for systems of equations arising in applying the finite element method in solid mechanics was implemented on the loosely coupled array of processors (lCAP) parallel computer located at IBM Kingston. The element-by-element algorithm has previously been shown to be advantageous over direct solution algorithms for large problems on sequential computers. It also has the advantage that it can be implemented in parallel on machines such as the lCAP in a relatively straightforward manner. The results show that solution speedup efficiencies of approximately 95% can be readily achieved with this method, with no indication that the speed-up efficiency drops off as more processors are added. The implementation used is applicable to other coarse-grained parallel architectures in addition to the lCAP computer.

Journal ArticleDOI
TL;DR: This paper presents the technique of concurrent hierarchical fault simulation, a performance model, and two hierarchical optimization techniques to enhance fault simulator performance and indicates that the speedup should increase with circuit size.
Abstract: This paper presents the technique of concurrent hierarchical fault simulation, a performance model, and two hierarchical optimization techniques to enhance fault simulator performance. The mechanisms for these enhancements are demonstrated with a performance model and are validated experimentally via CHIEFS, the Concurrent Hierarchical and Extensible Fault Simulator, and WRAP, an offline hierarchy compressor. Hieararchy-based fault partitioning and circuit reconfiguration are shown to improve simulator performance to O(n log n) under appropriate conditions. A decoupled fault modeling technique permits further performance improvements via a bottom-up hierarchy compression technique where macros of primitives are converted to single primitives. When combined, these techniques have produced a factor of 180 speedup on a mantissa multiplier. The performance model indicates that the speedup should increase with circuit size.

Proceedings ArticleDOI
01 Oct 1987
TL;DR: Algorithms and programming techniques needed to develop SUM (Simulation Using Massively parallel computers), a relaxation-based circuit simulator on the Connection Machine, a massively parallel processor with up to 65536 processors are described.
Abstract: Accurate circuit simulation is a very important step in the design of high performance integrated circuits. The ever increasing size of integrated circuits requires the use of an inordinate amount of computer time to be spent in circuit simulation. Parallel processors have been considered to speed up the simulation process. Massively parallel computers have been made available recently and present a new interesting paradigm for expensive CAD applications. This paper describes algorithms and programming techniques needed to develop SUM (Simulation Using Massively parallel computers), a relaxation-based circuit simulator on the Connection Machine, a massively parallel processor with up to 65536 processors. SUM can simulate circuits at almost constant CPU time per iteration, regardless of circuit size. SUM can simulate very large circuits. Circuit simulators running on the largest super computers can run circuits of comparable size, however SUM is easily scalable as the number of processors in the Connection Machine increases, with almost no increase in CPU time.


Journal ArticleDOI
01 Nov 1987
TL;DR: The paper describes the implementation of the Successive Overrelaxation (SOR) method on an asynchronous multiprocessor computer for solving large, linear systems and achieves a speedup and an efficiency of 2/3 using p = [N/2] processors.
Abstract: The paper describes the implementation of the Successive Overrelaxation (SOR) method on an asynchronous multiprocessor computer for solving large, linear systems. The parallel algorithm is derived by dividing the serial SOR method into noninterfering tasks which are then combined with an optimal schedule of a feasible number of processors. The important features of the algorithm are: (i) achieves a speedup Sp ≅ O(N/3) and an efficiency Ep ≅ 2/3 using p = [N/2] processors, where N is the number of the equations, (ii) contains a high level of inherent parallelism, whereas on the other hand, the convergence theory of the parallel SOR method is the same as its sequential counterpart and (iii) may be modified to use block methods in order to minimise the overhead due to communication and synchronisation of the processors.

Journal ArticleDOI
01 Jun 1987
TL;DR: A parallel Monte Carlo photon transport algorithm that insures the reproducibility of results and the introduction of a pair of pseudo-random number generators that are able to reproduce results exactly in a asynchronous parallel processing environment is presented.
Abstract: We present a parallel Monte Carlo photon transport algorithm that insures the reproducibility of results. The important feature of this parallel implementation is the introduction of a pair of pseudo-random number generators. This pair of generators is structured in such a manner as to insure minimal correlation between the two sequences of pseudo-random numbers produced. We term this structure as a ‘pseudo-random tree’. Using this structure, we are able to reproduce results exactly in a asynchronous parallel processing environment. The algorithm tracks the history of photons as they interact with two carbon cylinders joined end to end. The algorithm was implemented on both a Denelcor HEP and a CRAY X-MP/48. We describe the algorithm and the pseudo-random tree structure and present speedup results of our implementation.