Showing papers on "Speedup published in 1992"

PDF

Open Access

Proceedings Article•DOI•

The DASH prototype: implementation and performance

[...]

Daniel E. Lenoski¹, James Laudon¹, Truman Joe¹, David Nakahira¹, Luis Stevens¹, Anoop Gupta¹, John L. Hennessy¹ - Show less +3 more•Institutions (1)

Stanford University¹

01 Apr 1992

TL;DR: The hardware overhead of directory-based cache coherence in the prototype of the DASH multiprocessor is examined and the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup is characterized.

...read moreread less

Abstract: The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design trade-offs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the feasibility of the design allows one to accurately estimate both the hardware and the complexity cost of various features, and provides a platform for studying real workloads. A 16-processor prototype of the DASH multiprocessor has been operational for the last six months. In this paper, the hardware overhead of directory-based cache coherence in the prototype is examined. We also discuss the performance of the system, and the speedups obtained by parallel applications running on the prototype. Using a sophisticated hardware performance monitor, we characterize the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup.

...read moreread less

214 citations

Proceedings Article•DOI•

Parallel R-trees

[...]

Ibrahim Kamel¹, Christos Faloutsos¹•Institutions (1)

University of Maryland, College Park¹

01 Jun 1992

TL;DR: Experiments indicate that the multiplexed R-tree with PI heuristic gives better response time than the disk-stripping (=“Super-node”) approach, and imposes lighter load on the I/O sub-system.

...read moreread less

Abstract: We consider the problem of exploiting parallelism to accelerate the performance of spacial access methods and specifically, R-trees [11]. Our goal is to design a server for spatial data, so that to maximize the throughput of range queries. This can be achieved by (a) maximizing parallelism for large range queries, and (b) by engaging as few disks as possible on point queries [22].We propose a simple hardware architecture consisting of one processor with several disks attached to it. On this architecture, we propose to distribute the nodes of a traditonal R-tree, with cross-disk pointers (“Multiplexed” R-tree). The R-tree code is identical to the one for a single-disk R-tree, with the only addition that we have to decide which disk a newly created R-tree node should be stored in. We propose and examine several criteria to choose a disk for a new node. The most successful one, termed “proximity index” or PI, estimates the similarity of the new node with the other R-tree nodes already on a disk, and chooses the disk with the lowest similarity. Experimental results show that our scheme consistently outperforms all the other heuristics for node-to-disk assignments, achieving up to 55% gains over the Round Robin one. Experiments also indicate that the multiplexed R-tree with PI heuristic gives better response time than the disk-stripping (=“Super-node”) approach, and imposes lighter load on the I/O sub-system.The speed up of our method is close to linear speed up, increasing with the size of the queries.

...read moreread less

189 citations

Journal Article•DOI•

High-speed parallel CRC circuits in VLSI

[...]

T.-B. Pei¹, C. Zukowski¹•Institutions (1)

Columbia University¹

01 Apr 1992-IEEE Transactions on Communications

TL;DR: It is shown that parallel architectures fall somewhat short of ideal speedups in practice, but they should still enable current CMOS technologies to go well beyond 1 Gb/s data rates.

...read moreread less

Abstract: The use of VLSI technology to speed up cyclic redundancy checking (CRC) circuits used for error detection in telecommunications systems is investigated By generalizing the analysis of a parallel prototype, performance is estimated over a wide range of external constraints and design choices It is shown that parallel architectures fall somewhat short of ideal speedups in practice, but they should still enable current CMOS technologies to go well beyond 1 Gb/s data rates >

...read moreread less

189 citations

Journal Article•DOI•

Presto(Protein engineering simulator) : a vectorized molecular mechanics program for biopolymers

[...]

Kenji Morikami, Takahisa Nakai, Akinori Kidera, Minoru Saito, Haruki Nakamura - Show less +1 more

01 Jul 1992-Computational Biology and Chemistry

TL;DR: A new algorithm called the localized constraints algorithm was developed in order to vectorize efficiently a constraint molecular dynamics simulation, PRESTO (PRotein Engineering SimulaTOr), which provided high performance on many types of vector processors.

...read moreread less

171 citations

Proceedings Article•DOI•

Improving the performance of input-queued ATM packet switches

[...]

Mark J. Karol¹, Kai Y. Eng¹, H. Obara•Institutions (1)

Bell Labs¹

01 May 1992

TL;DR: The authors propose a single way to dramatically improve the performance of input-queued ATM packet switches beyond the 82% saturation point obtained in previous work and yields a throughput improvement from 65% to 92% without speedup, trunking, or complicated hardware.

...read moreread less

Abstract: The authors propose a single way to dramatically improve the performance of input-queued ATM packet switches beyond the 82% saturation point obtained in previous work. The method is an extension of the independent output-port schedulers technique and is based on the notion of recycled time slots, i.e. reusing time slots normally wasted due to scheduling conflicts. In contrast to previous results, the technique yields a throughput improvement from 65% to 92% without speedup, trunking, or complicated hardware. If input grouping with a group size of four is also employed, then the method can yield up to 95% throughput. >

...read moreread less

105 citations

Journal Article•DOI•

Lneuro 1.0: a piece of hardware LEGO for building neural network systems

[...]

N. Mauduit¹, Marc Duranton², Jean Gobert², J.-A. Sirat²•Institutions (2)

University of California, San Diego¹, Philips²

01 May 1992-IEEE Transactions on Neural Networks

TL;DR: The architecture is scalable and flexible enough to be useful for simulating various kinds of networks and paradigms, and the speedup factor increases regularly with the number of clusters involved (to a factor of 80).

...read moreread less

Abstract: Neural network simulations on a parallel architecture are reported. The architecture is scalable and flexible enough to be useful for simulating various kinds of networks and paradigms. The computing device is based on an existing coarse-grain parallel framework (INMOS transputers), improved with finer-grain parallel abilities through VLSI chips, and is called the Lneuro 1.0 (for LEP neuromimetic) circuit. The modular architecture of the circuit makes it possible to build various kinds of boards to match the expected range of applications or to increase the power of the system by adding more hardware. The resulting machine remains reconfigurable to accommodate a specific problem to some extent. A small-scale machine has been realized using 16 Lneuros, to experimentally test the behavior of this architecture. Results are presented on an integer version of Kohonen feature maps. The speedup factor increases regularly with the number of clusters involved (to a factor of 80). Some ways to improve this family of neural network simulation machines are also investigated. >

...read moreread less

90 citations

Proceedings Article•DOI•

HOPE: an efficient parallel fault simulator

[...]

H.K. Lee, Dong Sam Ha

08 Jun 1992

TL;DR: The authors present an efficient sequential circuit parallel fault simulator, HOPE, which simulates 32 faults at a time, which is about two times faster than PROOFS for most ISCAS89 sequential benchmark circuits.

...read moreread less

Abstract: The authors present an efficient sequential circuit parallel fault simulator, HOPE, which simulates 32 faults at a time. HOPE is a parallel fault simulator based on single fault propagation. It adopts the zero gate delay model. The key idea incorporated in HOPE is to screen out faults with short propagation paths, and prevent them from being simulated in parallel. The screening process drastically reduces the number of faults simulated in parallel to achieve substantial speedup. The experimental results presented show that HOPE is about two times faster than PROOFS for most ISCAS89 sequential benchmark circuits. >

...read moreread less

84 citations

Journal Article•DOI•

The Ring Array Processor: a multiprocessing peripheral for connectionist applications

[...]

Nelson Morgan¹, James Beck¹, Phil Kohn¹, Jeff A. Bilmes¹, Eric Allman¹, Joachim Beer¹ - Show less +2 more•Institutions (1)

International Computer Science Institute¹

01 Mar 1992-Journal of Parallel and Distributed Computing

TL;DR: The motivation for the RAP is described and how the architecture matches the target algorithm is shown, which is to reduce peak performance on the error back-propagation algorithm to about 50% of a linear speedup.

...read moreread less

71 citations

Proceedings Article•DOI•

Load balancing for distributed branch & bound algorithms

[...]

R. Luling¹, B. Monien¹•Institutions (1)

University of Paderborn¹

01 Mar 1992

TL;DR: The authors present a new load balancing strategy and its application to distributed branch & bound algorithms and demonstrate its efficiency by solving some NP-complete problems on a network of up to 256 transputers.

...read moreread less

Abstract: The authors present a new load balancing strategy and its application to distributed branch & bound algorithms and demonstrate its efficiency by solving some NP-complete problems on a network of up to 256 transputers. The parallelization of their branch & bound algorithm is fully distributed. Every processor performs the same algorithm but each on a different part of the solution tree. In this case it is necessary to distribute subproblems among the processors to achieve a well balanced workload. Their load balancing method overcomes the problem of search overhead and idle times by an appropriate load model and avoids trashing effects by a feedback control method. Using this strategy they were able to achieve a speedup of up to 237.32 on a 256 processor network for very short parallel computation times, compared to an efficient sequential algorithm. >

...read moreread less

69 citations

Journal Article•DOI•

Combined hardware selection and pipelining in high-performance data-path design

[...]

Francky Catthoor¹, G. Goossens¹, H. De Man¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Apr 1992-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: An algorithm which optimally selects hardware blocks for implementing these abstract building blocks and a technique for hierarchical redistribution and insertion of pipeline registers, which makes the area tradeoff between the cost of additional speedup circuitry and pipeline registers possible.

...read moreread less

Abstract: At the highest abstraction level, the specification of a data path consists of a number of interconnected abstract building blocks and a constraint on the minimal clock frequency. An algorithm which optimally selects hardware blocks for implementing these abstract building blocks is presented. A technique for hierarchical redistribution and insertion of pipeline registers is also presented. Finally, the two optimization tasks are combined. This combination makes the area tradeoff between the cost of additional speedup circuitry and pipeline registers possible. The techniques are based on accurate hierarchical timing models for the hardware blocks. The automation relieves the designer of the numerous, time-consuming critical path verifications and area evaluations that are required to explore the large design space. The implementation of the algorithms has resulted in a CAD tool called HANDEL, embedded in the data-path compiler CHOPIN. >

...read moreread less

64 citations

Proceedings Article•

Parallelism in a Main-Memory DBMS: The Performance of PRISMA/DB

[...]

A.N. Wilschut¹, Jan Flokstra, Peter M. G. Apers•Institutions (1)

University of Twente¹

23 Aug 1992

TL;DR: The results of this study can be used to design data fragmentation strategies for large parallel machines and the optimal number of processors for the parallelexecution of an operation is smaller for a main- memory system than for a disk -based system.

...read moreread less

Abstract: his paper evaluates the performance of the parallel, main-memory DBMS, PRISMA/DB. First, an abstract architecture for parallel query execution is presented. A performance model for the execution of simple relational operations on this architecture is developed. The parameters in the model are set using experiments on PRISMA/DB and the performance of PRISMA/DB is analized in the context of the model. Several conclusions can be drawn from the model combined with the results of the performance experiments. Firstly, the performance of PRISMA/DB appears to be competitive with respect toother systems. Secondly, the developed model can explain the results from the performance experiments to a large extent. Also, it is concluded that observed linear speedup for small numbers of processors cannot always be extrapolated to larger numbers of processors. Finally, it is concluded that the optimal number of processors for the parallelexecution of an operation is smaller for a main- memory system than for a disk -based system. The results of this study can be used to design data fragmentation strategies for large parallel machines.

...read moreread less

Journal Article•DOI•

Global multidimensional optimization on parallel computer

[...]

Roman G. Strongin, Yaroslav D. Sergeyev

01 Nov 1992

TL;DR: A parallel algorithm for solving multiextremal multidimensional global optimization problems by applying Peano-type space-filling curves and conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established.

...read moreread less

Abstract: A parallel algorithm for solving multiextremal multidimensional global optimization problems is proposed. The algorithm is based on reducing multidimensional problems to the one-dimensional ones by applying Peano-type space-filling curves. A new parallel scheme to construct such curves is presented. For reduced optimization problems a parallel global optimization method is constructed. Sufficient conditions of global convergence are investigated. Conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established. Numerical experiments executed on ALLIANT FX/80 are also presented.

...read moreread less

Journal Article•DOI•

Massively parallel and distributed simulation of a class of discrete event systems: a different perspective

[...]

Pirooz Vakili

01 Jul 1992-ACM Transactions on Modeling and Computer Simulation

TL;DR: A new approach to parallel and distributed simulation of discrete event systems using a single clock mechanism that drives all trajectories simultaneously and which offers the possibility of concurrent performance evaluation and comparison at many system parameter values offers new and significant opportunities for performance optimization.

...read moreread less

Abstract: In this paper we propose a new approach to parallel and distributed simulation of discrete event systems. Most parallel and distributed discrete event simulation algorithms are concerned with the simulation of one “large” discrete event system. In this case the computational intensity is due to the size and complexity of the simulated system. In contrast, we are interested in simulating a “large” number of “medium sized” systems. These are variants of a “nominal system” with different system parameter values or operation policies. The computational intensity in our case is due to the “large” number of simulated variants. Many simulation projects such as factor screening, performance modeling, and optimization require system performance evaluations at many parameter values; and others, we believe, could significantly benefit from them.There is considerable work in the literature on stochastic coupling of trajectories of parametric families of stochastic processes. Our approach can be viewed as the simulation of the coupled trajectories. We use a single clock mechanism that drives all trajectories simultaneously, hence the approach is called Single Clock Multiple System (SCMS) simulation. The single clock synchronizes all trajectories such that the “same” event occurs at the “same” time at all systems. This synchronization is the basis of our parallel and distributed algorithms.We focus on a particular implementation of the SCMS simulation using the so-called Standard Clock (SC) technique and also on the massively parallel implementation of the SC algorithms on the SIMD Connection Machine. Orders of magnitude of speedup is possible. Furthermore, the possibility of concurrent performance evaluation and comparison at many system parameter values offers new and significant opportunities for performance optimization.

...read moreread less

Journal Article•DOI•

Coupling algorithms for mixed-level circuit and device simulation

[...]

K. Mayaram¹, Donald O. Pederson¹•Institutions (1)

University of California, Berkeley¹

01 Aug 1992-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This study provides guidelines for choosing a particular coupling algorithm for mixed-level circuit and device simulation and describes a modified two-level Newton algorithm and a full block-LU decomposition algorithm used for transient analysis.

...read moreread less

Abstract: A general framework for mixed-level circuit and device simulation is described. This framework was used in the development of the simulation program CODECS (coupled device and circuit simulator). Various algorithms to couple the device and circuit simulators for DC and transient analyses have been implemented in CODECS. These algorithms are evaluated based on their convergence properties and run-time performance. This study provides guidelines for choosing a particular coupling algorithm. A modified two-level Newton algorithm is used for DC analysis, whereas a full block-LU decomposition algorithm is used for transient analysis. This combination of algorithms provides reasonable convergence and run-time performance. A simple latency scheme provides a 50% speedup. Coupling for small-signal AC and pole-zero analyses are described. >

...read moreread less

Journal Article•DOI•

On parallel processing systems: Amdahl's law generalized and some results on optimal design

[...]

Leonard Kleinrock¹, J.-H. Huang²•Institutions (2)

University of California, Los Angeles¹, National Taiwan University²

01 May 1992-IEEE Transactions on Software Engineering

TL;DR: The authors model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time, and find that the average number of jobs in the system with arrivals equals unity when power is maximized.

...read moreread less

Abstract: The authors model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time. They derive the speedup of the system for two cases: systems with no arrivals, and systems with arrivals. In the case with no arrivals, their speedup result is a generalization of Amdahl's law (G.M. Amdahl, 1967). They extend the notion of power as previously applied to general queuing and computer-communication systems to their case of parallel processing systems. They find the optimal job input and the optimal number of processors to use so that power is maximized. Many of the results for the case of arrivals are the same as for the case of no arrivals. It is found that the average number of jobs in the system with arrivals equals unity when power is maximized. They also model a job in such a way that the number of processors required continuously varies over time. The same performance indices and parameters studied in the discrete model are evaluated for this continuous model. >

...read moreread less

Journal Article•DOI•

Localization schemes in 2D boundary-fitted grids

[...]

Thomas Westermann

01 Aug 1992-Journal of Computational Physics

TL;DR: These algorithms are applicable to particle-in-cell codes based on two-dimensional boundary-fitted coordinates in order to localize particles inside the grid and are suitable for complicated geometries with outer and inner curved boundaries.

...read moreread less

Journal Article•DOI•

Parallelism analyzers for parallel discrete event simulation

[...]

Yi-Bing Lin

01 Jul 1992-ACM Transactions on Modeling and Computer Simulation

TL;DR: This paper describes three critical path analysis algorithms based on different event-scheduling (process scheduling) policies that can be integrated with sequential simulation programs written by users or integrated with simulation languages.

...read moreread less

Abstract: Discrete event simulation is usually time consuming. Recently, there has been a great deal of interest in using parallel computers to speed up the simulation process. Before the parallel simulation approach is applied, it is important to understand the inherent parallelism of simulation applications. A simple technique called critical path analysis was proposed to study paralllelism of simulation applications. This paper describes three critical path analysis algorithms based on different event-scheduling (process scheduling) policies. These algorithms are much simpler than a previous approach, where the events must be recorded in a trace and an extra pass is required to process the event trace. In our approach, the critical path analysis algorithms are integrated with the sequential simulation. At the end of the sequential simulation, the optimal parallel execution time is also computed. Livny proposed an algorithm similar to our approach (His, however, was designed for a specific language). Our algorithms can be integrated with sequential simulation programs written by users or be integrated with simulation languages. Another advantage of our algorithms over previous approaches is that ours can be used to study load balancing under different event-scheduling policies. Since our algorithms can be easily inserted in sequential simulation programs, critical path analysis can be applied to existing sequential programs without difficulty. The results can then be used to predict the performance of parallel simulation on similar applications. An example is given to show how useful information can be obtained from our algorithms.

...read moreread less

Proceedings Article•DOI•

Distributed design-space exploration for high-level synthesis systems

[...]

R. Dutta¹, Jibendu Sekhar Roy¹, Ranga Vemuri¹•Institutions (1)

University of Cincinnati¹

01 Jul 1992

TL;DR: A parallel algorithm for design-space exploration and trade-off analysis is presented and results showed reduction in search time, improvement in design quality, and close-to-linear speedup.

...read moreread less

Abstract: A parallel algorithm for design-space exploration and trade-off analysis is presented. Coarse-grained parallelism is introduced by generating multiple module bags and performing scheduling and performance analysis of the data flow graph for each module bag in parallel. This algorithm was implemented on a multiple processor machine as part of a distributed high-level synthesis system. Experimental results showed reduction in search time, improvement in design quality, and close-to-linear speedup. >

...read moreread less

Proceedings Article•DOI•

The consequences of fixed time performance measurement

[...]

John L. Gustafson¹•Institutions (1)

Iowa State University¹

07 Jan 1992

TL;DR: The fixed time method does not reward slower processors with higher speedup; it predicts a new limit to speedup, which is more optimistic than Amdahl's; it shows an efficiency which is independent of processor speed and ensemble size; it sometimes gives non-spurious superlinear speedup.

...read moreread less

Abstract: In measuring the performance of parallel computers, the usual method is to choose a problem and test the execution time as the processor count is varied. This model underlies definitions of 'speedup,' 'efficiency,' and arguments against parallel processing such as Ware's (1972) formulation of Amdahl's law (1967). Fixed time models use problem size as the figure of merit. Analysis and experiments based on fixed time instead of fixed size have yielded surprising consequences: the fixed time method does not reward slower processors with higher speedup; it predicts a new limit to speedup, which is more optimistic than Amdahl's; it shows an efficiency which is independent of processor speed and ensemble size; it sometimes gives non-spurious superlinear speedup; it provides a practical means (the SLALOM benchmark) of comparing computers of widely varying speeds without distortion. >

...read moreread less

Parallel logic simulation: an evaluation of centralized-time and distributed-time algorithms

[...]

Anoop Gupta, Lawrence Peter Soule

01 Jan 1992

TL;DR: The main conclusions reached are (1) parallel logic simulation of large circuits on general purpose workstations is practical, (2) the EFDP, FGA, and PGA algorithms work very well, and (3) synchronous time algorithms appear to be much faster and more space conscious than asynchronous time algorithms.

...read moreread less

Abstract: This dissertation records research on the problem of performing logic simulation of a circuit on a parallel architecture. Research begins with the development of a new circuit partitioning algorithm called the Event Flow and Distribution Partitioning (EFDP) algorithm. This algorithm is discussed in detail and is shown to produce a good partitioning of the circuit. Partitioning is followed by simulation. Test circuits are chosen from VLSI chips designed at the NASA Space Engineering Research Center for VLSI Design at the University of Idaho. These circuits range in size from 292 transistors to 249,897 transistors. Five test vectors are used to perform the simulation of these circuits. Simulation is first run on a uni-processor (a DEC 3000) using NOVA, a logic simulator developed at the NASA SERC. Simulation times varying from a few minutes to a few hours are reported on the suites of five test vectors. These same tests are then run on the same simulator modified to simulate execution on a shared memory multiprocessor using two new synchronous time parallel simulation algorithms and an asynchronous time parallel simulation algorithm. The two new synchronous time parallel simulation algorithms are types of greedy algorithms and are called the Focused Greedy Algorithm (FGA) and the Planned Greedy Algorithm (PGA). It is found that both algorithms performed nearly the same with the FGA showing a slightly higher speedup but also requiring slightly more storage than the PGA. An average speedup of up to 6.6 on an 8 processor architecture and 13.9 on a 16 processor architecture is reported. An asynchronous time parallel simulation algorithm based on the Time Warp algorithm is tested using the same seven circuits and the same suite of five test vectors. An average speedup of up to 4.2 on an 8 processor architecture and 7.7 on a 16 processor architecture is reported. The main conclusions reached are (1) parallel logic simulation of large circuits on general purpose workstations is practical, (2) the EFDP, FGA, and PGA algorithms work very well, and (3) synchronous time algorithms appear to be much faster and more space conscious than asynchronous time algorithms.

...read moreread less

Journal Article•DOI•

A CAD framework for simulation and optimization of high-speed VLSI interconnections

[...]

R. Griffith¹, Eli Chiprout¹, Qi-Jun Zhang¹, Michel Nakhla¹•Institutions (1)

Carleton University¹

01 Nov 1992-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: A parallel implementation of numerical inversion of the Laplace transform (NILT) and parallel interconnect optimization, resulting in substantial CPU speedup over existing NILT simulation and optimization, is described.

...read moreread less

Abstract: A CAD framework addressing three specific aspects of the high-speed interconnect problem, namely, simulation, sensitivity analysis, and performance optimization, is described. Distributed interconnect models represented by uniform or nonuniform lossy coupled transmission lines are supported. The CAD framework incorporates parallel processing capabilities. It also provides a design environment for integrating accurate simulations or waveform estimation, sensitivity analysis, design specifications, and numerical optimization. Approaches enhancing the accuracy and ensuring the stability of moment-matching techniques used in the asymptotic waveform evaluation (AWE) are introduced. Also described are a parallel implementation of numerical inversion of the Laplace transform (NILT) and parallel interconnect optimization, resulting in substantial CPU speedup over existing NILT simulation and optimization. >

...read moreread less

Proceedings Article•DOI•

LATTIS: an iterative speedup heuristic for mapped logic

[...]

John P. Fishburn¹•Institutions (1)

Bell Labs¹

01 Jul 1992

TL;DR: The author describes heuristic problems for performance optimization of mapped combinational logic, implemented in the system LATTIS (logic area-time tradeoff for integrated systems), which has six transform types: gate repowering, buffer insertion, downpowering of noncritical fanouts of the critical path, gate duplication, DeMorgan's laws, and timing-directed factorization and remapping of subcircuits.

...read moreread less

Abstract: The author describes heuristic problems for performance optimization of mapped combinational logic, implemented in the system LATTIS (logic area-time tradeoff for integrated systems). LATTIS currently has six transform types: gate repowering, buffer insertion, downpowering of noncritical fanouts of the critical path, gate duplication, DeMorgan's laws, and timing-directed factorization and remapping of subcircuits. From among the transforms applicable on the critical path. LATTIS chooses the one with maximum benefit/cost. Cost is increase in area, and benefit is improvement in local slack, weighted by the number of primary input/outputs affected. The delay-area curves produced by LATTIS for the 70 largest circuits of the 1991 MCNC multilevel combinational logic benchmark set are given. >

...read moreread less

Journal Article•DOI•

An approximation algorithm for scheduling tasks on varying partition sizes in partitionable multiprocessor systems

[...]

R. Krishnamurti¹, E. Ma²•Institutions (2)

Simon Fraser University¹, University of Glasgow²

01 Dec 1992-IEEE Transactions on Computers

TL;DR: In this article, the authors study the problem of choosing the partition sizes and a minimum completion time schedule for the execution of a set of tasks, each of which can be executed on partitions of varying sizes.

...read moreread less

Abstract: A partitionable multiprocessor system can form multiple partitions, each consisting of a controller and a varying number of processors. Given such a system and a set of tasks, each of which can be executed on partitions of varying sizes, the authors study the problem of choosing the partition sizes and a minimum completion time schedule for the execution of these tasks. They assume that the number of tasks to be scheduled on the system is no more than the maximum number of partitions that can be formed simultaneously by the system, and that parallelization of the tasks can achieve at most perfect speedup. They show this scheduling problem to be NP-hard, and present a polynomial time approximation algorithm for this problem. The authors derive a parameter dependent, asymptotically tight worst-case performance bound for the algorithm, and evaluate its average performance through simulation. >

...read moreread less

Journal Article•DOI•

Parallel algorithms for terminal-pair reliability

[...]

Narsingh Deo¹, Muralidhar Medidi¹•Institutions (1)

University of Central Florida¹

01 Jun 1992-IEEE Transactions on Reliability

TL;DR: Reduce and Partition (R and P), a novel sequential algorithm which combines the best efficient features of these two algorithms, is presented and it is shown that R and P runs almost twice as fast as the previously known fastest algorithm.

...read moreread less

Abstract: The parallelization of the two best-known sequential algorithms, that of W.P. Dotson and J.O. Gobein (1979) and that of L.B. Page and J.E. Perry (PP-F2TDN) (1989) for computing the terminal-pair reliability in a network is discussed. Reduce and Partition (R and P), a novel sequential algorithm which combines the best efficient features of these two algorithms, is presented. It is shown that R and P runs almost twice as fast as the previously known fastest algorithm. A parallel version of R and P is also presented. The execution times of all three parallel algorithms with various numbers of processors for different networks on the BBN Butterfly parallel computer are provided. The parallel algorithms were implemented on a shared-memory parallel computer. In R and P, the greedy approach was used in selecting shortest paths in order to locally minimize the number of subproblems. This selection did not consider the effect of reductions on the subproblems to be generated. >

...read moreread less

Book Chapter•DOI•

A Quantitative Algorithm for Data Locality Optimization

[...]

François Bodin¹, William Jalby¹, Daniel Windheiser¹, Christine Eisenbeis²•Institutions (2)

University of Rennes¹, French Institute for Research in Computer Science and Automation²

01 Jan 1992

TL;DR: A register allocation algorithm and a cache usage optimization algorithm based on the reference window concept which can be effectively implemented in a compiler system are described.

...read moreread less

Abstract: In this paper, we consider the problem of optimizing register allocation and cache behavior for loop array references. We exploit techniques developed initially for data locality estimation and improvement. First we review the concept of “reference window” that serves as our basic tool for both data locality evaluation and management. Then we study how some loop restructuring techniques (interchanging, tiling, ...) can help to improve data locality. We describe a register allocation algorithm and a cache usage optimization algorithm based on the window concept which can be effectively implemented in a compiler system. Experimental speedup measurements on a RISC processor, the IBM RS/6000, give evidence of the efficiency of our technique.

...read moreread less

Journal Article•DOI•

Molecular dynamics on a distributed-memory multiprocessor

[...]

S. L. Lin¹, John Mellor-Crummey¹, B. M. Pettitt², B. M. Pettitt¹, G. N. Phillips¹ - Show less +1 more•Institutions (2)

Rice University¹, University of Houston²

01 Oct 1992-Journal of Computational Chemistry

TL;DR: This article describes a parallelization of the simulation program CHARMM for the Intel iPSC/860, a distributed memory multiprocessor, and examines the effectiveness of the parallelization in the context of a case study of a realistic molecular system.

...read moreread less

Abstract: Dynamics simulations of molecular systems are notoriously computationally intensive. Using parallel computers for these simulations is important for reducing their turnaround time. In this article we describe a parallelization of the simulation program CHARMM for the Intel iPSC/860, a distributed memory multiprocessor. In the parallelization, the computational work is partitioned among the processors for core calculations including the calculation of forces, the integration of equations of motion, the correction of atomic coordinates by constraint, and the generation and update of data structures used to compute nonbonded interactions. Processors coordinate their activity using synchronous communication to exchange data values. Key data structures used are partitioned among the processors in nearly equal pieces, reducing the memory requirement per node and making it possible to simulate larger molecular systems. We examine the effectiveness of the parallelization in the context of a case study of a realistic molecular system. While effective speedup was achieved for many of the dynamics calculations, other calculations fared less well due to growing communication costs for exchanging data among processors. The strategies we used are applicable to parallelization of similar molecular mechanics and dynamics programs for distributed memory multiprocessors. © 1992 by John Wiley & Sons, Inc.

...read moreread less

Journal Article•DOI•

A volume integral formulation for nonlinear magnetostatics and eddy currents using edge elements

[...]

Lauri Kettunen¹, L. R. Turner¹•Institutions (1)

Argonne National Laboratory¹

01 Mar 1992-IEEE Transactions on Magnetics

TL;DR: The properties of Whitney elements provide the basis of a novel integral formulation for solving three-dimensional magnetostatics and eddy-current problems using a tree structure, which realizes much of the speedup promised by parallel computing.

...read moreread less

Abstract: The properties of Whitney elements provide the basis of a novel integral formulation for solving three-dimensional magnetostatics and eddy-current problems. Using a tree structure reduces the number of unknowns to a few less than the number of nodes for the static case, a significant saving over earlier integral formulations. Interface conditions are satisfied exactly, but the material constitutive relations are satisfied only approximately. The resulting magnetostatics code, GFUNET, is attractive in terms of convenience, CPU time utilization, and accuracy. It realizes much of the speedup promised by parallel computing. Results for an accelerator sextupole magnet and for TEAM Workshop problem Hash 13 are presented. >

...read moreread less

Proceedings Article•DOI•

PMM: a parallel architecture for production systems

[...]

Arobinda Gupta¹, C. Mazumdar•Institutions (1)

University of Alabama¹

12 Apr 1992

TL;DR: In this article, a partially shared Rete network is proposed for parallel implementation and a hierarchical two-level parallel architecture based on this network is outlined, which achieves significant speedup by reducing the dynamic scheduling overheads of fine-grained jobs in a multiprocessor implementation of the rete network, taking advantage of the sharing of common computations in the network.

...read moreread less

Abstract: The authors investigate methods to speed up the match phase of the execution of production systems. The Rete match algorithm is taken as the basis of the implementation. A partially shared Rete network is proposed for parallel implementation and a hierarchical two-level parallel architecture based on this network is outlined. The proposed architecture achieves significant speedup by reducing the dynamic scheduling overheads of fine-grained jobs in a multiprocessor implementation of the Rete network, while still taking advantage of the sharing of common computations in the network. >

...read moreread less

Book Chapter•DOI•

OR-Parallel Theorem Proving with Random Competition

[...]

Wolfgang Ertel¹•Institutions (1)

Technische Universität München¹

15 Jul 1992

TL;DR: This work can prove high efficiency (compared with other parallel theorem provers) of random competition on highly parallel architectures with thousands of processors on which no communication between the processors is necessary during run-time.

...read moreread less

Abstract: With random competition we propose a method for parallelizing arbitrary theorem provers. We can prove high efficiency (compared with other parallel theorem provers) of random competition on highly parallel architectures with thousands of processors. This method is suited for all kinds of distributed memory architectures, particularly for large networks of high performance workstations since no communication between the processors is necessary during run-time. On a set of examples we show the performance of random competition applied to the model elimination theorem prover SETHEO.

...read moreread less

Proceedings Article•DOI•

On the influence of programming models on shared memory computer performance

[...]

T.A. Ngo¹, Lawrence Snyder¹•Institutions (1)

University of Washington¹

26 Apr 1992

TL;DR: Experiments are presented indicating that on shared-memory machines, programs written in the nonshared-memory programming model generally offer better performance, in addition to being more portable and scalable.

...read moreread less

Abstract: Experiments are presented indicating that on shared-memory machines, programs written in the nonshared-memory programming model generally offer better performance, in addition to being more portable and scalable. The authors study the LU decomposition problem and a molecular dynamics simulation on three shared-memory machines with widely differing architectures, and analyze the results from three perspectives: performance, speedup, and scaling. >

...read moreread less

Collapse