scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1992"


Proceedings ArticleDOI
01 Apr 1992
TL;DR: The hardware overhead of directory-based cache coherence in the prototype of the DASH multiprocessor is examined and the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup is characterized.
Abstract: The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design trade-offs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the feasibility of the design allows one to accurately estimate both the hardware and the complexity cost of various features, and provides a platform for studying real workloads. A 16-processor prototype of the DASH multiprocessor has been operational for the last six months. In this paper, the hardware overhead of directory-based cache coherence in the prototype is examined. We also discuss the performance of the system, and the speedups obtained by parallel applications running on the prototype. Using a sophisticated hardware performance monitor, we characterize the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup.

214 citations


Proceedings ArticleDOI
01 Jun 1992
TL;DR: Experiments indicate that the multiplexed R-tree with PI heuristic gives better response time than the disk-stripping (=“Super-node”) approach, and imposes lighter load on the I/O sub-system.
Abstract: We consider the problem of exploiting parallelism to accelerate the performance of spacial access methods and specifically, R-trees [11]. Our goal is to design a server for spatial data, so that to maximize the throughput of range queries. This can be achieved by (a) maximizing parallelism for large range queries, and (b) by engaging as few disks as possible on point queries [22].We propose a simple hardware architecture consisting of one processor with several disks attached to it. On this architecture, we propose to distribute the nodes of a traditonal R-tree, with cross-disk pointers (“Multiplexed” R-tree). The R-tree code is identical to the one for a single-disk R-tree, with the only addition that we have to decide which disk a newly created R-tree node should be stored in. We propose and examine several criteria to choose a disk for a new node. The most successful one, termed “proximity index” or PI, estimates the similarity of the new node with the other R-tree nodes already on a disk, and chooses the disk with the lowest similarity. Experimental results show that our scheme consistently outperforms all the other heuristics for node-to-disk assignments, achieving up to 55% gains over the Round Robin one. Experiments also indicate that the multiplexed R-tree with PI heuristic gives better response time than the disk-stripping (=“Super-node”) approach, and imposes lighter load on the I/O sub-system.The speed up of our method is close to linear speed up, increasing with the size of the queries.

189 citations


Journal ArticleDOI
TL;DR: It is shown that parallel architectures fall somewhat short of ideal speedups in practice, but they should still enable current CMOS technologies to go well beyond 1 Gb/s data rates.
Abstract: The use of VLSI technology to speed up cyclic redundancy checking (CRC) circuits used for error detection in telecommunications systems is investigated By generalizing the analysis of a parallel prototype, performance is estimated over a wide range of external constraints and design choices It is shown that parallel architectures fall somewhat short of ideal speedups in practice, but they should still enable current CMOS technologies to go well beyond 1 Gb/s data rates >

189 citations


Journal ArticleDOI
TL;DR: A new algorithm called the localized constraints algorithm was developed in order to vectorize efficiently a constraint molecular dynamics simulation, PRESTO (PRotein Engineering SimulaTOr), which provided high performance on many types of vector processors.

171 citations


Proceedings ArticleDOI
01 May 1992
TL;DR: The authors propose a single way to dramatically improve the performance of input-queued ATM packet switches beyond the 82% saturation point obtained in previous work and yields a throughput improvement from 65% to 92% without speedup, trunking, or complicated hardware.
Abstract: The authors propose a single way to dramatically improve the performance of input-queued ATM packet switches beyond the 82% saturation point obtained in previous work. The method is an extension of the independent output-port schedulers technique and is based on the notion of recycled time slots, i.e. reusing time slots normally wasted due to scheduling conflicts. In contrast to previous results, the technique yields a throughput improvement from 65% to 92% without speedup, trunking, or complicated hardware. If input grouping with a group size of four is also employed, then the method can yield up to 95% throughput. >

105 citations


Journal ArticleDOI
TL;DR: The architecture is scalable and flexible enough to be useful for simulating various kinds of networks and paradigms, and the speedup factor increases regularly with the number of clusters involved (to a factor of 80).
Abstract: Neural network simulations on a parallel architecture are reported. The architecture is scalable and flexible enough to be useful for simulating various kinds of networks and paradigms. The computing device is based on an existing coarse-grain parallel framework (INMOS transputers), improved with finer-grain parallel abilities through VLSI chips, and is called the Lneuro 1.0 (for LEP neuromimetic) circuit. The modular architecture of the circuit makes it possible to build various kinds of boards to match the expected range of applications or to increase the power of the system by adding more hardware. The resulting machine remains reconfigurable to accommodate a specific problem to some extent. A small-scale machine has been realized using 16 Lneuros, to experimentally test the behavior of this architecture. Results are presented on an integer version of Kohonen feature maps. The speedup factor increases regularly with the number of clusters involved (to a factor of 80). Some ways to improve this family of neural network simulation machines are also investigated. >

90 citations


Proceedings ArticleDOI
08 Jun 1992
TL;DR: The authors present an efficient sequential circuit parallel fault simulator, HOPE, which simulates 32 faults at a time, which is about two times faster than PROOFS for most ISCAS89 sequential benchmark circuits.
Abstract: The authors present an efficient sequential circuit parallel fault simulator, HOPE, which simulates 32 faults at a time. HOPE is a parallel fault simulator based on single fault propagation. It adopts the zero gate delay model. The key idea incorporated in HOPE is to screen out faults with short propagation paths, and prevent them from being simulated in parallel. The screening process drastically reduces the number of faults simulated in parallel to achieve substantial speedup. The experimental results presented show that HOPE is about two times faster than PROOFS for most ISCAS89 sequential benchmark circuits. >

84 citations


Journal ArticleDOI
TL;DR: The motivation for the RAP is described and how the architecture matches the target algorithm is shown, which is to reduce peak performance on the error back-propagation algorithm to about 50% of a linear speedup.

71 citations


Proceedings ArticleDOI
01 Mar 1992
TL;DR: The authors present a new load balancing strategy and its application to distributed branch & bound algorithms and demonstrate its efficiency by solving some NP-complete problems on a network of up to 256 transputers.
Abstract: The authors present a new load balancing strategy and its application to distributed branch & bound algorithms and demonstrate its efficiency by solving some NP-complete problems on a network of up to 256 transputers. The parallelization of their branch & bound algorithm is fully distributed. Every processor performs the same algorithm but each on a different part of the solution tree. In this case it is necessary to distribute subproblems among the processors to achieve a well balanced workload. Their load balancing method overcomes the problem of search overhead and idle times by an appropriate load model and avoids trashing effects by a feedback control method. Using this strategy they were able to achieve a speedup of up to 237.32 on a 256 processor network for very short parallel computation times, compared to an efficient sequential algorithm. >

69 citations


Journal ArticleDOI
TL;DR: An algorithm which optimally selects hardware blocks for implementing these abstract building blocks and a technique for hierarchical redistribution and insertion of pipeline registers, which makes the area tradeoff between the cost of additional speedup circuitry and pipeline registers possible.
Abstract: At the highest abstraction level, the specification of a data path consists of a number of interconnected abstract building blocks and a constraint on the minimal clock frequency. An algorithm which optimally selects hardware blocks for implementing these abstract building blocks is presented. A technique for hierarchical redistribution and insertion of pipeline registers is also presented. Finally, the two optimization tasks are combined. This combination makes the area tradeoff between the cost of additional speedup circuitry and pipeline registers possible. The techniques are based on accurate hierarchical timing models for the hardware blocks. The automation relieves the designer of the numerous, time-consuming critical path verifications and area evaluations that are required to explore the large design space. The implementation of the algorithms has resulted in a CAD tool called HANDEL, embedded in the data-path compiler CHOPIN. >

64 citations


Proceedings Article
23 Aug 1992
TL;DR: The results of this study can be used to design data fragmentation strategies for large parallel machines and the optimal number of processors for the parallelexecution of an operation is smaller for a main- memory system than for a disk -based system.
Abstract: his paper evaluates the performance of the parallel, main-memory DBMS, PRISMA/DB. First, an abstract architecture for parallel query execution is presented. A performance model for the execution of simple relational operations on this architecture is developed. The parameters in the model are set using experiments on PRISMA/DB and the performance of PRISMA/DB is analized in the context of the model. Several conclusions can be drawn from the model combined with the results of the performance experiments. Firstly, the performance of PRISMA/DB appears to be competitive with respect toother systems. Secondly, the developed model can explain the results from the performance experiments to a large extent. Also, it is concluded that observed linear speedup for small numbers of processors cannot always be extrapolated to larger numbers of processors. Finally, it is concluded that the optimal number of processors for the parallelexecution of an operation is smaller for a main- memory system than for a disk -based system. The results of this study can be used to design data fragmentation strategies for large parallel machines.

Journal ArticleDOI
01 Nov 1992
TL;DR: A parallel algorithm for solving multiextremal multidimensional global optimization problems by applying Peano-type space-filling curves and conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established.
Abstract: A parallel algorithm for solving multiextremal multidimensional global optimization problems is proposed. The algorithm is based on reducing multidimensional problems to the one-dimensional ones by applying Peano-type space-filling curves. A new parallel scheme to construct such curves is presented. For reduced optimization problems a parallel global optimization method is constructed. Sufficient conditions of global convergence are investigated. Conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established. Numerical experiments executed on ALLIANT FX/80 are also presented.

Journal ArticleDOI
TL;DR: A new approach to parallel and distributed simulation of discrete event systems using a single clock mechanism that drives all trajectories simultaneously and which offers the possibility of concurrent performance evaluation and comparison at many system parameter values offers new and significant opportunities for performance optimization.
Abstract: In this paper we propose a new approach to parallel and distributed simulation of discrete event systems. Most parallel and distributed discrete event simulation algorithms are concerned with the simulation of one “large” discrete event system. In this case the computational intensity is due to the size and complexity of the simulated system. In contrast, we are interested in simulating a “large” number of “medium sized” systems. These are variants of a “nominal system” with different system parameter values or operation policies. The computational intensity in our case is due to the “large” number of simulated variants. Many simulation projects such as factor screening, performance modeling, and optimization require system performance evaluations at many parameter values; and others, we believe, could significantly benefit from them.There is considerable work in the literature on stochastic coupling of trajectories of parametric families of stochastic processes. Our approach can be viewed as the simulation of the coupled trajectories. We use a single clock mechanism that drives all trajectories simultaneously, hence the approach is called Single Clock Multiple System (SCMS) simulation. The single clock synchronizes all trajectories such that the “same” event occurs at the “same” time at all systems. This synchronization is the basis of our parallel and distributed algorithms.We focus on a particular implementation of the SCMS simulation using the so-called Standard Clock (SC) technique and also on the massively parallel implementation of the SC algorithms on the SIMD Connection Machine. Orders of magnitude of speedup is possible. Furthermore, the possibility of concurrent performance evaluation and comparison at many system parameter values offers new and significant opportunities for performance optimization.

Journal ArticleDOI
TL;DR: This study provides guidelines for choosing a particular coupling algorithm for mixed-level circuit and device simulation and describes a modified two-level Newton algorithm and a full block-LU decomposition algorithm used for transient analysis.
Abstract: A general framework for mixed-level circuit and device simulation is described. This framework was used in the development of the simulation program CODECS (coupled device and circuit simulator). Various algorithms to couple the device and circuit simulators for DC and transient analyses have been implemented in CODECS. These algorithms are evaluated based on their convergence properties and run-time performance. This study provides guidelines for choosing a particular coupling algorithm. A modified two-level Newton algorithm is used for DC analysis, whereas a full block-LU decomposition algorithm is used for transient analysis. This combination of algorithms provides reasonable convergence and run-time performance. A simple latency scheme provides a 50% speedup. Coupling for small-signal AC and pole-zero analyses are described. >

Journal ArticleDOI
TL;DR: The authors model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time, and find that the average number of jobs in the system with arrivals equals unity when power is maximized.
Abstract: The authors model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time. They derive the speedup of the system for two cases: systems with no arrivals, and systems with arrivals. In the case with no arrivals, their speedup result is a generalization of Amdahl's law (G.M. Amdahl, 1967). They extend the notion of power as previously applied to general queuing and computer-communication systems to their case of parallel processing systems. They find the optimal job input and the optimal number of processors to use so that power is maximized. Many of the results for the case of arrivals are the same as for the case of no arrivals. It is found that the average number of jobs in the system with arrivals equals unity when power is maximized. They also model a job in such a way that the number of processors required continuously varies over time. The same performance indices and parameters studied in the discrete model are evaluated for this continuous model. >

Journal ArticleDOI
TL;DR: These algorithms are applicable to particle-in-cell codes based on two-dimensional boundary-fitted coordinates in order to localize particles inside the grid and are suitable for complicated geometries with outer and inner curved boundaries.

Journal ArticleDOI
TL;DR: This paper describes three critical path analysis algorithms based on different event-scheduling (process scheduling) policies that can be integrated with sequential simulation programs written by users or integrated with simulation languages.
Abstract: Discrete event simulation is usually time consuming. Recently, there has been a great deal of interest in using parallel computers to speed up the simulation process. Before the parallel simulation approach is applied, it is important to understand the inherent parallelism of simulation applications. A simple technique called critical path analysis was proposed to study paralllelism of simulation applications. This paper describes three critical path analysis algorithms based on different event-scheduling (process scheduling) policies. These algorithms are much simpler than a previous approach, where the events must be recorded in a trace and an extra pass is required to process the event trace. In our approach, the critical path analysis algorithms are integrated with the sequential simulation. At the end of the sequential simulation, the optimal parallel execution time is also computed. Livny proposed an algorithm similar to our approach (His, however, was designed for a specific language). Our algorithms can be integrated with sequential simulation programs written by users or be integrated with simulation languages. Another advantage of our algorithms over previous approaches is that ours can be used to study load balancing under different event-scheduling policies. Since our algorithms can be easily inserted in sequential simulation programs, critical path analysis can be applied to existing sequential programs without difficulty. The results can then be used to predict the performance of parallel simulation on similar applications. An example is given to show how useful information can be obtained from our algorithms.

Proceedings ArticleDOI
01 Jul 1992
TL;DR: A parallel algorithm for design-space exploration and trade-off analysis is presented and results showed reduction in search time, improvement in design quality, and close-to-linear speedup.
Abstract: A parallel algorithm for design-space exploration and trade-off analysis is presented. Coarse-grained parallelism is introduced by generating multiple module bags and performing scheduling and performance analysis of the data flow graph for each module bag in parallel. This algorithm was implemented on a multiple processor machine as part of a distributed high-level synthesis system. Experimental results showed reduction in search time, improvement in design quality, and close-to-linear speedup. >

Proceedings ArticleDOI
07 Jan 1992
TL;DR: The fixed time method does not reward slower processors with higher speedup; it predicts a new limit to speedup, which is more optimistic than Amdahl's; it shows an efficiency which is independent of processor speed and ensemble size; it sometimes gives non-spurious superlinear speedup.
Abstract: In measuring the performance of parallel computers, the usual method is to choose a problem and test the execution time as the processor count is varied. This model underlies definitions of 'speedup,' 'efficiency,' and arguments against parallel processing such as Ware's (1972) formulation of Amdahl's law (1967). Fixed time models use problem size as the figure of merit. Analysis and experiments based on fixed time instead of fixed size have yielded surprising consequences: the fixed time method does not reward slower processors with higher speedup; it predicts a new limit to speedup, which is more optimistic than Amdahl's; it shows an efficiency which is independent of processor speed and ensemble size; it sometimes gives non-spurious superlinear speedup; it provides a practical means (the SLALOM benchmark) of comparing computers of widely varying speeds without distortion. >

01 Jan 1992
TL;DR: The main conclusions reached are (1) parallel logic simulation of large circuits on general purpose workstations is practical, (2) the EFDP, FGA, and PGA algorithms work very well, and (3) synchronous time algorithms appear to be much faster and more space conscious than asynchronous time algorithms.
Abstract: This dissertation records research on the problem of performing logic simulation of a circuit on a parallel architecture. Research begins with the development of a new circuit partitioning algorithm called the Event Flow and Distribution Partitioning (EFDP) algorithm. This algorithm is discussed in detail and is shown to produce a good partitioning of the circuit. Partitioning is followed by simulation. Test circuits are chosen from VLSI chips designed at the NASA Space Engineering Research Center for VLSI Design at the University of Idaho. These circuits range in size from 292 transistors to 249,897 transistors. Five test vectors are used to perform the simulation of these circuits. Simulation is first run on a uni-processor (a DEC 3000) using NOVA, a logic simulator developed at the NASA SERC. Simulation times varying from a few minutes to a few hours are reported on the suites of five test vectors. These same tests are then run on the same simulator modified to simulate execution on a shared memory multiprocessor using two new synchronous time parallel simulation algorithms and an asynchronous time parallel simulation algorithm. The two new synchronous time parallel simulation algorithms are types of greedy algorithms and are called the Focused Greedy Algorithm (FGA) and the Planned Greedy Algorithm (PGA). It is found that both algorithms performed nearly the same with the FGA showing a slightly higher speedup but also requiring slightly more storage than the PGA. An average speedup of up to 6.6 on an 8 processor architecture and 13.9 on a 16 processor architecture is reported. An asynchronous time parallel simulation algorithm based on the Time Warp algorithm is tested using the same seven circuits and the same suite of five test vectors. An average speedup of up to 4.2 on an 8 processor architecture and 7.7 on a 16 processor architecture is reported. The main conclusions reached are (1) parallel logic simulation of large circuits on general purpose workstations is practical, (2) the EFDP, FGA, and PGA algorithms work very well, and (3) synchronous time algorithms appear to be much faster and more space conscious than asynchronous time algorithms.

Journal ArticleDOI
TL;DR: A parallel implementation of numerical inversion of the Laplace transform (NILT) and parallel interconnect optimization, resulting in substantial CPU speedup over existing NILT simulation and optimization, is described.
Abstract: A CAD framework addressing three specific aspects of the high-speed interconnect problem, namely, simulation, sensitivity analysis, and performance optimization, is described. Distributed interconnect models represented by uniform or nonuniform lossy coupled transmission lines are supported. The CAD framework incorporates parallel processing capabilities. It also provides a design environment for integrating accurate simulations or waveform estimation, sensitivity analysis, design specifications, and numerical optimization. Approaches enhancing the accuracy and ensuring the stability of moment-matching techniques used in the asymptotic waveform evaluation (AWE) are introduced. Also described are a parallel implementation of numerical inversion of the Laplace transform (NILT) and parallel interconnect optimization, resulting in substantial CPU speedup over existing NILT simulation and optimization. >

Proceedings ArticleDOI
John P. Fishburn1
01 Jul 1992
TL;DR: The author describes heuristic problems for performance optimization of mapped combinational logic, implemented in the system LATTIS (logic area-time tradeoff for integrated systems), which has six transform types: gate repowering, buffer insertion, downpowering of noncritical fanouts of the critical path, gate duplication, DeMorgan's laws, and timing-directed factorization and remapping of subcircuits.
Abstract: The author describes heuristic problems for performance optimization of mapped combinational logic, implemented in the system LATTIS (logic area-time tradeoff for integrated systems). LATTIS currently has six transform types: gate repowering, buffer insertion, downpowering of noncritical fanouts of the critical path, gate duplication, DeMorgan's laws, and timing-directed factorization and remapping of subcircuits. From among the transforms applicable on the critical path. LATTIS chooses the one with maximum benefit/cost. Cost is increase in area, and benefit is improvement in local slack, weighted by the number of primary input/outputs affected. The delay-area curves produced by LATTIS for the 70 largest circuits of the 1991 MCNC multilevel combinational logic benchmark set are given. >

Journal ArticleDOI
TL;DR: In this article, the authors study the problem of choosing the partition sizes and a minimum completion time schedule for the execution of a set of tasks, each of which can be executed on partitions of varying sizes.
Abstract: A partitionable multiprocessor system can form multiple partitions, each consisting of a controller and a varying number of processors. Given such a system and a set of tasks, each of which can be executed on partitions of varying sizes, the authors study the problem of choosing the partition sizes and a minimum completion time schedule for the execution of these tasks. They assume that the number of tasks to be scheduled on the system is no more than the maximum number of partitions that can be formed simultaneously by the system, and that parallelization of the tasks can achieve at most perfect speedup. They show this scheduling problem to be NP-hard, and present a polynomial time approximation algorithm for this problem. The authors derive a parameter dependent, asymptotically tight worst-case performance bound for the algorithm, and evaluate its average performance through simulation. >

Journal ArticleDOI
TL;DR: Reduce and Partition (R and P), a novel sequential algorithm which combines the best efficient features of these two algorithms, is presented and it is shown that R and P runs almost twice as fast as the previously known fastest algorithm.
Abstract: The parallelization of the two best-known sequential algorithms, that of W.P. Dotson and J.O. Gobein (1979) and that of L.B. Page and J.E. Perry (PP-F2TDN) (1989) for computing the terminal-pair reliability in a network is discussed. Reduce and Partition (R and P), a novel sequential algorithm which combines the best efficient features of these two algorithms, is presented. It is shown that R and P runs almost twice as fast as the previously known fastest algorithm. A parallel version of R and P is also presented. The execution times of all three parallel algorithms with various numbers of processors for different networks on the BBN Butterfly parallel computer are provided. The parallel algorithms were implemented on a shared-memory parallel computer. In R and P, the greedy approach was used in selecting shortest paths in order to locally minimize the number of subproblems. This selection did not consider the effect of reductions on the subproblems to be generated. >

Book ChapterDOI
01 Jan 1992
TL;DR: A register allocation algorithm and a cache usage optimization algorithm based on the reference window concept which can be effectively implemented in a compiler system are described.
Abstract: In this paper, we consider the problem of optimizing register allocation and cache behavior for loop array references. We exploit techniques developed initially for data locality estimation and improvement. First we review the concept of “reference window” that serves as our basic tool for both data locality evaluation and management. Then we study how some loop restructuring techniques (interchanging, tiling, ...) can help to improve data locality. We describe a register allocation algorithm and a cache usage optimization algorithm based on the window concept which can be effectively implemented in a compiler system. Experimental speedup measurements on a RISC processor, the IBM RS/6000, give evidence of the efficiency of our technique.

Journal ArticleDOI
TL;DR: This article describes a parallelization of the simulation program CHARMM for the Intel iPSC/860, a distributed memory multiprocessor, and examines the effectiveness of the parallelization in the context of a case study of a realistic molecular system.
Abstract: Dynamics simulations of molecular systems are notoriously computationally intensive. Using parallel computers for these simulations is important for reducing their turnaround time. In this article we describe a parallelization of the simulation program CHARMM for the Intel iPSC/860, a distributed memory multiprocessor. In the parallelization, the computational work is partitioned among the processors for core calculations including the calculation of forces, the integration of equations of motion, the correction of atomic coordinates by constraint, and the generation and update of data structures used to compute nonbonded interactions. Processors coordinate their activity using synchronous communication to exchange data values. Key data structures used are partitioned among the processors in nearly equal pieces, reducing the memory requirement per node and making it possible to simulate larger molecular systems. We examine the effectiveness of the parallelization in the context of a case study of a realistic molecular system. While effective speedup was achieved for many of the dynamics calculations, other calculations fared less well due to growing communication costs for exchanging data among processors. The strategies we used are applicable to parallelization of similar molecular mechanics and dynamics programs for distributed memory multiprocessors. © 1992 by John Wiley & Sons, Inc.

Journal ArticleDOI
TL;DR: The properties of Whitney elements provide the basis of a novel integral formulation for solving three-dimensional magnetostatics and eddy-current problems using a tree structure, which realizes much of the speedup promised by parallel computing.
Abstract: The properties of Whitney elements provide the basis of a novel integral formulation for solving three-dimensional magnetostatics and eddy-current problems. Using a tree structure reduces the number of unknowns to a few less than the number of nodes for the static case, a significant saving over earlier integral formulations. Interface conditions are satisfied exactly, but the material constitutive relations are satisfied only approximately. The resulting magnetostatics code, GFUNET, is attractive in terms of convenience, CPU time utilization, and accuracy. It realizes much of the speedup promised by parallel computing. Results for an accelerator sextupole magnet and for TEAM Workshop problem Hash 13 are presented. >

Proceedings ArticleDOI
12 Apr 1992
TL;DR: In this article, a partially shared Rete network is proposed for parallel implementation and a hierarchical two-level parallel architecture based on this network is outlined, which achieves significant speedup by reducing the dynamic scheduling overheads of fine-grained jobs in a multiprocessor implementation of the rete network, taking advantage of the sharing of common computations in the network.
Abstract: The authors investigate methods to speed up the match phase of the execution of production systems. The Rete match algorithm is taken as the basis of the implementation. A partially shared Rete network is proposed for parallel implementation and a hierarchical two-level parallel architecture based on this network is outlined. The proposed architecture achieves significant speedup by reducing the dynamic scheduling overheads of fine-grained jobs in a multiprocessor implementation of the Rete network, while still taking advantage of the sharing of common computations in the network. >

Book ChapterDOI
15 Jul 1992
TL;DR: This work can prove high efficiency (compared with other parallel theorem provers) of random competition on highly parallel architectures with thousands of processors on which no communication between the processors is necessary during run-time.
Abstract: With random competition we propose a method for parallelizing arbitrary theorem provers. We can prove high efficiency (compared with other parallel theorem provers) of random competition on highly parallel architectures with thousands of processors. This method is suited for all kinds of distributed memory architectures, particularly for large networks of high performance workstations since no communication between the processors is necessary during run-time. On a set of examples we show the performance of random competition applied to the model elimination theorem prover SETHEO.

Proceedings ArticleDOI
26 Apr 1992
TL;DR: Experiments are presented indicating that on shared-memory machines, programs written in the nonshared-memory programming model generally offer better performance, in addition to being more portable and scalable.
Abstract: Experiments are presented indicating that on shared-memory machines, programs written in the nonshared-memory programming model generally offer better performance, in addition to being more portable and scalable. The authors study the LU decomposition problem and a molecular dynamics simulation on three shared-memory machines with widely differing architectures, and analyze the results from three perspectives: performance, speedup, and scaling. >