scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1988"


Journal ArticleDOI
TL;DR: A massively parallelizable algorithm for the classical assignment problem was proposed in this article, where unassigned persons bid simultaneously for objects thereby raising their prices. Once all bids are in, objects are awarded to the highest bidder.
Abstract: We propose a massively parallelizable algorithm for the classical assignment problem. The algorithm operates like an auction whereby unassigned persons bid simultaneously for objects thereby raising their prices. Once all bids are in, objects are awarded to the highest bidder. The algorithm can also be interpreted as a Jacobi — like relaxation method for solving a dual problem. Its (sequential) worst — case complexity, for a particular implementation that uses scaling, is O(NAlog(NC)), where N is the number of persons, A is the number of pairs of persons and objects that can be assigned to each other, and C is the maximum absolute object value. Computational results show that, for large problems, the algorithm is competitive with existing methods even without the benefit of parallelism. When executed on a parallel machine, the algorithm exhibits substantial speedup.

649 citations


Proceedings ArticleDOI
13 Jan 1988
TL;DR: A class of partitionings is presented that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two level of memory.
Abstract: Supercompilers must reschedule computations defined by nested DO-loops in order to make an efficient use of supercomputer features (vector units, multiple elementary processors, cache memory, etc…). Many rescheduling techniques like loop interchange, loop strip-mining or rectangular partitioning have been described to speedup program execution. We present here a class of partitionings that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two levels of memory.

594 citations


Journal ArticleDOI
TL;DR: H hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor.
Abstract: Various multisensor network scenarios with signal processing tasks that are amenable to multiprocessor implementation are described The natural origins of such multitasking are emphasized, and novel parallel structures for state estimation using the Kalman filter are proposed that extend existing results in several directions In particular, hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor The algorithms potentially yield an approximately linear speedup rate, are reasonably failure-resistant, and are optimized with respect to communication bandwidth and memory requirements at the various processors >

482 citations


Journal ArticleDOI
TL;DR: The scaled-problem paradigm better reveals the capabilities of large ensembles, and permits detection of subtle hardware-induced load imbalances that may become increasingly important as parallel processors increase in node count.
Abstract: We have developed highly efficient parallel solutions for three practical, full-scale scientific problems: wave mechanics, fluid dynamics, and structural analysis. Several algorithmic techniques are used to keep communication and serial overhead small as both problem size and number of processors are varied. A new parameter, operation efficiency, is introduced that quantifies the tradeoff between communication and redundant computation. A 1024-processor MIMD ensemble is measured to be 502 to 637 times as fast as a single processor when problem size for the ensemble is fixed, and 1009 to 1020 times as fast as a single processor when problem size per processor is fixed. The latter measure, denoted scaled speedup, is developed and contrasted with the traditional measure of parallel speedup. The scaled-problem paradigm better reveals the capabilities of large ensembles, and permits detection of subtle hardware-induced load imbalances (such as error correction and data-dependent MFLOPS rates) that may become increasingly important as parallel processors increase in node count. Sustained performance for the applications is 70 to 130 MFLOPS, validating the massively parallel ensemble approach as a practical alternative to more conventional processing methods. The techniques presented appear extensible to even higher levels of parallelism than the 1024-processor level explored here.

433 citations


Journal ArticleDOI
01 Apr 1988
TL;DR: A new genetic algorithm which relies on intelligent evolution of individuals is presented, which is inherently parallel and shows a superlinear speedup in multiprocessor systems.
Abstract: Evolution algorithms for combinatorial optimization have been proposed in the 70's. They did not have a major influence. With the availability of parallel computers, these algorithms will become more important. In this paper we discuss the dynamics of three different classes of evolution algorithms: network algorithms derived from the replicator equation, Darwinian algorithms and genetic algorithms inheriting genetic information. We present a new genetic algorithm which relies on intelligent evolution of individuals. With this algorithm, we have computed the best solution of a famous travelling salesman problem. The algorithm is inherently parallel and shows a superlinear speedup in multiprocessor systems.

391 citations


Journal ArticleDOI
TL;DR: A simple and efficient non-numerical algorithm for the automatic decomposition of an arbitrary finite element domain into a specified number of balanced subdomains is presented and it is shown that both the algorithm and its implementation are suitable for shared memory as well as local memory multiprocessors.

320 citations


Journal ArticleDOI
01 Apr 1988
TL;DR: A single-program-multiple-data computational model which is implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs and the applicability of the model in the parallelization of several applications is demonstrated.
Abstract: We present a single-program-multiple-data computational model which we have implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs. The computational model assumes a shared memory organization and is based on the scheme that all processes executing a program in parallel remain in existence for the entire execution; however, the tasks to be executed by each process are determined dynamically during execution by the use of appropriate synchronizing constructs that are imbedded in the program. We have demonstrated the applicability of the model in the parallelization of several applications. We discuss parallelization features of these applications and performance issues such as overhead, speedup, efficiency.

193 citations


Journal ArticleDOI
TL;DR: These algorithms are based on iterative improvement of a dual cost and operate in a manner that is reminiscent of coordinate ascent and Gauss-Seidel relaxation methods, and are found to be several times faster on standard benchmark problems, and faster by an order of magnitude on large, randomly generated problems.
Abstract: We propose a new class of algorithms for linear cost network flow problems with and without gains. These algorithms are based on iterative improvement of a dual cost and operate in a manner that is reminiscent of coordinate ascent and Gauss-Seidel relaxation methods. We compare our coded implementations of these methods with mature state-of-the-art primal simplex and primal-dual codes, and find them to be several times faster on standard benchmark problems, and faster by an order of magnitude on large, randomly generated problems. Our experiments indicate that the speedup factor increases with problem dimension.

191 citations


Proceedings ArticleDOI
07 Nov 1988
TL;DR: An algorithm for speeding up combinational logic with minimal area increase is presented, using a static timing analyzer and a weighted min-cut algorithm to determine the subset of nodes to be resynthesized.
Abstract: An algorithm for speeding up combinational logic with minimal area increase is presented. A static timing analyzer is used to identify the critical paths. Then a weighted min-cut algorithm is used to determine the subset of nodes to be resynthesized. This subset is selected so that the speedup is achieved with minimal area increase. Resynthesis is done by selectively collapsing the logic along the critical paths and then decomposing the collapsed nodes to minimize the critical delay. This process is iterated until either the timing requirements are satisfied or no further improvement can be made. The algorithm has been implemented and tested on many design examples with promising results. >

167 citations


Journal ArticleDOI
17 May 1988
TL;DR: A project is under way to evaluate MASA-like architectures for executing programs written in Multilisp, which features a tagged architecture, multiple contexts, fast trap handling, and a synchronization bit in every memory word.
Abstract: MASA is a “first cut” at a processor architecture intended as a building block for a multiprocessor that can execute parallel Lisp programs efficiently. MASA features a tagged architecture, multiple contexts, fast trap handling, and a synchronization bit in every memory word. MASA's principal novelty is its use of multiple contexts both to support multithreaded execution—interleaved execution from separate instruction streams—and to speed up procedure calls and trap handling in the same manner as register windows. A project is under way to evaluate MASA-like architectures for executing programs written in Multilisp.

159 citations


Journal ArticleDOI
TL;DR: Preliminary experiences are presented that show how accelerator processing helps a vehicle-monitoring problem solver meet deadlines, and a framework is outlined for meeting real-time constraints in AI systems.
Abstract: WE PROPOSE AN APPROACH FOR MEETING REAL-TIME CONSTRAINTS IN AI SYSTEMS THAT VIEWS (1) TIME AS A RESOURCE THAT SHOULD BE CONSIDERED WHEN MAKING CONTROL DECISIONS, (2) PLANS AS WAYS OF EXPRESSING CONTROL DECISIONS, AND (3) APPROXIMATE PROCESSING AS A WAY OF SATISFYING TIME CONSTRAINTS THAT CANNOT BE ACHIEVED THROUGH NORMAL PROCESSING. IN THIS APPROACH, A REAL-TIME PROBLEM SOLVER ESTIMATES THE TIME REQUIRED TO GENERATE SOLUTIONS AND THEIR QUALITY. THIS ESTIMATE PERMITS THE SYSTEM TO ANTICIPATE WHETHER THE CURRENT OBJECTIVES WILL BE MET IN TIME. THE SYSTEM CAN THEN TAKE CORRECTIVE ACTION BY FORMING LOWER QUALITY SOLUTIONS WITHIN TIME CONSTRAINTS. THIS MAY INVOLVE MODIFYING EXISTING PLANS OR FORMING RADICALLY DIFFERENT PLANS THAT UTILIZE ONLY ROUGH DATA CHARACTERISTICS AND APPROXIMATE KNOWLEDGE TO ACHIEVE A DESIRED SPEEDUP. A DECISION ABOUT HOW TO CHANGE PROCESSING SHOULD BE SITUATION-DEPENDENT, BASED ON THE CURRENT STATE OF PROCESSING AND ON DOMAIN-DEPENDENT SOLUTION CRITERIA. WE PRESENT PRELIMINARY EXPERIMENTS THAT SHOW HOW APPROXIMATE PROCESSING HELPS A VEHICLE-MONITORING PROBLEM SOLVER MEET DEADLINES, AND OUTLINE A FRAMEWORK FOR FLEXIBLY MEETING REAL-TIME CONSTRAINTS.

Journal ArticleDOI
K. So1, R.N. Rechtschaffen1
TL;DR: In this paper, the performance of set associative caches is analyzed by grouping the cache lines into regions according to their positions in the replacement stacks of a cache, and observing how the memory access of a CPU is distributed over these regions.
Abstract: The performance of set associative caches is analyzed The method used is to group the cache lines into regions according to their positions in the replacement stacks of a cache, and then to observe how the memory access of a CPU is distributed over these regions Results from the preserved CPU traces show that the memory accesses are heavily concentrated on the most recently used (MRU) region in the cache The concept of MRU change is introduced; the idea is to use the event that the CPU accesses a non-MRU line to approximate the time the CPU is changing its working set The concept is shown to be useful in many aspects of cache design and performance evaluation, such as comparison of various replacement algorithms, improvement of prefetch algorithms, and speedup of cache simulation >

Journal ArticleDOI
TL;DR: The current state of the SpeedUp package is described, both with respect to the input language and the techniques employed for the solution of the large systems of algebraic and differential/algebraic equations that typically describe chemical processes under steady-state and transient operation.

Journal ArticleDOI
TL;DR: Experimental results on the BBN Butterfly parallel processor demonstrate that the use of concurrent-heap algorithms in parallel branch-and-bound improves its performance substantially.
Abstract: Contention for the shared heap limits the obtainable speedup in parallel algorithms using this data structure as a priority queue. An approach that allows concurrent insertions and deletions on the heap in a shared-memory multiprocessor is presented. The scheme retains the strict priority ordering of the serial-access heap algorithms, i.e. a delete operation returns the best key of all keys that have been inserted or are being inserted at the time delete is started. Experimental results on the BBN Butterfly parallel processor demonstrate that the use of concurrent-heap algorithms in parallel branch-and-bound improves its performance substantially. >

Journal ArticleDOI
TL;DR: These approaches achieve significant speedup over uniprocessor simulated annealing, giving high-quality VLSI placement of standard cells in a short period of time.
Abstract: An algorithm called heuristic spanning creates parallelism by simultaneously investigating different areas of the plausible combinatorial search space. It is used to replace the high-temperature portion of simulated annealing. The low-temperature portion of simulated annealing is sped up by a technique called section annealing, in which placement is geographically divided and the pieces are assigned to separate processors. Each processor generates simulated-annealing-style moves for the cells in its area and communicates the moves to other processors as necessary. Heuristic spanning and section annealing are shown experimentally to converge to the same final cost function as regular simulated annealing. These approaches achieve significant speedup over uniprocessor simulated annealing, giving high-quality VLSI placement of standard cells in a short period of time. >

Journal ArticleDOI
TL;DR: The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed, an approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array.
Abstract: The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed. An approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array. The processors in this array execute the operations included in the corresponding nodes (or subsets of nodes) of the dataflow graph, while regular interconnections of these elements serve as edges of the graph. To speed up the execution, the proposed array allows the generation of computation fronts and their cancellation at a later time, depending on the arriving data operands; thus it is called a data-driven array. The structure of the basic cell and its programming are examined. Some design details are presented for two selected blocks, the instruction memory and the flag array. A scheme for mapping a dataflow graph (program) onto a hexagonally connected array is described and analyzed. Two distinct performance measures-mapping efficiency and array utilization-and some performance results are discussed. >

Proceedings ArticleDOI
01 Jun 1988
TL;DR: A fast and easily parallelizable global routing algorithm for standard cells and its parallel implementation are presented; the router is based on enumerating a subset of all two-bend routes between two points, and results in 16% to 37% fewer total number of tracks than the Timber Wolf global router.
Abstract: A fast and easily parallelizable global routing algorithm for standard cells and its parallel implementation is presented. LocusRoute is meant to be used as the cost function for a placement algorithm and so this context constrains the structure of the global routing algorithm and its parallel implementation. The router is based on enumerating a subset of all two-bend routes between two points, and results in 16% to 37% fewer total number of tracks than the TimberWolf global router for standard cells [Sech85]. It is comparable in quality to a maze router and an industrial router, but is factor of 10 times or more faster. Three approaches to parallelizing the router are implemented: wire-by-wire parallelism, segment-by-segment and route-by-route. Two of these approaches achieve significant speedup - route-by-route achieves up to 4.6 using eight processors, and wire-by-wire achieves from 5.8 to 7.6 on eight processors.

Journal ArticleDOI
J. G. Malone1
TL;DR: A concurrent finite element formulation for linear and nonlinear transient analysis using an explicit time integration scheme which automatically divides an arbitrary finite element mesh into regions and assigns each region to a processor on the hypercube.
Abstract: This paper discusses a concurrent finite element formulation for linear and nonlinear transient analysis using an explicit time integration scheme. The formulation has been developed for execution on hypercube multiprocessor computers. The formulation includes a new decomposition algorithm which automatically divides an arbitrary finite element mesh into regions and assigns each region to a processor on the hypercube. The algorithm selects the assignment of regions so as to minimize interprocessor communication and to balance the computational load across the processors. The decomposition algorithm is detetministic in nature and relies on a scheme which reduces the bandwidth of the matrix representation of the connectivities in the mesh. The algorithms have been implemented on a 32-processor Intel hypercube (the iPSC/d5 machine): speedup factors of greater than 31 have been obtained. Performance limitations of the hypercube architecture for finite element analysis are discussed.

Journal ArticleDOI
17 May 1988
TL;DR: This paper investigates the use of a more abstract and significantly more efficient analytical model for evaluating the relative performance of cache consistency protocols and shows promise for evaluating architectural tradeoffs at a much more detailed level than was previously thought possible.
Abstract: A number of dynamic cache consistency protocols have been developed for multiprocessors having a shared bus interconnect between processors and shared memory. The relative performance of these protocols has been studied extensively using simulation and detailed analytical models based on Markov chain techniques. Both of these approaches use relatively detailed models, which capture cache and bus interference rather precisely, but which are highly expensive to evaluate. In this paper, we investigate the use of a more abstract and significantly more efficient analytical model for evaluating the relative performance of cache consistency protocols. The model includes bus interference, cache interference, and main memory interference, but represents the interactions between the caches by steady-state mean collision rates which are computed by iterative solution of the model equations.We show that the speedup estimates obtained from the mean-value model are highly accurate. The results agree with the speedup estimates of the detailed analytical models to within 3%, over all modifications studied and over a wide range of parameter values. This result is surprising, given that the distinctions among the protocols are quite subtle. The validation experiments include sets of reasonable values of the workload parameters, as well as sets of unrealistic values for which one might expect the mean-value equations to break down. The conclusion we can draw is that this modeling technique shows promise for evaluating architectural tradeoffs at a much more detailed level than was previously thought possible. We also discuss the relationship between results of the analytical models and the results of independent evaluations of the protocols using simulation.

Proceedings ArticleDOI
24 Jul 1988
TL;DR: In this article, the gradient reuse algorithm (GRA) is proposed to improve the learning rate of backpropagation algorithm by reusing ingredients which are computed using backpropagsation several times until the resulting weight updates no longer lead to a reduction in error.
Abstract: A simple method for improving the learning rate of the backpropagation algorithm is described and analyzed. The method is referred to as the gradient reuse algorithm (GRA). The basic idea is that ingredients which are computed using backpropagation are reused several times until the resulting weight updates no longer lead to a reduction in error. It is shown that convergence speedup is a function of the reuse rate, and that the reuse rate can be controlled by using a dynamic convergence parameter. >

Journal ArticleDOI
TL;DR: In this paper, the authors examine some properties of this speedup, particularly its pattern of generality, in order to give indications as to its underlying theoretical basis and show that general procedures can be strengthened by practice and contradict the notion that speedup is purely a function of increased accessibility of schemas or other memory structures representing knowledge about the judgment target.
Abstract: : Many types of social and nonsocial cognitive processes can be performed faster after they have been practiced. Two experiments examine some properties of this speedup, particularly its pattern of generality, in order to give indications as to its underlying theoretical basis. For example, consider a person who practices judging whether a number of behaviors imply a particular target trait. Is the resulting increase in speed specific to the behaviors that were judged, does it apply to judgments of new behaviors with respect to the same target trait, or is it applicable to all judgements using the same process, even for different target traits? These experiments identify components of speedup that show each of these patterns. The results show that general procedures can be strengthened by practice and contradict the notion that speedup is purely a function of increased accessibility of schemas or other memory structures representing knowledge about the judgment target. The effects of practice need not be content-specific.

Journal ArticleDOI
V. Ostovic1
TL;DR: The ability of the method to evaluate both time and space variations simultaneously is illustrated with graphs which represent the traveling waves in induction machines on a tooth-by-tooth basis.
Abstract: The fundamentals of magnetic equivalent circuit representation of electromechanical systems and the application of these principles to the computation of induction machine dynamics have been previously discussed. Here, some simplifications which clarify the principles of the method and speed up the computation are introduced. The ability of the method to evaluate both time and space variations simultaneously is illustrated with graphs which represent the traveling waves in induction machines on a tooth-by-tooth basis. >

Proceedings Article
21 Aug 1988
TL;DR: A novel processor allocation strategy, called Bound-and-Branch, is presented for parallel alpha-beta search that achieves linear speedup in the case of perfect node ordering, and an actual speedup of 12 is obtained with 32 processors.
Abstract: We propose a parallel tree search algorithm based on the idea of tree-decomposition in which different processors search different parts of the tree. This generic algorithm effectively searches irregular trees using an arbitrary number of processors without shared memory or centralized control. The algorithm is independent of the particular type of tree search, such as single-agent or two-player game, and independent of any particular processor allocation strategy. Uniprocessor depth-first and breadth-first search are special cases of this generic algorithm. The algorithm has been implemented for alpha-beta search in the game of Othello on a 32-node Hypercube multiprocessor. The number of node evaluations grows approximately linearly with the number of processors P, resulting in an overall speedup for alpha-beta with random node ordering of P.75. Furthermore we present a novel processor allocation strategy, called Bound-and-Branch, for parallel alpha-beta search that achieves linear speedup in the case of perfect node ordering. Using this strategy, an actual speedup of 12 is obtained with 32 processors.

Book ChapterDOI
21 Dec 1988
TL;DR: If the search space has more than one solution and if these solutions are randomly distributed in a relatively small region of the searchspace, then the average speedup in parallel depth-first search can be superlinear and if all the solutions are uniformly distributed over the whole search space, then it is linear.
Abstract: When N processors perform depth-first search on disjoint parts of a state space tree to find a solution, the speedup can be superlinear (ie, > N) or sublinear (ie,

Journal ArticleDOI
01 Sep 1988
TL;DR: It is shown that fine-grain parallelism can be used to mask large, unpredictable memory latency and synchronization waits in architectures employing dataflow instruction execu tion mechanisms.
Abstract: A method for assessing the benefits of fine-grain paral lelism in "real" programs is presented. The method is based on parallelism profiles and speedup curves de rived by executing dataflow graphs on an interpreter under progressively more realistic assumptions about processor resources and communication costs. Even using traditional algorithms, the programs exhibit ample parallelism when parallelism is exposed at all levels. The bias introduced by the language ld and the compiler is examined. A method of estimating speedup through analysis of the ideal parallelism profile is developed, avoiding repeated execution of programs. It is shown that fine-grain parallelism can be used to mask large, unpredictable memory latency and synchronization waits in architectures employing dataflow instruction execu tion mechanisms. Finally, the effects of grouping por tions of dataflow programs, and requiring that the oper ators in a group execute on a single processor, are explored.

Journal ArticleDOI
TL;DR: In this article, a load-flow model is presented which is at least 50% faster than the traditional fast decoupled load flow model as described by B. Stott and O. Alsac.
Abstract: A load-flow model is presented which is at least 50% faster than the traditional fast decoupled load flow model as described by B. Stott and O. Alsac (1974). Reasons for the speedup are considered, and details of the model's performance are given. The performance of the model is investigated by examining a number of power systems, the largest of which is a 1435-bus system. >

Proceedings ArticleDOI
M.S. Lakshmi1, Philip S. Yu1
01 Jan 1988
TL;DR: The MIPS ratio as well as speedup are found to be very sensitive to the amount of skew, so careful thought should be given in parallelizing database applications and in the design of algorithms and query optimizer for parallel architectures.
Abstract: Skew in the distribution of values taken by an attribute is identified as a major factor that can affect the performance of parallel architectures for relational joins. The effect of skew on the performance of two parallel architectures is evaluated using analytic models. In one architecture, called database machine (DBMC), data as well as processing power are distributed; while in the other architecture, called Single Processor Parallel Input/output (SPPI), data is distributed but the processing power is concentrated in one processor. The two architectures are compared in terms of the ratio of MIPS used by DBMC and SPPI to deliver the same throughput and response time. In addition, the horizontal growth potential of DBMC is evaluated in terms of maximum speedup achievable by DBMC relative to SPPI response time. The MIPS ratio as well as speedup are found to be very sensitive to the amount of skew. These suggest, careful thought should be given in parallelizing database applications and in the design of algorithms and query optimizer for parallel architectures.

Journal ArticleDOI
TL;DR: The first parallel sort algorithm for shared memory MIMD (multiple-instruction-multiple-data-stream) multiprocessors that has a theoretical and measured speedup near linear is exhibited and is based on a novel asynchronous parallel merge that evenly partitions data to be merged among any number of processors.
Abstract: The first parallel sort algorithm for shared memory MIMD (multiple-instruction-multiple-data-stream) multiprocessors that has a theoretical and measured speedup near linear is exhibited. It is based on a novel asynchronous parallel merge that evenly partitions data to be merged among any number of processors. A benchmark sorting algorithm is proposed that uses this merge to remove the linear time bottleneck inherent in previous multiprocessors sorts. This sort, when applied to data set on p processors, has a time complexity of O((n log n)/p)+O((n log p)/p) and a space complexity of 2n, where n is the number of keys being sorted. Evaluations of the merge and benchmark sort algorithms on a 12-processor Sequent Balance 21000 System demonstrate near-linear speedup when compared to sequential Quicksort. >

Proceedings ArticleDOI
01 Jan 1988
TL;DR: This work outlines the design of a C * compiler for a hypercube multicomputer and aims to minimize the amount of time spent synchronizing, limit the number of interprocessor communications, and make each physical processor's emulation of a set of virtual processors as efficient as possible.
Abstract: A data parallel language such as C* has a number of advantages over conventional hypercube programming languages The algorithm design process is simpler, because (1) message passing is invisible, (2) race conditions are nonexistent, and (3) the data can be put into a one-to-one correspondence with the virtual processors Since data are mapped to virtual processors, rather than physical processors, it is easier to move algorithms implemented on one size hypercube to a larger or smaller system We outline the design of a C* compiler for a hypercube multicomputer Our design goals are to minimize the amount of time spent synchronizing, limit the number of interprocessor communications, and make each physical processor's emulation of a set of virtual processors as efficient as possible We have hand translated three benchmark programs and compared their performance with that of ordinary C programs All three programs—matrix multiplication, LU decomposition, and hyperquicksort—achieve reasonable speedup on a commercial hypercube, even when solving problems of modest size On a 64-processor NCUBE/7, the C* matrix multiplication program achieves a speedup of 27 when multiplying two 64 × 64 matrices, the hyperquicksort program achieves a speedup of 10 when sorting 16,384 integers, and LU decomposition attains a speedup of 7 when decomposing a 256 × 256 system of linear equations We believe the degradation in machine performance resulting from the use of a data parallel language will be more than compensated for by the increase in programmer productivity

Journal ArticleDOI
TL;DR: A technique for parallel backtracking using randomization is proposed, whose main advantage is that good speedups are possible with little or no interprocessor communication.
Abstract: A technique for parallel backtracking using randomization is proposed. Its main advantage is that good speedups are possible with little or no interprocessor communication. The speedup obtainable is problem-dependent. In those cases where the problem size becomes very large, randomization is extremely successful achieving good speedups. The technique also ensures high reliability, flexibility, and fault tolerance. >