scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1991"


Journal ArticleDOI
01 Sep 1991
TL;DR: The parallel genetic algorithm PGA is applied to the optimization of continuous functions and is able to find the global minimum of Rastrigin's function of dimension 400 on a 64 processor system!
Abstract: In this paper, the parallel genetic algorithm PGA is applied to the optimization of continuous functions. The PGA uses a mixed strategy. Subpopulations try to locate good local minima. If a subpopulation does not progress after a number of generations, hillclimbing is done. Good local minima of a subpopulation are diffused to neighboring subpopulations. Many simulation results are given with popular test functions. The PGA is at least as good as other genetic algorithms on simple problems. A comparison with mathematical optimization methods is done for very large problems. Here a breakthrough can be reported. The PGA is able to find the global minimum of Rastrigin's function of dimension 400 on a 64 processor system! Furthermore, we give an example of a superlinear speedup.

647 citations


Book
01 Jan 1991
TL;DR: The inadequacies of conventional parallel languages for programming multicomputers are identified, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented.
Abstract: The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed. >

294 citations


01 Jan 1991
TL;DR: This work set out to develop a benchmark that could be used to evaluate DIRECT both relative to itself and relative to the "university" version of Ingres, and found it difficult to understand application-specific benchmarks.
Abstract: In 1981 as we were completing the implementation of the DIRECT database machine [DEWI79, BORA82], attention turned to evaluating its performance. At that time no standard database benchmark existed. There were only a few application-specific benchmarks. While application-specific benchmarks measure which database system is best for a particular application, it was very difficult to understand them. We were interested in a benchmark to measure DIRECT's speedup characteristics. Thus, we set out to develop a benchmark that could be used to evaluate DIRECT both relative to itself and relative to the "university" version of Ingres.

189 citations


Proceedings Article
03 Sep 1991
TL;DR: A comparison between the partitioned band ,join algorithm and the classical sort-merge join algorit and data from speedup and scalcup experiments demonstrating that the partitioning hand join is efficiently paral-efficient are presented.
Abstract: A non-equijoin of relations R and S is R bnnd join if the join predicate requires valnes in the join att.ribr1t.e of R 10 fall within a specified hand ahcmt the valnrs in the join r.tt.rihnte of S. We propose a new algorithm. t.ermed a partitionerl hnnd join, for evaluating band joins. We present a comparison between the partitioned band ,join algorithm and the classical sort-merge join algorit.hm (op(.imixed for band ,jnins) using bot,h an analytical model and an implemenlaCon on top of the WiSS storage system. The results show that the partitioned ba.nd join algorithm outperforms sort.-merge unlrsp memory is scarce and t.he opernnda of t,he join are of equal size. We also describe a parallel implementation of the pnrtitioned band join on the Gamma database machine. and present data from speedup and scalcup experiments demonstrating that the partitioned hand join is efficiently paral-

163 citations


Journal ArticleDOI
TL;DR: It is proved that, even in the absence of image error, each model must be represented by a 2D surface in the index space, which places an unexpected lower bound on the space required to implement indexing and proves that no quantity is invariant for all projections of a model into the image.
Abstract: Model-based visual recognition systems often match groups of image features to groups of model features to form initial hypotheses, which are then verified. In order to accelerate recognition considerably, the model groups can be arranged in an index space (hashed) offline such that feasible matches are found by indexing into this space. For the case of 2D images and 3D models consisting of point features, bounds on the space required for indexing and on the speedup that such indexing can achieve are demonstrated. It is proved that, even in the absence of image error, each model must be represented by a 2D surface in the index space. This places an unexpected lower bound on the space required to implement indexing and proves that no quantity is invariant for all projections of a model into the image. Theoretical bounds on the speedup achieved by indexing in the presence of image error are also determined, and an implementation of indexing for measuring this speedup empirically is presented. It is found that indexing can produce only a minimal speedup on its own. However, when accompanied by a grouping operation, indexing can provide significant speedups that grow exponentially with the number of features in the groups. >

147 citations


Proceedings ArticleDOI
01 Sep 1991
TL;DR: It is demonstrated that randomization is an extremely powerful tool for designing very fast and efficient parallel algorithms and a running time of O(lg* n) (nearly-constant), with high probability, is achieved using n/lG* n (optimal speedup) processors for a wide range of fundamental problems.
Abstract: It is demonstrated that randomization is an extremely powerful tool for designing very fast and efficient parallel algorithms. Specifically, a running time of O(lg* n) (nearly-constant), with high probability, is achieved using n/lg* n (optimal speedup) processors for a wide range of fundamental problems. Also given is a constant time algorithm which, using n processors, approximates the sum of n positive numbers to within an error which is smaller than the sum by an order of magnitude. A variety of known and new techniques are used. New techniques, which are of independent interest, include estimation of the size of a set in constant time for several settings, and ways for deriving superfast optimal algorithms from superfast nonoptimal ones. >

139 citations


Journal ArticleDOI
01 Dec 1991
TL;DR: Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is 'unfair', in that it favors slow processors and poorly coded programs.
Abstract: The traditional definition of 'speedup' as the ratio of sequential execution time to parallel execution time has been widely accepted. One drawback to this metric is that it tends to reward slower processors and inefficient compilation with higher speedup. It seems unfair that the goals of high speed and high speedup are at odds with each other. In this paper, the 'fairness' of parallel performance metrics is studied. Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is 'unfair', in that it favors slow processors and poorly coded programs. Two new performance metrics are introduced. The first one, sizeup, provides a 'fair' performance measurement. The second one is a generalization of speedup - the generalized speedup, which recognizes that speedup is the ratio of speeds, not times. The relation between sizeup, speedup, and generalized speedup are studied. The various metrics have been tested using a real application that runs on an nCUBE 2 multicomputer. The experimental results closely match the analytical results.

137 citations


Journal ArticleDOI
TL;DR: In this article, a parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed.
Abstract: A parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed. The algorithm achieves parallelism by using the concurrency technique of speculative computation. Implementation of the parallel algorithm on a hypercube multiprocessor and application to a task assignment problem are described. The simulated annealing solutions are shown to be, on average, 28% better than the solutions produced by a random task assignment algorithm and 2% better than the solutions produced by a heuristic. >

112 citations


Journal ArticleDOI
01 Dec 1991
TL;DR: The strengths and weaknesses of the most commonly used benchmarks of supercomputer performance are compared (Livermore, Linpack, Perfect, SPEC and EuroBen); the theoretical peak performance is defined and compared with the realised performance on some of these benchmarks.
Abstract: The strengths and weaknesses of the most commonly used benchmarks of supercomputer performance are compared (Livermore, Linpack, Perfect, SPEC and EuroBen) The theoretical peak performance is defined and compared with the realised performance on some of these benchmarks The wide differences are interpreted in terms of terms of the performance parameters r"~, n"1"/"2, f"1"/"2, s"1"/"2, the latter three of which characterise the degradation of performance from inadequate vector length, inadequate computational intensity, and synchronisation overhead The RINF, POLY benchmarks are defined for measuring these parameters The PING-PONG benchmark is described for measuring the characteristics of communication in distributed systems, and the dangers associated with use of Speedup to compare the performance of algorithms on multiprocessor systems are discussed

99 citations


Journal ArticleDOI
Jia Chen1, T.E. Stern2
TL;DR: The model is extended to cover a nonhomogeneous system, where traffic intensity at each input varies and destination distribution is not uniform and it is seen that input imbalance has a more adverse effect on throughput than output imbalance.
Abstract: A general model is presented to study the performance of a family of space-domain packet switches, implementing both input and output queuing and varying degrees of speedup. Based on this model, the impact of the speedup factor on the switch performance is analyzed. In particular, the maximum switch throughput, and the average system delay for any given degree of speedup are obtained. The results demonstrate that the switch can achieve 99% throughput with a modest speedup factor of four. Packet blocking probability for systems with finite buffers can also be derived from this model, and the impact of buffer allocation on blocking probability is investigated. Given a fixed buffer budget, this analysis obtains an optimal placement of buffers among input and output ports to minimize the blocking probability. The model is also extended to cover a nonhomogeneous system, where traffic intensity at each input varies and destination distribution is not uniform. Using this model, the effect of traffic imbalance on the maximum switch throughput is studied. It is seen that input imbalance has a more adverse effect on throughput than output imbalance. >

96 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: Two algorithms are presented for accelerating the operation of a stack decoder using a method for computing the true least upper bound so that an optimal admissible A* search can be performed.
Abstract: Two algorithms are presented for accelerating the operation of a stack decoder. The first is a method for computing the true least upper bound so that an optimal admissible A* search can be performed. The second is a set of methods for linearizing the computation required by a stack decoder. The A* search has been implemented in a continuous speech recognizer simulator and has demonstrated a significant speedup. The linearizing algorithm has been partially implemented in the simulator and has also shown significant computational savings. >

Proceedings ArticleDOI
01 Jun 1991
TL;DR: This work presents a generic algorithm for implementing backtrack search on anN processor butterfly network that requires timeO(M/N +h) with high probability and is optimal and is obtained without making assumptions about the shape of the tree being searched.
Abstract: We present a generic algorithm for implementing backtrack search on anN processor butterfly network. For a backtrack search tree havingM nodes the heighth, our algorithm requires timeO(M/N +h) with high probability. This is optimal and is obtained without making assumptions about the shape of the tree being searched.

Journal ArticleDOI
TL;DR: A method for high-performance, software testing, called mutant unification, is described, designed to support program mutation on parallel machines based on the single instruction multiple data stream (SIMD) paradigm.
Abstract: A method for high-performance, software testing, called mutant unification, is described. The method is designed to support program mutation on parallel machines based on the single instruction multiple data stream (SIMD) paradigm. Several parameters that affect the performance of unification have been identified and their effect on the time to completion of a mutation test cycle and speedup has been studied. Program mutation analysis provides an effective means for determining the reliability of large software systems and a systematic method for measuring the adequacy of test data. However, it is likely that testing large software systems using mutation is computation bound and prohibitive on traditional sequential machines. Current, implementations of mutation tools are unacceptably slow and are only suitable for testing relatively small programs. The proposed unification method provides a practical alternative to the current approaches. The method also opens up a new application domain for SIMD machines. >

Journal ArticleDOI
TL;DR: This paper presents the implementation of an AND-parallel execution model of logic programs on a shared-memory multiprocessor based upon the Warren Abstract Machine, and shows that the parallel implementation can achieve reasonable speedup on dozens of processors.
Abstract: This paper presents the implementation and performance results of an and -parallel execution model of logic programs on a shared-memory multiprocessor. The execution model is meant for logic programs with “don't-know nondeterminism”, and handles binding conflicts by dynamically detecting dependencies among literals. The model also incorporates intelligent backtracking at the clause level. Our implementation of this model is based upon the Warren Machine (WAM); hence it retains most of the efficiency of the WAM for sequential segments of logic programs. Performance results on Sequent Balance 21000 show that on suitable programs, our parallel implementation can achieve linear speedup on dozens of processors. We also present an analysis of different overheads encountered in the implementation of the execution model.

Journal ArticleDOI
TL;DR: A new method, stochastic region contraction (SRC), is proposed that achieves a computational speedup of 30-50 when compared to the commonly used simulated-annealing method and is ideally suited for coarse-gain parallel processing.
Abstract: The authors deal with optimal microphone placement and gain for a linear one-dimensional array often in a confined environment. A power spectral dispersion function (PSD) is used as a core element for a min-max objective function (PSDX). Derivation of the optimal spacings and gains of the microphones is a hard computational problem since the min-max objective function exhibits multiple local minima (hundreds or thousands). The authors address the computational problem of finding the global optimal solution of the PSDX function. A new method, stochastic region contraction (SRC), is proposed. It achieves a computational speedup of 30-50 when compared to the commonly used simulated-annealing method. SRC is ideally suited for coarse-gain parallel processing. >

Journal ArticleDOI
TL;DR: A result of independent interest is a parallel hashing technique that enables drastic reduction of space requirements for the price of using randomness in the parallel sorting algorithm and for some parallel string matching algorithms.

Book ChapterDOI
11 Aug 1991
TL;DR: This note shows how to gain speed by scaling the modulus, which can be chosen to need no scaling, so that most of the minor extra costs are eliminated.
Abstract: There are a number of techniques known for speeding up modular multiplication, which is the main arithmetic operation in RSA cryptography This note shows how to gain speed by scaling the modulus Resulting hardware is limited only by the speed of addition Detailed analysis of fan out shows that over existing methods the speedup is potentially as much as two-fold This is because the addition and fan out can now be done in parallel Of course, in RSA the modulus can be chosen to need no scaling, so that most of the minor extra costs are eliminated

Journal ArticleDOI
TL;DR: It may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube parallel processing computer, and the parallelLU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling.
Abstract: Two new coarse-grain scheduling schemes, the levelwise and factorization path scheduling schemes, are examined. These schemes differ significantly from fine-grain scheduling schemes which have been proposed in the past. If a fine-grain scheduling scheme at the floating-point-operation level is an appropriate scheduling method for the iPSC hypercube parallel processing computer, then the levelwise scheduling scheme presented should have gain comparable to that obtained using the factorization path scheduling scheme. Since this is not the case, it may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube. Furthermore, the parallel LU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling. The maximum speedup of 2.08 was obtained by using four processors on the 494 bus system. The efficiency at maximum speedup was 52.1%. >

Journal ArticleDOI
TL;DR: An architecture suitable for real-time image coding using adaptive vector quantization (VQ) is presented, where the data is accessed simultaneously and in parallel on the basis of its content.
Abstract: An architecture suitable for real-time image coding using adaptive vector quantization (VQ) is presented. This architecture is based on the concept of content-addressable memory (CAM), where the data is accessed simultaneously and in parallel on the basis of its content. VQ essentially involves, for each input vector, a search operation to obtain the best match codeword. A speedup results if a CAM-based implementation is used. This speedup, coupled with the gains in execution time for the basic distortion operation, implies that even codebook generation is possible in real time ( >

Proceedings ArticleDOI
25 Feb 1991
TL;DR: A heuristic algorithm is described for technology mapping that performs a decomposition of the circuit in the FPGA primitives, driven by the information on logic functional sharing.
Abstract: The authors present a new approach for performing technology mapping onto field programmable gate arrays (FPGAs). They consider one class of FPGAs, based on two-output five-input RAM-based cells, that are used to implement combinational logic functions. A heuristic algorithm is described for technology mapping that performs a decomposition of the circuit in the FPGA primitives, driven by the information on logic functional sharing. The authors have implemented the algorithm in the program Hydra. Experimental results shows an average of 20% to 25% improvement over other existing programs in mapping area and 67-fold speedup in computing time. >

Journal ArticleDOI
TL;DR: By combining polygon scan-conversion with a dynamic screen data structure, the technique provides significant speedup in the display time of polygonal scenes that depend on BSP trees, especially in cases where the number of polygons is large.
Abstract: A technique for displaying binary space partitioning (BSP) trees that is faster than the usual back-to-front display method is presented. By combining polygon scan-conversion with a dynamic screen data structure, the technique, a front-to-back approach, provides significant speedup in the display time of polygonal scenes that depend on BSP trees, especially in cases where the number of polygons is large. This speedup is confirmed by applying the technique to randomly generated triangles. >

Journal ArticleDOI
TL;DR: The behavior of n interacting processors synchronized by the Time Warp protocol is analyzed using a discrete-state, continuous-time Markov chain model and the results have been validated through performance measurements of a Time Warp testbed executing on a shared-memory multiprocessor.
Abstract: The behavior of n interacting processors synchronized by the Time Warp protocol is analyzed using a discrete-state, continuous-time Markov chain model. The performance and dynamics of the processes (or processors) are analyzed under the following assumptions: exponential task times and timestamp increments on messages, each event message generates one new message that is sent to a randomly selected process, negligible rollback, state saving, and communication delay, unbounded message buffers, and homogeneous processors. Several performance measures are determined, such as: the fraction of processed events that commit, speedup, rollback probability, expected length of rollback, the probability mass function for the number of uncommitted processed events, the probability distribution function for the virtual time of a process, and the fraction of time the processors remain idle. The analysis is approximate, thus the results have been validated through performance measurements of a Time Warp testbed executing on a shared-memory multiprocessor. >

Journal ArticleDOI
01 May 1991
TL;DR: Asymptotically, the ACS-feedback no longer has to be processed recursively, i.e., there is no feedback and this fact can be exploited technically to design efficient and purely feedforward architectures for Viterbi decoding that have a modular extendable structure.
Abstract: The Viterbi-Algorithm (VA) is a common application of dynamic programming. The algorithm contains a nonlinear feedback loop (ACS-feedback, ACS: add-compare-select) which is the bottleneck in high data rate implementations. In this paper we show that, asymptotically, the ACS-feedback no longer has to be processed recursively, i.e., there is no feedback. With only negligible performance loss, this fact can be exploited technically to design efficient and purely feedforward architectures for Viterbi decoding that have a modular extendable structure. By designing one cascadable module, any speedup can be achieved simply by adding modules to the implementation. It is shown that optimization criteria, as minimum latency or maximum hardware efficiency, are met by very different architectures.

Journal ArticleDOI
01 Dec 1991
TL;DR: In this paper, the authors compare the capabilities of several commercially available, vectorizing Fortran compilers using a test suite of Fortran loops on a variety of supercomputers, mini-supercomputers and mainframes.
Abstract: We compare the capabilities of several commercially available, vectorizing Fortran compilers using a test suite of Fortran loops. We present the results of compiling and executing these loops on a variety of supercomputers, mini-supercomputers, and mainframes.

Journal ArticleDOI
TL;DR: All-optical image coding and decoding are demonstrated for image encipherment communication, which was performed within a 8-ms duration, which is even faster than the conventional television system.
Abstract: All-optical image coding and decoding are demonstrated for image encipherment communication. The optical parallel coder consists of two cascaded bipolar-operational spatial light modulators (BSLMs) that can select positive and negative image readout in response to the electrical pulse polarity. Coding is performed by an exclusive-OR operation, to a key and an input image, while decoding is the same operation performed to a decoding key and an enciphered image. Coding was performed within a 8-ms duration, which is even faster than the conventional television system. Increasing the voltages to the BSLMs two-fold or three-fold and shortening the pulse durations speed up the processing rate. >


Journal ArticleDOI
01 Sep 1991
TL;DR: Parallel algorithms for implementing the Kohonen SOFM on a linear chain and a two-dimensional mesh of transputers and the performance of massively parallel systems is predicted from the models.
Abstract: The Self-Organizing Feature Map (SOFM) proposed by Kohonen is a widely used vector quantization algorithm. A drawback of SOFM is the increase of computation time with an increase in the number of neurons. However, the inherent parallelism of SOFM allows parallel processing to speed up the computation. Parallel algorithms for implementing the Kohonen SOFM are presented in this paper. We have implemented the algorithms on a linear chain and a two-dimensional mesh of transputers. Significant speedup has been achieved. In addition, models to describe the performance of the algorithms are also presented. The performance of massively parallel systems is predicted from the models.

Journal ArticleDOI
TL;DR: The design of a benchmark is presented, SLALOM{trademark}, that scales automatically to the computing power available, and corrects several deficiencies in various existing benchmarks: it is highly scalable, it solves a real problem, it includes input and output times, and it can be run on parallel machines of all kinds, using any convenient language.

Journal ArticleDOI
TL;DR: A detailed description of vector/parallel algorithms for the molecular dynamics (MD) simulation of macromolecular systems on multiple processor, shared-memory computers is presented in this paper.
Abstract: A detailed description of vector/parallel algorithms for the molecular dynamics (MD) simulation of macromolecular systems on multiple processor, shared-memory computers is presented. The algorithms encompass three computationally intensive portions of typical MD programs: (1) the evaluation of the potential energies and forces, (2) the generation of the nonbonded neighbor list, and (3) the satisfaction of holonomic constraints. We implemented the algorithms into two standard programs; CHARMM and AMBER, and obtained near linear speedups on eight processors of a Cray Y-MP for cases (1) and (2). For case (3) the SHAKE method demonstrated a speedup of 6.0 on eight processors while the matrix inversion method demonstrated 6.4. For a system of water molecules the performance improvement over the standard scalar SHAKE subroutine in AMBER ranged from a factor of 165 to greater than 2000.

Journal ArticleDOI
TL;DR: The proposed processor-efficient parallel algorithm for the 0/1 knapsack problem has optimal time speedup and processor efficiency over the best known sequential algorithm and performs very well for a wide range of input sizes.