Showing papers on "Speedup published in 1991"

PDF

Open Access

Journal Article•DOI•

Paper: The parallel genetic algorithm as function optimizer

[...]

01 Sep 1991

TL;DR: The parallel genetic algorithm PGA is applied to the optimization of continuous functions and is able to find the global minimum of Rastrigin's function of dimension 400 on a 64 processor system!

...read moreread less

Abstract: In this paper, the parallel genetic algorithm PGA is applied to the optimization of continuous functions. The PGA uses a mixed strategy. Subpopulations try to locate good local minima. If a subpopulation does not progress after a number of generations, hillclimbing is done. Good local minima of a subpopulation are diffused to neighboring subpopulations. Many simulation results are given with popular test functions. The PGA is at least as good as other genetic algorithms on simple problems. A comparison with mathematical optimization methods is done for very large problems. Here a breakthrough can be reported. The PGA is able to find the global minimum of Rastrigin's function of dimension 400 on a 64 processor system! Furthermore, we give an example of a superlinear speedup.

...read moreread less

647 citations

Book•

Data-parallel programming on MIMD computers

[...]

Michael J. Quinn¹, Philip J. Hatcher²•Institutions (2)

Oregon State University¹, University of New Hampshire²

01 Jan 1991

TL;DR: The inadequacies of conventional parallel languages for programming multicomputers are identified, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented.

...read moreread less

Abstract: The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed. >

...read moreread less

294 citations

The Wisconsin Benchmark: Past, Present, and Future.

[...]

David J. DeWitt

01 Jan 1991

TL;DR: This work set out to develop a benchmark that could be used to evaluate DIRECT both relative to itself and relative to the "university" version of Ingres, and found it difficult to understand application-specific benchmarks.

...read moreread less

Abstract: In 1981 as we were completing the implementation of the DIRECT database machine [DEWI79, BORA82], attention turned to evaluating its performance. At that time no standard database benchmark existed. There were only a few application-specific benchmarks. While application-specific benchmarks measure which database system is best for a particular application, it was very difficult to understand them. We were interested in a benchmark to measure DIRECT's speedup characteristics. Thus, we set out to develop a benchmark that could be used to evaluate DIRECT both relative to itself and relative to the "university" version of Ingres.

...read moreread less

189 citations

Proceedings Article•

An Evaluation of Non-Equijoin Algorithms

[...]

David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider

03 Sep 1991

TL;DR: A comparison between the partitioned band ,join algorithm and the classical sort-merge join algorit and data from speedup and scalcup experiments demonstrating that the partitioning hand join is efficiently paral-efficient are presented.

...read moreread less

Abstract: A non-equijoin of relations R and S is R bnnd join if the join predicate requires valnes in the join att.ribr1t.e of R 10 fall within a specified hand ahcmt the valnrs in the join r.tt.rihnte of S. We propose a new algorithm. t.ermed a partitionerl hnnd join, for evaluating band joins. We present a comparison between the partitioned band ,join algorithm and the classical sort-merge join algorit.hm (op(.imixed for band ,jnins) using bot,h an analytical model and an implemenlaCon on top of the WiSS storage system. The results show that the partitioned ba.nd join algorithm outperforms sort.-merge unlrsp memory is scarce and t.he opernnda of t,he join are of equal size. We also describe a parallel implementation of the pnrtitioned band join on the Gamma database machine. and present data from speedup and scalcup experiments demonstrating that the partitioned hand join is efficiently paral-

...read moreread less

163 citations

Journal Article•DOI•

Space and time bounds on indexing 3D models from 2D images

[...]

D.T. Clemens¹, David W. Jacobs¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Oct 1991-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is proved that, even in the absence of image error, each model must be represented by a 2D surface in the index space, which places an unexpected lower bound on the space required to implement indexing and proves that no quantity is invariant for all projections of a model into the image.

...read moreread less

Abstract: Model-based visual recognition systems often match groups of image features to groups of model features to form initial hypotheses, which are then verified. In order to accelerate recognition considerably, the model groups can be arranged in an index space (hashed) offline such that feasible matches are found by indexing into this space. For the case of 2D images and 3D models consisting of point features, bounds on the space required for indexing and on the speedup that such indexing can achieve are demonstrated. It is proved that, even in the absence of image error, each model must be represented by a 2D surface in the index space. This places an unexpected lower bound on the space required to implement indexing and proves that no quantity is invariant for all projections of a model into the image. Theoretical bounds on the speedup achieved by indexing in the presence of image error are also determined, and an implementation of indexing for measuring this speedup empirically is presented. It is found that indexing can produce only a minimal speedup on its own. However, when accompanied by a grouping operation, indexing can provide significant speedups that grow exponentially with the number of features in the groups. >

...read moreread less

147 citations

Proceedings Article•DOI•

Towards a theory of nearly constant time parallel algorithms

[...]

Joseph Gil¹, Yossi Matias, Uzi Vishkin•Institutions (1)

University of British Columbia¹

01 Sep 1991

TL;DR: It is demonstrated that randomization is an extremely powerful tool for designing very fast and efficient parallel algorithms and a running time of O(lg* n) (nearly-constant), with high probability, is achieved using n/lG* n (optimal speedup) processors for a wide range of fundamental problems.

...read moreread less

Abstract: It is demonstrated that randomization is an extremely powerful tool for designing very fast and efficient parallel algorithms. Specifically, a running time of O(lg* n) (nearly-constant), with high probability, is achieved using n/lg* n (optimal speedup) processors for a wide range of fundamental problems. Also given is a constant time algorithm which, using n processors, approximates the sum of n positive numbers to within an error which is smaller than the sum by an order of magnitude. A variety of known and new techniques are used. New techniques, which are of independent interest, include estimation of the size of a set in constant time for several settings, and ways for deriving superfast optimal algorithms from superfast nonoptimal ones. >

...read moreread less

139 citations

Journal Article•DOI•

Paper: Toward a better parallel performance metric

[...]

Xian-He Sun¹, John L. Gustafson¹•Institutions (1)

Iowa State University¹

01 Dec 1991

TL;DR: Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is 'unfair', in that it favors slow processors and poorly coded programs.

...read moreread less

Abstract: The traditional definition of 'speedup' as the ratio of sequential execution time to parallel execution time has been widely accepted. One drawback to this metric is that it tends to reward slower processors and inefficient compilation with higher speedup. It seems unfair that the goals of high speed and high speedup are at odds with each other. In this paper, the 'fairness' of parallel performance metrics is studied. Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is 'unfair', in that it favors slow processors and poorly coded programs. Two new performance metrics are introduced. The first one, sizeup, provides a 'fair' performance measurement. The second one is a generalization of speedup - the generalized speedup, which recognizes that speedup is the ratio of speeds, not times. The relation between sizeup, speedup, and generalized speedup are studied. The various metrics have been tested using a real application that runs on an nCUBE 2 multicomputer. The experimental results closely match the analytical results.

...read moreread less

137 citations

Journal Article•DOI•

Parallel simulated annealing using speculative computation

[...]

E.E. Witte¹, Roger D. Chamberlain¹, Mark A. Franklin¹•Institutions (1)

Washington University in St. Louis¹

01 Oct 1991-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, a parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed.

...read moreread less

Abstract: A parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed. The algorithm achieves parallelism by using the concurrency technique of speculative computation. Implementation of the parallel algorithm on a hypercube multiprocessor and application to a task assignment problem are described. The simulated annealing solutions are shown to be, on average, 28% better than the solutions produced by a random task assignment algorithm and 2% better than the solutions produced by a heuristic. >

...read moreread less

112 citations

Journal Article•DOI•

Paper: Performance parameters and benchmarking of supercomputers

[...]

Roger W. Hockney¹•Institutions (1)

University of Reading¹

01 Dec 1991

TL;DR: The strengths and weaknesses of the most commonly used benchmarks of supercomputer performance are compared (Livermore, Linpack, Perfect, SPEC and EuroBen); the theoretical peak performance is defined and compared with the realised performance on some of these benchmarks.

...read moreread less

Abstract: The strengths and weaknesses of the most commonly used benchmarks of supercomputer performance are compared (Livermore, Linpack, Perfect, SPEC and EuroBen) The theoretical peak performance is defined and compared with the realised performance on some of these benchmarks The wide differences are interpreted in terms of terms of the performance parameters r"~, n"1"/"2, f"1"/"2, s"1"/"2, the latter three of which characterise the degradation of performance from inadequate vector length, inadequate computational intensity, and synchronisation overhead The RINF, POLY benchmarks are defined for measuring these parameters The PING-PONG benchmark is described for measuring the characteristics of communication in distributed systems, and the dangers associated with use of Speedup to compare the performance of algorithms on multiprocessor systems are discussed

...read moreread less

99 citations

Journal Article•DOI•

Throughput analysis, optimal buffer allocation, and traffic imbalance study of a generic nonblocking packet switch

[...]

Jia Chen¹, T.E. Stern²•Institutions (2)

IBM¹, Columbia University²

01 Apr 1991-IEEE Journal on Selected Areas in Communications

TL;DR: The model is extended to cover a nonhomogeneous system, where traffic intensity at each input varies and destination distribution is not uniform and it is seen that input imbalance has a more adverse effect on throughput than output imbalance.

...read moreread less

Abstract: A general model is presented to study the performance of a family of space-domain packet switches, implementing both input and output queuing and varying degrees of speedup. Based on this model, the impact of the speedup factor on the switch performance is analyzed. In particular, the maximum switch throughput, and the average system delay for any given degree of speedup are obtained. The results demonstrate that the switch can achieve 99% throughput with a modest speedup factor of four. Packet blocking probability for systems with finite buffers can also be derived from this model, and the impact of buffer allocation on blocking probability is investigated. Given a fixed buffer budget, this analysis obtains an optimal placement of buffers among input and output ports to minimize the blocking probability. The model is also extended to cover a nonhomogeneous system, where traffic intensity at each input varies and destination distribution is not uniform. Using this model, the effect of traffic imbalance on the maximum switch throughput is studied. It is seen that input imbalance has a more adverse effect on throughput than output imbalance. >

...read moreread less

96 citations

Proceedings Article•DOI•

Algorithms for an optimal A* search and linearizing the search in the stack decoder

[...]

Douglas B. Paul¹•Institutions (1)

Massachusetts Institute of Technology¹

14 Apr 1991

TL;DR: Two algorithms are presented for accelerating the operation of a stack decoder using a method for computing the true least upper bound so that an optimal admissible A* search can be performed.

...read moreread less

Abstract: Two algorithms are presented for accelerating the operation of a stack decoder. The first is a method for computing the true least upper bound so that an optimal admissible A* search can be performed. The second is a set of methods for linearizing the computation required by a stack decoder. The A* search has been implemented in a continuous speech recognizer simulator and has demonstrated a significant speedup. The linearizing algorithm has been partially implemented in the simulator and has also shown significant computational savings. >

...read moreread less

Proceedings Article•DOI•

Optimal speedup for backtrack search on a butterfly network

[...]

Abhiram Ranade¹•Institutions (1)

University of California, Berkeley¹

01 Jun 1991

TL;DR: This work presents a generic algorithm for implementing backtrack search on anN processor butterfly network that requires timeO(M/N +h) with high probability and is optimal and is obtained without making assumptions about the shape of the tree being searched.

...read moreread less

Abstract: We present a generic algorithm for implementing backtrack search on anN processor butterfly network. For a backtrack search tree havingM nodes the heighth, our algorithm requires timeO(M/N +h) with high probability. This is optimal and is obtained without making assumptions about the shape of the tree being searched.

...read moreread less

Journal Article•DOI•

High performance software testing on SIMD machines

[...]

E.W. Krauser¹, Aditya P. Mathur¹, V.J. Rego¹•Institutions (1)

Purdue University¹

01 May 1991-IEEE Transactions on Software Engineering

TL;DR: A method for high-performance, software testing, called mutant unification, is described, designed to support program mutation on parallel machines based on the single instruction multiple data stream (SIMD) paradigm.

...read moreread less

Abstract: A method for high-performance, software testing, called mutant unification, is described. The method is designed to support program mutation on parallel machines based on the single instruction multiple data stream (SIMD) paradigm. Several parameters that affect the performance of unification have been identified and their effect on the time to completion of a mutation test cycle and speedup has been studied. Program mutation analysis provides an effective means for determining the reliability of large software systems and a systematic method for measuring the adequacy of test data. However, it is likely that testing large software systems using mutation is computation bound and prohibitive on traditional sequential machines. Current, implementations of mutation tools are unacceptably slow and are only suitable for testing relatively small programs. The proposed unification method provides a practical alternative to the current approaches. The method also opens up a new application domain for SIMD machines. >

...read moreread less

Journal Article•DOI•

And-parallel execution of logic programs on a shared-Memory multiprocessor

[...]

Yow-Jian Lin, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

02 Jan 1991-Journal of Logic Programming

TL;DR: This paper presents the implementation of an AND-parallel execution model of logic programs on a shared-memory multiprocessor based upon the Warren Abstract Machine, and shows that the parallel implementation can achieve reasonable speedup on dozens of processors.

...read moreread less

Abstract: This paper presents the implementation and performance results of an and -parallel execution model of logic programs on a shared-memory multiprocessor. The execution model is meant for logic programs with “don't-know nondeterminism”, and handles binding conflicts by dynamically detecting dependencies among literals. The model also incorporates intelligent backtracking at the clause level. Our implementation of this model is based upon the Warren Machine (WAM); hence it retains most of the efficiency of the WAM for sequential segments of logic programs. Performance results on Sequent Balance 21000 show that on suitable programs, our parallel implementation can achieve linear speedup on dozens of processors. We also present an analysis of different overheads encountered in the implementation of the execution model.

...read moreread less

Journal Article•DOI•

Microphone array optimization by stochastic region contraction

[...]

M.F. Berger¹, Harvey F. Silverman²•Institutions (2)

Tel Aviv University¹, Brown University²

01 Nov 1991-IEEE Transactions on Signal Processing

TL;DR: A new method, stochastic region contraction (SRC), is proposed that achieves a computational speedup of 30-50 when compared to the commonly used simulated-annealing method and is ideally suited for coarse-gain parallel processing.

...read moreread less

Abstract: The authors deal with optimal microphone placement and gain for a linear one-dimensional array often in a confined environment. A power spectral dispersion function (PSD) is used as a core element for a min-max objective function (PSDX). Derivation of the optimal spacings and gains of the microphones is a hard computational problem since the min-max objective function exhibits multiple local minima (hundreds or thousands). The authors address the computational problem of finding the global optimal solution of the PSDX function. A new method, stochastic region contraction (SRC), is proposed. It achieves a computational speedup of 30-50 when compared to the commonly used simulated-annealing method. SRC is ideally suited for coarse-gain parallel processing. >

...read moreread less

Journal Article•DOI•

Parallel hashing and integer sorting

[...]

Yossi Matias¹, Yossi Matias², Uzi Vishkin¹, Uzi Vishkin²•Institutions (2)

Tel Aviv University¹, University of Maryland, College Park²

01 Dec 1991-Journal of Algorithms

TL;DR: A result of independent interest is a parallel hashing technique that enables drastic reduction of space requirements for the price of using randomness in the parallel sorting algorithm and for some parallel string matching algorithms.

...read moreread less

Book Chapter•DOI•

Faster Modular Multiplication by Operand Scaling

[...]

Colin D. Walter

11 Aug 1991

TL;DR: This note shows how to gain speed by scaling the modulus, which can be chosen to need no scaling, so that most of the minor extra costs are eliminated.

...read moreread less

Abstract: There are a number of techniques known for speeding up modular multiplication, which is the main arithmetic operation in RSA cryptography This note shows how to gain speed by scaling the modulus Resulting hardware is limited only by the speed of addition Detailed analysis of fan out shows that over existing methods the speedup is potentially as much as two-fold This is because the addition and fan out can now be done in parallel Of course, in RSA the modulus can be chosen to need no scaling, so that most of the minor extra costs are eliminated

...read moreread less

Journal Article•DOI•

Coarse grain scheduling in parallel triangular factorization and solution of power system matrices

[...]

K. Lau, Daniel Tylavsky, Anjan Bose

01 May 1991-IEEE Transactions on Power Systems

TL;DR: It may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube parallel processing computer, and the parallelLU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling.

...read moreread less

Abstract: Two new coarse-grain scheduling schemes, the levelwise and factorization path scheduling schemes, are examined. These schemes differ significantly from fine-grain scheduling schemes which have been proposed in the past. If a fine-grain scheduling scheme at the floating-point-operation level is an appropriate scheduling method for the iPSC hypercube parallel processing computer, then the levelwise scheduling scheme presented should have gain comparable to that obtained using the factorization path scheduling scheme. Since this is not the case, it may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube. Furthermore, the parallel LU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling. The maximum speedup of 2.08 was obtained by using four processors on the 494 bus system. The efficiency at maximum speedup was 52.1%. >

...read moreread less

Journal Article•DOI•

A content-addressable memory architecture for image coding using vector quantization

[...]

Sethuraman Panchanathan¹, Morris Goldberg¹•Institutions (1)

Ottawa University¹

01 Sep 1991-IEEE Transactions on Signal Processing

TL;DR: An architecture suitable for real-time image coding using adaptive vector quantization (VQ) is presented, where the data is accessed simultaneously and in parallel on the basis of its content.

...read moreread less

Abstract: An architecture suitable for real-time image coding using adaptive vector quantization (VQ) is presented. This architecture is based on the concept of content-addressable memory (CAM), where the data is accessed simultaneously and in parallel on the basis of its content. VQ essentially involves, for each input vector, a search operation to obtain the best match codeword. A speedup results if a CAM-based implementation is used. This speedup, coupled with the gains in execution time for the basic distortion operation, implies that even codebook generation is possible in real time ( >

...read moreread less

Proceedings Article•DOI•

Technology mapping for a two-output RAM-based field programmable gate array

[...]

D. Filo¹, J. Chih-Yuan Yang¹, Frederic Mailhot¹, G. De Micheli¹•Institutions (1)

Stanford University¹

25 Feb 1991

TL;DR: A heuristic algorithm is described for technology mapping that performs a decomposition of the circuit in the FPGA primitives, driven by the information on logic functional sharing.

...read moreread less

Abstract: The authors present a new approach for performing technology mapping onto field programmable gate arrays (FPGAs). They consider one class of FPGAs, based on two-output five-input RAM-based cells, that are used to implement combinational logic functions. A heuristic algorithm is described for technology mapping that performs a decomposition of the circuit in the FPGA primitives, driven by the information on logic functional sharing. The authors have implemented the algorithm in the program Hydra. Experimental results shows an average of 20% to 25% improvement over other existing programs in mapping area and 67-fold speedup in computing time. >

...read moreread less

Journal Article•DOI•

Front-to-back display of BSP trees

[...]

Dan Gordon¹, S. Chen•Institutions (1)

Texas A&M University¹

01 Sep 1991-IEEE Computer Graphics and Applications

TL;DR: By combining polygon scan-conversion with a dynamic screen data structure, the technique provides significant speedup in the display time of polygonal scenes that depend on BSP trees, especially in cases where the number of polygons is large.

...read moreread less

Abstract: A technique for displaying binary space partitioning (BSP) trees that is faster than the usual back-to-front display method is presented. By combining polygon scan-conversion with a dynamic screen data structure, the technique, a front-to-back approach, provides significant speedup in the display time of polygonal scenes that depend on BSP trees, especially in cases where the number of polygons is large. This speedup is confirmed by applying the technique to randomly generated triangles. >

...read moreread less

Journal Article•DOI•

Performance analysis of Time Warp with multiple homogeneous processors

[...]

A. Gupta¹, Ian F. Akyildiz¹, Richard M. Fujimoto¹•Institutions (1)

Georgia Institute of Technology¹

01 Oct 1991-IEEE Transactions on Software Engineering

TL;DR: The behavior of n interacting processors synchronized by the Time Warp protocol is analyzed using a discrete-state, continuous-time Markov chain model and the results have been validated through performance measurements of a Time Warp testbed executing on a shared-memory multiprocessor.

...read moreread less

Abstract: The behavior of n interacting processors synchronized by the Time Warp protocol is analyzed using a discrete-state, continuous-time Markov chain model. The performance and dynamics of the processes (or processors) are analyzed under the following assumptions: exponential task times and timestamp increments on messages, each event message generates one new message that is sent to a randomly selected process, negligible rollback, state saving, and communication delay, unbounded message buffers, and homogeneous processors. Several performance measures are determined, such as: the fraction of processed events that commit, speedup, rollback probability, expected length of rollback, the probability mass function for the number of uncommitted processed events, the probability distribution function for the virtual time of a process, and the fraction of time the processors remain idle. The analysis is approximate, thus the results have been validated through performance measurements of a Time Warp testbed executing on a shared-memory multiprocessor. >

...read moreread less

Journal Article•DOI•

Feedforward architectures for parallel Viterbi decoding

[...]

Gerhard Fettweis¹, Heinrich Meyr²•Institutions (2)

IBM¹, RWTH Aachen University²

01 May 1991

TL;DR: Asymptotically, the ACS-feedback no longer has to be processed recursively, i.e., there is no feedback and this fact can be exploited technically to design efficient and purely feedforward architectures for Viterbi decoding that have a modular extendable structure.

...read moreread less

Abstract: The Viterbi-Algorithm (VA) is a common application of dynamic programming. The algorithm contains a nonlinear feedback loop (ACS-feedback, ACS: add-compare-select) which is the bottleneck in high data rate implementations. In this paper we show that, asymptotically, the ACS-feedback no longer has to be processed recursively, i.e., there is no feedback. With only negligible performance loss, this fact can be exploited technically to design efficient and purely feedforward architectures for Viterbi decoding that have a modular extendable structure. By designing one cascadable module, any speedup can be achieved simply by adding modules to the implementation. It is shown that optimization criteria, as minimum latency or maximum hardware efficiency, are met by very different architectures.

...read moreread less

Journal Article•DOI•

Paper: A comparative study of automatic vectorizing compilers

[...]

David Levine¹, David Callahan, Jack Dongarra²•Institutions (2)

Argonne National Laboratory¹, University of Tennessee²

01 Dec 1991

TL;DR: In this paper, the authors compare the capabilities of several commercially available, vectorizing Fortran compilers using a test suite of Fortran loops on a variety of supercomputers, mini-supercomputers and mainframes.

...read moreread less

Abstract: We compare the capabilities of several commercially available, vectorizing Fortran compilers using a test suite of Fortran loops. We present the results of compiling and executing these loops on a variety of supercomputers, mini-supercomputers, and mainframes.

...read moreread less

Journal Article•DOI•

Image encipherment based on optical parallel processing using spatial light modulators

[...]

S. Fukushima, T. Kurokawa, Sakai Yoshihisa

01 Dec 1991-IEEE Photonics Technology Letters

TL;DR: All-optical image coding and decoding are demonstrated for image encipherment communication, which was performed within a 8-ms duration, which is even faster than the conventional television system.

...read moreread less

Abstract: All-optical image coding and decoding are demonstrated for image encipherment communication. The optical parallel coder consists of two cascaded bipolar-operational spatial light modulators (BSLMs) that can select positive and negative image readout in response to the electrical pulse polarity. Coding is performed by an exclusive-OR operation, to a key and an input image, while decoding is the same operation performed to a decoding key and an enciphered image. Coding was performed within a 8-ms duration, which is even faster than the conventional television system. Increasing the voltages to the BSLMs two-fold or three-fold and shortening the pulse durations speed up the processing rate. >

...read moreread less

Journal Article•DOI•

Supercritical speedup

[...]

David Jefferson¹, Peter Reiher²•Institutions (2)

University of California, Los Angeles¹, Jet Propulsion Laboratory²

01 Apr 1991

Journal Article•DOI•

Parallelizing the Self-Organizing Feature Map on multiprocessor systems

[...]

Chwan-Hwa Wu¹, Russel E. Hodges¹, Chia-Jiu Wang²•Institutions (2)

Auburn University¹, University of Colorado Colorado Springs²

01 Sep 1991

TL;DR: Parallel algorithms for implementing the Kohonen SOFM on a linear chain and a two-dimensional mesh of transputers and the performance of massively parallel systems is predicted from the models.

...read moreread less

Abstract: The Self-Organizing Feature Map (SOFM) proposed by Kohonen is a widely used vector quantization algorithm. A drawback of SOFM is the increase of computation time with an increase in the number of neurons. However, the inherent parallelism of SOFM allows parallel processing to speed up the computation. Parallel algorithms for implementing the Kohonen SOFM are presented in this paper. We have implemented the algorithms on a linear chain and a two-dimensional mesh of transputers. Significant speedup has been achieved. In addition, models to describe the performance of the algorithms are also presented. The performance of massively parallel systems is predicted from the models.

...read moreread less

Journal Article•DOI•

The design of a scalable, fixed-time computer benchmark

[...]

John L. Gustafson¹, Diane T. Rover¹, Stephen T. Elbert¹, Michael Carter¹•Institutions (1)

Iowa State University¹

01 Aug 1991-Journal of Parallel and Distributed Computing

TL;DR: The design of a benchmark is presented, SLALOM{trademark}, that scales automatically to the computing power available, and corrects several deficiencies in various existing benchmarks: it is highly scalable, it solves a real problem, it includes input and output times, and it can be run on parallel machines of all kinds, using any convenient language.

...read moreread less

Journal Article•DOI•

Vector and parallel algorithms for the molecular dynamics simulation of macromolecules on shared‐memory computers

[...]

John E. Mertz¹, Douglas J. Tobias², Charles L. Brooks², U. C. Singh³•Institutions (3)

Cray¹, Carnegie Mellon University², Scripps Health³

01 Dec 1991-Journal of Computational Chemistry

TL;DR: A detailed description of vector/parallel algorithms for the molecular dynamics (MD) simulation of macromolecular systems on multiple processor, shared-memory computers is presented in this paper.

...read moreread less

Abstract: A detailed description of vector/parallel algorithms for the molecular dynamics (MD) simulation of macromolecular systems on multiple processor, shared-memory computers is presented. The algorithms encompass three computationally intensive portions of typical MD programs: (1) the evaluation of the potential energies and forces, (2) the generation of the nonbonded neighbor list, and (3) the satisfaction of holonomic constraints. We implemented the algorithms into two standard programs; CHARMM and AMBER, and obtained near linear speedups on eight processors of a Cray Y-MP for cases (1) and (2). For case (3) the SHAKE method demonstrated a speedup of 6.0 on eight processors while the matrix inversion method demonstrated 6.4. For a system of water molecules the performance improvement over the standard scalar SHAKE subroutine in AMBER ranged from a factor of 165 to greater than 2000.

...read moreread less

Journal Article•DOI•

Processor-efficient hypercube algorithms for the knapsack problem

[...]

Jianhua Lin¹, James A. Storer¹•Institutions (1)

Brandeis University¹

01 Nov 1991-Journal of Parallel and Distributed Computing

TL;DR: The proposed processor-efficient parallel algorithm for the 0/1 knapsack problem has optimal time speedup and processor efficiency over the best known sequential algorithm and performs very well for a wide range of input sizes.

...read moreread less

Collapse