scispace - formally typeset
Search or ask a question

Showing papers on "Benchmark (computing) published in 1988"


Proceedings ArticleDOI
07 Nov 1988
TL;DR: The results show that binary decision diagrams (BDD) with the proposed ordering method can verify almost all benchmark circuits in less than several central processor unit (CPU) minutes, which is one hundred times faster than times reported in the literature.
Abstract: R.E. Bryant proposed a method to handle logic expressions (IEEE Trans. Comp., vol.25, no.8, p.667-91, 1986) which is based on binary decision diagrams (BDD) with restriction; variable ordering ix fixed throughout a diagram. The method is more efficient than other methods proposed so far and depends heavily on variable ordering. A simple but powerful algorithm for variable ordering is developed. The algorithm tries to find a variable ordering which minimizes the number of crosspoints of nets when the circuit diagram is drawn. This is applied to the Boolean comparison of ISCAS benchmark circuits for test pattern generation. The results show that binary decision diagrams (BDD) with the proposed ordering method can verify almost all benchmark circuits in less than several central processor unit (CPU) minutes, which is one hundred times (or more) faster than times reported in the literature. Some techniques for circuit evaluation ordering are also mentioned. >

278 citations


Journal ArticleDOI
TL;DR: These algorithms are based on iterative improvement of a dual cost and operate in a manner that is reminiscent of coordinate ascent and Gauss-Seidel relaxation methods, and are found to be several times faster on standard benchmark problems, and faster by an order of magnitude on large, randomly generated problems.
Abstract: We propose a new class of algorithms for linear cost network flow problems with and without gains. These algorithms are based on iterative improvement of a dual cost and operate in a manner that is reminiscent of coordinate ascent and Gauss-Seidel relaxation methods. We compare our coded implementations of these methods with mature state-of-the-art primal simplex and primal-dual codes, and find them to be several times faster on standard benchmark problems, and faster by an order of magnitude on large, randomly generated problems. Our experiments indicate that the speedup factor increases with problem dimension.

191 citations


Journal ArticleDOI
TL;DR: A new algorithm for optimally balancing assembly lines is formulated and tested, which obtains proven optimal solutions for ten 1000 task lines, which each possess the computationally favorable conditions of an average of at least 6 tasks per work station and a small number of between-task precedence requirements.
Abstract: A new algorithm for optimally balancing assembly lines is formulated and tested. Named "FABLE," it obtains proven optimal solutions for ten 1000 task lines, which each possess the computationally favorable conditions of an average of at least 6 tasks per work station and a small number of between-task precedence requirements, in less than 20 seconds of IBM 3033U CPU time for each problem. FABLE also performs very favorably on a benchmark group of 64 test problems drawn from the literature, which are of up to 111 tasks each. FABLE finds and proves an optimal solution to the 64 problems in a total of 3.16 seconds of IBM 3090 CPU time. FABLE is a 'laser' type, depth-first, branch-and-bound algorithm, with logic designed for very fast achievement of feasibility, ensuring a feasible solution to any line of 1000 or even more tasks. It utilizes new and existing dominance rules and bound arguments. A total of 549 problems of various characteristics are solved to determine conditions under which FABLE performs most and least favorably. Performance is sensitive to average number of tasks per work station, number of between-task precedence requirements measured by 'order strength', and the total number of tasks per problem. A heuristic variant of FABLE is also described.

186 citations


Book ChapterDOI
01 Mar 1988

127 citations


Journal ArticleDOI
R. P. Weicker1
TL;DR: The Dhrystone benchmark program has become a popular benchmark for CPU/compiler performance measurement, in particular in the area of minicomputers, workstations, PC's and microprocesors, and it has been made for two reasons.
Abstract: The Dhrystone benchmark program [1] has become a popular benchmark for CPU/compiler performance measurement, in particular in the area of minicomputers, workstations, PC's and microprocesors. It apparently satisfies a need for an easy-to-use integer benchmark; it gives a first performance indication which is more meaningful than MIPS numbers which, m their literal meaning (million instructions per second), cannot be used across different instruction sets (e.g. RISC vs. CISC). With the increasing use of the benchmark, it seems necessary to reconsider the benchmark and to check whether it can still fulfill this function. Version 2 of Dhrystone is the result of such a re-evaluation, it has been made for two reasons:

101 citations


Journal ArticleDOI
01 Jun 1988
TL;DR: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops and addresses the concern that their computations must be understood thoroughly, so that efficient implementations may be written.
Abstract: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops. The Loops represent the type of computational kernels typically found in large-scale scientific computing and have been used to benchmark computer system since the mid-60's. On parallel systems, a process's computational structure can greatly affect its efficiency. If the loops are to be used to benchmark such systems, their computations must be understood thoroughly, so that efficient implementations may be written. This paper addresses that concern.

81 citations


Proceedings ArticleDOI
01 Jun 1988
TL;DR: The two most frequently used symbolic layout compaction approaches, constraint graph compaction and virtual grid compaction, are reviewed in this paper.
Abstract: Symbolic layout and compaction is reaching a mature status. This is demonstrated, in part, by the recent or imminent introductions of a number of commercial symbolic layout and compaction systems. The two most frequently used symbolic layout compaction approaches, constraint graph compaction and virtual grid compaction, are reviewed in this paper. The current status of these two approaches is presented by looking at the results of the ICCD87 compaction benchmark session.

80 citations


Journal ArticleDOI
17 May 1988
TL;DR: It is found that in non-vector machines, pipelining multiple function units does not provide significant performance improvements, and it is worthwhile to investigate the performance improvements that can be achieved from issuing multiple instructions each clock cycle.
Abstract: In this paper, we look at the interaction of pipelining and multiple functional units in single processor machines. When implementing a high performance machine, a number of hardware techniques maybe used to improve the performance of the final system. Our goal is to gain an understanding of how each of these techniques contribute to performance improvement. As a basis for our studies we use a CRAY-like processor model and the issue rate (instructions per clock cycle) as the performance measure. We then systematically augment this base, non-pipelined, machine with more and more hardware features and evaluate the performance impact of each feature. We find, for example, that in non-vector machines, pipelining multiple function units does not provide significant performance improvements. Dataflow limits are then derived for our benchmark programs to determine the performance potential of each benchmark. In addition, other limits are computed which apply more realistic constraints on a computation. Based on these more realistic limits, we determine it is worthwhile to investigate the performance improvements that can be achieved from issuing multiple instructions each clock cycle. Several hardware approaches are evaluated for issuing multiple instructions each clock cycle.

63 citations


Proceedings ArticleDOI
01 Dec 1988
TL;DR: It is found that with sufficient memory, multiple processors can greatly improve performance in the transaction processing system performance achievable through the combination of multiple processors and massive memories.
Abstract: In this paper we describe an experiment designed to evaluate the potential transaction processing system performance achievable through the combination of multiple processors and massive memories. The experiment consisted of the design and implementation of a transaction processing kernel on stock multiprocessors. We found that with sufficient memory, multiple processors can greatly improve performance. A prototype implementation of the kernel on a pair of Firefly multiprocessors (each with five 1-MIP processors) runs the standard debit-credit benchmark at over 1000 transactions per second.

51 citations


Proceedings ArticleDOI
01 Jan 1988
TL;DR: This work outlines the design of a C * compiler for a hypercube multicomputer and aims to minimize the amount of time spent synchronizing, limit the number of interprocessor communications, and make each physical processor's emulation of a set of virtual processors as efficient as possible.
Abstract: A data parallel language such as C* has a number of advantages over conventional hypercube programming languages The algorithm design process is simpler, because (1) message passing is invisible, (2) race conditions are nonexistent, and (3) the data can be put into a one-to-one correspondence with the virtual processors Since data are mapped to virtual processors, rather than physical processors, it is easier to move algorithms implemented on one size hypercube to a larger or smaller system We outline the design of a C* compiler for a hypercube multicomputer Our design goals are to minimize the amount of time spent synchronizing, limit the number of interprocessor communications, and make each physical processor's emulation of a set of virtual processors as efficient as possible We have hand translated three benchmark programs and compared their performance with that of ordinary C programs All three programs—matrix multiplication, LU decomposition, and hyperquicksort—achieve reasonable speedup on a commercial hypercube, even when solving problems of modest size On a 64-processor NCUBE/7, the C* matrix multiplication program achieves a speedup of 27 when multiplying two 64 × 64 matrices, the hyperquicksort program achieves a speedup of 10 when sorting 16,384 integers, and LU decomposition attains a speedup of 7 when decomposing a 256 × 256 system of linear equations We believe the degradation in machine performance resulting from the use of a data parallel language will be more than compensated for by the increase in programmer productivity

45 citations


Journal ArticleDOI
TL;DR: This note reports about the implementation of AC-unification algorithms, based on the variable-abstraction method of Stickel and on the constant-ABstraction methods of Livesey, Siekmann, and Herold, and gives a set of 105 benchmark examples and compares execution times for implementations of the two approaches.
Abstract: This note reports about the implementation of AC-unification algorithms, based on the variable-abstraction method of Stickel and on the constant-abstraction method of Livesey, Siekmann, and Herold. We give a set of 105 benchmark examples and compare execution times for implementations of the two approaches. This documents for other researchers what we consider to be the state-of-the-art performance for elementary AC-unification problems.

Proceedings ArticleDOI
03 Oct 1988
TL;DR: The authors present a method for transforming multilevel equations into a gate-level netlist of a given technology, takes full advantage of any gate library's timing and area information and offers the designer an option to trade off CPU runtime for better results.
Abstract: The authors present a method for transforming multilevel equations into a gate-level netlist of a given technology. The proposed mapping procedure performs multiple mappings, each with randomly selected program parameters. The number of mappings is user-settable, and it offers the designer an option to trade off CPU runtime for better results. This feature is important to designers who begin by exploring the space of architectural possibilities, then finally create a specific, highly optimized circuit. The proposed technology mapping method has been implemented in C as a logic-design tool (McMAP) that takes full advantage of any gate library's timing and area information. Using default parameter settings, the tool synthesized several standard benchmark examples yielding higher-quality circuits with lower CPU requirements than previously reported. >

Proceedings ArticleDOI
16 May 1988
TL;DR: A discussion is presented of implementation of two WDF (wave digital filter) benchmarks that have been designed with three architecture-specific silicon compilers and shown that as an initial optimization step, it is important to initially choose a good algorithm.
Abstract: A discussion is presented of implementation of two WDF (wave digital filter) benchmarks that have been designed with three architecture-specific silicon compilers. The design time for high-level synthesis and optimization is roughly one day. For each of the three synthesis systems, the elapsed design cycle starting from the specifications down to the optimized signal flow graph is another 1-2 days. For the architecture and layout generation (including the evaluation of the tradeoffs), the design time ranges from a few hours to a day. Specific higher-level filter specifications have been used for the synthesis with three different implementation strategies in the CATHEDRAL silicon compilers. It is shown that as an initial optimization step, it is important to initially choose a good algorithm. >

Journal ArticleDOI
TL;DR: The LRU cache hit function is used as a general characterization of locality of reference to address the synthesis question of whether benchmarks can be created that have a required locality ofreference.
Abstract: The LRU cache hit function is used as a general characterization of locality of reference to address the synthesis question of whether benchmarks can be created that have a required locality of reference. Several results are given that show circumstances under which this synthesis can or cannot be achieved. An additional characterization called the warm-start cache hit function is introduced and shown to be efficiently computable. The operations of repetition and replication are used to form new programs, and their characteristics are derived. Using these operations, a general benchmark synthesis technique is obtained and demonstrated with an example. >

Journal ArticleDOI
TL;DR: The high-level-language-graph transformations that must be performed to achieve high performance for numerical and nonnumerical programs executed in a dataflow computing environment are described for Lisp, using the DCBL transformations.
Abstract: The authors compare dataflow computing models, languages, and dataflow computing machines for numerical and nonnumerical computations. The high-level-language-graph transformations that must be performed to achieve high performance for numerical and nonnumerical programs executed in a dataflow computing environment are described for Lisp, using the DCBL transformations. Some general problems of dataflow computing machines are discussed. Performance evaluation measurements obtained by executing benchmark programs in the ETL nonnumerical dataflow computing environment, the EM-3, are presented. >

Proceedings ArticleDOI
15 Jun 1988
TL;DR: In this article, an experimental process control system under development at Caltech is described, which is intended to be a source of benchmark control and identification problems, and a first principles theoretical model is developed and compared to preliminary experimental data.
Abstract: This paper describes an experimental process control system under development at Caltech. It is intended to be a source of benchmark control and identification problems. A first principles theoretical model is developed and compared to preliminary experimental data.

Book ChapterDOI
01 Jun 1988
TL;DR: In this article, two techniques for the improvement of the efficiency of the analysis of DSPN are outlined, using a previously published model of a high speed local area network as an example of the application of the proposed techniques, and as a benchmark for the assessment of their efficiency.
Abstract: The applicability of DSPN models has been limited by the computational complexity of the algorithm for the evaluation of the steady state probability distribution over reachable markings, so that it was often necessary to resort to simulation rather than analysis. Two techniques for the improvement of the efficiency of the analysis of DSPN are outlined in this paper, using a previously published model of a high speed local area network as an example of the application of the proposed techniques, and as a benchmark for the assessment of their efficiency.

01 Jun 1988
TL;DR: A machine analyzer is described, which measures the performance of a given machine on Fortran source language constructs and a program analyzer, which analyzes Fortran programs and determines the frequency of execution of each of the same set of source language operations.
Abstract: From runs of standard benchmark suites, it is not possible to characterize the machine nor to predict the running time of other benchmarks which have not been run. In this paper, we report on a new approach to benchmarking and machine characterization. We describe the creation and use of a machine analyzer, which measures the performance of a given machine on Fortran source language constructs. The machine analyzer yields a set of parameters which characterize the machine and spotlight its strong and weak points. We also describe a program analyzer, which analyzes Fortran programs and determines the frequency of execution of each of the same set of source language operations. We then show that by combining a machine characterization and a program characterization, we are able to predict with good accuracy the running time of a given benchmark on a given machine. Characterizations are provided for the Cray X-MP/48, Cyber 205, IBM 3090/200, Amdahl 5840, ConvexC-1, VAX 8600, VAX 11/785, VAX 11/ 780, SUN 3/50 and IBM RT-PC/125, and for the following benchmark programs suites: Los Alamos (BMK8A1), Baskett, Linpack, Livermore Loops, Mandelbrot Set, NAS Kernels, Shell Sort, Smith, Whetstone and Sieve of Erathostenes.

01 Jun 1988
TL;DR: The class of forward branching programs is defined, and it is shown that it is a superset of the strongly left-sequential programs, and an efficient reduction algorithm is designed that works for a sequence of reductions whenever the program is forward branching.
Abstract: High level programming languages have introduced models of computation that are very different from the traditional von Neumann machine. Translating such programs into efficient machine code for traditional machines is therefore non-trivial. In this thesis, we are interested in generating code for Equational Programs. An equational program consists of Equations of the form T = U, interpreted as reduction rules, i.e., the program takes an input term and replaces, repeatedly, instances of left-hand sides of equations by corresponding right-hand sides. The process, called reduction, stops when no instance of a left-hand side can be found. The term is then in normal form. The reduction process may go on forever if no normal form exists. Huet and Levy showed that the strongly sequential programs have efficient translations, in that there is an efficient algorithm for finding an instance of a left-hand side that necessary to replace in every sequence of reductions to normal form. Their algorithm does, however, not consider sequences of reductions. Simply restarting the algorithm after each replacement is too expensive. Hoffmann and O'Donnell defined the strongly left-sequential programs, for which an efficient algorithm exists that solves this problem. In this thesis we define the class of forward branching programs, and show that it is a superset of the strongly left-sequential programs. We design an efficient reduction algorithm that works for a sequence of reductions whenever the program is forward branching. We also give a decision procedure for the forward branching programs. Huet and Levy, as well as Hoffmann and O'Donnell, ignored the problem of replacing instances of left-hand sides with right-hand sides. Experience with previous implementations suggest that these replacements are serious bottle necks. We define an efficient replacement algorithm for the forward branching programs. Our algorithm is a novel application of partial evaluation. Finally, we present an experimental implementation of a compiler for forward branching programs. We present simple benchmark that suggests that our implementation performs as well as compiled Franz Lisp and within a factor of two of Unix C.

Proceedings ArticleDOI
01 Jan 1988

Proceedings ArticleDOI
12 Sep 1988
TL;DR: A method for PLA (programmable logic-array) test-pattern generation based on a branch-and-bound algorithm that function monotonicity is presented, resulting in the efficient generation of compact test sets.
Abstract: A method for PLA (programmable logic-array) test-pattern generation based on a branch-and-bound algorithm that function monotonicity is presented. The algorithm makes irrevocable input assignments first, resulting in the efficient generation of compact test sets. In most cases there is no backtracking. An intelligent branching heuristic is presented. The algorithm handles extended fault models, including cross-point and delay faults. Heuristics which speed up test-set generation and improve test-set compaction are discussed. Results of tests on a wide range of benchmark PLAs are included. >

Journal ArticleDOI
TL;DR: This work has demonstrated that the assumption that textually identical loop statements will take the same amount of time to execute in a dual loop design is inaccurate in these specific test cases.
Abstract: Benchmarks that measure time values using a standard system clock often employ a dual loop design. One of the important assumptions of this design is that textually identical loop statements will take the same amount of time to execute. This assumption has been tested on two bare computers with Ada® test programs and has been demonstrated to be inaccurate in these specific test cases.

Proceedings ArticleDOI
01 Jun 1988
TL;DR: This work examines the problem of evaluating performance of supercomputer architectures on sparse (matrix) computations and lays out the details of a benchmark package that consists of several independent modules, each of which has a distinct role.
Abstract: We examine the problem of evaluating performance of supercomputer architectures on sparse (matrix) computations and lay out the details of a benchmark package for this problem. Whereas there already exists a number of benchmark packages for scientific computations, such as the Livermore Loops, the Linpack benchmark and the Los Alamos benchmark, none of these deals with the specific nature of sparse computations. Sparse matrix techniques are characterized by the relatively small number of operations per data element and the irregularity of the computation. Both facts may significantly increase the overhead time due to memory traffic. For this reason, the performance evaluation of sparse computations should not only take into account the CPU performance but also the degradation of performance caused by high memory traffic. Furthermore, sparse matrix techniques comprise a variety of different types of basic computations. Taking these considerations into account we propose a benchmark package that consists of several independent modules, each of which has a distinct role.

Journal ArticleDOI
TL;DR: Design details and benchmark results are given for a Prolog interpreter that can be executed across a network by using message passing to implement AND-parallelism, and is simple and easy to use, yet significantly speeds up existing programs.
Abstract: Design details and benchmark results are given for a Prolog interpreter that can be executed across a network by using message passing to implement AND-parallelism. The system is simple and easy to use, yet significantly speeds up existing programs. The system hardware is a group of Sun 3/50 workstations connected to a 10-Mb/s Ethernet. The number of machines actually used by the system is determined when it is initialized. The benchmark programs to test the system are a Prolog compiler, a recursive Fibonacci program, an implementation of the standard quicksort algorithm, and a simple chess program. >

Journal ArticleDOI
TL;DR: The authors discuss the design architecture of the TX1, provide some performance analysis for the design, and describe the debugging feature provided on the processor.
Abstract: The 32-bit TX1 microprocessor, developed to meet the architectural specification of Japan's TRON (The Real-Time-Operating Nucleus) project, has been given a loosely coupled pipeline structure to meet the demands of high-performance systems. The authors discuss the design architecture of the TX1, provide some performance analysis for the design, and describe the debugging feature provided on the processor. Results for several benchmark programs show that the average performance of the TX1 is over 5 MIPS (million instructions per second). >

Journal ArticleDOI
TL;DR: DUNIX is an operating system that integrates several computers, connected by a packet switching network, into a single UNIX machine, which exhibits surprisingly high performance.
Abstract: DUNIX is an operating system that integrates several computers, connected by a packet switching network, into a single UNIX machine. As far as the users and their software can tell, the system is a single large computer running UNIX. This illusion is created by cooperation of the computers' kernels. The kernels' mode of operation is novel. The software is procedure call oriented. The code that implements a specific system call (e.g., open) does not know whether the object in question (the file) is local or remote. That uniformity makes the kernel small and easy to maintain. The system behaves gracefully under subcomponents' failures. Users which do not have objects (files, processes, tty, etc) in a given computer are not disturbed when that computer crashes. The system administrator may switch a disk from a "dead" computer to a healthy one, and remount the disk under the original path-name. After the switch, users may access files in that disk via the same old names. DUNIX exhibits surprisingly high performance. For a compilation benchmark, DUNIX is faster than 4.2 BSD, even if in the DUNIX case all the files in question are remote. Currently, in Bell Communications Research we have an installation running DUNIX over five DEC VAX computers connected by an Ethernet. This installation speaks TCP/IP and is on the Internet network.

Proceedings ArticleDOI
07 Nov 1988
TL;DR: A model of optimal performance for a machine with an infinite number of processors having uniform memory accesses is presented and demonstrates that some circuits have significantly more parallelism than previously believed.
Abstract: The optimal level of performance from parallel discrete-event simulation depends on the circuit being simulated, the vectors being simulated, and the machine on which the simulation is being performed. Empirical studies based on very simple models suggest that the amount of parallelism available in typical circuits is very small. A model of optimal performance for a machine with an infinite number of processors having uniform memory accesses is presented. It demonstrates that some circuits have significantly more parallelism than previously believed. The model is refined to define the optimal load partitioning for a machine with a finite number of processors with uniform access and extended to define the optimal static data partitioning. A metric is obtained which can be used to benchmark different models of parallel simulation. The effectiveness of these models in detecting performance problems of the version of RSIM running on the BBN Butterfly is shown. >

Proceedings ArticleDOI
10 Oct 1988
TL;DR: The authors briefly describe the ASP architecture and its implementation and report the results of an evaluation of its applicability to image-processing tasks, using the Defense Advanced Research Projects Agency image-understanding benchmark as a benchmark.
Abstract: The associative string processor (ASP) is a homogeneous, reconfigurable, programmable massively parallel processor which offers step-function advantages in cost performance and application flexibility due to its unique architecture and its use of state-of-the-art microelectronics. The authors briefly describe the ASP architecture and its implementation and report the results of an evaluation of its applicability to image-processing tasks. In order to provide a realistic demonstration of the above-mentioned advantages, a set of independently defined tasks (viz. the Defense Advanced Research Projects Agency (DARPA) image-understanding benchmark) was chosen for the evaluation, and the results are used to compare the performance of the ASP architecture with the performances of other parallel computer architectures when applied to the same computer vision tasks. >

Journal ArticleDOI
TL;DR: The papers in this special issue of COMPEL describe six benchmark problems that can be used to validate eddy current computer codes and include computational results to the problems.
Abstract: The papers in this special issue of COMPEL describe six benchmark problems that can be used to validate eddy current computer codes. The papers include computational results to the problems. The results were presented at six workshops for the comparison of codes held in Europe, America, and Japan between March 1986 and August 1987.

Journal ArticleDOI
TL;DR: While speedup of this implementation on highly OR-parallel problems is very good, overall performance is poor, and many aspcts of the implementation and its performance can prove useful in designing future systems for similar machines.
Abstract: The research focus in parallel logic programming is shifting rapidly from theoretical considerations and simulation on uniprocessors to implementation on true multiprocessors. This report presents performance figures from such a system,Boplog, for OR-parallel Horn clause logic programs on the BBN Butterfly Parallel Processor. Boplog is designed expressly for a large scale shared memory multiprocessor with an Omega interconnect. The target machine and underlying execution model are described briefly. The primary focus of the paper is on detailed statistics taken from the execution of benchmark programs to assess the performance of the model and clarify the impact of design and architecture decisions. They show that while speedup of this implementation on highly OR-parallel problems is very good, overall performance is poor. Despite its speed drawback, many aspcts of the implementation and its performance can prove useful in designing future systems for similar machines. A binding model that prohibits constant time access to bindings, and the inability of the machine to support an ambitious use of machine memory appear to be most damaging factors.