Showing papers on "Benchmark (computing) published in 1996"

PDF

Open Access

Book Chapter•DOI•

[...]

01 Jan 1996

TL;DR: This chapter continues with a detailed computational study of the most powerful algorithm on 162 benchmark problems and discusses the suitability of the algorithm for either very large or very difficult JSP instances.

...read moreread less

Abstract: In this chapter we give a survey on the GA approaches considered so far. We continue with a detailed computational study of the most powerful algorithm on 162 benchmark problems. Finally we discuss the suitability of the algorithm for either very large or very difficult JSP instances.

...read moreread less

711 citations

Journal Article•DOI•

Maximizing multiprocessor performance with the SUIF compiler

[...]

Mary Hall, Jennifer M. Anderson¹, Jennifer M. Anderson², Saman Amarasinghe¹, Saman Amarasinghe³, Brian R. Murphy⁴, Brian R. Murphy¹, Shih-Wei Liao¹, Shih-Wei Liao⁵, Edouard Bugnion⁶, Edouard Bugnion¹, Monica S. Lam⁷, Monica S. Lam¹, Monica S. Lam⁸ - Show less +10 more•Institutions (8)

Stanford University¹, University of California, Irvine², Cornell University³, Massachusetts Institute of Technology⁴, National Taiwan University⁵, ETH Zurich⁶, Carnegie Mellon University⁷, University of British Columbia⁸

01 Dec 1996-IEEE Computer

TL;DR: In this paper, the authors describe automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs.

...read moreread less

Abstract: This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIF's robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

...read moreread less

592 citations

Journal Article•DOI•

Improving data locality with loop transformations

[...]

Kathryn S. McKinley¹, Steve Carr², Chau-Wen Tseng³•Institutions (3)

University of Massachusetts Amherst¹, Michigan Technological University², University of Maryland, College Park³

01 Jul 1996-ACM Transactions on Programming Languages and Systems

TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.

...read moreread less

Abstract: In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.

...read moreread less

566 citations

Journal Article•DOI•

A tabu search heuristic for the multi-depot vehicle routing problem

[...]

Jacques Renaud, Gilbert Laporte¹, Gilbert Laporte², Fayez F. Boctor³•Institutions (3)

École Normale Supérieure¹, Université de Montréal², Laval University³

01 Mar 1996-Computers & Operations Research

TL;DR: A tabu search algorithm for the multi-depot vehicle routing problem with capacity and route length restrictions is described and is shown to outperform existing heuristics.

...read moreread less

342 citations

Journal Article•DOI•

Analysis of benchmark characteristics and benchmark performance prediction

[...]

Rafael H. Saavedra¹, Alan Jay Smith²•Institutions (2)

University of Southern California¹, University of California, Berkeley²

01 Nov 1996-ACM Transactions on Computer Systems

TL;DR: A machine-imdependent model of program execution is developed to characterize both machine performance and program execution, and a metric for program similarity is developed that makes it possible to classify benchmarks with respect to a large set of characteristics.

...read moreread less

Abstract: Standard benchmarking provides to run-times for given programs on given machines, but fails to provide insight as to why those results were obtained (either in terms of machine or program characteristics) and fails to provide run-times for that program on some other machine, or some other programs on that machine. We have developed a machine-imdependent model of program execution to characterize both machine performance and program execution. By merging these machine and program characterizations, we can estimate execution time for arbitrary machine/program combinations. Our technique allows us to identify those operations, either on the machine or in the programs, which dominate the benchmark results. This information helps designers in improving the performance of future machines and users in tuning their applications to better utilize the performance of existing machines. Here we apply our methodology to characterize benchmarks and predict their execution times. We present extensive run-time statistics for a large set of benchmarks including the SPEC and Perfect Club suites. We show how these statistics can be used to identify important shortcoming in the programs. In addition, we give execution time estimates for a large sample of programs and machines and compare these against benchmark results. Finally, we develop a metric for program similarity that makes it possible to classify benchmarks with respect to a large set of characteristics.

...read moreread less

230 citations

Efficient Algorithms for Speech Recognition.

[...]

Mosur Ravishankar

15 May 1996

TL;DR: The main contributions of this thesis are an 8-fold speedup and 4-fold memory size reduction over the baseline Sphinx-II system, and the improvement in speed is obtained from the following techniques: lexical tree search, phonetic fast match heuristic, and global best path search of the word lattice.

...read moreread less

Abstract: : Advances in speech technology and computing power have created a surge of interest in the practical application of speech recognition. However, the most accurate speech recognition systems in the research world are still far too slow and expensive to be used in practical, large vocabulary continuous speech applications. Their main goal has been recognition accuracy, with emphasis on acoustic and language modelling. But practical speech recognition also requires the computation to be carried out in real time within the limited resources CPU power and memory size of commonly available computers. There has been relatively little work in this direction while preserving the accuracy of research systems. In this thesis, we focus on efficient and accurate speech recognition. It is easy to improve recognition speed and reduce memory requirements by trading away accuracy, for example by greater pruning, and using simpler acoustic and language models. It is much harder to improve both the recognition speed and reduce main memory size while preserving the accuracy. This thesis presents several techniques for improving the overall performance of the CMU Sphinx-II system. Sphinx-II employs semi-continuous hidden Markov models for acoustics and trigram language models, and is one of the premier research systems of its kind. The techniques in this thesis are validated on several widely used benchmark test sets using two vocabulary sizes of about 20K and 58K words. The main contributions of this thesis are an 8-fold speedup and 4-fold memory size reduction over the baseline Sphinx-II system. The improvement in speed is obtained from the following techniques: lexical tree search, phonetic fast match heuristic, and global best path search of the word lattice.

...read moreread less

221 citations

Book•

Mixed analog-digital VLSI devices and technology

[...]

Yannis Tsividis

01 Jan 1996

TL;DR: Contents: Introduction: Mixed Analog-Digital Chips The MOSFET: Introduction and Qualitative View.

...read moreread less

Abstract: Contents: Introduction: Mixed Analog-Digital Chips The MOSFET: Introduction and Qualitative View MOSFET DC Modeling MOSFET Small-Signal Modeling Technology and Available Circuit Components Layout Appendices: Additional MOS Transistor Modeling Information A Set of Benchmark Tests for Evaluating MOSFET Models for Analog Design A Sample Spice Input File.

...read moreread less

195 citations

Journal Article•DOI•

Selecting Input Variables for Fuzzy Models

[...]

Stephen L. Chiu¹•Institutions (1)

Rockwell Automation¹

01 Jul 1996-Journal of Intelligent and Fuzzy Systems

TL;DR: An efficient method for selecting important input variables when building a fuzzy model from data by systematically removing premises in the fuzzy rules of this initial model to search for the best simplified model without actually generating any new models.

...read moreread less

Abstract: We present an efficient method for selecting important input variables when building a fuzzy model from data. Past methods for input variable selection require generating different models while searching for the optimal combination of variables; our method requires generating only one model that employs all possible input variables. To determine the important variables, premises in the fuzzy rules of this initial model are systematically removed to search for the best simplified model without actually generating any new models. We also present an efficient method for generating the initial model that typically must incorporate a large number of input variables. These methods are illustrated through application to the benchmark Box and Jenkins gas furnace data; the results are compared with those of other fuzzy models found in literature.

...read moreread less

141 citations

Book Chapter•DOI•

A Parallel Tabu Search Algorithm Using Ejection Chains for the Vehicle Routing Problem

[...]

César Rego, Catherine Roucairol¹•Institutions (1)

Versailles Saint-Quentin-en-Yvelines University¹

01 Jan 1996

TL;DR: A Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions and in the neighborhood search, the algorithm uses compound moves generated by an ejection chain process.

...read moreread less

Abstract: In this paper we describe a Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions. In the neighborhood search, the algorithm uses compound moves generated by an ejection chain process. Parallel processing is used to explore the solution space more extensively and different parallel techniques are used to accelerate the search process. Tests were carried out on a network of SUNSparc workstations and computational results for a set of benchmark problems prove the efficiency of the algorithm proposed.

...read moreread less

133 citations

Proceedings Article•DOI•

Compiler-directed page coloring for multiprocessors

[...]

Edouard Bugnion¹, Jennifer M. Anderson¹, Todd C. Mowry², Mendel Rosenblum¹, Monica S. Lam¹ - Show less +1 more•Institutions (2)

Stanford University¹, University of Toronto²

01 Sep 1996

TL;DR: It is demonstrated that compiler-directed page coloring can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches, and is complementary to latency-hiding techniques such as prefetching.

...read moreread less

Abstract: This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler's knowledge of the access patterns of the parallelized applications to direct the operating system's virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching.We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8% over Digital UNIX's page mapping policy and by 20% over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.

...read moreread less

123 citations

Journal Article•DOI•

GATTO: a genetic algorithm for automatic test pattern generation for large synchronous sequential circuits

[...]

Fulvio Corno¹, Paolo Prinetto¹, Maurizio Rebaudengo¹, M. Sonza Reorda¹•Institutions (1)

Polytechnic University of Turin¹

01 Aug 1996-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A prototype system named GATTO is used to assess the effectiveness of the approach in terms of result quality and CPU time requirements and the results are the best ones reported in the literature for most of the largest standard benchmark circuits.

...read moreread less

Abstract: This paper deals with automated test pattern generation for large synchronous sequential circuits and describes an approach based on genetic algorithms. A prototype system named GATTO is used to assess the effectiveness of the approach in terms of result quality and CPU time requirements. An account is also given of a distributed version of the same algorithm, named GATTO*. Being based on the PVM library, it runs on any network of workstations and is able to either reduce the required time, or improve the result quality with respect to the monoprocessor version. In the latter case, in terms of Fault Coverage, the results are the best ones reported in the literature for most of the largest standard benchmark circuits. The flexibility of GATTO enables users to easily tradeoff fault coverage and CPU time to suit their needs.

...read moreread less

Journal Article•DOI•

FIRE: a fault-independent combinational redundancy identification algorithm

[...]

Mahesh A. Iyer¹, Miron Abramovici²•Institutions (2)

Illinois Institute of Technology¹, Alcatel-Lucent²

01 Jun 1996-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The results on benchmark and real circuits indicate that a large number of redundancies are found, much faster than a test-generation-based approach for redundancy identification, however, FIRE is not guaranteed to identify all redundancies in a circuit.

...read moreread less

Abstract: FIRE is a novel Fault-Independent algorithm for combinational REdundancy identification. The algorithm is based on a simple concept that a fault which requires a conflict as a necessary condition for its detection is undetectable and hence redundant. FIRE does not use the backtracking-based exhaustive search performed by fault-oriented automatic test generation algorithms, and identifies redundant faults without any search. Our results on benchmark and real circuits indicate that we find a large number of redundancies (about 80% of the combinational redundancies in benchmark circuits), much faster than a test-generation-based approach for redundancy identification. However, FIRE is not guaranteed to identify all redundancies in a circuit.

...read moreread less

Book Chapter•DOI•

Timing analysis in COSPAN

[...]

Rajeev Alur¹, Robert P. Kurshan¹•Institutions (1)

Bell Labs¹

01 Jul 1996

TL;DR: This work describes how to model and verify real-time systems using the formal verification tool Cospan, which supports automata-theoretic verification of coordinating processes with timing constraints.

...read moreread less

Abstract: We describe how to model and verify real-time systems using the formal verification tool Cospan. The verifier supports automata-theoretic verification of coordinating processes with timing constraints. We discuss different heuristics, and our experiences with the tool for certain benchmark problems appearing in the verification literature.

...read moreread less

Proceedings Article•DOI•

Hybrid FPGA Architecture

[...]

Alireza S. Kaviani¹, Stephen J. Brown¹•Institutions (1)

University of Toronto¹

15 Feb 1996

TL;DR: Preliminary results indicate that compared to LUT-based FPGAs the Hybrid offers savings of more than a factor of two in terms of chip area.

...read moreread less

Abstract: This paper proposes a new field-programmable architecture that is a combination of two existing technologies: Field Programmable Gate Arrays (FPGAs) based on LookUp Tables (LUTs), and Complex Programmable Logic Devices based on PALs/PLAs. The methodology used for development of the new architecture, called Hybrid FPGA, is based on analysis of a large set of benchmark circuits, in which we determine what types of logic resources best match the needs of the circuits. The proposed Hybrid FPGA is evaluated by manually technology mapping a set of circuits into the new architecture and estimating the total chip area needed for each circuit, compared to the area that would be required if only LUTs were available. Preliminary results indicate that compared to LUT-based FPGAs the Hybrid offers savings of more than a factor of two in terms of chip area.

...read moreread less

Patent•

Application and method for benchmarking a database server

[...]

Richard N. Bromberg, Richard W. Kupcunas

28 Feb 1996

TL;DR: A benchmarking application for testing the performance of a database server (14) includes a plurality of execution parameters (82) and a program (78) operable to read the execution parameters as discussed by the authors.

...read moreread less

Abstract: A benchmarking application for testing the performance of a database server (14) includes a plurality of execution parameters (82) and a program (78) operable to read the execution parameters (82). Processes (56, 58, 60) are generated by the program (78) in accordance with the execution parameters (82). Each process (56, 58, 60) represents a user (16, 18, 20) of the database server (14) and generates benchmark transactions (108) for submission to the database server (14).

...read moreread less

Proceedings Article•DOI•

Representative traces for processor models with infinite cache

[...]

V.S. Iyengar¹, L.H. Trevillyan¹, Pradip Bose¹•Institutions (1)

IBM¹

03 Feb 1996

TL;DR: The introduction of a new metric, called the R-metric, to evaluate the representativeness of reduced traces when applied to a wide class of processor designs and the development of a novel graph-based heuristic to generate reduced traces based on the notions incorporated in the metric.

...read moreread less

Abstract: Performance evaluation of processor designs using dynamic instruction traces is a critical part of the iterative design process. The widening gap between the billions of instructions in such traces for benchmark programs and the throughput of timers performing the analysis in the tens of thousands of instructions per second has led to the use of reduced traces during design. This opens up the issue of whether these traces are truly representative of the actual workload in these benchmark programs. The first key result in this paper is the introduction of a new metric, called the R-metric, to evaluate the representativeness of these reduced traces when applied to a wide class of processor designs. The second key result, is the development of a novel graph-based heuristic to generate reduced traces based on the notions incorporated in the metric. These ideas have been implemented in a prototype system (SMART) for generating representative and reduced traces. Extensive experimental results are presented on various benchmarks to demonstrate the quality of the synthetic traces and the uses of the R-metric.

...read moreread less

Book•

The royal tree problem, a benchmark for single and multiple population genetic programming

[...]

Bill Punch, Doug Zongker, Erik D. Goodman

01 Dec 1996

TL;DR: Work done to develop a benchmark problem for genetic programming, the royal tree, is a function that accounts for tree shape as part of its evaluation function, thus it controls for a parameter not often found in the GP literature.

...read moreread less

Abstract: We report on work done to develop a benchmark problem for genetic programming, both as a diicult problem to test GP abilities and as a platform for tuning GP parameters. This benchmark, the royal tree, is a function that accounts for tree shape as part of its evaluation function, thus it controls for a parameter not often found in the GP literature. It also is a progressive function, allowing the user to set the diiculty of the problem attempted. We not only describe the function, but also report on results of using island parallelism for solving GP problems. The results obtained are somewhat surprising, as it appears that a single large population outperforms a group of smaller populations under all the conditions tested. 15.1 Introduction Given the multiplicity of GP programs that could produce the correct solution for a particular problem, it is diicult to judge the eeectiveness of various architectural changes or parameter settings on the performance of a GP system. We encountered these problems directly in the design of our genetic programming tool lilgp. When lilgp was completed, we wanted to test how well it solved a set of standard GP problems. In fact, for a new GP system it is diicult to judge whether it is performing as intended or not, since the programs it generates are not necessarily identical to those generated by other GP systems. This raised two questions: what constitutes a \standard" problem in GP, and how do we rate the performance of a system on such a problem. One of the goals of this research was to create a benchmark problem to test how well a particular GP connguration would perform as compared to other conngura-tions. Such benchmarks have existed for some time in the GA eld, in particular the royal road problems of Holland Jones 1994]. In creating the royal road, Holland addressed three issues. First, the royal road provides a proof-of-principle for the kind of diicult problems, exhibiting deception, that a genetic algorithm is capable of solving. Second, it serves as a benchmark of performance for tuning GA parameters. For example, at ICGA93, Holland claimed a specialized, properly tuned GA

...read moreread less

Journal Article•DOI•

Benchmarking Implementations of Functional Languages with `Pseudoknot', a Float-Intensive Benchmark

[...]

Pieter H. Hartel¹, Marc Feeley², M. Alt³, Lennart Augustsson⁴•Institutions (4)

University of Amsterdam¹, Université de Montréal², Saarland University³, Chalmers University of Technology⁴

01 Jul 1996-Journal of Functional Programming

TL;DR: Over 25 implementations of different functional languages are benchmarked using the same program, a floatingpoint intensive application taken from molecular biology, and the principal aspects studied are compile time and execution time for the various implementations that were benchmarked.

...read moreread less

Abstract: Over 25 implementations of different functional languages are benchmarked using the same program, a floatingpoint intensive application taken from molecular biology. The principal aspects studied are compile time and execution time for the various implementations that were benchmarked. An important consideration is how the program can be modified and tuned to obtain maximal performance on each language implementation. With few exceptions, the compilers take a significant amount of time to compile this program, though most compilers were faster than the then current GNU C compiler (GCC version 2.5.8). Compilers that generate C or Lisp are often slower than those that generate native code directly: the cost of compiling the intermediate form is normally a large fraction of the total compilation time. There is no clear distinction between the runtime performance of eager and lazy implementations when appropriate annotations are used: lazy implementations have clearly come of age when it comes to implementing largely strict applications, such as the Pseudoknot program. The speed of C can be approached by some implemtations, but to achieve this performance, special measures such as strictness annotations are required by non-strict implementations. The benchmark results have to be interpreted with care. Firstly, a benchmark based on a single program cannot cover a wide spectrum of 'typical' applications.j Secondly, the compilers vary in the kind and level of optimisations offered, so the effort required to obtain an optimal version of the program is similarly varied.

...read moreread less

Journal Article•DOI•

GALLO: a genetic algorithm for floorplan area optimization

[...]

Maurizio Rebaudengo¹, Matteo Sonza Reorda¹•Institutions (1)

Polytechnic University of Turin¹

01 Aug 1996-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: The proposed Genetic Algorithm for the Floorplan Area Optimization problem is based on suitable techniques for solution encoding and evaluation function definition, effective cross-over and mutation operators, and heuristic operators which further improve the method's effectiveness.

...read moreread less

Abstract: The paper describes a Genetic Algorithm for the Floorplan Area Optimization problem. The algorithm is based on suitable techniques for solution encoding and evaluation function definition, effective cross-over and mutation operators, and heuristic operators which further improve the method's effectiveness. An adaptive approach automatically provides the optimal values for the activation probabilities of the operators. Experimental results show that the proposed method is competitive with the most effective ones as far as the CPU time requirements and the result accuracy is considered, but it also presents some advantages. It requires a limited amount of memory, it is not sensible to special structures which are critical for other methods, and has a complexity which grows linearly with the number of implementations. Finally, we demonstrate that the method is able to handle floorplans much larger (in terms of number of basic rectangles) than any benchmark previously considered in the literature.

...read moreread less

Proceedings Article•DOI•

Maximum power estimation for CMOS circuits using deterministic and statistic approaches

[...]

Chuan-Yu Wang¹, Kaushik Roy¹•Institutions (1)

Purdue University¹

03 Jan 1996

TL;DR: This paper proposes a novel approach to obtain a lower bound of the maximum power consumption using Automatic Test Generation (ATG) technique and shows that this approach generates the lower bound with the quality which cannot be achieved using simulation-based techniques.

...read moreread less

Abstract: Excessive instantaneous power consumption in VLSI circuits may reduce the reliability and performance of VLSI chips. Hence, to synthesize circuits with high reliability, it is essential to efficiently obtain a precise estimation of the maximum power dissipation. However, due to the inherent input-pattern dependence of the problem, it is intractable to conduct an exhaustive search for circuits with a large number of primary inputs. Hence, the practical approach is to generate a tight lower bound and an upper bound for maximum power dissipation within a reasonable amount of CPU time. In this paper, instead of using the traditional simulation-based techniques, we propose a novel approach to obtain a lower bound of the maximum power consumption using Automatic Test Generation (ATG) technique. Experiments with MCNC and ISCAS-85 benchmark circuits show that our approach generates the lower bound with the quality which cannot be achieved using simulation-based techniques. In addition, a Monte Carlo based technique to estimate maximum power dissipation is described. It not only serves as a comparison version for our ATG approach, but also generates a metric to measure the quality of a lower bound from a statistical point of view.

...read moreread less

Proceedings Article•DOI•

A method for generating random circuits and its application to routability measurement

[...]

J. Darnauer¹, Wayne Wei-Ming Dai¹•Institutions (1)

University of California, Santa Cruz¹

15 Feb 1996

TL;DR: This paper presents a method for generating large random circuits with a fixed number of inputs, outputs, blocks, pins per cell, and approximate rent exponent, and finds that routability is best predicted by estimating the total wirelength in the circuit, not the mean wirelength times pins percell.

...read moreread less

Abstract: FPLD architectures are often designed based on the results of experiments with "typical" benchmark circuits. For very large FPLDs, it may be difficult to obtain enough benchmark circuits to accurately evaluate an architecture. In this paper, we present a method for generating large random circuits with a fixed number of inputs, outputs, blocks, pins per cell, and approximate rent exponent. The circuits generated are used to evaluate several routability measures. We find that routability is best predicted by estimating the total wirelength in the circuit, not the mean wirelength times pins per cell.

...read moreread less

Journal Article•DOI•

A neural network approach to facility layout problems

[...]

Kazuhiro Tsuchiya¹, Kazuhiro Tsuchiya², Sunil Bharitkar¹, Yoshiyasu Takefuji², Yoshiyasu Takefuji¹ - Show less +1 more•Institutions (2)

Case Western Reserve University¹, Keio University²

22 Mar 1996-European Journal of Operational Research

TL;DR: A near-optimum parallel algorithm for solving facility layout problems is presented in this paper where the problem is NP-complete and the algorithm has given improved solutions for several benchmark problems over the best existing algorithms.

...read moreread less

Proceedings Article•DOI•

Using register-transfer paths in code generation for heterogeneous memory-register architectures

[...]

Guido Araujo¹, Sharad Malik¹, Mike Tien-Chien Lee²•Institutions (2)

Princeton University¹, Fujitsu²

01 Jun 1996

TL;DR: It is claimed that tree based algorithms, like the one described in this paper, should be the technique of choice for basic blocks code generation with heterogeneous memory register architectures.

...read moreread less

Abstract: In this paper we address the problem of code generation for basic blocks in heterogeneous memory-register DSP processors. We propose a new a technique, based on register-transfer paths, that can be used for efficiently dismantling basic block DAGs (Directed Acyclic Graphs) into expression trees. This approach builds on recent results which report optimal code generation algorithm for expression trees for these architectures. This technique has been implemented and experimentally validated for the TMS320C25, a popular fixed point DSP processor. The results show that good code quality can be obtained using the proposed technique. An analysis of the type of DAGs found in the DSPstone benchmark programs reveals that the majority of basic blocks in this benchmark set are expression trees and leaf DAGs. This leads to our claim that tree based algorithms, like the one described in this paper, should be the technique of choice for basic blocks code generation with heterogeneous memory register architectures.

...read moreread less

Journal Article•DOI•

Benchmark integral transform results for flow over a backward-facing step

[...]

J.S. Pérez Guerrero¹, Renato M. Cotta¹•Institutions (1)

Federal University of Rio de Janeiro¹

01 Jun 1996-Computers & Fluids

TL;DR: In this paper, the generalized integral transform technique (GITT) is employed to handle the steady two-dimensional incompressible Navier-Stokes equations in stream function-only formulation.

...read moreread less

Journal Article•DOI•

Benchmark evaluation of the IBM SP2 for parallel signal processing

[...]

Kai Hwang¹, Zhiwei Xu, M. Arakawa²•Institutions (2)

University of Hong Kong¹, Massachusetts Institute of Technology²

01 May 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library through STAP (Space-Time Adaptive Processing) benchmark experiments, and conducts a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size.

...read moreread less

Abstract: This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library (MPL) through STAP (Space-Time Adaptive Processing) benchmark experiments. Only coarse-grain parallelism was exploited on the SP2 due to its high communication overhead. A new parallelization scheme is developed for programming message passing multicomputers. Parallel STAP benchmark structures are illustrated with domain decomposition, efficient mapping of partitioned programs, and optimization of collective communication operations. We measure the SP2 performance in terms of execution time, Gflop/s rate, speedup over a single SP2 node, and overall system utilization. With 256 nodes, the Maul SP2 demonstrated the best performance of 23 Gflop/s in executing the High-Order Post-Doppler program, corresponding to a 34% system utilization. We have conducted a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size. Important lessons learned from these parallel processing benchmark experiments are discussed in the context of real-time, adaptive, radar signal processing on massively parallel processors (MPP).

...read moreread less

The Royal Tree Problem, a Benchmark for Single and Multiple Population Genetic Programming

[...]

Peter J. Angeline, Kenneth E. Kinnear

01 Jan 1996

Proceedings Article•DOI•

Performance analysis of MAC protocols for wireless control area network

[...]

A. Kutlu¹, H. Ekiz¹, E.T. Powner¹•Institutions (1)

University of Brighton¹

12 Jun 1996

TL;DR: This paper discusses the applicability of the wireless MAC protocols and validates its usefulness for real-time communications based on the parameters in the benchmark table.

...read moreread less

Abstract: This paper presents the performance analysis of the wireless medium access control (WMAC) protocol and the remote frame medium access control (RFMAC) protocol for a wireless control area network (WCAN). These two MAC protocols are suggested for distributed and centralised wireless communications respectively, as part of the complete WCAN project. The performances of the protocols are evaluated by simulating the models of the protocols using set of signals called the "SAE benchmark". The benchmark table provides an example to illustrate the application of control area network (CAN) system. This paper discusses the applicability of the wireless MAC protocols and validates its usefulness for real-time communications based on the parameters in the benchmark table.

...read moreread less

A data-driven multiprocessor architecture for high throughput digital signal processing

[...]

Alfred Kwok-Wah Yeung

01 Jan 1996

TL;DR: The benchmark results based on a variety of DSP algorithms in video processing, digital communication, digital filtering, and speech recognition confirm the performance, efficiency and generality of the PADDI-2 architecture.

...read moreread less

Abstract: A data-driven multiprocessor architecture called PADDI-2 specially designed for rapid prototyping of high throughput digital signal processing algorithms is presented. Characteristics of typical high speed DSP systems were examined and the efficiencies and deficiencies of existing traditional architectures were studied to establish the architectural requirements, and to guide the architectural design. The proposed PADDI-2 architecture is a highly scalable and modular, multiple-instruction stream multiple-data stream (MIMD) architecture. It consists of a large number of fine-grain processing elements called nanoprocessors interconnected by a flexible and high-bandwidth communication network. The basic idea is that a data flow graph representing a DSP algorithm is directly mapped onto a network of nanoprocessors. The algorithm is executed by the nanoprocessors executing the operations associated with the assigned data flow nodes in a data-driven manner. High computation power is achieved by using multiple nanoprocessors to exploit the large amount of fine-grain parallelism inherent in the target algorithms. Programming flexibility is provided by the MIMD control strategy and the flexible interconnection network which can be reconfigured to handle a wide range of DSP algorithms, including those with heterogeneous communication patterns. As a proof of concept, a single-chip multiprocessor integrated circuit containing 48 16-bit nanoprocessors was designed and fabricated in a 2-metal 1-$\mu$m CMOS technology. A 2-level, area-efficient communication network occupying only 17% of the core area provides flexible and high-bandwidth inter-processor communications. Running at 50 MHz, the chip achieves 2.4 GOPS peak performance and 800 MBytes per second I/O bandwidth. An integrated development system including an assembler, a VHDL-base system simulator, and a demonstration board has been developed for PADDI-2 program development and demonstration. The benchmark results based on a variety of DSP algorithms in video processing, digital communication, digital filtering, and speech recognition confirm the performance, efficiency and generality of the architecture. Moreover, when compared with several competitive architectures that target the same application domain, PADDI-2 is in general about 2 to 3 times better in terms of hardware efficiency.

...read moreread less

Journal Article•DOI•

Computing the entire active area/power consumption versus delay tradeoff curve for gate sizing with a piecewise linear simulator

[...]

Michel Berkelaar¹, P.H.W. Buurman, Jochen A. G. Jess•Institutions (1)

Eindhoven University of Technology¹

01 Nov 1996-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A way to obtain the entire cost versus delay tradeoff curve of a combinational logic circuit in an efficient way is described, and every point on the resulting curve is the global optimum of the corresponding gate sizing problem.

...read moreread less

Abstract: The gate sizing problem is the problem of finding load drive capabilities for all gates in a given Boolean network such, that a given delay limit is kept, and the necessary cost in terms of active area usage and/or power consumption is minimal. This paper describes a way to obtain the entire cost versus delay tradeoff curve of a combinational logic circuit in an efficient way. Every point on the resulting curve is the global optimum of the corresponding gate sizing problem. The problem is solved by mapping it onto piecewise linear models in such a way, that a piecewise linear (circuit) simulator can do the job. It is shown that this setup is very efficient, and can produce tradeoff curves for large circuits (thousands of gates) in a few minutes. Benchmark results for the entire set of MCNC '91 two-level examples are given.

...read moreread less

Proceedings Article•DOI•

Hybrid floorplanning based on partial clustering and module restructuring

[...]

Takayuki Yamanouchi, kazuo Tamakashi, Takashi Kambe

10 Nov 1996

TL;DR: Two hierarchical strategies for avoiding local optima during iterative improvement are proposed: (1) Partial Clustering, and (2) Module Restructuring, which are successful in reducing both area and wire length in addition to reducing the computational time required for optimization.

...read moreread less

Abstract: In this paper, we propose a hybrid floorplanning methodology. Two hierarchical strategies for avoiding local optima during iterative improvement are proposed: (1) Partial Clustering, and (2) Module Restructuring. These strategies work for localizing nets connecting small modules in small regions, and conceal such small modules and their nets during the iterative improvement phase. This method is successful in reducing both area and wire length in addition to reducing the computational time required for optimization. Although our method only searches slicing floorplans, the results are superior to the results obtained even with non-slicing floorplans. We applied our method to the largest MCNC floorplan benchmark example, ami49, and industrial data. For the ami49 benchmark, we obtained results superior to any published results for both estimated area and routing results.

...read moreread less

Collapse