scispace - formally typeset
Search or ask a question

Showing papers on "Benchmark (computing) published in 2000"


Journal ArticleDOI
01 May 2000
TL;DR: The design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor, are described and evaluated.
Abstract: We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.

935 citations


Journal ArticleDOI
TL;DR: CPU2000 as mentioned in this paper is a new CPU benchmark suite with 19 applications that have never before been in a SPEC CPU suite, including high-performance numeric computing, Web servers, and graphical subsystems.
Abstract: As computers and software have become more powerful, it seems almost human nature to want the biggest and fastest toy you can afford. But how do you know if your toy is tops? Even if your application never does any I/O, it's not just the speed of the CPU that dictates performance. Cache, main memory, and compilers also play a role. Software applications also have differing performance requirements. So whom do you trust to provide this information? The Standard Performance Evaluation Corporation (SPEC) is a nonprofit consortium whose members include hardware vendors, software vendors, universities, customers, and consultants. SPEC's mission is to develop technically credible and objective component- and system-level benchmarks for multiple operating systems and environments, including high-performance numeric computing, Web servers, and graphical subsystems. On 30 June 2000, SPEC retired the CPU95 benchmark suite. Its replacement is CPU2000, a new CPU benchmark suite with 19 applications that have never before been in a SPEC CPU suite. The article discusses how SPEC developed this benchmark suite and what the benchmarks do.

877 citations


Journal ArticleDOI
16 May 2000
TL;DR: This paper proposes three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic that incorporates novel optimizations that improve efficiency greatly.
Abstract: Complex queries are becoming commonplace, with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries run as a batch. Multiquery optimization aims at exploiting common sub-expressions to reduce evaluation cost. Multi-query optimization has hither-to been viewed as impractical, since earlier algorithms were exhaustive, and explore a doubly exponential search space.In this paper we demonstrate that multi-query optimization using heuristics is practical, and provides significant benefits. We propose three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic. Our greedy heuristic incorporates novel optimizations that improve efficiency greatly. Our algorithms are designed to be easily added to existing optimizers. We present a performance study comparing the algorithms, using workloads consisting of queries from the TPC-D benchmark. The study shows that our algorithms provide significant benefits over traditional optimization, at a very acceptable overhead in optimization time.

414 citations


Proceedings ArticleDOI
01 Oct 2000
TL;DR: This paper addresses the problem of resolving virtual method and interface calls in Java bytecode by developing a technique that can be solved with only one iteration, and thus scales linearly with the size of the program, while at the same time providing more accurate results than two popular existing linear techniques, class hierarchy analysis and rapid type analysis.
Abstract: This paper addresses the problem of resolving virtual method and interface calls in Java bytecode. The main focus is on a new practical technique that can be used to analyze large applications. Our fundamental design goal was to develop a technique that can be solved with only one iteration, and thus scales linearly with the size of the program, while at the same time providing more accurate results than two popular existing linear techniques, class hierarchy analysis and rapid type analysis.We present two variations of our new technique, variable-type analysis and a coarser-grain version called declared-type analysis. Both of these analyses are inexpensive, easy to implement, and our experimental results show that they scale linearly in the size of the program.We have implemented our new analyses using the Soot frame-work, and we report on empirical results for seven benchmarks. We have used our techniques to build accurate call graphs for complete applications (including libraries) and we show that compared to a conservative call graph built using class hierarchy analysis, our new variable-type analysis can remove a significant number of nodes (methods) and call edges. Further, our results show that we can improve upon the compression obtained using rapid type analysis.We also provide dynamic measurements of monomorphic call sites, focusing on the benchmark code excluding libraries. We demonstrate that when considering only the benchmark code, both rapid type analysis and our new declared-type analysis do not add much precision over class hierarchy analysis. However, our finer-grained variable-type analysis does resolve significantly more call sites, particularly for programs with more complex uses of objects.

312 citations


Journal ArticleDOI
TL;DR: A tabu search is proposed for the Capacitated Arc Routing Problem, which outperforms all known heuristics and often produces a proven optimum.
Abstract: The Capacitated Arc Routing Problem arises in several contexts where streets or roads must be traversed for maintenance purposes or for the delivery of services. A tabu search is proposed for this difficult problem. On benchmark instances, it outperforms all known heuristics and often produces a proven optimum.

280 citations


Proceedings ArticleDOI
Erik Ruf1
01 May 2000
TL;DR: This work presents a new technique for removing unnecessary synchronization operations from statically compiled Java programs that makes use of a compact, equivalence-class-based representation that eliminates the need for fixed point operations during the analysis.
Abstract: We present a new technique for removing unnecessary synchronization operations from statically compiled Java programs. Our approach improves upon current efforts based on escape analysis, as it can eliminate synchronization operations even on objects that escape their allocating threads. It makes use of a compact, equivalence-class-based representation that eliminates the need for fixed point operations during the analysis. We describe and evaluate the performance of an implementation in the Marmot native Java compiler. For the benchmark programs examined, the optimization removes 100% of the dynamic synchronization operations in single-threaded programs, and 0-99% in multi-threaded programs, at a low cost in additional compilation time and code growth.

240 citations


Proceedings ArticleDOI
01 Nov 2000
TL;DR: Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
Abstract: Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show that iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cache for larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121 percent for key scientific kernels, including a 27 percent average improvement for the key computational loop nest in the SPEC/NAS benchmark mgrid.

235 citations


Journal ArticleDOI
23 Mar 2000-Nature
TL;DR: An experimental realization of an algorithmic benchmark using an NMR technique that involves coherent manipulation of seven qubits is reported, which can be used as a reliable and efficient method for creating a standard pseudopure state, the first step for implementing traditional quantum algorithms in liquid state NMR systems.
Abstract: Quantum information processing offers potentially great advantages over classical information processing, both for efficient algorithms1,2 and for secure communication3,4. Therefore, it is important to establish that scalable control of a large number of quantum bits (qubits) can be achieved in practice. There are a rapidly growing number of proposed device technologies5,6,7,8,9,10,11 for quantum information processing. Of these technologies, those exploiting nuclear magnetic resonance (NMR) have been the first to demonstrate non-trivial quantum algorithms with small numbers of qubits12,13,14,15,16. To compare different physical realizations of quantum information processors, it is necessary to establish benchmark experiments that are independent of the underlying physical system, and that demonstrate reliable and coherent control of a reasonable number of qubits. Here we report an experimental realization of an algorithmic benchmark using an NMR technique that involves coherent manipulation of seven qubits. Moreover, our experimental procedure can be used as a reliable and efficient method for creating a standard pseudopure state, the first step for implementing traditional quantum algorithms in liquid state NMR systems. The benchmark and the techniques can be adapted for use with other proposed quantum devices.

233 citations


Journal ArticleDOI
TL;DR: A new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LR algorithm for multi-way partitioning, both for optimizing local as well as global objectives.
Abstract: In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LR algorithm for multi-way partitioning, both for optimizing local as well as global objectives. Experiments on the ISPD98 benchmark suite show that the partitionings produced by our scheme are on the average 15% to 23% better than those produced by the K-PM/LR algorithm, both in terms of the hyperedge cut as well as the (K – 1) metric. Furthermore, our algorithm is significantly faster, requiring 4 to 5 times less time than that required by K-PM/LR.

233 citations


Proceedings ArticleDOI
24 Apr 2000
TL;DR: The paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors, and characteristics such as instruction frequencies, computational complexity, and cache performance are presented.
Abstract: The paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark is composed of eight programs, four of them oriented towards packet header processing and four oriented towards data stream processing. The benchmark is defined and characteristics such as instruction frequencies, computational complexity, and cache performance are presented. These measured characteristics are compared to the standard SPEC benchmark. Three examples are presented indicating how CommBench can aid in the design of a single chip network multiprocessor.

215 citations


Journal ArticleDOI
TL;DR: A collection of systems that are suitable for testing PID controllers and are collected from a wide range of sources are described.

Journal ArticleDOI
TL;DR: This article presents a detailed comparative analysis of two particularly well-known families of local search algorithms for SAT, the GSAT and WalkSAT architectures, using a benchmark set that contains instances from randomized distributions as well as SAT-encoded problems from various domains.
Abstract: Local search algorithms are among the standard methods for solving hard combinatorial problems from various areas of artificial intelligence and operations research. For SAT, some of the most successful and powerful algorithms are based on stochastic local search, and in the past 10 years a large number of such algorithms have been proposed and investigated. In this article, we focus on two particularly well-known families of local search algorithms for SAT, the GSAT and WalkSAT architectures. We present a detailed comparative analysis of these algorithms" performance using a benchmark set that contains instances from randomized distributions as well as SAT-encoded problems from various domains. We also investigate the robustness of the observed performance characteristics as algorithm-dependent and problem-dependent parameters are changed. Our empirical analysis gives a very detailed picture of the algorithms" performance for various domains of SAT problems; it also reveals a fundamental weakness in some of the best-performing algorithms and shows how this can be overcome.

Proceedings ArticleDOI
17 Apr 2000
TL;DR: An experimental prototype of a software system that will take MATLAB descriptions of various applications, and automatically map them on to a distributed computing environment consisting of embedded processors, digital signal processors and field-programmable gale arrays built from commercial off-the-shelf components is implemented.
Abstract: Recently, high-level languages such as MATLAB have become popular in prototyping algorithms in domains such as signal and image processing. Many of these applications whose subtasks have diverse execution requirements, often employ distributed, heterogeneous, reconfigurable systems. These systems consist of an interconnected set of heterogeneous processing resources that provide a variety of architectural capabilities. The objective of the MATCH (MATLAB Compiler for Heterogeneous Computing Systems) compiler project at Northwestern University is to make it easier for the users to develop efficient code for distributed heterogeneous, reconfigurable computing systems. Towards this end we are implementing and evaluating an experimental prototype of a software system that will take MATLAB descriptions of various applications, and automatically map them on to a distributed computing environment consisting of embedded processors, digital signal processors and field-programmable gale arrays built from commercial off-the-shelf components. We provide an overview of the MATCH compiler and discuss the testbed which is being used to demonstrate our ideas. We present preliminary experimental results on some benchmark MATLAB programs with the use of the MATCH compiler.

Proceedings ArticleDOI
05 Nov 2000
TL;DR: A new class of fast and highly scalable placement algorithms that directly handle complex constraints and achieve total wirelengths comparable to the state of the art are designed and implemented.
Abstract: We have designed and implemented a new class of fast and highly scalable placement algorithms that directly handle complex constraints and achieve total wirelengths comparable to the state of the art. Our approach exploits recent advances in (i) multilevel methods for hierarchical computation, (ii) interior-point methods for nonconvex nonlinear programming, and (iii) the Fast Multipole Method for the order N evaluation of sums over the N (N - 1)/2 pairwise interactions of N components. Significant adaptation of these methods for the placement problem is required, and we have therefore developed a set of customized discrete algorithms for clustering, declustering, slot assignment, and local refinement with which the continuous algorithms are naturally combined. Preliminary test runs on benchmark circuits with up to 184,000 cells produce total wirelengths within approximately 5-10% of those of GORDIAN-L [1] in less than one tenth the run time. Such an ultra-fast placement engine is badly needed for timing convergence of the synthesis and layout phases of integrated circuit design.

Journal ArticleDOI
TL;DR: The paper considers the classic linear assignment problem with a min-sum objective function, and the most efficient and easily available codes for its solution, and selects eight codes.

Proceedings ArticleDOI
08 Jan 2000
TL;DR: Run-time mechanisms that dynamically distribute the instructions of a program among these two clusters are investigated and can achieve an average speed-up of 36% for the SpecInt95 benchmark suite.
Abstract: Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-point instructions to the other. This partitioning scheme is simple and results in no communications between the two clusters (just through memory) but it is in general far from optimal because she workload is not evenly distributed most of the time. In fact, when the processor is running integer programs, the workload is extremely unbalanced since the FP cluster is not used at all. In this work we investigate run-time mechanisms that dynamically distribute the instructions of a program among these two clusters. By optimizing the trade-off between inter-cluster communication penalty and workload balance, the proposed schemes can achieve an average speed-up of 36% for the SpecInt95 benchmark suite.

Journal ArticleDOI
TL;DR: The paper describes a procedure for optimising the performance of an industrially designed inventory control system using three classic control policies utilising sales, inventory and pipeline information to set the order rate so as to achieve a desired balance between capacity, demand and minimum associated stock level.

Journal ArticleDOI
Arno Sprecher1
TL;DR: The purpose of this paper is to direct the focus to a branch-and-bound concept that can, by simple adaptations, operate on a wide range of problem settings and can compete with the best approaches available for the single-mode problem.
Abstract: We consider the resource-constrained project scheduling problem. The purpose of this paper is to direct the focus to a branch-and-bound concept that can, by simple adaptations, operate on a wide range of problem settings. The general approach can, e.g., deal with multimode problems, resource availability varying with time, and a wide range of objectives. Even the simple assembly line balancing problem of type-1 can be competitively approached with some modifications. Although the algorithm is the most general and simple one currently available for resource-constrained project scheduling, the computational performance can compete with the best approaches available for the single-mode problem. The algorithm uses far less memory than the state-of-the-art procedure, i.e., 256 KB versus 24 MB, for solving the standard benchmark set with projects consisting of 32 activities within comparable time. If both approaches are allowed to make limited use of memory, i.e., 256 KB, then more than 97% of the benchmark instances can be solved within fractions of the time required by the current state-of-the-art procedure. The truncated version of our algorithm achieves at 256 KB approximately the results of the truncated version of the state-of-the-art approach at 24 MB. Since in general the memory requirements exponentially grow with the number of activities the project consists of, memory will become a critical resource, and the strategy to access previously stored information will gain fundamental importance when solving larger projects.

Proceedings ArticleDOI
26 Mar 2000
TL;DR: This work designs a new family of distributed and asynchronous PCMA algorithms for autonomous channel access in high-performance wireless networks and finds them to perform substantially better than a standard benchmark algorithm for power control.
Abstract: We address the issue of power-controlled shared channel access in future wireless networks supporting packetized data traffic, beyond the voice-oriented continuous traffic primarily supported by current-generation networks. First, some novel formulations of the power control problem are introduced, which become progressively more general by incorporating various relevant costs. The analysis of the models under simple, yet natural, assumptions yields certain ubiquitous structural properties of 'optimal' power control algorithms. Based on such structural properties, we design a new family of distributed and asynchronous PCMA algorithms and evaluate them experimentally by simulation. They are found to perform substantially better than a standard benchmark algorithm for power control. This is a first step towards the design of full PCMA protocols for autonomous channel access in high-performance wireless networks.

01 May 2000
TL;DR: The first phase in a benchmark SHM problem organized under the auspices of the IASC-ASCE Structural Health Monitoring Task Group is detailed, with the scale-model structure adopted for use in this benchmark problem described.
Abstract: Structural health monitoring (SHM) is a promising field with widespread application in civil engineering. However, many SHM studies apply different methods to different structures, often making side-by-side comparison of the methods difficult. This paper details the first phase in a benchmark SHM problem organized under the auspices of the IASC-ASCE Structural Health Monitoring Task Group. The scale-model structure adopted for use in this benchmark problem is described. Then, two analytical models based on the structure — one a 12DOF shear-building model, the other a 120DOF model, both finite-element based — are given. The damage patterns to be identified are listed as well as the types and number of sensors, magnitude of sensor information, and so forth. More details are available on the Task Group web site at wusceel.cive.wustl.edu/asce.shm/ .

Journal ArticleDOI
TL;DR: A set of benchmark formulas for proof search in propositional modal logics K, KT, and S4 are presented and the discussion of postulates concerning ATP benchmark helps to obtain improved benchmark methods for other logics, too.
Abstract: A lot of methods have been proposed – and sometimes implemented – for proof search in the propositional modal logics K, KT, and S4 It is difficult to compare the usefulness of these methods in practice, since in most cases no or only a few execution times have been published We try to improve this unsatisfactory situation by presenting a set of benchmark formulas Note that we do not just list formulas, but give a method that allows us to compare different provers today and in the future As a starting point we give the results we obtained when we applied this benchmark method to the Logics Workbench (LWB) We hope that the discussion of postulates concerning ATP benchmark helps to obtain improved benchmark methods for other logics, too


Book ChapterDOI
27 Aug 2000
TL;DR: It is shown that the minimum area point for architectures similar to those available from Xilinx Corporation falls below the 100% logic utilization point for many circuits.
Abstract: In this paper we outline a procedure to determine appropriate partitioning of programmable logic and interconnect area to minimize overall device area across a broad range of benchmark circuits. To validate our design approach, FPGA layout tools which target devices with less that 100% logic capacity have been developed to augment existing approaches that target fully-utilized devices. These tools have been applied to FPGA and reconfigurable computing benchmarks which range from simple state machines to pipelined datapaths. In general, it is shown that the minimum area point for architectures similar to those available from Xilinx Corporation falls belowthe 100% logic utilization point for many circuits.

Proceedings ArticleDOI
24 Apr 2000
TL;DR: It is shown how more detailed statistical profiles can be obtained and how the synthetic trace generation mechanism should be designed to generate syntactically correct benchmark traces so that the performance predictions are far more accurate than those reported in previous research.
Abstract: Most research in the area of microarchitectural performance analysis is done using trace-driven simulations. Although trace-driven simulations are fairly accurate, they are both time- and space-consuming which makes them sometimes impractical. Modeling the execution of a computer program by a statistical profile and generating a synthetic benchmark trace from this statistical profile can be used to accelerate the design process. Thanks to the statistical nature of this technique, performance characteristics quickly converge to a steady state solution during simulation, which makes this technique suitable for fast design space explorations. In this paper, it is shown how more detailed statistical profiles can be obtained and how the synthetic trace generation mechanism should be designed to generate syntactically correct benchmark traces. As a result, the performance predictions in this paper are far more accurate than those reported in previous research.

Journal ArticleDOI
TL;DR: A graph-based benchmark generation method is extended to include functional information and the use of a user-specified component library, together with the restriction that no combinational loops are introduced, now broadens the scope to timing-driven and logic optimizer applications.
Abstract: For the development and evaluation of computer-aided design tools for partitioning, floorplanning, placement, and routing of digital circuits, a huge amount of benchmark circuits with suitable characteristic parameters is required. Observing the lack of industrial benchmark circuits available for use in evaluation tools, one could consider to actually generate synthetic circuits. In this paper, we extend a graph-based benchmark generation method to include functional information. The use of a user-specified component library, together with the restriction that no combinational loops are introduced, now broadens the scope to timing-driven and logic optimizer applications. Experiments show that the resemblance between the characteristic Rent curve and the net degree distribution of real versus synthetic benchmark circuits is hardly influenced by the suggested extensions and that the resulting circuits are more realistic than before. An indirect validation verifies that existing partitioning programs have comparable behavior for both real and synthetic circuits. The problems of accounting for timing-aware characteristics in synthetic benchmarks are addressed in detail and suggestions for extensions are included.

Journal ArticleDOI
TL;DR: This work proposes an original solution based on genetic algorithms which allows to determine a set of good heuristics for a given benchmark and proposes a dynamic model using agents that is a way to simulate the behavior of entities that are going to collaborate to improve the Gantt diagram.

Journal ArticleDOI
TL;DR: This article presents the observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads, and proposes two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations.
Abstract: The large address space needs of many current applications have pushed processor designs toward 64-bit word widths. Although full 64-bit addresses and operations are indeed sometimes needed, arithmetic operations on much smaller quantities are still more common. In fact, another instruction set trend has been the introduction of instructions geared toward subword operations on 16-bit quantities. For examples, most major processors now include instruction set support for multimedia operations allowing parallel execution of several subword operations in the same ALU. This article presents our observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%-10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors.

Proceedings Article
10 Sep 2000
TL;DR: This work presents a calibration tool that automatically extracts the relevant parameters about the memory subsystem from any hardware and demonstrates how a database system equipped with this calibrator can automatically tune memory-conscious database algorithms to their optimal settings.
Abstract: Performance of modern hardware increasingly depends on proper utilization of both the memory cache hierarchy and parallel execution possibilities in todays super-scalar CPUs. Recent database research has demonstrated that database system performance severely suffers from poor utilization of these resources. In previous work, we presented join algorithms that strongly accelerate large equi-join by tuning the memory access pattern to match the characteristics of the memory cache subsystem in the benchmark hardware. In order to make such algorithms applicable in database systems that run on a wide variety of platforms, we now present a calibration tool that automatically extracts the relevant parameters about the memory subsystem from any hardware. Exhaustive experiments with join-queries demonstrate how a database system equipped with this calibrator can automatically tune memory-conscious database algorithms to their optimal settings. Once memory access is optimized, CPU resource usage becomes crucial for database performance. We demonstrate how CPU resource usage can be improved by using appropriate implementation techniques. Join experiments with the Monet database system on various hardware platforms confirm that combining memory and CPU optimization can lead to almost an order of magnitude of performance improvement on modern processors.

Proceedings ArticleDOI
01 Jan 2000
TL;DR: Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors.
Abstract: This paper aims to provide a quantitative understanding of the performance of DSP and multimedia applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors. We evaluate the performance of the VLIW paradigm using Texas Instruments Inc.'s TMS320C62xx processor and the SIMD paradigm using Intel's Pentium II processor (with MMX) on a set of DSP and media benchmarks. Tradeoffs in superscalar performance are evaluated with a combination of measurements on Pentium II and simulation experiments on the SimpleScalar simulator. Our benchmark suite includes kernels (filtering, autocorrelation, and dot product) and applications (audio effects, G.711 speech coding, and speech compression). Optimized assembly libraries and compiler intrinsics were used to create the SIMD and VLIW code. We used the hardware performance counters on the Pentium II and the stand-alone simulator for the C62xx to obtain the execution cycle counts. In comparison to non-SIMD Pentium II performance, the SIMD version exhibits a speedup ranging from 1.0 to 5.5 while the speedup of the VLIW version ranges from 0.63 to 9.0. The benchmarks are seen to contain large amounts of available parallelism, however, most of it is inter-iteration parallelism. Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications.

Proceedings ArticleDOI
01 Nov 2000
TL;DR: This paper introduces a high performance communication middle layer, called PM2, for hetero-geneous network environments, and suggests that binary code written in PM2 or written in a communication library, such as MPICH-SCore, may run on any combination of those networks without re-compilation.
Abstract: This paper introduces a high performance communication middle layer, called PM2, for heterogeneous network environments. PM2 currently supports Myrinet, Ethernet, and SMP. Binary code written in PM2 or written in a communication library, such as MPICH-SCore on top of PM2, may run on any combination of those networks without re-compilation. According to a set of NAS parallel benchmark results, MPICH-SCore performance is better than dedicated communication libraries such as MPICH-BIP/SMP and MPICH-GM when running some benchmark programs.