scispace - formally typeset
Search or ask a question

Showing papers on "Benchmark (computing) published in 2003"


Proceedings ArticleDOI
03 Dec 2003
TL;DR: This paper identifies numerous cases, such as prefetches, dynamicallydead code, and wrong-path instructions, in which a fault will not affect correct execution, and shows AVFs of 28% and 9% for the instruction queue and execution units, respectively,averaged across dynamic sections of the entire CPU2000benchmark suite.
Abstract: Single-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal with these transients faults exist, but come at a cost. Designers clearly require accurate estimates of processor error rates to make appropriate cost/reliability tradeoffs. This paper describes a method for generating these estimates. A key aspect of this analysis is that some single-bit faults (such as those occurring in the branch predictor) do not produce an error in a program's output. We define a structure's architectural vulnerability factor (AVF) as the probability that a fault in that particular structure do not result in an error. A structure's error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. We identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault do not affect, correct execution. We instrument a detailed 1A64 processor simulator to map bit-level microarchitectural state to these cases, generating per-structure AVF estimates. This analysis shows AVFs of 28% and 9% for the instruction queue and execution units, respectively, averaged across dynamic sections of the entire CPU2000 benchmark suite.

915 citations


Journal ArticleDOI
TL;DR: Aside from the LINPACK Benchmark suite, the TOP500 and the HPL codes are presented and information is given on how to interpret the results of the benchmark and how the results fit into the performance evaluation process.
Abstract: SUMMARY This paper describes the LINPACK Benchmark and some of its variations commonly used to assess the performance of computer systems. Aside from the LINPACK Benchmark suite, the TOP500 and the HPL codes are presented. The latter is frequently used to obtained results for TOP500 submissions. Information is also given on how to interpret the results of the benchmark and how the results fit into the performance evaluation process. Copyright c � 2003 John Wiley & Sons, Ltd.

787 citations


Proceedings ArticleDOI
01 May 2003
TL;DR: The Sampling Microarchitecture Simulation (SMARTS) framework is presented as an approach to enable fast and accurate performance measurements of full-length benchmarks and accelerates simulation by selectively measuring in detail only an appropriate benchmark subset.
Abstract: Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates.Analysis of 41 of the 45 possible SPEC2K benchmark/input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3% with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ∼2% for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64% on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively.

563 citations


Journal ArticleDOI
TL;DR: The essential backbone of the framework is an evolutionary algorithm coupled with a feasible sequential quadratic programming solver in the spirit of Lamarckian learning that leverages surrogate models for solving computationally expensive design problems with general constraints on a limited computational budget.
Abstract: We present a parallel evolutionary optimization algorithm that leverages surrogate models for solving computationally expensive design problems with general constraints, on a limited computational budget. The essential backbone of our framework is an evolutionary algorithm coupled with a feasible sequential quadratic programming solver in the spirit of Lamarckian learning. We employ a trust-region approach for interleaving use of exact modelsfortheobjectiveandconstraintfunctionswithcomputationallycheapsurrogatemodelsduringlocalsearch. In contrastto earlier work, we construct local surrogatemodels using radial basis functionsmotivated by theprinciple of transductive inference. Further, the present approach retains the intrinsic parallelism of evolutionary algorithms and can hence be readily implemented on grid computing infrastructures. Experimental results are presented for some benchmark test functions and an aerodynamic wing design problem to demonstrate that our algorithm converges to good designs on a limited computational budget.

559 citations


Journal ArticleDOI
TL;DR: The findings of this study indicate that ACOAs are an attractive alternative to GAs for the optimal design of water distribution systems, as they outperformed GA for the two case studies considered both in terms of computational efficiency and their ability to find near global optimal solutions.
Abstract: During the last decade, evolutionary methods such as genetic algorithms have been used extensively for the optimal design and operation of water distribution systems. More recently, ant colony optimization algorithms (ACOAs), which are evolutionary methods based on the foraging behavior of ants, have been successfully applied to a number of benchmark combinatorial optimization problems. In this paper, a formulation is developed which enables ACOAs to be used for the optimal design of water distribution systems. This formulation is applied to two benchmark water distribution system optimization problems and the results are compared with those obtained using genetic algorithms (GAs). The findings of this study indicate that ACOAs are an attractive alternative to GAs for the optimal design of water distribution systems, as they outperformed GAs for the two case studies considered both in terms of computational efficiency and their ability to find near global optimal solutions.

479 citations


Proceedings ArticleDOI
10 Jun 2003
TL;DR: How to use the SimPoint tool, and an improved SimPoint algorithm designed to significantly reduce the simulation time required when the simulation environment relies upon fast-forwarding are described.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of a single industry standard benchmark at this level of detail takes on the order of months to complete. This problem is exacerbated by the fact that to properly perform an architectural evaluation requires multiple benchmarks to be evaluated across many separate runs. To address this issue we recently created a tool called SimPoint that automatically finds a small set of Simulation Points to represent the complete execution of a program for efficient and accurate simulation. In this paper we describe how to use the SimPoint tool, and introduce an improved SimPoint algorithm designed to significantly reduce the simulation time required when the simulation environment relies upon fast-forwarding.

311 citations


Proceedings ArticleDOI
10 Jun 2003
TL;DR: The most striking observation is the strong correlation between power consumption and the instructions per cycle (IPC) during OS routine executions, and the proposed models can estimate OS power for run-time dynamic thermal and energy management.
Abstract: The increasing constraints on power consumption in many computing systems point to the need for power modeling and estimation for all components of a system. The Operating System (OS) constitutes a major software component and dissipates a significant portion of total power in many modern application executions. Therefore, modeling OS power is imperative for accurate software power evaluation, as well as power management (e.g. dynamic thermal control and equal energy scheduling) in the light of OS-intensive workloads. This paper characterizes the power behavior of a commercial OS across a wide spectrum of applications to understand OS energy profiles and then proposes various models to cost-effectively estimate its run-time energy dissipation. The proposed models rely on a few simple parameters and have various degrees of complexity and accuracy. Experiments show that compared with cycle-accurate full-system simulation, the model can predict cumulative OS energy to within 1% accuracy for a set of benchmark programs evaluated on a high-end superscalar microprocessor. When applied to track run-time OS energy profiles, the proposed routine level OS power model offers superior accuracy than a simpler, flat OS power model, yielding per-routine estimation error of less than 6%. The most striking observation is the strong correlation between power consumption and the instructions per cycle (IPC) during OS routine executions. Since tools and methodology to measure IPC exist on modern microprocessors, the proposed models can estimate OS power for run-time dynamic thermal and energy management.

254 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: Experimental results show that the proposed approach is far more effective than the other considered techniques in terms of fault detection capability, at the cost of a limited increase in memory requirements and in performance overhead.
Abstract: Over the last few years, an increasing number of safety-critical tasks have been demanded of computer systems. In this paper, a software-based approach for developing safety-critical applications is analyzed. The technique is based on the introduction of additional executable assertions to check the correct execution of the program control flow. By applying the proposed technique, several benchmark applications have been hardened against transient errors. Fault injection campaigns have been performed to evaluate the fault detection capability of the proposed technique in comparison with state-of-the-art alternative assertion-based methods. Experimental results show that the proposed approach is far more effective than the other considered techniques in terms of fault detection capability, at the cost of a limited increase in memory requirements and in performance overhead.

244 citations


Proceedings ArticleDOI
27 Sep 2003
TL;DR: A statistically driven algorithm for forming clusters from which simulation points are chosen, and algorithms for picking simulation points earlier in a program's execution are examined - in order to significantly reduce fast-forwarding time during simulation.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.In this paper we present a statistically driven algorithm for forming clusters from which simulation points are chosen, and examine algorithms for picking simulation points earlier in a program's execution - in order to significantly reduce fast-forwarding time during simulation. In addition, we show that simulation points can be used independent of the underlying architecture. The points are generated once for a program/input pair by only examining the code executed. We show the points accurately track hardware metrics (e.g., performance and cache miss rates) between different architecture configurations. They can therefore be used across different architecture configurations to allow a designer to make accurate trade-off decisions between different configurations.

222 citations


Proceedings ArticleDOI
20 Jul 2003
TL;DR: The purpose is to determine the importance of several basic reduction techniques on Support Vector Machines, by comparing their relative performance improvement when applied on the standard REUTERS-21578 benchmark.
Abstract: Given a data set and a learning task such as classification, there are two prime motives for executing some kind of data set reduction. On one hand there is the possible algorithm performance improvement. On the other hand the decrease in the overall size of the data set can bring advantages in storage space used and time spent computing. Our purpose is to determine the importance of several basic reduction techniques on Support Vector Machines, by comparing their relative performance improvement when applied on the standard REUTERS-21578 benchmark.

200 citations


Journal ArticleDOI
TL;DR: It is suggested that the BMD method is used as a first choice and that in cases where it is not possible to fit a model to the data the traditional NOAEL approach should be used instead, and the possibilities to make benchmark dose calculations on continuous data need to be further investigated.
Abstract: The benchmark dose method has been proposed as an alternative to the no-observed-adverse-effect level (NOAEL) approach for assessing noncancer risks associated with hazardous compounds. The benchmark dose method is a more powerful statistical tool than the traditional NOAEL approach and represents a step in the right direction for a more accurate risk assessment. The benchmark dose method involves fitting a mathematical model to all the dose-response data within a study, and thus more biological information is incorporated in the resulting estimates of guidance values (e.g., acceptable daily intakes, ADIs). Although there is an increasing interest in the benchmark dose approach, it has not yet found its way into the regulatory toxicology in Europe, while in the United States the U.S. Environmental Protection Agency (EPA) already uses the benchmark dose in health risk assessment. Several software packages are today available for benchmark dose calculations. The availability of software to facilitate the analysis can make modeling appear simple, but often the interpretation of the results is not trivial, and it is recommended that benchmark dose modeling be performed in collaboration with a toxicologist and someone familiar with this type of statistical analysis. The procedure does not replace expert judgments of toxicologists and others addressing the hazard characterization issues in risk assessment. The aim of this article is to make risk assessors familiar with the concept, to show how the method can be used, and to describe some possibilities, limitations, and extensions of the benchmark dose approach. In this article the benchmark dose approach is presented in detail and compared to the traditional NOAEL approach. Statistical methods essential for the benchmark dose method are presented in Appendix A, and different mathematical models used in the U.S. EPA's BMD software, the Crump software, and the Kalliomaa software are described in the text and in Appendix B. For replacement of NOAEL in health risk assessment it is considered important that consensus is reached on the crucial parts of the benchmark dose method, that is, selection of risk types and the determination of a response level corresponding to the BMD, especially for continuous data. It is suggested that the BMD method is used as a first choice and that in cases where it is not possible to fit a model to the data the traditional NOAEL approach should be used instead. The possibilities to make benchmark dose calculations on continuous data need to be further investigated. In addition, it is of importance to study whether it would be appropriate to increase the number of dose levels by decreasing the number of animals in each dose group.

Journal ArticleDOI
TL;DR: The investigation shows that PSO yields solutions that are not inferior to those of the benchmark methods and, simultaneously, it has several theoretical, computational, and practical advantages.
Abstract: [1] Most common methods used in optimal control of reservoir systems require a large number of control variables, which are typically the sequences of releases from all reservoirs and for all time steps of the control period In contrast, the less widespread parameterization-simulation-optimization (PSO) method is a low-dimensional method It uses a handful of control variables, which are parameters of a simple rule that is valid through the entire control period and determines the releases from different reservoirs at each time step The parameterization of the rule is linked to simulation of the reservoir system, which enables the calculation of a performance measure of the system for given parameter values, and nonlinear optimization, which enables determination of the optimal parameter values To evaluate the PSO method and, particularly, to investigate whether the radical reduction of the number of control variables might lead to inferior solutions or not, we compare it to two alternative methods These methods, namely, the high-dimensional perfect foresight method and the simplified “equivalent reservoir” method that merges the reservoir system into a single hypothetical reservoir, determine “benchmark” performance measures for the comparison The comparison is done both theoretically and by investigation of the results of the PSO against the benchmark methods in a large variety of test problems Forty-one test problems for a hypothetical system of two reservoirs are constructed and solved for comparison These refer to different objectives (maximization of reliable yield, minimization of cost, maximization of energy production), water uses (irrigation, water supply, energy production), characteristics of the reservoir system and hydrological scenarios The investigation shows that PSO yields solutions that are not inferior to those of the benchmark methods and, simultaneously, it has several theoretical, computational, and practical advantages

Journal Article
TL;DR: In this paper, the authors use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis to explore the workload space and select a limited set of representative benchmark-input pairs that span the complete workload space.
Abstract: Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. In this paper, we use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis to ecien tly explore the workload space. Within this workload space, dieren t input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral dierences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The nal goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, we discuss two other possible applications, namely getting insight in the impact of input data sets on program behavior and evaluating the representativeness of sampled traces.

Journal ArticleDOI
TL;DR: The results show that slow-motion benchmarking solves the problems with using conventional benchmarks on thin-client systems and is an accurate tool for analyzing the performance of these systems.
Abstract: Modern thin-client systems are designed to provide the same graphical interfaces and applications available on traditional desktop computers while centralizing administration and allowing more efficient use of computing resources. Despite the rapidly increasing popularity of these client-server systems, there are few reliable analyses of their performance. Industry standard benchmark techniques commonly used for measuring desktop system performance are ill-suited for measuring the performance of thin-client systems because these benchmarks only measure application performance on the server, not the actual user-perceived performance on the client. To address this problem, we have developed slow-motion benchmarking, a new measurement technique for evaluating thin-client systems. In slow-motion benchmarking, performance is measured by capturing network packet traces between a thin client and its respective server during the execution of a slow-motion version of a conventional benchmark application. These results can then be used either independently or in conjunction with conventional benchmark results to yield an accurate and objective measure of the performance of thin-client systems. We have demonstrated the effectiveness of slow-motion benchmarking by using this technique to measure the performance of several popular thin-client systems in various network environments on Web and multimedia workloads. Our results show that slow-motion benchmarking solves the problems with using conventional benchmarks on thin-client systems and is an accurate tool for analyzing the performance of these systems.

Journal ArticleDOI
TL;DR: A new model showing how genetic algorithms can be manipulated to help optimize bus transit routing design, incorporating unique service frequency settings for each route is proposed and shown to be more efficient than the binary-coded genetic algorithm benchmark, in which problem content cannot be utilized.
Abstract: In this paper we propose a new model showing how genetic algorithms can be manipulated to help optimize bus transit routing design, incorporating unique service frequency settings for each route. The main lesson is in the power that can be given to heuristic methods if problem content is exploited appropriately. In this example, seven proposed genetic operators are designed for this specific problem to facilitate the search within a reasonable amount of time. In addition, headway coordination is applied by the ranking of transfer demands at the transfer terminals. The model is applied on a benchmark network to test its efficiency, and performance results are presented. It is shown that the proposed model is more efficient than the binary-coded genetic algorithm benchmark, in which problem content cannot be utilized.

Journal ArticleDOI
TL;DR: An integrated framework for system-on-chip (SOC) test automation based on a new test access mechanism (TAM) architecture consisting of flexible-width test buses that can fork and merge between cores is described.
Abstract: We describe an integrated framework for system-on-chip (SOC) test automation. Our framework is based on a new test access mechanism (TAM) architecture consisting of flexible-width test buses that can fork and merge between cores. Test wrapper and TAM cooptimization for this architecture is performed by representing core tests using rectangles and by employing a novel rectangle packing algorithm for test scheduling. Test scheduling is tightly integrated with TAM optimization and it incorporates precedence and power constraints in the test schedule, while allowing the SOC integrator to designate a group of tests as preemptable. Test preemption helps avoid hardware and power consumption conflicts, thereby leading to a more efficient test schedule. Finally, we study the relationship between TAM width and tester data volume to identify an effective TAM width for the SOC. We present experimental results on our test automation framework for four benchmark SOCs.

Journal ArticleDOI
TL;DR: A novel hybrid approach based upon stochastic sampling, interpolation and spring models is designed, which allows the visualisation of data sets of previously infeasible size and is a solid foundation for interactive and visual exploration of data.
Abstract: The term 'proximity data' refers to data sets within which it is possible to assess the similarity of pairs of objects. Multidimensional scaling (MDS) is applied to such data and attempts to map high-dimensional objects onto low-dimensional space through the preservation of these similarity relations. Standard MOS techniques have in the past suffered from high computational complexity and, as such, could not feasibly be applied to data sets over a few thousand objects in size. Through a novel hybrid approach based upon stochastic sampling, interpolation and spring models, we have designed an algorithm running in O(N√N). Using Chalmers' 1996 O(N2) spring model as a benchmark for the evaluation of our technique, we compare layout quality and run times using sets of synthetic and real data. Our algorithm executes significantly faster than Chalmers' 1996 algorithm, while producing superior layouts. In reducing complexity and run time, we allow the visualisation of data sets of previously infeasible size. Our results indicate that our method is a solid foundation for interactive and visual exploration of data.

Journal ArticleDOI
27 Oct 2003
TL;DR: A new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath, leading to an improvement in both timing performance and power consumption.
Abstract: This paper describes a new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath. The reconfigurable unit is tightly coupled with the processor, featuring an application-specific instruction-set extension. Mapping computation intensive algorithmic portions on the reconfigurable unit allows a more efficient elaboration, thus leading to an improvement in both timing performance and power consumption. A test chip has been implemented in a standard 0.18-/spl mu/m CMOS technology. The test of a signal processing algorithmic benchmark showed speedups ranging from 4.3/spl times/ to 13.5/spl times/ and energy consumption reduced up to 92%.

Journal ArticleDOI
01 May 2003-Infor
TL;DR: Results from a computational experiment over common benchmark problems show that the proposed technique matches or outperforms some of the best heuristic routing procedures, providing six new best-known solutions.
Abstract: A route-directed hybrid genetic approach to address the Vehicle Routing Problem with Time Windows is presented. The proposed scheme relies on the concept of simultaneous evolution of two populations pursuing different objectives subject to partial constraint relaxation. The first population evolves individuals to minimize total traveled distance while the second focuses on minimizing temporal constraint violation to generate a feasible solution, both subject to a fixed number of tours. Genetic operators have been designed to incorporate key concepts emerging from recent promising techniques such as insertion heuristics and large neighborhood search to further explore the solution space. Results from a computational experiment over common benchmark problems show that the proposed technique matches or outperforms some of the best heuristic routing procedures, providing six new best-known solutions. In comparison, the method proved to be fast, cost-effective and highly competitive.

Proceedings ArticleDOI
11 Jun 2003
TL;DR: Results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.
Abstract: In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEC CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.

01 Jan 2003
TL;DR: A metaheuristic based on annealing-like restarts to diversify and intensify local searches for solving the vehicle routing problem with time windows and is comparable to the best in published literature.
Abstract: In this paper, we propose a metaheuristic basedon annealing-like restarts to diversify andintensify local searches for solving the vehicle routing problem with time windows (VRPTW). Using the Solomons benchmark instances for the problem, our methodobtainedseven new best results andequaled19 other best results. Extensive comparisons indicate that our methodis comparable to the best in publishedliterature. This approach is flexible andcan be extend edto handle other variants of vehicle routing problems and other combinatorial optimization problems. 2002 Elsevier Science B.V. All rights reserved.

Book ChapterDOI
09 Sep 2003
TL;DR: This paper proposes a dependability benchmark for OLTP systems that uses the workload of the TPC-C performance benchmark and specifies the measures and all the steps required to evaluate both the performance and key dependability features ofOLTP systems, with emphasis on availability.
Abstract: The ascendance of networked information in our economy and daily lives has increased the awareness of the importance of dependability features. OLTP (On-Line Transaction Processing) systems constitute the kernel of the information systems used today to support the daily operations of most of the business. Although these systems comprise the best examples of complex business-critical systems, no practical way has been proposed so far to characterize the impact of faults in such systems or to compare alternative solutions concerning dependability features. This paper proposes a dependability benchmark for OLTP systems. This dependability benchmark uses the workload of the TPC-C performance benchmark and specifies the measures and all the steps required to evaluate both the performance and key dependability features of OLTP systems, with emphasis on availability. This dependability benchmark is presented through a concrete example of benchmarking the performance and dependability of several different transactional systems configurations. The effort required to run the dependability benchmark is also discussed in detail.

Journal ArticleDOI
TL;DR: The authors used experimental results from their TPC-W implementation to assess the benchmark's behavior, including its granularity and sensitivity to changes in workload and system parameters.
Abstract: Correctly interpreting benchmark results requires a basic knowledge of the synthetic workload the benchmark uses to determine how well it represents diverse e-commerce applications' real-world workloads. Factors that influence these results include the characteristics of the system under test, the procedures used to execute the tests, and the performance metrics the benchmark generates. TPC-W performs server evaluation in a controlled Internet e-commerce environment that simulates the activities of a business-oriented transactional Web server. The authors used experimental results from their TPC-W implementation to assess the benchmark's behavior, including its granularity and sensitivity to changes in workload and system parameters.

Journal ArticleDOI
TL;DR: This paper presents a tutorial overview of some of the issues that arise in the design of switched linear control systems and a benchmark regulation problem is presented.
Abstract: In this paper we present a tutorial overview of some of the issues that arise in the design of switched linear control systems. Particular emphasis is given to issues relating to stability and control system realisation. A benchmark regulation problem is then presented. This problem is most naturally solved by means of a switched control design. The challenge to the community is to design a control system that meets the required performance specifications and permits the application of rigorous analysis techniques. A simple design solution is presented and the limitations of currently available analysis techniques are illustrated with reference to this example.

Proceedings ArticleDOI
26 Oct 2003
TL;DR: The goal of this paper is to study this complex interaction between the Java application, its input and the virtual machine it runs on at the microarchitectural level by measuring a large number of performance characteristics using performance counters on an AMD K7 Duron microprocessor.
Abstract: Java workloads are becoming increasingly prominent on various platforms ranging from embedded systems, over general-purpose computers to high-end servers. Understanding the implications of all the aspects involved when running Java workloads, is thus extremely important during the design of a system that will run such workloads. In other words, understanding the interaction between the Java application, its input and the virtual machine it runs on, is key to a succesful design. The goal of this paper is to study this complex interaction at the microarchitectural level, e.g., by analyzing the branch behavior, the cache behavior, etc. This is done by measuring a large number of performance characteristics using performance counters on an AMD K7 Duron microprocessor. These performance characteristics are measured for seven virtual machine configurations, and a collection of Java benchmarks with corresponding inputs coming from the SPECjvm98 benchmark suite, the SPECjbb2000 benchmark suite, the Java Grande Forum benchmark suite and an open-source raytracer, called Raja with 19 scene descriptions. This large amount of data is further analyzed using statistical data analysis techniques, namely principal components analysis and cluster analysis. These techniques provide useful insights in an understandable way.From our experiments, we conclude that (i) the behavior observed at the microarchitectural level is primarily determined by the virtual machine for small input sets, e.g., the SPECjvm98 s1 input set; (ii) the behavior can be quite different for various input sets, e.g., short-running versus long-running benchmarks; (iii) for long-running benchmarks with few hot spots, the behavior can be primarily determined by the Java program and not the virtual machine, i.e., all the virtual machines optimize the hot spots to similarly behaving native code; (iv) in general, the behavior of a Java application running on one virtual machine can be significantly different from running on another virtual machine. These conclusions warn researchers working on Java workloads to be careful when using a limited number of Java benchmarks or virtual machines since this might lead to biased conclusions.

Journal ArticleDOI
TL;DR: This work evaluates the Vector IRAM architecture and shows that a compiler can vectorize embedded tasks automatically without compromising code density, and describes a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption.
Abstract: For embedded applications with data-level parallelism, a vector processor offers high performance at low power consumption and low design complexity. Unlike superscalar and VLIW designs, a vector processor is scalable and can optimally match specific application requirements.To demonstrate that vector architectures meet the requirements of embedded media processing, we evaluate the Vector IRAM, or VIRAM (pronounced "V-IRAM"), architecture developed at UC Berkeley, using benchmarks from the Embedded Microprocessor Benchmark Consortium (EEMBC). Our evaluation covers all three components of the VIRAM architecture: the instruction set, the vectorizing compiler, and the processor microarchitecture. We show that a compiler can vectorize embedded tasks automatically without compromising code density. We also describe a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption. Finally, we demonstrate that clustering and modular design techniques let a vector processor scale to tens of arithmetic data paths before wide instruction-issue capabilities become necessary.

Journal ArticleDOI
TL;DR: In this article, the authors present the problem definition for the second generation of benchmark structural control problems for cable-stayed bridges and provide a testbed for the development of strategies for the control of cable stayed-bridges.
Abstract: This paper presents the problem definition for the second generation of benchmark structural control problems for cable-stayed bridges. The goal of this study is to provide a testbed for the development of strategies for the control of cable stayed-bridges. Based on detailed drawings of the Bill Emerson Memorial Bridge, a three-dimensional evaluation model has been developed to represent the complex behavior of the full-scale benchmark bridge. Phase II considers more complex structural behavior than phase I, including multi-support and transverse excitations. Evaluation criteria are presented for the design problem that are consistent with the goals of seismic response control of a cable-stayed bridge. Control constraints are also provided to ensure that the benchmark results are representative of a control implementation on the physical structure. Each participant in this benchmark bridge control study is given the task of defining, evaluating and reporting on their proposed control strategies. Participants should also evaluate the robust stability and performance of their resulting designs through simulation with an evaluation model which includes additional mass due to snow loads. The problem and a sample control design have been made available in the form of a set of MATLAB equations. Copyright © 2003 John Wiley & Sons, Ltd.

Book ChapterDOI
02 Jun 2003
TL;DR: A performance modeling framework that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peakperformance metrics, and is shown to be effective on the LINPACK benchmark and a synthetic version of an ocean modeling application (NLOM).
Abstract: This work presents a performance modeling framework, developed by the Performance Modeling and Characterization (PMaC) Lab at the San Diego Supercomputer Center, that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peakperformance metrics, and is shown to be effective on the LINPACK benchmark and a synthetic version of an ocean modeling application (NLOM). The LINPACK benchmark is further used to investigate methods to reduce the time required to make accurate performance predictions with the framework. These methods are applied to the predictions of the synthetic NLOM application.

Proceedings ArticleDOI
02 May 2003
TL;DR: To provide high-level focus to distributed space flight dynamics and control research, several Ijenchmark problems are suggested, intended to capture high- level features that would be generic to many similar missions.
Abstract: To provide high-level focus to distributed space system flight dynamics and control research, several benchmark problems are suggested. These problems are not specific to any current or proposed mission, but instead are intended to capture high-level features that would be generic to many similar missions.

Proceedings ArticleDOI
03 Dec 2003
TL;DR: This paper defines a power similarity metric as an intersection of both magnitude based and ratio-wise similarities in the power dissipation of processor components and develops a thresholding algorithm in order to partition the power behavior into similarity groups.
Abstract: Characterizing program behavior is important for both hardware and software research. Most modern applications exhibit distinctly different behavior throughout their runtimes, which constitute several phases of execution that share a greater amount of resemblance within themselves compared to other regions of execution. These execution phases can occur at very large scales, necessitating prohibitively long simulation times for characterization. Due to the implementation of extensive clock gating and additional power and thermal management techniques in modern processors, these program phases are also reflected in program power behavior, which can be used as an alternative means of program behavior characterization for power-oriented research. In this paper, we present our methodology for identifying phases in program power behavior and determining execution points that correspond to these phases, as well as defining a small set of power signatures representative of overall program power behavior. We define a power similarity metric as an intersection of both magnitude based and ratio-wise similarities in the power dissipation of processor components. We then develop a thresholding algorithm in order to partition the power behavior into similarity groups. We illustrate our methodology with the gzip benchmark for its whole runtime and characterize gzip power behavior with both the selected execution points and defined signature vectors.