scispace - formally typeset
Search or ask a question

Showing papers on "Benchmark (computing) published in 2001"


Proceedings ArticleDOI
02 Dec 2001
TL;DR: A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors.
Abstract: This paper examines a set of commercially representative embedded programs and compares them to an existing benchmark suite, SPEC2000. A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors. Several characteristics distinguish the representative embedded programs from the existing SPEC benchmarks including instruction distribution, memory behavior, and available parallelism. The embedded benchmarks, called MiBench, are freely available to all researchers.

3,548 citations


Book ChapterDOI
03 Sep 2001
TL;DR: The goal of this paper is to provide an experimental comparison of the efficiency of min-cut/max flow algorithms for applications in vision, comparing the running times of several standard algorithms, as well as a new algorithm that is recently developed.
Abstract: After [10, 15, 12, 2, 4] minimum cut/maximum flow algorithms on graphs emerged as an increasingly useful tool for exact or approximate energy minimization in low-level vision. The combinatorial optimization literature provides many min-cut/max-flow algorithms with different polynomial time complexity. Their practical efficiency, however, has to date been studied mainly outside the scope of computer vision. The goal of this paper is to provide an experimental comparison of the efficiency of min-cut/max flow algorithms for energy minimization in vision. We compare the running times of several standard algorithms, as well as a new algorithm that we have recently developed. The algorithms we study include both Goldberg-style "push-relabel" methods and algorithms based on Ford-Fulkerson style augmenting paths. We benchmark these algorithms on a number of typical graphs in the contexts of image restoration, stereo, and interactive segmentation. In many cases our new algorithm works several times faster than any of the other methods making near real-time performance possible.

3,099 citations


Proceedings ArticleDOI
08 Sep 2001
TL;DR: This paper proposes Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution and shows that theperiodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution. This approach is based upon using profiles of a program's code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.

571 citations




Journal ArticleDOI
TL;DR: A new method for generating motion fields from real sequences containing polyhedral objects is presented and a test suite for benchmarking optical flow algorithms consisting of complex synthetic sequences and real scenes with ground truth is presented.

294 citations


Journal ArticleDOI
TL;DR: A stochastic formulation of watermarking attacks using an estimation-based concept and a new method of evaluating image quality based on the Watson metric which overcomes the limitations of the PSNR are proposed.

283 citations


Book ChapterDOI
30 Jul 2001
TL;DR: An overview of a new benchmark suite for parallel computers, SPEComp, which targets mid-size parallel servers and includes a number of science/engineering and data processing applications, is presented.
Abstract: We present a new benchmark suite for parallel computers. SPEComp targets mid-size parallel servers. It includes a number of science/engineering and data processing applications. Parallelism is expressed in the OpenMP API. The suite includes two data sets, Medium and Large, of approximately 1.6 and 4 GB in size. Our overview also describes the organization developing SPEComp, issues in creating OpenMP parallel benchmarks, the benchmarking methodology underlying SPEComp, and basic performance characteristics.

239 citations


Book ChapterDOI
25 Apr 2001
TL;DR: A second generation benchmark for image watermarking is proposed which includes attacks which take into account powerful prior information about the watermark and theWatermarking algorithms and presents results as a function of application.
Abstract: Digital image watermarking techniques for copyright protection have become increasingly robust. The best algorithms perform well against the now standard benchmark tests included in the Stirmark package. However the stirmark tests are limited since in general they do not properly model the watermarking process and consequently are limited in their potential to removing the best watermarks. Here we propose a second generation benchmark for image watermarking which includes attacks which take into account powerful prior information about the watermark and the watermarking algorithms. We follow the model of the Stirmark benchmark and propose several new categories of tests including: denoising (ML and MAP), wavelet compression, watermark copy attack, active desynchronization, denoising, geometrical attacks, and denoising followed by perceptual remodulation. In addition, we take the important step of presenting results as a function of application. This is an important contribution since it is unlikely that one technology will be suitable for all applications.

193 citations


Proceedings ArticleDOI
01 Dec 2001
TL;DR: A mechanism to dynamically, simultaneously and independently adjust the sizes of the issue queue (IQ), the reorder buffer (ROB) and the load/store queue (LSQ) based on the periodic sampling of their occupancies to achieve significant power savings with minimal impact on performance is proposed.
Abstract: The "one-size-fits-all" philosophy used for permanently allocating datapath resources in today's superscalar CPUs to maximize performance across a wide range of applications results in the overcommitment of resources in general. To reduce power dissipation in the datapath, the resource allocations can be dynamically adjusted based on the demands of applications. We propose a mechanism to dynamically, simultaneously and independently adjust the sizes of the issue queue (IQ), the reorder buffer (ROB) and the load/store queue (LSQ) based on the periodic sampling of their occupancies to achieve significant power savings with minimal impact on performance. Resource upsizing is done more aggressively (compared to downsizing) using the relative rate of blocked dispatches to limit the performance penalty. Our results are validated by the execution of SPEC 95 benchmark suite on a substantially modified version of Simplescalar simulator, where the IQ, the ROB, the LSQ and the register files are implemented as separate structures, as is the case with most practical implementations. For the SPEC 95 benchmarks, the use of our technique in a 4-way superscalar processor results in a power savings in excess of 70% within individual components and an average power savings of 53% for the IQ, LSQ and ROB combined for the entire benchmark suite with an average performance penalty of only 5%.

176 citations


Proceedings Article
04 Aug 2001
TL;DR: This work presents a hybrid approach for the 0-1 multidimensional knapsack problem that combines linear programming and Tabu Search and improves significantly on the best known results of a set of more than 150 benchmark instances.
Abstract: We present a hybrid approach for the 0-1 multidimensional knapsack problem The proposed approach combines linear programming and Tabu Search The resulting algorithm improves significantly on the best known results of a set of more than 150 benchmark instances

Journal ArticleDOI
TL;DR: In this article, the authors investigate the impact of reordering on data reuse at different levels in the memory hierarchy and introduce a new architecture-independent multi-level blocking strategy for irregular applications.
Abstract: The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

Book ChapterDOI
07 Mar 2001
TL;DR: A scaleable multi-user benchmark called XMach-1 (AML Data Management benchmark) is proposed, based on a web application, that considers different types of XML data, in particular text documents, schema-less data and structured data, and measures the query throughput of a system under response time constraints.
Abstract: We propose a scaleable multi-user benchmark called XMach-1 (AML Data Management benchmark) for evaluating the performance of XML data management systems. It is based on a web application and considers different types of XML data, in particular text documents, schema-less data and structured data. We specify the structure of the benchmark database and the generation of its contents. Furthermore, we define a mix of XML queries and update operations for which system performance is determined. The primary performance metric, Xqps, measures the query throughput of a system under response time constraints. We will use XMach-1 to evaluate both native XML data management systems and XML-enabled relational DBMS.

Journal ArticleDOI
TL;DR: The paper presents a new simulated annealing (SA)-based algorithm for the assembly line-balancing problem with a U-type configuration that employs an intelligent mechanism to search a large solution space.
Abstract: The paper presents a new simulated annealing (SA)-based algorithm for the assembly line-balancing problem with a U-type configuration. The proposed algorithm employs an intelligent mechanism to search a large solution space. U-type assembly systems are becoming increasingly popular in today's modern production environments since they are more general than the traditional assembly systems. In these systems, tasks are to be allocated into stations by moving forward and backward through the precedence diagram in contrast to a typical forward move in the traditional assembly systems. The performance of the algorithm is measured by solving a large number of benchmark problems available in the literature. The results of the computational experiments indicate that the proposed SA-based algorithm performs quite effectively. It also yields the optimal solution for most problem instances. Future research directions and a comprehensive bibliography are also provided here.

Journal ArticleDOI
TL;DR: It is found that the benchmark calculations are highly dependent on the choice of the dose‐effect function and the definition of the benchmark dose, and it is recommended that several sets of biologically relevant default settings be used to illustrate the effect on the benchmark results.
Abstract: A threshold for dose-dependent toxicity is crucial for standards setting but may not be possible to specify from empirical studies. Crump (1984) instead proposed calculating the lower statistical confidence bound of the benchmark dose, which he defined as the dose that causes a small excess risk. This concept has several advantages and has been adopted by regulatory agencies for establishing safe exposure limits for toxic substances such as mercury. We have examined the validity of this method as applied to an epidemiological study of continuous response data associated with mercury exposure. For models that are linear in the parameters, we derived an approximative expression for the lower confidence bound of the benchmark dose. We find that the benchmark calculations are highly dependent on the choice of the dose-effect function and the definition of the benchmark dose. We therefore recommend that several sets of biologically relevant default settings be used to illustrate the effect on the benchmark results and to stimulate research that will guide an a priori choice of proper default settings.

Book ChapterDOI
04 Sep 2001
TL;DR: In the process of reducing and verifying the SPEC2000 Benchmark datasets, this work obtains instuction mix, memory behavior, and instructions per cycle characterization information about each benchmark program.
Abstract: The large input datasets in the SPEC 2000 benchmark suite result in unreasonably long simulation times when using detailed execution-driven simulators for evaluating future computer architectures ideas. To address this problem, we have an ongoing project to reduce the execution times of the SPEC 2000 benchmarks in a quantitatively defensible way. Upon completion of this work 1, we will have smaller input datasets for several SPEC2000 benchmarks. The programs using our reduced input datasets will produce execution profiles that accurately reflect the program behavior of the full reference dataset, as measured using standard statistical tests. In the process of reducing and verifying the SPEC2000 Benchmark datasets, we also obtain instuction mix, memory behavior, and instructions per cycle characterization information about each benchmark program.

Proceedings ArticleDOI
01 May 2001
TL;DR: This work presents a general model for schedules with pipelining, and shows that finding a valid schedule with minimum cost is NP-hard, and presents a greedy heuristic for finding good schedules.
Abstract: Database systems frequently have to execute a set of related queries, which share several common subexpressions. Multi-query optimization exploits this, by finding evaluation plans that share common results. Current approaches to multi-query optimization assume that common subexpressions are materialized. Significant performance benefits can be had if common subexpressions are pipelined to their uses, without being materialized. However, plans with pipelining may not always be realizable with limited buffer space, as we show. We present a general model for schedules with pipelining, and present a necessary and sufficient condition for determining validity of a schedule under our model. We show that finding a valid schedule with minimum cost is NP-hard. We present a greedy heuristic for finding good schedules. Finally, we present a performance study that shows the benefit of our algorithms on batches of queries from the TPCD benchmark.

Proceedings ArticleDOI
13 Mar 2001
TL;DR: A technique based upon a statistical approach that improves existing estimation techniques is described that provides a degree of reliability in the error of the estimated execution time of embedded software.
Abstract: Estimates of execution time of embedded software play an important role in function-architecture co-design. This paper describes a technique based upon a statistical approach that improves existing estimation techniques. Our approach provides a degree of reliability in the error of the estimated execution time. We illustrate the technique using both control-oriented and computational-dominated benchmark programs.

Journal ArticleDOI
TL;DR: The main advantage of the proposed procedure is that it identifies critical activities without requiring too much information, and it is shown that the information that DEA provides for inefficient DMUs is in general not sufficient to improve their activities.

Journal ArticleDOI
TL;DR: In this article, the authors compared the performance of simulation problem analysis and research kernel (SPARK) and the HVACSIM+ programs by means of benchmark testing and showed that the graph-theoretic techniques employed in SPARK offer significant speed advantages over the other methods for significantly reducible problems and that even problem portions with little reduction potential can be solved efficiently.

Proceedings Article
27 Aug 2001
TL;DR: Using simulations, the performance potential of the MOLEN ρµ-coded processor, which comprises hardwired and microcoded reconfigurable units, is established and it is indicated that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32%" for the MPEG-2 benchmark using the proposed processor organization.
Abstract: In this paper, we introduce the MOLEN ρµ-coded processor which comprises hardwired and microcoded reconfigurable units At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner The reconfiguration of the hardware and the execution on the reconfigured hardware are performed by ρ-microcode (an extension of the classical microcode to allow reconfiguration capabilities) We include fixed and pageable microcode hardware features to extend the flexibility and improve the performance The scheme allows partial reconfiguration and includes caching mechanisms for nonfrequently used reconfiguration and execution microcode Using simulations, we establish the performance potential of the proposed processor assuming the JPEG and MPEG-2 benchmarks, the ALTERA APEX20K boards for the implementation, and a hardwired superscalar processor After implementation, cycle time estimations and normalization, our simulations indicate that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% for the MPEG-2 benchmark using the proposed processor organization

Book ChapterDOI
18 Apr 2001
TL;DR: This paper systematically configure an ILS algorithms by optimizing the single procedures part of ILS and optimizing their interaction to come up with a highly effective ILS approach, which outperforms the implementation of the iterated dynasearch algorithm on the hardest benchmark instances.
Abstract: In this article we investigate the application of iterated local search (ILS) to the single machine total weighted tardiness problem. Our research is inspired by the recently proposed iterated dynasearch approach, which was shown to be a very effective ILS algorithm for this problem. In this paper we systematically configure an ILS algorithms by optimizing the single procedures part of ILS and optimizing their interaction. We come up with a highly effective ILS approach, which outperforms our implementation of the iterated dynasearch algorithm on the hardest benchmark instances.

Journal ArticleDOI
TL;DR: This paper offers a framework for a spatio-temporal data sets generator, a first step towards a full benchmark for the large real world application field of “smoothly” moving objects with few or no restrictions in motion.
Abstract: The spatio-temporal database research community has just started to investigate benchmarking issues. On one hand we would rather have a benchmark that is representative of real world applications, in order to verify the expressiveness of proposed models. On the other hand, we would like a benchmark that offers a sizeable workload of data and query sets, which could obviously stress the strengths and weaknesses of a broad range of data access methods. This paper offers a framework for a spatio-temporal data sets generator, a first step towards a full benchmark for the large real world application field of “smoothly” moving objects with few or no restrictions in motion. The driving application is the modeling of fishing ships where the ships go in the direction of the most attractive shoals of fish while trying to avoid storm areas. Shoals are themselves attracted by plankton areas. Ships are moving points; plankton or storm areas are regions with fixed center but moving shape; and shoals are moving regions. The specification is written in such a way that the users can easily adjust generation model parameters.

Journal ArticleDOI
TL;DR: Functional cache miss ratios and related statistics for selected benchmarks in the SPEC CPU2000 suite are presented.
Abstract: The SPEC CPU2000 benchmark suite (http://www.spec.org/osg/cpu2000) is a collection of 26 compute-intensive, non-trivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers. The benchmarks in this suite were chosen to represent real-world applications, and thus exhibit a wide range of runtime behaviors. On this webpage, we present functional cache miss ratios and related statistics for selected benchmarks in the SPEC CPU2000 suite. In particular, split L1 cache sizes ranging from 4KB to 1MB with 64B blocks and associativities of 1, 2, 4, 8 and full. Most of this data was collected at the University of Wisconsin-Madison with the aid of the Simplescalar toolset (http://www.simplescalar.org).

Book ChapterDOI
27 Aug 2001
TL;DR: The MOLEN ρμ-coded processor as mentioned in this paper is a reconfigurable superscalar processor that includes fixed and pageable microcode hardware features to extend the flexibility and improve the performance.
Abstract: In this paper, we introduce the MOLEN ρμ-coded processor which comprises hardwired and microcoded reconfigurable units. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. The reconfiguration of the hardware and the execution on the reconfigured hardware are performed by ρ-microcode (an extension of the classical microcode to allow reconfiguration capabilities). We include fixed and pageable microcode hardware features to extend the flexibility and improve the performance. The scheme allows partial reconfiguration and includes caching mechanisms for non-frequently used reconfiguration and execution microcode. Using simulations, we establish the performance potential of the proposed processor assuming the JPEG and MPEG-2 benchmarks, the ALTERA APEX20K boards for the implementation, and a hardwired superscalar processor. After implementation, cycle time estimations and normalization, our simulations indicate that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% for the MPEG-2 benchmark using the proposed processor organization.

Journal ArticleDOI
TL;DR: A checkpointing technique in which the selection of the positions of checkpoints is based on a checkpointing-recovery cost model that allows faster execution and, in some cases, exhibits the additional advantage that less memory is required for recording state vectors.
Abstract: Recent papers have shown that the performance of Time Warp simulators can be improved by appropriately selecting the positions of checkpoints, instead of taking them on a periodic basis. In this paper, we present a checkpointing technique in which the selection of the positions of checkpoints is based on a checkpointing-recovery cost model. Given the current state S, the model determines the convenience of recording S as a checkpoint before the next event is executed. This is done by taking into account the position of the last taken checkpoint, the granularity (i.e., the execution time) of intermediate events, and using an estimate of the probability that S will have to be restored due to rollback in the future of the execution. A synthetic benchmark in different configurations is used for evaluating and comparing this approach to classical periodic techniques. As a testing environment we used a cluster of PCs connected through a Myrinet switch coupled with a fast communication layer specifically designed to exploit the potential of this type of switch. The obtained results point out that our solution allows faster execution and, in some cases, exhibits the additional advantage that less memory is required for recording state vectors. This possibly contributes to further performance improvements when memory is a critical resource for the specific application. A performance study for the case of a cellular phone system simulation is finally reported to demonstrate the effectiveness of this solution for a real world application.

Proceedings ArticleDOI
17 Jun 2001
TL;DR: This paper presents an optimal algorithm to combine loop shifting, loop fusion and array contraction to reduce the temporary array storage required to execute a collection of loops.
Abstract: In this paper, we propose memory reduction as a new approach to data locality enhancement. Under this approach, we use the compiler to reduce the size of the data repeatedly referenced in a collection of nested loops. Between their reuses, the data will more likely remain in higher-speed memory devices, such as the cache. Specifically, we present an optimal algorithm to combine loop shifting, loop fusion and array contraction to reduce the temporary array storage required to execute a collection of loops. When applied to 20 benchmark programs, our technique reduces the memory requirement, counting both the data and the code, by 51% on average. The transformed programs gain a speedup of 1.40 on average, due to the reduced footprint and, consequently, the improved data locality.


Journal ArticleDOI
TL;DR: This paper illustrates a new approach to evaluating portfolios in the context of multiple performance measures based upon linear programming techniques and identifies the n-dimensional efficient portfolio frontier.
Abstract: This paper illustrates a new approach to evaluating portfolios in the context of multiple performance measures. The approach is based upon linear programming techniques and identifies the n-dimensional efficient portfolio frontier. An illustrative example with commodity trading advisor (CTA) returns shows that benchmarks can be identified for each individual portfolio.

Journal Article
TL;DR: Using simulations, the performance potential and performance potential of the MOLEN ρμ-coded processor, which comprises hardwired and microcoded reconfigurable units, are established and it is indicated that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% by using the proposed processor organization.
Abstract: In this paper, we introduce the MOLEN ρμ-coded processor which comprises hardwired and microcoded reconfigurable units. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. The reconfiguration of the hardware and the execution on the reconfigured hardware are performed by p-microcode (an extension of the classical microcode to allow reconfiguration capabilities). We include fixed and pageable microcode hardware features to extend the flexibility and improve the performance. The scheme allows partial reconfiguration and includes caching mechanisms for non-frequently used reconfiguration and execution microcode. Using simulations, we establish the performance potential of the proposed processor assuming the JPEG and MPEG-2 benchmarks, the ALTERA APEX20K boards for the implementation, and a hardwired superscalar processor. After implementation, cycle time estimations and normalization, our simulations indicate that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% for the MPEG-2 benchmark using the proposed processor organization.