scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Instruction-level Parallelism in 2005"


Journal Article
TL;DR: This paper describes the new features available in the SimPoint 3.0 release, which provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering.
Abstract: This paper describes the new features available in the SimPoint 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an option to output only the simulation points that represent the majority of execution, which can reduce simulation time without much increase in error. Finally, this release provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering. This paper describes SimPoint 3.0’s new features, how to use them, and points out some common pitfalls.

309 citations


Journal Article
TL;DR: The Optimization-Space Exploration (OSE) compiler organization is presented, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers and uses the compiler writer's knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment.

229 citations


Journal Article
TL;DR: This paper attempts to show that there is a significant peak temperature reduction potential in managing lateral heat spreading through floorplanning and argues that this potential warrants consideration of the temperature-performance trade-off early in the design stage at the microarchitectural level usingfloorplanning.
Abstract: In current day microprocessors, exponentially increasing power densities, leakage, cooling costs, and reliability concerns have resulted in temperature becoming a first class design constraint like performance and power. Hence, virtually every high performance microprocessor uses a combination of an elaborate thermal package and some form of Dynamic Thermal Management (DTM) scheme that adaptively controls its temperature. While DTM schemes exploit the important variable of power density to control temperature, this paper attempts to show that there is a significant peak temperature reduction potential in managing lateral heat spreading through floorplanning. It argues that this potential warrants consideration of the temperature-performance trade-off early in the design stage at the microarchitectural level using floorplanning. As a demonstration, it uses previously proposed wire delay model and floorplanning algorithm based on simulated annealing to present a profile-driven, thermal-aware floorplanning scheme that significantly reduces peak processor temperature with minimal performance impact that is quite competitive with DTM.

201 citations


Journal Article
TL;DR: This paper describes the 2bcgskew branch predictor fused by an alloyed redundant history skewed perceptron predictor, which is the design submitted to the 1st JILP Championship Branch Prediction (CBP) competition.
Abstract: This paper describes the 2bcgskew branch predictor fused by an alloyed redundant history skewed perceptron predictor, which is our design submitted to the 1st JILP Championship Branch Prediction (CBP) competition. The presented predictor intelligently combines multiple predictions (fusion) in order to obtain a more accurate prediction. The various predictions are delivered by a 2bcgskew predictor and include the 2bcgskew prediction itself as well as the bias and hysteresis bits of its component predictors. Together with global history, local history and address information, these predictions are used in the fusion predictor, which is an alloyed redundant history skewed perceptron predictor (RHSP). The new predictor design outperforms gshare by 40% on the CBP traces. This improvement also manifests itself for the SPEC INT 2000 benchmarks.

71 citations


Journal Article
TL;DR: A tag-based, global-history predictor derived from PPM, which features five tables and a new update method that improves the mispredict rate and a method for implementing hashing functions in hardware is proposed.
Abstract: The predictor proposed is a tag-based, global-history predictor derived from PPM. It features five tables. Each table is indexed with a different history length. The prediction for a branch is given by the up-down saturating counter associated with the longest matching history. For this kind of predictor to work well, the update must be done carefully. We propose a new update method that improves the mispredict rate. We also propose a method for implementing hashing functions in hardware. 1 Overview The predictor proposed is a global-history predictor derived from the PPM. PPM was originally introduced for text compression [2], and it was used in [1] for branch prediction. Figure 1 shows a synopsis of the proposed predictor, which features 5 tables. It can be viewed as a 4th order approximation to PPM [6], while YAGS [3] can be viewed as a 1st order approximation. The leftmost table on Figure 1 is a bimodal predictor [4]. We refer to this table as table 0. It has 4k entries, and is indexed with the 12 least significant bits of the branch PC. Each entry of table 0 contains a 3-bit up-down saturating counter, and a bit m (m stands for meta-predictor) which function is described in Section 3. Table 0 uses a total of 4k × (3 + 1) = 16 Kbits of storage. The 4 other tables are indexed both with the branch PC and some global history bits : tables 1,2,3 and 4 are indexed respectively with the 10,20,40 and 80 most recent bits in the 80-bit global history, as indicated on Figure 1. When the number of global history bits exceeds the number of index bits, the global history is “folded” by a bit-wise XOR of groups of consecutive history bits, then it is XORed with the branch PC as in a gshare predictor [4]. For example, table 3 is indexed with 40 history bits, and the index may be implemented as pc[0 : 9] ⊕ h[0 : 9] ⊕ h[10 : 19] ⊕ h[20 : 29] ⊕ h[30 : 39] where ⊕ denotes the bit-wise XOR. Section 4 describes precisely the index functions that were used for the submission. Each of the tables 1 to 4 has 1k entries. Each entry contains a 8-bit tag, a 3-bit up-down saturating counter, and a bit u (u stands for “useful entry”, its function is described in Section 3), for a total of 12 bits per entry. So each of the tables 1 to 4 uses 1k × (3 + 8 + 1) = 12 Kbits. The total storage used by the predictor is 16k + 4 × 12k = 64 Kbits. 2 Obtaining a prediction At prediction time, the 5 tables are accessed simultaneously. While accessing the tables, a 8-bit tag is computed for each table 1 to 4. The hash function used to compute the 8-bit tag is different from the one used to index the table, but it takes as input the same PC and global history bits. Once the access is done, we obtain four 8-bits tags from tables 1 to 4, and 5 prediction bits from tables 0 to 4 (the prediction bit is the most significant bit of the 3-bit counter). We obtain a total of 4×8+5 = 37 bits. These 37 bits are then reduced to a 0/1 final prediction, which is obtained as the most significant bit of the

40 citations


Journal Article
TL;DR: The design of a novel fine-grained hardware cache monitoring system in an SMT-based processor that enables improved operating system scheduling and recaptures parallelism by mitigating interference is explored.
Abstract: Simultaneous Multithreading (SMT) has emerged as an effective method of increasing utilization of resources in modern super-scalar processors. SMT processors increase instruction-level parallelism (ILP) and resource utilization by simultaneously executing instructions from multiple independent threads. Although simultaneously sharing resources benefits system throughput, coscheduled threads often aggressively compete for limited resources, namely the cache memory system. While compiler and hardware technologies have been traditionally examined for their effect on ILP, in the context of SMT machines, the operating system also has a substantial influence on system performance. By making informed scheduling decisions, the operating system can limit the amount of contention in the memory hierarchy between threads and reduce the impact of multiple threads simultaneously accessing the cache system. This paper explores the design of a novel fine-grained hardware cache monitoring system in an SMT-based processor that enables improved operating system scheduling and recaptures parallelism by mitigating interference.

30 citations


Journal Article
TL;DR: Adaptive schemes are proposed to preprocess history information so that the input vector to a perceptron predictor contains only those history bits with the strongest correlation, so that a much larger history-information set can be explored effectively without increasing the size of perceptron predictors.
Abstract: Perceptron branch predictors achieve high prediction accuracy by capturing correlation from very long histories. The required hardware, however, limits the history length to be explored practically. In this paper, an important observation is made that the perceptron weights can be used to estimate the strength of branch correlation. Based such an estimate, adaptive schemes are proposed to preprocess history information so that the input vector to a perceptron predictor contains only those history bits with the strongest correlation. In this way, a much larger history-information set can be explored effectively without increasing the size of perceptron predictors. For the distributed Championship Branch Prediction (CBP-1) traces, our proposed scheme achieves a 47% improvement over a g-share predictor of the same size 1 . For SPEC2000 benchmarks, our proposed scheme outperforms the g-share predictor by 35% on average.

28 citations



Journal Article
TL;DR: An idealized branch predictor that develops a set of linear functions, one for each program path to the branch to be predicted, that separate predicted taken from predicted not taken branches.
Abstract: Traditional branch predictors exploit correlations between pattern history and branch outcome to predict branches, but there is a stronger and more natural correlation between path history and branch outcome. I exploit this correlation with piecewise linear branch prediction, an idealized branch predictor that develops a set of linear functions, one for each program path to the branch to be predicted, that separate predicted taken from predicted not taken branches. Taken together, all of these linear functions form a piecewise linear decision surface. Disregarding implementation concerns modulo a 64.25 kilobit hardware budget, I present this idealized branch predictor for the rst Championship Branch Predictor competition. I describe the idea of the algorithm and as well as tricks used to squeeze it into 64.25 kilobits while maintaining good accuracy.

23 citations


Journal Article
TL;DR: A scheduler for deeply pipelined microprocessors that predicts the execution delay of an instruction and issues them accordingly, thereby enabling more precise scheduling of instructions dependent on in-flight loads.
Abstract: Pipeline depths in high performance dynamically scheduled microprocessors are increasing steadily. In addition, level 1 caches are shrinking to meet latency constraints - but more levels of cache are being added to mitigate this performance impact. Moreover, the growing schedule-toexecute-window of deeply pipelined processors has required the use of speculative scheduling techniques. When these effects are combined, we are faced with performance degradation and increased power consumption due to load misscheduling, particularly when considering instructions dependent on in-flight loads. In this paper, we propose a scheduler for such processors. Instead of non-selectively speculating, the scheduler predicts the execution delay of an instruction and issues them accordingly. This, in return, can eliminate the issuing of some operations that will otherwise be squashed. Clearly, load operations constitute an important obstacle in predicting the latency of instructions, because their latencies are not known until the cache access stage, which happens later in the pipeline. Our proposed techniques can estimate the arrival of cache blocks in various locations of the cache hierarchy, thereby enabling more precise scheduling of instructions dependent on these loads. Our scheduler makes use of two structures: A Previously-Accessed Table that stores the source addresses of in-flight load operations and a Cache Miss Detection Engine that detects the location of the block to be accessed in the memory hierarchy. Using the SPEC 2000 CPU suite, we show that the number of instructions issued can be reduced by as much as 52.5% (16.9% on average) while increasing the performance by as much as 42.1% (14.3% on average) over the performance of an aggressive processor.

18 citations


Journal Article
TL;DR: This paper comprehensively analyzes the redundancy in the information stored and exchanged between the processor and the memory system and evaluates the potential of compression in improving performance, power consumption, and cost of the memorySystem.
Abstract: Continuing exponential growth in processor performance, combined with technology, architecture, and application trends, place enormous demands on the memory system to allow information storage and exchange at a high-enough performance (i.e., to provide low latency and high bandwidth access to large amounts of information), at low power, and cost-eectively. This paper comprehensively analyzes the redundancy in the information (addresses, instructions, and data) stored and exchanged between the processor and the memory system and evaluates the potential of compression in improving performance, power consumption, and cost of the memory system. Traces obtained with Sun Microsystems’ Shade simulator simulating SPARC executables of eight integer and seven floating-point programs in the SPEC CPU2000 benchmark suite and five programs from the MediaBench suite, and analyzed using Markov entropy models, existing compression schemes, and CACTI 3.0 and SimplePower timing, power, and area models yield impressive results.

Journal Article
TL;DR: This study shows that when these sources of waste are eliminated, processor energy has the potential to be reduced by 55% and 52% for the studied SPEC 2000 integer and oating point benchmarks respectively.
Abstract: This paper explores the limits of microprocessor power savings available via certain classes of architecture-level optimization. It classies architectural power optimizations into three categories, three sources of waste that consume energy. The rst is the execution of instructions that are unnecessary for correct program execution. The second source of wasted power is speculation waste { waste due to speculative execution of instructions that do not commit their results. The third source is architectural waste. This comes from suboptimal sizing of processor structures. This study shows that when these sources of waste are eliminated, processor energy has the potential to be reduced by 55% and 52% for the studied SPEC 2000 integer and oating point benchmarks respectively.

Journal Article
TL;DR: This paper presents an lP-based optimal register allocator which is much faster than previous work, built into the Gnu C Compiler and evaluated experimentally using the SPEC921NT benchmarks.

Journal Article
TL;DR: Reinterpreting perceptron weights with non-linear translation functions and using different sized tables with different widths of counters shows promise and is practical and the Frankenpredictor’s branch history register update rules for unconditional branches also provide improvements in prediction accuracy for gshare and path-based neural predictors with very little implementation overhead.
Abstract: The Frankenpredictor entry for the Championship Branch Prediction contest proposed several new optimizations for branch predictors. The Frankenpredictor also assimilated many previously proposed techniques. The rules of the contest were such that implementation concerns were largely ignored. In this context, many of the proposed optimizations may not actually be feasible in a realizable predictor. In this paper, we revisit some of the Frankenpredictor optimizations and attempt to apply them to conventional predictor organizations that are more typical of what is described in the literature. In particular, reinterpreting perceptron weights with non-linear translation functions and using different sized tables with different widths of counters shows promise and is practical. The Frankenpredictor’s branch history register update rules for unconditional branches also provide improvements in prediction accuracy for gshare and path-based neural predictors with very little implementation overhead.