scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Wattch: a framework for architectural-level power analysis and optimizations

01 May 2000-Vol. 28, Iss: 2, pp 83-94
TL;DR: Wattch is presented, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.
Abstract: Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.This paper presents Wattch, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. Wattch is 1000X or more faster than existing layout-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. This paper presents several validations of Wattch's accuracy. In addition, we present three examples that demonstrate how architects or compiler writers might use Wattch to evaluate power consumption in their design process.We see Wattch as a complement to existing lower-level tools; it allows architects to explore and cull the design space early on, using faster, higher-level tools. It also opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.
Citations
More filters
Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations


Cites methods from "Wattch: a framework for architectur..."

  • ...McPAT advances the state of the art in several directions compared to Wattch, which is the current standard for power research....

    [...]

  • ...When modeling out-of-order processors, Wattch uses the synthetic RUU model that is tightly coupled to the SimpleScalar simulator [9]....

    [...]

  • ...Wattch [8], first presented in 2000, has been such a tool, enabling a tremendous surge in power-related architecture research....

    [...]

  • ...Third, Wattch uses simple linear scaling models based on 0.8μm technology that are inaccurate to make predictions for current and future deep-submicron technology nodes....

    [...]

  • ...Wattch calculates dynamic power dissipation from switching events obtained from an architectural simulation and capacitance models of components of the microarchitecture....

    [...]

Journal ArticleDOI
TL;DR: The SimpleScalar tool set provides an infrastructure for simulation and architectural modeling that can model a variety of platforms ranging from simple unpipelined processors to detailed dynamically scheduled microarchitectures with multiple-level memory hierarchies.
Abstract: Designers can execute programs on software models to validate a proposed hardware design's performance and correctness, while programmers can use these models to develop and test software before the real hardware becomes available. Three critical requirements drive the implementation of a software model: performance, flexibility, and detail. Performance determines the amount of workload the model can exercise given the machine resources available for simulation. Flexibility indicates how well the model is structured to simplify modification, permitting design variants or even completely different designs to be modeled with ease. Detail defines the level of abstraction used to implement the model's components. The SimpleScalar tool set provides an infrastructure for simulation and architectural modeling. It can model a variety of platforms ranging from simple unpipelined processors to detailed dynamically scheduled microarchitectures with multiple-level memory hierarchies. SimpleScalar simulators reproduce computing device operations by executing all program instructions using an interpreter. The tool set's instruction interpreters also support several popular instruction sets, including Alpha, PPC, x86, and ARM.

1,656 citations

Proceedings ArticleDOI
01 May 2003
TL;DR: HotSpot is described, an accurate yet fast model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package that shows that power metrics are poor predictors of temperature, and that sensor imprecision has a substantial impact on the performance of DTM.
Abstract: With power density and hence cooling costs rising exponentially, processor packaging can no longer be designed for the worst case, and there is an urgent need for runtime processor-level techniques that can regulate operating temperature when the package's capacity is exceeded. Evaluating such techniques, however, requires a thermal model that is practical for architectural studies.This paper describes HotSpot, an accurate yet fast model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package. Validation was performed using finite-element simulation. The paper also introduces several effective methods for dynamic thermal management (DTM): "temperature-tracking" frequency scaling, localized toggling, and migrating computation to spare hardware units. Modeling temperature at the microarchitecture level also shows that power metrics are poor predictors of temperature, and that sensor imprecision has a substantial impact on the performance of DTM.

1,252 citations

Proceedings ArticleDOI
03 Nov 2004
TL;DR: A scalable simulation environment for wireless sensor networks that provides an accurate, per-node estimate of power consumption and employs a novel code-transformation technique to estimate the number of CPU cycles executed by each node, eliminating the need for expensive instruction-level simulation of sensor nodes.
Abstract: Developing sensor network applications demands a new set of tools to aid programmers. A number of simulation environments have been developed that provide varying degrees of scalability, realism, and detail for understanding the behavior of sensor networks. To date, however, none of these tools have addressed one of the most important aspects of sensor application design: that of power consumption. While simple approximations of overall power usage can be derived from estimates of node duty cycle and communication rates, these techniques often fail to capture the detailed, low-level energy requirements of the CPU, radio, sensors, and other peripherals.In this paper, we present, a scalable simulation environment for wireless sensor networks that provides an accurate, per-node estimate of power consumption. PowerTOSSIM is an extension to TOSSIM, an event-driven simulation environment for TinyOS applications. In PowerTOSSIM, TinyOS components corresponding to specific hardware peripherals (such as the radio, EEPROM, LEDs, and so forth) are instrumented to obtain a trace of each device's activity during the simulation runPowerTOSSIM employs a novel code-transformation technique to estimate the number of CPU cycles executed by each node, eliminating the need for expensive instruction-level simulation of sensor nodes. PowerTOSSIM includes a detailed model of hardware energy consumption based on the Mica2 sensor node platform. Through instrumentation of actual sensor nodes, we demonstrate that PowerTOSSIM provides accurate estimation of power consumption for a range of applications and scales to support very large simulations.

1,174 citations

Journal ArticleDOI
TL;DR: The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.
Abstract: This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.

985 citations

References
More filters
Journal ArticleDOI
TL;DR: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors.
Abstract: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This paper contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms).

3,079 citations


"Wattch: a framework for architectur..." refers methods in this paper

  • ...In this work we have integrated these power models into the Simplescalar architectural simulator [7]....

    [...]

Journal ArticleDOI
TL;DR: This paper shows that complementary CMOS is the logic style of choice for the implementation of arbitrary combinational circuits if low voltage, low power, and small power-delay products are of concern.
Abstract: Recently reported logic style comparisons based on full-adder circuits claimed complementary pass-transistor logic (CPL) to be much more power-efficient than complementary CMOS. However, new comparisons performed on more efficient CMOS circuit realizations and a wider range of different logic cells, as well as the use of realistic circuit arrangements demonstrate CMOS to be superior to CPL in most cases with respect to speed, area, power dissipation, and power-delay products. An implemented 32-b adder using complementary CMOS has a power-delay product of less than half that of the CPL version. Robustness with respect to voltage scaling and transistor sizing, as well as generality and ease-of-use, are additional advantages of CMOS logic gates, especially when cell-based design and logic synthesis are targeted. This paper shows that complementary CMOS is the logic style of choice for the implementation of arbitrary combinational circuits if low voltage, low power, and small power-delay products are of concern.

911 citations

Proceedings ArticleDOI
01 May 1997
TL;DR: A microarchitecture that simplifies wakeup and selection logic is proposed and discussed, which will help minimize performance degradation due to slow bypasses in future wide-issue machines.
Abstract: The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0.8µm, 0.35µm, and 0.18µm. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster --- consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.

861 citations

Journal ArticleDOI
08 Feb 1996
TL;DR: This custom VLSI implementation of a microprocessor architecture delivers 184 Drystone/MIPS at 162 MHz dissipating 0.5 W using an 1.5 V internal supply and Clock generation uses an on-chip PLL with 3.68 MHz input clock to minimize high frequency clock signals on the board.
Abstract: This paper describes a 160 MHz 500 mW 32 b StrongARM(R) microprocessor designed for low-power, low-cost applications. The chip implements the ARM(R) V4 instruction set and is bus compatible with earlier implementations. The pin interface runs at 3.3 V but the internal power supplies can vary from 1.5 to 2.2 V, providing various options to balance performance and power dissipation. At 160 MHz internal clock speed with a nominal Vdd of 1.65 V, it delivers 185 Dhrystone 2.1 MIPS while dissipating less than 450 mW. The range of operating points runs from 100 MHz at 1.65 V dissipating less than 300 mW to 200 MHz at 2.0 V for less than 900 mW. An on-chip PLL provides the internal clock based on a 3.68 MHz clock input. The chip contains 2.5 million transistors, 90% of which are in the two 16 kB caches. It is fabricated in a 0.35-/spl mu/m three-metal CMOS process with 0.35 V thresholds and 0.25 /spl mu/m effective channel lengths. The chip measures 7.8 mm/spl times/6.4 mm and is packaged in a 144-pin plastic thin quad flat pack (TQFP) package.

686 citations

Journal ArticleDOI
TL;DR: It is found that careful design reduced the energy dissipation by almost 25% and methods of reducing energy consumption that do not lead to performance loss, and methods to reduce delay by exploiting instruction level parallelism are explored.
Abstract: In this paper we investigate possible ways to improve the energy efficiency of a general purpose microprocessor. We show that the energy of a processor depends on its performance, so we chose the energy-delay product to compare different processors. To improve the energy-delay product we explore methods of reducing energy consumption that do not lead to performance loss (i.e. wasted energy), and explore methods to reduce delay by exploiting instruction level parallelism. We found that careful design reduced the energy dissipation by almost 25%. Pipelining can give approximately a 2/spl times/ improvement in energy-delay product. Superscalar issue, however, does not improve the energy-delay product any further since the overhead required offsets the gains in performance. Further improvements will be hard to come by since a large fraction of the energy (50-80%) is dissipated in the clock network and the on-chip memories. Thus, the efficiency of processors will depend more on the technology being used and the algorithm chosen by the programmer than the micro-architecture.

635 citations