scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2003"


Journal ArticleDOI
TL;DR: This paper presents the first effort toward a formal treatment of battery-aware task scheduling and voltage scaling, based on an accurate analytical model of the battery behavior, using a novel analytical battery model which can be used for the battery lifetime estimation.
Abstract: Portable embedded computing systems require energy autonomy. This is achieved by batteries serving as a dedicated energy source. The requirement of portability places severe restrictions on size and weight, which in turn limits the amount of energy that is continuously available to maintain system operability. For these reasons, efficient energy utilization has become one of the key challenges to the designer of battery-powered embedded computing systems.In this paper, we first present a novel analytical battery model, which can be used for the battery lifetime estimation. The high quality of the proposed model is demonstrated with measurements and simulations. Using this battery model, we introduce a new "battery-aware" cost function, which will be used for optimizing the lifetime of the battery. This cost function generalizes the traditional minimization metric, namely the energy consumption of the system. We formulate the problem of battery-aware task scheduling on a single processor with multiple voltages. Then, we prove several important mathematical properties of the cost function. Based on these properties, we propose several algorithms for task ordering and voltage assignment, including optimal idle period insertion to exercise charge recovery.This paper presents the first effort toward a formal treatment of battery-aware task scheduling and voltage scaling, based on an accurate analytical model of the battery behavior.

274 citations


Journal ArticleDOI
TL;DR: This paper provides a theoretical basis for the analysis of DPM strategies for systems with multiple power-down states, without resorting to such complex approaches, and shows how a relatively simple "online learning" scheme can be used to improve the competitive ratio over deterministic strategies.
Abstract: Online dynamic power management (DPM) strategies refer to strategies that attempt to make power-mode-related decisions based on information available at runtime. In making such decisions, these strategies do not depend upon information of future behavior of the system, or any a priori knowledge of the input characteristics. In this paper, we present online strategies, and evaluate them based on a measure called the competitive ratio that enables a quantitative analysis of the performance of online strategies. All earlier approaches (online or predictive) have been limited to systems with two power-saving states (e.g., idle and shutdown). The only earlier approaches that handled multiple power-saving states were based on stochastic optimization. This paper provides a theoretical basis for the analysis of DPM strategies for systems with multiple power-down states, without resorting to such complex approaches. We show how a relatively simple "online learning" scheme can be used to improve the competitive ratio over deterministic strategies using the notion of "probability-based" online DPM strategies. Experimental results show that the algorithm presented here attains the best competitive ratio in comparison with other known predictive DPM algorithms. The other algorithms that come close to matching its performance in power suffer at least an additional 40p wake-up latency on average. Meanwhile, the algorithms that have comparable latency to our methods use at least 25p more power on average.

191 citations


Journal ArticleDOI
TL;DR: Simulations show that an average of 73% of I-cache lines and 54% of D- caches are put in sleep mode with an average IPC impact of only 1.7%, for 64 KB caches, and this work proposes applying sleep mode only to the data store and not the tag store.
Abstract: Lower threshold voltages in deep submicron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation, but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance.Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information), and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active the hardware knows what the hypothetical miss rate would be if all data lines were active, and the actual miss rate can be made to precisely track it. Simulations show that an average of 73p of I-cache lines and 54p of D-cache lines are put in sleep mode with an average IPC impact of only 1.7p, for 64 KB caches.

140 citations


Journal ArticleDOI
TL;DR: This work proves that the problem of energy-optimal voltage scheduling for fixed-priority hard real-time systems is NP-hard, and presents a fully polynomial time approximation scheme (FPTAS) for the problem.
Abstract: We address the problem of energy-optimal voltage scheduling for fixed-priority hard real-time systems, on which we present a complete treatment both theoretically and practically. Although most practical real-time systems are based on fixed-priority scheduling, there have been few research results known on the energy-optimal fixed-priority scheduling problem. First, we prove that the problem is NP-hard. Then, we present a fully polynomial time approximation scheme (FPTAS) for the problem. For any e > 0, the proposed approximation scheme computes a voltage schedule whose energy consumption is at most (1 + e) times that of the optimal voltage schedule. Furthermore, the running time of the proposed approximation scheme is bounded by a polynomial function of the number of input jobs and 1/e. Given the NP-hardness of the problem, the proposed approximation scheme is practically the best solution because it can compute a near-optimal voltage schedule (i.e., provably arbitrarily close to the optimal schedule) in polynomial time. Experimental results show that the approximation scheme finds more efficient (almost optimal) voltage schedules faster than the best existing heuristic.

113 citations


Journal ArticleDOI
TL;DR: In this paper, the authors use a specific scaling technique, dynamic modulation scaling (DMS), as a vehicle to outline the challenges of packet scheduling and introduce the intricacies caused by the non-preemptive nature of packet schedule and the time-varying wireless channel.
Abstract: System-level power management has become a key technique to render modern wireless communication devices economically viable. Despite their relatively large impact on the system energy consumption, power management for radios has been limited to shutdown-based schemes, while processors have benefited from superior techniques based on dynamic voltage scaling (DVS). However, similar scaling approaches that trade-off energy versus performance are also available for radios. To utilize these in radio power management, existing packet scheduling policies have to be thoroughly rethought to make them energy-aware, essentially opening a whole new set of challenges the same way the introduction of DVS did to CPU task scheduling. We use one specific scaling technique, dynamic modulation scaling (DMS), as a vehicle to outline these challenges, and to introduce the intricacies caused by the nonpreemptive nature of packet scheduling and the time-varying wireless channel.

108 citations


Journal ArticleDOI
TL;DR: This paper devise two algorithms, an optimal algorithm for homogeneous applications and a heuristic iterative algorithm that can also accommodate heterogeneous applications (i.e., those with different power consumption functions), and shows by simulation that the latter is fast and within 1% of the optimal.
Abstract: New technologies have brought about a proliferation of embedded systems, which vary from control systems to sensor networks to personal digital assistants. Many of the portable embedded devices run several applications, which typically have three constraints that need to be addressed: energy, deadline, and reward. However, many of these portable devices do not have powerful enough CPUs and batteries to run all applications within their deadlines. An optimal scheme would allow the device to run the most applications, each using the most amount of CPU cycles possible, without depleting the energy source while still meeting all deadlines. In this paper we propose a solution to this problem; to our knowledge, this is the first solution that combines the three constraints mentioned above. We devise two algorithms, an optimal algorithm for homogeneous applications (with respect to power consumption) and a heuristic iterative algorithm that can also accommodate heterogeneous applications (i.e., those with different power consumption functions). We show by simulation that our iterative algorithm is fast and within 1p of the optimal.

87 citations


Journal ArticleDOI
TL;DR: A novel linear time algorithm for data remapping is presented that may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment, and overcomes key hurdles in designing high-performance embedded computing solutions.
Abstract: In this article, we present a novel linear time algorithm for data remapping, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a novel application of this algorithm as a key step in optimizing the design of an embedded memory system. Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50p, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61p). Such a reduction overcomes key hurdles in designing high-performance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization, we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access, and others. The proposed methodology will become increasingly important, especially as the needs for application specific embedded architectures become prevalent. In addition, we demonstrate the wide applicability of data remapping using several existing microprocessors, such as the Pentium and UltraSparc. Namely, we show that remapping can achieve a performance improvement of 20p on the average. Similarly, for a parametric research HPL-PD microprocessor, which characterizes the new Itanium machines, we achieve a performance improvement of 28p on average. All of our results are achieved using applications from the DIS, Olden and SPEC2000 suites of integer and floating point benchmarks.

64 citations


Journal ArticleDOI
TL;DR: One of the most outstanding problems in chip design for embedded applications is guided through different memory technologies and architectures, and the most successful strategies for optimizing them in the power/performance plane are reviewed.
Abstract: Embedded systems are often designed under stringent energy consumption budgets, to limit heat generation and battery size. Since memory systems consume a significant amount of energy to store and to forward data, it is then imperative to balance power consumption and performance in memory system design. Contemporary system design focuses on the trade-off between performance and energy consumption in processing and storage units, as well as in their interconnections. Although memory design is as important as processor design in achieving the desired design objectives, the former topic has received less attention than the latter in the literature. This article centers on one of the most outstanding problems in chip design for embedded applications. It guides the reader through different memory technologies and architectures, and it reviews the most successful strategies for optimizing them in the power/performance plane.

60 citations


Journal ArticleDOI
TL;DR: An array recovery technique is developed that automatically converts pointers to arrays, enabling the empirical evaluation of high-level transformations, and justify pointer to array conversion and further investigation into the use ofhigh-level techniques for embedded compilers.
Abstract: Efficient implementation of DSP applications is critical for many embedded systems. Optimizing compilers for application programs, written in C, largely focus on code generation and scheduling, which, with their growing maturity, are providing diminishing returns. As DSP applications typically make extensive use of pointer arithmetic, the alternative use of high-level, source-to-source, transformations has been largely ignored. This article develops an array recovery technique that automatically converts pointers to arrays, enabling the empirical evaluation of high-level transformations. High-level techniques were applied to the DSPstone benchmarks on three platforms: TriMedia TM-1000, Texas Instruments TMS320C6201, and the Analog Devices SHARC ADSP-21160. On average, the best transformation gave a factor of 2.43 improvement across the platforms. In certain cases, a speedup of 5.48 was found for the SHARC, 7.38 for the TM-1, and 2.3 for the C6201. These preliminary results justify pointer to array conversion and further investigation into the use of high-level techniques for embedded compilers.

55 citations


Journal ArticleDOI
TL;DR: The proposed subcache architecture employs a page-based placement strategy, a dynamic page remapping policy, and a subcache prediction policy in order to improve the memory system energy behavior, especially on-chip cache energy.
Abstract: The demand for high-performance architectures and powerful battery-operated mobile devices has accentuated the need for low-power systems. In many media and embedded applications, the memory system can consume more than 50p of the overall system energy, making it a ripe candidate for optimization. To address this increasingly important problem, this article studies energy-efficient cache architectures in the memory hierarchy that can have a significant impact on the overall system energy consumption.Existing cache optimization approaches have looked at partitioning the caches at the circuit level and enabling/disabling these cache partitions (subbanks) at the architectural level for both performance and energy. In contrast, this article focuses on partitioning the cache resources architecturally for energy and energy-delay optimizations. Specifically, we investigate ways of splitting the cache into several smaller units, each of which is a cache by itself (called a subcache). Subcache architectures not only reduce the per-access energy costs, but can potentially improve the locality behavior as well.The proposed subcache architecture employs a page-based placement strategy, a dynamic page remapping policy, and a subcache prediction policy in order to improve the memory system energy behavior, especially on-chip cache energy. Using applications from the SPECjvm98 and SPEC CPU2000 benchmarks, the proposed subcache architecture is shown to be very effective in improving both the energy and energy-delay metrics. It is more beneficial in larger caches as well.

43 citations


Journal ArticleDOI
TL;DR: A compiler framework to analyze SA-C programs, perform optimizations, and automatically map the application onto the Morphosys architecture is described, which reflects an average speedup in execution times of up to 6× compared to the execution on 800 MHz Pentium III machines.
Abstract: The rapid growth of device densities on silicon has made it feasible to deploy reconfigurable hardware as a highly parallel computing platform. However, one of the obstacles to the wider acceptance of this technology is its programmability. The application needs to be programmed in hardware description languages or an assembly equivalent, whereas most application programmers are used to the algorithmic programming paradigm. SA-C has been proposed as an expression-oriented language designed to implicitly express data parallel operations. The Morphosys project proposes an SoC architecture consisting of reconfigurable hardware that supports a data-parallel, SIMD computational model. This paper describes a compiler framework to analyze SA-C programs, perform optimizations, and automatically map the application onto the Morphosys architecture. The mapping process is static and it involves operation scheduling, processor allocation and binding, and register allocation in the context of the Morphosys architecture. The compiler also handles issues concerning data streaming and caching in order to minimize data transfer overhead. We have compiled some important image-processing kernels, and the generated schedules reflect an average speedup in execution times of up to 6× compared to the execution on 800 MHz Pentium III machines.

Journal ArticleDOI
TL;DR: An in-house energy simulator for memory systems that is accelerated by special hardware support while maintaining accuracy and allows for practical energy reduction schemes by providing the actual amount of reduction out of the total energy consumption in main memory systems.
Abstract: Memory systems are dominant energy consumers, and thus many energy reduction techniques for memory buses and devices have been proposed. For practical energy reduction practices, we have to take into account the interaction between a processor and cache memories together with application programs. Furthermore, energy characterization of memory systems must be accurate enough to justify various techniques. In this article, we build an in-house energy simulator for memory systems that is accelerated by special hardware support while maintaining accuracy. We explore energy behavior of memory systems for various values of the processor and memory clock frequencies and cache configuration. Each experiment is performed with 24M instruction steps of real application programs to guarantee accuracy.The simulator is based on precise energy characterization of memory systems including buses, bus drivers, and memory devices by a cycle-accurate energy measurement technique. We characterize energy consumption of each component by an energy state machine whose states and transitions are associated with the dynamic and static energy costs, respectively. Our approach easily characterizes the energy consumption of complex SDRAMs. We divide and quantify energy components of main memory systems for high-level reduction. The energy simulator enables us to devise practical energy reduction schemes by providing the actual amount of reduction out of the total energy consumption in main memory systems. We introduce several practical energy reduction techniques for SDRAM memory systems and demonstrate energy reduction ratio over the SDRAM memory systems with commercial SDRAM controller chipsets. We classify the SDRAM memory systems into high-performance and mid-performance classes and achieve suitable system configurations for each class. For instance, a typical high-performance 32-bit, 64 MB SDRAM memory system consumes 19.6 mJ, 33.8 mJ, 35.4 mJ, and 37.0 mJ for 24M instructions of an MP3 decoder, a JPEG compressor, a JPEG decompressor, and an MPEG4 decoder, respectively. Our reduction scheme saves 12.7 mJ, 15.1 mJ, 15.5 mJ, and 14.8 mJ, and the reduction ratios are 64.8p, 44.6p, 43.8p, and 40.1p, respectively, without compromising execution speed.

Journal ArticleDOI
TL;DR: The theoretical foundation of code size reduction for software-pipelined loops based on retiming concept is presented and a general Code-size REDuction technique (CRED) for various kinds of processors is proposed.
Abstract: Software pipelining technique is extensively used to exploit instruction-level parallelism of loops, but also significantly expands the code size. For embedded systems with very limited on-chip memory resources, code size becomes one of the most important optimization concerns. This paper presents the theoretical foundation of code size reduction for software-pipelined loops based on retiming concept. We propose a general Code-size REDuction technique (CRED) for various kinds of processors. Our CRED algorithms integrate the code size reduction with software pipelining. The experimental results show the effectiveness of the CRED technique on both code size reduction and code size/performance trade-off space exploration.

Journal ArticleDOI
TL;DR: It is shown on average that filter caching achieves the best instruction fetch energy reductions of 60--80%, but at the cost of about 20% performance degradation, which could also affect overall energy savings.
Abstract: Instruction caches have traditionally been used to improve software performance. Recently, several tiny instruction cache designs, including filter caches and dynamic loop caches, have been proposed to instead reduce software power. We propose several new tiny instruction cache designs, including preloaded loop caches, and one-level and two-level hybrid dynamic/preloaded loop caches. We evaluate the existing and proposed designs on embedded system software benchmarks from both the Powerstone and MediaBench suites, on two different processor architectures, for a variety of different technologies. We show on average that filter caching achieves the best instruction fetch energy reductions of 60--80p, but at the cost of about 20p performance degradation, which could also affect overall energy savings. We show that dynamic loop caching gives good instruction fetch energy savings of about 30p, but that if a designer is able to profile a program, preloaded loop caching can more than double the savings. We describe automated methods for quickly determining the best loop cache configuration, methods useful in a core-based design flow.

Journal ArticleDOI
TL;DR: A technique for mitigating the loss of battery energy capacity with large peak currents is concluded, showing an improvement of up to 10% in battery life, albeit at some cost to the size and weight of the system.
Abstract: This paper introduces a systematic approach to power awareness in mobile, handheld computers. It describes experimental evaluations of several techniques for improving the energy efficiency of a system, ranging from the network level down to the physical level of the battery. At the network level, a new routing method based upon the power consumed by the network subsystem is shown to improve power consumption by 15p on average and to reduce latency by 75p over methods that consider only the transmitted power. At the boundary between the network and the processor levels, the paper presents the problem of local versus remote processing and derives a figure of merit for determining whether a computation should be completed locally or remotely, one that involves the relative performance of the local and remote system, the transmission bandwidth and power consumption, and the network congestion. At the processor level, the main memory bandwidth is shown to have a significant effect on the relationship between performance and CPU frequency, which in turn determines the energy savings of dynamic CPU speed-setting. The results show that accounting for the main memory bandwidth using Amdahl's law permits the performance speed-up and peak power versus the CPU frequency to be estimated to within 5p. The paper concludes with a technique for mitigating the loss of battery energy capacity with large peak currents, showing an improvement of up to 10p in battery life, albeit at some cost to the size and weight of the system.

Journal ArticleDOI
TL;DR: Two complementary media-sensitive energy-saving techniques that leverage static information are presented that are applicable to existing architectures and propose a new tagless caching architecture by reevaluating the architecture--compiler interface.
Abstract: The unique characteristics of multimedia/embedded applications dictate media-sensitive architectural and compiler approaches to reduce the power consumption of the data cache. Our goal is exploring energy savings for embedded/multimedia workloads without sacrificing performance. Here, we present two complementary media-sensitive energy-saving techniques that leverage static information. While our first technique is applicable to existing architectures, in our second technique we adopt a more radical approach and propose a new tagless caching architecture by reevaluating the architecture--compiler interface.Our experiments show that substantial energy savings are possible in the data cache. Across a wide range of cache and architectural configurations, we obtain up to 77p energy savings, while the performance varies from 14p improvement to 4p degradation depending on the application.

Journal ArticleDOI
TL;DR: This work has developed the first framework for systems design with timing QoS guarantees, latency and synchronization, and develop scheduling algorithms that solve the problem optimally for a-priori specified applications.
Abstract: Modern system design is being increasingly driven by applications such as multimedia and wireless sensing and communications, which have intrinsic quality of service (QoS) requirements, such as throughput, error-rate, and resolution. One of the most crucial QoS guarantees that the system has to provide is the timing constraint among the interacting media (synchronization) and within each media (latency). We have developed the first framework for system design with timing QoS guarantees. In particular, we address how to design system-on-chip with minimum silicon area to meet both latency and synchronization constraints. The proposed design methodology consists of two phases: hardware configuration selection and on-chip memory/storage minimization. In the first phase, we use silicon area and system performance as criteria to identify all the competitive hardware configurations (i.e., Pareto points) that facilitate the needs of synchronous applications. In the second phase, we determine the minimum on-chip memory requirement to meet the timing constraints for each Pareto point. An overall system evaluation is conducted to select the best system configuration. We have developed optimal algorithms that schedule a priori specified applications to meet their synchronization requirements with the minimum size of memory. We have also implemented on-line heuristics for real-time applications. The effectiveness of our algorithms has been demonstrated on a set of simulated MPEG streams from popular movies.

Journal ArticleDOI
TL;DR: This work couple the exploration of memory modules together with their connectivity, to evaluate a wide range of cost, performance, and energy connectivity architectures, and uses a heuristic to prune the design space, guiding the exploration towards the most promising designs.
Abstract: Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, designers tried to alleviate this problem by relying on a simple cache hierarchy, or a limited use of special purpose memory modules such as stream buffers. Although real-life applications contain a large number of memory references to a diverse set of data structures, a significant percentage of all memory accesses in the application are generated from a few memory instructions that exhibit predictable, well-known access patterns; this creates an opportunity for memory customization, targeting the needs of these access patterns. We present APEX, an approach that extracts, analyzes and clusters the most active access patterns in the application, and aggressively customizes the memory architecture to match the needs of the application. Moreover, though the memory modules are important, the rate at which the memory system can produce the data for the CPU is significantly impacted by the connectivity architecture between the memory subsystem and the CPU. Thus, it is critical to consider the connectivity architecture early in the design flow, in conjunction with the memory architecture. We couple the exploration of memory modules together with their connectivity, to evaluate a wide range of cost, performance, and energy connectivity architectures. We use a heuristic to prune the design space, guiding the exploration towards the most promising designs. We present experiments on a set of large real-life benchmarks, showing significant performance improvements for varied cost and power characteristics, allowing the designer to evaluate customized memory and connectivity configurations for embedded systems.

Journal ArticleDOI
TL;DR: This issue will address the system-level design gap between a given algorithm specification and the selected target architecture platform, and includes seven papers investigating some of the fundamental research issues under current investigation in the critical domain of power-aware embedded computing.
Abstract: In the last few years, power awareness has emerged as one of the main research topics in both embedded and general-purpose computing. The challenge has become not only to achieve the desired target performance, but to deliver it with unprecedented energy/power efficiency. Achieving these objectives requires coordinated effort by multiple research communities, including algorithmic design, system-level design, CAD for embedded systems, compilers, operating systems, computer architectures, and VLSI technology and circuits. In this issue, we will address the system-level design gap between a given algorithm specification and the selected target architecture platform. In the critical domain of power-aware embedded computing, some of the fundamental research issues under current investigation include: which decisions/ optimizations should be statically defined? Which should be dynamically performed at run time? Which require a synergy between both approaches? How can such decisions/techniques adapt to dynamically varying working conditions and/or performance targets? What is the cost-effectiveness of the required hardware assists (if any)? Which components/features/parameters of a system architecture should be statically tuned/specialized to the needs/requirements of a target embedded application so as to improve energy efficiency? Which ones should be dynamically reconfigurable and/or adaptive? Which fundamental characteristics of an embedded application influence/impact the above design choices? This special issue includes seven papers investigating some of the above questions in the context of a broad range of energy/power-saving techniques and optimizations. As expected in this field, some contributions focus strictly on particular application domains and/or consider specific (sub)system architectures, while others are of a more “general” nature. We are confident that this collection of excellent papers will provide the reader with an interesting insight/snapshot on the state-of-the-art in this challenging research area.

Journal ArticleDOI
TL;DR: This article presents an approach that reduces the need for explicit instruction selection by transferring constraints implied by the instruction set to static resource constraints, and all resulting schedules are guaranteed to correspond to a valid implementation with given instructions.
Abstract: Due to an increasing need for flexibility, embedded systems embody more and more programmable processors as their core components. Due to silicon area and power considerations, the corresponding instruction sets are often highly encoded to minimize code size for given performance requirements. This has hampered the development of robust optimizing compilers because the resulting irregular instruction set architectures are far from convenient compiler targets. Among other considerations, they introduce an interdependence between the tasks of instruction selection and scheduling. This so-called phase coupling is so strong that, in practice, instruction selection rather than scheduling is responsible for the quality of the schedule, which tends to disappoint. The lack of efficient compilation tools has also severely hampered the design space exploration of code-size efficient instruction sets, and correspondingly, their tuning to the application domain. In this article, we present an approach that reduces the need for explicit instruction selection by transferring constraints implied by the instruction set to static resource constraints. All resulting schedules are then guaranteed to correspond to a valid implementation with given instructions. We also demonstrate the suitability of this model to enable instruction set design (-space exploration) with a simple, well-understood and proven method long used in high-level synthesis (HLS) of ASICs. Experimental results show the efficacy of our approach.

Journal ArticleDOI
TL;DR: This work proposes a microarchitectural technique that targets the reduction of the useless power dissipation of a high-performance processor by predicting whether the result of a particular block of logic will be useful in order to execute the instructions (no matter whether the instructions will be eventually committed or not).
Abstract: The power consumption of current processors keeps increasing in spite of aggressive circuit design techniques and process shrinks. One of the reasons for this increase is the complexity of the microarchitecture required to achieve the performance that each processor generation demands. These techniques, such as branch prediction and on-chip level two caches, increase not only the power consumption of the committed instructions, but also the useless power associated with those block accesses that generate results that are not needed for the correct execution and commit of the instructions.In this work, the different accesses that a particular block receives are classified into four different components, based on whether the accesses are performed by instructions of the correct path or the wrong (mispredicted) path, and also based on whether the results of the accesses are needed or not for the correct execution of the instructions. Out of the four components, only one accounts for the useful accesses to the block, that is, accesses performed to correctly execute instructions that will be committed. The other three components account for the useless activity on the block. The simulations performed indicate that, if the useless power dissipation of a high-performance processor could be totally removed with no performance degradation, the overall processor power consumption would be reduced by as much as 65p compared to the same processor in which all the blocks are accessed every cycle.This work then proposes a microarchitectural technique that targets the reduction of the useless power dissipation. The technique consists of predicting whether the result of a particular block of logic will be useful in order to execute the instructions (no matter whether the instructions will be eventually committed or not). If it is predicted useless, then the block is disabled.A case example is presented where two blocks are predicted for low power: the on-chip L2 cache for instruction fetches and the branch target buffer (BTB). The IPC versus power-consumption design space is explored for a particular microprocessor architecture. Both the average and the peak power consumption are targeted. High-level estimations are done to show that it is plausible that the ideas described might produce a significant reduction in useless block accesses. As an example, 65p accesses to the L2 cache can be eliminated at a 0.2p IPC degradation, and about 5p accesses to the BTB can be saved at the penalty of 0.7p IPC reduction.

Journal ArticleDOI
TL;DR: The proposed method results in the best compression ratio achieved so far, giving average compression ratios of 33.8% for MediaBench benchmarks and 33.6% for SPEC95 benchmarks, both compiled for a MIPS R2000 processor.
Abstract: Intuitively, destination registers of some instructions have great possibilities to be used as the source registers of the immediately subsequent instructions. Such destination register/source register pairs have been exploited previously to improve code compression ratio [compression ratio = (Dictionary Size + Encoded Program Size)/Original Program Size]. This paper further examines the exploitation of both register and immediate operand dependencies to improve the compression ratio. A mapping tag is used to flag dependency relationships so that dependent operands can be omitted during compression. The compression ratio is enhanced by both the removal of dependent operands and the sharing of mapping tags between different types of dependencies and between different instructions. Simulation results show that the proposed method results in the best compression ratio achieved so far, giving average compression ratios of 33.8p for MediaBench benchmarks and 33.6p for SPEC95 benchmarks, both compiled for a MIPS R2000 processor.

Journal ArticleDOI
TL;DR: The four outstanding papers selected for publication in this issue were due to a solicitation that targeted CASES2001, held in November, 2001 in Atlanta, but which also sought other papers from the broader research community in embedded systems.
Abstract: This special issue was suggested by the rapid growth of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). The four outstanding papers selected for publication in this issue were due to a solicitation that targeted CASES2001, held in November, 2001 in Atlanta, but which also sought other papers from the broader research community in embedded systems. The papers in this issue cover a number of subjects, all bearing on different aspects of embedded computing systems. In their scope they are representative of the subject, which encompasses a wide range of research areas across many disciplines in computer science and engineering. CASES, as its name is meant to suggest, provides a forum to foster synergies among apparently disparate fields. The editors would like to thank all those who submitted papers to this special issue and to the reviewers. Thanks for all the hard work. Gao Guang Trevor Mudge