Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2010"
TL;DR: Power gating has become one of the most widely used circuit design techniques for reducing leakage current as discussed by the authors, but its application to standard-cell VLSI designs involves many careful considerations.
Abstract: Power Gating has become one of the most widely used circuit design techniques for reducing leakage current. Its concept is very simple, but its application to standard-cell VLSI designs involves many careful considerations. The great complexity of designing a power-gated circuit originates from the side effects of inserting current switches, which have to be resolved by a combination of extra circuitry and customized tools and methodologies. In this tutorial we survey these design considerations and look at the best practice within industry and academia. Topics include output isolation and data retention, current switch design and sizing, and physical design issues such as power networks, increases in area and wirelength, and power grid analysis. Designers can benefit from this tutorial by obtaining a better understanding of implications of power gating during an early stage of VLSI designs. We also review the ways in which power gating has been improved. These include reducing the sizes of switches, cutting transition delays, applying power gating to smaller blocks of circuitry, and reducing the energy dissipated in mode transitions. Power Gating has also been combined with other circuit techniques, and these hybrids are also reviewed. Important open problems are identified as a stimulus to research.
TL;DR: A novel compiler for SystemC is presented that integrates a formal and scalable race analysis and produces a simulator that uses the race analysis information at runtime to perform partial-order reduction, thereby eliminating context switches that do not affect the result of the simulation.
Abstract: SystemC is a system-level modeling language that offers a wide range of features to describe concurrent systems at different levels of abstraction. The SystemC standard permits simulators to implement a deterministic scheduling policy, which often hides concurrency-related design flaws. We present a novel compiler for SystemC that integrates a very precise formal race analysis by means of model checking. Our compiler produces a simulator that uses the outcome of the analysis to perform partial order reduction. The key insight to make the model checking engine scale is to apply it only to tiny fractions of the SystemC model. We show that the outcome of the analysis is not only valuable to eliminate redundant context switches at runtime, but can also be used to diagnose race conditions statically. In particular, our analysis is able to reveal races that can remain undetected during simulation and is able to formally prove the absence of races.
TL;DR: It is shown through real implementation of the system on a state-of-the-art testbed of server machines that vGreen improves both average performance and system-level energy savings by close to 40% across benchmarks with varying characteristics.
Abstract: In this article, we present vGreen, a multitiered software system for energy-efficient virtual machine management in a clustered virtualized environment. The system leverages the use of novel hierarchical metrics that work across the different abstractions in a virtualized environment to capture power and performance characteristics of both the virtual and physical machines. These characteristics are then used to implement policies for scheduling and power management of virtual machines across the cluster. We show through real implementation of the system on a state-of-the-art testbed of server machines that vGreen improves both average performance and system-level energy savings by close to 40p across benchmarks with varying characteristics.
TL;DR: This work proposes a methodology for a NBTI-aware power gating that allows synthesizing low-leakage circuits with maximum lifetime, and shows how it is possible to partially overcome this conflict by leveraging the benefits in terms of aging provided by power-gating.
Abstract: The emergence of Negative Bias Temperature Instability (NBTI) as the most relevant source of reliability in sub-90nm technologies has led to a new facet of the traditional trade-off between power and reliability. NBTI effects in fact manifest themselves as an increase of the propagation delay of the devices over time, which adds up to the delay penalty incurred by most low-power design solutions. This implies that, given a desired lifetime of a circuit (i.e., a given performance target at some point in time), a power-managed component will fail earlier than a nonpower-managed one.In this work, we show how it is possible to partially overcome this conflict, by leveraging the benefits in terms of aging provided by power-gating (i.e., by using switches that disconnect a logic block from the ground). Thanks to some electrical properties, it is possible to nullify aging effects during standby periods.Based on this important property, we propose a methodology for a NBTI-aware power gating that allows synthesizing low-leakage circuits with maximum lifetime.
TL;DR: The approach of abstract processor modeling in the context of multiprocessor architectures is introduced, combining modeling of computation on processors with an abstract RTOS and accurate interrupt handling into a versatile, multifaceted processor model with several levels of features.
Abstract: With growing system complexity and ever-increasing software content, the development of embedded software for upcoming MPSoC architectures is a tremendous challenge. Traditional ISS-based validation becomes infeasible due to the large complexity.Addressing the need for flexible and fast simulating models, we introduce in this article our approach of abstract processor modeling in the context of multiprocessor architectures. We combine modeling of computation on processors with an abstract RTOS and accurate interrupt handling into a versatile, multifaceted processor model with several levels of features.Our processor models are utilized in a framework allowing designers to develop a system in a top-down manner using automatic model generation and compilation down to a given MPSoC architecture. During generation, instances of our processor models are integrated into a system model combining software, hardware, and bus communication. The generated system model serves for rapid design space exploration and a fast and accurate system validation.Our experimental results show the benefits of our processor modeling using an actual multiprocessor mobile phone baseband platform. Our abstract models of this complex system reach a simulation speed of 300MCycles/s within a high accuracy of less than 3p error. In addition, our results quantify the speed/accuracy trade-off at varying abstraction levels of our models to guide future processor model designers.
TL;DR: A new architecture-level parameterized dynamic thermal behavioral modeling algorithm for emerging thermal-related design and optimization problems for high-performance multicore microprocessor design, called ParThermPOF, which offers two order of magnitudes speedup over the commercial thermal analysis tool FloTHERM.
Abstract: In this article, we propose a new architecture-level parameterized dynamic thermal behavioral modeling algorithm for emerging thermal-related design and optimization problems for high-performance multicore microprocessor design We propose a new approach, called ParThermPOF, to build the parameterized thermal performance models from the given accurate architecture thermal and power information The new method can include a number of variable parameters such as the locations of thermal sensors in a heat sink, different components (heat sink, heat spreader, core, cache, etc), thermal conductivity of heat sink materials, etc The method consists of two steps: first, a response surface method based on low-order polynomials is applied to build the parameterized models at each time point for all the given sampling nodes in the parameter space Second, an improved Generalized Pencil-Of-Function (GPOF) method is employed to build the transfer-function-based behavioral models for each time-varying coefficient of the polynomials generated in the first step Experimental results on a practical quad-core microprocessor show that the generated parameterized thermal model matches the given data very well The compact models by ParThermPOF offer two order of magnitudes speedup over the commercial thermal analysis tool FloTHERM on the given examples ParThermPOF is very suitable for design space exploration and optimization where both time and system parameters need to be considered
TL;DR: This paper presents a meta-analysis of data center power consumption in the United States over the course of a 12-month period and shows clear trends in power consumption over the long-term.
Abstract: ing total data center power. In Proceedings of the Workshop on Energy Efficient Design
TL;DR: This work presents an efficient heuristic algorithm based on kernel recognition for the pipelined scheduling problem, a technique borrowed from SW pipelining, to overcome the scalability problem of the SMT-based optimal solution technique.
Abstract: FPGAs are widely used in today's embedded systems design due to their low cost, high performance, and reconfigurability. Partially RunTime-Reconfigurable (PRTR) FPGAs, such as Virtex-2 Pro and Virtex-4 from Xilinx, allow part of the FPGA area to be reconfigured while the remainder continues to operate without interruption, so that HW tasks can be placed and removed dynamically at runtime. We address two problems related to HW task scheduling on PRTR FPGAs: (1) HW/SW partitioning. Given an application in the form of a task graph with known execution times on the HW (FPGA) and SW (CPU), and known area sizes on the FPGA, find an valid allocation of tasks to either HW or SW and a static schedule with the optimization objective of minimizing the total schedule length (makespan). (2) Pipelined scheduling. Given an input task graph, construct a pipelined schedule on a PRTR FPGA with the goal of maximizing system throughput while meeting a given end-to-end deadline. Both problems are NP-hard. Satisfiability Modulo Theories (SMT) is an extension to SAT by adding the ability to handle arithmetic and other decidable theories. We use the SMT solver Yices with Linear Integer Arithmetic (LIA) theory as the optimization engine for solving the two scheduling problems. In addition, we present an efficient heuristic algorithm based on kernel recognition for the pipelined scheduling problem, a technique borrowed from SW pipelining, to overcome the scalability problem of the SMT-based optimal solution technique.
TL;DR: This work extends modern digital synthesis with a novel technique, called SWEDE, that makes use of extensive external don't-cares present implicitly in existing simulation-based verification environments for circuit customization.
Abstract: Traditional digital circuit synthesis flows start from an HDL behavioral definition and assume that circuit functions are almost completely defined, making don't-care conditions rare. However, recent design methodologies do not always satisfy these assumptions. For instance, third-party IP blocks used in a system-on-chip are often overdesigned for the requirements at hand. By focusing only on the input combinations occurring in a specific application, one could resynthesize the system to greatly reduce its area and power consumption. Therefore we extend modern digital synthesis with a novel technique, called SWEDE, that makes use of extensive external don't-cares. In addition, we utilize such don't-cares present implicitly in existing simulation-based verification environments for circuit customization. Experiments indicate that SWEDE scales to large ICs with half-million input vectors and handles practical cases well.
TL;DR: This work proposes a novel approach to partition the register in a way that can achieve the largest power saving, and forms the register file partitioning process into a graph partitioning problem, and applies an effective algorithm to obtain the optimal result.
Abstract: Register files in modern embedded processors contribute a substantial budget in the energy consumption due to their large switching capacitance and long working time. For some embedded processors, on average 25p of registers account for 83p of register file accessing time. This motivates us to partition the register file into hot and cold regions, with the most frequently used registers placed in the hot region, and the rarely accessed ones in the cold region. We employ the bit-line splitting and drowsy register cell techniques to reduce the overall register file accessing power. We propose a novel approach to partition the register in a way that can achieve the largest power saving. We formulate the register file partitioning process into a graph partitioning problem, and apply an effective algorithm to obtain the optimal result. We evaluate our algorithm for MiBench and SPEC2000 applications on the SimpleScalar PISA system, and an average saving of 58.3p and 54.4p over the nonpartitioned register file accessing power is achieved. The area overhead is negligible, and the execution time overhead is acceptable (5.5p for MiBench 2.4p for SPEC2000). Further evaluation for MiBench applications is performed on Alpha and X86 system.
TL;DR: A heuristic method to estimate the chip-level thermal profile when the underlying randomness is non-Gaussian is given, which can generate highly accurate chip- level thermal profile estimates within a few milliseconds.
Abstract: This article addresses the problem of chip-level thermal profile estimation using runtime temperature sensor readings. We address the challenges of: (a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere); (b) random chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given such statistical characteristics and the runtime thermal sensor readings, we exploit the correlation in power dissipation among different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We give a heuristic method to estimate the chip-level thermal profile when the underlying randomness is non-Gaussian. An extension of our method has also been proposed to address the dynamic case. Several speedup strategies are carefully investigated to improve the efficiency of the estimation algorithm. Experimental results indicated that, given only a few thermal sensors, our method can generate highly accurate chip-level thermal profile estimates within a few milliseconds.
TL;DR: The theory and related algorithms for complete polymorphic gate sets with more than two modes are proposed and the impact of logic-1 and logic-0 on the completeness of the polymorphic Gate set is discussed.
Abstract: Polymorphic gates are special kinds of logic gates that can exhibit different functions under the control of environmental parameters, such as light, temperature, and VDD. These polymorphic gates can be used to build polymorphic circuits that perform different functions under different environments. Because polymorphic gates are different from traditional logic gates, the existent completeness theory for the traditional logic gate set is not suitable for the polymorphic gate set. So far, only the definition of the complete polymorphic gate set is given. There is no approach to judging whether a given polymorphic gate set is complete. The contributions of this article include three aspects. First, the impact of logic-1 and logic-0 on the completeness of the polymorphic gate set is discussed. Second, the theory and two related algorithms for judging the completeness of polymorphic gate sets with two modes are given. Finally, the theory and related algorithms for complete polymorphic gate sets with more than two modes are proposed.
TL;DR: This article proposes an efficient procedure to compute an approximated behavior-level observability of every operation in a dataflow graph, and is the first time behavior- level observability analysis and optimization are performed during behavioral synthesis in a systematic manner.
Abstract: Many techniques for power reduction in advanced RTL synthesis tools rely explicitly or implicitly on observability don’t-care conditions In this article we propose a systematic approach to maximize the effectiveness of these techniques by generating power-friendly RTL descriptions in behavioral synthesis This is done using operation gating, that is, explicitly adding a predicate to an operation based on its observability condition, so that the operation, once identified as unobservable at runtime, can be avoided using RTL power optimization techniques such as clock gatingWe first introduce the concept of behavior-level observability and its approximations in the context of behavioral synthesis We then propose an efficient procedure to compute an approximated behavior-level observability of every operation in a dataflow graph Unlike previous techniques which work at the bit level in Boolean networks, our method is able to perform analysis at the word level, and thus avoids most computation effort with a reasonable approximation Our algorithm exploits the observability-masking nature of some Boolean operations, as well as the select operation, and allows certain forms of other knowledge to be considered for stronger observability conditions The approximation is proved exact for (acyclic) dataflow graphs when non-Boolean operations other than select are treated as black boxes The behavior-level observability condition obtained by our analysis can be used to guide the operation scheduler to optimize the efficiency of operation gating In a set of experiments on real-world designs, our method achieves an average of 339p reduction in total power; it outperforms a previous method by 171p on average and gives close-to-optimal solutions on several designs To the best of our knowledge, this is the first time behavior-level observability analysis and optimization are performed during behavioral synthesis in a systematic manner We believe that our idea can be applied to compiler transformations in general
TL;DR: A new (à priori counterintuitive) paradigm in device optimization for subthreshold logic: relaxing gate leakage constraints to improve robustness against short-channel effects and variability is revealed.
Abstract: Subthreshold operation of digital circuits enables minimum energy consumption. In this article, we observe that minimum energy Emin of subthreshold logic dramatically increases when reaching 45nm CMOS node. We demonstrate by circuit simulation and analytical modeling that this increase comes from the combined effects of variability, gate leakage, and Drain-Induced Barrier Lowering (DIBL) effect. We then investigate the new impact of individual MOSFET parameters Lg, Vt, and Tox on Emin in sub-45nm technologies. We further propose an optimum MOSFET selection, which favors low-Vt mid-Lg devices in 45nm CMOS technology. The use of such optimum MOSFETs yields 35p Emin reduction for a benchmark multiplier with good speed performances and negligible area overhead. This optimum MOSFET selection can easily be integrated into a standard EDA tool flow by appropriate selection of the standard cell library.We finally demonstrate that undoped-channel fully-depleted Silicon-On-Insulator (SOI) technology brings 60p Emin reduction with baseline MOSFETs thanks to strong mitigation of variability and short-channel effects. This study reveals a new (a priori counterintuitive) paradigm in device optimization for subthreshold logic: relaxing gate leakage constraints to improve robustness against short-channel effects and variability. Additionally, we propose pre-Silicon BSIM4 MOSFET model cards for realistic subthreshold circuit simulations including variability in bulk and fully depleted SOI technologies, which are made available online.
TL;DR: This article introduces a novel and efficient hardware-supported compression technique that is based on Huffman Coding, which reduces the size of the generated decoding table, which takes a large portion of the memory.
Abstract: The size of embedded software is increasing at a rapid pace. It is often challenging and time consuming to fit an amount of required software functionality within a given hardware resource budget. Code compression is a means to alleviate the problem by providing substantial savings in terms of code size. In this article we introduce a novel and efficient hardware-supported compression technique that is based on Huffman Coding. Our technique reduces the size of the generated decoding table, which takes a large portion of the memory. It combines our previous techniques, Instruction Splitting Technique and Instruction Re-encoding Technique into new one called Combined Compression Technique to improve the final compression ratio by taking advantage of both previous techniques. The instruction Splitting Technique is instruction set architecture (ISA)-independent. It splits the instructions into portions of varying size (called patterns) before Huffman coding is applied. This technique improves the final compression ratio by more than 20p compared to other known schemes based on Huffman Coding. The average compression ratios achieved using this technique are 48p and 50p for ARM and MIPS, respectively. The Instruction Re-encoding Technique is ISA-dependent. It investigates the benefits of reencoding unused bits (we call them reencodable bits) in the instruction format for a specific application to improve the compression ratio. Reencoding those bits can reduce the size of decoding tables by up to 40p. Using this technique, we improve the final compression ratios in comparison to the first technique to 46p and 45p for ARM and MIPS, respectively (including all overhead that incurs). The Combined Compression Technique improves the compression ratio to 45p and 42p for ARM and MIPS, respectively. In our compression technique, we have conducted evaluations using a representative set of applications and we have applied each technique to two major embedded processor architectures, namely ARM and MIPS.
TL;DR: This article introduces a hybrid scheduling approach that relies on an abstract symbolic representation of data flow nodes (operations) bound to control flow paths that produces a more realistic lower bound during the prescheduling resource estimation step and speeds up slower but accurate heuristic scheduling techniques, thus achieving a globally improved result.
Abstract: Hardware synthesis is the process by which system-level, Register Transfer (RT)-level, or behavioral descriptions can be turned into real implementations, in terms of logic gates. Scheduling is one of the most time-consuming steps in the overall design flow, and may become much more complex when performing hardware synthesis from high-level specifications. Exploiting a single scheduling strategy on very large designs is often reductive and potentially inadequate. Furthermore, finding the “best” single candidate among all possible scheduling algorithms is practically infeasible. In this article we introduce a hybrid scheduling approach that is a preliminary step towards a comprehensive solution not yet provided by industrial or by academic solutions. Our method relies on an abstract symbolic representation of data flow nodes (operations) bound to control flow paths: it produces a more realistic lower bound during the prescheduling resource estimation step and speeds up slower but accurate heuristic scheduling techniques, thus achieving a globally improved result.
TL;DR: This work proposes a page-level, link-time technique that minimizes not only the size of patching scripts but also perturbation to the firmware memory, over the entire sequence of updates in the system’s lifetime.
Abstract: Firmware update over a network connection is an essential but expensive feature for many embedded systems due to not only the relatively high power consumption and limited bandwidth, but also page-granular erasure before rewriting to flash memory. This work proposes a page-level, link-time technique that minimizes not only the size of patching scripts but also perturbation to the firmware memory, over the entire sequence of updates in the system’s lifetime. We propose a tool that first clusters functions to minimize caller-callee dependency across pages, and then orders the functions within each page to minimize intrapage perturbation. Experimental results show our technique to reduce the energy consumption of firmware update by 30--42p over the state-of-the-art. Most importantly, this is the first work that has ever shown to evolve well over 41 revisions of a real-world open-source real-time operating system.
TL;DR: This article investigates vulnerability-based partitioning schemes that are applicable to applications in general and effectively reduce failures due to soft errors at minimal power and performance overheads.
Abstract: Increasing exponentially with technology scaling, the soft error rate even in earth-bound embedded systems manufactured in deep subnanometer technology is projected to become a serious design consideration. Partially protected cache (PPC) is a promising microarchitectural feature to mitigate failures due to soft errors in power, performance, and cost sensitive embedded processors. A processor with PPC maintains two caches, one protected and the other unprotected, both at the same level of memory hierarchy. The intuition behind PPCs is that not all data in the application is equally prone to soft errors. By finding and mapping the data that is more prone to soft errors to the protected cache, and error-resilient data to the unprotected cache, failures induced by soft errors can be significantly reduced at a minimal power and performance penalty. Consequently, the effectiveness of PPCs critically hinges on the compiler's ability to partition application data into error-prone and error-resilient data. The effectiveness of PPCs has previously been demonstrated on multimedia applications—where an obvious partitioning of data exists, the multimedia data is inherently resilient to soft errors, and the rest of the data and the entire code is assumed to be error-prone. Since the amount of multimedia data is a quite significant component of the entire application data, this obvious partitioning is quite effective. However, no such obvious data and code partitioning exists for general applications. This severely restricts the applicability of PPCs to data caches and instruction caches in general. This article investigates vulnerability-based partitioning schemes that are applicable to applications in general and effectively reduce failures due to soft errors at minimal power and performance overheads.Our experimental results on an HP iPAQ-like processor enhanced with PPC architecture, running benchmarks from the MiBench suite demonstrate that our partitioning heuristic efficiently finds page partitions for data PPCs that can reduce the failure rate by 48p at only 2p performance and 7p energy overhead, and finds page partitions for instruction PPCs that reduce the failure rate by 50p at only 2p performance and 8p energy overhead, on average.
TL;DR: A processor platform architecture based on an application-specific programmable processor core, System-On-Chip bus, and a hardware accelerator is investigated for use in Wireless Sensor Network (WSN) nodes (motes).
Abstract: In this article we describe a low-power processor platform for use in Wireless Sensor Network (WSN) nodes (motes). WSN motes are small, battery-powered devices comprised of a processor, sensors, and a radio frequency transceiver. It is expected that WSNs consisting of large numbers of motes will offer long-term, distributed monitoring, and control of real-world equipment and phenomena. A key requirement for these applications is long battery life. We investigate a processor platform architecture based on an application-specific programmable processor core, System-On-Chip bus, and a hardware accelerator. The architecture improves on the energy consumption of a conventional microprocessor design by tuning the architecture for a suite of TinyOS-based WSN applications. The tuning method used minimizes changes to the instruction set architecture facilitating rapid software migration to the new platform. The processor platform was implemented and validated in an FPGA-based WSN mote. The benefits of the approach in terms of energy consumption are estimated to be a reduction of 48p for ASIC implementation relative to a conventional programmable processor for a typical TinyOS application suite without use of voltage scaling.
TL;DR: The impact of migrating from 2-D to 3-D on the difficulty of floorplanning and placement is discussed and the results show possible challenges in the future for physical design and CAD of3-D integrated circuits.
Abstract: Interconnect dominated electronic design stimulates a demand for developing circuits on the third dimension, leading to 3-D integration. Recent advances in chip fabrication technology enable 3-D circuit manufacturing. However, there is still a possible barrier of design complexity in exploiting 3-D technologies. This article discusses the impact of migrating from 2-D to 3-D on the difficulty of floorplanning and placement. By looking at a basic formulation of the graph cuboidal dual problem, we show that the 3-D cases and the 3-layer 2.5-D cases are fundamentally more difficult than the 2-D cases in terms of computational complexity. By comparison among these cases, the intrinsic complexity in 3-D floorplan structures is revealed in the hard-to-decide relations between topological connections and geometrical contacts. The results show possible challenges in the future for physical design and CAD of 3-D integrated circuits.
TL;DR: This benchmark suite includes seven designs; one design targets fine-grained FPGA fabrics allowing for quick state-of-the-art evaluation, and six designs are specified at a high level allowing them to target a range of existing and future reconfigurable technologies.
Abstract: We present the GroundHog 2009 benchmarking suite that evaluates the power consumption of reconfigurable technology for applications targeting the mobile computing domain. This benchmark suite includes seven designs; one design targets fine-grained FPGA fabrics allowing for quick state-of-the-art evaluation, and six designs are specified at a high level allowing them to target a range of existing and future reconfigurable technologies. Each of the six designs can be stimulated with the help of synthetically generated input stimuli created by an open-source tool included in the downloadable suite. Another tool is included to help verify the correctness of each implemented design. To demonstrate the potential of this benchmark suite, we evaluate the power consumption of two modern industrial FPGAs targeting the mobile domain. Also, we show how an academic FPGA framework, VPR 5.0, that has been updated for power estimates can be used to estimates the power consumption of different FPGA architectures and an open-source CAD flow mapping to these architectures.
TL;DR: A new approach termed Concept-Based Partitioning is presented that focuses on system evolution, product lines, and large-scale reuse when partitioning that improved the composability of concepts while keeping performance and size overhead within the 2% range.
Abstract: Hardware-software partitioning is an important phase in embedded systems. Decisions made during this phase impact the quality, cost, performance, and the delivery date of the final product. Over the past decade or more, various partitioning approaches have been proposed. A majority operate at a relatively fine granularity and use a low-level executable specification as the starting point. This presents problems if the context is families of industrial products with frequent release of upgraded or new members. Managing complexity using a low-level specification is extremely challenging and impacts developer productivity. Designing using a high-level specification and component-based development, although a better option, imposes component integration and replacement problems during system evolution and new product release. A new approach termed Concept-Based Partitioning is presented that focuses on system evolution, product lines, and large-scale reuse when partitioning. Beginning with information from UML 2.0 sequence diagrams and a concept repository concepts are identified and used as the unit of partitioning within a specification. A methodology for the refinement of interpart communication in the system specification using sequence diagrams is also presented. Change localization during system evolution, composability during large-scale reuse, and provision for configurable feature variations for a product line are facilitated by a Generic Adaptive Layer (GAL) around selected concepts. The methodology was applied on a subsystem of an Unmanned Aerial Vehicle (UAV) using various concepts which improved the composability of concepts while keeping performance and size overhead within the 2p range.
TL;DR: An analysis of the reliability of memories protected with Built-in Current Sensors (BICS) and a per-word parity bit when exposed to Single Event Upsets (SEUs) and Mean Time to Failure (MTTF) is presented.
Abstract: This article presents an analysis of the reliability of memories protected with Built-in Current Sensors (BICS) and a per-word parity bit when exposed to Single Event Upsets (SEUs). Reliability is characterized by Mean Time to Failure (MTTF) for which two analytic models are proposed. A simple model, similar to the one traditionally used for memories protected with scrubbing, is proposed for the low error rate case. A more complex Markov model is proposed for the high error rate case. The accuracy of the models is checked using a wide set of simulations. The results presented in this article allow fast estimation of MTTF enabling design of optimal memory configurations to meet specified MTTF goals at minimum cost. Additionally the power consumption of memories protected with BICS is compared to that of memories using scrubbing in terms of the number of read cycles needed in both configurations.
TL;DR: For Dynamic Voltage Scaling (DVS), a novel design methodology is proposed that is composed of an error detection circuit and three technologies to reduce the area and power penalties which are the large issues for the conventional DVS with error detection.
Abstract: For Dynamic Voltage Scaling (DVS), we propose a novel design methodology. This methodology is composed of an error detection circuit and three technologies to reduce the area and power penalties which are the large issues for the conventional DVS with error detection. The proposed circuit, Phase-Adjustable Error Detection Flip-Flip (PEDFF), adjusts the clock phase of an additional FF for the timing error detection, based on the timing slack. 2-Stage Hold-Driven Optimization (2-SHDO) technology splits the hold-driven optimization in two stages. Slack-Based Grouping Scheme (SBGS) technology divides each timing path into appropriate groups based on the timing slack. Slack Distribution Control (SDC) technology improves the sharp distribution of the path delay at which the logic synthesis tool has relaxed the delay. We evaluate the methodology by simulating a 32-bit microprocessor in 90 nm CMOS technology. The proposed methodology reduces the energy consumption by 19.8p compared to non-DVS. The OR-tree's latency is shortened to 16.3p compared to the conventional DVS. The area and power penalties for delay buffers on short paths are reduced to 35.0p and 40.6p compared to the conventional DVS, respectively. The proposed methodology with SDC reduces the energy consumption by 17.0p on another example with the sharp slack distribution by the logic synthesis compared to non-DVS.
TL;DR: This work presents the formulation and implementation of a method for analyzing the thermal (chip heating) behavior of a MPSoC task schedule, during the early stages of the design, and proposes a directed simulation methodology which uses results of a time-bounded analysis of the hybrid automata modeling thermal behavior of the application, to simulate the expected worst-case execution runs of the same.
Abstract: Overheating of computer chips leads to degradation of performance and reliability. Therefore, preventing chips from overheating in spite of increased performance requirements has emerged as a major challenge. Since the cost of cooling has been rising steadily, various architecture and application design techniques are used to prevent chip overheating. Temperature-aware task scheduling has emerged as an important application design methodology for addressing this problem in multiprocessor SoC systems.In this work we present the formulation and implementation of a method for analyzing the thermal (chip heating) behavior of a MPSoC task schedule, during the early stages of the design. We highlight the challenges in developing such a framework and propose solutions for tackling them. Due to nondeterminism in task execution times and decision branches, multiprocessor applications cannot be evaluated accurately by the current state-of-the-art thermalsimulation and steady-state analysis methods. Hence an analysis covering nondeterministic execution behaviors is required for thermal analysis of MPSoC task schedules. To address this issue we propose a model checking-based approach for solving the thermal analysis problem and formulate it as a hybrid automata reachability verification problem. We present an algorithm for constructing this hybrid automata given the task schedule, a set of power profiles of tasks, and the Compact Thermal Model (CTM) of the chip. Information about task power consumption is inferred from Markov chains which are learned from power profiles of tasks, obtained from simulation or emulation runs. A numerical analysis-based algorithm which uses CounterExample-Guided Abstraction Refinement (CEGAR) is developed for reachability analysis of this hybrid automata. We propose a directed simulation methodology which uses results of a time-bounded analysis of the hybrid automata modeling thermal behavior of the application, to simulate the expected worst-case execution runs of the same. The algorithms presented in this work have been implemented in a prototype tool called HeatCheck. We present experimental results and analysis of thermal behavior of a set of task schedules executing on a MPSoC system.
TL;DR: One of the major advantages of this technique is that it achieves a significant reduction in leakage without increasing the delay of the circuit.
Abstract: Leakage power currently comprises a large fraction of the total power consumption of an IC. Techniques to minimize leakage have been researched widely. However, most approaches to reducing leakage have an associated performance penalty. In this article, we present an approach which minimizes leakage by simultaneously modifying the circuit while deriving the input vector that minimizes leakage. In our approach, we selectively modify a gate so that its output (in sleep mode) is in a state which helps minimize the leakage of other gates in its transitive fanout. Gate replacement is performed in a slack-aware manner, to minimize the resulting delay penalty. One of the major advantages of our technique is that we achieve a significant reduction in leakage without increasing the delay of the circuit.
TL;DR: This article presents several scan-cell reordering techniques to reduce the signal transitions during the test mode while preserving the don’t-care bits in the test patterns for a later optimization.
Abstract: This article presents several scan-cell reordering techniques to reduce the signal transitions during the test mode while preserving the don’t-care bits in the test patterns for a later optimization. Combined with a pattern-filling technique, the proposed scan-cell reordering techniques can utilize both high response correlations and pattern correlations to simultaneously minimize scan-out and scan-in transitions. Those scan-shift transitions can be further reduced by selectively using the inverse connections between scan cells. In addition, the trade-off between routing overhead and power consumption can also be controlled by the proposed scan-cell reordering techniques. A series of experiments are conducted to demonstrate the effectiveness of each of the proposed techniques individually.
TL;DR: A low-overhead design technique that allows efficient characterization of Fmax at different operating voltages and temperatures and considers actual timing paths instead of critical path replicas, thereby accounting for local within-die delay variations.
Abstract: Maximum operating frequency (Fmax) of a system often needs to be determined at multiple operating points, defined by voltage and temperatures. Such calibration is important for the speed binning process, where the voltage-frequency (V-Fmax) relation needs to be accurately determined to sort chips into different bins that can be used for different applications. Moreover, adaptive systems typically require Fmax calibration at multiple operating points in order to dynamically change operating condition such as supply voltage or body bias for power, temperature, or throughput management. For example, a Dynamic Voltage and Frequency Scaling (DVFS) system requires accurate delay calibration at multiple operating voltages in order to apply the correct operating frequency corresponding to a scaled supply. In this article, we propose a low-overhead design technique that allows efficient characterization of Fmax at different operating voltages and temperatures. The proposed method selects a set of representative timing paths in a circuit based on their temperature and voltage sensitivities and dynamically configures them into a ring oscillator to compute the critical path delay. Compared to existing Fmax calibration approaches, the proposed approach provides the following two main advantages: (1) it introduces a delay sensitivity metric to isolate few representative timing paths; (2) it considers actual timing paths instead of critical path replicas, thereby accounting for local within-die delay variations. The all-digital calibration method is robust under process variations and achieves high delay estimation accuracy (> 4p error) at the cost of negligible design overhead (1.7p in delay, 0.3p in power, and 3.5p in die-area).
TL;DR: A floating point FFT processor is demonstrated that leverages both 3D integration and a unique hypercube memory division scheme to reduce the power consumption of a 1024 point F FT down to 4.227μJ.
Abstract: In this article we demonstrate a floating point FFT processor that leverages both 3D integration and a unique hypercube memory division scheme to reduce the power consumption of a 1024 point FFT down to 4.227μJ. The hypercube memory division scheme lowers the energy per memory access by 59.2p and increases the total required area by 16.8p. The use of 3D integration reduces the logic power by 5.2p. We describe the tool flow required to realize the 3D implementation and perform a thermal analysis of it.
TL;DR: An in-place search algorithm for computing the exact solutions to the resource constrained scheduling problem based on two lower-bound estimation mechanisms that can effectively prune the nonpromising search space and finds the optimum usually several times faster than existing techniques.
Abstract: We propose an in-place search algorithm for computing the exact solutions to the resource constrained scheduling problem. This algorithm supports operation chaining, pipelining and multicycling in the underlying scheduling problem. Based on two lower-bound estimation mechanisms that are capable of predicting the criterion values of search nodes represented by partially scheduled data flow graphs, the proposed algorithm can effectively prune the nonpromising search space and finds the optimum usually several times faster than existing techniques. As opposed to existing search-based scheduling techniques whose space complexity is squared or exponential in the search depth, our approach requires only a constant storage space during the traversal of the search tree. The low space complexity is accomplished by using a combination-generating algorithm, which leads our approach to visit search nodes in such a way that each one is obtained by making only a small change to its sibling without keeping any parent nodes in memory. Experimental results on several well known benchmarks with varying resource constraints show the effectiveness of the proposed algorithm.