scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2008"


Journal ArticleDOI
TL;DR: The history of MPSoCs is surveyed to argue that they represent an important and distinct category of computer architecture and to survey computer-aided design problems relevant to the design of MP soCs.
Abstract: The multiprocessor system-on-chip (MPSoC) uses multiple CPUs along with other hardware subsystems to implement a system. A wide range of MPSoC architectures have been developed over the past decade. This paper surveys the history of MPSoCs to argue that they represent an important and distinct category of computer architecture. We consider some of the technological trends that have driven the design of MPSoCs. We also survey computer-aided design problems relevant to the design of MPSoCs.

435 citations


Journal ArticleDOI
TL;DR: Algorithms that perform automatic static analysis of software to detect programming errors or prove their absence are surveyed and the three techniques considered are static analysis with abstract domains, model checking, and bounded model checking.
Abstract: The quality and the correctness of software are often the greatest concern in electronic systems. Formal verification tools can provide a guarantee that a design is free of specific flaws. This paper surveys algorithms that perform automatic static analysis of software to detect programming errors or prove their absence. The three techniques considered are static analysis with abstract domains, model checking, and bounded model checking. A short tutorial on these techniques is provided, highlighting their differences when applied to practical problems. This paper also surveys tools implementing these techniques and describes their merits and shortcomings.

343 citations


Journal ArticleDOI
TL;DR: In this paper, the authors review the recent developments in statistical static-timing analysis (SSTA) and discuss its underlying models and assumptions, then survey the major approaches, and close by discussing its remaining key challenges.
Abstract: Static-timing analysis (STA) has been one of the most pervasive and successful analysis engines in the design of digital circuits for the last 20 years. However, in recent years, the increased loss of predictability in semiconductor devices has raised concern over the ability of STA to effectively model statistical variations. This has resulted in extensive research in the so-called statistical STA (SSTA), which marks a significant departure from the traditional STA framework. In this paper, we review the recent developments in SSTA. We first discuss its underlying models and assumptions, then survey the major approaches, and close by discussing its remaining key challenges.

341 citations


Journal ArticleDOI
TL;DR: This work proposes a high-quality analytical placement algorithm considering wirelength, preplaced blocks, and density based on the log-sum-exp wirelength model proposed by Naylor and the multilevel framework and uses the conjugate gradient method to find better macro positions.
Abstract: In addition to wirelength, modern placers need to consider various constraints such as preplaced blocks and density. We propose a high-quality analytical placement algorithm considering wirelength, preplaced blocks, and density based on the log-sum-exp wirelength model proposed by Naylor and the multilevel framework. To handle preplaced blocks, we use a two-stage smoothing technique, i.e., Gaussian smoothing followed by level smoothing, to facilitate block spreading during global placement (GP). The density is controlled by white-space reallocation using partitioning and cut-line shifting during GP and cell sliding during detailed placement. We further use the conjugate gradient method with dynamic step-size control to speed up the GP and macro shifting to find better macro positions. Experimental results show that our placer obtains very high-quality results.

260 citations


Journal ArticleDOI
TL;DR: It is shown experimentally that, over 18 industrial circuits in the ISPD98 benchmark suite, FLUTE with default accuracy is more accurate than the Batched 1-Steiner heuristic and is almost as fast as a very efficient implementation of Prim's rectilinear minimum spanning tree algorithm.
Abstract: In this paper, we present a very fast and accurate rectilinear Steiner minimal tree (RSMT) algorithm called fast lookup table estimation (FLUTE). FLUTE is based on a precomputed lookup table to make RSMT construction very fast and very accurate for low-degreeThe degree of a net is the number of pins in the net. nets. For high-degree nets, a net-breaking technique is proposed to reduce the net size until the table can be used. A scheme is also presented to allow users to control the tradeoff between accuracy and runtime. FLUTE is optimal for low-degree nets (up to degree 9 in our current implementation) and is still very accurate for nets up to degree 100. Therefore, it is particularly suitable for very large scale integration applications in which most nets have a degree of 30 or less. We show experimentally that, over 18 industrial circuits in the ISPD98 benchmark suite, FLUTE with default accuracy is more accurate than the Batched 1-Steiner heuristic and is almost as fast as a very efficient implementation of Prim's rectilinear minimum spanning tree algorithm.

245 citations


Journal ArticleDOI
TL;DR: This paper considers a local optimization technique based on templates to simplify and reduce the depth of nonoptimal quantum circuits and shows how templates can be used to compact the number of levels of a quantum circuit.
Abstract: Quantum circuits are time-dependent diagrams describing the process of quantum computation. Usually, a quantum algorithm must be mapped into a quantum circuit. Optimal synthesis of quantum circuits is intractable, and heuristic methods must be employed. With the use of heuristics, the optimality of circuits is no longer guaranteed. In this paper, we consider a local optimization technique based on templates to simplify and reduce the depth of nonoptimal quantum circuits. We present and analyze templates in the general case and provide particular details for the circuits composed of NOT, CNOT, and controlled-sqrt-of-NOT gates. We apply templates to optimize various common circuits implementing multiple control Toffoli gates and quantum Boolean arithmetic circuits. We also show how templates can be used to compact the number of levels of a quantum circuit. The runtime of our implementation is small, whereas the reduction in the number of quantum gates and number of levels is significant.

237 citations


Journal ArticleDOI
TL;DR: In this article, different schemes for clocking and timing of the QCA systems are proposed; these schemes utilize 2D techniques that permit a reduction in the longest line length in each clocking zone.
Abstract: At nanoscale, quantum-dot cellular automata (QCA) defines a new device architecture that permits the innovative design of digital systems. Features of these systems are the allowed crossing of signal lines with different orientation in polarization on a Cartesian plane, the potential of high throughput due to efficient pipelining, fast signal switching, and propagation. However, QCA designs of even modest complexity suffer from the negative impact due to the placement of long lines of cells among clocking zones, thus resulting in increased delay, slow timing, and sensitivity to thermal fluctuations. In this paper, different schemes for clocking and timing of the QCA systems are proposed; these schemes utilize 2D techniques that permit a reduction in the longest line length in each clocking zone. The proposed clocking schemes utilize logic-propagation techniques that have been developed for systolic arrays. Placement of QCA cells is modified to ensure correct signal generation and timing. The significant reduction in the longest line length permits a fast timing and efficient pipelining to occur while guaranteeing a kink-free behavior in switching.

197 citations


Journal ArticleDOI
TL;DR: This paper proposes an efficient proactive continuously engaged hardware and operating system thermal management technique governed by optimal thermal management polices and finds that proactive power-thermal budgeting allows a 30% improvement in instruction throughput compared to a proactive thermal management approach that bases decisions only upon local information.
Abstract: Three-dimensional integration has the potential to improve the communication latency and integration density of chip-level multiprocessors (CMPs). However, the stacked high-power density layers of 3D CMPs increase the importance and difficulty of thermal management. In this paper, we investigate the 3D CMP run-time thermal management problem and describe efficient management techniques. This paper makes the following main contributions: 1) It identifies and describes the critical concepts required for optimal thermal management, namely the methods by which heterogeneity in both workload power characteristics and processor core thermal characteristics should be exploited; and 2) it proposes an efficient proactive continuously engaged hardware and operating system thermal management technique governed by optimal thermal management polices. The proposed technique is evaluated using multiprogrammed and multithreaded benchmarks in an integrated power, performance, and temperature full-system simulation environment. We find that proactive power-thermal budgeting allows a 30% improvement in instruction throughput compared to a proactive thermal management approach that bases decisions only upon local information. The software components of the proposed thermal management technique have been implemented in the Linux 2.6.8 kernel. This source code will be publicly released. The analysis and technique developed in this paper provide a general solution for future 3D and 2D CMPs.

190 citations


Journal ArticleDOI
TL;DR: This paper targets real-time applications which are dynamically mapped onto embedded MPSoCs, where communication happens via the Network-on-Chip (NoC) approach, and resources connected to the NoC have multiple voltage levels.
Abstract: Achieving effective run-time mapping on multiprocessor systems-on-chip (MPSoCs) is a challenging task, particularly since the arrival order of the target applications is not known a priori. This paper targets real-time applications which are dynamically mapped onto embedded MPSoCs, where communication happens via the Network-on-Chip (NoC) approach, and resources connected to the NoC have multiple voltage levels. We address precisely the energy- and performance-aware incremental mapping problem for NoCs with multiple voltage levels and propose an efficient technique (consisting of region selection and node allocation) to solve it. Moreover, the proposed technique allows for new applications to be added to the system with minimal in- terprocessor communication overhead. Experimental results show that the proposed technique is very fast, and as much as 50% communication energy savings can be achieved compared to using an arbitrary allocation scheme.

183 citations


Journal ArticleDOI
TL;DR: An approach to remove halos (free space) around large modules is described, and a method to control the module density is presented, which demonstrates that Kraftwerk2 offers both high quality and excellent computational efficiency.
Abstract: The force-directed quadratic placer ldquoKraftwerk2,rdquo as described in this paper, is based on two main concepts. First, the force that is necessary to distribute the modules on the chip is separated into the following two components: a hold force and a move force. Both components are implemented in a systematic manner. Consequently, Kraftwerk2 converges such that the module overlap is reduced in each placement iteration. The second concept of Kraftwerk2 is to use the ldquoBound2Boundrdquo net model, which accurately represents the half-perimeter wirelength (HPWL) in the quadratic cost function. Aside from these features, this paper presents additional details about Kraftwerk2. An approach to remove halos (free space) around large modules is described, and a method to control the module density is presented. In order to choose the important tradeoff between runtime and quality, a systematic quality control is shown. Furthermore, plots demonstrating the convergence of Kraftwerk2 are presented. Results using various benchmark suites demonstrate that Kraftwerk2 offers both high quality and excellent computational efficiency.

177 citations


Journal ArticleDOI
TL;DR: Results of applications like circuit sizing, design centering, response surface modeling, or analog placement show the benefits of the sizing rules method.
Abstract: This paper presents the sizing rules method for basic building blocks in analog CMOS and bipolar circuit design. It consists of the development of a hierarchical library of transistor-pair groups as basic building blocks for analog CMOS and bipolar circuits, the derivation of a hierarchical generic list of constraints that must be satisfied to guarantee the function and robustness of each block, and the development of a reliable automatic recognition procedure of building blocks in a circuit schematic. Sizing rules efficiently capture design knowledge on the technology-specific level of transistor-pair groups. This reduces the effort and improves the resulting quality for analog circuit synthesis. Results of applications like circuit sizing, design centering, response surface modeling, or analog placement show the benefits of the sizing rules method.

Journal ArticleDOI
TL;DR: This paper used ESPAM to automatically generate and program several multiprocessor systems that execute three image processing applications, namely Sobel edge detection, Discrete Wavelet Transform, and Motion JPEG encoder, to validate and evaluate the methodology and techniques implemented.
Abstract: For modern embedded systems in the realm of high-throughput multimedia, imaging, and signal processing, the complexity of embedded applications has reached a point where the performance requirements of these applications can no longer be supported by embedded system architectures based on a single processor. Thus, the emerging embedded system-on-chip platforms are increasingly becoming multiprocessor architectures. As a consequence, two major problems emerge, namely how to design and how to program such multiprocessor platforms in a systematic and automated way in order to reduce the design time and to satisfy the performance needs of applications executed on such platforms. As an efficient solution to these two problems, in this paper, we present the methodology and techniques implemented in a tool called Embedded System-level Platform synthesis and Application Mapping (ESPAM) for automated multiprocessor system design, programming, and implementation. ESPAM moves the design specification and programming from the Register Transfer Level and low-level C to a higher system level of abstraction. We explain how, starting from system-level platform, application, and mapping specifications, a multiprocessor platform is synthesized, programmed, and implemented in a systematic and automated way. The class of multiprocessor platforms we consider is introduced as well. To validate and evaluate our methodology, we used ESPAM to automatically generate and program several multiprocessor systems that execute three image processing applications, namely Sobel edge detection, Discrete Wavelet Transform, and Motion JPEG encoder. The performance of the systems that execute these applications is also presented in this paper.

Journal ArticleDOI
TL;DR: A high-performance droplet router for a digital microfluidic biochip (DMFB) design that achieves over 35 x and 20 x better routability with comparable timing and fault tolerance than the popular prioritized A* search and the state-of-the-art network-flow-based algorithm, respectively.
Abstract: In this paper, we propose a high-performance droplet router for a digital microfluidic biochip (DMFB) design. Due to recent advancements in the biomicro electromechanical system and its various applications to clinical, environmental, and military operations, the design complexity and the scale of a DMFB are expected to explode in the near future, thus requiring strong support from CAD as in conventional VLSI design. Among the multiple design stages of a DMFB, droplet routing, which schedules the movement of each droplet in a time-multiplexed manner, is one of the most critical design challenges due to high complexity as well as large impacts on performance. Our algorithm first routes a droplet with higher by passibility which is less likely to block the movement of the others. When multiple droplets form a deadlock, our algorithm resolves it by backing off some droplets for concession. The final compaction step further enhances timing as well as fault tolerance by tuning each droplet movement greedily. The experimental results on hard benchmarks show that our algorithm achieves over 35 x and 20 x better routability with comparable timing and fault tolerance than the popular prioritized A* search and the state-of-the-art network-flow-based algorithm, respectively.

Journal ArticleDOI
TL;DR: This paper proposes an exact common subexpression elimination algorithm for the optimum sharing of partial terms in multiple constant multiplications (MCMs) and describes how this algorithm can be modified to target the minimum area solution under a user-specified delay constraint.
Abstract: The main contribution of this paper is an exact common subexpression elimination algorithm for the optimum sharing of partial terms in multiple constant multiplications (MCMs). We model this problem as a Boolean network that covers all possible partial terms that may be used to generate the set of coefficients in the MCM instance. We cast this problem into a 0-1 integer linear programming (ILP) by requiring that the single output of this network is asserted while minimizing the number of gates representing operations in the MCM implementation that evaluate to one. A satisfiability (SAT)-based 0-1 ILP solver is used to obtain the exact solution. We argue that for many real problems, the size of the problem is within the capabilities of current SAT solvers. Because performance is often a primary design parameter, we describe how this algorithm can be modified to target the minimum area solution under a user-specified delay constraint. Additionally, we propose an approximate algorithm based on the exact approach with extremely competitive results. We have applied these algorithms on the design of digital filters and present a comprehensive set of results that evaluate ours and existing approximation schemes against exact solutions under different number representations and using different SAT solvers.

Journal ArticleDOI
TL;DR: This work introduces a neural system that is trained not only to predict the pass/fail labels of devices based on a set of low-cost measurements, but also to assess the confidence in this prediction, which sustains the high accuracy of specification testing while leveraging the low cost of machine-learning-based testing.
Abstract: Machine-learning-based test methods for analog/RF devices have been the subject of intense investigation over the last decade. However, despite the significant cost benefits that these methods promise, they have seen a limited success in replacing the traditional specification testing, mainly due to the incurred test error which, albeit small, cannot meet industrial standards. To address this problem, we introduce a neural system that is trained not only to predict the pass/fail labels of devices based on a set of low-cost measurements, as aimed by the previous machine-learning-based test methods, but also to assess the confidence in this prediction. Devices for which this confidence is insufficient are then retested through the more expensive specification testing in order to reach an accurate test decision. Thus, this two-tier test approach sustains the high accuracy of specification testing while leveraging the low cost of machine-learning-based testing. In addition, by varying the desired level of confidence, it enables the exploration of the tradeoff between test cost and test accuracy and facilitates the development of cost-effective test plans. We discuss the structure and the training algorithm of an ontogenic neural network which is embodied in the neural system in the first tier, as well as the extraction of appropriate measurements such that only a small fraction of devices are funneled to the second tier. The proposed test-error-moderation method is demonstrated on a switched-capacitor filter and an ultrahigh-frequency receiver front end.

Journal ArticleDOI
TL;DR: A new CSE algorithm using binary representation of coefficients for the implementation of higher order FIR filters with a fewer number of adders than CSD-based CSE methods is presented, showing that the CSE method is more efficient in reducing the number ofAdders needed to realize the multipliers when the filter coefficients are represented in the binary form.
Abstract: The complexity of linear-phase finite-impulse-response (FIR) filters is dominated by the complexity of coefficient multipliers. The number of adders (subtractors) used to implement the multipliers determines the complexity of the FIR filters. It is well known that common subexpression elimination (CSE) methods based on canonical signed digit (CSD) coefficients reduce the number of adders required in the multipliers of FIR filters. A new CSE algorithm using binary representation of coefficients for the implementation of higher order FIR filters with a fewer number of adders than CSD-based CSE methods is presented in this paper. We show that the CSE method is more efficient in reducing the number of adders needed to realize the multipliers when the filter coefficients are represented in the binary form. Our observation is that the number of unpaired bits (bits that do not form CSs) is considerably few for binary coefficients compared to CSD coefficients, particularly for higher order FIR filters. As a result, the proposed binary-coefficient-based CSE method offers good reduction in the number of adders in realizing higher order filters. The reduction of adders is achieved without much increase in critical path length of filter coefficient multipliers. Design examples of FIR filters show that our method offers an average adder reduction of 18% over the best known CSE method, without any increase in the logic depth.

Journal ArticleDOI
TL;DR: This work studies the theoretical aspects of the problem of the practical realization of an abstract quantum circuit when executed on a quantum hardware and presents empirical results that match the best known solutions that have been developed by experimentalists.
Abstract: We study the problem of the practical realization of an abstract quantum circuit when executed on a quantum hardware. By practical, we mean adapting the circuit to particulars of the physical environment which restricts/complicates the establishment of certain direct interactions between qubits. This is a quantum version of the classical circuit placement problem. We study the theoretical aspects of the problem and also present empirical results that match the best known solutions that have been developed by experimentalists. Finally, we discuss the efficiency of the approach and the scalability of its implementation with regard to the future development of quantum hardware.

Journal ArticleDOI
TL;DR: This paper proposes systematic techniques for determining the optimal locations for thermal sensors to provide high-fidelity thermal monitoring of a complex microprocessor system and presents a hybrid algorithm that shows the tradeoffs associated with number of sensors and expected accuracy.
Abstract: High-performance microprocessor families employ dynamic-thermal-management techniques to cope with the increasing thermal stress resulting from peaking power densities. These techniques operate on feedback generated from on-die thermal sensors. The allocation and the placement of thermal-sensing elements directly impact the effectiveness of the dynamic management mechanisms. In this paper, we propose systematic techniques for determining the optimal locations for thermal sensors to provide high-fidelity thermal monitoring of a complex microprocessor system. Our strategies can be divided into two main categories: uniform sensor allocation and nonuniform sensor allocation. In the uniform approach, the sensors are placed on a regular grid. The nonuniform allocation identifies an optimal physical location for each sensor such that the sensor's attraction toward steep thermal gradients is maximized, which can result in uneven concentrations of sensors on different locations of the chip. We also present a hybrid algorithm that shows the tradeoffs associated with number of sensors and expected accuracy. Our experimental results show that our uniform approach using interpolation can detect the chip temperature with a maximum error of 5.47degC and an average maximum error of 1.05degC . On the other hand, our nonuniform strategy is able to create a sensor distribution for a given microprocessor architecture, providing thermal measurements with a maximum error of 3.18degC and an average maximum error of 1.63degC across a wide set of applications.

Journal ArticleDOI
Nishant Patil1, Jie Deng2, Albert Lin1, Hon-Sum Philip Wong1, Subhasish Mitra1 
TL;DR: This paper presents a technique for designing arbitrary logic functions using CNFET circuits that are guaranteed to implement correct functions even in the presence of a large number of misaligned and mispositioned CNTs.
Abstract: Carbon-nanotube (CNT) field-effect transistors (CNFETs) are promising extensions to silicon CMOS. Simulations show that CNFET inverters fabricated with a perfect CNFET technology have 13 times better energy delay product compared with 32-nm silicon CMOS inverters. The following two fundamental challenges prevent the fabrication of CNFET circuits with the aforementioned advantages: 1) misaligned and mispositioned CNTs and 2) metallic CNTs. Misaligned and mispositioned CNTs can cause incorrect functionality. This paper presents a technique for designing arbitrary logic functions using CNFET circuits that are guaranteed to implement correct functions even in the presence of a large number of misaligned and mispositioned CNTs. Experimental demonstration of misaligned and mispositioned CNT-immune logic structures is also presented.

Journal ArticleDOI
TL;DR: Significant improvements to core routing technologies are described and the best results from the International Symposium on Physical Design 2007 Global Routing Contest and the International Conference on Computer-Aided Design 2007 in terms of route completion and total wirelength are outperformed.
Abstract: In this paper, we describe significant improvements to core routing technologies and outperform the best results from the International Symposium on Physical Design 2007 Global Routing Contest and the International Conference on Computer-Aided Design 2007 in terms of route completion and total wirelength.

Journal ArticleDOI
Michael D. Moffitt1
TL;DR: MaizeRouter reflects a significant leap in progress over existing publicly available routing tools yet relies upon relatively simple operations, includingextreme edge shifting, a technique aimed primarily at the efficient reduction of routing congestion, and edge retraction, a counterpart to extreme edge shifting that serves to reduce unnecessary wirelength.
Abstract: In this paper, we present the complete design and architectural details of MaizeRouter. MaizeRouter reflects a significant leap in progress over existing publicly available routing tools yet relies upon relatively simple operations, including extreme edge shifting, a technique aimed primarily at the efficient reduction of routing congestion, and edge retraction, a counterpart to extreme edge shifting that serves to reduce unnecessary wirelength. We present enhanced variations of these operations to enable the rapid exploration of candidate paths, along with a form of dynamic cost deflation that provides our various path computation procedures with progressively more accurate (and less optimistic) cost information as search continues. These algorithmic contributions are built upon a framework of interdependent net decomposition, a representation that improves upon traditional two-pin net decomposition by preventing duplication of routing resources while enabling cheap and incremental topological reconstruction. Collectively, these operations permit a broad search space that previous algorithms have been unable to achieve, resulting in solutions of considerably higher quality than those of well-established routers.

Journal ArticleDOI
TL;DR: The first network-flow-based routing algorithm that can concurrently route a set of noninterfering nets for the droplet routing problem on biochips is presented and is presented as the first polynomial-time algorithm for simultaneous routing and scheduling using the global-routing paths with a negotiation- based routing scheme.
Abstract: Due to recent advances in microfluidics, digital microfluidic biochips are expected to revolutionize laboratory procedures. One critical problem for biochip synthesis is the droplet routing problem. Unlike traditional very large scale integration routing problems, in addition to routing path selection, the biochip routing problem needs to address the issue of scheduling droplets under practical constraints imposed by the fluidic property and timing restriction of synthesis results. In this paper, we present the first network-flow-based routing algorithm that can concurrently route a set of noninterfering nets for the droplet routing problem on biochips. We adopt a two-stage technique of global routing followed by detailed routing. In global routing, we first identify a set of noninterfering nets and then adopt the network-flow approach to generate optimal global-routing paths for nets. In detailed routing, we present the first polynomial-time algorithm for simultaneous routing and scheduling using the global-routing paths with a negotiation-based routing scheme. Our algorithm targets at both the minimization of cells used for routing for better fault tolerance and minimization of droplet transportation time for better reliability and faster bioassay execution. Experimental results show the robustness and efficiency of our algorithm.

Journal ArticleDOI
TL;DR: A layout-aware solution for analog cells that tackles both geometric and parasitic-aware electrical synthesis is proposed and several design examples are provided.
Abstract: In analog integrated circuit design, iterations between electrical and physical syntheses to counterbalance layout-induced performance degradations should be avoided as much as possible. One possible solution involves the integration of traditionally separated electrical and physical synthesis phases by including layout-induced effects right into the electrical synthesis phase in what has been called parasitic-aware synthesis. This solution, as such, is not yet complete since there are geometric requirements (minimization of area or fulfillment of certain layout aspect ratio, among others) whose effects on the resulting parasitics are not usually considered during the electrical synthesis. In this paper, a layout-aware solution for analog cells that tackles both geometric and parasitic-aware electrical synthesis is proposed. Several design examples are provided.

Journal ArticleDOI
TL;DR: An efficient fully automatic approach to fault localization for safety properties stated in linear temporal logic is presented by solving the satisfiability of a propositional Boolean formula using the proper decision heuristics and simulation-based preprocessing.
Abstract: We present an efficient fully automatic approach to fault localization for safety properties stated in linear temporal logic. We view the failure as a contradiction between the specification and the actual behavior and look for components that explain this discrepancy. We find these components by solving the satisfiability of a propositional Boolean formula. We show how to construct this formula and how to extend it so that we find exactly those components that can be used to repair the circuit for a given set of counterexamples. Furthermore, we discuss how to efficiently solve the formula by using the proper decision heuristics and simulation-based preprocessing. We demonstrate the quality and efficiency of our approach by experimental results.

Journal ArticleDOI
TL;DR: HybDTM is proposed, a system-level framework for doing fine-grained coordinated thermal management using a hybrid of hardware techniques and software techniques, leveraging the advantages of both approaches in a synergistic fashion, to manage the overall temperature with minimal overhead using the operating system (OS) support.
Abstract: Thermal issues are fast becoming major design constraints in high-performance systems. Temperature variations adversely affect system reliability and prompt worst-case design. In recent history, researchers have proposed dynamic thermal-management (DTM) techniques targeting average-case design and tackling the temperature issue at runtime. While past work on DTM has focused on different techniques in isolation, it fails to consider a system-level approach which uses both hardware and software support in a synergistic fashion and hence leads to a significant execution-time overhead. In this paper, we propose HybDTM, a system-level framework for doing fine-grained coordinated thermal management using a hybrid of hardware techniques (like clock gating) and software techniques (like thermal-aware process scheduling), leveraging the advantages of both approaches in a synergistic fashion. We show that while hardware techniques can be used reactively to manage the overall temperature in case of thermal emergencies, proactive use of software techniques can build on top of it to balance the overall thermal profile with minimal overhead using the operating system (OS) support. In order to evaluate our proposed hybrid-DTM policy, we develop a novel regression-based thermal model, providing fast and accurate temperature estimates to do runtime thermal characterization of all applications running on the system, using hardware performance counters available in modern high-performance processors alongside thermal sensors for training the model at runtime. Our model is validated against actual temperature measurements from online thermal sensors, with the average estimation error found to be less than 5%. We also study system-level DTM issues, jointly considering both the processor and memory, and show how a unified DTM approach can benefit from global knowledge of individual system components. We evaluate our proposed methodology on a desktop system with an Intel Pentium-4 processor and a modified Linux OS, running a number of SPEC2000 benchmarks, in both uniprocessor and simultaneous multithreaded environments and show that our proposed technique is able to successfully manage the overall temperature with an average execution-time overhead of only 10.4% (20.1% maximum) compared to the case without any DTM, as opposed to 23.9% (46% maximum) overhead for purely hardware-based DTM. Our system, including the thermal-aware OS, built-in runtime thermal-characterization model, and interface to the underlying hardware using the Pentium-4 processor, is ready for release.

Journal ArticleDOI
TL;DR: This paper presents a new approach to combine intra- and intertask voltage scheduling for better energy savings in hard real-time systems with uncertain task execution time, and shows that the new approach can save more energy than existing solutions while meeting hard deadlines.
Abstract: Dynamic voltage and frequency scaling can save energy for real-time systems. Frequencies are generally assumed proportional to voltages. Previous studies consider the probabilistic distributions of tasks' execution time to assist dynamic voltage scaling in task scheduling. These studies use probability information for intratask voltage scheduling but do not sufficiently explore the opportunities for intertask scheduling to save more energy. This paper presents a new approach to combine intra- and intertask voltage scheduling for better energy savings in hard real-time systems with uncertain task execution time. Our approach takes three steps: 1) We calculate statistically the optimal voltage schedules for multiple concurrent tasks, using earliest deadline first scheduling for an ideal processor that can change the frequency continuously; 2) we then adapt the solution to a processor with a limited range of discrete frequencies, using a polynomial-time heuristic algorithm; and 3) finally, we improve our solution, considering the time and energy overheads of frequency switching for schedulability and energy reduction. Our simulation shows that the new approach can save more energy than existing solutions while meeting hard deadlines.

Journal ArticleDOI
TL;DR: The synthesis approach provides multiple sets of layout parameters that help a designer in the tradeoff analysis between conflicting objectives, such as area, Q, and SRF for a target-inductance value.
Abstract: This paper presents an efficient layout-level synthesis approach for RF planar on-chip spiral inductors. A spiral inductor is modeled using artificial neural networks in which the layout design parameters, namely, spiral outer diameter, number of turns, width of metal traces, and metal spacing, are taken as input. Inductance, quality factor (Q), and self-resonance frequency (SRF) form the output of the neural model. Particle-swarm optimization is used to explore the layout space to achieve a given target inductance meeting the SRF and other constraints. Our synthesis approach provides multiple sets of layout parameters that help a designer in the tradeoff analysis between conflicting objectives, such as area, Q, and SRF for a target-inductance value. We present several synthesis results which show good accuracy with respect to full-wave electromagnetic (EM) simulations. Since the proposed procedure does not require an EM simulation in the synthesis loop, it substantially reduces the cycle time in RF-circuit design optimization.

Journal ArticleDOI
TL;DR: This paper presents a technique that applies structural knowledge about the circuit during the transformation of Boolean SAT solvers and shows that the size of the problem instances decreases, as well as the run time of the ATPG process.
Abstract: Due to the rapidly growing size of integrated circuits, there is a need for new algorithms for automatic test pattern generation (ATPG). While classical algorithms reach their limit, there have been recent advances in algorithms to solve Boolean Satisfiability (SAT). Because Boolean SAT solvers are working on conjunctive normal forms (CNFs), the problem has to be transformed. During transformation, relevant information about the problem might get lost and, therefore, is not available in the solving process. In this paper, we present a technique that applies structural knowledge about the circuit during the transformation. As a result, the size of the problem instances decreases, as well as the run time of the ATPG process. The technique was implemented, and experimental results are presented. The approach was combined with the ATPG framework of NXP Semiconductors. It is shown that the overall performance of an industrial framework can significantly be improved. Further experiments show the benefits with regard to the efficiency and robustness of the combined approach.

Journal ArticleDOI
TL;DR: Algorithms for automated macromodeling of nonlinear mixed-signal system blocks using piecewise-polynomial (PWP) representations and a novel technique that combines concepts from proper orthogonal decomposition with Krylov-subspace projection are presented.
Abstract: We present algorithms for automated macromodeling of nonlinear mixed-signal system blocks. A key feature of our methods is that they automate the generation of general-purpose macromodels that are suitable for a wide range of time- and frequency-domain analyses important in mixed-signal design flows. In our approach, a nonlinear circuit or system is approximated using piecewise-polynomial (PWP) representations. Each polynomial system is reduced to a smaller one via weakly nonlinear polynomial model-reduction methods. Our approach, dubbed PWP, generalizes recent trajectory-based piecewise-linear approaches and ties them with polynomial-based model-order reduction, which inherently captures stronger nonlinearities within each region. PWP-generated macromodels not only reproduce small-signal distortion and intermodulation properties well but also retain fidelity in large-signal transient analyses. The reduced models can be used as drop-in replacements for large subsystems to achieve fast system-level simulation using a variety of time- and frequency-domain analyses (such as dc, ac, transient, harmonic balance, etc.). For the polynomial reduction step within PWP, we also present a novel technique [dubbed multiple pseudoinput (MPI)] that combines concepts from proper orthogonal decomposition with Krylov-subspace projection. We illustrate the use of PWP and MPI with several examples (including op-amps and I/O buffers) and provide important implementation details. Our experiments indicate that it is easy to obtain speedups of about an order of magnitude with push-button nonlinear macromodel-generation algorithms.

Journal ArticleDOI
TL;DR: A polynomial-time algorithm which first generates a net order and then performs layer assignment one net at a time according to the order using dynamic programming is proposed, which is guaranteed to generate a layer assignment solution satisfying the given congestion constraints.
Abstract: In this paper, we study the problem of layer assignment for via minimization, which arises during multilayer global routing. In addressing this problem, we take the total overflow and the maximum overflow as the congestion constraints from a given one-layer global routing solution and aim to find a layer assignment result for each net such that the via cost is minimized while the given congestion constraints are satisfied. To solve the problem, we propose a polynomial-time algorithm which first generates a net order and then performs layer assignment one net at a time according to the order using dynamic programming. Our algorithm is guaranteed to generate a layer assignment solution satisfying the given congestion constraints. We used the six-layer benchmarks released from the ISPD'07 global routing contest to test our algorithm. The experimental results show that our algorithm was able to improve the contest results of the top three winners MaizeRouter, BoxRouter, and FGR on each benchmark. As compared to BoxRouter 2.0 and FGR 1.1, which are newer versions of BoxRouter and FGR, our algorithm respectively produced smaller via costs on all benchmarks and half the benchmarks. Our algorithm can also be adapted to refine a given multilayer global routing solution in a net-by-net manner, and the experimental results show that this refinement approach improved the via costs on all benchmarks for FGR 1.1.