scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2005"


Journal ArticleDOI
TL;DR: A detailed and flexible power model which has been integrated in the widely used Versatile Place and Route (VPR) CAD tool is described, which estimates the dynamic, short-circuit, and leakage power consumed by FPGAs.
Abstract: Power has become a critical issue for field-programmable gate array (FPGA) vendors. Understanding the power dissipation within FPGAs is the first step in developing power-efficient architectures and computer-aided design (CAD) tools for FPGAs. This article describes a detailed and flexible power model which has been integrated in the widely used Versatile Place and Route (VPR) CAD tool. This power model estimates the dynamic, short-circuit, and leakage power consumed by FPGAs. It is the first flexible power model developed to evaluate architectural tradeoffs and the efficiency of power-aware CAD tools for a variety of FPGA architectures, and is freely available for noncommercial use. The model is flexible, in that it can estimate the power for a wide variety of FPGA architectures, and it is fast, in that it does not require extensive simulation, meaning it can be used to explore a large architectural space. We show how the model can be used to investigate the impact of various architectural parameters on the energy consumed by the FPGA, focusing on the segment length, switch block topology, lookuptable size, and cluster size.

187 citations


Journal ArticleDOI
TL;DR: This article mixes two encoding techniques to reduce test data volume, test pattern delivery time, and power dissipation in scan test applications by using run-length encoding followed by Huffman encoding.
Abstract: This article mixes two encoding techniques to reduce test data volume, test pattern delivery time, and power dissipation in scan test applications. This is achieved by using run-length encoding followed by Huffman encoding. This combination is especially effective when the percentage of don't cares in a test set is high, which is a common case in today's large systems-on-chips (SoCs). Our analysis and experimental results confirm that achieving up to an 89% compression ratio and a 93% scan-in power reduction is possible for scan-testable circuits such as ISCAS89 benchmarks.

139 citations


Journal ArticleDOI
TL;DR: The essence of the new approach is the addition of a set of design and timing constraints which encodes the author's signature which results in signature data that is highly resilient, difficult to detect and remove, and yet is easy to verify and can be embedded in designs with very low hardware overhead.
Abstract: We introduce dynamic watermarking techniques for protecting the value of intellectual property of CAD and compilation tools and reusable design components. The essence of the new approach is the addition of a set of design and timing constraints which encodes the author's signature. The constraints are selected in such a way that they result in a minimal hardware overhead while embedding a unique signature that is difficult to remove and forge. Techniques are applicable in conjunction with an arbitrary behavioral synthesis task such as scheduling, assignment, allocation, transformation, and template matching.On a large set of design examples, studies indicate the effectiveness of the new approach that results in signature data that is highly resilient, difficult to detect and remove, and yet is easy to verify and can be embedded in designs with very low hardware overhead. For example, the probability that the same design with the embedded signature is obtained by any other designers by themselves is less than 1 in 10102, and no register overhead was incurred. The probability of tampering, the probability that part of the embedded signature can be removed by random attempts, is shown to be extremely low, and the watermark is additionally protected from such tampering with error-correcting codes.

137 citations


Journal ArticleDOI
TL;DR: The polynomial-time algorithm serves as the basis for a highly efficient novel heuristic for the NP-hard version of the problem, which makes use of problem-specific knowledge, and can thus find high-quality solutions rapidly.
Abstract: One of the most crucial steps in the design of embedded systems is hardware/software partitioning, that is, deciding which components of the system should be implemented in hardware and which ones in software. Most formulations of the hardware/software partitioning problem are NP-hard, so the majority of research efforts on hardware/software partitioning has focused on developing efficient heuristics.This article considers the combinatorial structure behind hardware/software partitioning. Two similar versions of the partitioning problem are defined, one of which turns out to be NP-hard, whereas the other one can be solved in polynomial time. This helps in understanding the real cause of complexity in hardware/software partitioning. Moreover, the polynomial-time algorithm serves as the basis for a highly efficient novel heuristic for the NP-hard version of the problem. Unlike general-purpose heuristics such as genetic algorithms or simulated annealing, this heuristic makes use of problem-specific knowledge, and can thus find high-quality solutions rapidly. Moreover, it has the unique characteristic that it also calculates lower bounds on the optimum solution. It is demonstrated on several benchmarks and also large random examples that the new algorithm clearly outperforms other heuristics that are generally applied to hardware/software partitioning.

102 citations


Journal ArticleDOI
TL;DR: This tutorial summarizes results from recent optimality and scalability studies of existing placement tools, and highlights the recent progress on large-scale circuit placement, including techniques for wirelength minimization, routability optimization, and performance optimization.
Abstract: Placement is one of the most important steps in the RTL-to-GDSII synthesis process, as it directly defines the interconnects, which have become the bottleneck in circuit and system performance in deep submicron technologies. The placement problem has been studied extensively in the past 30 years. However, recent studies show that existing placement solutions are surprisingly far from optimal. The first part of this tutorial summarizes results from recent optimality and scalability studies of existing placement tools. These studies show that the results of leading placement tools from both industry and academia may be up to 50% to 150% away from optimal in total wirelength. If such a gap can be closed, the corresponding performance improvement will be equivalent to several technology-generation advancements. The second part of the tutorial highlights the recent progress on large-scale circuit placement, including techniques for wirelength minimization, routability optimization, and performance optimization.

77 citations


Journal ArticleDOI
TL;DR: A generic reconfigurable online event-based NoC monitoring service, based on hardware probes attached to NoC components, offering run-time observability of NoC behavior and supporting system-level debugging is proposed.
Abstract: Networks on chip (NoCs) are a scalable interconnect solution for multiprocessor systems on chip. We propose a generic reconfigurable online event-based NoC monitoring service, based on hardware probes attached to NoC components, offering run-time observability of NoC behavior and supporting system-level debugging. We present a probe architecture, its programming model, traffic management strategies, and a cost analysis. We prove feasibility via a prototype implementation for the AEthereal NoC. Two MPEG NoC examples show that the monitoring service area, without advanced optimizations, is 17--24p of the NoC area. Two realistic monitoring examples show that monitoring traffic is several orders of magnitude lower than the 2GB/s/link raw bandwidth.

58 citations


Journal ArticleDOI
TL;DR: This work shows how to place macros consistently with large numbers of small standard cells and addresses the computational difficulty of layout problems involving large macros and numerous small logic cells at the same time.
Abstract: While recent literature on circuit layout addresses large-scale standard-cell placement, the authors typically assume that all macros are fixed. Floorplanning techniques are very good at handling macros, but do not scale to hundreds of thousands of placeable objects. Therefore we combine floorplanning techniques with placement techniques to solve the more general placement problem. Our work shows how to place macros consistently with large numbers of small standard cells. Proposed techniques can also be used to guide circuit designers who prefer to place macros by hand.We address the computational difficulty of layout problems involving large macros and numerous small logic cells at the same time. Proposed algorithms are evaluated in the context of wirelength minimization because a computational method that is not scalable in optimizing wirelength is unlikely to be successful for more complex objectives (congestion, delay, power, etc.)We propose several different design flows to place mixed-size placement instances. The first flow relies on an arbitrary black-box standard-cell placer to obtain an initial placement and then removes possible overlaps using a fixed-outline floorplanner. This results in valid placements for macros, which are considered fixed. Remaining standard cells are then placed by another call to the standard-cell placer. In the second flow a standard-cell placer generates an initial placement and a force-directed placer is used in the engineering change order (ECO) mode to generate an overlap-free placement. Empirical evaluation on ibm benchmarks shows that in most cases our proposed flows compare favorably with previously published mixed-size placers, Kraftwerk, and the mixed-size floor-placer proposed at the 2003 Conference on Design, Automation, and Test in Europe (DATE 2003), and are competitive with mPG-MS.

49 citations


Journal ArticleDOI
TL;DR: The foundations of the layered approach to modeling and performance simulation of PHMs are described, showing an example design space of a network processor explored using the simulation approach.
Abstract: Heterogeneous multiprocessing is the future of chip design with the potential for tens to hundreds of programmable elements on single chips within the next several years. These chips will have heterogeneous, programmable hardware elements that lead to different execution times for the same software executing on different resources as well as a mix of desktop-style and embedded-style software. They will also have a layer of programming across multiple programmable elements forming the basis of a new kind of programmable system which we refer to as a Programmable Heterogeneous Multiprocessor (PHM). Current modeling approaches use instruction set simulation for performance modeling, but this will become far too prohibitive in terms of simulation time for these larger designs. The fundamental question is what the next higher level of design will be. The high-level modeling, simulation and design required for these programmable systems poses unique challenges, representing a break from traditional hardware design. Programmable systems, including layered concurrent software executing via schedulers on concurrent hardware, are not characterizable with traditional component-based hierarchical composition approaches, including discrete event simulation. We describe the foundations of our layered approach to modeling and performance simulation of PHMs, showing an example design space of a network processor explored using our simulation approach.

46 citations


Journal ArticleDOI
TL;DR: This article proposes an algorithm to iteratively find the variable partition such that the maximum energy saving is achieved while satisfying the given performance constraint.
Abstract: Many high-end DSP processors employ both multiple memory banks and heterogeneous register files to improve performance and power consumption. The complexity of such architectures presents a great challenge to compiler design. In this article, we present an approach for variable partitioning and instruction scheduling to maximally exploit the benefits provided by such architectures. Our approach is built on a novel graph model which strives to capture both performance and power demands. We propose an algorithm to iteratively find the variable partition such that the maximum energy saving is achieved while satisfying the given performance constraint. Experimental results demonstrate the effectiveness of our approach.

37 citations


Journal ArticleDOI
TL;DR: An agile formal methodology based on Extreme Programming concepts to construct abstract models from natural language specifications of complex systems, focusing on Prescriptive Formal Models (PFMs) that capture the specification of the system under design in a mathematically precise manner is presented.
Abstract: We present an agile formal methodology named eXtreme Formal Modeling (XFM), based on Extreme Programming (XP) concepts to construct abstract models from natural language specifications of complex systems. In particular, we focus on Prescriptive Formal Models (PFMs) that capture the specification of the system under design in a mathematically precise manner. Such models can be used as golden reference models for formal verification, test generation, coverage monitor generation, etc. This methodology for incrementally building PFMs works by adding user stories expressed as LTL formulae gleaned from the natural language specifications, one by one, into the model. XFM builds the models, retaining correctness with respect to incrementally added properties by regressively model-checking all the LTL properties captured theretofore in the model. We illustrate XFM with a graded set of examples consisting of a traffic light controller and a DLX pipeline. To make the regressive model-checking steps feasible with current model-checking tools, we need to control the model size increments at each subsequent step in the process. We therefore analyze the effects of ordering the LTL properties in XFM on the statespace growth rate of the model. We compare three different property-ordering methodologies: ad hoc ordering, property-based ordering, and predicate-based ordering. We experiment on the models of the ISA bus monitor and the arbitration phase of the Pentium Pro bus. We experimentally show and mathematically reason that the predicate-based ordering is the best among these orderings. Finally, we present a GUI-based toolbox that we implemented to build PFMs using XFM.

30 citations


Journal ArticleDOI
TL;DR: A very efficient technology mapping algorithm, k_m_flow, is developed for a novel field-programmable gate array (FPGA) architecture that is based on k-input single-output programmable logic array- (PLA-) like cells, or, k/m-macrocells and can outperform 4-LUT-based FPGAs on this set of benchmarks.
Abstract: In this article, we study the technology mapping problem for a novel field-programmable gate array (FPGA) architecture that is based on k-input single-output programmable logic array- (PLA-) like cells, or, k/m-macrocells. Each cell in this architecture can implement a single output function of up to k inputs and up to m product terms. We develop a very efficient technology mapping algorithm, klmlflow, for this new type of architecture. The experimental results show that our algorithm can achieve depth-optimality on almost all the testcases in a set of 16 Microelectronics Center of North Carolina (MCNC) benchmarks. Furthermore it is shown that on this set of benchmarks, with only a relatively small number of product terms (m ≤ k p 3), the k/m-macrocell-based FPGAs can achieve the same or similar mapping depth compared with the traditional k-input single-output lookup table- (k-LUT-) based FPGAs. We also investigate the total area and delay of k/m-macrocell-based FPGAs and compare them with those of the commonly used 4-LUT-based FPGAs. The experimental results show that k/m-macrocell-based FPGAs can outperform 4-LUT-based FPGAs in terms of both delay and area after placement and routing by VPR on this set of benchmarks.

Journal ArticleDOI
TL;DR: In this paper, the equivalence between behavioral level and RTL designs can be defined precisely using the proposed "attribute statements" in an interactive fashion, and implementation issues as well as considerations on real life industrial design examples are also presented.
Abstract: In this article, we present techniques for comparison between behavioral level and register transfer level (RTL) design descriptions by mapping the designs into virtual controllers and virtual datapaths. We also discuss about how the equivalence between behavioral level and RTL designs can be defined precisely using the proposed “attribute statements” in an interactive fashion. Implementation issues as well as considerations on real life industrial design examples are also presented.

Journal ArticleDOI
TL;DR: New datapath scheduling algorithms that use multiple supply voltages and dynamic frequency clocking in a coordinated manner in order to reduce the energy consumption ofdatapath circuits are developed.
Abstract: Recently, dynamic frequency scaling has been explored at the CPU and system levels for power optimization Low-power datapath scheduling using multiple supply voltages has been well researched In this work, we develop new datapath scheduling algorithms that use multiple supply voltages and dynamic frequency clocking in a coordinated manner in order to reduce the energy consumption of datapath circuits In dynamic frequency clocking, the functional units can be operated at different frequencies depending on the computations occurring within the datapath during a given clock cycle The strategy is to schedule high-energy units, such as multipliers at lower frequencies, so that they can be operated at lower voltages to reduce energy consumption and the low-energy units, such as adders at higher frequencies, to compensate for speed The proposed time- and resource-constrained algorithms have been applied to various high-level synthesis benchmark circuits under different time and resource constraints The experimental results show significant reduction in energy for both the algorithms

Journal ArticleDOI
TL;DR: This work presents an algorithm that schedules a chain of operations with data dependencies among consecutive operations at a single step, and uses a technique from the computational geometry domain to solve the matching problem.
Abstract: Complexities of applications implemented on embedded and programmable systems grow with the advances in capacities and capabilities of these systems. Mapping applications onto them manually is becoming a very tedious task. This draws attention to using high-level synthesis within design flows. Meanwhile, it is essential to provide a flexible formulation of optimization objectives as well as to perform efficient planning for various design objectives early on in the design flow. In this work, we address these issues in the context of data flow graph (DFG) scheduling, which is an essential element within the high-level synthesis flow. We present an algorithm that schedules a chain of operations with data dependencies among consecutive operations at a single step. This local problem is repeated to generate the schedule for the whole DFG. The local problem is formulated as a maximum weight noncrossing bipartite matching. We use a technique from the computational geometry domain to solve the matching problem. This technique provides a theoretical guarantee on the solution quality for scheduling a single chain of operations. Although still being local, this provides a relatively wider perspective on the global scheduling objectives. In our experiments we compared the latencies obtained using our algorithm with the optimal latencies given by the exact solution to the integer linear programming (ILP) formulation of the problem. In 9 out of 14 DFGs tested, our algorithm found the optimal solution, while generating latencies comparable to the optimal solution in the remaining five benchmarks. The formulation of the objective function in our algorithm provides flexibility to incorporate different optimization goals. We present examples of how to exploit the versatility of our algorithm with specific examples of objective functions and experimental results on the ability of our algorithm to capture these objectives efficiently in the final schedules.

Journal ArticleDOI
TL;DR: A routing-driven methodology for scan chain ordering with minimum wirelength objective is presented and substantial wirelength reductions for the routing-based flow versus the traditional placement- based flow are shown.
Abstract: Scan chain insertion can have a large impact on routability, wirelength, and timing of the design. We present a routing-driven methodology for scan chain ordering with minimum wirelength objective. A routing-based approach to scan chain ordering, while potentially more accurate, can result in TSP (Traveling Salesman Problem) instances which are asymmetric and highly nonmetric; this may require a careful choice of solvers. We evaluate our new methodology on recent industry place-and-route blocks with 1200 to 5000 scan cells. We show substantial wirelength reductions for the routing-based flow versus the traditional placement-based flow. In a number of our test cases, over 86p of scan routing overhead is saved. Even though our experiments are, so far, timing oblivious, the routing-based flow also improves evaluated timing, and practical timing-driven extensions appear feasible.

Journal ArticleDOI
TL;DR: This article shows how phase shifters can be synthesized uniformly and efficiently for any LFSM, including the aforementioned ones, and demonstrates the method by showing how to obtain phase shifter for two-dimensional cellular automata and for ring generators.
Abstract: Phase shifters are used to shift the bit sequences produced by the successive stages of a built-in test pattern generator (TPG) based on a linear finite state machine (LFSM) by a specified amount (phase shift) relative to the characteristic sequence. An upper bound on the number of taps to be used for each phase shifter and a lower bound on the phase-shift value between successive stages of the TPG mechanism are the general parameters of the problem. Methods to design such phase shifters have been given in the past separately for Type-1 LFSRs, Type-2 LFSRs, and three-neighborhood cellular automata. In this article, we show how phase shifters can be synthesized uniformly and efficiently for any LFSM, including the aforementioned ones. We demonstrate the method by showing how to obtain phase shifters for two-dimensional cellular automata and for ring generators.

Journal ArticleDOI
TL;DR: Two different implementations of TIS are presented, one of which employs a dedicated hardware modules for test vector generation, while the other is a software-based approach that reads test vectors from memory.
Abstract: TIS is an instruction-level methodology for processor core self-testing that enhances instruction set of a CPU with test instructions. Since the functionality of test instructions is the same as the NOP instruction, NOP instructions can be replaced with test instructions. Online testing can be accomplished without any performance penalty. TIS tests different parts of the processor and detects stuck-at faults. This method can be employed in offline and online testing of single-cycle, multicycle and pipelined processors. But, TIS is more appropriate for online testing of pipelined architectures in which NOP instructions are frequently executed because of data, control and structural hazards. Running test instructions instead of these NOP instructions, TIS utilizes the time that is otherwise wasted by NOPs. In this article, two different implementations of TIS are presented. One implementation employs a dedicated hardware modules for test vector generation, while the other is a software-based approach that reads test vectors from memory. These two approaches are implemented on a pipelined processor core and their area overheads are compared. To demonstrate the appropriateness of the TIS test technique, several programs are executed and fault coverage results are presented.

Journal ArticleDOI
TL;DR: To solve the obstacle-avoiding rectilinear and 4-geometry Steiner tree problems, a heuristic algorithm is presented that utilizes a cost accumulation scheme based on the maze router to determine the Torricelli vertices (points) for improving the quality of multiterminal nets.
Abstract: The maze routing problem is to find an optimal path between a given pair of cells on a grid plane. Lee's algorithm and its variants, probably the most widely used maze routing method, fails to work in the 4-geometry of the grid plane. Our algorithm solves this problem by using a suitable data structure for uniform wave propagation in the 4-geometry, 8-geometry, etc. The algorithm guarantees finding an optimal path if it exists and has linear time and space complexities. Next, to solve the obstacle-avoiding rectilinear and 4-geometry Steiner tree problems, a heuristic algorithm is presented. The algorithm utilizes a cost accumulation scheme based on the maze router to determine the Torricelli vertices (points) for improving the quality of multiterminal nets. Our experimental results show that the algorithm works well in practice. Furthermore, using the 4-geometry router, path lengths can be significantly reduced up to 12% compared to those in the rectilinear router.

Journal ArticleDOI
TL;DR: Techniques to address the problem of yield loss due to incidental overtesting of functionally-untestable transition faults, and an efficient adjustment to the algorithm to keep the overtest ratio low are proposed.
Abstract: Scan-based transition tests are added to improve the detection of speed failures in sequential circuits. Empirical data suggests that both data volume and application time will increase dramatically for such transition testing. Techniques to address the above problem for a class of transition tests, called enhanced transition tests, are proposed in this article.The first technique, which combines the proposed transition test chains with the ATE repeat capability, reduces test data volume by 46.5% when compared with transition tests computed by a commercial transition test ATPG tool. However, the test application time may sometimes increase. To address the test time issue, a new DFT technique, Exchange Scan, is proposed. Exchange scan reduces both data volume and application time by 46.5%. These techniques rely on the use of hold-scan cells and highlight the effectiveness of hold-scan design to address test time and test data volume issues. In addition, we address the problem of yield loss due to incidental overtesting of functionally-untestable transition faults, and we formulate an efficient adjustment to the algorithm to keep the overtest ratio low. Our experimental results show that up to 14.5% reduction in overtest ratio can be achieved, with an average overtest reduction of 4.68%.

Journal ArticleDOI
TL;DR: It is shown that by using band-limited transient test signals, which can be supported by wafer-probe test instrumentation, significant numbers of bad ICs can be detected early during the wafersafer probe test.
Abstract: It is well known that wafer-probe test costs of analog ICs are an order of magnitude less than the corresponding test costs of assembled packages. It is therefore natural to push as much of the testing process into wafer-probe testing as possible to reduce the scope of assembled package testing. However, the signal drive and response observation capabilities during wafer probe testing are limited in comparison to assembled packages. In this article, it is shown that by using band-limited transient test signals, which can be supported by wafer-probe test instrumentation, significant numbers of bad ICs can be detected early during the wafer-probe test. The optimal test stimuli are determined by cooptimizing the wafer-probe and assembled package test waveforms. Overall test costs, including the cost of packaging bad ICs, are minimized and are reduced up to four times. The proposed method has been validated using hardware test data, which were obtained through measurements made on a prototype.

Journal ArticleDOI
TL;DR: The proposed algorithm is combined with the existing postprocessing procedures which are used to find the gates that can be duplicated on a set of benchmark examples and it is shown that except for some cases the proposed algorithm can find an optimal solution of a given problem.
Abstract: Minimum area is one of the important objectives in technology mapping for lookup table-based field-progrmmable gate arrays (FPGAs). Although there is an algorithm that can find an optimal solution in polynomial time for the minimal-area FPGA technology mapping problem without gate duplication, its time complexity can grow exponentially with the number of inputs of the lookup-tables. This article proposes an algorithm with approximate to the area-optimal solution and lower time complexity. The time complexity of this algorithm is proven theoretically to be bounded by O(n3), where n is the total number of gates in the given circuit. It is shown that except for some cases the proposed algorithm can find an optimal solution of a given problem. We have combined the proposed algorithm with the existing postprocessing procedures which are used to find the gates that can be duplicated on a set of benchmark examples. The experimental results demonstrate the effectiveness of our algorithm.

Journal ArticleDOI
TL;DR: Four different approaches for reducing the number of accesses to the Translation Look-aside Buffer are proposed and experimentally demonstrate that one of these schemes that uses a combination of compiler and hardware enhancements can reduce iTLB dynamic power by over 85&percent; in most cases.
Abstract: Power consumption and power density for the Translation Look-aside Buffer (TLB) are important considerations not only in its design, but can have a consequence on cache design as well. After pointing out the importance of instruction TLB (iTLB) power optimization, this article embarks on a new philosophy for reducing the number of accesses to this structure. The overall idea is to keep a translation currently being used in a register and avoid going to the iTLB as far as possible---until there is a page change. We propose four different approaches for achieving this, and experimentally demonstrate that one of these schemes that uses a combination of compiler and hardware enhancements can reduce iTLB dynamic power by over 85% in most cases.The proposed approaches can work with different instruction-cache (iL1) lookup mechanisms and achieve significant iTLB power savings without compromising on performance. Their importance grows with higher iL1 miss rates and larger page sizes. They can work very well with large iTLB structures that can possibly consume more power and take longer to lookup, without the iTLB getting into the common case. Further, we also experimentally demonstrate that they can provide performance savings for virtually indexed, virtually tagged iL1 caches, and can even make physically indexed, physically tagged iL1 caches a possible choice for implementation.

Journal ArticleDOI
TL;DR: A technique that simplifies the design of pipelined circuits automates the specification and verification of structural-hazard and datapath correctness properties for pipelining circuits.
Abstract: This article describes a technique that simplifies the design of pipelined circuits automates the specification and verification of structural-hazard and datapath correctness properties for pipelined circuits. The technique is based upon a template for pipeline stages, a control-circuit cell library, a decomposition of structural hazard and datapath correctness into a collection of simple properties, and a prototype design tool that generates verification scripts for use by external tools. Our case studies include scalar and superscalar implementations of a 32-bit OpenRISC integer microprocessor.

Journal ArticleDOI
TL;DR: This article proposes a novel risk management based technique that is capable of generating an effective tradeoff between power and “risk”: the more the risk, the less the power.
Abstract: This article addresses the problem of voltage scheduling in unpredictable situations. The voltage scheduling problem assigns voltages to operations such that the power is minimized under a clock delay constraint. In the presence of unpredictabilities, meeting the clock latency constraint cannot be guaranteed. This article proposes a novel risk management based technique to solve this problem. Here, the risk management paradigm assigns a quantified value to the amount of risk the designer is willing to take on the clock cycle constraint. The algorithm then assigns voltages in order to meet the expected value of clock cycle constraint while keeping the maximum delay within the specified “risk” and minimizing the power. The proposed algorithm is based on dynamic programming and is optimal for trees. Experimental results show that the traditional voltage scheduling approach is incapable of handling unpredictabilities. Our approach is capable of generating an effective tradeoff between power and “risk”: the more the risk, the less the power. The results show that a small increase in design risk positively affects the power dissipation.

Journal ArticleDOI
TL;DR: An experiment involving a practical situation with an early deadlock condition showed that the time measured from application initialization to deadlock detection was reduced by 46% by employing the DDU as compared to detecting deadlock in software.
Abstract: This article presents a novel Parallel Deadlock Detection Algorithm (PDDA) and its hardware implementation, Deadlock Detection Unit (DDU). PDDA uses simple Boolean representations of request, grant, and no activity so that the hardware implementation of PDDA becomes easier and operates faster. We prove the correctness of PDDA and that the DDU has a runtime complexity of O(min(m,n)), where m is the number of resources and n is the number of processes. The DDU reduces deadlock detection time by 99p, (i.e., 100X) or more compared to software implementations of deadlock detection algorithms. An experiment involving a practical situation with an early deadlock condition showed that the time measured from application initialization to deadlock detection was reduced by 46p by employing the DDU as compared to detecting deadlock in software.

Journal ArticleDOI
TL;DR: This work exploits the bipartition approach as well as encoding techniques to reduce power dissipation not only of combinational logic blocks but also of the pipeline registers to optimize power for pipelined circuits.
Abstract: In this article, we present a bipartition dual-encoding architecture for low-power pipelined circuits. We exploit the bipartition approach as well as encoding techniques to reduce power dissipation not only of combinational logic blocks but also of the pipeline registers. Based on Shannon expansion, we partition a given circuit into two subcircuits such that the number of different outputs of both subcircuits are reduced, and then encode the output of both subcircuits to minimize the Hamming distance for transitions with a high switching probability. We measure the benefits of four different combinational bipartitioning and encoding architectures for comparison. The transistor-level simulation results show that bipartition dual-encoding can effectively reduce power by 72.7% for the pipeline registers and 27.1% for the total power consumption on average. To the best of our knowledge, it is the first work that presents an in-depth study on bipartition and encoding techniques to optimize power for pipelined circuits.

Journal ArticleDOI
TL;DR: A new style for writing temporal specifications of open systems that can be integrated with traditional symbolic model-checking techniques and present a complete tool for the verification of Verilog RTL modules in isolation is proposed.
Abstract: Assume-guarantee style verification of modules relies on the appropriate modeling of the interaction of the module with its environment. Popular temporal logics such as Computation Tree Logic (CTL) and Linear Temporal Logic (LTL) that were originally defined for closed systems (Kripke structures) do not make any syntactic discrimination between input and output variables. As a result, these logics and their recent derivatives (such as System Verilog, Sugar, Forspec, etc) permit the specification of properties that have some semantic problems when interpreted over open systems or modules. These semantic problems are quite common in practice, but are computationally hard to detect within a given specification. In this article, we propose a new style for writing temporal specifications of open systems that helps the designer to avoid most of these problems. In the proposed style, the basic temporal operators (such as next and until) are annotated with assume constraints over the input variables. We formalize this style through an extension of LTL, namely Open-LTL and an extension of CTL with fairness, called Open-CTL. We show that this simple syntactic separation between the assume and the guarantee achieves the desired results. We show that the proposed style can be integrated with traditional symbolic model-checking techniques and present a complete tool for the verification of Verilog RTL modules in isolation.

Journal ArticleDOI
TL;DR: This article presents a polynomial-time exact algorithm for integrated pin assignment and buffer planning for all two-pin nets from one macro block (source block) to all other blocks of a given buffer block plan, while minimizing the total cost.
Abstract: The buffer block methodology has become increasingly popular as more and more buffers are needed in deep-submicron design, and it leads to many challenging problems in physical design. In this article, we present a polynomial-time exact algorithm for integrated pin assignment and buffer planning for all two-pin nets from one macro block (source block) to all other blocks of a given buffer block plan, while minimizing the total cost α d W p β d R for any positive α and β where W is the total wirelength, and R is the number of buffers. By applying this algorithm iteratively (each time, pick one block as the source block), it provides a polynomial-time algorithm for pin assignment and buffer planning for nets among multiple macro blocks. Experimental results demonstrate its efficiency and effectiveness.

Journal ArticleDOI
TL;DR: The proposed approach provides solutions to the problem of how to place the minimal number of registers in Step 3 and can handle nonzero clock skew, and conjecture that the problem is NP-hard in its general form.
Abstract: Data dependency constraints constitute a lower bound P on the minimal clock period of single-phase clocked sequential circuits. In contrast to methods based on basic retiming, clocked sequential circuits with clock period P can always be obtained using software pipelining techniques. Such circuits can be derived by any method that can be framed in the following four-step process: Step 1, determine P; Step 2, compute a valid periodic schedule of the computational elements; Step 3, place registers back to the circuit; Step 4, assign the clock signals to control registers.Methods with polynomial run-time to implement this process are proposed in the literature. They implement these steps sequentially, starting with Step 1. These methods do not know how to optimally place registers which leads to an unnecessary number of registers. In this article, we address the problem of how to simultaneously implement Steps 2 and 3 in order to minimize the total number of registers. We conjecture that the problem is NP-hard in its general form. We formulate the problem for the first time in the literature, and devise a Mixed Integer Linear Program (MILP) to solve it. From this MILP, we derive a linear program to determine approximate solutions to the problem for large general circuits. We show that the proposed approach can handle nonzero clock skew. Experimental results confirm the effectiveness of the approach and show that significant reductions of the number of registers can be obtained although register sharing is not used. When the schedule is given, the proposed approach provides solutions to the problem of how to place the minimal number of registers in Step 3.

Journal ArticleDOI
TL;DR: This article proposes a two-step synthesis scheme of skewed logic circuits, in the first step, an integer linear programming-based approach is presented to overcome the logic reconvergence problem in skewed Logic circuits with minimal logic duplication cost.
Abstract: Skewed logic circuits belong to a noise-tolerant high-performance static circuit family. Skewed logic circuits can achieve performance comparable to that of Domino logic circuits but with much lower power consumption. Two factors contribute to the reduction in power. First, by exploiting the static nature of skewed logic circuits, we can alleviate the cost of logic duplication which is typically required to overcome the logic reconvergence problem in both Domino logic and skewed logic circuits. Second, a selective clocking scheme can be applied to a skewed logic circuit to reduce the clock load and hence, clock power. In this article, we propose a two-step synthesis scheme of skewed logic circuits. In the first step, an integer linear programming-based approach is presented to overcome the logic reconvergence problem in skewed logic circuits with minimal logic duplication cost. In the second step, a dynamic programming-based heuristic is applied to achieve an optimal selective clocking scheme. Experimental results show that the average power saving of skewed logic circuits over Domino logic circuits is 41.1%.