scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2006"


Journal ArticleDOI
TL;DR: This survey attempts to provide an overview of the current state of the art for fault tolerance in FPGAs, assuming that faults have been previously detected and diagnosed and the methods presented are targeted towards tolerating the faults.
Abstract: A wide range of fault tolerance methods for FPGAs have been proposed. Approaches range from simple architectural redundancy to fully on-line adaptive implementations. The applications of these methods also differ; some are used only for manufacturing yield enhancement, while others can be used in-system. This survey attempts to provide an overview of the current state of the art for fault tolerance in FPGAs. It is assumed that faults have been previously detected and diagnosed; the methods presented are targeted towards tolerating the faults. A detailed description of each method is presented. Where applicable, the methods are compared using common metrics. Results are summarized to present a succinct, comprehensive comparison of the different approaches.

110 citations


Journal ArticleDOI
Ali Dasdan1, Ivan Hom1
TL;DR: The consequences of ITD in STA are analyzed and a proper handling of ITd is proposed in an industrial sign-off STA tool, believed to be the first such work.
Abstract: In digital circuit design, it is typically assumed that cell delay increases with decreasing voltage and increasing temperature. This assumption is the basis of the cornering approach with cell libraries in static timing analysis (STA). However, this assumption breaks down at low supply voltages because cell delay can decrease with increasing temperature. This phenomenon is caused by a competition between mobility and threshold voltage to dominate cell delay. We refer to this phenomenon as the inverted temperature dependence (ITD). Due to ITD, it becomes very difficult to analytically determine the temperatures that maximize or minimize the delay of a cell or a path. As such, ITD has profound consequences for STA: (1) ITD essentially invalidates the approach of defining corners by independently varying voltage and temperature; (2) ITD makes it more difficult to find short paths, leading to difficulties in detecting hold time violations; and (3) the effect of ITD will worsen as supply voltages decrease and threshold voltage variations increase. This article analyzes the consequences of ITD in STA and proposes a proper handling of ITD in an industrial sign-off STA tool. To the best of our knowledge, this article is the first such work.

60 citations


Journal ArticleDOI
TL;DR: This article proposes a framework for analyzing data flow for estimating the component activities at fixed points of programs whilst considering pipeline architectures and proposes a set of scheduling policies that are effective in reducing leakage power in microprocessors.
Abstract: Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts indicate that architectures, compilers, and software can be optimized so as to reduce the switching power (also known as dynamic power) in microprocessors. This has lead to interest in using architecture and compiler optimization to reduce leakage power (also known as static power) in microprocessors. In this article, we investigate compiler-analysis techniques that are related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating at the component level. Our compiler provides an analysis framework for utilizing instructions to reduce the leakage power. We present a framework for analyzing data flow for estimating the component activities at fixed points of programs whilst considering pipeline architectures. We also provide equations that can be used by the compiler to determine whether employing power-gating instructions in given program blocks will reduce the total energy requirements. As the duration of power gating on components when executing given program routines is related to the number and complexity of program branches, we propose a set of scheduling policies and evaluate their effectiveness. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumptions on Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing leakage power in microprocessors.

58 citations


Journal ArticleDOI
TL;DR: The proposed concurrent testing methodology is directed at ensuring high reliability and availability of bio-MEMS and lab-on-a-chip systems, as they are increasingly deployed for safety-critical applications.
Abstract: We present a concurrent testing methodology for detecting catastrophic faults in digital microfluidics-based biochips and investigate the related problems of test planning and resource optimization. We first show that an integer linear programming model can be used to minimize testing time for a given hardware overhead, for example, droplet dispensing sources and capacitive sensing circuitry. Due to the NP-complete nature of the problem, we also develop efficient heuristic procedures to solve this optimization problem. We apply the proposed concurrent testing methodology to a droplet-based microfluidic array that was fabricated and used to perform multiplexed glucose and lactate assays. Experimental results show that the proposed test approach interleaves test application with the biomedical assays and prevents resource conflicts. The proposed method is therefore directed at ensuring high reliability and availability of bio-MEMS and lab-on-a-chip systems, as they are increasingly deployed for safety-critical applications.

49 citations


Journal ArticleDOI
TL;DR: This article presents a new methodology that allows to design custom DM management mechanisms with a reduced memory footprint for multimedia and wireless network applications and proposes a suitable way to traverse the aforementioned design space and construct custom DM managers that minimize the DM used by these highly dynamic applications.
Abstract: New portable consumer embedded devices must execute multimedia and wireless network applications that demand extensive memory footprint. Moreover, they must heavily rely on Dynamic Memory (DM) due to the unpredictability of the input data (e.g., 3D streams features) and system behavior (e.g., number of applications running concurrently defined by the user). Within this context, consistent design methodologies that can tackle efficiently the complex DM behavior of these multimedia and network applications are in great need. In this article, we present a new methodology that allows to design custom DM management mechanisms with a reduced memory footprint for such kind of dynamic applications. First, our methodology describes the large design space of DM management decisions for multimedia and wireless network applications. Then, we propose a suitable way to traverse the aforementioned design space and construct custom DM managers that minimize the DM used by these highly dynamic applications. As a result, our methodology achieves improvements of memory footprint by 60p on average in real case studies over the current state-of-the-art DM managers used for these types of dynamic applications.

45 citations


Journal ArticleDOI
TL;DR: A compile-time area estimation technique to guide SA-C compiler optimizations is presented and results show that the technique predicts the area required for a design to within 2.5% of actual for small image processing operators and to within 5.0% for larger benchmarks.
Abstract: The Cameron Project has developed a system for compiling codes written in a high-level language called SA-C, to FPGA-based reconfigurable computing systems. In order to exploit the parallelism available on the FPGAs, the SA-C compiler performs a large number of optimizations such as full loop unrolling, loop fusion and strip-mining. However, since the area on an FPGA is limited, the compiler needs to know the effect of compiler optimizations on the FPGA area; this information is typically not available until after the synthesis and place and route stage, which can take hours. In this article, we present a compile-time area estimation technique to guide SA-C compiler optimizations. We demonstrate our technique for a variety of benchmarks written in SA-C. Experimental results show that our technique predicts the area required for a design to within 2.5p of actual for small image processing operators and to within 5.0p for larger benchmarks. The estimation time is in the order of milliseconds, compared with minutes for the synthesis tool.

45 citations


Journal ArticleDOI
TL;DR: Application of the proposed procedure to adaptive filters and polynomial evaluation circuits realized in a Xilinx Virtex FPGA has resulted in area reductions and speed-up of up to 36% over common alternative design strategies.
Abstract: This article introduces an automatic design procedure for determining the sensitivity of outputs in a digital signal processing design to small errors introduced by rounding or truncation of internal variables. The proposed approach can be applied to both linear and nonlinear designs. By analyzing the resulting sensitivity values, the proposed procedure is able to determine an appropriate distinct word-length for each internal variable in a fixed-point hardware implementation. In addition, the power-optimizing capabilities of word-length optimization are studied. Application of the proposed procedure to adaptive filters and polynomial evaluation circuits realized in a Xilinx Virtex FPGA has resulted in area reductions of up to 80p (mean 66p) combined with power reductions of up to 98p (mean 87p) and speed-up of up to 36p(mean 20p) over common alternative design strategies.

44 citations


Journal ArticleDOI
TL;DR: An accurate and numerically stable formulation of the substrate resistive coupling using boundary element methods, specifically for substrates without grounded backplates (floating substrates).
Abstract: This article focuses on the formulation of the substrate resistive coupling using boundary element methods, specifically for substrates without grounded backplates (floating substrates). An accurate and numerically stable formulation is presented. Numerical results are shown to demonstrate the correctness and the numerical robustness of the formulation.

31 citations


Journal ArticleDOI
TL;DR: A language and design environment called GEZEL that can be used for the design, verification and implementation of coprocessor-based systems and presents the execution ladder as an optimization framework to balance interactivity against simulation speed.
Abstract: Energy-efficient embedded systems rely on domain-specific coprocessors for dedicated tasks such as baseband processing, video coding, or encryption. We present a language and design environment called GEZEL that can be used for the design, verification and implementation of such coprocessor-based systems.The GEZEL environment creates a platform simulator by combining a hardware simulation kernel with one or more instruction-set simulators. The hardware part of the platform is programmed in GEZEL, a deterministic, cycle-true and implementation-oriented hardware description language. GEZEL designs are scripted, allowing the hardware configuration of the platform simulator to be changed quickly without going through lengthy recompiles. For this reason, we call the environment interactive. We present the execution ladder as an optimization framework to balance interactivity against simulation speed.We demonstrate our approach using several designs including an AES encryption coprocessor and a Viterbi decoding coprocessor. We discuss the advantages of our approach as opposed to more conventional approaches using SystemC and Verilog/VHDL.

29 citations


Journal ArticleDOI
TL;DR: This article proposes a new algorithm for statistical static timing analysis (SSTA) using levelized covariance propagation (LCP), which simultaneously considers the effect of die-to-die variations in process parameters as well as within-die variation, including systematic and random variations.
Abstract: Variability in process parameters is making accurate timing analysis of nano-scale integrated circuits an extremely challenging task. In this article, we propose a new algorithm for statistical static timing analysis (SSTA) using levelized covariance propagation (LCP). The algorithm simultaneously considers the effect of die-to-die variations in process parameters as well as within-die variation, including systematic and random variations. In order to efficiently handle complicated process variation models while contending with the arbitrary correlation among timing signals, we employ a compact form of the levelized statistical data structure. Furthermore, we propose two enhancements to the LCP algorithms to the make it practical for the analysis of large sized circuits. Results on several ISCAS'85 benchmark circuits in predictive 70nm technology show an average of 0.19p and 0.57p errors in the mean and standard deviation, respectively, of timing analysis using the proposed technique, as compared to the Monte Carlo-based approach.

26 citations


Journal ArticleDOI
TL;DR: The thermal via planning approach is proven to be very efficient to eliminate localized hot spots directly and nine times faster with better solution quality compared to a recent published result.
Abstract: New three-dimensional (3D) floorplanning and thermal via planning algorithms are proposed for thermal optimization in two-stacked die integration. Our contributions include (1) a two-stage design flow for 3D floorplanning, which scales down the enlarged solution space due to multidevice layer structure; (2) an efficient thermal-driven 3D floorplanning algorithm with power distribution constraints; (3) a thermal via planning algorithm considering congestion minimization. Experiments results show that our approach is nine times faster with better solution quality compared to a recent published result. In addition, the thermal via planning approach is proven to be very efficient to eliminate localized hot spots directly.

Journal ArticleDOI
TL;DR: This work develops a polynomial-time optimal algorithm for assigning low Vdds to as many operations as possible under the resource and latency constraints, and in the same time minimizing total switching activity through functional unit binding.
Abstract: Reducing power consumption through high-level synthesis has attracted a growing interest from researchers due to its large potential for power reduction. In this work we study functional unit binding (or module assignment) given a scheduled data flow graph under a multi-Vdd framework. We assume that each functional unit can be driven by different Vdd levels dynamically during run time to save dynamic power. We develop a polynomial-time optimal algorithm for assigning low Vdds to as many operations as possible under the resource and latency constraints, and in the same time minimizing total switching activity through functional unit binding. Our algorithm shows consistent improvement over a design flow that separates voltage assignment from functional unit binding. We also change the initial scheduling to examine power/energy-latency tradeoff scenarios under different voltage level combinations. Experimental results show that we can achieve 28.1p and 33.4p power reductions when the latency bound is the tightest with two and three-Vdd levels respectively compared with the single-Vdd case. When latency is relaxed, multi-Vdd offers larger power reductions (up to 46.7p). We also show comparison data of energy consumption under the same experimental settings.

Journal ArticleDOI
TL;DR: A different proof for Snir's theorem is provided by capturing the structural information of zero-deficiency prefix circuits by constructing a prefix circuit as wide as possible for a given depth d.
Abstract: A parallel prefix circuit has n inputs x1, x2, …, xn, and computes the n outputs yi= xi•xi−1•…•x1, 1 ≤i≤n, in parallel, where • is an arbitrary binary associative operator. Snir proved that the depth t and size s of any parallel prefix circuit satisfy the inequality tps≥2n−2. Hence, a parallel prefix circuit is said to be of zero-deficiency if equality holds. In this article, we provide a different proof for Snir's theorem by capturing the structural information of zero-deficiency prefix circuits. Following our proof, we propose a new kind of zero-deficiency prefix circuit Z(d) by constructing a prefix circuit as wide as possible for a given depth d. It is proved that the Z(d) circuit has the minimal depth among all possible zero-deficiency prefix circuits.

Journal ArticleDOI
TL;DR: This work presents a compilation framework for such dual instruction sets, which uses a profitability based compiler heuristic that operates at the instruction-level granularity and is able to effectively take advantage of both Instruction Sets.
Abstract: For many embedded applications, program code size is a critical design factor. One promising approach for reducing code size is to employ a “dual instruction set”, where processor architectures support a normal (usually 32-bit) Instruction Set, and a narrow, space-efficient (usually 16-bit) Instruction Set with a limited set of opcodes and access to a limited set of registers. This feature however, requires compilers that can reduce code size by compiling for both Instruction Sets. Existing compiler techniques operate at the routine-level granularity and are unable to make the trade-off between increased register pressure (resulting in more spills) and decreased code size. We present a compilation framework for such dual instruction sets, which uses a profitability based compiler heuristic that operates at the instruction-level granularity and is able to effectively take advantage of both Instruction Sets. We demonstrate consistent and improved code size reduction (on average 22p), for the MIPS 32/16 bit ISA. We also show that the code compression obtained by this “dual instruction set” technique is heavily dependent on the application characteristics and the narrow Instruction Set itself.

Journal ArticleDOI
TL;DR: This article develops an instruction-level loop-scheduling technique to reduce both execution time and bus-switching activities for applications with loops on VLIW architectures and proposes an algorithm, SAMLS (Switching-Activity Minimization Loop Scheduling), to minimize both schedule length and switching activities for Applications with loops.
Abstract: In embedded systems, high-performance DSP needs to be performed not only with high-data throughput but also with low-power consumption. This article develops an instruction-level loop-scheduling technique to reduce both execution time and bus-switching activities for applications with loops on VLIW architectures. We propose an algorithm, SAMLS (Switching-Activity Minimization Loop Scheduling), to minimize both schedule length and switching activities for applications with loops. In the algorithm, we obtain the best schedule from the ones that are generated from an initial schedule by repeatedly rescheduling the nodes with schedule length and switching activities minimization based on rotation scheduling and bipartite matching. The experimental results show that our algorithm can reduce both schedule length and bus-switching activities. Compared with the work of Lee et al. [2003], SAMLS shows an average 11.5p reduction in schedule length and an average 19.4p reduction in bus-switching activities.

Journal ArticleDOI
TL;DR: This article proposes a methodology based on rewriting-logic which is adequate to quickly model and evaluate reconfigurable architectures (RA) in general and, in particular, reconfigured systolic architectures.
Abstract: Many algebraic operations can be efficiently implemented as pipe networks in arrays of functional units such as systolic arrays that provide a large amount of parallelism. However, the applicability of classical systolic arrays is restricted to problems with strictly regular data dependencies yielding only arrays with uniform linear pipes. This limitation can be circumvented by using reconfigurable systolic arrays or reconfigurable data path arrays, where the node interconnections and operations can be redefined even at run time. In this context, several alternative reconfigurable systolic architectures can be explored and powerful tools are needed to model and evaluate them. Well-known rewriting-logic environments such as ELAN and Maude can be used to specify and simulate complex application-specific integrated systems. In this article we propose a methodology based on rewriting-logic which is adequate to quickly model and evaluate reconfigurable architectures (RA) in general and, in particular, reconfigurable systolic architectures. As an interesting case study we apply this rewriting-logic modeling methodology to the space-efficient treatment of the Fast-Fourier Transform (FFT). The FFT prototype conceived in this way, has been specified and validated in VHDL using the Quartus II system.

Journal ArticleDOI
TL;DR: The concepts of reuse subspace, dependence vector, self, and group reuse are extended and applied in this new context and enables the scratch-pad to be used in a larger context than was possible before.
Abstract: We propose techniques for identifying and exploiting spatial and temporal reuse for indirectly indexed arrays. Indirectly indexed arrays are those arrays which are, typically, accessed inside multilevel loop nests and whose index expression includes not only loop iterators and constants but arrays as well. Existing techniques for improving locality are quite sophisticated in the case of directly indexed arrays. But, unfortunately, they are inadequate for handling indirectly indexed arrays. In this article we therefore extend the existing framework and techniques of directly indexed to indirectly indexed arrays. The concepts of reuse subspace, dependence vector, self, and group reuse are extended and applied in this new context. Also, lately scratch-pad memory has become an attractive alternative to data-cache, specially in the embedded multimedia community. This is because embedded systems are very sensitive to area and energy and the scratch-pad is smaller in area and consumes less energy on a per access basis compared to the cache of the same capacity. Several techniques have been proposed in the past for the efficient exploitation of the scratch-pad for directly indexed arrays. We extend these techniques by presenting a method for scratch-pad mapping of indirectly indexed arrays. This enables the scratch-pad to be used in a larger context than was possible before.

Journal ArticleDOI
TL;DR: This article proposes an ILP-based framework for the reduction of energy and transient power through datapath scheduling during behavioral synthesis and shows that significant reductions in power, energy and energy delay product can be obtained.
Abstract: In low-power design for battery-driven portable applications, the reduction of peak power, peak power differential, cycle difference power, average power and energy are equally important. These are different forms of dynamic power dissipation of a CMOS circuit, which is predominant compared to static power dissipation for higher switching activity. The peak power, the cycle difference power, and the peak power differential drive the transient characteristic of a CMOS circuit. In this article, we propose an ILP-based framework for the reduction of energy and transient power through datapath scheduling during behavioral synthesis. A new metric called “modified cycle power function” (CPFa) is defined that captures the above power characteristics and facilitates integer linear programming formulations. The ILP-based datapath scheduling schemes with CPFa as objective function are developed assuming three modes of datapath operation, such as, single supply voltage and single frequency (SVSF), multiple supply voltages and dynamic frequency clocking (MVDFC), and multiple supply voltages and multicycling (MVMC). We conducted experiments on selected high-level synthesis benchmark circuits for various resource constraints and estimated power, energy and energy delay product for each of them. Experimental results show that significant reductions in power, energy and energy delay product can be obtained.

Journal ArticleDOI
TL;DR: Two polynomial-time heuristics are proposed that provide a speedup of up to 13.7 with an extremely low penalty for power when compared to the optimal ILP solution for the authors' selected benchmarks.
Abstract: This article proposes two very fast graph theoretic heuristics for the low power binding problem given fixed number of resources and multiple architectures for the resources. First, the generalized low power binding problem is formulated as an Integer Linear Programming (ILP) problem that happens to be an NP-complete task to solve. Then two polynomial-time heuristics are proposed that provide a speedup of up to 13.7 with an extremely low penalty for power when compared to the optimal ILP solution for our selected benchmarks.

Journal ArticleDOI
TL;DR: An efficient heuristic algorithm for selecting K index bits for improved cache performance is presented and the feasibility of the algorithm is shown by applying it to a large number of embedded system applications as well as the integer SPEC CPU 2000 benchmarks.
Abstract: The increasing use of microprocessor cores in embedded systems as well as mobile and portable devices creates an opportunity for customizing the cache subsystem for improved performance. In traditional cache design, the index portion of the memory address bus consists of the K least significant bits, where K = log2D and D is the depth of the cache. However, in devices where the application set is known and characterized (e.g., systems that execute a fixed application set) there is an opportunity to improve cache performance by choosing a near-optimal set of bits used as index into the cache. This technique does not add any overhead in terms of area or delay. In this article, we present an efficient heuristic algorithm for selecting K index bits for improved cache performance. We show the feasibility of our algorithm by applying it to a large number of embedded system applications as well as the integer SPEC CPU 2000 benchmarks. Specifically, for data traces, we show up to 45p reduction in cache misses. Likewise, for instruction traces, we show up to 31p reduction in cache misses. When a unified data/instruction cache architecture is considered, our results show an average improvement of 14.5p for the Powerstone benchmarks and an average improvement of 15.2p for the SPEC'00 benchmarks.

Journal ArticleDOI
TL;DR: This article focuses on a multiprocessor-system-on-a-chip (MPSoC) architecture with a banked memory system, and shows how code and data optimizations can help to reduce memory energy consumption for embedded applications with regular data access patterns.
Abstract: The next generation embedded architectures are expected to accommodate multiple processors on the same chip. While this makes interprocessor communication less costly as compared to traditional high-end parallel machines, it also makes off-chip requests very costly. In particular, frequent off-chip memory accesses do not only increase execution cycles but also increase overall power consumption. One way of alleviating this power problem is to divide the off-chip memory into multiple banks, each of which can be power-controlled independently using low-power operating modes.In this article, we focus on a multiprocessor-system-on-a-chip (MPSoC) architecture with a banked memory system, and show how code and data optimizations can help us reduce memory energy consumption for embedded applications with regular data access patterns, for example, those from the embedded image and video processing domain. This is achieved by ensuring bank locality, which means that each processor localizes its accesses into a small set of banks in a given time period. We present a mathematical formulation of the bank locality problem. Our formulation is based on constructing a set of matrix equations that capture the mappings between the data, computation, processor, and memory bank spaces. Based on this formulation, we propose a heuristic solution to the bank locality problem under different scenarios. Our solution involves an iterative process through which we try to satisfy as many matrix constraints as possible; the unsatisfied constraints represent the degree of degradation in bank locality. Finally, we report extensive experimental results showing the effectiveness of our strategy in practice. Our results show that the proposed solution improves bank locality significantly, and reduces the overall memory system energy consumption by up to 34p over an approach that makes use of the low-power modes but does not employ our strategy.

Journal ArticleDOI
TL;DR: An algorithm for routing bus structures between components on two layers such that all length constraints are satisfied such that maximum resource utilization is achieved during length extension.
Abstract: The increasing clock frequencies in high-end industrial circuits bring new routing challenges that cannot be handled by traditional algorithms. An important design automation problem for high-speed boards today is routing nets within tight minimum and maximum length bounds. In this article, we propose an algorithm for routing bus structures between components on two layers such that all length constraints are satisfied. This algorithm handles length extension simultaneously during the actual routing process so that maximum resource utilization is achieved during length extension. Our approach here is to process one track at a time, and choose the best subset of nets to be routed on each track. The algorithm we propose for single-track routing is guaranteed to find the optimal subset of nets together with the optimal solution with length extension on one track. The experimental comparison with a recently proposed technique shows the effectiveness of this algorithm both in terms of solution quality and run-time.

Journal ArticleDOI
TL;DR: This article incorporates the notion of timing constraints into the Phantom compiler, and shows that the approach is effective in meeting such constraints, allowing fine-grained concurrency among the tasks.
Abstract: In modern embedded systems, software development plays a vital role. Many key functions are being migrated to software, aiming at a shorter time to market and easier upgrades. Multitasking is increasingly common in embedded software, and many of these tasks incorporate real-time constraints. Although multitasking simplifies coding, it demands an operating system and imposes significant overhead on the system. The use of serializing compilers, such as the Phantom compiler, allows the synthesis of a monolithic code from a multitasking C application, eliminating the need for an operating system. In this article, we introduce the synthesis of multitasking applications that execute in a timely manner. We incorporate the notion of timing constraints into the Phantom compiler, and show that our approach is effective in meeting such constraints, allowing fine-grained concurrency among the tasks. As an additional case study, we present the implementation of a software-based modem and show that real-time applications such as the modem have guaranteed performance in the serialized code generated by the Phantom compiler.

Journal ArticleDOI
TL;DR: According to this modeling, the multi-objective optimization problem can optimally be solved by Lagrangian relaxation, and by relaxing Lagrange multipliers to the critical paths, it takes only two iterations for all solutions to converge to the global optimal, which is much more efficient than related previous work.
Abstract: As technology advances apace, crosstalk becomes a design metric of comparable importance to area and delay. This article focuses mainly on the crosstalk issue, specifically on the impacts of physical design and process variation on crosstalk. While the feature size shrinks below 0.25μm, the impact of process variation on crosstalk increases rapidly. Hence, a crosstalk insensitive design is desirable in the deep submicron regime. In this article, crosstalk sensitivity is referred to as the influence of process variation on crosstalk in a circuit. We show that the lower bound of crosstalk sensitivity grows quadratically, while that of crosstalk increases linearly. Therefore, designers should also consider crosstalk sensitivity, when optimizing other design objectives such as crosstalk, area, and delay. According to our modeling, these objectives are all in posynomial forms, and thus the multi-objective optimization problem can optimally be solved by Lagrangian relaxation. Experimental results show that our method is effective and efficient. For instance, a circuit of 2856 gates and 5272 wires is optimized using 13-minute runtime and 2.8-MB memory on a Pentium III 1.0 GHz PC with 256-MB memory. In particular, by relaxing Lagrange multipliers to the critical paths, it takes only two iterations for all solutions to converge to the global optimal, which is much more efficient than related previous work. This relaxation scheme provides a key insight into the rapid convergence in Lagrangian relaxation.

Journal ArticleDOI
TL;DR: The authors proposed an algorithm that exploits relations between instructions of frequently-executed instruction groups by tracing program execution sequences, and a two-stage low-power decomposition architecture was used for instruction decoding.
Abstract: During the execution of processor instruction, decoding the instructions is a major task in identifying instructions and generating control signals for data paths. In this article, we propose two instruction decoder decomposition techniques for low-power designs. First, by tracing program execution sequences, we propose an algorithm that explores the relations between frequently executed instructions. Second, we propose a two-stage low-power decomposition structure for decoding instructions. Experimental results demonstrate that our proposed techniques achieve an average of 34.18p in power reduction and 12.93p in critical-path delay reduction for the instruction decoder.

Journal ArticleDOI
TL;DR: This paper presents a two-phase LVS methodology: a standard LVS phase where power and ground nets are defined as global nets and a multi-power-domain LVSphase where power-and-ground nets are treated as local nets.
Abstract: A unique LVS (layout-versus-schematic) methodology has been developed for the verification of a four-core microprocessor with multiple power domains using a triple-well 90-nm CMOS technology. The chip is migrated from its previous generation that is for a twin-well process. Due to the design reuse, VDD and GND are designed as global nets but they are not globally connected across the entire chip. The standard LVS flow is unable to handle the additional design complexity and there seems to be no published literature tackling the problem. This paper presents a two-phase LVS methodology: a standard LVS phase where power and ground nets are defined as global nets and a multi-power-domain LVS phase where power and ground nets are treated as local nets. The first phase involves verifying LVS at the block level as well as the full-chip level. The second phase aims at verifying the integrity of the multi-power-domain power grid that is not covered in the first phase LVS. The proposed LVS methodology was successfully verified by real silicon.

Journal ArticleDOI
TL;DR: This article proposes a synthesis scheme to reduce the duplication cost by allowing inverters in Domino logic under certain timing constraints for both simple and complex gates, which translates into significant improvements in area and power.
Abstract: Logic duplication, a commonly used synthesis technique to remove trapped inverters in reconvergent paths of Domino circuits, incurs high area and power penalties. In this article, we propose a synthesis scheme to reduce the duplication cost by allowing inverters in Domino logic under certain timing constraints for both simple and complex gates. Moreover, we can include the logic duplication minimization during technology mapping for synthesis of Domino circuits with complex gates. In order to guarantee the robustness of such Domino circuits, we perform the logic optimization as a postlayout step. Experimental results show significant reduction in duplication cost, which translates into significant improvements in area and power. As a byproduct, the timing performance is also improved owing to smaller layout area and/or logic depth.

Journal ArticleDOI
TL;DR: This work proposes a maximum crosstalk effect minimization algorithm that takes logic synthesis into consideration for PLA structures and can effectively minimize the maximum coupling capacitance of a circuit by 51% as compared with the original area-minimized PLA without cros Stalk Effect minimization.
Abstract: We propose a maximum crosstalk effect minimization algorithm that takes logic synthesis into consideration for PLA structures. To minimize the crosstalk effect, a technique for permuting wire is used which contains the following steps. First, product terms are partitioned into long and short sets, and then the product terms in the long and short sets are interleaved. After that, we take advantage of the crosstalk immunity of product terms in the long set to further reduce the maximum coupling capacitance of the PLA. Finally, synthesis techniques such as local and global transformations are taken into consideration to search for a better result. The experiments demonstrate that our algorithm can effectively minimize the maximum coupling capacitance of a circuit by 51p as compared with the original area-minimized PLA without crosstalk effect minimization.

Journal ArticleDOI
TL;DR: A new scheme is proposed that greatly improves technology mapping by partitioning a circuit into a set of maximal super-gates (MSGs) and applying the dynamic programming technique to the trees and allowing duplication of gates in the mapping of each individual MSG.
Abstract: Traditionally, technology mapping is done by first partitioning a circuit into a forest of trees. Each individual tree is then mapped using dynamic programming. The links among the mappings of different trees are provided via propagating the essential mapping information along multiple fanout branches. While this approach may achieve optimality within each tree, the overall result is compromised from the very first treatment of fanouts. In this article, we propose a new scheme that greatly improves technology mapping. Instead of a forest of trees, we partition the circuit into a set of maximal super-gates (MSGs). These are used to transform the original circuit into trees. We then apply the dynamic programming technique to the trees and allow duplication of gates in the mapping of each individual MSG. Experimental results on ISCAS'85 benchmarks show that our approach delivers an average of 20.6p reduction in delay with only a 9.5p increase on area.

Journal ArticleDOI
TL;DR: A fault coverage metric for an FSM specification based on the transition fault model is proposed, and using this metric, the coverage of a test sequence is derived from the specification and the Coverage of a Test sequence is determined.
Abstract: Verification and test are critical phases in the development of any hardware or software system. This article focuses on black box testing of the control part of hardware and software systems. Black box testing involves specification, test generation, and fault coverage. Finite state machines (FSMs) are commonly used for specifying controllers. FSMs may have shortcomings in modeling complex systems. With the introduction of X-machines, complex systems can be modeled at higher levels of abstraction. An X-machine can be converted into an FSM while preserving the level of abstraction. The fault coverage of a test sequence for an FSM specification provides a confidence level. We propose a fault coverage metric for an FSM specification based on the transition fault model, and using this metric, we derive the coverage of a test sequence. The article also presents a method which generates short test sequences that meet a specific coverage level and then extends this metric to determine the coverage of a test sequence for an FSM driven by an FSM network. We applied our FSM verification technique to a real-life FSM, namely, the fibre channel arbitrated loop port state machine, used in the field of storage area networks.