scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2007"


Journal ArticleDOI
TL;DR: Experimental measurements of the differences between a 90- nm CMOS field programmable gate array (FPGA) and 90-nm CMOS standard-cell application-specific integrated circuits (ASICs) in terms of logic density, circuit speed, and power consumption for core logic are presented.
Abstract: This paper presents experimental measurements of the differences between a 90-nm CMOS field programmable gate array (FPGA) and 90-nm CMOS standard-cell application-specific integrated circuits (ASICs) in terms of logic density, circuit speed, and power consumption for core logic. We are motivated to make these measurements to enable system designers to make better informed choices between these two media and to give insight to FPGA makers on the deficiencies to attack and, thereby, improve FPGAs. We describe the methodology by which the measurements were obtained and show that, for circuits containing only look-up table-based logic and flip-flops, the ratio of silicon area required to implement them in FPGAs and ASICs is on average 35. Modern FPGAs also contain "hard" blocks such as multiplier/accumulators and block memories. We find that these blocks reduce this average area gap significantly to as little as 18 for our benchmarks, and we estimate that extensive use of these hard blocks could potentially lower the gap to below five. The ratio of critical-path delay, from FPGA to ASIC, is roughly three to four with less influence from block memory and hard multipliers. The dynamic power consumption ratio is approximately 14 times and, with hard blocks, this gap generally becomes smaller

1,078 citations


Journal ArticleDOI
TL;DR: This paper analyzes the reasons of the failures of adder designs using QCA technology, and proposes adders that exploit proper clocking schemes.
Abstract: Quantum-dot cellular automata (QCA) is attracting a lot of attention due to its extremely small feature size and ultralow power consumption. Up to now, several adder designs using QCA technology have been proposed. However, it was found that not all of the designs function properly. This paper analyzes the reasons of the failures and proposes adders that exploit proper clocking schemes

211 citations


Journal ArticleDOI
TL;DR: Applying mathematical theories from random fields and convex analysis, this work develops a robust technique to extract a valid spatial-correlation function and matrix from measurement data by solving a constrained nonlinear optimization problem and a modified alternative-projection algorithm.
Abstract: The increased variability of process parameters makes it important yet challenging to extract the statistical characteristics and spatial correlation of process variation. Recent progress in statistical static-timing analysis also makes the extraction important for modern chip designs. Existing approaches extract either only a deterministic component of spatial variation or these approaches do not consider the actual difficulties in computing a valid spatial-correlation function, ignoring the fact that not every function and matrix can be used to describe the spatial correlation. Applying mathematical theories from random fields and convex analysis, we develop: 1) a robust technique to extract a valid spatial-correlation function by solving a constrained nonlinear optimization problem and 2) a robust technique to extract a valid spatial-correlation matrix by employing a modified alternative-projection algorithm. Our novel techniques guarantee to extract a valid spatial-correlation function and matrix from measurement data, even if those measurements are affected by unavoidable random noises. Experiment results, obtained from data generated by a Monte Carlo model, confirm the accuracy and robustness of our techniques and show that we are able to recover the correlation function and matrix with very high accuracy even in the presence of significant random noises

185 citations


Journal ArticleDOI
TL;DR: It is shown that due to NBTI, READ stability of SRAM cell degrades, while write stability and standby leakage improve with time, while by carefully examining the degradation in leakage current, it is possible to characterize and predict the lifetime behavior of N BTI degradation in real circuit operation.
Abstract: One of the major reliability concerns in nanoscale very large-scale integration design is the time-dependent negative- bias-temperature-instability (NBTI) degradation. Due to the higher operating temperature and increasing vertical oxide field, threshold voltage (Vt) of PMOS transistors can increase with time under NBTI. In this paper, we examine the impact of NBTI degradation in memory elements of digital circuits, focusing on the conventional 6T-SRAM-array topology. An analytical expression for the time-dependent Vt degradation in PMOS transistors based on the empirical reaction-diffusion (RD) framework was employed for our analysis. Using the RD-based Vt model, we analytically examine the impact of NBTI degradation in critical performance parameters of SRAM array. These parameters include the following: (1) static noise margin; (2) statistical READ and WRITE stability; (3) parametric yield; and (4) standby leakage current (IDDQ). We show that due to NBTI, READ stability of SRAM cell degrades, while write stability and standby leakage improve with time. Furthermore, by carefully examining the degradation in leakage current due to NBTI, it is possible to characterize and predict the lifetime behavior of NBTI degradation in real circuit operation.

161 citations


Journal ArticleDOI
TL;DR: The performance benefits of a monolithically stacked three-dimensional (3-D) field-programmable gate array (FPGA), whereby the programming overhead of an FPGA is stacked on top of a standard CMOS layer containing logic blocks and interconnects, are investigated.
Abstract: The performance benefits of a monolithically stacked three-dimensional (3-D) field-programmable gate array (FPGA), whereby the programming overhead of an FPGA is stacked on top of a standard CMOS layer containing logic blocks (LBs) and interconnects, are investigated. A Virtex-II-style two-dimensional (2-D) FPGA fabric is used as a baseline architecture to quantify the relative improvements in logic density, delay, and power consumption achieved by such a 3-D FPGA. It is assumed that only the switch transistor and configuration memory cells can be moved to the top layers and that the 3-D FPGA employs the same LB and programmable interconnect architecture as the baseline 2-D FPGA. Assuming they are les 0.7, the area of a static random-access memory cell and switch transistors having the same characteristics as n-channel metal-oxide-semiconductor devices in the CMOS layer are used. It is shown that a monolithically stacked 3-D FPGA can achieve 3.2 times higher logic density, 1.7 times lower critical path delay, and 1.7 times lower total dynamic power consumption than the baseline 2-D FPGA fabricated in the same 65-nm technology node

153 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed heterogeneous spatial-resolution adaptation and asynchronous thermal-element time-marching techniques are sufficient to make accurate dynamic and steady-state thermal analysis practical within the inner loops of IC synthesis algorithms.
Abstract: Ever-increasing integrated circuit (IC) power densities and peak temperatures threaten reliability, performance, and economical cooling. To address these challenges, thermal analysis must be embedded within IC synthesis. However, this requires accurate three-dimensional chip-package heat flow analysis. This has typically been based on numerical methods that are too computationally intensive for numerous repeated applications during synthesis or design. Thermal analysis techniques must be both accurate and fast for use in IC synthesis. This paper presents a novel accurate incremental spatially and temporally adaptive chip-package thermal analysis technique called ISAC for use in IC synthesis and design. It is common for IC temperature variation to strongly depend on position and time. ISAC dynamically adapts spatial- and temporal-modeling granularities to achieve high efficiency while maintaining accuracy. Both steady-state and dynamic thermal analyses are accelerated by the proposed heterogeneous spatial-resolution adaptation and asynchronous thermal-element time-marching techniques. Each technique enables orders-of-magnitude improvement in performance while preserving accuracy when compared with other state-of-the-art adaptive steady-state and dynamic IC thermal analysis techniques. Experimental results indicate that these improvements are sufficient to make accurate dynamic and steady-state thermal analysis practical within the inner loops of IC synthesis algorithms. ISAC has been validated against reliable commercial thermal analysis tools using industrial and academic synthesis test cases and chip designs. It has been implemented as a software package suitable for integration in IC synthesis and design flows and has been publicly released

143 citations


Journal ArticleDOI
TL;DR: Several orthogonal improvements to the state-of-the-art lookup table (LUT)-based field-programmable gate array (FPGA) technology mapping target the delay and area of technology mapping as well as the runtime and memory requirements.
Abstract: This paper presents several orthogonal improvements to the state-of-the-art lookup table (LUT)-based field-programmable gate array (FPGA) technology mapping The improvements target the delay and area of technology mapping as well as the runtime and memory requirements 1) Improved cut enumeration computes all K-feasible cuts, without pruning, for up to seven inputs for the largest Microelectronics Center of North Carolina benchmarks A new technique for on-the-fly cut dropping reduces, by orders of magnitude, the memory needed to represent cuts for large designs 2) The notion of cut factorization is introduced, in which one computes a subset of cuts for a node and generates other cuts from that subset as needed Two cut factorization schemes are presented, and a new algorithm that uses cut factorization for delay-oriented mapping for FPGAs with large LUTs is proposed 3) Improved area recovery leads to mappings with the area, on average, 6% smaller than the previous best work while preserving the delay optimality when starting from the same optimized netlists 4) Lossless synthesis accumulates alternative circuit structures seen during logic optimization Extending the mapper to use structural choices reduces the delay, on average, by 6% and the area by 12%, compared with the previous work, while increasing the runtime 16 times Performing five iterations of mapping with choices reduces the delay by 10% and the area by 19% while increasing the runtime eight times These improvements, on top of the state-of-the-art methods for LUT mapping, are available in the package ABC

139 citations


Journal ArticleDOI
Tim Tuan1, Arif Rahman1, Satyaki Das1, Steve Trimberger1, Sean Kao 
TL;DR: The design and implementation of Pika, a low-power FPGA core targeting battery-powered applications that achieves substantial power savings through a series of power optimizations and is compatible with existing commercial design tools.
Abstract: Programmable logic devices such as field-programmable gate arrays (FPGAs) are useful for a wide range of applications. However, FPGAs are not commonly used in battery-powered applications because they consume more power than application-specified integrated circuits and lack power management features. In this paper, we describe the design and implementation of Pika, a low-power FPGA core targeting battery-powered applications. Our design is based on a commercial low-cost FPGA and achieves substantial power savings through a series of power optimizations. The resulting architecture is compatible with existing commercial design tools. The implementation is done in a 90-nm triple-oxide CMOS process. Compared to the baseline design, Pika consumes 46% less active power and 99% less standby power. Furthermore, it retains circuit and configuration state during standby mode and wakes up from standby mode in approximately 100 ns

133 citations


Journal ArticleDOI
TL;DR: Two detailed case studies of RC4 stream cipher and AES block cipher have been presented to show that the proposed strategy prevents existing scan-based attacks in the literature.
Abstract: Scan chains are exploited to develop attacks on cryptographic hardware and steal intellectual properties from the chip. This paper proposes a secured strategy to test designs by inserting a certain number of inverters between randomly selected scan cells. The security of the scheme has been analyzed. Two detailed case studies of RC4 stream cipher and AES block cipher have been presented to show that the proposed strategy prevents existing scan-based attacks in the literature. The elegance of the scheme lies in its less hardware overhead.

131 citations


Journal ArticleDOI
TL;DR: This paper solves the problem of energy minimization with the consideration of the characteristics of power consumption of dc-DC converters and proposes a technique for dc-dc converter-aware energy-minimal DVS techniques for multiple tasks.
Abstract: Most digital systems are equipped with dc-dc converters to supply various levels of voltages from batteries to logic devices. DC-DC converters maintain legal voltage ranges regardless of the load current variation as well as battery voltage drop. Although the efficiency of dc-dc converters is changed by the output voltage level and the load current, most existing power management techniques simply ignore the efficiency variation of dc-dc converters. However, without a careful consideration of the efficiency variation of dc-dc converters, finding a true optimal power management will be impossible. In this paper, we solve the problem of energy minimization with the consideration of the characteristics of power consumption of dc-dc converters. Specifically, the contributions of our work are as follows: 1) We analyze the effects of the efficiency variation of dc-dc converters on a single-task execution in dynamic voltage scaling (DVS) scheme and propose the technique for dc-dc converter-aware energy-minimal DVS. 2) is then extended to embed an awareness of the characteristics of dc-dc converters in general DVS techniques for multiple tasks. 3) We go on to propose a technique called for generating a dc-dc converter that is most energy efficient for a particular application. 4) We also present an integrated framework, i.e., , based on and , which addresses dc-dc converter configuration and DVS simultaneously. Experimental results show that is able to save up to 24.8% of energy compared with previous power management schemes, which do not consider the efficiency variation of dc-dc converters.

128 citations


Journal ArticleDOI
TL;DR: A congestion-driven placement flow that considers in the global placement stage the routing demand to replace cells in order to avoid congested regions and allocates appropriate amounts of white space into different regions of the chip according to the congestion map.
Abstract: We present a two-stage congestion-driven placement flow. First, during each refinement stage of our multilevel global placement framework, we replace cells based on the wirelength weighted by congestion level to reduce the routing demands of congested regions. Second, after the global placement stage, we allocate appropriate amounts of white space into different regions of the chip according to a congestion map by shifting cut lines in a top-down fashion and apply a detailed placer to legalize the placement and further reduce the half-perimeter wirelength while preserving the distribution of white space. Experimental results show that our placement flow can achieve the best routability with the shortest routed wirelength among publicly available placement tools on IBM v2 benchmarks. Our placer obtains 100% successful routings on 16 IBM v2 benchmarks with shorter routed wirelengths by 3.1% to 24.5% compared to other placement tools. Moreover, our white space allocation approach can significantly improve the routability of placements generated by other placement tools

Journal ArticleDOI
TL;DR: The hybrid floorplanning approach combines linear programming and simulated annealing, which is shown to be very effective in obtaining high-quality solutions in a short runtime under multiobjective goals.
Abstract: This paper presents the first multiobjective microarchitectural floorplanning algorithm for high-performance processors implemented in two-dimensional (2-D) and three-dimensional (3-D) ICs. The floorplanner takes a microarchitectural netlist and determines the dimension as well as the placement of the functional modules into single- or multiple-device layers while simultaneously achieving high performance and thermal reliability. The traditional design objectives such as area and wirelength are also considered. The 3-D floorplanning algorithm considers the following 3-D-specific issues: vertical overlap optimization and bonding-aware layer partitioning. The hybrid floorplanning approach combines linear programming and simulated annealing, which is shown to be very effective in obtaining high-quality solutions in a short runtime under multiobjective goals. This paper provides comprehensive experimental results on making tradeoffs among performance, thermal, area, and wirelength for both 2-D and 3-D ICs

Journal ArticleDOI
TL;DR: A novel paradigm for low-power variation-tolerant circuit design called critical path isolation for timing adaptiveness (CRISTA), which allows aggressive voltage scaling and isolate and predict the set of possible paths that may become critical under process variations.
Abstract: Design considerations for robustness with respect to variations and low-power operations typically impose contradictory design requirements Low-power design techniques such as voltage scaling, dual- , etc, can have a large negative impact on parametric yield In this paper, we propose a novel paradigm for low-power variation-tolerant circuit design called critical path isolation for timing adaptiveness (CRISTA), which allows aggressive voltage scaling The principal idea includes the following: 1) isolate and predict the set of possible paths that may become critical under process variations; 2) ensure that they are activated rarely; and 3) avoid possible delay failures in the critical paths by dynamically switching to two-cycle operation (assuming all standard operations are single cycle), when they are activated This allows us to operate the circuit at reduced supply voltage while achieving the required yield Simulation results on a set of benchmark circuits with Berkeley-predictive-technology-model [BPTM 70 nm: Berkeley predictive technology model] 70-nm devices that show an average of 60% improvement in power with small overhead in performance and 18% overhead in die area compared to conventional design We also present two applications of the proposed methodology that include the following: 1) pipeline design for low power and 2) temperature-adaptive circuit design

Journal ArticleDOI
TL;DR: This paper presents a novel parametric waveform model based on the Weibull function to represent particle strikes at individual nodes in the circuit and describes the construction of the descriptor object that efficiently captures the correlation between the transient waveforms and their associated rate distribution functions.
Abstract: Soft errors have emerged as an important reliability challenge for nanoscale very large scale integration designs. In this paper, we present a fast and efficient soft error rate (SER) analysis methodology for combinational circuits. We first present a novel parametric waveform model based on the Weibull function to represent particle strikes at individual nodes in the circuit. We then describe the construction of the descriptor object that efficiently captures the correlation between the transient waveforms and their associated rate distribution functions. The proposed algorithm consists of operations to inject, propagate, and merge these descriptors while traversing forward along the gates in a circuit. The parameterized waveforms enable an efficient static approach to calculate the SER of a circuit. We exercise the proposed approach on a wide variety of combinational circuits and observe that our algorithm has linear runtime with the size of the circuit. The runtimes for soft error estimation were observed to be in the order of about 1 s, compared to several minutes or even hours for previously proposed methods

Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed method significantly reduces test power and in most cases provides greater test-data compression than LFSR reseeding alone.
Abstract: This paper presents a new low-power test-data-compression scheme based on linear feedback shift register (LFSR) reseeding. A drawback of compression schemes based on LFSR reseeding is that the unspecified bits are filled with random values, which results in a large number of transitions during scan-in, thereby causing high-power dissipation. A new encoding scheme that can be used in conjunction with any LFSR-reseeding scheme to significantly reduce test power and even further reduce test storage is presented. The proposed encoding scheme acts as the second stage of compression after LFSR reseeding. It accomplishes two goals. First, it reduces the number of transitions in the scan chains (by filling the unspecified bits in a different manner). Second, it reduces the number of specified bits that need to be generated via LFSR reseeding. Experimental results indicate that the proposed method significantly reduces test power and in most cases provides greater test-data compression than LFSR reseeding alone

Journal ArticleDOI
TL;DR: Fundamental bounds on the number of wires required to provide joint crosstalk avoidance and error correction using memoryless codes are presented and a code construction that results in practical codec circuits with theNumber of wires being within 35% of the fundamental bounds is proposed.
Abstract: A reliable high-speed bus employing low-swing signaling can be designed by encoding the bus to prevent crosstalk and provide error correction. Coding for on-chip buses requires additional bus wires and codec circuits. In this paper, fundamental bounds on the number of wires required to provide joint crosstalk avoidance and error correction using memoryless codes are presented. The authors propose a code construction that results in practical codec circuits with the number of wires being within 35% of the fundamental bounds. When applied to a 10-mm 32-bit bus in a 0.13-mum CMOS technology with low-swing signaling, one of the proposed codes provides 2.14times speedup and 27.5% energy savings at the cost of 2.1times area overhead, but without any loss in reliability

Journal ArticleDOI
TL;DR: A parameterized reduction technique for highly nonlinear systems which is able to accurately capture the parameter dependence over the parameter ranges of plusmn50% from the nominal values and to achieve an average simulation speedup of about 10x.
Abstract: This paper presents a parameterized reduction technique for highly nonlinear systems. In our approach, we first approximate the nonlinear system with a convex combination of parameterized linear models created by linearizing the nonlinear system at points along training trajectories. Each of these linear models is then projected using a moment-matching scheme into a low-order subspace, resulting in a parameterized reduced-order nonlinear system. Several options for selecting the linear models and constructing the projection matrix are presented and analyzed. In addition, we propose a training scheme which automatically selects parameter-space training points by approximating parameter sensitivities. Results and comparisons are presented for three examples which contain distributed strong nonlinearities: a diode transmission line, a microelectromechanical switch, and a pulse-narrowing nonlinear transmission line. In most cases, we are able to accurately capture the parameter dependence over the parameter ranges of plusmn50% from the nominal values and to achieve an average simulation speedup of about 10x.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed algorithm is capable of synthesizing FIR filters with the least CSPT terms compared with existing filter synthesis algorithms.
Abstract: In this paper, a new efficient algorithm is proposed for the synthesis of low-complexity finite-impulse response (FIR) filters with resource sharing. The original problem statement based on the minimization of signed-power-of-two (SPT) terms has been reformulated to account for the sharable adders. The minimization of common SPT (CSPT) terms that were considered in our proposed algorithm addresses the optimization of the reusability of adders for two major types of common subexpressions, together with the minimization of adders that are needed for the spare SPT terms. The coefficient set is synthesized in two stages. In the first stage, CSPT terms in the vicinity of the scaled and rounded canonical signed digit (CSD) coefficients are allocated to obtain a CSD coefficient set, with the total number of CSPT terms not exceeding the initial coefficient set. The balanced normalized peak ripple magnitude due to the quantization error is fulfilled in the second stage by a local search method. The algorithm uses a common-subexpression-based hamming weight pyramid to seek for low-cost candidate coefficients with preferential consideration of shared common subexpressions. Experimental results demonstrate that our algorithm is capable of synthesizing FIR filters with the least CSPT terms compared with existing filter synthesis algorithms.

Journal ArticleDOI
TL;DR: An efficient circuit synthesis methodology comprised of proposed low-power logic options in FinFET design library has been developed and results show about 8.5% area savings and 18% power savings over conventional FinFet technology for ISCAS85 benchmark circuits in 45-nm technology with no performance penalty.
Abstract: Independent control of front and back gate in double gate (DG) devices can be used to merge parallel transistors in noncritical paths. This reduces the effective switching capacitance and, hence, the dynamic power dissipation of a circuit. However, efficient design of large-scale circuits with DG devices is not well explored due to lack of proper modeling and large-scale design simulation tools. In this paper, we propose several low-power circuit options using independent gate FinFETs. We developed semianalytical models for different FinFET logic gates to predict their performance. An efficient circuit synthesis methodology comprised of proposed low-power logic options in FinFET design library has been developed. Results show about 8.5% area savings and 18% power savings over conventional FinFET technology for ISCAS85 benchmark circuits in 45-nm technology with no performance penalty.

Journal ArticleDOI
TL;DR: The APEX begins by efficiently computing the high-order moments of the unknown distribution and then applies moment matching to approximate the characteristic function of the random distribution by an efficient rational function, and is proven that such a moment-matching approach is asymptotically convergent when applied to quadratic response surface models.
Abstract: While process variations are becoming more significant with each new IC technology generation, they are often modeled via linear regression models so that the resulting performance variations can be captured via normal distributions. Nonlinear response surface models (e.g., quadratic polynomials) can be utilized to capture larger scale process variations; however, such models result in nonnormal distributions for circuit performance. These performance distributions are difficult to capture efficiently since the distribution model is unknown. In this paper, an asymptotic-probability-extraction (APEX) method for estimating the unknown random distribution when using a nonlinear response surface modeling is proposed. The APEX begins by efficiently computing the high-order moments of the unknown distribution and then applies moment matching to approximate the characteristic function of the random distribution by an efficient rational function. It is proven that such a moment-matching approach is asymptotically convergent when applied to quadratic response surface models. In addition, a number of novel algorithms and methods, including binomial moment evaluation, PDF/CDF shifting, nonlinear companding and reverse evaluation, are proposed to improve the computation efficiency and/or approximation accuracy. Several circuit examples from both digital and analog applications demonstrate that APEX can provide better accuracy than a Monte Carlo simulation with 104 samples and achieve up to 10times more efficiency. The error, incurred by the popular normal modeling assumption for several circuit examples designed in standard IC technologies, is also shown

Journal ArticleDOI
TL;DR: This paper introduces a design exploration methodology that identifies the lowest cost FPGA pipelined implementation of an untimed synchronous data-flow graph by combined module selection with resource sharing under the context of pipeline scheduling.
Abstract: The primary goal during synthesis of digital signal processing (DSP) circuits is to minimize the hardware area while meeting a minimum throughput constraint. In field-programmable gate array (FPGA) implementations, significant area savings can be achieved by using slower, more area-efficient circuit modules and/or by time-multiplexing faster, larger circuit modules. Unfortunately, manual exploration of this design space is impractical. In this paper, we introduce a design exploration methodology that identifies the lowest cost FPGA pipelined implementation of an untimed synchronous data-flow graph by combined module selection with resource sharing under the context of pipeline scheduling. These techniques are applied together to minimize the area cost of the FPGA implementation while meeting a user-specified minimum throughput constraint. Two different algorithms are introduced for exploring the large design space. We show that even for small DSP algorithms, combining these techniques can offer significant area savings relative to applying any of them alone

Journal ArticleDOI
TL;DR: Three highly efficient thermal simulation algorithms for calculating the on-chip temperature distribution in a multilayered substrate structure based on the concept of the Green function and utilize the technique of discrete cosine transform are presented.
Abstract: Due to technology scaling trends, the accurate and efficient calculations of the temperature distribution corresponding to a specific circuit layout and power density distribution will become indispensable in the design of high-performance very large scale integrated circuits. In this paper, we present three highly efficient thermal simulation algorithms for calculating the on-chip temperature distribution in a multilayered substrate structure. All three algorithms are based on the concept of the Green function and utilize the technique of discrete cosine transform. However, the application areas of the algorithms are different. The first algorithm is suitable for localized analysis in thermal problems, whereas the second algorithm targets full-chip temperature profiling. The third algorithm, which combines the advantages of the first two algorithms, can be used to perform thermal simulations where the accuracy requirement differs from place to place over the same chip. Experimental results show that all three algorithms can achieve relative errors of around 1% compared with that of a commercial computational fluid dynamic software package for thermal analysis, whereas their efficiencies are orders of magnitude higher than that of the direct application of the Green function method.

Journal ArticleDOI
TL;DR: An algorithm that generates a class of solutions to this time-multiplexed multiple-constant multiplication problem by ldquofusingrdquo single- constant multiplication circuits for the required constants is presented.
Abstract: This paper studies area-efficient arithmetic circuits to multiply a fixed-point input value selectively by one of several preset fixed-point constants. We present an algorithm that generates a class of solutions to this time-multiplexed multiple-constant multiplication problem by ldquofusingrdquo single-constant multiplication circuits for the required constants. Our evaluation compares our solution against a baseline implementation style that employs a full multiplier and a lookup table for the constants. The evaluation shows that we gain a significant area advantage, at the price of increased latency, for problem sizes (in terms of the number of constants) up to a threshold dependent on the bit-widths of the input and the constants. Our evaluation further shows that our solution is better suited for standard-cell application-specific integrated circuits than prior works on reconfigurable multiplier blocks.

Journal ArticleDOI
TL;DR: An exploration of the microarchitectural tradeoffs for soft processors and a set of customization techniques that capitalizes on these tradeoffs to improve the efficiency of soft processors for specific applications are provided.
Abstract: As embedded systems designers increasingly use field-programmable gate arrays (FPGAs) while pursuing single-chip designs, they are motivated to have their designs also include soft processors, processors built using FPGA programmable logic. In this paper, we provide: 1) an exploration of the microarchitectural tradeoffs for soft processors and 2) a set of customization techniques that capitalizes on these tradeoffs to improve the efficiency of soft processors for specific applications. Using our infrastructure for automatically generating soft-processor implementations (which span a large area/speed design space while remaining competitive with Altera's Nios II variations), we quantify tradeoffs within soft-processor microarchitecture and explore the impact of tuning the microarchitecture to the application. In addition, we apply a technique of subsetting the instruction set to use only the portion utilized by the application. Through these two techniques, we can improve the performance-per-area of a soft processor for a specific application by an average of 25%

Journal ArticleDOI
TL;DR: A novel methodology for testing network-on-chip NoC architectures, able to reduce the test time significantly as compared to previously proposed solutions, offering speedup factors ranging from 2x to 34x for the NoCs considered for experimental evaluation.
Abstract: Network-on-chip (NoC) communication fabrics will be increasingly used in many large multicore system-on-chip designs in the near future A relevant challenge that arises from this trend is that the test costs associated with NoC infrastructures may account for a significant part of the total test budget In this paper, we present a novel methodology for testing such NoC architectures The proposed methodology offers a tradeoff between test time and on-chip self-test resources The fault models used are specific to deep submicrometer technologies and account for crosstalk effects due to interwire coupling The novelty of our approach lies in the progressive reuse of the NoC infrastructure to transport test data to the components under test in a recursive manner It exploits the inherent parallelism of the data transport mechanism to reduce the test time and, implicitly, the test cost We also describe a suitable test-scheduling approach In this manner, the test methodology developed in this paper is able to reduce the test time significantly as compared to previously proposed solutions, offering speedup factors ranging from 2x to 34x for the NoCs considered for experimental evaluation

Journal ArticleDOI
TL;DR: A sizing algorithm is proposed, taking the NBTI-affected performance degradation into account to ensure the reliability of nanoscale circuits for a given period of time.
Abstract: Negative bias temperature instability (NBTI) has become one of the major causes for temporal reliability degradation of nanoscale circuits. In this paper, we analyze the temporal delay degradation of logic circuits due to NBTI. We show that knowing the threshold-voltage degradation of a single transistor due to NBTI, one can predict the performance degradation of a circuit with a reasonable degree of accuracy. We also propose a sizing algorithm, taking the NBTI-affected performance degradation into account to ensure the reliability of nanoscale circuits for a given period of time. Experimental results on several benchmark circuits show that with an average of 8.7% increase in area, one can ensure a reliable performance of circuits for ten years

Journal ArticleDOI
TL;DR: A highly efficient algorithm based on dynamic programming is proposed to optimally solve slew buffering with discrete buffer locations and a new algorithm using the maximum matching technique is developed to handle the difficult cases in which no assumption is made on buffer input slew.
Abstract: As a prevalent constraint, sharp slew rate is often required in circuit design, which causes a huge demand for buffering resources. This problem requires ultrafast buffering techniques to handle large volume of nets while also minimizing buffering cost. This problem is intensively studied in this paper. First, a highly efficient algorithm based on dynamic programming is proposed to optimally solve slew buffering with discrete buffer locations. Second, a new algorithm using the maximum matching technique is developed to handle the difficult cases in which no assumption is made on buffer input slew. Third, an adaptive buffer selection approach is proposed to efficiently handle slew buffering with continuous buffer locations. Fourth, buffer blockage avoidance is handled, which makes the algorithms ready for practical use. Experiments on industrial netlists demonstrate that our algorithms are very effective and highly efficient: we achieve about 90x speedup and save up to 20% buffer area over the commonly used van Ginneken style buffering. The new algorithms also significantly outperform previous works that indirectly address the slew buffering problem.

Journal ArticleDOI
TL;DR: A technique to optimize StWL in global and detail placement without a significant runtime penalty is developed, and this new optimization, along with congestion-driven whitespace distribution, improves overall Place-and-Route results, making the use of HPWL unnecessary.
Abstract: We demonstrate that Steiner-tree wirelength (StWL) correlates with routed wirelength (rWL) much better than the more common half-perimeter wirelength (HPWL) objective. Therefore, we develop a technique to optimize StWL in global and detail placement without a significant runtime penalty. This new optimization, along with congestion-driven whitespace distribution, improves overall Place-and-Route results, making the use of HPWL unnecessary. Additionally, our empirical results provide ample evidence that the fidelity of net-length estimates is more important than their accuracy in Place-and-Route. The new data structures that make our min-cut algorithms fast can also be useful in multilevel analytical placement. Our placement algorithm Rigorous Optimization Of Steiner-Trees Eases Routing (ROOSTER) outperforms the best published results for Dragon, Capo, FengShui, mPL-R/WSA, and APlace in terms of rWL by 10.7%, 5.6%, 9.3%, 5.5%, and 4.2%, respectively. Via counts, which are especially important at 90 nm and below, are improved by 15.6% over mPL-R/WSA and 11.9% over APlace

Journal ArticleDOI
TL;DR: This paper proposes the first router for the flip-chip package in the literature that adopts a two-stage technique of global routing followed by detailed routing, and uses the network flow algorithm to solve the assignment problem from the wire-bonding pads to the bump pads and then create the global path for each net.
Abstract: The flip-chip package gives the highest chip density of any packaging method to support the pad-limited application-specific integrated circuit designs. In this paper, we propose the first router for the flip-chip package in the literature. The router can redistribute nets from wire-bonding pads to bump pads and then route each of them. The router adopts a two-stage technique of global routing followed by detailed routing. In global routing, we use the network flow algorithm to solve the assignment problem from the wire-bonding pads to the bump pads and then create the global path for each net. The detailed routing consists of three stages, namely: 1) cross-point assignment; 2) net ordering determination; and 3) track assignment, to complete the routing. Experimental results based on seven real designs from the industry demonstrate that the router can reduce the total wirelength by 10.2%, the critical wirelength by 13.4%, and the signal skews by 13.9%, as compared with a heuristic algorithm currently used in industry.

Journal ArticleDOI
TL;DR: A crucial component of ROAD is a novel projection-based scheme for quadratic (both polynomial and posynomial) performance modeling, which allows the approach to scale well to large problem sizes.
Abstract: In this paper, a robust analog design (ROAD) tool for post-tuning (i.e., locally optimizing) analog/RF circuits is proposed. Starting from an initial design derived from hand analysis or analog circuit optimization based on simplified models, ROAD extracts accurate performance models via transistor-level simulation and iteratively improves the circuit performance by a sequence of geometric programming steps. Importantly, ROAD sets up all design constraints to include large-scale process and environmental variations, thereby facilitating the tradeoff between yield and performance. A crucial component of ROAD is a novel projection-based scheme for quadratic (both polynomial and posynomial) performance modeling, which allows our approach to scale well to large problem sizes. A key feature of this projection-based scheme is a new implicit power iteration algorithm to find the optimal projection space and extract the unknown model coefficients with robust convergence. The efficacy of ROAD is demonstrated on several circuit examples