scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2010"


Journal ArticleDOI
TL;DR: This paper develops a hierarchical framework for analyzing the impact of NBTI on the performance of logic circuits under various operation conditions, such as the supply voltage, temperature, and node switching activity, and proposes an efficient method to predict the degradation of circuit speed over a long period of time.
Abstract: Negative-bias-temperature instability (NBTI) has become the primary limiting factor of circuit life time. In this paper, we develop a hierarchical framework for analyzing the impact of NBTI on the performance of logic circuits under various operation conditions, such as the supply voltage, temperature, and node switching activity. Given a circuit topology and input switching activity, we propose an efficient method to predict the degradation of circuit speed over a long period of time. The effectiveness of our method is comprehensively demonstrated with the International Symposium on Circuits and Systems (ISCAS) benchmarks and a 65-nm industrial design. Furthermore, we extract the following key design insights for reliable circuit design under NBTI effect, including: 1) During dynamic operation, NBTI-induced degradation is relatively insensitive to supply voltage, but strongly dependent on temperature; 2) There is an optimum supply voltage that leads to the minimum of circuit performance degradation; circuit degradation rate actually goes up if supply voltage is lower than the optimum value; 3) Circuit performance degradation due to NBTI is highly sensitive to input vectors. The difference in delay degradation is up to 5× for various static and dynamic operations. Finally, we examine the interaction between NBTI effect, and process and design uncertainty in realistic conditions.

297 citations


Journal ArticleDOI
TL;DR: A novel error-tolerant adder (ETA) is proposed that is able to ease the strict restriction on accuracy, and at the same time achieve tremendous improvements in both the power consumption and speed performance.
Abstract: In modern VLSI technology, the occurrence of all kinds of errors has become inevitable. By adopting an emerging concept in VLSI design and test, error tolerance (ET), a novel error-tolerant adder (ETA) is proposed. The ETA is able to ease the strict restriction on accuracy, and at the same time achieve tremendous improvements in both the power consumption and speed performance. When compared to its conventional counterparts, the proposed ETA is able to attain more than 65% improvement in the Power-Delay Product (PDP). One important potential application of the proposed ETA is in digital signal processing systems that can tolerate certain amount of errors.

286 citations


Journal ArticleDOI
TL;DR: Analysis shows that domino logic circuits suffer from a doubled variability as compared to the static CMOS logic style, which adds to the well-known speed degradation due to the current contention associated with the keeper transistor.
Abstract: In this paper, the effect of process variations on delay is analyzed in depth for both static and dynamic CMOS logic styles. Analysis allows for gaining an insight into the delay dependence on fan-in, fan-out, and sizing in sub-100-nm technologies. Simple but reasonably accurate models are derived to capture the basic dependences. The effect of process variations in transistor stacks is analytically modeled and analyzed in detail. The impact of both interdie and intradie variations is evaluated and discussed. Interestingly, the input capacitance of static and dynamic logic is shown to be rather insensitive to variations. The delay variability was also shown to be a weak function of the input rise/fall time and load. Analysis shows that domino logic circuits suffer from a doubled variability as compared to the static CMOS logic style. The positive feedback associated with the keeper transistor is shown to be responsible for the variability increase, which, in turn, limits the speed performance. This adds to the well-known speed degradation due to the current contention associated with the keeper transistor. Monte Carlo simulations on a 90-nm technology, including layout parasitics, are performed to validate the results.

183 citations


Journal ArticleDOI
TL;DR: This work presents a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular, and reduces the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.
Abstract: Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and two orders of magnitude more time than executing an instruction. Computer systems and microarchitecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Furthermore, as we show in this paper, raw compression ratio is not always the most important metric. In this work, we present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation. Experiments comparing our work to previous work are described.

161 citations


Journal ArticleDOI
TL;DR: The supply voltage for a minimum total energy operation (VMIN) based on activity factor is found that it is significantly higher for SRAM than for logic, and SRAM robustness is calculated using importance sampling, resulting in a seven-order run-time improvement over Monte Carlo sampling.
Abstract: Voltage scaling is desirable in static RAM (SRAM) to reduce energy consumption. However, commercial SRAM is susceptible to functional failures when VDD is scaled down. Although several published SRAM designs scale VDD to 200-300 mV, these designs do not sufficiently consider SRAM robustness, limiting them to small arrays because of yield constraints, and may not correctly target the minimum energy operation point. We examine the effects on area and energy for the differential 6T and 8T bit cells as VDD is scaled down, and the bit cells are either sized and doped, or assisted appropriately to maintain the same yield as with full VDD. SRAM robustness is calculated using importance sampling, resulting in a seven-order run-time improvement over Monte Carlo sampling. Scaling 6T and 8T SRAM VDD down to 500 mV and scaling 8T SRAM to 300 mV results in a 50% and 83% dynamic energy reduction, respectively, with no reduction in robustness and low area overhead, but increased leakage per bit. Using this information, we calculate the supply voltage for a minimum total energy operation (VMIN) based on activity factor and find that it is significantly higher for SRAM than for logic.

136 citations


Journal ArticleDOI
TL;DR: This paper analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations and developed an efficient design paradigm from circuit and/or architecture perspective-to improve the robustness and integration density.
Abstract: Spin-torque transfer magnetic RAM (STT MRAM) is a promising candidate for future embedded applications. It combines the desirable attributes of current memory technologies such as SRAM, DRAM, and flash memories (fast access time, low cost, high density, and non-volatility). It also solves the critical drawbacks of conventional MRAM technology: poor scalability and high write current. However, variations in process parameters can lead to a large number of cells to fail, severely affecting the yield of the memory array. In this paper, we analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations. Based on the model, we performed a thorough analysis of the impact of design parameters on parametric failures due to process variations. To achieve high memory yield without incurring expensive technology modification, we developed an efficient design paradigm from circuit and/or architecture perspective-to improve the robustness and integration density. The proposed technique effectively relaxes or completely decouples the conflicting design requirements for read stability, writability and cell area. It can be used at an early stage of the design cycle for yield enhancement.

131 citations


Journal ArticleDOI
TL;DR: This paper investigates the sensitivity of a power supply transient signal analysis method for detecting Trojans and focuses on determining the smallest detectable Trojan, i.e., the least number of gates a Trojan may have and still be detected, using a set of process simulation models that characterize a TSMC 0.18 μm process.
Abstract: Trust in reference to integrated circuits addresses the concern that the design and/or fabrication of the integrated circuit (IC) may be purposely altered by an adversary. The insertion of a hardware Trojan involves a deliberate and malicious change to an IC that adds or removes functionality or reduces its reliability. Trojans are designed to disable and/or destroy the IC at some future time or they may serve to leak confidential information covertly to the adversary. Trojans can be cleverly hidden by the adversary to make it extremely difficult for chip validation processes, such as manufacturing test, to accidentally discover them. This paper investigates the sensitivity of a power supply transient signal analysis method for detecting Trojans. In particular, we focus on determining the smallest detectable Trojan, i.e., the least number of gates a Trojan may have and still be detected, using a set of process simulation models that characterize a TSMC 0.18 μm process. We also evaluate the sensitivity of our Trojan detection method in the presence of measurement noise and background switching activity.

124 citations


Journal ArticleDOI
TL;DR: This work describes an on-chip NBTI degradation sensor using a delay-locked loop (DLL), in which the increase in pMOS threshold voltage due to NBTi stress is translated into a control voltage shift in the DLL for high sensing gain.
Abstract: Negative bias temperature instability (NBTI) is one of the most critical device reliability issues in sub-130 nm CMOS processes. In order to better understand the characteristics of this mechanism, accurate and efficient means of measuring its effects must be explored. In this work, we describe an on-chip NBTI degradation sensor using a delay-locked loop (DLL), in which the increase in pMOS threshold voltage due to NBTI stress is translated into a control voltage shift in the DLL for high sensing gain. The proposed sensor is capable of supporting both DC and AC stress modes. Measurements from a test chip fabricated in a 130 nm bulk CMOS process show an average gain of 10 in the operating range of interest, with measurement times in tens of microseconds possible for minimal unwanted threshold voltage recovery. NBTI degradation readings across a range of operating conditions are presented to demonstrate the flexibility of this system.

102 citations


Journal ArticleDOI
TL;DR: A dynamic bit-width adaptation scheme for applications using discrete cosine transform (DCT) that can efficiently trade off image quality and computation energy and a bit- width selection algorithm to select the appropriate operand bit- Widths.
Abstract: This paper presents a dynamic bit-width adaptation scheme for applications using discrete cosine transform (DCT). The technique can efficiently trade off image quality and computation energy. Based on sensitivity differences of 64 DCT coefficients, separate operand bit-widths are used for different frequency components to reduce computation energy. To select the appropriate operand bit-widths that achieve significant reduction of power consumption with minimum image quality degradation, we also propose a bit-width selection algorithm. The proposed variable bit precision DCT algorithm can be efficiently implemented using carry save adder trees. The reconfigurable DCT architecture can achieve power savings ranging from 36% to 75% compared to normal operation at the expense of minor image quality degradation.

95 citations


Journal ArticleDOI
TL;DR: Three error-correcting architectures, named as whole-page, sector-pipelined, and multistrip ones, are proposed and the VLSI design applies both algorithmic and architectural-level optimizations that include parallel algorithm transformation, resource sharing, and time multiplexing.
Abstract: Bit-error correction is crucial for realizing cost-effective and reliable NAND Flash-memory-based storage systems. In this paper, low-power and high-throughput error-correction circuits have been developed for multilevel cell (MLC) nand Flash memories. The developed circuits employ the Bose-Chaudhuri-Hocquenghem code to correct multiple random bit errors. The error-correcting codes for them are designed based on the bit-error characteristics of MLC NAND Flash memories for solid-state drives. To trade the code rate, circuit complexity, and power consumption, three error-correcting architectures, named as whole-page, sector-pipelined, and multistrip ones, are proposed. The VLSI design applies both algorithmic and architectural-level optimizations that include parallel algorithm transformation, resource sharing, and time multiplexing. The chip area, power consumption, and throughput results for these three architectures are presented.

94 citations


Journal ArticleDOI
TL;DR: An optimal soft real-time loop scheduling and voltage assignment algorithm, loop schedulingand voltage assignment to minimize energy, to minimize both dynamic and leakage energy via DVS and ABB is proposed.
Abstract: With the shrinking of technology feature sizes, the share of leakage in total power consumption of digital systems continues to grow. Traditional dynamic voltage scaling (DVS) fails to accurately address the impact of scaling on system power consumption as the leakage power increases exponentially. The combination of DVS and adaptive body biasing (ABB) is an effective technique to jointly optimize dynamic and leakage energy dissipation. In this paper, we propose an optimal soft real-time loop scheduling and voltage assignment algorithm, loop scheduling and voltage assignment to minimize energy, to minimize both dynamic and leakage energy via DVS and ABB. Voltage transition overhead has been considered in our approach. We conduct simulations on a set of digital signal processor benchmarks based on the power model of 70 nm technology. The simulation results show that our approach achieves significant energy saving compared to that of the integer linear programming approach.

Journal ArticleDOI
TL;DR: A new RAM cell structure design is proposed that can realize high speed and reliable sensing operations in the presence of relatively poor magnetoresistive ratio, while maintaining low sensing current through magnetic tunneling junctions (MTJs).
Abstract: With a great scalability potential, nonvolatile magnetoresistive memory with spin-torque transfer (STT) programming has become a topic of great current interest. This paper addresses cell structure design for STT magnetoresistive RAM, content addressable memory (CAM) and ternary CAM (TCAM). We propose a new RAM cell structure design that can realize high speed and reliable sensing operations in the presence of relatively poor magnetoresistive ratio, while maintaining low sensing current through magnetic tunneling junctions (MTJs). We further apply the same basic design principle to develop new cell structures for nonvolatile CAM, and TCAM. The effectiveness of the proposed RAM, CAM and TCAM cell structures has been demonstrated by circuit simulation at 0.18 ?m CMOS technology.

Journal ArticleDOI
TL;DR: A self-contained adaptive system for detecting and bypassing permanent errors in on-chip interconnects that reroutes data on erroneous links to a set of spare wires without interrupting the data flow is presented.
Abstract: We present a self-contained adaptive system for detecting and bypassing permanent errors in on-chip interconnects. The proposed system reroutes data on erroneous links to a set of spare wires without interrupting the data flow. To detect permanent errors at runtime, a novel in-line test (ILT) method using spare wires and a test pattern generator is proposed. In addition, an improved syndrome storing-based detection (SSD) method is presented and compared to the ILT method. Each detection method (ILT and SSD) is integrated individually into the noninterrupting adaptive system, and a case study is performed to compare them with Hamming and Bose-Chaudhuri-Hocquenghem (BCH) code implementations. In the presence of permanent errors, the probability of correct transmission in the proposed systems is improved by up to 140% over the standalone Hamming code. Furthermore, our methods achieve up to 38% area, 64% energy, and 61% latency improvements over the BCH implementation at comparable error performance.

Journal ArticleDOI
TL;DR: A postmanufacture self-tuning technique that aims to compensate for multiparameter variations is presented, which incorporates a response feature and hardware tuning knobs designed into the RF circuit and can be applied to other RF circuits as well.
Abstract: In the deep-submicrometer design regime, RF circuits are expected to be increasingly susceptible to process variations, and thereby suffer from significant loss of parametric yield. To address this problem, a postmanufacture self-tuning technique that aims to compensate for multiparameter variations is presented. The proposed method incorporates a ?response feature? detector and ?hardware tuning knobs,? designed into the RF circuit. The RF device test response to a specially crafted diagnostic test stimulus is logged via the built-in detector and embedded analog-to-digital converter. Analysis and prediction of the optimal tuning knob control values for performance compensation is performed using software running on the baseband DSP processor. As a result, the RF circuit performance can be diagnosed and tuned with minimal assistance from external test equipment. Multiple RF performance parameters can be adjusted simultaneously under tuning knob control. The proposed concepts are illustrated for an RF low-noise amplifier (LNA) design and can be applied to other RF circuits as well. A simulation case study and hardware measurements on a fabricated 1.9-GHz LNAs show significant parametric yield enhancement (up to 58%) across the critical RF performance specifications of interest.

Journal ArticleDOI
Yi Chen1, Xiaobin Wang1, Hai Li, Haiwen Xi1, Yuan Yan2, Wenzhong Zhu1 
TL;DR: By using the model, the scalability of STT-RAM technology down to a 22-nm Bulk-CMOS technology node is analyzed and the tradeoffs among the MTJ switching current, the thermal stability of theMTJ and the MOS transistor driving strength are discussed.
Abstract: We propose a magnetic and electric level spin-transfer torque random access memory (STT-RAM) cell model to simulate the write operation of an STT-RAM. The model of a magnetic tunneling junction (MTJ) is modified to take into account the electrical response of the MOS transistor that is connected to the MTJ. A dynamic design flow is also proposed to minimize any unnecessary design margin in an STT-RAM cell design by leveraging from the new STT-RAM cell model. The design of an STT-RAM cell with a one-transistor-one-MTJ (1T1J) structure shows that our technique can reduce more than 22% of the STT-RAM cell area, compared with a conventional STT-RAM cell model at a TSMC 90-nm technology node. The performance and the reliability of the memory cell were unaffected. By using our model, we analyzed the scalability of STT-RAM technology down to a 22-nm Bulk-CMOS technology node. The tradeoffs among the MTJ switching current, the thermal stability of the MTJ and the MOS transistor driving strength are discussed. Some magnetic- and circuit-level solutions to achieve 9F2 STT-RAM cell area at 22-nm technology node are also discussed.

Journal ArticleDOI
TL;DR: A novel customizable architecture of a neuromorphic robust optical flow (multichannel gradient model) based on reconfigurable hardware with the properties of the cortical motion pathway is presented, thus obtaining a useful framework for building future complex bioinspired real-time systems with high computational complexity.
Abstract: Motion estimation from image sequences, called optical flow, has been deeply analyzed by the scientific community. Despite the number of different models and algorithms, none of them covers all problems associated with real-world processing. This paper presents a novel customizable architecture of a neuromorphic robust optical flow (multichannel gradient model) based on reconfigurable hardware with the properties of the cortical motion pathway, thus obtaining a useful framework for building future complex bioinspired real-time systems with high computational complexity. The presented architecture is customizable and adaptable, while emulating several neuromorphic properties, such as the use of several information channels of small bit width, which is the nature of the brain. This paper includes the resource usage and performance data, as well as a comparison with other systems. This hardware platform has many application fields in difficult environments due to its bioinspired nature and robustness properties, and it can be used as starting point in more complex systems.

Journal ArticleDOI
TL;DR: A series of bandgap references (BGRs) using a self-biased symmetrically matched current-voltage mirror (SM CVM) in reducing systematic offset, thus achieving an excellent line regulation, is presented.
Abstract: A series of bandgap references (BGRs) using a self-biased symmetrically matched current-voltage mirror (SM CVM) in reducing systematic offset, thus achieving an excellent line regulation, is presented. By replacing the operational amplifier with a CVM in the feedback loop, current consumption is much reduced. An SM buffer stage that is capable of driving a resistive load with minor degradation in temperature coefficient (TC) and line regulation is also presented. The technique is extended to design a sub-1-V BGR with a TC-cancellation output buffer. All circuits are designed using a 0.35- CMOS process, and experimental results are presented, confirming the analysis.

Journal ArticleDOI
TL;DR: A novel current-mode transimpedance amplifier exploiting the common gate input stage with common source active feedback with low input impedance similar to that of the well-known regulated cascode (RGC) topology is realized in CHRT 0.8 V RFCMOS technology.
Abstract: In this paper, a novel current-mode transimpedance amplifier (TIA) exploiting the common gate input stage with common source active feedback has been realized in CHRT 0.18 ?m -1.8 V RFCMOS technology. The proposed active feedback TIA input stage is able to achieve a low input impedance similar to that of the well-known regulated cascode (RGC) topology. The proposed TIA also employs series inductive peaking and capacitive degeneration techniques to enhance the bandwidth and the gain. The measured transimpedance gain is 54.6 dB? with a -3 dB bandwidth of about 7 GHz for a total input parasitic capacitance of 0.3 pF. The measured average input referred noise current spectral density is about 17.5 pA/?{Hz} up to 7 GHz. The measured group delay is within 65 ± 10 ps over the bandwidth of interest. The chip consumes 18.6 mW DC power from a single 1.8 V supply. The mathematical analysis of the proposed TIA is presented together with a detailed noise analysis based on the van der Ziel MOSFET noise model. The effect of the induced gate noise in a broadband TIA is included.

Journal ArticleDOI
TL;DR: It has been shown that the 6T SRAM cell will be allowed long reign, even in the 15-nm process node, if ¿VT can be suppressed to < 70 mV thanks to effective oxide thickness scaling for the low-standby-power process; otherwise, 10T and 8T with read-modify-write will be needed after¿VT becomes > 85 and 75 mV, respectively.
Abstract: This paper compares area scaling capabilities of many kinds of SRAM margin-assist solutions for VT variability issues, which are based on various efforts by not only the cell topology changes from 6T to 8T and 10T but also incorporation of multiple voltage supply for cell terminal biasing and timing sequence controls of read and write. The various SRAM solutions are analyzed in light of an impact on the required area overhead for each design solution given by ever-increasing VT random variation (?VT)> resulting in a slowdown in the SRAM scaling pace. In order to predict the area scaling trends among various SRAM solutions, two different ?VT-increasing scenarios of being pessimistic and optimistic are assumed, where o-vt becomes > 130 mV and suppressed to 85 and 75 mV, respectively.

Journal ArticleDOI
TL;DR: A novel sort-free approach to path extension, as well as quantized metrics result in a high-throughput VLSI architecture with lower power and area consumption compared to state-of-the-art published systems.
Abstract: This paper describes the design and very-large-scale integration (VLSI) architecture for a 4 × 4 breadth-first K-best multiple-input-multiple-output (MIMO) decoder using a 64 quadrature-amplitude modulation (QAM) scheme. A novel sort-free approach to path extension, as well as quantized metrics result in a high-throughput VLSI architecture with lower power and area consumption compared to state-of-the-art published systems. Functionality is confirmed via a field-programmable gate array (FPGA) implementation on a Xilinx Virtex II Pro FPGA. Comparison of simulation and measurements are given, and FPGA utilization figures are provided. Finally, VLSI architectural tradeoffs are explored for a synthesized application-specific IC (ASIC) implementation in a 65-nm CMOS technology.

Journal ArticleDOI
TL;DR: A memory fault tolerance design solution geared to MLC NAND flash memories to concatenate trellis coded modulation (TCM) with an outer BCH code, which can greatly improve the error correction performance compared with the current design practice that uses BCH codes only.
Abstract: By storing more than one bit in each memory cell, multi-level per cell (MLC) NAND flash memories are dominating global flash memory market due to their appealing storage density advantage. However, continuous technology scaling makes MLC NAND flash memories increasingly subject to worse raw storage reliability. This paper presents a memory fault tolerance design solution geared to MLC NAND flash memories. The basic idea is to concatenate trellis coded modulation (TCM) with an outer BCH code, which can greatly improve the error correction performance compared with the current design practice that uses BCH codes only. The key is that TCM can well leverage the multi-level storage characteristic to reduce the memory bit error rate and hence relieve the burden of outer BCH code, at no cost of extra redundant memory cells. The superior performance of such concatenated BCH-TCM coding systems for MLC NAND flash memories has been well demonstrated through computer simulations. A modified TCM demodulation approach is further proposed to improve the tolerance to static memory cell defects. We also address the associated practical implementation issues in case of using either single-page or multi-page programming strategy, and demonstrate the silicon implementation efficiency through application-specific integrated circuit design at 65 nm node.

Journal ArticleDOI
TL;DR: An analytical method to estimate the statistics of computer arithmetic computation errors due to supply voltage overscaling is presented and can be used to choose the appropriate computer arithmetic architecture in voltage-overscaled signal processing systems.
Abstract: It has been recently demonstrated that digital signal processing systems may possibly leverage unconventional voltage overscaling (VOS) to reduce energy consumption while maintaining satisfactory signal processing performance. Due to the computation-intensive nature of most signal processing algorithms, the energy saving potential largely depends on the behavior of computer arithmetic units in response to overscaled supply voltage. This paper shows that different hardware implementations of the same computer arithmetic function may respond to VOS very differently and result in different energy saving potentials. Therefore, the selection of appropriate computer arithmetic architecture is an important issue in voltage-overscaled signal processing system design. This paper presents an analytical method to estimate the statistics of computer arithmetic computation errors due to supply voltage overscaling. Compared with computation-intensive circuit simulations, this analytical approach can be several orders of magnitude faster and can achieve a reasonable accuracy. This approach can be used to choose the appropriate computer arithmetic architecture in voltage-overscaled signal processing systems. Finally, we carry out case studies on a coordinate rotation digital computer processor and a finite-impulse-response filter to further demonstrate the importance of choosing proper computer arithmetic implementations.

Journal ArticleDOI
TL;DR: The proposed MAC showed the superior properties to the standard design in many ways and performance twice as much as the previous research in the similar clock frequency.
Abstract: In this paper, we proposed a new architecture of multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was elevated. The proposed CSA tree uses 1's-complement-based radix-2 modified Booth's algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The CSA propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits instead of the output of the final adder, which made it possible to optimize the pipeline scheme to improve the performance. The proposed architecture was synthesized with 250, 180 and 130 ?m, and 90 nm standard CMOS library. Based on the theoretical and experimental estimation, we analyzed the results such as the amount of hardware resources, delay, and pipelining scheme. We used Sakurai's alpha power law for the delay modeling. The proposed MAC showed the superior properties to the standard design in many ways and performance twice as much as the previous research in the similar clock frequency. We expect that the proposed MAC can be adapted to various fields requiring high performance such as the signal processing areas.

Journal ArticleDOI
TL;DR: It is shown that FinFET-based 6-T SRAM cells designed with pass-gate feedback (PGFB) achieve significant improvements in the cell read stability without area penalty, allowing for simultaneous read and write yield enhancements when the PGFB and PUWG designs are used in combination.
Abstract: Process-induced variations and sub-threshold leakage in bulk-Si technology limit the scaling of SRAM into sub-32 nm nodes. New device architectures are being considered to improve control and reduce short channel effects. Among the likely candidates, FinFETs are the most attractive option because of their good scalability and possibilities for further SRAM performance and yield enhancement through independent gating. The enhancements to read/write margins and yield are investigated in detail for two cell designs employing independently gated FinFETs. It is shown that FinFET-based 6-T SRAM cells designed with pass-gate feedback (PGFB) achieve significant improvements in the cell read stability without area penalty. The write-ability of the cell can be improved through the use of pull-up write gating (PUWG) with a separate write word line (WWL). The benefits of these two approaches are complementary and additive, allowing for simultaneous read and write yield enhancements when the PGFB and PUWG designs are used in combination.

Journal ArticleDOI
TL;DR: The architecture performs encryption and decryption of large data with 128-b key in CBC mode using on-the-fly key generation and composite field S-box, making it more cost effective (with better thousand-gate/gigabit-per-second ratio) than conventional methods.
Abstract: As networking technology advances, the gap between network bandwidth and network processing power widens. Information security issues add to the need for developing high-performance network processing hardware, particularly that for real-time processing of cryptographic algorithms. This paper presents a configurable architecture for Advanced Encryption Standard (AES) encryption, whose major building blocks are a group of AES processors. Each AES processor provides 219 block cipher schemes with a novel on-the-fly key expansion design for the original AES algorithm and an extended AES algorithm. In this multicore architecture, the memory controller of each AES processor is designed for the maximum overlapping between data transfer and encryption, reducing interrupt handling load of the host processor. This design can be applied to high-speed systems since its independent data paths greatly reduces the input/output bandwidth problem. A test chip has been fabricated for the AES architecture, using a standard 0.25-?m CMOS process. It has a silicon area of 6.29 mm2, containing about 200,500 logic gates, and runs at a 66-MHz clock. In electronic codebook (ECB) and cipher-block chaining (CBC) cipher modes, the throughput rates are 844.9, 704, and 603.4 Mb/s for 128-, 192-, and 256-b keys, respectively. In order to achieve 1-Gb/s throughput (including overhead) at the worst case, we design a multicore architecture containing three AES processors with 0.18-?m CMOS process. The throughput rate of the architecture is between 1.29 and 3.75 Gb/s at 102 MHz. The architecture performs encryption and decryption of large data with 128-b key in CBC mode using on-the-fly key generation and composite field S-box, making it more cost effective (with better thousand-gate/gigabit-per-second ratio) than conventional methods.

Journal ArticleDOI
TL;DR: The results show that, thanks to a larger threshold voltage sensitivity to back biasing, the FinFET technology is able to offer a more favorable compromise between standby power consumption and dynamic performance and is well suited for implementing fast and energy-efficient adaptive back-biasing strategies.
Abstract: In this paper, we study the advantages offered by multi-gate fin FETs (FinFETs) over traditional bulk MOSFETs when low standby power circuit techniques are implemented. More precisely, we simulated various vehicle circuits, ranging from ring oscillators to mirror full adders, to investigate the effectiveness of back biasing and transistor-stacking in both FinFETs and bulk MOSFETs. The opportunity to separate the gates of FinFETs and to operate them independently has been systematically analyzed; mixed connected- and independent-gate circuits have also been evaluated. The study spans over the device, the layout, and the circuit level of abstraction and appropriate figures of merit are introduced to quantify the potential advantage of different schemes. Our results show that, thanks to a larger threshold voltage sensitivity to back biasing, the FinFET technology is able to offer a more favorable compromise between standby power consumption and dynamic performance and is well suited for implementing fast and energy-efficient adaptive back-biasing strategies.

Journal ArticleDOI
TL;DR: A novel variation-tolerant keeper architecture is proposed, which is capable of significantly reducing contention and improving performance and power consumption and exhibits the lowest delay deviation under different levels of process variations.
Abstract: Dynamic gates have been excellent choice in the design of high-performance modules in modern microprocessors. The only limitation of dynamic gates is their relatively low noise margin compared to that of standard CMOS gates. Traditionally, this issue has been resolved by employing a pMOS keeper circuit that compensates for leakage current of the pull-down nMOS network. In the earlier technology nodes, the keeper circuit could improve reliability of the dynamic gates with minor performance penalty. However, aggressive scaling trends of CMOS technology along with increasing levels of process variations have reduced effectiveness of the traditional keeper approach. This is because to maintain an acceptable noise margin level in deep sub-100 nm technologies, large pMOS keepers must be employed, which generates substantial contention between the keeper and the pull-down network, and hence results in severe loss of performance and high power consumption. This problem is more severe in wide fan-in dynamic gates due to the large number of leaky nMOS devices connected to the dynamic node. In this paper, a novel variation-tolerant keeper architecture is proposed, which is capable of significantly reducing contention and improving performance and power consumption. Using circuit simulations, the overall improved characteristics of the proposed keeper are demonstrated in comparison to those of the traditional as well as several state-of-the-art keepers. The proposed keeper exhibits the lowest delay deviation under different levels of process variations. Also, it is shown that for an eight-input or gate, in presence of 15% Vth fluctuations, the proposed architecture can lead to 20%, 15%, and more than 40% reduction in power consumption, mean delay, and standard deviation of delay, respectively, when compared to the traditional keeper circuit.

Journal ArticleDOI
TL;DR: It is argued in this paper that the use of error-control schemes in on-chip networks results in degradable systems, hence, performance and reliability must be measured jointly using a unified measure, i.e., performability.
Abstract: High reliability against noise, high performance, and low energy consumption are key objectives in the design of on-chip networks. Recently some researchers have considered the impact of various error-control schemes on these objectives and on the tradeoff between them. In all these works performance and reliability are measured separately. However, we will argue in this paper that the use of error-control schemes in on-chip networks results in degradable systems, hence, performance and reliability must be measured jointly using a unified measure, i.e., performability. Based on the traditional concept of performability, we provide a definition for the ?Interconnect Performability?. Analytical models are developed for interconnect performability and expected energy consumption. A detailed comparative analysis of the error-control schemes using the performability analytical models and SPICE simulations is provided taking into consideration voltage swing variations (used to reduce interconnect energy consumption) and variations in wire length. Furthermore, the impact of noise power and time constraint on the effectiveness of error-control schemes are analyzed.

Journal ArticleDOI
TL;DR: A novel test data compression technique using bitmasks which provides a substantial improvement in the compression efficiency without introducing any additional decompression penalty is proposed.
Abstract: Higher circuit densities in system-on-chip (SOC) designs have led to drastic increase in test data volume. Larger test data size demands not only higher memory requirements, but also an increase in testing time. Test data compression addresses this problem by reducing the test data volume without affecting the overall system performance. This paper proposes a novel test data compression technique using bitmasks which provides a substantial improvement in the compression efficiency without introducing any additional decompression penalty. The major contributions of this paper are as follows: 1) it develops an efficient bitmask selection technique for test data in order to create maximum matching patterns; 2) it develops an efficient dictionary selection method which takes into account the bitmask based compression; and 3) it proposes a test compression technique using efficient dictionary and bitmask selection to significantly reduce the testing time and memory requirements. We have applied our method on various test data sets and compared our results with other existing test compression techniques. Our algorithm outperforms existing dictionary-based approaches by up to 30%, giving a best possible test compression of 92%.

Journal ArticleDOI
TL;DR: The analysis shows that control voltage (V cnt) of voltage-controlled oscillator in PLL can be used as a dynamic performance signature of an operating IC and a variation-resilient system technique using adaptive body biasing (ABB) is proposed.
Abstract: Performance variability in digital integrated circuits can largely affect parametric yield and product reliability in ultra deep submicrometer technologies. As a result, variation resilience is becoming an essential design requirement for future technology nodes, especially for timing critical applications. This paper proposes an on-chip variability sensor using phase-locked loop (PLL) to detect process, supply voltage (V DD), and temperature variations (process, voltage, and temperature variation) or even temporal reliability degradation stemming from negative bias temperature instability. Our analysis shows that control voltage (V cnt) of voltage-controlled oscillator in PLL can be used as a dynamic performance signature of an operating IC. Along with the proposed PLL-based sensor circuit, we also propose a variation-resilient system technique using adaptive body biasing (ABB). The PLL V cnt signal is efficiently transformed to an optimal body bias signal for various circuit blocks to avoid possible timing failures. Correspondingly, circuits can be designed with significantly relaxed timing constraint compared to conventional approaches, where a large amount of design resources can be wasted to take care of the worst-case situations. We demonstrated our approach on a test chip fabricated in IBM 130-nm CMOS technology. Measurement results show that the PLL-based sensor is cable of tracking various sources of circuit variations. Optimization analysis shows that 42% and 43% reduction in area and power can be obtained using our approach compared to the worst-case sizing. The proposed study refers to our previous study introduced in with major improvements in measurement results and analysis.