scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2005"


Journal ArticleDOI
TL;DR: It is shown that arbiter-based PUFs are realizable and well suited to build key-cards that need to be resistant to physical attacks and to be identified securely and reliably over a practical range of environmental variations such as temperature and power supply voltage.
Abstract: Modern cryptographic protocols are based on the premise that only authorized participants can obtain secret keys and access to information systems. However, various kinds of tampering methods have been devised to extract secret keys from conditional access systems such as smartcards and ATMs. Arbiter-based physical unclonable functions (PUFs) exploit the statistical delay variation of wires and transistors across integrated circuits (ICs) in manufacturing processes to build unclonable secret keys. We fabricated arbiter-based PUFs in custom silicon and investigated the identification capability, reliability, and security of this scheme. Experimental results and theoretical studies show that a sufficient amount of inter-chip variation exists to enable each IC to be identified securely and reliably over a practical range of environmental variations such as temperature and power supply voltage. We show that arbiter-based PUFs are realizable and well suited to build, for example, key-cards that need to be resistant to physical attacks.

1,002 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigated the area and power-delay performances of low-voltage full adder cells in different CMOS logic styles for the predominating tree structured arithmetic circuits.
Abstract: The general objective of our work is to investigate the area and power-delay performances of low-voltage full adder cells in different CMOS logic styles for the predominating tree structured arithmetic circuits. A new hybrid style full adder circuit is also presented. The sum and carry generation circuits of the proposed full adder are designed with hybrid logic styles. To operate at ultra-low supply voltage, the pass logic circuit that cogenerates the intermediate XOR and XNOR outputs has been improved to overcome the switching delay problem. As full adders are frequently employed in a tree structured configuration for high-performance arithmetic circuits, a cascaded simulation structure is introduced to evaluate the full adders in a realistic application environment. A systematic and elegant procedure to scale the transistor for minimal power-delay product is proposed. The circuits being studied are optimized for energy efficiency at 0.18-/spl mu/m CMOS process technology. With the proposed simulation environment, it is shown that some survival cells in stand alone operation at low voltage may fail when cascaded in a larger circuit, either due to the lack of drivability or unsatisfactory speed of operation. The proposed hybrid full adder exhibits not only the full swing logic and balanced outputs but also strong output drivability. The increase in the transistor count of its complementary CMOS output stage is compensated by its area efficient layout. Therefore, it remains one of the best contenders for designing large tree structured arithmetic circuits with reduced energy consumption while keeping the increase in area to a minimum.

344 citations


Journal ArticleDOI
Amit Agarwal1, Bipul C. Paul1, Hamid Mahmoodi1, Animesh Datta1, Kaushik Roy1 
TL;DR: This technique dynamically detects and replaces faulty cells by dynamically resizing the cache and surpasses all the contemporary fault tolerant schemes such as row/column redundancy and error-correcting code (ECC) in handling failures due to process variation.
Abstract: Process parameter variations are expected to be significantly high in a sub-50-nm technology regime, which can severely affect the yield, unless very conservative design techniques are employed. The parameter variations are random in nature and are expected to be more pronounced in minimum geometry transistors commonly used in memories such as SRAM. Consequently, a large number of cells in a memory are expected to be faulty due to variations in different process parameters. We analyze the impact of process variation on the different failure mechanisms in SRAM cells. We also propose a process-tolerant cache architecture suitable for high-performance memory. This technique dynamically detects and replaces faulty cells by dynamically resizing the cache. It surpasses all the contemporary fault tolerant schemes such as row/column redundancy and error-correcting code (ECC) in handling failures due to process variation. Experimental results on a 64-K direct map L1 cache show that the proposed technique can achieve 94% yield compared to its original 33% yield (standard cache) in a 45-nm predictive technology under /spl sigma//sub Vt-inter/=/spl sigma//sub Vt-intra/=30 mV.

207 citations


Journal ArticleDOI
TL;DR: This paper presents a coding framework derived from a communication-theoretic view of a DSM bus to jointly address power, delay, and reliability, and shows that coding is a better alternative to repeater insertion for delay reduction as it reduces power dissipation at the same time.
Abstract: Global buses in deep-submicron (DSM) system-on-chip designs consume significant amounts of power, have large propagation delays, and are susceptible to errors due to DSM noise. Coding schemes exist that tackle these problems individually. In this paper, we present a coding framework derived from a communication-theoretic view of a DSM bus to jointly address power, delay, and reliability. In this framework, the data is first passed through a nonlinear source coder that reduces self and coupling transition activity and imposes a constraint on the peak coupling transitions on the bus. Next, a linear error control coder adds redundancy to enable error detection and correction. The framework is employed to efficiently combine existing codes and to derive novel codes that span a wide range of tradeoffs between bus delay, codec latency, power, area, and reliability. Using simulation results in 0.13-/spl mu/m CMOS technology, we show that coding is a better alternative to repeater insertion for delay reduction as it reduces power dissipation at the same time. For a 10-mm 4-bit bus, we show that a bus employing the proposed codes achieves up to 2.17/spl times/ speed-up and 33% energy savings over a bus employing Hamming code. For a 10-mm 32-bit bus, we show that 1.7/spl times/ speed-up and 27% reduction in energy are achievable over an uncoded bus by employing low-swing signaling without any loss in reliability.

178 citations


Journal ArticleDOI
TL;DR: A new test-data compression technique that uses exactly nine codewords that provides significant reduction in test- data volume and test-application time and is flexible in utilizing both fixed- and variable-length blocks.
Abstract: This paper presents a new test-data compression technique that uses exactly nine codewords. Our technique aims at precomputed data of intellectual property cores in system-on-chips and does not require any structural information of cores. The technique is flexible in utilizing both fixed- and variable-length blocks. In spite of its simplicity, it provides significant reduction in test-data volume and test-application time. The decompression logic is very small and can be implemented fully independent of the precomputed test-data set. Our technique is flexible and can be efficiently adopted for single- or multiple-scan chain designs. Experimental results for ISCAS'89 benchmarks illustrate the flexibility and efficiency of the proposed technique.

144 citations


Journal ArticleDOI
TL;DR: An automated design methodology for MCML circuits is proposed to overcome the complexities of the design process, and a comprehensive analytical formulation for the design parameters of MC ML circuits using the BSIM3v3 model is introduced.
Abstract: The interest in MOS current-mode logic (MCML) is increasing because of its ability to dissipate less power than conventional CMOS circuits at high frequencies, while providing an analog friendly environment. Moreover, automated design methodologies are gaining attention by circuit designers to provide shorter design cycles and faster time to market. This paper provides designers with an insight to the different tradeoffs involved in the design of MCML circuits to efficiently and systematically design MCML circuits. A comprehensive analytical formulation for the design parameters of MCML circuits using the BSIM3v3 model is introduced. In addition, a closed-form expression for the noise margin of two-level MCML circuits is derived. In order to verify the validity of the analytical formulations, an automated design methodology for MCML circuits is proposed to overcome the complexities of the design process. The effectiveness of the design methodology and the accuracy of the analytical formulations are tested by designing several MCML benchmarks built in a 0.18-/spl mu/m CMOS technology. The error in the required performance in the designed circuits is within 11% when compared to HSPICE simulations. A worst case parameter variations modeling is presented to investigate the impact of variations on MCML circuits as well as designing MCML circuits for variability. Finally, the impact of variations on MCML circuits is investigated with technology scaling and different circuit architectures.

132 citations


Journal ArticleDOI
TL;DR: A novel circuit technique to virtually eliminate test power dissipation in combinational logic by masking signal transitions at the logic inputs during scan shifting by inserting an extra supply gating transistor in the supply to ground path for the first-level gates at the outputs of the scan flip-flops.
Abstract: Reduction in test power is important to improve battery lifetime in portable electronic devices employing periodic self-test, to increase reliability of testing, and to reduce test cost. In scan-based testing, a significant fraction of total test power is dissipated in the combinational block. In this paper, we present a novel circuit technique to virtually eliminate test power dissipation in combinational logic by masking signal transitions at the logic inputs during scan shifting. We implement the masking effect by inserting an extra supply gating transistor in the supply to ground path for the first-level gates at the outputs of the scan flip-flops. The supply gating transistor is turned off in the scan-in mode, essentially gating the supply. Adding an extra transistor in only one logic level renders significant advantages with respect to area, delay, and power overhead compared to existing methods, which use gating logic at the output of scan flip-flops. Moreover, the proposed gating technique allows a reduction in leakage power by input vector control during scan shifting. Simulation results on ISCAS89 benchmarks show an average improvement of 62% in area overhead, 101% in power overhead (in normal mode), and 94% in delay overhead, compared to the lowest cost existing method.

125 citations


Journal ArticleDOI
TL;DR: The proposed method adopts a global approach for passivity enforcement by ensuring that the passivity correction at a certain region does not introduce new passivity violations at other parts of the frequency spectrum.
Abstract: With the continually increasing operating frequencies, complex high-speed interconnect and package modules require characterization based on measured/simulated data. Several algorithms were recently suggested for macromodeling such types of data to enable unified transient analysis in the presence of external network elements. One of the critical issues involved here is the passivity violations associated with the computed macromodel. To address this issue, a new passivity enforcement algorithm is presented in this paper. The proposed method adopts a global approach for passivity enforcement by ensuring that the passivity correction at a certain region does not introduce new passivity violations at other parts of the frequency spectrum. It also provides an error estimate for the response of the passivity corrected macromodel.

119 citations


Journal ArticleDOI
TL;DR: It is concluded that extending the voltage range below V/sub dd//2 will improve the energy efficiency for many processor designs and compare several different low-power approaches including MTCMOS, standard DVS, and the proposed Insomniac (extended DVS into subthreshold operation).
Abstract: Dynamic voltage scaling (DVS) is a popular approach for energy reduction of integrated circuits. Current processors that use DVS typically have an operating voltage range from full to half of the maximum V/sub dd/. However, there is no fundamental reason why designs cannot operate over a much larger voltage range: from full V/sub dd/ to subthreshold voltages. This possibility raises the question of whether a larger voltage range improves the energy efficiency of DVS. First, from a theoretical point of view, we show that, for subthreshold supply voltages, leakage energy becomes dominant, making "just-in-time computation" energy-inefficient at extremely low voltages. Hence, we introduce the existence of a so-called "energy-optimal voltage" which is the voltage at which the application is executed with the highest possible energy efficiency and below which voltage scaling reduces energy efficiency. We derive an analytical model for the energy-optimal voltage and study its trends with technology scaling and different application loads. Second, we compare several different low-power approaches including MTCMOS, standard DVS, and the proposed Insomniac (extended DVS into subthreshold operation). A study of real applications on commercial processors shows that Insomniac provides the best energy efficiency. From these results, we conclude that extending the voltage range below V/sub dd//2 will improve the energy efficiency for many processor designs.

118 citations


Journal ArticleDOI
TL;DR: This paper presents a method that synthesizes a network with the most common reversible gates, the Toffoli gate and the Fredkin gate, and compares the results to the optimal results.
Abstract: Reversible logic has applications in quantum computing, low power CMOS, nanotechnology, optical computing, and DNA computing. The most common reversible gates are the Toffoli gate and the Fredkin gate. We present a method that synthesizes a network with these gates in two steps. First, our synthesis algorithm finds a cascade of Toffoli and Fredkin gates with no backtracking and minimal look-ahead. Next we apply transformations that reduce the number of gates in the network. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence of gates in the network to be reduced matches a sequence of gates comprising more than half of a template, then a transformation that reduces the gate count can be applied. We have synthesized all three input, three output reversible functions and here compare our results to the optimal results. We also present the results of applying our synthesis tool to obtain networks for a number of benchmark functions.

107 citations


Journal ArticleDOI
TL;DR: The combined device-circuit-architecture level techniques offer 64% total leakage reduction and 7.3% improvement in bit line delay compared to a previous state-of-the-art low-leakage SRAM technique.
Abstract: This paper presents a forward body-biasing (FBB) technique for active and standby leakage power reduction in cache memories. Unlike previous low-leakage SRAM approaches, we include device level optimization into the design. We utilize super high Vt (threshold voltage) devices to suppress the cache leakage power, while dynamically FBB only the selected SRAM cells for fast operation. In order to build a super high Vt device, the two-dimensional (2-D) halo doping profile was optimized considering various nanoscale leakage mechanisms. The transition latency and energy overhead associated with FBB was minimized by waking up the SRAM cells ahead of the access and exploiting the general cache access pattern. The combined device-circuit-architecture level techniques offer 64% total leakage reduction and 7.3% improvement in bit line delay compared to a previous state-of-the-art low-leakage SRAM technique. Static noise margin of the proposed SRAM cell is comparable to conventional SRAM cells.

Journal ArticleDOI
TL;DR: The new sense-amplifier-based flip-flop provides ratioless design, reduced short-circuit power dissipation, and glitch-free operation, and has been successfully employed in a high-speed direct digital frequency synthesizer chip, highlighting the effectiveness of the proposed flip- flop in high- speed standard cell-based applications.
Abstract: A new sense-amplifier-based flip-flop is presented. The output latch of the proposed circuit can be considered as an hybrid solution between the standard NAND-based set/reset latch and the NC-/sup 2/MOS approach. The proposed flip-flop provides ratioless design, reduced short-circuit power dissipation, and glitch-free operation. The simulation results, obtained for a 0.25-/spl mu/m technology, show improvements in the clock-to-output delay and the power dissipation with respect to the recently proposed high-speed flip-flops. The new circuit has been successfully employed in a high-speed direct digital frequency synthesizer chip, highlighting the effectiveness of the proposed flip-flop in high-speed standard cell-based applications.

Journal ArticleDOI
TL;DR: The simulated results show that by halving the clock frequency, dual-edge clocking strategy can save about 50% of the power consumed by the clock distribution network, and relax the design of clock distribution system, while paying virtually no penalty in throughput.
Abstract: This paper describes the classification, detailed timing characterization, evaluation, and design of the dual-edge triggered storage elements (DETSE). The performance and power characterization of DETSE includes the effect of clocking at halved clock frequency and impact of load imposed by the storage element to the clock distribution network. The presented analysis estimates the timing penalty and power savings of a system based on DETSE, and gives design guidelines for high-performance and low-power application. In addition, the paper presents a class of dual-edge triggered flip-flops with clock load, delay, and internal power consumption comparable to the fastest single-edge triggered storage elements (SETSE). Our simulated results show that by halving the clock frequency, dual-edge clocking strategy can save about 50% of the power consumed by the clock distribution network, and relax the design of clock distribution system, while paying virtually no penalty in throughput.

Journal ArticleDOI
TL;DR: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented.
Abstract: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented. They employ "fine-grained" pipelining, i.e., pipelining within the processor and result in an increased output latency when used in the LMS recursive system. Therefore, the major challenge is to maintain a low latency output whilst increasing the pipeline stage in the filter for higher speeds. Using the delayed LMS (DLMS) algorithm, fine-grained pipelined FPGA implementations using both the direct form (DF) and the transposed form (TF) are considered and compared. It is shown that the direct form LMS filter utilizes the FPGA resources more efficiently thereby allowing a 120 MHz sampling rate.

Journal ArticleDOI
TL;DR: The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.
Abstract: This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

Journal ArticleDOI
TL;DR: A hardware Gaussian noise generator based on the Wallace method used for a hardware simulation system that accurately models a true Gaussian probability density function even at high /spl sigma/ values is described.
Abstract: We describe a hardware Gaussian noise generator based on the Wallace method used for a hardware simulation system. Our noise generator accurately models a true Gaussian probability density function even at high /spl sigma/ values. We evaluate its properties using: 1) several different statistical tests, including the chi-square test and the Anderson-Darling test and 2) an application for decoding of low-density parity-check (LDPC) codes. Our design is implemented on a Xilinx Virtex-II XC2V4000-6 field-programmable gate array (FPGA) at 155 MHz; it takes up 3% of the device and produces 155 million samples per second, which is three times faster than a 2.6-GHz Pentium-IV PC. Another implementation on a Xilinx Spartan-III XC3S200E-5 FPGA at 106 MHz is two times faster than the software version. Further improvement in performance can be obtained by concurrent execution: 20 parallel instances of the noise generator on an XC2V4000-6 FPGA at 115 MHz can run 51 times faster than software on a 2.6-GHz Pentium-IV PC.

Journal ArticleDOI
TL;DR: A design space for matrix multiplication on FPGAs that results in tradeoffs among energy, area, and latency is shown that improves the energy performance of state-of-the-art FPGA-based designs by 29%-51% without any increase in the area-latency product.
Abstract: We develop new algorithms and architectures for matrix multiplication on configurable devices. These have reduced energy dissipation and latency compared with the state-of-the-art field-programmable gate array (FPGA)-based designs. By profiling well-known designs, we identify "energy hot spots", which are responsible for most of the energy dissipation. Based on this, we develop algorithms and architectures that offer tradeoffs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and performance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide tradeoffs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplication on FPGAs that results in tradeoffs among energy, area, and latency. For example, our designs improve the energy performance of state-of-the-art FPGA-based designs by 29%-51% without any increase in the area-latency product. The latency of our designs is reduced one-third to one-fifteenth while area is increased 1.9-9.4 times. In terms of comprehensive metrics such as Energy-Area-Time, our designs exhibit superior performance compared with the state-of-the-art by 50%-79%.

Journal ArticleDOI
TL;DR: This paper presents the design and analysis of a novel distributed CMOS mixer for ultrawide-band (UWB) receivers that incorporates artificial inductance-capacitance delay lines in radio frequency, local oscillator, and intermediate frequency signal paths, and single-balanced mixer cells that are distributed along these LC circuits.
Abstract: This paper presents the design and analysis of a novel distributed CMOS mixer for ultrawide-band (UWB) receivers. To achieve the UWB RF frequency range required for the UWB communications, the proposed mixer incorporates artificial inductance-capacitance (LC) delay lines in radio frequency (RF), local oscillator (LO), and intermediate frequency signal paths, and single-balanced mixer cells that are distributed along these LC circuits. Closed-form analytical model for the conversion gain of the mixer is presented. Furthermore, a comprehensive noise analysis of the proposed distributed mixer is carried out, which includes calculation of the mixer noise figure (NF) and derivation of the optimum number of stages, n, minimizing the NF. The designed mixer is capable of covering the RF and LO frequencies over a wide range of frequencies from 3.1-8.72 GHz. A two-stage distributed mixer has been fabricated in a 0.18-/spl mu/m CMOS process. Experiments show a conversion gain of more than 2.5 dB for the entire range of the frequencies. The dc power consumption is 10.4 mW.

Journal ArticleDOI
TL;DR: Experimental results show that a high repair rate is achieved with the BISR scheme, and the proposed RA algorithm for RAM with a 2-D redundancy structure, i.e., spare rows and spare columns, is implemented.
Abstract: This brief presents a built-in self-repair (BISR) scheme for semiconductor memories with two-dimensional (2-D) redundancy structures, i.e., spare rows and spare columns. The BISR design is composed of a built-in self-test module and a built-in redundancy analysis (BIRA) module. The BIRA module executes the proposed RA algorithm for RAM with a 2-D redundancy structure. The BIRA module also serves as the reconfiguration unit in the normal mode. Experimental results show that a high repair rate (i.e., the ratio of the number of repaired memories to the number of defective memories) is achieved with the BISR scheme. The BISR circuit has a low area overhead-about 4.6% for an 8 K /spl times/ 64 SRAM.

Journal ArticleDOI
TL;DR: This paper proposes an efficient heuristic algorithm using a charge-based cost function derived from the analytical battery model that first creates a task sequence that ensures battery survival, and then distributes the available delay slack so that the cost function is maximized.
Abstract: Battery lifetime enhancement is a critical design parameter for mobile computing devices. Maximizing the battery lifetime is a particularly difficult problem due to the nonlinearity of the battery behavior and its dependence on the characteristics of the discharge profile. In this paper, we address the problem of task scheduling with voltage scaling in a battery-powered single and multiprocessor system such that the residual charge or the battery voltage (the parameters for evaluating battery performance) is maximized. We propose an efficient heuristic algorithm using a charge-based cost function derived from the analytical battery model. Our algorithm first creates a task sequence that ensures battery survival, and then distributes the available delay slack so that the cost function is maximized. The effectiveness of the algorithm has been verified using DUALFOIL, a low-level Li-ion battery simulator. The algorithm has been validated on synthetic examples created from applications running on Compaq's handheld computing research platform, ITSY

Journal ArticleDOI
TL;DR: The estimation is quick, not requiring extensive simulation or use of computer-aided design tools, yet sufficiently accurate to provide guidance through various choices in the design process.
Abstract: In this paper, we motivate the concept of comparing very large scale integration adders based on their energy-delay characteristics and present results of our estimation technique. This stems from a need to make appropriate selection at the beginning of the design process. The estimation is quick, not requiring extensive simulation or use of computer-aided design tools, yet sufficiently accurate to provide guidance through various choices in the design process. We demonstrate the accuracy of the method by applying it to examples of high-performance 32- and 64-b adders in 100- and 130-nm CMOS technologies.

Journal ArticleDOI
TL;DR: A closed-loop voltage swing controller that samples the error retransmission rate to determine the operational voltage swing and an embodiment of a self-calibrating circuit that compensates for significant manufacturing parameter deviations and environmental variations is described.
Abstract: Systems-on-Chip (SoC) design involves several challenges, stemming from the extreme miniaturization of the physical features and from the large number of devices and wires on a chip. Since most SoCs are used within embedded systems, specific concerns are increasingly related to correct, reliable, and robust operation. We believe that in the future most SoCs will be assembled by using large-scale macro-cells and interconnected by means of on-chip networks. We examine some physical properties of on-chip interconnect busses, with the goal of achieving fast, reliable, and low-energy communication. These objectives are reached by dynamically scaling down the voltage swing, while ensuring data integrity-in spite of the decreased signal to noise ratio-by means of encoding and retransmission schemes. In particular, we describe a closed-loop voltage swing controller that samples the error retransmission rate to determine the operational voltage swing. We present a control policy which achieves our goals with minimal complexity; such simplicity is demonstrated by implementing the policy in a synthesizable controller. Such a controller is an embodiment of a self-calibrating circuit that compensates for significant manufacturing parameter deviations and environmental variations. Experimental results show that energy savings amount up to 42%, while at the same time meeting performance requirements.

Journal ArticleDOI
TL;DR: This work proposes an adaptive error correcting scheme to reduce the leakage energy consumption and improve reliability and demonstrates that there is a tradeoff between optimizing for leakage power and improving the immunity to soft error.
Abstract: As technology scales, reducing leakage power and improving reliability of data stored in memory cells is both important and challenging. While lower threshold voltages increase leakage, lower supply voltages and smaller nodal capacitances reduce energy consumption but increase soft errors rates. In this work, we present a comprehensive study of soft error rates on low-power cache design. First, we study the effect of circuit level techniques, used to reduce the leakage energy consumption, on soft error rates. Our results using custom designs show that many of these approaches may increase the soft error rates as compared to a standard 6T SRAM. We also validate the effects of voltage scaling on soft error rate by performing accelerated tests on off-the-shelf SRAM-based chips using a neutron beam source. Next, we study the impact of cache decay and drowsy cache, which are two commonly used architectural-level leakage reduction approaches, on the cache reliability. Our results indicate that the leakage optimization techniques change the reliability of cache memory. More importantly, we demonstrate that there is a tradeoff between optimizing for leakage power and improving the immunity to soft error. We also study the impact of error correcting codes on soft error rates. Based on this study, we propose an adaptive error correcting scheme to reduce the leakage energy consumption and improve reliability.

Journal ArticleDOI
TL;DR: Analysis shows that the computational delay time of the proposed architecture is significantly less than the previously proposed digit-serial systolic multiplier, and since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation.
Abstract: In this paper, an efficient digit-serial systolic array is proposed for multiplication in finite field GF(2/sup m/) using the standard basis representation. From the least significant bit first multiplication algorithm, we obtain a new dependence graph and design an efficient digit-serial systolic multiplier. If input data come in continuously, the proposed array can produce multiplication results at a rate of one every /spl lceil/m/L/spl rceil/ clock cycles, where L is the selected digit size. Analysis shows that the computational delay time of the proposed architecture is significantly less than the previously proposed digit-serial systolic multiplier. Furthermore, since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation.

Journal ArticleDOI
TL;DR: Experimental calculations indicate that the use of dynamic reconfiguration leads to a 69% reduction in decoder power consumption over a nonreconfigurable field-programmable gate array implementation with no loss of decode accuracy.
Abstract: Error-correcting convolutional codes provide a proven mechanism to limit the effects of noise in digital data transmission. Although hardware implementations of decoding algorithms, such as the Viterbi algorithm, have shown good noise tolerance for error-correcting codes, these implementations require an exponential increase in very large scale integration area and power consumption to achieve increased decoding accuracy. To achieve reduced decoder power consumption, we have examined and implemented decoders based on the reduced-complexity adaptive Viterbi algorithm (AVA). Run-time dynamic reconfiguration is performed in response to varying communication channel-noise conditions to match minimized power consumption to required error-correction capabilities. Experimental calculations indicate that the use of dynamic reconfiguration leads to a 69% reduction in decoder power consumption over a nonreconfigurable field-programmable gate array implementation with no loss of decode accuracy.

Journal ArticleDOI
TL;DR: This work presents a comprehensive overview of two closely related approaches to designing DPM strategies, namely, competitive analysis approach and model checking approach based on adversarial modeling.
Abstract: Dynamic power management (DPM) refers to the problem of judicious application of various low-power techniques based on runtime conditions in an embedded system to minimize the total energy consumption. To be effective, often such decisions take into account the operating conditions and the system-level design goals. DPM has been a subject of intense research in the past decade driven by the need for low power consumption in modern embedded devices. We present a comprehensive overview of two closely related approaches to designing DPM strategies, namely, competitive analysis approach and model checking approach based on adversarial modeling. Although many other approaches exist for solving the system-level DPM problem, these two approaches are closely related and are based on a common theme. This commonality is in the fact that the underlying model is that of a competition between the system and an adversary. The environment that puts service demands on devices is viewed as an adversary, or to be in competition with the system to make it burn more energy, and the DPM strategy is employed by the system to counter that.

Journal ArticleDOI
TL;DR: A novel parallel interleaver and an algorithm for its design are presented, achieving the same error correction performance as the standard architecture and achieving a very high coding gain.
Abstract: Standard VLSI implementations of turbo decoding require substantial memory and incur a long latency, which cannot be tolerated in some applications. A parallel VLSI architecture for low-latency turbo decoding, comprising multiple single-input single-output (SISO) elements, operating jointly on one turbo-coded block, is presented and compared to sequential architectures. A parallel interleaver is essential to process multiple concurrent SISO outputs. A novel parallel interleaver and an algorithm for its design are presented, achieving the same error correction performance as the standard architecture. Latency is reduced up to 20 times and throughput for large blocks is increased up to six-fold relative to sequential decoders, using the same silicon area, and achieving a very high coding gain. The parallel architecture scales favorably: latency and throughput are improved with increased block size and chip area.

Journal ArticleDOI
TL;DR: TCG combines the advantages of popular representations such as sequence pair, BSG, and B*-tree and makes TCG an effective and flexible representation for handling the general floorplan/placement design problems with various constraints.
Abstract: In this brief, we introduce the concept of the P*-admissible representation and propose a P*-admissible, transitive closure graph-based representation for general floorplans, called transitive closure graph (TCG), and show its superior properties. TCG combines the advantages of popular representations such as sequence pair, BSG, and B*-tree. Like sequence pair and BSG, but unlike O-tree, B*-tree, and CBL, TCG is P*-admissible. Like B*-tree, but unlike sequence pair, BSG, O-tree, and CBL, TCG does not need to construct additional constraint graphs for the cost evaluation during packing, implying a faster runtime. Further, TCG supports incremental update during operations and keeps the information of boundary modules as well as the shapes and the relative positions of modules in the representation. More importantly, the geometric relation among modules is transparent not only to the TCG representation but also to its operation, facilitating the convergence to a desired solution. All of these properties make TCG an effective and flexible representation for handling the general floorplan/placement design problems with various constraints. Experimental results show the promise of TCG.

Journal ArticleDOI
TL;DR: A design exploration framework for application-adaptive multiple-clock processors which provides the means for analyzing and identifying the right interdomain communication scheme and the proper granularity for the choice of voltage/frequency islands in case of superscalar, out-of-order processors is proposed.
Abstract: Enabled by the continuous advancement in fabrication technology, present-day synchronous microprocessors include more than 100 million transistors and have clock speeds well in excess of the 1-GHz mark. Distributing a low-skew clock signal in this frequency range to all areas of a large chip is a task of growing complexity. As a solution to this problem, designers have recently suggested the use of frequency islands that are locally clocked and externally communicate with each other using mixed clock communication schemes. Such a design style fits nicely with the recently proposed concept of voltage islands that, in addition, can potentially enable fine-grain dynamic power management by simultaneous voltage and frequency scaling. This paper proposes a design exploration framework for application-adaptive multiple-clock processors which provides the means for analyzing and identifying the right interdomain communication scheme and the proper granularity for the choice of voltage/frequency islands in case of superscalar, out-of-order processors. In addition, the presented design exploration framework allows for comparative analysis of newly proposed or already published application-driven dynamic power management strategies. Such a design exploration framework and accompanying results can help designers and computer architects in choosing the right design strategy for achieving better power-performance tradeoffs in multiple-clock high-end processors.

Journal ArticleDOI
TL;DR: It has been shown that a complete co-design at all levels of hierarchy is necessary to reduce the overall power consumption while achieving acceptable performance in the subthreshold regime of operation.
Abstract: This paper presents a novel design methodology for ultralow-power design using subthreshold leakage as the operating current (suitable for medium frequency of operation: tens to hundreds of millihertz). Standard design techniques suitable for super-threshold design can be used in the subthreshold region. However, in this study, it has been shown that a complete co-design at all levels of hierarchy (device, circuit, and architecture) is necessary to reduce the overall power consumption while achieving acceptable performance (hundreds of millihertz) in the subthreshold regime of operation. Simulation results of co-design on a five-tap finite-impulse-response filter shows /spl sim/2.5/spl times/ improvement in throughput at iso-power compared to a conventional design.