scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2013"


Journal ArticleDOI
TL;DR: The proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single-path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

198 citations


Journal ArticleDOI
TL;DR: An insight into the field of fault attacks and countermeasures to help the designer to protect the design against this type of implementation attacks and a guide for selecting a set of countermeasures, which provides a sufficient security level to meet the constraints of the embedded devices.
Abstract: Hardware designers invest a significant design effort when implementing computationally intensive cryptographic algorithms onto constrained embedded devices to match the computational demands of the algorithms with the stringent area, power, and energy budgets of the platforms. When it comes to designs that are employed in potential hostile environments, another challenge arises-the design has to be resistant against attacks based on the physical properties of the implementation, the so-called implementation attacks. This creates an extra design concern for a hardware designer. This paper gives an insight into the field of fault attacks and countermeasures to help the designer to protect the design against this type of implementation attacks. We analyze fault attacks from different aspects and expose the mechanisms they employ to reveal a secret parameter of a device. In addition, we classify the existing countermeasures and discuss their effectiveness and efficiency. The result of this paper is a guide for selecting a set of countermeasures, which provides a sufficient security level to meet the constraints of the embedded devices.

159 citations


Journal ArticleDOI
TL;DR: The proposed sequential circuits based on conservative logic gates outperform the sequential circuits implemented in classical gates in terms of testability and a new conservative logic gate called multiplexer conservative QCA gate (MX-cqca) that is not reversible in nature but has similar properties as the Fredkin gate of working as 2:1 severalxer is presented.
Abstract: In this paper, we propose the design of two vectors testable sequential circuits based on conservative logic gates. The proposed sequential circuits based on conservative logic gates outperform the sequential circuits implemented in classical gates in terms of testability. Any sequential circuit based on conservative logic gates can be tested for classical unidirectional stuck-at faults using only two test vectors. The two test vectors are all 1's, and all 0's. The designs of two vectors testable latches, master-slave flip-flops and double edge triggered (DET) flip-flops are presented. The importance of the proposed work lies in the fact that it provides the design of reversible sequential circuits completely testable for any stuck-at fault by only two test vectors, thereby eliminating the need for any type of scan-path access to internal memory cells. The reversible design of the DET flip-flop is proposed for the first time in the literature. We also showed the application of the proposed approach toward 100% fault coverage for single missing/additional cell defect in the quantum-dot cellular automata (QCA) layout of the Fredkin gate. We are also presenting a new conservative logic gate called multiplexer conservative QCA gate (MX-cqca) that is not reversible in nature but has similar properties as the Fredkin gate of working as 2:1 multiplexer. The proposed MX-cqca gate surpasses the Fredkin gate in terms of complexity (the number of majority voters), speed, and area.

130 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.
Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormhole-switched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.

112 citations


Journal ArticleDOI
TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented.
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption.

99 citations


Journal ArticleDOI
TL;DR: A fault-tolerant solution for a bufferless network-on-chip is proposed, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-Tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock.
Abstract: Continuing decrease in the feature size of integrated circuits leads to increases in susceptibility to transient and permanent faults. This paper proposes a fault-tolerant solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock. A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR router. Synthesized results show that, compared with the FTDR router, the FTDR-H router can reduce the area by 27% in an 88 network. Simulation results demonstrate that under synthetic workloads, in the presence of permanent link faults, the throughput of an 8 8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the cost-based deflection routing algorithm, respectively. Under real application workloads, the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN algorithm. For transient faults, the performance of the FTDR router can achieve graceful degradation even at a high fault rate. We also implement the fault-tolerant deflection router which can achieve 400 MHz in TSMC 65-nm technology.

95 citations


Journal ArticleDOI
TL;DR: The proposed switchback switching method does not consume any power at the first digital-to-analog converter switching, which can reduce the power consumption and design effort of the reference buffer.
Abstract: This brief presents a 10-bit 30-MS/s successive-approximation-register analog-to-digital converter (ADC) that uses a power efficient switchback switching method. With respect to the monotonic switching method, the input common-mode voltage variation reduces which improves the dynamic offset and the parasitic capacitance variation of the comparator. The proposed switchback switching method does not consume any power at the first digital-to-analog converter switching, which can reduce the power consumption and design effort of the reference buffer. The prototype was fabricated in a 90-nm 1P9M CMOS technology. At 1-V supply and 30 MS/s, the ADC achieves an sequenced neighbor double reservation of 56.89 dB and consumes 0.98 mW, resulting in a figure-of-merit (FOM) of 57 fJ/conversion-step. The ADC core occupies an active area of only 190 × 525 μm2.

92 citations


Journal ArticleDOI
TL;DR: A new hardware architecture for ECPM over GF(p) is presented, based on the residue number system (RNS), which encompasses RNS bases with various word-lengths in order to efficiently implement RNS Montgomery multiplication.
Abstract: Elliptic curve point multiplication (ECPM) is one of the most critical operations in elliptic curve cryptography. In this brief, a new hardware architecture for ECPM over GF(p) is presented, based on the residue number system (RNS). The proposed architecture encompasses RNS bases with various word-lengths in order to efficiently implement RNS Montgomery multiplication. Two architectures with four and six pipeline stages are presented, targeted on area-efficient and fast RNS Montgomery multiplication designs, respectively. The fast version of the proposed ECPM architecture achieves higher speeds and the area-efficient version achieves better area-delay tradeoffs compared to state-of-the-art implementations.

85 citations


Journal ArticleDOI
TL;DR: A novel hybrid SPM which consists of static random-access memory (SRAM) and nonvolatile memory (NVM) to take advantage of the ultralow leakage power and high density of latter is proposed and a novel dynamic data management algorithm is proposed to make use of the full potential of NVM.
Abstract: Embedded systems normally have a tight energy budget. Since the on-chip cache typically consumes 25%-50% of the processor's area and energy consumption, scratch pad memory (SPM), which is a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its smaller area and lower power consumption. However, as the speed of the CMOS transistors increases along with density, leakage power consumption is becoming a critical issue for memory components with a large number of transistors. In this paper, we propose a novel hybrid SPM which consists of static random-access memory (SRAM) and nonvolatile memory (NVM) to take advantage of the ultralow leakage power and high density of latter. A novel dynamic data management algorithm is also proposed to make use of the full potential of NVM. According to the experimental results, with the help of the proposed algorithm, the novel hybrid SPM architecture can reduce the memory access time by 18.17%, the dynamic energy by 24.29%, and the leakage power by 37.34% compared with a baseline pure SRAM SPM with the same area.

77 citations


Journal ArticleDOI
TL;DR: A new domino circuit is proposed, which has a lower leakage and higher noise immunity without dramatic speed degradation for wide fan-in gates and is based on comparison of mirrored current of the pull-up network with its worst case leakage current.
Abstract: In this paper, a new domino circuit is proposed, which has a lower leakage and higher noise immunity without dramatic speed degradation for wide fan-in gates. The technique which is utilized in this paper is based on comparison of mirrored current of the pull-up network with its worst case leakage current. The proposed circuit technique decreases the parasitic capacitance on the dynamic node, yielding a smaller keeper for wide fan-in gates to implement fast and robust circuits. Thus, the contention current and consequently power consumption and delay are reduced. The leakage current is also decreased by exploiting the footer transistor in diode configuration, which results in increased noise immunity. Simulation results of wide fan-in gates designed using a 16-nm high-performance predictive technology model demonstrate 51% power reduction and at least 2.41t noise-immunity improvement at the same delay compared to the standard domino circuits for 64-bit OR gates.

77 citations


Journal ArticleDOI
TL;DR: This paper proposes an energy-efficient algorithm and its corresponding architecture that is capable of bypassing the superfluous carry-save addition and register write operations, leading to less energy consumption and higher throughput of Montgomery modular multipliers.
Abstract: Modular exponentiation in the Rivest, Shamir, and Adleman cryptosystem is usually achieved by repeated modular multiplications on large integers. To speed up the encryption/decryption process, many high-speed Montgomery modular multiplication algorithms and hardware architectures employ carry-save addition to avoid the carry propagation at each addition operation of the add-shift loop. In this paper, we propose an energy-efficient algorithm and its corresponding architecture to not only reduce the energy consumption but also further enhance the throughput of Montgomery modular multipliers. The proposed architecture is capable of bypassing the superfluous carry-save addition and register write operations, leading to less energy consumption and higher throughput. In addition, we also modify the barrel register full adder (BRFA) so that the gated clock design technique can be applied to significantly reduce the energy consumption of storage elements in BRFA. Experimental results show that the proposed approaches can achieve up to 60% energy saving and 24.6% throughput improvement for 1024-bit Montgomery multiplier.

Journal ArticleDOI
TL;DR: It is found that crosstalk noise can significantly limit the scalability of mesh-based ONoCs, and it is shown that symmetric meshes have the best SNR performance.
Abstract: Crosstalk noise is an intrinsic characteristic as well as a potential issue of photonic devices. In large scale optical networks-on-chips (ONoCs), crosstalk noise could cause severe performance degradation and prevent ONoC from communicating properly. The novel contribution of this paper is the systematical modeling and analysis of the crosstalk noise and the signal-to-noise ratio (SNR) of optical routers and mesh-based ONoCs using a formal method. Formal analytical models for the worst-case crosstalk noise and minimum SNR in mesh-based ONoCs are presented. The crosstalk analysis is performed at device, router, and network levels. A general 5 × 5 optical router model is proposed for router level analysis. The minimum SNR optical link candidates, which constrain the scalability of mesh-based ONoCs, are identified. It is also shown that symmetric mesh-based ONoCs have the best SNR performance. The presented formal analyses can be easily applied to other optical routers and mesh-based ONoCs. Finally, we present case studies of mesh-based ONoCs using the optimized crossbar and Crux optical routers to evaluate the proposed formal method. We find that crosstalk noise can significantly limit the scalability of mesh-based ONoCs. For example, when the mesh-based ONoC size, using optimized crossbar, is larger than 8 × 8, the optical signal power is smaller than the crosstalk noise power; when the network size is 16 × 16 and the input power is 0 dBm, in the worst-case, the signal power is -24.9 dBm and the crosstalk noise power is -11 dBm.

Journal ArticleDOI
TL;DR: The designs of various two-input three-state QDGFET gates, including NAND- and NOR-like operations and their application in different combinational circuits like decoder, multiplier, adder, and so on are discussed.
Abstract: In this paper, we discuss logic circuit designs using the circuit model of three-state quantum dot gate field effect transistors (QDGFETs). QDGFETs produce one intermediate state between the two normal stable ON and OFF states due to a change in the threshold voltage over this range. We have developed a simplified circuit model that accounts for this intermediate state. Interesting logic can be implemented using QDGFETs. In this paper, we discuss the designs of various two-input three-state QDGFET gates, including NAND- and NOR-like operations and their application in different combinational circuits like decoder, multiplier, adder, and so on. Increased number of states in three-state QDGFETs will increase the number of bit-handling capability of this device and will help us to handle more number of bits at a time with less circuit elements.

Journal ArticleDOI
TL;DR: An improved mathematical model of the memristor is adopted that captures the well-established features of memristive devices and is used to analyze the time and voltage characteristics of stable read and write operations.
Abstract: In this paper, we explore various aspects of memristor modeling and use them to propose improved access operations and design of a memristor-based memory. We study the current mathematical and SPICE modeling of memristors and compare them with known device specifications. Based on this survey of existing models, we adopt an improved mathematical model of the memristor that captures the well-established features of memristive devices. This modeling is used to analyze the time and voltage characteristics of stable read and write operations. The tradeoffs between the various design parameters such as voltage, frequency, noise margin, and area are also analyzed. Based on the device modeling, we propose a hybrid CMOS-memristor memory cell and architecture that addresses the limitations of memristor such as state drift, cell-cell interference, and refresh requirements. Memristor is used as a state element, and CMOS-based transistors are used to isolate, control, decode, and inter operate the logic. We verify our design using SPICE simulation using a 28-nm model for CMOS and a modified memristor model.

Journal ArticleDOI
TL;DR: The results of synthesis show that, in the first implementation, 17 929 slices or 20% of the chip area is occupied, which makes it suitable for speed-critical cryptographic applications, while in the second implementation, 14203 slices or 16% ofThe resulting architecture is suitable for applications that may require speed-area tradeoff.
Abstract: A new and highly efficient architecture for elliptic curve scalar point multiplication is presented. To achieve the maximum architectural and timing improvements, we reorganize and reorder the critical path of the Lopez-Dahab scalar point multiplication architecture such that logic structures are implemented in parallel and operations in the critical path are diverted to noncritical paths. The results we obtained show that with G=55 our proposed design is able to compute scalar multiplication over GF(2163) in 9.6 μs with the maximum achievable frequency of 250 MHz on Xilinx Virtex-4 (XC4VLX200), where G is the digit size of the underlying digit-serial finite-field multiplier. Another implementation variant for less resource consumption is also proposed; with G=33, the design performs the same operation in 11.6 μs at 263 MHz on the same platform. The results of synthesis show that, in the first implementation, 17 929 slices or 20% of the chip area is occupied, which makes it suitable for speed-critical cryptographic applications, while in the second implementation 14203 slices or 16% of the chip area is utilized, which makes it suitable for applications that may require speed-area tradeoff.

Journal ArticleDOI
TL;DR: A signal selection approach based on total restorability at gate-level is proposed, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods.
Abstract: Post-silicon validation is one of the most important and expensive tasks in modern integrated circuit design methodology. The primary problem governing post-silicon validation is the limited observability due to storage of a small number of signals in a trace buffer. The signals to be traced should be carefully selected in order to maximize restoration of the remaining signals. Existing approaches have two major drawbacks. They depend on partial restorability computations that are not effective in restoring maximum signal states. They also require long signal selection time due to inefficient computation as well as operating on gate-level netlist. We have proposed a signal selection approach based on total restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods. We have also developed a register transfer level signal selection approach, which reduces both memory requirements and signal selection time by several orders-of-magnitude.

Journal ArticleDOI
TL;DR: An energy model which takes into account the realities of scaling, specifically for asynchronous systems is developed and results show fabricated results of 17 Giga Operations per Joule in 0.6 μm at subthreshold when fully pipelined.
Abstract: Statistical analysis of computations per unit energy in processors over the last 30 years is given that illustrates a sharp reduction in the rate of energy efficiency improvements over the last several years resulting in the formation of an asymptotic “wall” with our dataset; we use the measure of giga multiply accumulates per Joule. We have developed an energy model which takes into account the realities of scaling, specifically for asynchronous systems. Studies of an energy efficient asynchronous pipeline show fabricated results of 17 Giga Operations per Joule in 0.6 μm at subthreshold when fully pipelined, and simulations at a more modern 65 nm process show a further order of magnitude improvement on that.

Journal ArticleDOI
TL;DR: This work proposes a hierarchical way to merge flip-flops and significantly reduces clock power by 20-30% and the running time is very short, in the largest test case, which contains 1 700 000 flip- flops.
Abstract: Power has become a burning issue in modern VLSI design. In modern integrated circuits, the power consumed by clocking gradually takes a dominant part. Given a design, we can reduce its power consumption by replacing some flip-flops with fewer multi-bit flip-flops. However, this procedure may affect the performance of the original circuit. Hence, the flip-flop replacement without timing and placement capacity constraints violation becomes a quite complex problem. To deal with the difficulty efficiently, we have proposed several techniques. First, we perform a co-ordinate transformation to identify those flip-flops that can be merged and their legal regions. Besides, we show how to build a combination table to enumerate possible combinations of flip-flops provided by a library. Finally, we use a hierarchical way to merge flip-flops. Besides power reduction, the objective of minimizing the total wirelength is also considered. The time complexity of our algorithm is Θ(n1.12) less than the empirical complexity of Θ(n2). According to the experimental results, our algorithm significantly reduces clock power by 20-30% and the running time is very short. In the largest test case, which contains 1 700 000 flip-flops, our algorithm only takes about 5 min to replace flip-flops and the power reduction can achieve 21%.

Journal ArticleDOI
Taesang Cho1, Hanho Lee1
TL;DR: A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed, which can reduce the number of complex multiplications and the size of the twiddle factor memory.
Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed. This method can reduce the number of complex multiplications and the size of the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word length. The proposed processor has been designed and implemented using 90-nm CMOS technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the proposed FFT processor is 290 K. Furthermore, the highest throughput rate is up to 2.5 GS/s at 310 MHz while requiring much less hardware complexity.

Journal ArticleDOI
TL;DR: This work provides a qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of Through Silicon Vias (TSVs) size for their mitigation and investigates and discusses the design implications in the presence of decoupling capacitors, TSV/on-die/package parasitics, various resonance effects and power gating.
Abstract: 3-D integration presents a path to higher performance, greater density, increased functionality and heterogeneous technology implementation. However, 3-D integration introduces many challenges for power and thermal integrity due to large switching currents, longer power delivery paths, and increased parasitics compared to 2-D integration. In this work, we provide an in-depth study of power and thermal issues while incorporating the physical design characteristics unique to 3-D integration. We provide a qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of Through Silicon Vias (TSVs) size for their mitigation. We investigate and discuss the design implications of power and thermal issues in the presence of decoupling capacitors, TSV/on-die/package parasitics, various resonance effects and power gating. Our study is based on a ten-tier system utilizing existing 3-D technology specifications. Based on detailed power distribution and heat dissipation models, we present a comprehensive analysis of TSV tapering for alleviating power and thermal integrity issues in 3-D ICs.

Journal ArticleDOI
TL;DR: This paper has synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher throughput, less latency and less area-delay product than the reference CORDic design for fixed and known angles of rotation.
Abstract: Rotation of vectors through fixed and known angles has wide applications in robotics, digital signal processing, graphics, games, and animation But, we do not find any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation through specific angles Therefore, in this paper, we present optimization schemes and CORDIC circuits for fixed and known rotations with different levels of accuracy For reducing the area- and time-complexities, we have proposed a hardwired pre-shifting scheme in barrel-shifters of the proposed circuits Two dedicated CORDIC cells are proposed for the fixed-angle rotations In one of those cells, micro-rotations and scaling are interleaved, and in the other they are implemented in two separate stages Pipelined schemes are suggested further for cascading dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency implementations We have obtained the optimized set of micro-rotations for fixed and known angles The optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the scaling The fixed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and strategies for reducing the error are given We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher throughput, less latency and less area-delay product than the reference CORDIC design for fixed and known angles of rotation We find similar results of synthesis for different Xilinx field-programmable gate-array platforms

Journal ArticleDOI
TL;DR: This paper investigates extremely low-power circuits based on new Si/SiGe heterojunction tunneling transistors (HETTs) that have a subthreshold swing of 60 mV/decade and proposes a novel seven-transistor HETT-based SRAM cell topology to overcome, and take advantage of, the asymmetric current flow.
Abstract: The theoretical lower limit of subthreshold swing in mosfets (60 mV/decade) significantly restricts low-voltage operation since it results in a low ON -to- OFF current ratio at low supply voltages. This paper investigates extremely low-power circuits based on new Si/SiGe heterojunction tunneling transistors (HETTs) that have a subthreshold swing of . Device characteristics, as determined through technology computer aided design tools, are used to develop a Verilog-A device model to simulate and evaluate a range of HETT-based circuits. We show that an HETT-based ring oscillator (RO) shows a 9-19 times reduction in dynamic power compared to a CMOS RO. We also explore two key differences between HETTs and traditional mosfets, namely, asymmetric current flow and increased Miller capacitance, analyze their effect on circuit behavior, and propose methods to address them. HETT characteristics have the most dramatic impact on static random access memory (SRAM) operation and we propose a novel seven-transistor HETT-based SRAM cell topology to overcome, and take advantage of, the asymmetric current flow. This new HETT SRAM design achieves 7-37 times reduction in leakage power compared to CMOS.

Journal ArticleDOI
TL;DR: This paper decomposes the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and performs them using a novel low-complexity ACS unit, facilitating a 71% energy consumption reduction.
Abstract: Turbo codes have recently been considered for energy-constrained wireless communication applications, since they facilitate a low transmission energy consumption. However, in order to reduce the overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low processing energy consumption are required. In this paper, we decompose the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and perform them using a novel low-complexity ACS unit. We demonstrate that our architecture employs an order of magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges above 58 m.

Journal ArticleDOI
TL;DR: A dynamic BER-based greedy wear-leveling algorithm that uses BER statistics as the measurement of memory block wear-out pace, and guides dynamic memory block data swapping to fully maximize the wear- leveling efficiency is presented.
Abstract: This brief presents a NAND Flash memory wear-leveling algorithm that explicitly uses memory raw bit error rate (BER) as the optimization target. Although NAND Flash memory wear-leveling has been well studied, all the existing algorithms aim to equalize the number of programming/erase cycles among all the memory blocks. Unfortunately, such a conventional design practice becomes increasingly suboptimal as inter-block variation becomes increasingly significant with the technology scaling. This brief presents a dynamic BER-based greedy wear-leveling algorithm that uses BER statistics as the measurement of memory block wear-out pace, and guides dynamic memory block data swapping to fully maximize the wear-leveling efficiency. Simulations have been carried out to quantitatively demonstrate its advantages over existing wear-leveling algorithms.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions.
Abstract: 3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based on the assumption of strong thermal correlation among processing cores within a stack. On the other hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the power density and result in hot spots. In this paper, we explore the tradeoff between thermal and interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions.

Journal ArticleDOI
TL;DR: A novel detection algorithm with an efficient VLSI architecture featuring efficient operation over infinite complex lattices and support of unbounded infinite lattice decoding distinguishes the present method from previous K-Best strategies and also allows its complexity to scale sublinearly with the modulation order.
Abstract: A novel detection algorithm with an efficient VLSI architecture featuring efficient operation over infinite complex lattices is proposed. The proposed design results in the highest throughput, the lowest latency, and the lowest energy compared to the complex-domain VLSI implementations to date. The main innovations are a novel complex-domain means of expanding/visiting the intermediate nodes of the search tree on demand, rather than exhaustively, as well as a new distributed sorting scheme to keep track of the best candidates at each search phase. Its support of unbounded infinite lattice decoding distinguishes the present method from previous K-Best strategies and also allows its complexity to scale sublinearly with the modulation order. Since the expansion and sorting cores are data-driven, the architecture is well suited for a pipelined parallel VLSI implementation. The proposed algorithm is used to fabricate a 4×4, 64-QAM complex multiple-input-multiple-output detector in a 0.13-μm CMOS technology, achieving a clock rate of 417 MHz with the core area of 340 kgates. The chip test results prove that the fabricated design can sustain a throughput of 1 Gb/s with energy efficiency of 110 pJ/bit, the best numbers reported to date.

Journal ArticleDOI
TL;DR: It is demonstrated that Asymm-ΦG shorted-gate (a-SG) n/p-FinFETs are promising, as they can yield over two orders of magnitude lower leakage without excessive degradation in ON-state current, in comparison to Symm- ΦGShorted-Gate (SG) FinFets, placing them in a better position than back-gate biased independent-gate [IG] FinFetts for leakage reduction.
Abstract: With the emergence of nonplanar CMOS devices at the 22-nm node and beyond, it is highly likely that multigate device adoption will occur in a high-performance process technology, owing to the increased performance and area benefits. In this paper, for the first time, we evaluate symmetric (Symm-ΦG) and asymmetric (Asymm-ΦG) gate-workfunction FinFETs head to head in a high-performance process, using technology computer-aided design 3-D device simulations. We demonstrate that Asymm-ΦG shorted-gate (a-SG) n/p-FinFETs, which use both workfunctions corresponding to typical high-performance metal-gate n/p-FinFETs, are promising, as they can yield over two orders of magnitude lower leakage without excessive degradation in ON-state current, in comparison to Symm- ΦG shorted-gate (SG) FinFETs, placing them in a better position than back-gate biased independent-gate (IG) FinFETs for leakage reduction. Thereafter, we explore the design space of FinFET logic gates, latches, and flip-flops, for optimal tradeoffs in leakage versus delay and temperature, using mixed-mode 2-D device simulations. Elementary logic gates (such as INV, NAND2, NOR2, XOR2, and XNOR2) using Asymm-ΦG SG-mode FinFETs appear to be located optimally in the leakage-delay spectrum, in comparison to the most versatile configurations possible by mixing corresponding Symm-ΦG SG- and IG-mode FinFETs. Latches and flip-flops, however, require an astute combination of Symm-ΦG and Asymm-ΦG FinFETs to optimize leakage, delay, and setup time simultaneously.

Journal ArticleDOI
TL;DR: This paper investigates the impact of the TSV on the quality of 3-D IC layouts and proposes two design schemes, namely TSV co-placement and TSV site, and accompanying algorithms to find and optimize locations of gates and TSVs.
Abstract: The technology of through-silicon vias (TSVs) enables fine-grained integration of multiple dies into a single 3-D stack. TSVs occupy significant silicon area due to their sheer size, which has a great effect on the quality of 3-D integrated chips (ICs). Whereas well-managed TSVs alleviate routing congestion and reduce wirelength, excessive or ill-managed TSVs increase the die area and wirelength. In this paper, we investigate the impact of the TSV on the quality of 3-D IC layouts. Two design schemes, namely TSV co-placement (irregular TSV placement) and TSV site (regular TSV placement), and accompanying algorithms to find and optimize locations of gates and TSVs are proposed for the design of 3-D ICs. Two TSV assignment algorithms are also proposed to enable the regular TSV placement. Simulation results show that the wirelength of 3-D ICs is shorter than that of 2-D ICs by up to 25%.

Journal ArticleDOI
TL;DR: The performance improvements indicate that the proposed designs are well suited for modern high-performance designs where power dissipation and latching overhead are of major concern.
Abstract: In this paper, we introduce a new dual dynamic node hybrid flip-flop (DDFF) and a novel embedded logic module (DDFF-ELM) based on DDFF. The proposed designs eliminate the large capacitance present in the precharge node of several state-of-the-art designs by following a split dynamic node structure to separately drive the output pull-up and pull-down transistors. The DDFF offers a power reduction of up to 37% and 30% compared to the conventional flip-flops at 25% and 50% data activities, respectively. The aim of the DDFF-ELM is to reduce pipeline overhead. It presents an area, power, and speed efficient method to incorporate complex logic functions into the flip-flop. The performance comparisons made in a 90 nm UMC process show a power reduction of 27% compared to the Semidynamic flip-flop, with no degradation in speed performance. The leakage power and process-voltage-temperature variations of various designs are studied in detail and are compared with the proposed designs. Also, DDFF and DDFF-ELM are compared with other state-of-the-art designs by implementing a 4-b synchronous counter and a 4-b Johnson up-down counter. The performance improvements indicate that the proposed designs are well suited for modern high-performance designs where power dissipation and latching overhead are of major concern.

Journal ArticleDOI
TL;DR: This paper presents a novel circuit-level timing error mitigation technique, which aims to increase energy-efficiency of digital signal processing datapaths without loss of robustness, and proposes a new approach to bound the magnitude of intermittent timing errors at the circuit level.
Abstract: In this paper, we present a novel circuit-level timing error mitigation technique, which aims to increase energy-efficiency of digital signal processing datapaths without loss of robustness. Timing errors are detected using razor flip-flops on critical-paths, and the error-rate feedback is used to control a dynamic voltage scaling control loop. In place of conventional razor error correction by replay, we propose a new approach to bound the magnitude of intermittent timing errors at the circuit level. A timing guard-band is created by shaping the path delay distribution such that the critical paths correspond to a group of least-significant bit registers. These end-points are ensured to be critical by modifying the topology of the final stage carry-merge adder, and by using tool-based device sizing. Hence, timing violations lead to weakly correlated logical errors of small magnitude in a mean-squared-error sense. We examine this approach in an finite-impulse response (FIR) filter and a 2-D discrete cosine transform implementation, in 32-nm CMOS. Power saving compared to a conventional design at iso-frequency is 21%-23% at the typical corner, while retaining a voltage guard-band to protect against fast transient changes in switching activity and supply noise. The impact on minimum clock period is small (16%-20%), as it does not necessitate the use of ripple-carry adders and also requires only a bare minimum of additional design effort.