scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2015"


Journal ArticleDOI
TL;DR: In this paper, a hybrid 1-bit full adder design employing both complementary metal-oxide-semiconductor (CMOS) logic and transmission gate logic is reported and is found to offer significant improvement in terms of power and speed.
Abstract: In this paper, a hybrid 1-bit full adder design employing both complementary metal–oxide–semiconductor (CMOS) logic and transmission gate logic is reported. The design was first implemented for 1 bit and then extended for 32 bit also. The circuit was implemented using Cadence Virtuoso tools in 180-and 90-nm technology. Performance parameters such as power, delay, and layout area were compared with the existing designs such as complementary pass-transistor logic, transmission gate adder, transmission function adder, hybrid pass-logic with static CMOS output drive full adder, and so on. For 1.8-V supply at 180-nm technology, the average power consumption (4.1563 $\mu $ W) was found to be extremely low with moderately low delay (224 ps) resulting from the deliberate incorporation of very weak CMOS inverters coupled with strong transmission gates. Corresponding values of the same were 1.17664 $\mu $ W and 91.3 ps at 90-nm technology operating at 1.2-V supply voltage. The design was further extended for implementing 32-bit full adder also, and was found to be working efficiently with only 5.578-ns (2.45-ns) delay and 112.79- $\mu $ W (53.36- $\mu $ W) power at 180-nm (90-nm) technology for 1.8-V (1.2-V) supply voltage. In comparison with the existing full adder designs, the present implementation was found to offer significant improvement in terms of power and speed.

215 citations


Journal ArticleDOI
TL;DR: This brief proposes multiplier architectures that can tradeoff computational accuracy with energy consumption at design time and demonstrates that such a small computational error does not notably impact the quality of DSP and the accuracy of classification applications.
Abstract: The need to support various digital signal processing (DSP) and classification applications on energy-constrained devices has steadily grown. Such applications often extensively perform matrix multiplications using fixed-point arithmetic while exhibiting tolerance for some computational errors. Hence, improving the energy efficiency of multiplications is critical. In this brief, we propose multiplier architectures that can tradeoff computational accuracy with energy consumption at design time. Compared with a precise multiplier, the proposed multiplier can consume 58% less energy/op with average computational error of $\sim 1$ %. Finally, we demonstrate that such a small computational error does not notably impact the quality of DSP and the accuracy of classification applications.

162 citations


Journal ArticleDOI
TL;DR: A multibit-decision approach that can significantly reduce latency of SCL decoders and a general decoding scheme that can perform intermediate decoding of any 2K bits simultaneously, which can reduce the overall decoding latency to as short as n/2K-2-2 cycles.
Abstract: Polar codes, as the first provable capacity-achieving error-correcting codes, have received much attention in recent years. However, the decoding performance of polar codes with traditional successive-cancellation (SC) algorithm cannot match that of the low-density parity-check or Turbo codes. Because SC list (SCL) decoding algorithm can significantly improve the error-correcting performance of polar codes, design of SCL decoders is important for polar codes to be deployed in practical applications. However, because the prior latency reduction approaches for SC decoders are not applicable for SCL decoders, these list decoders suffer from the long-latency bottleneck. In this paper, we propose a multibit-decision approach that can significantly reduce latency of SCL decoders. First, we present a reformulated SCL algorithm that can perform intermediate decoding of 2 b together. The proposed approach, referred as 2-bit reformulated SCL ( 2b-rSCL ) algorithm , can reduce the latency of SCL decoder from ( $3{n}-2$ ) to ( $2{n}-2$ ) clock cycles without any performance loss. Then, we extend the idea of 2-b-decision to general case, and propose a general decoding scheme that can perform intermediate decoding of any $2^{K}$ bits simultaneously. This general approach, referred as $\textit {2}^{K}$ -bit reformulated SCL ( ${2}^{K}$ b-rSCL ) algorithm , can reduce the overall decoding latency to as short as ${n}/2^{K-2}-2$ cycles. Furthermore, on the basis of the proposed algorithms, very large-scale integration architectures for 2b-rSCL and 4b-rSCL decoders are synthesized. Compared with a prior SCL decoder, the proposed (1024, 512) 2b-rSCL and 4b-rSCL decoders can achieve 21% and 60% reduction in latency, 1.66 and 2.77 times increase in coded throughput with list size 2, and 2.11 and 3.23 times increase in coded throughput with list size 4, respectively.

131 citations


Journal ArticleDOI
TL;DR: The ULV circuit design challenges are discussed and a new biasing metric for ULV and ULP designs in deep-submicrometer CMOS technologies is introduced and series inductive peaking in the feedback loop is analyzed and employed to enhance the bandwidth and noise performance of the LNA.
Abstract: This paper presents a design methodology for an ultra-low-power (ULP) and ultra-low-voltage (ULV) ultra-wideband (UWB) resistive-shunt feedback low-noise amplifier (LNA). The ULV circuit design challenges are discussed and a new biasing metric for ULV and ULP designs in deep-submicrometer CMOS technologies is introduced. Series inductive peaking in the feedback loop is analyzed and employed to enhance the bandwidth and noise performance of the LNA. Exploiting the new biasing metric, the design methodology, and series inductive peaking in the feedback loop, a 0.5 V, 0.75-mW broadband LNA with a current reuse scheme is implemented in a 90-nm CMOS technology. Measurement results show 12.6-dB voltage gain, 0.1–7-GHz bandwidth, 5.5-dB NF, −9-dBm IIP $_{3}$ , and −18-dB P1dB while occupying 0.23 mm $^{2}$ .

101 citations


Journal ArticleDOI
TL;DR: A novel architecture based on field-programmable gate arrays (FPGAs) for the reconstruction of compressively sensed signal using the orthogonal matching pursuit (OMP) algorithm that provides higher throughput with less area consumption is presented.
Abstract: In this paper, we present a novel architecture based on field-programmable gate arrays (FPGAs) for the reconstruction of compressively sensed signal using the orthogonal matching pursuit (OMP) algorithm. We have analyzed the computational complexities and data dependence between different stages of OMP algorithm to design its architecture that provides higher throughput with less area consumption. Since the solution of least square problem involves a large part of the overall computation time, we have suggested a parallel low-complexity architecture for the solution of the linear system. We have further modeled the proposed design using Simulink and carried out the implementation on FPGA using Xilinx system generator tool. We have presented here a methodology to optimize both area and execution time in Simulink environment. The execution time of the proposed design is reduced by maximizing parallelism by appropriate level of unfolding, while the FPGA resources are reduced by sharing the hardware for matrix–vector multiplication across the data-dependent sections of the algorithm. The hardware implementation on the Virtex6 FPGA provides significantly superior performance in terms of resource utilization measured in the number of occupied slices, and maximum usable frequency compared with the existing implementations. Compared with the existing similar design, the proposed structure involves 328 more DSP48s, but it involves $25\,802$ less slices and 1.85 times less computation time for signal reconstruction with $N = 1024$ , $ K = 256$ , and $m = 36$ , where $N$ is the number of samples, $K$ is the size of the measurement vector, and $m$ is the sparsity. It also provides a higher peak signal-to-noise ratio value of 38.9 dB with a reconstruction time of $0.34~\mu $ s, which is twice faster than the existing design. In addition, we have presented a performance metric to implement the OMP algorithm in resource constrained FPGA for the better quality of signal reconstruction.

97 citations


Journal ArticleDOI
TL;DR: A hill-climbing maximum power point tracking algorithm is developed in an energy-efficient manner to tune the input impedance of the system and guarantee adaptive maximum power transfer under wide illumination conditions, and a capacitor value modulation is implemented to achieve a higher efficiency than the traditional pulse-frequency modulation scheme.
Abstract: Implementing a monolithic highly efficient ultralow photovoltaic (PV) power harvesting system is pivotal for smart nodes of Internet of things (IOT) networks. This paper proposes a fully integrated harvesting system in 0.18- $\mu $ m CMOS technology. Utilizing a small commercial solar cell of 2.5 $\mathrm{cm}^{2}$ , the proposed system can provide 0–29 $\mu $ W of power, which is much higher than the commonly used passive radio-frequency identification devices in IOT application. The hill-climbing maximum power point tracking algorithm is developed in an energy-efficient manner to tune the input impedance of the system and guarantee adaptive maximum power transfer under wide illumination conditions. The detailed impedance tuning approach is implemented with a capacitor value modulation to eliminate the quiescent power consumption as well as to achieve a higher efficiency than the traditional pulse-frequency modulation scheme. A supercapacitor is utilized for buffering, energy storing, and filtering purposes, which enables more functions of the IOT smart nodes such as active sensing and system-on-chip (SOC) signal processing. The output voltage ranges between 3.0 and 3.5 V for different device loads, such as sensors, SOC, or wireless transceivers. The measured results confirm that this PV harvesting system achieves both ultralow operation capability under $20~\mu $ W and a self-sustaining efficiency of 89%.

88 citations


Journal ArticleDOI
TL;DR: This paper presents a novel approach to design obfuscated circuits for digital signal processing (DSP) applications using high-level transformations, a key-based obfuscating finite-state machine (FSM), and a reconfigurator to design DSP circuits that are harder to reverse engineer.
Abstract: This paper presents a novel approach to design obfuscated circuits for digital signal processing (DSP) applications using high-level transformations, a key-based obfuscating finite-state machine (FSM), and a reconfigurator. The goal is to design DSP circuits that are harder to reverse engineer. High-level transformations of iterative data-flow graphs have been exploited for area-speed-power tradeoffs. This is the first attempt to develop a design flow to apply high-level transformations that not only meet these tradeoffs but also simultaneously obfuscate the architectures both structurally and functionally. Several modes of operations are introduced for obfuscation where the outputs are meaningful from a signal processing point of view, but are functionally incorrect. Examples of such modes include a third-order digital filter that can also implement a sixth-order or ninth-order filter in a time-multiplexed manner. The latter two modes are meaningful but represent functionally incorrect modes. Multiple meaningful modes can be exploited to reconfigure the filter order for different applications. Other modes may correspond to nonmeaningful modes. A correct key input to an FSM activates a reconfigurator. The configure data controls various modes of the circuit operation. Functional obfuscation is accomplished by requiring use of the correct initialization key, and configure data. Wrong initialization key fails to enable the reconfigurator, and a wrong configure data activates either a meaningful but nonfunctional or nonmeaningful mode. Probability of activating the correct mode is significantly reduced leading to an obfuscated DSP circuit. Structural obfuscation is also achieved by the proposed methodology via high-level transformations. Experimental results show that the overhead of the proposed methodology is small, while a strong obfuscation is attained. For example, the area overhead for a (31)th-order IIR filter benchmark is only 17.7% with a 128-bit configuration key, where $1 \leq l \leq 8$ , i.e., the order of this filter should be a multiple of 3, and can vary from 3 to 24.

86 citations


Journal ArticleDOI
Jun Lin1, Zhiyuan Yan1
TL;DR: In this article, the authors proposed an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques, which achieves 1.24 and 1.83 times the area efficiency.
Abstract: Long polar codes can achieve the symmetric capacity of arbitrary binary-input discrete memoryless channels under a low-complexity successive cancelation (SC) decoding algorithm. However, for polar codes with short and moderate code lengths, the decoding performance of the SC algorithm is inferior. The cyclic-redundancy-check (CRC)-aided SC-list (SCL)-decoding algorithm has better error performance than the SC algorithm for short or moderate polar codes. In this paper, we propose an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques. In particular, an area efficient message memory architecture is proposed to reduce the area of the proposed decoder architecture. An efficient path pruning unit suitable for large list size is also proposed. For a polar code of length 1024 and rate 1/2, when list size $L=2$ and 4, the proposed list decoder architecture is implemented under a Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS technology. Compared with the list decoders in the literature, our decoder achieves 1.24–1.83 times the area efficiency.

83 citations


Journal ArticleDOI
TL;DR: Because of a novel architecture combined with the use of multithreshold CMOS technique, the proposed circuit guarantees robust voltage shifting from the deep subthreshold to the above-threshold domain while exhibiting fast response and low energy consumption.
Abstract: Multisupply voltage design technique is widely used in modern system-on-chips to tradeoff energy and speed. Level shifters (LSs) allow different voltage domains to be interfaced. In this brief, a new LS is presented for fast and wide range voltage conversion. Because of a novel architecture combined with the use of multithreshold CMOS technique, the proposed circuit guarantees robust voltage shifting from the deep subthreshold to the above-threshold domain while exhibiting fast response and low energy consumption. When implemented in a 90-nm technology node, considering process-voltage-temperature variations, the proposed design reliably converts 100-mV input signals into 1 V output signals. Post-layout simulation results demonstrate that the new LS shows a propagation delay of 16.6 ns, a static power dissipation of 8.7 nW and a total energy per transition of only 77 fJ for a 0.2 V 1-MHz input pulse.

75 citations


Journal ArticleDOI
TL;DR: In the worst case, the crosstalk noise power exceeds the signal power in all three WDM-based ONoC architectures, even when the number of processor cores is small, e.g., 64.
Abstract: Optical networks-on-chip (ONoCs) using wavelength-division multiplexing (WDM) technology have progressively attracted more and more attention for their use in tackling the high-power consumption and low bandwidth issues in growing metallic interconnection networks in multiprocessor systems-on-chip. However, the basic optical devices employed to construct WDM-based ONoCs are imperfect and suffer from inevitable power loss and crosstalk noise. Furthermore, when employing WDM, optical signals of various wavelengths can interfere with each other through different optical switching elements within the network, creating crosstalk noise. As a result, the crosstalk noise in large-scale WDM-based ONoCs accumulates and causes severe performance degradation, restricts the network scalability, and considerably attenuates the signal-to-noise ratio (SNR). In this paper, we systematically study and compare the worst case as well as the average crosstalk noise and SNR in three well-known optical interconnect architectures, mesh-based, folded-torus-based, and fat-tree-based ONoCs using WDM. The analytical models for the worst case and the average crosstalk noise and SNR in the different architectures are presented. Furthermore, the proposed analytical models are integrated into a newly developed crosstalk noise and loss analysis platform (CLAP) to analyze the crosstalk noise and SNR in WDM-based ONoCs of any network size using an arbitrary optical router. Utilizing CLAP, we compare the worst case as well as the average crosstalk noise and SNR in different WDM-based ONoC architectures. Furthermore, we indicate how the SNR changes in respect to variations in the number of optical wavelengths in use, the free-spectral range, and the microresonators $\boldsymbol {Q}$ factor. The analyses’ results demonstrate that the crosstalk noise is of critical concern to WDM-based ONoCs: in the worst case, the crosstalk noise power exceeds the signal power in all three WDM-based ONoC architectures, even when the number of processor cores is small, e.g., 64.

75 citations


Journal ArticleDOI
TL;DR: The physical design-aware fault-tolerant quantum circuit synthesis (PAQCS) contains two algorithms: one for physical qubit placement and another for routing of communications, which reduces the overhead of converting a logical to a physical circuit by 30.1%, on an average, relative to previous work.
Abstract: Quantum circuits consist of a cascade of quantum gates. In a physical design-unaware quantum logic circuit, a gate is assumed to operate on an arbitrary set of quantum bits (qubits), without considering the physical location of the qubits. However, in reality, physical qubits have to be placed on a grid. Each node of the grid represents a qubit. The grid implements the architecture of the quantum computer. A physical constraint often imposed is that quantum gates can only operate on adjacent qubits on the grid. Hence, a communication channel needs to be built if the qubits in the logical circuit are not adjacent. In this paper, we introduce a tool called the physical design-aware fault-tolerant quantum circuit synthesis (PAQCS). It contains two algorithms: one for physical qubit placement and another for routing of communications. With the help of these two algorithms, the overhead of converting a logical to a physical circuit is reduced by 30.1%, on an average, relative to previous work. The optimization algorithms in PAQCS are evaluated on circuits implemented using quantum operations supported by two different quantum physical machine descriptions and three quantum error-correcting codes. They reduce the number of primitive operations by 11.5%–68.6%, and the number of execution cycles by 16.9%–59.4%.

Journal ArticleDOI
TL;DR: This brief proposes a novel memory architecture, named Z-TCAM, which emulates the TCAM functionality with SRAM and logically partitions the classical TCAM table along columns and rows into hybrid TCAM subtables, which are then processed to map on their corresponding memory blocks.
Abstract: Ternary content addressable memories (TCAMs) perform high-speed lookup operation but when compared with static random access memories (SRAMs), TCAMs have certain limitations such as low storage density, relatively slow access time, low scalability, complex circuitry, and are very expensive. Thus, can we use the benefits of SRAM by configuring it (with additional logic) to enable it to behave like TCAM? This brief proposes a novel memory architecture, named Z-TCAM, which emulates the TCAM functionality with SRAM. Z-TCAM logically partitions the classical TCAM table along columns and rows into hybrid TCAM subtables, which are then processed to map on their corresponding memory blocks. Two example designs for Z-TCAM of sizes 512 $\,\times\,$ 36 and 64 $\,\times\,$ 32 have been implemented on Xilinx Virtex-7 field-programmable gate array. The design of 64 $\,\times\,$ 32 Z-TCAM has also been implemented using OSUcells library for 0.18 $\mu{\rm m}$ technology, which confirms the physical and technical feasibility of Z-TCAM. Search latency for each design is three clock cycles. The detailed implementation results and power measurements for each design have been reported thoroughly.

Journal ArticleDOI
TL;DR: An efficient combined single-path delay commutator-feedback (SDC-SDF) radix-2 pipelined fast Fourier transform architecture, which includes log2 N - 1 SDC stages, and 1 SDF stage, is presented.
Abstract: We present an efficient combined single-path delay commutator-feedback (SDC-SDF) radix-2 pipelined fast Fourier transform architecture, which includes $\log _{2}\textit {N}-1$ SDC stages, and 1 SDF stage. The SDC processing engine is proposed to achieve 100% hardware resource utilization by sharing the common arithmetic resource in the time-multiplexed approach, including both adders and multipliers. Thus, the required number of complex multipliers is reduced to $\log _{4}\textit {N}-0.5$ , compared with $\log _{2}\textit {N}-1$ for the other radix-2 SDC/SDF architectures. In addition, the proposed architecture requires roughly minimum number of complex adders $\log _{2}\textit {N}+1$ and complex delay memory $2\textit {N}+1.5\log _{2}\textit {N}-1.5$ .

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new 9T SRAM cell that has good write ability and improves read stability at the same time, and they showed that the proposed design increases read static noise margin and increases read path by 219% and 113%, respectively, at supply voltage of 300mV.
Abstract: In this paper, we present a new 9T SRAM cell that has good write ability and improves read stability at the same time. Simulation results show that the proposed design increases read static noise margin and ${I_{\mathrm{{\scriptstyle ON}}}}/{I_{\mathrm{{\scriptstyle OFF}}}}$ of read path by 219% and 113%,respectively, at supply voltage of 300-mV over conventional 6T SRAM cell in a 90-nm CMOS technology. The proposed design lets us reduce the minimum operating voltage of SRAM ( $\mathrm{VDD}_{\rm min}$ ) to 350 mV, whereas conventional 6T SRAM cannot operate successfully with an acceptable failure rate at supply voltages below 725 mV. We also compared our design with three other SRAM cells from recent literature. To verify the proposed design, a 256-kb SRAM is designed using new 9T and conventional 6T SRAM cells. Operating at their minimum possible $V_{\rm DDs}$ , the proposed design decreases write and read power per operation by 92% and 93%, respectively, over the conventional rival. The area of the proposed SRAM cell is increased by 83% over a conventional 6T one. However, due to large ${I_{\mathrm{{\scriptstyle ON}}}}/{I_{\mathrm{{\scriptstyle OFF}}}}$ of read path for 9T cell, we are able to put 1k cells in each column of 256-kb SRAM block, resulting in the possibility for sharing write and read circuitries of each column between more cells compared with conventional 6T. Thus, the area overhead of 256-kb SRAM based on new 9T cell is reduced to 37% compared with 6T SRAM.

Journal ArticleDOI
TL;DR: This work proposes a bilinear quarter pixel approximation, together with a search pattern based on it to reduce the complexity of interpolation and fractional search process, and achieves more than 52% improvement on power efficiency, relative to previous works in H.264.
Abstract: Fractional motion estimation (FME) significantly enhances video compression efficiency, but its high computational complexity also limits the real-time processing capability. In this brief, we present a VLSI implementation of FME design in High Efficiency Video Coding for ultrahigh definition video applications. We first propose a bilinear quarter pixel approximation, together with a search pattern based on it to reduce the complexity of interpolation and fractional search process. Furthermore, a data reuse strategy is exploited to reduce the hardware cost of transform. In addition, using the considered pixel parallelism and dedicated access pattern for memory, we fully pipeline the computation and achieve high hardware utilization. This design has been implemented as a 65-nm CMOS chip and verified. The measured throughput reaches 995 Mpixels/s for $7680\,\times \,4320~30$ frames/s at 188 MHz, at least 4.7 times faster than prior arts. The corresponding power dissipation is 198.6 mW, with a power efficiency of 0.2 nJ/pixel. Due to the optimization, our work achieves more than 52% improvement on power efficiency, relative to previous works in H.264.

Journal ArticleDOI
TL;DR: This is the very first in-depth study on TSV inductors to make them practical for high-frequency applications and proposes a novel shield mechanism utilizing the microchannel, a technique conventionally used for heat removal, to reduce the substrate loss.
Abstract: Through-silicon-vias (TSVs) can potentially be used to implement inductors in 3-D integrated systems for minimal footprint and large inductance. However, different from conventional 2-D spiral inductors, TSV inductors are fully buried in the lossy substrate, thus suffering from low quality factors. In this paper, we systematically examine how various process and design parameters affect their performance. A few interesting phenomena that are unique to TSV inductors are observed. We then propose a novel shield mechanism utilizing the microchannel, a technique conventionally used for heat removal, to reduce the substrate loss. The technique increases the quality factor and inductance of the TSV inductor by up to $21\times $ and $17\times $ , respectively. Finally, since full-wave simulations of 3-D structures are time-consuming, we develop a set of compressed sensing-based design strategies for microchannel-shielded TSV inductors, which only requires a minimal number of simulations. It enables us to implement microchannel-shielded TSV inductors of up to $5.44\times $ reduced area compared with spiral inductors of the same design specs (quality factor, inductance, and frequency). To the best of our knowledge, this is the very first in-depth study on TSV inductors to make them practical for high-frequency applications. We hope our study shall point out a new and exciting research direction for 3-D integrated circuit designers.

Journal ArticleDOI
TL;DR: A new and efficient Montgomery modular multiplication architecture based on a new digit serial computation that relaxes the high-radix partial multiplication to a binary multiplication and performs several multiplications of consecutive zero bits in one clock cycle instead of several clock cycles is presented.
Abstract: Modular exponentiation with a large modulus and exponent is a fundamental operation in many public-key cryptosystems. This operation is usually accomplished by repeating modular multiplications. Montgomery modular multiplication has been widely used to relax the quotient determination. The carry–save adder has been employed to reduce the critical path. This paper presents and evaluates a new and efficient Montgomery modular multiplication architecture based on a new digit serial computation. The proposed architecture relaxes the high-radix partial multiplication to a binary multiplication. It also performs several multiplications of consecutive zero bits in one clock cycle instead of several clock cycles. Moreover, the right-to-left and left-to-right modular exponentiation architectures have been modified to use the proposed modular multiplication architecture as its structural unit. We provide the implementation results on a Xilinx Virtex 5 FPGA demonstrating that the total computation time and throughput rate of the proposed architectures outperform most results so far in the literatures.

Journal ArticleDOI
TL;DR: A dynamic voltage and frequency scaling scheme with SC converters is proposed that achieves high converter efficiency by allowing the output voltage to ripple and having the processor core frequency track the ripple.
Abstract: Integrating multiple power converters on-chip improves energy efficiency of manycore architectures. Switched-capacitor (SC) dc-dc converters are compatible with conventional CMOS processes, but traditional implementations suffer from limited conversion efficiency. We propose a dynamic voltage and frequency scaling scheme with SC converters that achieves high converter efficiency by allowing the output voltage to ripple and having the processor core frequency track the ripple. Minimum core energy is achieved by hopping between different converter modes and tuning body-bias voltages. A multicore processor model based on a 28-nm technology shows conversion efficiencies of 90% along with over 25% improvement in the overall chip energy efficiency.

Journal ArticleDOI
TL;DR: A digital low-dropout regulator (D-LDO) with a proposed transient-response boost technique, which enables the reduction of transient response time, as well as overshoot/undershoot, when the load current is abruptly drawn.
Abstract: This paper presents a digital low-dropout regulator (D-LDO) with a proposed transient-response boost technique, which enables the reduction of transient response time, as well as overshoot/undershoot, when the load current is abruptly drawn. The proposed D-LDO detects the deviation of the output voltage by overshoot/undershoot, and increases its loop gain, for the time that the deviation is beyond a limit. Once the output voltage is settled again, the loop gain is returned. With the D-LDO fabricated on an 110-nm CMOS technology, we measured its settling time and peak of undershoot, which were reduced by 60% and 72%, respectively, compared with and without the transient-response boost mode. Using the digital logic gates, the chip occupies a small area of 0.04 mm $^{2}$ , and it achieves a maximum current efficiency of 99.98%, by consuming the quiescent current of 15 $\mu $ A at 0.7-V input voltage.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a subthreshold CMOS voltage reference operating with a minimum supply voltage of only 150 mV, which is three times lower than the minimum value presently reported in the literature.
Abstract: We propose a subthreshold CMOS voltage reference operating with a minimum supply voltage of only 150 mV, which is three times lower than the minimum value presently reported in the literature. The generated reference voltage is only 17.69 mV. This result has been achieved by introducing a temperature compensation technique that does not require the drain–source voltage of each MOSFET to be larger than $4 kT/q$ . The implemented solution consists in two transistors voltage reference with two MOSFETs of the same threshold-type and exploits the dependence of the threshold voltage on transistor size. Measurements performed over a large sample population of 60 chips from two separate batches show a standard deviation of only 0.29 mV. The mean variation of the reference voltage for $V_{\rm DD}$ ranging from 0.15 to 1.8 V is 359.5 $\mu $ V/V, whereas the mean variation of V $_{\rm {REF}}$ in the temperature range from 0 °C to 120 °C is 26.74 $\mu $ V/°C. The mean power consumption at 25 °C for $V_{\rm DD}= 0.15$ V is 26.1 pW. The occupied area is $1200~\mu $ m $^{2}$ .

Journal ArticleDOI
TL;DR: To generate 100% accurate results, error detection and recovery circuits are added to the proposed CSPA to construct a variable-latency carry speculative adder (VLCSPA).
Abstract: Adders are one of the most critical arithmetic circuits in a system and their throughput affects the overall performance of the system. Traditional n-bit adders provide accurate results, but the lower bound of their critical path delay is $\Omega {(\log ~n)}$ . To achieve a critical path delay lower than $\Omega {(\log ~n)}$ , many approximate adders have been proposed. These approximate adders decrease the critical path delay and improve the speed by sacrificing computation accuracy or predicting the computation results. This paper proposes a high-performance low-power carry speculative adder (CSPA). This adder separates the carry generator and sum generator. Only one sum generator is used in a block adder to reduce the critical path delay and area overhead. In addition, to generate 100% accurate results, error detection and recovery circuits are added to the proposed CSPA to construct a variable-latency carry speculative adder (VLCSPA). Instead of recalculating all results, the error detection and recovery circuits find and correct the block adder that generates incorrect partial sum bits, reducing power consumption. The experimental results show that the proposed CSPA achieves a 26.59% delay reduction, a 14.06% area reduction, and a 19.03% power consumption reduction compared to the corresponding values for an existing speculative carry-select adder. The experimental results also show the proposed CSPA can be used to improve image denoising results as well.

Journal ArticleDOI
TL;DR: A hotspot detection scheme is proposed, enabling to identify the PV module that is under hotspot condition, and can be used to avoid the permanent damage of the cells under hotspots, thus their drawback on the power efficiency of the entire PV system.
Abstract: In this paper, we address the problem of modeling the thermal behavior of photovoltaic (PV) cells undergoing a hotspot condition. In case of shading, PV cells may experience a dramatic temperature increase, with consequent reduction of the provided power. Our model has been validated against experimental data, and has highlighted a counterintuitive PV cell behavior, that should be considered to improve the energy efficiency of PV arrays. Then, we propose a hotspot detection scheme, enabling to identify the PV module that is under hotspot condition. Such a scheme can be used to avoid the permanent damage of the cells under hotspot, thus their drawback on the power efficiency of the entire PV system.

Journal ArticleDOI
TL;DR: The GHR combines 2-D and 1-D factorization techniques and improves the throughput by a factor of two to four with comparable hardware cost compared with the previous designs, which is nearly two times better than that of previous FFT processors.
Abstract: In this paper, we propose a hardware-efficient mixed generalized high-radix (GHR) reconfigurable fast Fourier transform (FFT) processor for long-term evolution applications. The GHR processor based on radix-25/16/9 uses a 2-D factorization scheme as the high-radix unit and a 1-D factorization method as the system data routing technology. The 2-D factorization scheme is implemented by an enhanced delay element matrix structure, which supports 25-, 16-, 9-, 8-, 5-, 4-, 3-, and 2-point FFTs. Two different designs were implemented. One design (called discrete Fourier transform core) supports 34 different transform sizes from 12 to 1296 points, while the other design (called FFT core) supports five different power-of-two sizes from 128 to 2048 points. The 1-D factorization method is performed by a coprime accessing technology, which accesses the data in parallel without conflict using a RAM. The GHR combines 2-D and 1-D factorization techniques and improves the throughput by a factor of two to four with comparable hardware cost compared with the previous designs. The speed–area ratio of the proposed scheme is nearly two times better than that of previous FFT processors. Application-specified integrated circuit implementation results based on a 0.18- $\mu{\rm m}$ technology are also provided.

Journal ArticleDOI
TL;DR: New codes that can correct triple adjacent errors and 3-bit burst errors are presented and have been implemented using a 45-nm library and compared with previous proposals, showing that these codes have better error protection with a moderate overhead and low redundancy.
Abstract: Static random access memories (SRAMs) are key in electronic systems. They are used not only as standalone devices, but also embedded in application specific integrated circuits. One key challenge for memories is their susceptibility to radiation-induced soft errors that change the value of memory cells. Error correction codes (ECCs) are commonly used to ensure correct data despite soft errors effects in semiconductor memories. Single error correction/double error detection (SEC-DED) codes have been traditionally the preferred choice for data protection in SRAMs. During the last decade, the percentage of errors that affect more than one memory cell has increased substantially, mainly due to multiple cell upsets (MCUs) caused by radiation. The bits affected by these errors are physically close. To mitigate their effects, ECCs that correct single errors and double adjacent errors have been proposed. These codes, known as single error correction/double adjacent error correction (SEC-DAEC), require the same number of parity bits as traditional SEC-DED codes and a moderate increase in the decoder complexity. However, MCUs are not limited to double adjacent errors, because they affect more bits as technology scales. In this brief, new codes that can correct triple adjacent errors and 3-bit burst errors are presented. They have been implemented using a 45-nm library and compared with previous proposals, showing that our codes have better error protection with a moderate overhead and low redundancy.

Journal ArticleDOI
TL;DR: A novel 128/256/512/1024/1536/2048-point single-path delay feedback (SDF) pipeline FFT processor for long-term evolution and mobile worldwide interoperability for microwave access systems and formulated a hardware-sharing mechanism to reduce the memory space requirements of the proposed 1536-point FFT computation scheme.
Abstract: Fast Fourier transform (FFT) is widely used in digital signal processing and telecommunications, particularly in orthogonal frequency division multiplexing systems, to overcome the problems associated with orthogonal subcarriers. This paper presents a novel 128/256/512/1024/1536/2048-point single-path delay feedback (SDF) pipeline FFT processor for long-term evolution and mobile worldwide interoperability for microwave access systems. The proposed design employs a low-cost computation scheme to enable 1536-point FFT, which significantly reduces hardware costs as well as power consumption. In conjunction with the aforementioned 1536-point FFT computation scheme, the proposed design included an efficient three-stage SDF pipeline architecture on which to implement a radix-3 FFT. The new radix-3 SDF pipeline FFT processor simplifies its data flow and is easy to control, and the complexity of the resulting hardware is lower than that of existing structures. This paper also formulated a hardware-sharing mechanism to reduce the memory space requirements of the proposed 1536-point FFT computation scheme. The proposed design was implemented using 90 nm CMOS technology. Postlayout simulation results revealed a die area of approximately $1.44 \times 1.44~\mathrm{mm}^{2}$ with power consumption of only 9.3 mW at 40 MHz.

Journal ArticleDOI
TL;DR: The architecture and the implementation of a high-performance scalable elliptic curve cryptography processor (ECP) that can support all five NIST recommended prime curves without the need to reconfigure the hardware is presented.
Abstract: The architecture and the implementation of a high-performance scalable elliptic curve cryptography processor (ECP) are presented. The proposed ECP is able to support all five prime field elliptic curves recommended by the National Institute of Standards and Technology (NIST). The design takes advantage of the high-performance capabilities of the DSP48E slices available in Xilinx field-programmable gate arrays (FPGAs) to achieve high speed and low hardware resource utilization. The proposed design parallelizes the underlying prime field operations to reduce the latency of the elliptic curve point multiplication (ECPM) operation. Prime field inversion is performed efficiently using the same arithmetic blocks as the ones used for prime field multiplication and addition/subtraction. To the best of the authors' knowledge, the proposed scalable ECP is the fastest and smallest ECP that can support all five NIST recommended prime curves without the need to reconfigure the hardware. It can compute the ECPM between 1.709 and 28.04 ms using a Xilinx Virtex-5 FPGA.

Journal ArticleDOI
TL;DR: A new temperature metric derived from frequency domain moment matching technique incorporates both initial temperature and other transient effects to make optimized task migration decisions, which leads to more effective reduction of hot spots in the experiments on a 100-core microprocessor than the existing distributed thermal management methods.
Abstract: In this brief, a new distributed thermal management scheme using task migrations based on a new temperature metric called effective initial temperature is proposed to reduce the on-chip temperature variance and the occurrence of hot spots for many-core microprocessors. The new temperature metric derived from frequency domain moment matching technique incorporates both initial temperature and other transient effects to make optimized task migration decisions, which leads to more effective reduction of hot spots in the experiments on a 100-core microprocessor than the existing distributed thermal management methods.

Journal ArticleDOI
TL;DR: Behavior of adiabatic logic circuits in weak inversion or subthreshold regime is analyzed in depth for the first time in the literature to make great improvement in ultralow-power circuit design.
Abstract: Behavior of adiabatic logic circuits in weak inversion or subthreshold regime is analyzed in depth for the first time in the literature to make great improvement in ultralow-power circuit design. This novel approach is efficacious in low-speed operations where power consumption and longevity are the pivotal concerns instead of performance. The schematic and layout of a 4-bit carry look ahead adder (CLA) has been implemented to show the workability of the proposed logic. The effect of temperature and process parameter variations on subthreshold adiabatic logic-based 4-bit CLA has also been addressed separately. Postlayout simulations show that subthreshold adiabatic units can save significant energy compared with a logically equivalent static CMOS implementation. Results are validated through extensive simulations in 22-nm CMOS technology using CADENCE SPICE Spectra.

Journal ArticleDOI
TL;DR: In this article, the authors present a formal system-level analytical approach to analyze the worst-case crosstalk noise and SNR in arbitrary fat-tree-based ONoCs.
Abstract: Optical networks-on-chip (ONoCs) have shown the potential to be substituted for electronic networks-on-chip (NoCs) to bring substantially higher bandwidth and more efficient power consumption in both on- and off-chip communication. However, basic optical devices, which are the key components in constructing ONoCs, experience inevitable crosstalk noise and power loss; the crosstalk noise from the basic devices accumulates in large-scale ONoCs and considerably hurts the signal-to-noise ratio (SNR) as well as restricts the network scalability. For the first time, this paper presents a formal system-level analytical approach to analyze the worst-case crosstalk noise and SNR in arbitrary fat-tree-based ONoCs. The analyses are performed hierarchically at the basic optical device level, then at the optical router level, and finally at the network level. A general 4 $\,\times\,$ 4 optical router model is considered to enable the proposed method to be adaptable to fat-tree-based ONoCs using an arbitrary 4 $\,\times\,$ 4 optical router. Utilizing the proposed general router model, the worst-case SNR link candidates in the network are determined. Moreover, we apply the proposed analyses to a case study of fat-tree-based ONoCs using an optical turnaround router (OTAR). Quantitative simulation results indicate low values of SNR and scalability constraints in large scale fat-tree-based ONoCs, which is due to the high power of crosstalk noise and power loss. For instance, in fat-tree-based ONoCs using the OTAR, when the injection laser power equals 0 dBm, the crosstalk noise power is higher than the signal power when the number of processor cores exceeds 128; when it is equal to 256, the signal power, crosstalk noise power, and SNR are ${-}{17.3}$ , ${-}{11.9}$ , and ${-}{\rm 5.5}~{\rm dB}$ , respectively.

Journal ArticleDOI
TL;DR: This work presents an approach that enables efficient representations based on sparsity to be utilized throughout a signal processing system, with the aim of reducing the energy and/or resources required for computation, communication, and storage.
Abstract: Sparsity is characteristic of a signal that potentially allows us to represent information efficiently. We present an approach that enables efficient representations based on sparsity to be utilized throughout a signal processing system, with the aim of reducing the energy and/or resources required for computation, communication, and storage. The representation we focus on is compressive sensing. Its benefit is that compression is achieved with minimal computational cost through the use of random projections; however, a key drawback is that reconstruction is expensive. We focus on inference frameworks for signal analysis. We show that reconstruction can be avoided entirely by transforming signal processing operations (e.g., wavelet transforms, finite impulse response filters, etc.) such that they can be applied directly to the compressed representations. We present a methodology and a mathematical framework that achieve this goal and also enable significant computational-energy savings through operations over fewer input samples. This enables explicit energy-versus-accuracy tradeoffs that are under the control of the designer. We demonstrate the approach through two case studies. First, we consider a system for neural prosthesis that extracts wavelet features directly from compressively sensed spikes. Through simulations, we show that spike sorting can be achieved with $54\times$ fewer samples, providing an accuracy of 98.63% in spike count, 98.56% in firing-rate estimation, and 96.51% in determining the coefficient of variation; this compares with a baseline Nyquist-domain detector with corresponding performance of 98.97%, 99.69%, and 97.09%, respectively. Second, we consider a system for detecting epileptic seizures by extracting spectral-energy features directly from compressively sensed electroencephalogram. Through simulations of the end-to-end algorithm, we show that detection can be achieved with $21\times$ fewer samples, providing a sensitivity of 94.43%, false alarm rate of 0.1543/h, and latency of 4.70 s; this compares with a baseline Nyquist-domain detector with corresponding performance of 96.03%, 0.1471/h, and 4.59 s, respectively.