scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2016"


Journal ArticleDOI
TL;DR: A Schmitt-trigger-based single-ended 11T SRAM cell is presented, which significantly improves read and write static noise margin (SNM) and consumes low power and achieves the lowest leakage power dissipation among the cells considered for comparison.
Abstract: This paper presents a Schmitt-trigger-based single-ended 11T SRAM cell, which significantly improves read and write static noise margin (SNM) and consumes low power. Simulation results show that the cell also achieves the lowest leakage power dissipation among the cells considered for comparison. We also investigate the impact of process, voltage, and temperature variations on various performance parameters, such as hold SNM, read SNM, write margin, immunity to half-select issue, $I_{\mathrm{\scriptscriptstyle ON}}/I_{\mathrm{\scriptscriptstyle OFF}}$ ratio of read path, and leakage power of the cell; Monte Carlo simulation results confirm the robustness of the proposed cell toward these issues. Layout drawn in a 45-nm technology rule shows that the proposed cell occupies $2.02\times $ greater area as compared with 6T SRAM cells. However, $6.9\times $ higher $I_{\mathrm{\scriptscriptstyle ON}}/I_{\mathrm{\scriptscriptstyle OFF}}$ ratio of the read path of the proposed cell as compared with 6T cell holds potential to significantly subside the area overhead. A new figure of merit that comprehensively captures stability, delay, power dissipation, and area of an SRAM cell is also proposed. Based on the proposed metric, we observe that the proposed cell outperforms all, but one of the SRAM cells considered in this paper.

134 citations


Journal ArticleDOI
TL;DR: This paper presents the design of a fully integrated electrocardiogram (ECG) signal processor (ESP) for the prediction of ventricular arrhythmia using a unique set of ECG features and a naive Bayes classifier.
Abstract: This paper presents the design of a fully integrated electrocardiogram (ECG) signal processor (ESP) for the prediction of ventricular arrhythmia using a unique set of ECG features and a naive Bayes classifier. Real-time and adaptive techniques for the detection and the delineation of the P-QRS-T waves were investigated to extract the fiducial points. Those techniques are robust to any variations in the ECG signal with high sensitivity and precision. Two databases of the heart signal recordings from the MIT PhysioNet and the American Heart Association were used as a validation set to evaluate the performance of the processor. Based on application-specified integrated circuit (ASIC) simulation results, the overall classification accuracy was found to be 86% on the out-of-sample validation data with 3-s window size. The architecture of the proposed ESP was implemented using 65-nm CMOS process. It occupied 0.112- ${\rm mm}^{2}$ area and consumed 2.78- $\mu \text{W}$ power at an operating frequency of 10 kHz and from an operating voltage of 1 V. It is worth mentioning that the proposed ESP is the first ASIC implementation of an ECG-based processor that is used for the prediction of ventricular arrhythmia up to 3 h before the onset.

128 citations


Journal ArticleDOI
TL;DR: This brief elaborately analyzes these hardware security techniques and proposes a practical logic obfuscation method with low overheads to prevent an adversary from RE both the gate-level netlist and the layout-level geometry of IP/ IC and protect IP/IC from piracy and overbuilding.
Abstract: A number of studies of hardware security aim to thwart piracy, overbuilding, and reverse engineering (RE) by obfuscating and/or camouflaging. However, these techniques incur high overheads, and integrated circuit (IC) camouflaging cannot provide any protection for the gate-level netlist of the third party intellectual property (IP) core or the single large monolithic IC. In order to circumvent these weaknesses, this brief elaborately analyzes these hardware security techniques and proposes a practical logic obfuscation method with low overheads to prevent an adversary from RE both the gate-level netlist and the layout-level geometry of IP/IC and protect IP/IC from piracy and overbuilding. Experimental evaluations demonstrate the low area, power, and zero performance overhead of the proposed obfuscation technique.

125 citations


Journal ArticleDOI
TL;DR: This paper jointly optimizes high parameter density (number of programmable elements/area/process normalized), as well as high accessibility of the computations due to its data flow handling; the SoC FPAA is 600 000 × higher density than other non-FG approaches.
Abstract: This paper presents a floating-gate (FG)-based, field-programmable analog array (FPAA) system-on-chip (SoC) that integrates analog and digital programmable and configurable blocks with a 16-bit open-source MSP430 microprocessor ( $\mu \text{P}$ ) and resulting interface circuitry. We show the FPAA SoC architecture, experimental results from a range of circuits compiled into this architecture, and system measurements. A compiled analog acoustic command-word classifier on the FPAA SoC requires 23 $\mu \text{W}$ to experimentally recognize the word dark in a TIMIT database phrase. This paper jointly optimizes high parameter density (number of programmable elements/area/process normalized), as well as high accessibility of the computations due to its data flow handling; the SoC FPAA is 600 $000\times $ higher density than other non-FG approaches.

119 citations


Journal ArticleDOI
TL;DR: A cobweb-based redundant through-silicon-via (TSV) design is proposed with efficient hardware as well as high repair rate to repair clustered faulty TSVs (FTSVs).
Abstract: Three-dimensional integrated circuits (3-D-ICs) that employ the through-silicon vias (TSVs) vertically stacking multiple dies provide many benefits, such as high density, high bandwidth, and low power. However, the fabrication and bonding of TSVs may fail because of many factors, such as the winding level of the thinned wafers, the surface roughness and cleanness of silicon dies, and bonding technology. To improve the yield of 3-D-ICs, many redundant TSV (RTSV) architectures were proposed to repair 3-D-ICs with faulty TSVs. These methods reroute the signals of faulty TSVs to other regular TSV or RTSV. In practice, the faulty TSVs may cluster because of imperfect bonding technology. To resolve the problem of clustered TSV faults, router-based RTSV architecture was the first proposed to pay attention to it. Their method enables faulty TSVs to be repaired by RTSVs that are farther apart. However, to repair some rarely occurring defective patterns, their method requires too much area. In this paper, we propose a ring-based RTSV architecture to utilize the area more efficiently as well as to maintain high yield. Assume that the size of the TSVs is 1 $\mu \text{m}$ . Simulation results show that for a given number of TSVs ( $8 \times 8$ ) and TSV failure rate (1%), our design achieves 58.9% area reduction of MUXes per signal, 54.6% total area reduction per signal, and 50.54% total wire length reduction while the yield of our ring-based RTSV architectures can still maintain 98.47%–99% as compared with the router-based design. Furthermore, the shifting length of our ring-based RTSV architecture is at most 1, which guarantees at most one MUX-delay timing overhead of each signal.

113 citations


Journal ArticleDOI
TL;DR: This paper proves in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution, in terms of power dissipation and error.
Abstract: Approximate computing has received significant attention as a promising strategy to decrease power consumption of inherently error tolerant applications. In this paper, we focus on hardware-level approximation by introducing the partial product perforation technique for designing approximate multiplication circuits. We prove in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution. Through extensive experimental evaluation, we apply the partial product perforation method on different multiplier architectures and expose the optimal architecture–perforation configuration pairs for different error constraints. We show that, compared with the respective exact design, the partial product perforation delivers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay. In addition, the product perforation method is compared with the state-of-the-art approximation techniques, i.e., truncation, voltage overscaling, and logic approximation, showing that it outperforms them in terms of power dissipation and error.

111 citations


Journal ArticleDOI
TL;DR: An area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform that uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs).
Abstract: In this paper, we present an area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform. The NoC implements message-passing communication between processor cores. It uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs) to offer real-time guarantees. The area-efficient design is a result of two contributions: 1) asynchronous routers combined with TDM scheduling and 2) a novel NI microarchitecture. Together they result in a design in which data are transferred in a pipelined fashion, from the local memory of the sending core to the local memory of the receiving core, without any dynamic arbitration, buffering, and clock synchronization. The routers use two-phase bundled-data handshake latches based on the Mousetrap latch controller and are extended with a clock gating mechanism to reduce the energy consumption. The NIs integrate the direct memory access functionality and the TDM schedule, and use dual-ported local memories to avoid buffering, flow-control, and synchronization. To verify the design, we have implemented a 4 $\times $ 4 bitorus NoC in 65-nm CMOS technology and we present results on area, speed, and energy consumption for the router, NI, NoC, and postlayout.

89 citations


Journal ArticleDOI
TL;DR: A general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications is derived and a low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters.
Abstract: Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant saving of computation. However, transpose form configuration does not directly support the block processing unlike direct-form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application-specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.

78 citations


Journal ArticleDOI
TL;DR: A novel 8-transistor (8T) static random access memory cell with improved data stability in subthreshold operation is designed and enhances the static noise margin (SNM) for ultralow power supply.
Abstract: A novel 8-transistor (8T) static random access memory cell with improved data stability in subthreshold operation is designed. The proposed single-ended with dynamic feedback control 8T static RAM (SRAM) cell enhances the static noise margin (SNM) for ultralow power supply. It achieves write SNM of $1.4\times $ and $1.28\times $ as that of isoarea 6T and read-decoupled 8T (RD-8T), respectively, at 300 mV. The standard deviation of write SNM for 8T cell is reduced to $0.4\times $ and $0.56\times $ as that for 6T and RD-8T, respectively. It also possesses another striking feature of high read SNM $\sim 2.33\times $ , $1.23\times $ , and $0.89\times $ as that of 5T, 6T, and RD-8T, respectively. The cell has hold SNM of $1.43\times $ , $1.23\times $ , and $1.05\times $ as that of 5T, 6T, and RD-8T, respectively. The write time is 71% lesser than that of single-ended asymmetrical 8T cell. The proposed 8T consumes less write power $0.72\times $ , $0.6\times $ , and $0.85\times $ as that of 5T, 6T, and isoarea RD-8T, respectively. The read power is $0.49\times $ of 5T, $0.48\times $ of 6T, and $0.64\times $ of RD-8T. The power/energy consumption of 1-kb 8T SRAM array during read and write operations is $0.43\times $ and $0.34\times $ , respectively, of 1-kb 6T array. These features enable ultralow power applications of 8T.

75 citations


Journal ArticleDOI
TL;DR: A complete thermal energy harvesting power supply for implantable pacemakers is presented in this paper and has been designed using 180-nm CMOS technology.
Abstract: A complete thermal energy harvesting power supply for implantable pacemakers is presented in this paper. The designed power supply includes an internal startup and does not need any external reference voltage. The startup circuit includes a prestartup charge pump (CP) and a startup boost converter. The prestartup CP consists of an ultralow-voltage oscillator followed by a high-efficiency modified Dickson. Forward body biasing is used to effectively reduce the MOS threshold voltages as well as the supply voltage in oscillator and CP. The steady-state circuit includes a high-efficiency boost converter that utilizes a modified maximum powerpoint tracking scheme. The system is designed so that no failure occurs under overload conditions. Using this approach, a thermal energy harvesting power supply has been designed using 180-nm CMOS technology. According to HSPICE simulation results, the circuit operates from input voltages as low as 40 mV provided from a thermoelectric generator and generates output voltages up to 3 V. A maximum power of 130 $\mu $ W can be obtained from the output of the boost converter, which means that its efficiency is 60%. A minimum voltage of 60 mV and a maximum time of 400 ms are needed for the circuit to start up.

72 citations


Journal ArticleDOI
TL;DR: Experimental results show that the approach of dynamically adjusting the degree of hardware approximation based on the input video respects the given quality bound across different videos while achieving a power saving up to 38% over a conventional nonapproximated MPEG encoder architecture.
Abstract: The field of approximate computing has received significant attention from the research community in the past few years, especially in the context of various signal processing applications. Image and video compression algorithms, such as JPEG, MPEG, and so on, are particularly attractive candidates for approximate computing, since they are tolerant of computing imprecision due to human imperceptibility, which can be exploited to realize highly power-efficient implementations of these algorithms. However, existing approximate architectures typically fix the level of hardware approximation statically and are not adaptive to input data. For example, if a fixed approximate hardware configuration is used for an MPEG encoder (i.e., a fixed level of approximation), the output quality varies greatly for different input videos. This paper addresses this issue by proposing a reconfigurable approximate architecture for MPEG encoders that optimizes power consumption with the goal of maintaining a particular Peak Signal-to-Noise Ratio (PSNR) threshold for any video. Toward this end, we design reconfigurable adder/subtractor blocks (RABs), which have the ability to modulate their degree of approximation, and subsequently integrate these blocks in the motion estimation and discrete cosine transform modules of the MPEG encoder. We propose two heuristics for automatically tuning the approximation degree of the RABs in these two modules during runtime based on the characteristics of each individual video. Experimental results show that our approach of dynamically adjusting the degree of hardware approximation based on the input video respects the given quality bound (PSNR degradation of 1%–10%) across different videos while achieving a power saving up to 38% over a conventional nonapproximated MPEG encoder architecture. Note that although the proposed reconfigurable approximate architecture is presented for the specific case of an MPEG encoder, it can be easily extended to other DSP applications.

Journal ArticleDOI
TL;DR: Simulations with regard to supply power scaling and different load conditions confirm the superiority of the proposed cells compared with the previously reported ones in terms of power, delay, power-delay product (PDP), and Energy- delay product (EDP).
Abstract: In this paper, a number of novel 1-bit full adder cells using carbon nanotube field-effect transistor devices are presented. First of all, some two-input XOR/XNOR circuits are proposed, and then, they are employed to form 1-bit full adders. Totally, five full adders with driving power and one without driving power are proposed in this paper, each of which has its own merits. Simulations with regard to supply power scaling and different load conditions confirm the superiority of the proposed cells compared with the previously reported ones in terms of power, delay, power-delay product (PDP), and Energy-delay product (EDP). Also embedding the proposed full adders in the large circuits,such as ripple carry adder (RCA), with a wide word length shows that they have better power, speed, and PDP with regard to their counterparts. Furthermore, the susceptibility of the full adders against both input noise and process variations (diameter deviations of carbon nanotubes) is studied. In terms of noise, the proposed cells have a close competition to their counterparts, and they are robust against high amplitude of noises. In terms of process variation, the proposed cells with driving power display the most robustness compared with their counterpart.

Journal ArticleDOI
TL;DR: This paper identifies design constraints for Trojan detection to achieving detection, collusion prevention, and isolating the Trojan-infected 3PIP, and incorporates them during high-level synthesis.
Abstract: Trustworthiness of system-on-chip designs is undermined by malicious logic (Trojans) in third-party intellectual properties (3PIPs). In this paper, duplication, diversity, and isolation principles have been extended to detect build trustworthy systems using untrusted, potentially Trojan-infected 3PIPs. We use a diverse set of vendors to prevent collusions between the 3PIPs from the same vendor. We identify design constraints for Trojan detection to achieving detection, collusion prevention, and isolating the Trojan-infected 3PIP, and incorporate them during high-level synthesis. In addition, we develop techniques to reduce the number of vendors. The effectiveness of the proposed techniques is validated using the high-level synthesis benchmarks.

Journal ArticleDOI
TL;DR: A suite of solutions, based on lightweight negative bias temperature instability (NBTI)-aware ring oscillators (ROs) for combating die and IC recycling (CDIR) when ICs are used for a very short duration are proposed.
Abstract: The recycling of electronic components has become a major industrial and governmental concern, as it could potentially impact the security and reliability of a wide variety of electronic systems. It is extremely challenging to detect a recycled integrated circuit (IC) that is already used for a very short period of time because the process variations outpace the degradation caused by aging, especially in lower technology nodes. In this paper, we propose a suite of solutions, based on lightweight negative bias temperature instability (NBTI)-aware ring oscillators (ROs), for combating die and IC recycling (CDIR) when ICs are used for a very short duration. The proposed solutions are implemented in the 90-nm technology node. The simulation results demonstrate that our newly proposed NBTI-aware multiple pair RO-based CDIRs can detect ICs used only for a few hours.

Journal ArticleDOI
TL;DR: A mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed and the extra clock cycles for operand precomputation and format conversion can be hidden and high throughput can be obtained.
Abstract: This paper proposes a simple and efficient Montgomery multiplication algorithm such that the low-cost and high-performance Montgomery modular multiplier can be implemented accordingly The proposed multiplier receives and outputs the data with binary representation and uses only one-level carry-save adder (CSA) to avoid the carry propagation at each addition operation This CSA is also used to perform operand precomputation and format conversion from the carry-save format to the binary representation, leading to a low hardware cost and short critical path delay at the expense of extra clock cycles for completing one modular multiplication To overcome the weakness, a configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand precomputation and format conversion by half In addition, a mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed As a result, the extra clock cycles for operand precomputation and format conversion can be hidden and high throughput can be obtained Experimental results show that the proposed Montgomery modular multiplier can achieve higher performance and significant area–time product improvement when compared with previous designs

Journal ArticleDOI
TL;DR: A reduced latency list decoding (RLLD) algorithm for polar codes is proposed, which significantly reduces the decoding latency and, hence, improves throughput, while introducing little performance degradation.
Abstract: While long polar codes can achieve the capacity of arbitrary binary-input discrete memoryless channels when decoded by a low complexity successive-cancellation (SC) algorithm, the error performance of the SC algorithm is inferior for polar codes with finite block lengths. The cyclic redundancy check (CRC)-aided SC list (SCL) decoding algorithm has better error performance than the SC algorithm. However, current CRC-aided SCL decoders still suffer from long decoding latency and limited throughput. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Our RLLD algorithm performs the list decoding on a binary tree, whose leaves correspond to the bits of a polar code. In existing SCL decoding algorithms, all the nodes in the tree are traversed, and all possibilities of the information bits are considered. Instead, our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and, hence, improves throughput, while introducing little performance degradation. Based on our RLLD algorithm, we also propose a high throughput list decoder architecture, which is suitable for larger block lengths due to its scalable partial sum computation unit. Our decoder architecture has been implemented for different block lengths and list sizes using the TSMC 90-nm CMOS technology. The implementation results demonstrate that our decoders achieve significant latency reduction and area efficiency improvement compared with the other list polar decoders in the literature.

Journal ArticleDOI
TL;DR: The proposed processor employs extensive pipelining techniques for Karatsuba-Ofman method to achieve high throughput multiplication and supports the recommended NIST curve P256 and is based on an extended NIST reduction scheme.
Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed digit representation is proposed. The processor employs extensive pipelining techniques for Karatsuba–Ofman method to achieve high throughput multiplication. Furthermore, an efficient modular adder without comparison and a high-throughput modular divider, which results in a short datapath for maximized frequency, are implemented. The processor supports the recommended NIST curve P256 and is based on an extended NIST reduction scheme. The proposed processor performs single-point multiplication employing points in affine coordinates in 2.26 ms and runs at a maximum frequency of 160 MHz in Xilinx Virtex 5 (XC5VLX110T) field-programmable gate array.

Journal ArticleDOI
TL;DR: The experimental results show that the stochastic implementation of Sauvola needs much less time and area and can tolerate more faults, while consuming less power in comparison with its conventional implementation.
Abstract: Binarization plays an important role in document image processing, particularly in degraded document images. Among all local image thresholding algorithms, Sauvola has excellent binarization performance for degraded document images. However, this algorithm is computationally intensive and sensitive to the noises from the internal computational circuits. In this paper, we present a stochastic implementation of Sauvola algorithm. Our experimental results show that the stochastic implementation of Sauvola needs much less time and area and can tolerate more faults, while consuming less power in comparison with its conventional implementation.

Journal ArticleDOI
TL;DR: This paper presents an on-chip, low drop-out (LDO) voltage regulator with improved power-supply rejection (PSR) able to drive large capacitive loads without stability concerns and a custom, wide bandwidth capacitance multiplier that emulates a nanofarad-range capacitance at the LDO output node.
Abstract: This paper presents an on-chip, low drop-out (LDO) voltage regulator with improved power-supply rejection (PSR) able to drive large capacitive loads. The LDO compensation is achieved via a custom, wide bandwidth capacitance multiplier (c-multiplier) that emulates a nanofarad-range capacitance at the LDO output node. The LDO frequency response resembles that of externally compensated LDOs, leading to a wide PSR frequency range without using an off-chip capacitor. To drive large capacitive loads without stability concerns, the supply-line capacitance of the load circuit is incorporated to the design of the LDO compensation scheme. The power-stability-performance tradeoffs involved in the design are discussed in detail. The LDO and the c-multiplier are implemented in 0.18- $\mu \text{m}$ CMOS technology and target applications with load currents in the 10-mA range. Experimental results show that the LDO achieves a PSR better than −39 dB up to 20 MHz at 1.2 V output voltage, while maintaining a 97.4% current efficiency.

Journal ArticleDOI
TL;DR: A carry skip adder structure that has a higher speed yet lower energy consumption compared with the conventional one, and a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented.
Abstract: In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The proposed structures are assessed by comparing their speed, power, and energy parameters with those of other adders using a 45-nm static CMOS technology for a wide range of supply voltages. The results that are obtained using HSPICE simulations reveal, on average, 44% and 38% improvements in the delay and energy, respectively, compared with those of the Conv-CSKA. In addition, the power–delay product was the lowest among the structures considered in this paper, while its energy–delay product was almost the same as that of the Kogge–Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the latest works in this field while having a reasonably high speed.

Journal ArticleDOI
TL;DR: This paper presents the design of a compact 60-GHz phase shifter that provides a 5-bit digital phase control and 360° phase range for beam-forming systems and is the best of the authors' knowledge, the designed 360°phase shifter with the size of 0.094 mm2 is the smallest 5- bit passive phase shifters at frequencies around 60 GHz.
Abstract: This paper presents the design of a compact 60-GHz phase shifter that provides a 5-bit digital phase control and 360° phase range for beam-forming systems. The phase shifter is designed using the proposed cross-coupled bridged T-type topology and switched-varactor reflective-type topology. The topologies are analyzed using a small-signal equivalent circuit model. Furthermore, the design equations are derived and investigated. To validate the theoretical analysis, 60-GHz 5-bit 360° phase shifters are designed in a commercial 65-nm CMOS technology. The fabricated 360° phase shifter features good performance of 32 phase states from 57 to 64 GHz with an rms phase error of 4.4°, a total insertion loss of 14.3 ±2 dB, an rms gain error of 0.5 dB, $P_{\text {1 dB}}$ of better than 9.5 dBm, and the power consumption of almost zero. To the best of our knowledge, the designed 360° phase shifter with the size of 0.094 mm2 is the smallest 5-bit passive phase shifter at frequencies around 60 GHz.

Journal ArticleDOI
TL;DR: This paper proposes a new sparse LU solver on GPUs for circuit simulation and more general scientific computing, based on a hybrid right-looking LU factorization algorithm for sparse matrices, and shows that more concurrency can be exploited in the right- looking method than the left-looking method on GPU platforms.
Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation problems. However, parallelizing LU factorization on the graphic processing units (GPUs) turns out to be a difficult problem due to intrinsic data dependence and irregular memory access, which diminish GPU computing power. In this paper, we propose a new sparse LU solver on GPUs for circuit simulation and more general scientific computing. The new method, which is called GPU accelerated LU factorization (GLU) solver (for GPU LU), is based on a hybrid right-looking LU factorization algorithm for sparse matrices. We show that more concurrency can be exploited in the right-looking method than the left-looking method, which is more popular for circuit analysis, on GPU platforms. At the same time, the GLU also preserves the benefit of column-based left-looking LU method, such as symbolic analysis and column-level concurrency. We show that the resulting new parallel GPU LU solver allows the parallelization of all three loops in the LU factorization on GPUs. While in contrast, the existing GPU-based left-looking LU factorization approach can only allow parallelization of two loops. Experimental results show that the proposed GLU solver can deliver $5.71\times $ and $1.46\times $ speedup over the single-threaded and the 16-threaded PARDISO solvers, respectively, $19.56\times $ speedup over the KLU solver, $47.13\times $ over the UMFPACK solver, and $1.47\times $ speedup over a recently proposed GPU-based left-looking LU solver on the set of typical circuit matrices from the University of Florida (UFL) sparse matrix collection. Furthermore, we also compare the proposed GLU solver on a set of general matrices from the UFL, GLU achieves $6.38\times $ and $1.12\times $ speedup over the single-threaded and the 16-threaded PARDISO solvers, respectively, $39.39\times $ speedup over the KLU solver, $24.04\times $ over the UMFPACK solver, and $2.35\times $ speedup over the same GPU-based left-looking LU solver. In addition, comparison on self-generated $RLC$ mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing sparse LU solvers.

Journal ArticleDOI
TL;DR: This work works through tunneling approaches to initialize the FG devices for precision programming, as well as hot-electron injection approaches for precise device programming.
Abstract: We present the first integrated system to handle heterogeneously used and programmed floating-gate (FG) elements in a single modular approach. We focus on IC design, integration, characterization, and algorithmic development of an integrated FG programming system for a large-scale field-programmable analog array. We work through tunneling approaches to initialize the FG devices for precision programming, as well as hot-electron injection approaches for precise device programming.

Journal ArticleDOI
TL;DR: A low-overhead two-state checkpointing (TsCp) scheme for fault-tolerant hard real-time systems that differentiates between the fault-free and faulty execution states and leverages two types of checkpoint intervals for these two different states.
Abstract: Checkpointing with rollback recovery is a well-established technique to tolerate transient faults. However, it incurs significant time and energy overheads, which go wasted in fault-free execution states and may not even be feasible in hard real-time systems. This paper presents a low-overhead two-state checkpointing (TsCp) scheme for fault-tolerant hard real-time systems. It differentiates between the fault-free and faulty execution states and leverages two types of checkpoint intervals for these two different states. The first type is nonuniform intervals that are used while no fault has occurred. These intervals are determined based on postponing checkpoint insertions in fault-free states, with the aim of decreasing the number of checkpoint insertions. The second type is uniform intervals that are used from the time when the first fault occurs. They are determined so as to minimize execution time for faulty states, leaving more time available for energy management in fault-free states. Experimental evaluation on an embedded processor (LEON3) and an emerging nonvolatile memory technology (ReRAM) illustrates that TsCp significantly reduces the number of checkpoints (62% on average) compared with previous works, while preserving fault tolerance. This results in 14% and 13% reduced execution time and energy consumption, respectively. Furthermore, we combine TsCp with dynamic voltage scaling (DVS) and achieve up to 26% (21% on average) energy saving compared with the state-of-the-art techniques.

Journal ArticleDOI
TL;DR: Results show that the proposed SRAM is more efficient in terms of area, complexity, clock frequency, latency, throughput, and power consumption than the QCA-based SRAM cell.
Abstract: Application of quantum-dot cellular automata (QCA) technology as an alternative to CMOS technology on the nanoscale has a promising future; QCA is an interesting technology for building memory. The proposed design and simulation of a new memory cell structure based on QCA with a minimum delay, area, and complexity is presented to implement a static random access memory (SRAM). This paper presents the design and simulation of a 16-bit $\times 32$ -bit SRAM with a new structure in QCA. Since QCA is a pipeline, this SRAM has a high operating speed. The 16-bit $\times 32$ -bit SRAM has a new structure with a 32-bit width designed and implemented in QCA. It has the ability of a conventional logic SRAM that can provide read/write operations frequently with minimum delay. The 16-bit $\times 32$ -bit SRAM is generalized and an $n\times 16$ -bit $\times 32$ -bit SRAM is implemented in QCA. Novel 16-bit decoders and multiplexers (MUXs) in QCA are presented that have been designed with a minimum number of majority gates and cells. The new SRAM, decoders, and MUXs are designed, implemented, and simulated in QCA using a signal distribution network to avoid the coplanar problem of crossing wires. The QCA-based SRAM cell was compared with the SRAM cell based on CMOS. Results show that the proposed SRAM is more efficient in terms of area, complexity, clock frequency, latency, throughput, and power consumption.

Journal ArticleDOI
TL;DR: Simulations that were performed while considering different memory accessing aspects, such as bit reading versus word reading, stored data background distribution, crossbar dimensions, etc., showed that read margins can be increased significantly as compared with standard crossbar architectures.
Abstract: Resistive random access memory (ReRAM), referred to as memristor, is an emerging memory technology to potentially replace conventional memories, which will soon be facing serious design challenges related to continued scaling. Memristor-based crossbar architecture has been shown to be the best implementation for ReRAM. However, it faces a major challenge related to the sneak current (current sneak paths) flowing through unselected memory cells, which significantly reduces the voltage read margins. In this paper, five alternative architectures (topologies) are applied to minimize the impact of sneak current; the architectures are based on the introduction of insulating junctions within the crossbar. Simulations that were performed while considering different memory accessing aspects, such as bit reading versus word reading, stored data background distribution, crossbar dimensions, etc., showed that read margins can be increased significantly (up to $4\times $ ) as compared with standard crossbar architectures. In addition, the proposed architectures eliminate the requirement for extra select devices at each cross point and have no operational complexity overhead.

Journal ArticleDOI
TL;DR: The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2163) in 6.1 μs using 7354 Slices on Virtex-4.
Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF( ${2}^{m}$ ). The architecture uses a bit-parallel finite-field (FF) multiplier accumulator (MAC) based on the Karatsuba-Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF( ${2^{163}}$ ) in 6.1 $\mu \text{s}$ using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication for ${m} = 163$ , 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 $\mu \text{s}$ , respectively, which are faster than previous results.

Journal ArticleDOI
TL;DR: This brief presents the key concept, design strategy, and implementation of reconfigurable coordinate rotation digital computer (CORDIC) architectures that can be configured to operate either for circular or for hyperbolic trajectories in rotation as well as vectoring-modes.
Abstract: This brief presents the key concept, design strategy, and implementation of reconfigurable coordinate rotation digital computer (CORDIC) architectures that can be configured to operate either for circular or for hyperbolic trajectories in rotation as well as vectoring-modes. It can, therefore, be used to perform all the functions of both circular and hyperbolic CORDIC. We propose three reconfigurable CORDIC designs: 1) a reconfigurable rotation-mode CORDIC that operates either for circular or for hyperbolic trajectory; 2) a reconfigurable vectoring-mode CORDIC for circular and hyperbolic trajectories; and 3) a generalized reconfigurable CORDIC that can operate in any of the modes for both circular and hyperbolic trajectories. The reconfigurable CORDIC can perform the computation of various trigonometric and exponential functions, logarithms, square-root, and so on of circular and hyperbolic CORDIC using either rotation-mode or vectoring-mode CORDIC in one single circuit. It can be used in digital synchronizers, graphics processors, scientific calculators, and so on. It offers substantial saving of area complexity over the conventional design for reconfigurable applications.

Journal ArticleDOI
TL;DR: The proposed method is the first approach able to automatically generate SBST programs for both end- of-manufacturing and in-field test whose fault efficiency is superior to those produced by state-of-the-art manual approaches.
Abstract: Software-based self-test (SBST) techniques are used to test processors and processor cores against permanent faults introduced by the manufacturing process or to perform in-field test in safety-critical applications. However, the generation of an SBST program is usually associated with high costs as it requires significant manual effort of a skilled engineer with in-depth knowledge about the processor under test. In this paper, we propose an approach for the automatic generation of SBST programs. First, we detail an automatic test pattern generation (ATPG) framework for the generation of functional test sequences. Second, we describe the extension of this framework with the concept of a validity checker module (VCM), which allows the specification of constraints with regard to the generated sequences. Third, we use the VCM to express typical constraints that exist when SBST is adopted for in-field test. In our experimental results, we evaluate the proposed approach with a microprocessor without interlocked pipeline stages (MIPS)-like microprocessor. The results show that the proposed method is the first approach able to automatically generate SBST programs for both end-of-manufacturing and in-field test whose fault efficiency is superior to those produced by state-of-the-art manual approaches.

Journal ArticleDOI
TL;DR: The Monte Carlo simulation for 500 runs was performed to ensure the robustness of the proposed precharge-free CAM, which is free of the drawbacks of the charge sharing in the NAND and the SC current in the NOR-type CAM.
Abstract: Content-addressable memory (CAM) is the hardware for parallel lookup/search. The parallel search scheme promises a high-speed search operation but at the cost of high power consumption. Parallel NOR- and NAND-type matchline (ML) CAMs are suitable for high-search-speed and low-power-consumption applications, respectively. The NOR-type ML CAM requires high power, and therefore, the reduction of its power consumption is the subject of many reported designs. Here, we report and explore the short-circuit (SC) current during the precharge phase of the NOR-type ML. Also proposed here is a novel precharge-free CAM. The proposed CAM is free of the drawbacks of the charge sharing in the NAND and the SC current in the NOR-type CAM. Postlayout simulations performed with a 45-nm technology node revealed a significant reduction in the energy metric: 93% and 77% lesser than NOR- and NAND-type CAMs, respectively. The Monte Carlo simulation for 500 runs was performed to ensure the robustness of the proposed precharge-free CAM.