scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2018"


Journal ArticleDOI
TL;DR: This paper quantitatively analyzing and optimizing the design objectives of the CNN accelerator based on multiple design variables and proposes a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance.
Abstract: As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

228 citations


Journal ArticleDOI
TL;DR: In this article, the spin-transfer torque compute-in-memory (STT-CiM) was proposed for in-memory computing with spin transfer torque magnetic RAM, which allows multiple wordlines within an array to be simultaneously enabled, allowing for directly sensing functions of the values stored in multiple rows using a single access.
Abstract: In-memory computing is a promising approach to addressing the processor-memory data transfer bottleneck in computing systems. We propose spin-transfer torque compute-in-memory (STT-CiM), a design for in-memory computing with spin-transfer torque magnetic RAM (STT-MRAM). The unique properties of spintronic memory allow multiple wordlines within an array to be simultaneously enabled, opening up the possibility of directly sensing functions of the values stored in multiple rows using a single access. We propose modifications to STT-MRAM peripheral circuits that leverage this principle to perform logic, arithmetic, and complex vector operations. We address the challenge of reliable in-memory computing under process variations by extending error-correction code schemes to detect and correct errors that occur during CiM operations. We also address the question of how STT-CiM should be integrated within a general-purpose computing system. To this end, we propose architectural enhancements to processor instruction sets and on-chip buses that enable STT-CiM to be utilized as a scratchpad memory. Finally, we present data mapping techniques to increase the effectiveness of STT-CiM. We evaluate STT-CiM using a device-to-architecture modeling framework, and integrate cycle-accurate models of STT-CiM with a commercial processor and on-chip bus (Nios II and Avalon from Intel). Our system-level evaluation shows that STT-CiM provides the system-level performance improvements of 3.93 times on average (up to 10.4 times), and concurrently reduces memory system energy by 3.83 times on average (up to 12.4 times).

205 citations


Journal ArticleDOI
TL;DR: This paper proposes an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significantbits with an approximate higher radix encode.
Abstract: Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy-accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.

102 citations


Journal ArticleDOI
TL;DR: The proposed novel circuits for XOR/XNOR and simultaneous XOR–XNOR functions are highly optimized in terms of the power consumption and delay, which are due to low output capacitance and low short-circuit power dissipation.
Abstract: In this paper, novel circuits for XOR/XNOR and simultaneous XOR–XNOR functions are proposed. The proposed circuits are highly optimized in terms of the power consumption and delay, which are due to low output capacitance and low short-circuit power dissipation. We also propose six new hybrid 1-bit full-adder (FA) circuits based on the novel full-swing XOR–XNOR or XOR/XNOR gates. Each of the proposed circuits has its own merits in terms of speed, power consumption, power-delay product (PDP), driving ability, and so on. To investigate the performance of the proposed designs, extensive HSPICE and Cadence Virtuoso simulations are performed. The simulation results, based on the 65-nm CMOS process technology model, indicate that the proposed designs have superior speed and power against other FA designs. A new transistor sizing method is presented to optimize the PDP of the circuits. In the proposed method, the numerical computation particle swarm optimization algorithm is used to achieve the desired value for optimum PDP with fewer iterations. The proposed circuits are investigated in terms of variations of the supply and threshold voltages, output capacitance, input noise immunity, and the size of transistors.

101 citations


Journal ArticleDOI
TL;DR: A new BO-based global optimization algorithm titled Two-Stage BO (TSBO) is proposed, applied for clock skew minimization in 3-D integrated circuits and multiobjective co-optimization for maximizing efficiency in integrated voltage regulators.
Abstract: Increasing levels of system integration pose difficulties in meeting design specifications for high-performance systems. Oftentimes increased complexity, nonlinearity, and multiple tradeoffs need to be handled simultaneously during the design cycle. Since components in such systems are highly correlated with each other, codesign and co-optimization of the complete system are required. Machine learning (ML) provides opportunities for analyzing such systems with multiple control parameters, where techniques based on Bayesian optimization (BO) can be used to meet or exceed design specifications. In this paper, we propose a new BO-based global optimization algorithm titled Two-Stage BO (TSBO). TSBO can be applied to black box optimization problems where the computational time can be reduced through a reduction in the number of simulations required. Empirical analysis on a set of popular challenge functions with several local extrema and dimensions shows TSBO to have a faster convergence rate as compared with other optimization methods. In this paper, TSBO has been applied for clock skew minimization in 3-D integrated circuits and multiobjective co-optimization for maximizing efficiency in integrated voltage regulators. The results show that TSBO is between $2\times $ - $4\times $ faster as compared with previously published BO algorithms and other non-ML-based techniques.

89 citations


Journal ArticleDOI
TL;DR: A recurrent neural network (RNN) accelerator design with resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architecture distinguished from prior ReRAM-based convolutional neural network accelerators is presented.
Abstract: We present a recurrent neural network (RNN) accelerator design with resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architecture. Distinguished from prior ReRAM-based convolutional neural network accelerators, we redesign the system to make it suitable for RNN acceleration. We measure the system throughput and energy efficiency with the detailed circuit and device characterization. Reprogrammability is enabled with our design, and an RNN friendly pipeline is employed to increase the system throughput. We observe that on average the proposed system achieves $79{\times}$ improvement of computing efficiency compared with graphics processing unit baseline. Our simulation also indicates that to maintain high accuracy and computing efficiency, the read noise standard deviation should be less than 0.2, the device resistance should be at least 1 $\text{M}{\Omega }$ , and the device writes latency should be minimized.

89 citations


Journal ArticleDOI
TL;DR: Three variations of convolutions are evaluated, including direct convolution, fast Fourier transform-based convolution (FFT-Conv), and FFT overlap and add convolution for popular CNN networks in embedded hardware to explore the tradeoff between software and hardware implementation, domain-specific logic and instructions, as well as various parallelism across different architectures.
Abstract: Fueled by ImageNet Large Scale Visual Recognition Challenge and Common Objects in Context competitions, the convolutional neural network (CNN) has become important in computer vision and natural language processing. However, state-of-the-art CNNs are computationally memory-intensive, thus energy-efficient implementation on the embedded platform is challenging. Recently, VGGNet and ResNet showed that deep neural networks with more convolution layers and a few fully connected layers can achieve lower error rates, thus reducing the complexity of convolution layers is of utmost importance. In this paper, we evaluate three variations of convolutions, including direct convolution (Direct-Conv), fast Fourier transform (FFT)-based convolution (FFT-Conv), and FFT overlap and add convolution (FFT-OVA-Conv) in terms of computation complexity and memory storage requirements for popular CNN networks in embedded hardware. We implemented these three techniques for ResNet-20 with the CIFAR-10 data set on a low-power domain-specific many-core architecture called power-efficient nanoclusters (PENCs), NVIDIA Jetson TX1 graphics processing unit (GPU), ARM Cortex A53 CPU, and SPARse Convolutional NETwork (SPARCNet) accelerator on Zynq 7020 FPGA to explore the tradeoff between software and hardware implementation, domain-specific logic and instructions, as well as various parallelism across different architectures. Results are evaluated and compared with respect to throughput per layer, energy consumption, and execution time for the three methods. SPARCNet deployed on Zynq FPGA achieved 42-ms runtime with 135-mJ energy consumption with a 10.8-MB/s throughput per layer using FFT-Conv for ResNet-20. Using built-in FFT instruction in PENC, the FFT-OVA-Conv performs $2.9\times $ and $1.65\times $ faster and achieves $6.8\times $ and $2.5\times $ higher throughput per watt than Direct-Conv and FFT-Conv. In ARM A53 CPU, FFT-OVA-Conv achieves $3.36\times $ and $1.38\times $ improvement in execution time and $2.72\times $ and $1.32\times $ higher throughput than Direct-Conv and FFT-Conv. In TX1 GPU, FFT-Conv is $1.9\times $ faster, $2.2\times $ more energy-efficient, and achieves $5.6\times $ higher throughput per layer than Direct-Conv. PENC is 10 $916\times $ and $1.8\times $ faster and $5053\times $ and $4.3\times $ more energy-efficient and achieves $7.5\times $ and $1.2\times $ higher throughput per layer than ARM A53 CPU and TX1 GPU, respectively.

84 citations


Journal ArticleDOI
TL;DR: This brief generalizes and systematically optimizes an architectural template for approximate adders called optimized lower part constant-OR adder (LOCA), which outperforms previous approaches in terms of accuracy and hardware cost.
Abstract: Exploiting the tradeoff between accuracy and hardware cost has a tremendous potential to improve the efficiency of integrated systems. Using this concept, numerous approximate adders have been proposed in the last ten years. Although conceptually different, all previous architectures have been obtained with an ad hoc and nonsystematic methodology. Instead, this brief generalizes and systematically optimizes an architectural template for approximate adders. The outcome, called optimized lower part constant-OR adder (LOCA), outperforms previous approaches in terms of accuracy and hardware cost. For example, an 8-bit approximate adder implemented with our new approach improves the mean squared error by 58.5%, while simultaneously reducing the cost by 7.2% with respect to the previously reported best architecture.

81 citations


Journal ArticleDOI
TL;DR: A novel secure cell design is presented for implementing the design-for-security infrastructure to prevent leaking the key to an adversary under any circumstances and is resistant to various known attacks at the cost of a very little (< 1%) area overhead.
Abstract: Due to the prohibitive costs of semiconductor manufacturing, most system-on-chip design companies outsource their production to offshore foundries. As most of these devices are manufactured in environments of limited trust that often lack appropriate oversight, a number of different threats have emerged. These include unauthorized overproduction of the integrated circuits (ICs), sale of out-of-specification/rejected ICs discarded by manufacturing tests, piracy of intellectual property, and reverse engineering of the designs. Over the years, researchers have proposed different metering and obfuscation techniques to enable trust in outsourced IC manufacturing, where the design is obfuscated by modifying the underlying functionality and only activated by using a secure obfuscation key. However, Boolean satisfiability-based algorithms have been shown to efficiently break key-based obfuscation methods, and thus circumvent the primary objectives of metering and obfuscation. In this paper, we present a novel secure cell design for implementing the design-for-security infrastructure to prevent leaking the key to an adversary under any circumstances. Importantly, our design does not limit the testability of the chip during the normal manufacturing flow in any way, including postsilicon validation and debug. Our proposed design is resistant to various known attacks at the cost of a very little (< 1%) area overhead.

76 citations


Journal ArticleDOI
TL;DR: This paper presents a DTPM algorithm, which uses a practical temperature prediction methodology based on system identification and successfully regulates the maximum temperature and decreases the temperature violations by one order of magnitude while also reducing the total power consumption on average by 7% compared with the default solution.
Abstract: State-of-the-art mobile platforms are powered by heterogeneous system-on-chips that integrate multiple CPU cores, a GPU, and many specialized processors. Competitive performance on these platforms comes at the expense of increased power density due to their small form factor. Consequently, the skin temperature, which can degrade the experience, becomes a limiting factor. Since using a fan is not a viable solution for hand-held devices, there is a strong need for dynamic thermal and power management (DTPM) algorithms that can regulate temperature with minimal performance impact. This paper presents a DTPM algorithm, which uses a practical temperature prediction methodology based on system identification. The proposed algorithm dynamically computes a power budget using the predicted temperature. This budget is used to throttle the frequency and number of cores to avoid temperature violations with minimal impact on the system performance. Our experimental measurements on two different octa-core big.LITTLE processors and common Android applications demonstrate that the proposed technique predicts the temperature with less than 5% error across all benchmarks. Using this prediction, the proposed DTPM algorithm successfully regulates the maximum temperature and decreases the temperature violations by one order of magnitude while also reducing the total power consumption on average by 7% compared with the default solution.

68 citations


Journal ArticleDOI
TL;DR: An innovative approximate adder is developed, which significantly reduces the silicon area and data path delay, and algorithmic transformations for certain layers of BCNNs and a memory-efficient quantization scheme are incorporated to further reduce the energy cost and on-chip storage requirement.
Abstract: Binary weight convolutional neural networks (BCNNs) can achieve near state-of-the-art classification accuracy and have far less computation complexity compared with traditional CNNs using high-precision weights. Due to their binary weights, BCNNs are well suited for vision-based Internet-of-Things systems being sensitive to power consumption. BCNNs make it possible to achieve very high throughput with moderate power dissipation. In this paper, an energy-efficient architecture for BCNNs is proposed. It fully exploits the binary weights and other hardware-friendly characteristics of BCNNs. A judicious processing schedule is proposed so that off-chip I/O access is minimized and activations are maximally reused. To significantly reduce the critical path delay, we introduce optimized compressor trees and approximate binary multipliers with two novel compensation schemes. The latter is able to save significant hardware resource, and almost no computation accuracy is compromised. Taking advantage of error resiliency of BCNNs, an innovative approximate adder is developed, which significantly reduces the silicon area and data path delay. Thorough error analysis and extensive experimental results on several data sets show that the approximate adders in the data path cause negligible accuracy loss. Moreover, algorithmic transformations for certain layers of BCNNs and a memory-efficient quantization scheme are incorporated to further reduce the energy cost and on-chip storage requirement. Finally, the proposed BCNN hardware architecture is implemented with the SMIC 130-nm technology. The postlayout results demonstrate that our design can achieve an energy efficiency over 2.0TOp/s/W when scaled to 65 nm, which is more than two times better than the prior art.

Journal ArticleDOI
TL;DR: Simulations performed in Cadence Spectre demonstrate the ability of the proposed radiation-hardened-by-design 10T cell to tolerate both single node upsets and the increased read/write access time.
Abstract: In this brief, based on upset physical mechanism together with reasonable transistor size, a robust 10T memory cell is first proposed to enhance the reliability level in aerospace radiation environment, while keeping the main advantages of small area, low power, and high stability. Using Taiwan Semiconductor Manufacturing Company 65-nm CMOS commercial standard process, simulations performed in Cadence Spectre demonstrate the ability of the proposed radiation-hardened-by-design 10T cell to tolerate both $0~\rightarrow ~1$ and $1~\rightarrow ~0$ single node upsets, with the increased read/write access time.

Journal ArticleDOI
TL;DR: Extensive characterizations of multi-kb RRAM arrays during forming, set, reset, and cycling operations are presented, and the relationships among programming conditions, memory window, and endurance features are presented.
Abstract: Resistive random access memories (RRAMs) feature high-speed operations, low-power consumption, and nonvolatile retention, thus serving as a promising candidate for future memory applications. To explore the applications of the RRAM, switching variability and cycling endurance need to be addressed. This paper presents extensive characterizations of multi-kb RRAM arrays during forming, set, reset, and cycling operations. The relationships among programming conditions, memory window, and endurance features are presented. The experimental results are then used to perform variability-aware simulations of a 128-bit RRAM-based ternary content-addressable-memory (TCAM) macro. The tradeoff among endurance, search latency, and reliability in terms of match/mismatch detection is explored, identifying the programming conditions that allow to obtain a searching speed comparable to static random access memory-based TCAMs (2 ns on average and 3 ns at $3\sigma $ ) while guaranteeing good reliability metrics (with a time ratio of 3000 on average and 150 at $3\sigma $ ).

Journal ArticleDOI
TL;DR: A low-power comparator using pMOS transistors at the input of the preamplifier of the comparator as well as the latch stage that reduces the power consumption and provides 30% better comparison speed at the same offset and almost the same noise budgets.
Abstract: A low-power comparator is presented. pMOS transistors are used at the input of the preamplifier of the comparator as well as the latch stage. Both stages are controlled by a special local clock generator. At the evaluation phase, the latch is activated with a delay to achieve enough preamplification gain and avoid excess power consumption. Meanwhile, small cross-coupled transistors increase the preamplifier gain and decrease the input common mode of the latch to strongly turn on the pMOS transistors (at the latch input) and reduce the delay. Unlike the conventional comparator, the proposed structure let us set the optimum delay for preamplification and avoid excess power consumption. The speed and the power benefits of the comparator were verified using solid analytical derivations, process–VDD–temperature corners, and Monte Carlo simulations along with silicon measurements in $0.18~\mu \text{m}$ . The tests confirm that the proposed circuit reduces the power consumption by 50% and provides 30% better comparison speed at the same offset and almost the same noise budgets. Moreover, the comparator provides a rail-to-rail input $V_{\text {cm}}$ range in $f_{\text {clk}} = 500$ MHz.

Journal ArticleDOI
TL;DR: The results indicate that employing the proposed RCPAs in the hybrid adders may provide, on average, 27%, 6%, and 31% improvements in delay, energy, and energy-delay-product while providing higher levels of accuracy.
Abstract: In this paper, a reverse carry propagate adder (RCPA) is presented. In the RCPA structure, the carry signal propagates in a counter-flow manner from the most significant bit to the least significant bit; hence, the carry input signal has higher significance than the output carry. This method of carry propagation leads to higher stability in the presence of delay variations. Three implementations of the reverse carry propagate full-adder (RCPFA) cell with different delay, power, energy, and accuracy levels are introduced. The proposed structure may be combined with an exact (forward) carry adder to form hybrid adders with tunable levels of accuracy. The design parameters of the proposed RCPA implementations and some hybrid adders realized utilizing these structures are studied and compared with those of the state-of-the-art approximate adders using HSPICE simulations in a 45-nm CMOS technology. The results indicate that employing the proposed RCPAs in the hybrid adders may provide, on average, 27%, 6%, and 31% improvements in delay, energy, and energy-delay-product while providing higher levels of accuracy. In addition, the structure is more resilient to delay variation compared to the conventional approximate adder. Finally, the efficacy of the proposed RCPAs is investigated in the discrete cosine transform (DCT) block of the JPEG compression and finite-impulse response (FIR) filter applications. The investigation reveals 60% and 39% energy saving in the DCT of JPEG and FIR filter, respectively, for the proposed RCPAs.

Journal ArticleDOI
TL;DR: A novel area-efficient and power-efficient approach to sorting networks, based on “unary processing,” which is validated with two implementations of an important application of sorting: median filtering and is a low cost, energy-efficient implementation of median filtering with only a slight accuracy loss.
Abstract: Sorting is a common task in a wide range of applications from signal and image processing to switching systems. For applications that require high performance, sorting is often performed in hardware with application-specified integrated circuits or field-programmable gate arrays. Hardware cost and power consumption are the dominant concerns. The usual approach is to wire up a network of compare-and-swap units in a configuration called the Batcher (or bitonic) network. Such networks can readily be pipelined. This paper proposes a novel area-efficient and power-efficient approach to sorting networks, based on “unary processing.” In unary processing, numbers are encoded uniformly by a sequence of one value (say 1) followed by a sequence of the other value (say 0) in a stream of 0’s and 1’s with the value defined by the fraction of 1’s in the stream. Synthesis results of complete sorting networks show up to 92% area and power saving compared to the conventional binary implementations. However, the latency increases. To mitigate the increased latency, this paper uses a novel time-encoding of data. The approach is validated with two implementations of an important application of sorting: median filtering. The result is a low cost, energy-efficient implementation of median filtering with only a slight accuracy loss, compared to conventional implementations.

Journal ArticleDOI
TL;DR: This brief presents a novel ultralow power CMOS voltage reference (CVR) with only 4.6-nW power consumption and measurement results show that the prototype design is capable of providing a 755 mV typical reference voltage with 34 ppm/°C from −15 °C to 140 °C.
Abstract: This brief presents a novel ultralow power CMOS voltage reference (CVR) with only 4.6-nW power consumption. In the proposed CVR circuit, the proportional-to-absolute-temperature voltage is generated by feeding the leakage current of a zero- $V_{\mathrm {gs}}$ nMOS transistor to two diode-connected nMOS transistors in series, both of which are in subthreshold region; while the complementary-to-absolute-temperature voltage is created by using the body diodes of another nMOS transistor. Consequently, low-power operation can be achieved without requiring resistors or bipolar junction transistors, leading to small chip area consumption. The proposed CVR circuit is fabricated in a standard 0.18- $\mu \text{m}$ CMOS process. Measurement results show that the prototype design is capable of providing a 755 mV typical reference voltage with 34 ppm/°C from −15 °C to 140 °C. Moreover, the typical power consumption is only 4.6 nW at room temperature and the active area is only 0.0598 mm2.

Journal ArticleDOI
TL;DR: The underlying theory is formulated and circuit design is proposed for an arbitrary level of parallelization in a power of 2.0, and the use of Sobol sequence generators improves the accuracy of stochastic computation with a reduced sequence length.
Abstract: Stochastic computing (SC) often requires long stochastic sequences and, thus, a long latency to achieve accurate computation. The long latency leads to an inferior performance and low energy efficiency compared with most conventional binary designs. In this paper, a type of low-discrepancy sequences, the Sobol sequence, is considered for use in SC. Compared to the use of pseudorandom sequences generated by linear feedback shift registers (LFSRs), the use of Sobol sequences improves the accuracy of stochastic computation with a reduced sequence length. The inherent feature in Sobol sequence generators enables the parallel implementation of random number generators with an improved performance and hardware efficiency. In particular, the underlying theory is formulated and circuit design is proposed for an arbitrary level of parallelization in a power of 2. In addition, different strategies are implemented for parallelizing combinational and sequential stochastic circuits. The hardware efficiency of the parallel stochastic circuits is measured by energy per operation (EPO), throughput per area (TPA), and runtime. At a similar accuracy, the $8{\times}$ parallel stochastic circuits using Sobol sequences consume approximately 1% of the EPO of the conventional LFSR-based nonparallelized circuits. Meanwhile, an average of 70 (up to 89) times improvements in TPA and less than 1% runtime are achieved. A sorting network is implemented for a median filter (MF) as an application. For a similar image processing quality, a higher energy efficiency is obtained for an $8{\times}$ parallelized stochastic MF compared with its binary counterpart.

Journal ArticleDOI
TL;DR: This work investigates a simple accuracy configurable adder design that contains no redundancy or error detection/correction circuitry and uses very simple carry prediction and proposes a delay-adaptive self-configuration technique to further improve accuracy-delay-power tradeoff.
Abstract: Approximate computing is a promising approach for low-power IC design and has recently received considerable research attention. To accommodate dynamic levels of approximation, a few accuracy-configurable adder (ACA) designs have been developed in the past. However, these designs tend to incur large area overheads as they rely on either redundant computing or complicated carry prediction. Some of these designs include error detection and correction circuitry, which further increase the area. In this paper, we investigate a simple ACA design that contains no redundancy or error detection/correction circuitry and uses very simple carry prediction. The simulation results show that our design dominates the latest previous work on accuracy-delay-power tradeoff while using 39% lower area. In the best case, the iso-delay power of our design is only 16% of accurate adder regardless of degradation in accuracy. One variant of this design provides finer-grained and larger tunability than that of the previous works. Moreover, we propose a delay-adaptive self-configuration technique to further improve the accuracy-delay-power tradeoff. The advantages of our method are confirmed by the applications in multiplication and discrete cosine transform computing.

Journal ArticleDOI
TL;DR: Compared with the best spurious-free dynamic range (SFDR) values obtained by the conventional DEM DACs, Monte Carlo simulations demonstrate that the proposed DAC achieves higher performance improvement with the same MSB bit width.
Abstract: Dynamic performance of the current-steering digital-to-analog converter (DAC) is mainly affected by the mismatch-induced nonlinearity. Dynamic-element-matching (DEM) method has been widely employed to effectively improve the amplitude and timing mismatches. However, the maximum performance improvement is constrained by the conventional separate-segment structure for the current source array. In this brief, a DEM DAC with nested-segment structure is proposed to improve the mismatch performance. Compared with the best spurious-free dynamic range (SFDR) values obtained by the conventional DEM DACs, Monte Carlo simulations demonstrate that the proposed DAC achieves higher performance improvement with the same MSB bit width. The largest improvement occurs at MSB bit width of 3 with the 6.95- and 4.68-dB gain over two conventional designs, respectively. In terms of the digital complexity, the proposed architecture employs at least $2.7\times$ fewer multiplexers compared with the reported DEM DACs, while achieving comparable dynamic performance. Fabricated in 130-nm CMOS process, the proposed 12-bit 100-MS/s DAC occupies 0.21 mm2. Measurement results show that $1.9\times$ integral nonlinearity reduction ratio and 15.5-dB SFDR improvement from 46.6 to 62.1 dB at near Nyquist frequency are achieved.

Journal ArticleDOI
TL;DR: This paper presents a differential fourth-order low-pass filter suitable for electrocardiography (ECG) acquisition formed by cascading two compact and power-efficient biquads operating in the subthreshold region that performs comparably to the recent state-of-the-art nanowatt-class low- pass filter.
Abstract: This paper presents a differential fourth-order low-pass filter suitable for electrocardiography (ECG) acquisition. It is formed by cascading two compact and power-efficient biquads operating in the subthreshold region. Each biquad combines two capacitors and a flipped voltage follower circuit. The filter attains a cutoff frequency adjustable to cover the entire range of ECG (150–250 Hz). The filter prototype has been fabricated in a 0.35- $\mu \text{m}$ CMOS technology. It occupies an area of $362\,\,\mu \text {m} \times 466 \,\,\mu \text{m}$ and operates from a 0.6-V supply. Measurements confirm that the filter consumes 0.9-nW static power for a 101-Hz cutoff frequency and contributes the input-referred noise of 46.27 $\mu \text{V}_{\mathbf {rms}}$ . For a 60-Hz input frequency, the filter achieves a dynamic range of 47 dB where the third-harmonic distortion of −60 dB is produced. This leads to the figure of merit of $46.5 \times 10^{-18}$ J. When the chip area is also concerned, the proposed filter performs comparably to the recent state-of-the-art nanowatt-class low-pass filter.

Journal ArticleDOI
TL;DR: The main leading applications that demand advanced technologies to fit the unconventional requirements of extreme operating conditions, including silicon (Si), silicon on insulator (SOI), silicon germanium (SiGe), silicon carbide (SiC) as well as III–V semiconductors particularly the gallium nitride (GaN) semiconductor are reviewed.
Abstract: Several industrial applications require specific electronic systems installed in harsh environments to perform measurements, monitoring, and control tasks such as in space exploration, aerospace missions, automotive industries, down-hole oil and gas industry, and geothermal power plants. The extreme environment could be surrounding high-, low-, and wide-range temperature, intense radiation, or even a combination of above conditions. We review, in this paper, the main leading applications that demand advanced technologies to fit the unconventional requirements of extreme operating conditions, discussing their main merits and limits compared to established and emerging technologies in this field, including silicon (Si), silicon on insulator (SOI), silicon germanium (SiGe), silicon carbide (SiC) as well as III–V semiconductors particularly the gallium nitride (GaN) semiconductor. In spite of successfully exceeding extreme conditions borders by developing advanced semiconductor devices dedicated for harsh environments, especially in high-temperature applications, the packaging challenges are still limiting the reliability of the developed technologies. Those challenges are examined in this review in terms of limitations and proposed solutions.

Journal ArticleDOI
TL;DR: The signal flow path of photocurrent throughout a retina in a scalable 180-nm CMOS technology is synthesized and the resulting image matches biologically verified results within an error margin of 6% and exhibits the following features of the retina: lateral inhibition, asynchronous adaptation, and a low-dynamic-range integration active pixel sensor to perceive a high-d dynamic-range scene.
Abstract: The development of a bioinspired image sensor, which can match the functionality of the vertebrate retina, has provided new opportunities for vision systems and processing through the realization of new architectures. Research in both retinal cellular systems and nanodriven memristive technology has made a challenging arena more accessible to emulate features of the retina that are closer to biological systems. This paper synthesizes the signal flow path of photocurrent throughout a retina in a scalable 180-nm CMOS technology, which initiates at a $128\times 128$ active pixel image sensor, and converges to a $16\times 16$ array, where each node emits a spike train synonymous to the function of the retinal ganglionic output cell. This signal can be sent to the visual cortex for image interpretation as part of an artificial vision system. Layers of memristive networks are used to emulate the functions of horizontal and amacrine cells in the retina, which average and converge signals. The resulting image matches biologically verified results within an error margin of 6% and exhibits the following features of the retina: lateral inhibition, asynchronous adaptation, and a low-dynamic-range integration active pixel sensor to perceive a high-dynamic-range scene.

Journal ArticleDOI
TL;DR: Numerical results show that the proposed novel Krylov subspace-based method for fast numerical solutions to the stress PDEs can lead to about 1–2 orders of magnitude speed-up over existing finite-difference time-domain-based methods on large interconnect trees for both void nucleation and growth phases with negligible errors.
Abstract: Electromigration effects are a key failure mechanism for copper-based dual damascene interconnects wires in semiconductor technologies. However, accurately predicting the time-to-failure for a complicated interconnect tree in a VLSI interconnect layout requires detailed knowledge of the stress evolutions over time, and is subject to time-varying currents and temperature. This is a challenging problem as one needs to solve the stress-based partial differential equations (PDEs) in the time domain for confined copper damascene interconnect trees for both void nucleation and void growth phases. To mitigate this problem, we propose a novel Krylov subspace-based method for fast numerical solutions to the stress PDEs. The new approach, which we call FastEM , is based on the finite-difference method which is used to first discretize the PDEs into linear time-invariant ordinary differential equations (ODEs). After discretization, a modified Krylov subspace-based reduction technique is applied in the frequency domain to reduce the size of the original system matrices so that they can be efficiently simulated in the time domain. The FastEM can perform the simulation process for both void nucleation and void growth phases under piecewise constant linear current density inputs and time-varying stressing temperatures. Furthermore, we show that the steady-state response of stress diffusion equations can be obtained from the resulting ODE system in the frequency domain, which agrees with the recently proposed voltage-based EM analysis method for EM immortality checks. Numerical results show that the proposed method can lead to about 1–2 orders of magnitude speed-up over existing finite-difference time-domain-based methods on large interconnect trees for both void nucleation and growth phases with negligible errors. We further show that for most of the interconnect trees tested; we only need a small number of dominant poles for sufficient accuracy.

Journal ArticleDOI
TL;DR: A low temperature coefficient (TC) and high power supply ripple rejection (PSRR) CMOS sub-bandgap voltage reference (sub-BGR) circuit using subthreshold MOS transistors and a single BJT is presented in this brief.
Abstract: A low temperature coefficient (TC) and high power supply ripple rejection (PSRR) CMOS sub-bandgap voltage reference (sub-BGR) circuit using subthreshold MOS transistors and a single BJT is presented in this brief. The proposed sub-BGR consists of a novel complementary-to-absolute-temperature (CTAT) voltage generator based on a scaled emitter-base voltage of a BJT, and an improved proportional-to-absolute-temperature (PTAT) voltage generator based on stacking of $\Delta V_{\mathbf {GS}}$ of sub- $V_{\mathbf {TH}}$ MOSFETs. As the CTAT circuit achieves a reduced absolute value of the negative TC, the PTAT circuit achieves reduced power consumption without consuming a large chip area. The proposed sub-BGR circuit is implemented in a standard 0.18- $\mu \text{m}$ CMOS process. Measured results show that the sub-BGR circuit can run with a supply voltage down to 0.9 V while the power consumption is only 85 nW. An average TC of 33.7 ppm/°C and a PSRR of better than −40 dB over the full frequency range are achieved.

Journal ArticleDOI
TL;DR: The low area probing detector (LAPD) is presented as an efficient approach to detect microprobing and it is shown that the detection of state-of-the-art commercial microprobes is possible even under extreme conditions and the margin with respect to false positives is sufficient.
Abstract: Microprobing allows intercepting data from on-chip wires as well as injecting faults into data or control lines. This makes it a commonly used attack technique against security-related semiconductors, such as smart card controllers. We present the low area probing detector (LAPD) as an efficient approach to detect microprobing. It compares delay differences between symmetric lines such as bus lines to detect timing asymmetries introduced by the capacitive load of a probe. Compared with state-of-the-art microprobing countermeasures from industry, such as shields or bus encryption, the area overhead is minimal and no delays are introduced; in contrast to probing detection schemes from academia, such as the probe attempt detector, no analog circuitry is needed. We show the Monte Carlo simulation results of mismatch variations as well as process, voltage, and temperature corners on a 65-nm technology and present a simple reliability optimization. Eventually, we show that the detection of state-of-the-art commercial microprobes is possible even under extreme conditions and the margin with respect to false positives is sufficient.

Journal ArticleDOI
TL;DR: This paper introduces a new approach to cost-effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders, called nonsurjective finite alphabet iterative decmoders (NS-FAIDs), which exploits the robustness of message-passing LDPC decoder to inaccuracies in the calculation of exchanged messages.
Abstract: This paper introduces a new approach to cost-effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.

Journal ArticleDOI
TL;DR: This paper implemented and calibrated sensors in configurable logic appropriate to observe delay changes caused by transient voltage fluctuations, and places them at multiple locations on the chip to evaluate temporal and spatial changes in a timing margin due to different workload characteristics.
Abstract: Due to recent technology scaling trends and increased circuit complexity, process and runtime variabilities are becoming major threats for correct circuit operation. Among these, transient voltage fluctuations appear to be the most critical issue, accounting for the biggest component of timing margin, at increased cost. As various design and workload parameters have an impact on voltage fluctuations, they need to be fully understood in order to design efficient countermeasures and margining. Field-programmable gate arrays are predestined for this analysis by allowing more control over such experiments at lower cost than application-specific integrated circuits. Furthermore, they highly suffer from the same issues, which are typically only handled by excessive and overpessimistic timing margining built into the mapping tools. In this paper, we implemented and calibrated sensors in configurable logic appropriate to observe delay changes caused by transient voltage fluctuations. We place them at multiple locations on the chip to evaluate temporal and spatial changes in a timing margin due to different workload characteristics. Moreover, we analyze the spatial and the temporal interdependence of various workloads and investigate their combined effect on a voltage drop. This analysis provides useful insights to designers for application mapping and workload scheduling.

Journal ArticleDOI
TL;DR: This brief introduces a general and efficient method for constructing large high-quality approximate multipliers with respect to the objectives formulated in terms of the power-delay product and a provable error bound.
Abstract: Approximate computing exploits the fact that many applications are inherently error resilient. In order to reduce power consumption, approximate circuits such as multipliers have been employed in these applications. However, most current approximate multipliers are based on $ad~hoc$ circuit structures and, for automated circuit approximation methods, large efficient designs are difficult to find due to the increased search space. Moreover, existing design methods do not typically provide sufficient formal guarantees in terms of error if large approximate multipliers are constructed. To address these challenges, this brief introduces a general and efficient method for constructing large high-quality approximate multipliers with respect to the objectives formulated in terms of the power-delay product and a provable error bound. This is demonstrated by means of a comparative evaluation of approximate 16-bit multipliers constructed by the proposed method and other methods in the literature.

Journal ArticleDOI
TL;DR: This paper presents a highly reliable and invasive-attack-resistant switched-capacitor (SC) strong physical unclonable function (PUF), which can offer an extremely large number of challenge–response pairs.
Abstract: This paper presents a highly reliable and invasive-attack-resistant switched-capacitor (SC) strong physical unclonable function (PUF), which can offer an extremely large number of challenge–response pairs. Two symmetrical capacitor arrays that are controlled by challenges are used to realize the strong ability of SC PUF. The mismatch created by the capacitor arrays in real fabrication is sampled by an SC circuit and further amplified by a latch-styled sense amplifier. By covering the entire chip with the metallic transmission lines connected to the sampling capacitor, the PUF can resist invasive attacks. To provide highly reliable responses, a built-in self-test (BIST) strategy is also adopted. A test capacitor is embedded in the PUF circuits to automatically test the capacitance deviation of the SC circuit. Only when the capacitance deviation is large, the output of the PUF is selected for use. The proposed strong SC PUF circuit is fabricated and verified using HJ standard 0.18- $\mu \text{m}$ CMOS process. Measured results show that after using the BIST strategy, the bit error rate is less than $10^{-9}$ when the ratio of the selected responses is 19.6% for the PUF instances in our test.