scispace - formally typeset
Search or ask a question

Showing papers on "Sense amplifier published in 2020"


Journal ArticleDOI
TL;DR: This article proposes a serial-input non-weighted product (SINWP) structure; a down-scaling weighted current translator and positive–negative current-subtractor scheme; a current-aware bitline clamper scheme; and a triple-margin small-offset current-mode sense amplifier (TMCSA).
Abstract: Computing-in-memory (CIM) based on embedded nonvolatile memory is a promising candidate for energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices. However, circuit design for NVM-based CIM (nvCIM) imposes a number of challenges, including an area-latency-energy tradeoff for multibit MAC operations, pattern-dependent degradation in signal margin, and small read margin. To overcome these challenges, this article proposes the following: 1) a serial-input non-weighted product (SINWP) structure; 2) a down-scaling weighted current translator (DSWCT) and positive–negative current-subtractor (PN-ISUB); 3) a current-aware bitline clamper (CABLC) scheme; and 4) a triple-margin small-offset current-mode sense amplifier (TMCSA). A 55-nm 1-Mb ReRAM-CIM macro was fabricated to demonstrate the MAC operation of 2-b-input, 3-b-weight with 4-b-out. This nvCIM macro achieved $T_{\text {MAC}}= 14.6$ ns at 4-b-out with peak energy efficiency of 53.17 TOPS/W.

76 citations


Proceedings ArticleDOI
22 Mar 2020
TL;DR: A novel 8T SRAM -based bitcell is proposed for current-based compute-in-memory dot-product operations and Monte-Carlo simulations and test-chip measurement results have verified both linearity and process variation.
Abstract: A novel 8T SRAM -based bitcell is proposed for current-based compute-in-memory dot-product operations. The proposed bitcell with two extra NMOS transistors (vs. standard 6T SRAM) decouples SRAM read and write operation. A 128×128 8T SRAM bitcell array is built for processing a vector-matrix multiplication (or parallel dot-products) with 64x binary (0 or 1) inputs, 64×128 binary (-1 or +1) weights, and 128x 1-5bit outputs. Each column (i.e. neuron) of the proposed SRAM compute-in-memory macro consists of 64x bitcells for dot-product, 32x bitcells for ADC, and 32x bitcells for calibration. The column-based neuron minimizes the ADC overhead by reusing a sense amplifier for SRAM read. The column-wise ADC converts the analog dot-product results to N-bit output codes (N=1 to 5) by sweeping reference levels using replica bitcells for 2N-1 cycles for each conversion. Monte-Carlo simulations and test-chip measurement results have verified both linearity and process variation. The largest variation (σ=2.48%) results in the MNIST classification accuracy of 96.2% (i.e. 0.4% lower than a baseline with no variation). A test-chip is fabricated using 65nm, and the 16K SRAM bitcell array occupies 0.055mm2. The energy efficiency of the 1bit operation is 490-to-15.8TOPS/W at 1-5bit ADC mode using 0.45/0.8V core supply and 200MHz.

66 citations


Journal ArticleDOI
TL;DR: A self-timed voltage-mode sense scheme named ST-VSS which can enable optimal timing depending on the bit-cell discharging ability is proposed which is applied to a 32bits/word MRAM using 28-nm CMOS process.
Abstract: In Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), the most commonly used timing scheme for conventional Voltage-mode Sense Amplifier (VSA) is the global activated timing. Obviously this method cannot obtain the optimal yield because different bit-cells have its sensing latency respectively. This paper proposes a self-timed voltage-mode sense scheme named ST-VSS which can enable optimal timing depending on the bit-cell discharging ability. Two circuit structures are proposed: The single SA structure uses a multiplexer at the input of the SA. Its successive sensing operations are implemented with input offset flipping. A dual SA structure is reconfigured by built-in-self-test (BIST) method to the opposite offset states to monitor sensing results from each other. The sensing operation can be immediately terminated after successful reading. The proposed ST-VSS is applied to a 32bits/word MRAM using 28-nm CMOS process. Simulation results show that the successful sensing rate across a wide range of voltages can be improved, comparing with the conventional scheme. The single SA structure obtains 32%~42% yield improvement, costs 44.1%/26.9%/19.3% energy, and brings 8.3%/5.8%/2.9% layout area penalty in 128/256/512 column depth, respectively. The dual SA structure gets 54%~65% yield improvement, costs 66.2%/38.6%/27.5% energy, and brings 26.4%/13.8%/7.1% area penalty in 128/256/512 column depth conditions, respectively.

31 citations


Proceedings ArticleDOI
30 May 2020
TL;DR: This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost and can improve system performance and DRAM energy consumption with four-core multiprogrammed workloads.
Abstract: DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance under very different and dynamic workload demands. This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost. CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to operate as a single low-latency logical cell driven by a single logical sense amplifier. We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.

28 citations


Journal ArticleDOI
TL;DR: A novel processing-in-memory architecture (HPLG-PIM) for highly flexible, efficient, and secure logic computation that exploits a hardware-friendly approach to implement the complex logic functions between multiple operands combining a reconfigurable sense amplifier and an HPLG unit to reduce the latency and the power-hungry data movement further.
Abstract: In this article, we initially present a hybrid spin-CMOS polymorphic logic gate (HPLG) using a novel 5-terminal magnetic domain wall motion device. The proposed HPLG is able to perform a full set of 1and 2-input Boolean logic functions (i.e., NOT, AND/NAND, OR/NOR, and XOR/XNOR) by configuring the applied keys. We further show that our proposed HPLG could become a promising hardware security primitive to address IC counterfeiting or reverse engineering by logic locking and polymorphic transformation. The experimental results on a set of ISCAS-89, ITC-99, and Ecole Polytechnique Federale de Lausanne (EPFL) benchmarks show that HPLG obtains up to 51.4% and 10% average performance improvements on the power-delay product (PDP) compared with recent non-volatile logic and CMOS-based designs, respectively. We then leverage this gate to realize a novel processing-in-memory architecture (HPLG-PIM) for highly flexible, efficient, and secure logic computation. Instead of integrating complex logic units in cost-sensitive memory, this architecture exploits a hardware-friendly approach to implement the complex logic functions between multiple operands combining a reconfigurable sense amplifier and an HPLG unit to reduce the latency and the power-hungry data movement further. The device-to-architecture co-simulation results for widely used graph processing tasks running on three social network data sets indicate roughly 3.6× higher energy efficiency and 5.3× speedup over recent resistive RAM (ReRAM) accelerators. In addition, an HPLG-PIM achieves ~4× higher energy efficiency and 5.1× speedup over recent processing-in-DRAM acceleration methods.

27 citations


Posted Content
TL;DR: In this article, the authors propose a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost, which is called Capacity-Latency-Reconfigurable DRAM (CLR-DRAM).
Abstract: DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance under very different and dynamic workload demands. This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost. CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to operate as a single low-latency logical cell driven by a single logical sense amplifier. We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.

26 citations


Journal ArticleDOI
TL;DR: In this article, a bio-inspired variation-aware nonvolatile quaternary latch (Qlatch) was proposed to reduce standby power without the need for extra component and data loss.
Abstract: Quaternary memory and logic circuits have been studied by researchers, as they can provide denser integrated circuits and subsequently lower area and power consumption via succinct interconnects. Using multithreshold gate-all-around carbon nanotube field-effect transistors and the nonvolatile feature of magnetic tunnel junctions (MTJs), this letter proposes a bio-inspired variation-aware nonvolatile quaternary latch (Qlatch). Nonvolatility allows the system to be completely powered off during the idle state to reduce standby power significantly without the need for extra component and data loss. Moreover, thanks to the bio-inspired structure of the Qlatch, and the fact that no sense amplifier is used in this circuit to read the MTJs, the Qlatch is more robust to process variations. Simulation results show that the nonvolatile Qlatch consumes 7% less dynamic power and 12% less static power than the conventional Qlatch, has a 61% lower delay, and has 68% lower power delay product.

15 citations


Journal ArticleDOI
TL;DR: Signal integrity (SI) is used to design and analyze 3-D X-Point memory, including a phase-change memory (PCM) cell, ovonic threshold switch (OTS) selector, interconnection lines, and peripheral circuits, including decoder, sense amplifier, and analog-to-digital converter.
Abstract: In this article, we, for the first time, used signal integrity (SI) to design and analyze 3-D X-Point memory, including a phase-change memory (PCM) cell, ovonic threshold switch (OTS) selector, interconnection lines, and peripheral circuits. With the narrow space and the long interconnection lines that come with 20-nm process technology, crosstalk and IR drop can degrade the voltage margin of the memory cell and affect the memory operation. For SI analysis considering crosstalk and IR drop, the unit size of the memory array tile was considered in designing the interconnection lines. Crosstalk and IR drop are analyzed using full 3-D electromagnetic and circuit simulations. To cover practical conditions, the PCM cell and OTS selector are modeled as behavior models using Verilog-A modules, respectively. Also, the word lines (WLs) and bit lines (BLs) of 3-D X-Point memory are modeled to resistance and capacitance by ANSYS Q3D extractor. The core peripheral circuits, such as decoder, sense amplifier, and analog-to-digital converter, are included in the circuit simulation. To verify the proposed design and analysis, a transient simulation was conducted considering crosstalk and IR drop of 3-D X-Point memory. A tradeoff relationship between crosstalk and IR drop in the interconnection designs was verified. Additionally, to suppress crosstalk and reduce IR drop, the new design of the interconnection lines considering the tradeoff between SI issues is proposed. The newly proposed interconnection design shows 30% improvement in the voltage margin considering the IR drop issues and under 10% enhancement of crosstalk noise. It is expected that the SI analysis and design methodologies could be widely applied in other new memory developments.

15 citations


Journal ArticleDOI
TL;DR: A technique is proposed to implement a majority gate in a memory array in an energy-efficient manner as a memory READ operation and the proposed logic family disintegrates arithmetic operations to majority and NOT operations which are implemented as memory READ and WRITE operations.
Abstract: The flow of data between processing and memory units in contemporary computing systems is their main performance and energy-efficiency bottleneck, often referred to as the ‘von Neumann bottleneck’ or ‘memory wall’. Emerging resistance switching memories (memristors) show promising signs to overcome the ‘memory wall’ by enabling computation in the memory array. Majority logic is a type of Boolean logic, and in many nanotechnologies, it has been found to be an efficient logic primitive. In this paper, a technique is proposed to implement a majority gate in a memory array. The majority gate is realised in an energy-efficient manner as a memory R E A D operation. The proposed logic family disintegrates arithmetic operations to majority and NOT operations which are implemented as memory R E A D and W R I T E operations. A 1-bit full adder can be implemented in 6 steps (memory cycles) in a 1T–1R array, which is faster than I M P L Y , N A N D , N O R and other similar logic primitives.

15 citations


Proceedings ArticleDOI
16 Jun 2020
TL;DR: A read circuitry that tackles all STT-MRAM read challenges by using a negative temperature coefficient (NTC) reference based on an MTJ in series with an “NTC” resistor circuit emulator is presented.
Abstract: In this paper we present a read circuitry that tackles all STT-MRAM read challenges. First, a negative temperature coefficient (NTC) reference based on an MTJ in series with an “NTC” resistor circuit emulator is described. Then, an offset cancelled voltage sense amplifier using low read current and reference averaging is discussed. Measurement results show a maximum of 2% reference impedance error (vs. ideal) and 1.7% read error rate degradation (vs. technology intrinsic defectivity rate). A 14.7Mb/mm2 memory density is also achieved, which is the best STT-MRAM published density for embedded applications.

14 citations


Journal ArticleDOI
TL;DR: The proposed compact physical unclonable function (PUF) based on cross-coupled comparator has the lowest native response instability and the unpredictability of the fabricated PUF chips is validated by autocorrelation function and NIST randomness tests.
Abstract: In this article, a compact physical unclonable function (PUF) based on cross-coupled comparator is presented. Featuring a positive feedback response generation mechanism, the mismatch in analog signals between the cross-coupled transistor pair is quickly amplified to prevent its polarity from flipping by the temporal noise. The rapid enlargement of noise margin by the sense amplifier also contributes to stabilizing the response against supply voltage variations. To improve its temperature stability, the counteracting effect of complementary-to-absolute-temperature (CTAT) and proportional-to-absolute-temperature (PTAT) drives are considered in sizing the bit cell transistors. The proposed design is fabricated in a standard 65-nm CMOS process. The bit cell occupies an area of only $4.38~\mu \text {m}^{2}$ (i.e., $1036~F^{2}$ ), and the overall PUF chip consumes 2.98 pJ/bit at the throughput of 8 Mb/s, of which only 1.61 pJ/bit is due to the PUF’s core. With the uniqueness measured to be 49.53%, the unpredictability of the fabricated PUF chips is validated by autocorrelation function and NIST randomness tests. Compared with the state-of-the-art implementations, the proposed PUF has the lowest native response instability of 1.46% with 500 repeated PUF readouts at 27 °C and 1.2 V. By varying the operating temperature from −50 °C to 150 °C in a step size of 10 °C and the supply voltage from 1.0 to 1.4 V in a step size of 0.1 V simultaneously, the average reliability of the proposed PUF obtained from the 2-D plot of all operating conditions is found to be 96.87% without correction and 99.31% with spatial majority voting (SMV).

Journal ArticleDOI
Sung-Tae Lee1, Dongseok Kwon1, Hyeongsu Kim1, Honam Yoo1, Jong-Ho Lee1 
TL;DR: The low-variance conductance distribution of the NAND cells achieves a higher inference accuracy compared to that of resistive random access memory (RRAM) devices by 2~7 % and 0.04~0.23 % for CIFAR 10 and MNIST datasets, respectively.
Abstract: We propose a novel synaptic architecture based on a NAND flash memory for highly robust and high-density quantized neural networks (QNN) with 4-bit weight and binary neuron activation, for the first time The proposed synaptic architecture is fully compatible with the conventional NAND flash memory architecture by adopting a differential sensing scheme and a binary neuron activation of (1, 0) A binary neuron enables using a 1-bit sense amplifier, which significantly reduces the burden of peripheral circuits and power consumption and enables bitwise communication between the layers of neural networks Operating NAND cells in the saturation region eliminates the effect of metal wire resistance and serial resistance of the NAND cells With a read-verify-write (RVW) scheme, low-variance conductance distribution is demonstrated for 8 levels Vector-matrix multiplication (VMM) of a 4-bit weight and binary activation can be accomplished by only one input pulse, eliminating the need of a multiplier and an additional logic operation In addition, quantization training can minimize the degradation of the inference accuracy compared to post-training quantization Finally, the low-variance conductance distribution of the NAND cells achieves a higher inference accuracy compared to that of resistive random access memory (RRAM) devices by 2~7 % and 004~023 % for CIFAR 10 and MNIST datasets, respectively

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a timing-speculation (TS) cache to boost the cache frequency and improve energy efficiency under low supply voltages, where the voltage differences of bitlines (BLs) are continuously evaluated twice by a sense amplifier (SA).
Abstract: To mitigate the ever-worsening “power wall” problem, more and more applications need to expand their working voltage to the wide-voltage range including the near-threshold region. However, the read delay distribution of the static random access memory (SRAM) cells under the near-threshold voltage shows a more serious long-tail characteristic than that under the nominal voltage due to the process fluctuation. Such degradation of SRAM delay makes the SRAM-based cache a performance bottleneck of systems as well. To avoid unreliable data reading, circuit-level studies use larger/more transistors in a bitcell by sacrificing chip area and the static power of cache arrays. Architectural studies propose the auxiliary error correction or block disabling/remapping methods in fault-tolerant caches, which worsen both the hit latency and energy efficiency due to the complex accessing logic. This article proposes a timing-speculation (TS) cache to boost the cache frequency and improve energy efficiency under low supply voltages. In the TS cache, the voltage differences of bitlines (BLs) are continuously evaluated twice by a sense amplifier (SA), and the access timing error can be detected much earlier than that in prior methods. According to the measurement results from the fabricated chips, the TS L1 cache aggressively increases its frequency to $1.62\times $ and $1.92\times $ compared with the conventional scheme at 0.5- and 0.6-V supply voltages, respectively.

Journal ArticleDOI
TL;DR: A simple and feasible low power design scheme which can be used as a powerful tool for energy reduction in RRAM circuits is proposed, exclusively based on current control during write and read operations and ensures that write operations are completed without wasted energy.
Abstract: Energy efficiency remains one of the main factors for improving the key performance markers of RRAMs to support IoT edge devices. This paper proposes a simple and feasible low power design scheme which can be used as a powerful tool for energy reduction in RRAM circuits. The design scheme is exclusively based on current control during write and read operations and ensures that write operations are completed without wasted energy. Self-adaptive write termination circuits are proposed to control the RRAM current during FORMING, RESET and SET operations. The termination circuits sense the programming current and stop the write pulse as soon as a preferred programming current is reached. Simulation results demonstrate that an appropriate choice of the programming currents can help obtain 4.1X improvement in FORMING, 9.1X improvement in SET and 1.12X improvement in RESET energy. Also, the possibility to have a tight control over the RESET resistance is demonstrated. READ energy optimization is also covered by leveraging on a differential sense amplifier offering a programmable current reference. Finally, an optimal trade-off between energy consumption during SET/RESET operations and an acceptable read margin is established according to the final application requirements.

Proceedings ArticleDOI
20 Jul 2020
TL;DR: An in-memory neural network accelerator architecture called MOSAIC is proposed which uses minimal form of peripheral circuits; 1-bit word line driver to replace DAC and1-bit sense amplifier to replace ADC to achieve an order of magnitude higher energy and area efficiency.
Abstract: We propose an in-memory neural network accelerator architecture called MOSAIC which uses minimal form of peripheral circuits; 1-bit word line driver to replace DAC and 1-bit sense amplifier to replace ADC. To map multi-bit neural networks on MOSAIC architecture which has 1-bit precision peripheral circuits, we also propose a bit-splitting method to approximate the original network by separating each bit path of the multi-bit network so that each bit path can propagate independently throughout the network. Thanks to the minimal form of peripheral circuits, MOSAIC can achieve an order of magnitude higher energy and area efficiency than previous in-memory neural network accelerators.

Journal ArticleDOI
TL;DR: An offset-canceled DRAM sense amplifier with coupling capacitors to store and cancel the offset arising from random variations of the threshold voltages of the amplifying transistors, thereby increasing the sensing margin of the overall DRAM design.
Abstract: This article reports an offset-canceled DRAM sense amplifier with coupling capacitors to store and cancel the offset arising from random variations of the threshold voltages of the amplifying transistors. Analytical calculations of the average and standard deviation of the decision threshold voltages, defined as the voltage in the cell capacitor that bifurcates into binary levels when activated, are performed on various DRAM sensing schemes and their comparison results are presented. Based on the analysis, the proposed sense amplifier scheme using coupling capacitors is shown to offer the least amount of variation in the decision threshold, thereby increasing the sensing margin of the overall DRAM design. The coupling capacitors not only compensate for the random offset of the sense amplifiers but also mitigate the effect of the mismatch of the bitline capacitances in the open bitline scheme. Measurement on the experimental chip fabricated in a 65-nm CMOS process validates the analysis and confirms the superior performance of the proposed DRAM sensing scheme.

Journal ArticleDOI
TL;DR: An embedded level-shifting (ELS) dual-rail SRAM is proposed to enhance the availability of dual-Rail SRAMs and achieves low-power operation with 71.4% power consumption compared to single-railSRAM with 72% performance overhead in circuit-level simulation, while the previous hybrid dual- rail SRAM shows 67.8% energy consumption with 270%performance overhead.
Abstract: An embedded level-shifting (ELS) dual-rail SRAM is proposed to enhance the availability of dual-rail SRAMs. Although dual-rail SRAM is a powerful solution for satisfying the increasing demand for low-power applications, the enormous performance degradation at low supply voltages cannot meet the high-performance cache requirement in recent computing systems. The requirement of many level shifters is another drawback of the dual-rail SRAM because it degrades the energy-savings. The proposed ELS dual-rail SRAM achieves energy-savings by using a low supply voltage to precharge bitlines while minimizing the performance overhead by appropriately assigning a high-supply voltage to critical circuit blocks with effective level-shifting circuits. The sense amplifier embeds a level-shifting operation, thereby operating with a high supply voltage for a fast sensing operation. The proposed dynamic output buffer resolves the potential static current problem and improves the read delay. The number of level shifters is reduced using a proposed write driver, which conducts level-shifting and write-driving simultaneously. The proposed ELS dual-rail SRAM achieves low-power operation with 71.4% power consumption compared to single-rail SRAM with 72% performance overhead in circuit-level simulation, while the previous hybrid dual-rail SRAM shows 67.8% energy consumption with 270% performance overhead. In architecture-level simulation using Gem5 simulator with SPEC2006 benchmarks, the system with the ELS dual-rail SRAM caches shows, on average, 29% performance improvement compared to that of the system with the hybrid dual-rail SRAM caches.

Journal ArticleDOI
TL;DR: An offset-canceling zero-sensing-dead-zone sense amplifier (OCZS-SA) combined with the OCDS-SC is proposed to significantly improve the read yield of resistive nonvolatile memories.
Abstract: With technology scaling, achieving a target read yield of resistive nonvolatile memories becomes more difficult due to increased process variation and decreased supply voltage. Recently, an offset-canceling dual-stage sensing circuit (OCDS-SC) has been proposed to improve the read yield by canceling the offset voltage and utilizing a double-sensing-margin structure. In this paper, an offset-canceling zero-sensing-dead-zone sense amplifier (OCZS-SA) combined with the OCDS-SC is proposed to significantly improve the read yield. The OCZS-SA has two major advantages, namely, offset voltage cancellation and a zero sensing dead zone. The Monte Carlo HSPICE simulation results using a 65-nm predictive technology model show that the OCZS-SA achieves 2.1 times smaller offset voltage with a zero sensing dead zone than the conventional latch-type SAs at the cost of an increased area overhead of 1.0% for a subarray size of 128 × 16.

Proceedings ArticleDOI
06 Jul 2020
TL;DR: A method to compute majority while reading from a transistor-accessed RRAM array, which could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.
Abstract: Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.

Journal ArticleDOI
TL;DR: A low-power 1/4-rate four-level pulse amplitude modulation (PAM4) receiver with an adaptive variable-gain rectifier (AVGR)-based decoder in 28-nm CMOS technology achieves a better power efficiency by employing a 1/ 4-rate topology and merging a variable- gain function into the decoder.
Abstract: This article presents a low-power 1/4-rate four-level pulse amplitude modulation (PAM4) receiver with an adaptive variable-gain rectifier (AVGR)-based decoder in 28-nm CMOS technology. The PAM4 input signal is preconditioned by a continuous-time linear equalizer (CTLE) then sampled into four branches of decoders by 1/4-rate clocks. The proposed AVGR-based PAM4-to-nonreturn-to-zero (NRZ) decoder performs gain adaptation and amplitude rectification simultaneously for decoding the least significant bit (LSB). The linear sense amplifier in the AVGR is modified from a latch to achieve a high gain and low power. Compared with the full-rate receiver adopting a decoder consisting of three comparators, this design achieves a better power efficiency by employing a 1/4-rate topology and merging a variable-gain function into the decoder. Experimental results demonstrate that the receiver chip can receive and decode a 24-Gb/s 190-mVpp PAM4 signal at a BER of 10−11 and a bit efficiency of 1.38 pJ/bit.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: A sense amplifier for split supply SRAMs to enable wide range of Dynamic Voltage and Frequency Scaling (DVFS) and the proposed solution has more than 25% lower offset and almost the same SA reaction time compared to the voltage latch sense amplifier.
Abstract: Embedded memories are an integral part of the design of todays processors and System on Chip (SoC). Sense amplifier (SA) plays an important role in determining the performance and yield of memories. In this work, we propose a sense amplifier for split supply SRAMs to enable wide range of Dynamic Voltage and Frequency Scaling (DVFS). The proposed solution has more than 25% lower offset and almost the same SA reaction time compared to the voltage latch sense amplifier and more than 50% lower offset and more than 15% faster SA reaction time compared to the current latch sense amplifier across 0.45V to 1.0V operation in 22nm HKMG CMOS technology.

Journal ArticleDOI
TL;DR: In modern computer memory role is to sense the low power signals, this paper optimally finds the best solutions to improve the performance of the Sense Amplifier for CMOS SRAM.

Journal ArticleDOI
TL;DR: A sense-amplifier-based physically unclonable function (PUF) with individually embedded non-volatile memory (eNVM) that offers 100% stable random bits that is implemented in a 180 nm standard CMOS process.
Abstract: In this paper is presented a sense-amplifier-based physically unclonable function (PUF) with individually embedded non-volatile memory (eNVM) that offers 100% stable random bits. The proposed eNVM, which stores the initially generated random key, biases the sense-amplifier to reproduce always the same key as the initial value through a feedback path. In order to verify the performance of the proposed architecture, a 256-bit PUF with a core area of 0.160 mm2 was implemented in a 180 nm standard CMOS process. The measurement results of the implemented PUF show an intra-chip Hamming Distance (HD) of 0 (100% stability) and inter-chip HD of 0.5047.

Patent
04 Jun 2020
TL;DR: In this paper, a memory cell is coupled to the first digit line in response to activation of a wordline coupled the memory cell, and the second transistor is disabled to decouple the second digit line from the second gut node.
Abstract: Apparatuses and methods for reducing sense amplifier leakage current during an active power-down are disclosed. An example apparatus includes a memory that includes a memory cell and a first digit line and a second digit line. The memory cell is coupled to the first digit line in response to activation of a wordline coupled the memory cell. The example apparatus further includes a sense amplifier comprising of a first transistor coupled between the first digit line and a first gut node of the sense amplifier and a second transistor coupled between the second digit line and a second gut node of the sense amplifier. While the wordline is activated, in response to entering a power-down mode, the first transistor is disabled to decouple the first digit line from the first gut node and the second transistor is disabled to decouple the second digit line from the second gut node.

Journal ArticleDOI
TL;DR: A novel sensing approach for spin-based memories that augments the read sense margin and improves read decision failure without deteriorating read disturb by changing the read current dynamically according to the bit-cell state is presented.
Abstract: This brief presents a novel sensing approach for spin-based memories that augments the read sense margin and improves read decision failure without deteriorating read disturb by changing the read current dynamically according to the bit-cell state. The proposed sensing circuit consists of three main sub-circuits: 1) the bit-cell; 2) bit-line amplifier; and 3) a standard current-latch sense amplifier. The bit-line amplifier is a basic amplifier with positive feedback connection to achieve higher sense margin with dynamic read current. As a result, the significant increase in the sense margin eliminates the effect of the sense amplifier offset on read decision failure. Monte Carlo simulations in 45 nm demonstrate that the proposed sensing scheme improves the read bit error rate (BER) by more than one order of magnitude compared to the conventional voltage sensing scheme with a cost of 0.3% array area overhead and 3% energy penalty. Moreover, quantitatively compared with some of the state-of-the-art sensing schemes, the proposed scheme achieves a better area-energy-robustness trade-off.

Journal ArticleDOI
10 Mar 2020
TL;DR: In this letter, 1.5-MB and 256-KB 2T-MONOS eFlash macros are developed with 65-nm silicon-on-thin-box (SOTB) technology, adopting low-energy sense amplifier and data transmission circuit techniques which enhance intrinsic advantages of SOTB devices.
Abstract: To expand Internet of Things application ranges, ultralow active energy operations are essential in edge devices. Especially, read energy reduction in embedded Flash (eFlash) memory is strongly required to enable real-time sensing with limited energy generated by energy harvesting (EH). In this letter, 1.5-MB and 256-KB 2T-MONOS eFlash macros are developed with 65-nm silicon-on-thin-box (SOTB) technology, adopting low-energy sense amplifier and data transmission circuit techniques which enhance intrinsic advantages of SOTB devices. These macros achieve read energy of 0.15-pJ/bit with 80-MHz random read access capability, which is low enough to utilize EH technologies as energy sources.

Patent
30 Apr 2020
TL;DR: In this paper, a memory including a memory cell coupled to a first digit line in response to a wordline being set to an active state and a sense amplifier coupled to the first digit lines and to a second digit line.
Abstract: Apparatuses and methods for reducing row address (RAS) to column address (CAS) delay are disclosed. An example apparatus includes a memory including a memory cell coupled to a first digit line in response to a wordline being set to an active state and a sense amplifier coupled to the first digit line and to a second digit line. The sense amplifier is configured to perform a threshold voltage compensation operation to bias the first digit line and the second digit line based on a threshold voltage difference between at least two circuit components of the sense amplifier. The apparatus further comprising a decoder circuit coupled to the wordline and to the sense amplifier. In response to an activate command, the decoder circuit is configured to initiate the threshold voltage compensation operation and, during the threshold voltage compensation operation, to the set the wordline to the active state.

Journal ArticleDOI
TL;DR: An analog bit-counting scheme is proposed to decrease the burden of neuron circuits with a synaptic architecture utilizing NAND flash memory, and a novel binary neuron circuit with a double-gate positive feedback (PF) device is demonstrated to replace the sense amplifier, adder, and comparator.
Abstract: Recent studies have demonstrated that binary neural networks (BNN) could achieve a satisfying inference accuracy on representative image datasets. BNN conducts XNOR and bit-counting operations instead of high-precision vector-matrix multiplication (VMM), significantly reducing the memory storage. In this work, an analog bit-counting scheme is proposed to decrease the burden of neuron circuits with a synaptic architecture utilizing NAND flash memory. A novel binary neuron circuit with a double-gate positive feedback (PF) device is demonstrated to replace the sense amplifier, adder, and comparator, thereby reducing the burden of the complementary metal-oxide semiconductor (CMOS) circuits and power consumption. By using the double-gate PF device, the threshold voltage of the neuron circuits can be adaptively matched to the threshold value in the algorithms eliminating the accuracy degradation introduced by the process variation. Thanks to the super-steep SS characteristics of the PF device, the proposed neuron circuit with the PF device has an off-state current of 1 pA, representing 10 5 times improvement compared to the neuron circuit with a conventional metal-oxide-semiconductor field effect transistor (MOSFET) device. A system simulation of a hardware-based BNN shows that the low-variance conductance distribution (8.4 %) of the synaptic device and the adjustable threshold of the neuron circuit implement a highly efficient BNN with a high inference accuracy.

Posted Content
TL;DR: It is shown based on neural network simulation on the CIFAR-10 image recognition task that going from binary to ternary neural networks significantly increases neural network performance, highlighting that AI circuits function may sometimes be revisited when operated in low power regimes.
Abstract: The design of systems implementing low precision neural networks with emerging memories such as resistive random access memory (RRAM) is a major lead for reducing the energy consumption of artificial intelligence (AI). Multiple works have for example proposed in-memory architectures to implement low power binarized neural networks. These simple neural networks, where synaptic weights and neuronal activations assume binary values, can indeed approach state-of-the-art performance on vision tasks. In this work, we revisit one of these architectures where synapses are implemented in a differential fashion to reduce bit errors, and synaptic weights are read using precharge sense amplifiers. Based on experimental measurements on a hybrid 130 nm CMOS/RRAM chip and on circuit simulation, we show that the same memory array architecture can be used to implement ternary weights instead of binary weights, and that this technique is particularly appropriate if the sense amplifier is operated in near-threshold regime. We also show based on neural network simulation on the CIFAR-10 image recognition task that going from binary to ternary neural networks significantly increases neural network performance. These results highlight that AI circuits function may sometimes be revisited when operated in low power regimes.

Proceedings ArticleDOI
01 Aug 2020
TL;DR: In this article, the same memory array architecture can be used to implement ternary weights instead of binary weights, and this technique is particularly appropriate if the sense amplifier is operated in near-threshold regime.
Abstract: The design of systems implementing low precision neural networks with emerging memories such as resistive random access memory (RRAM) is a major lead for reducing the energy consumption of artificial intelligence (AI). Multiple works have for example proposed in-memory architectures to implement low power binarized neural networks. These simple neural networks, where synaptic weights and neuronal activations assume binary values, can indeed approach state-of-the-art performance on vision tasks. In this work, we revisit one of these architectures where synapses are implemented in a differential fashion to reduce bit errors, and synaptic weights are read using precharge sense amplifiers. Based on experimental measurements on a hybrid 130 nm CMOS/RRAM chip and on circuit simulation, we show that the same memory array architecture can be used to implement ternary weights instead of binary weights, and that this technique is particularly appropriate if the sense amplifier is operated in near-threshold regime. We also show based on neural network simulation on the CIFAR-10 image recognition task that going from binary to ternary neural networks significantly increases neural network performance. These results highlight that AI circuits function may sometimes be revisited when operated in low power regimes.