Proceedings ArticleDOI

# A Parallel-friendly Majority Gate to Accelerate In-memory Computation

06 Jul 2020-pp 93-100

TL;DR: A method to compute majority while reading from a transistor-accessed RRAM array, which could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.

AbstractEfforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.

Topics: NAND gate (67%), Adder (62%), Sense amplifier (59%), Inverter (58%), NOR logic (57%)

## Summary (3 min read)

### Introduction

• RRAMs are two terminal devices (usually a Metal-InsulatorMetal structure) capable of storing data as resistance.
• Recent research [10]–[12] has confirmed that majority logic is to be preferred not only because a particular nanotechnology can realize it, but also because of its ability to implement arithmetic-intensive circuits with less gates.
• Personal use of this material is permitted.
• In Section III the authors present the framework to compute in the memory array, using the proposed majority gate.

### A. Majority gate: Operating principle

• Consider an array of RRAM cells arranged in a 1T-1R configuration, as depicted in Fig. 2. Each cell can be individually read/written into by activating the corresponding wordline (WL) and applying appropriate voltage across the cell (BL and SL).
• Table I lists the truth table of 3-input majority gate (M3(A,B,C)) and the effective resistance for all the eight possibilities.
• The 1T–1R structure consists of a NMOS transistor manufactured in IHP’s 130 nm CMOS technology, whose drain is connected in series to the RRAM.
• IHP’s 1T–1R cells were modeled using the Stanford-PKU RRAM model following the methodology presented in [16].

### B. Sensing methodology

• As stated, the methodology to reliably translate Reff into a CMOS-compatible voltage is the crucial aspect of the proposed majority gate.
• The time-based sensing circuit is essentially a voltage-totime converter followed by a time-domain comparator (D-flip flop).
• ENdelay, the EN signal delayed by tdelay acts as the edge trigger for the D-FF.
• Therefore the majority gate was evaluated by taking RRAM variations into account.

### A. Functional completeness and memory controller

• This is accomplished by using a control signal INV which is low during READ and majority operation (Q is latched) and goes high only during NOT operation (Q is latched).
• Majority together with NOT is functionally complete i.e any Boolean logic can be expressed in terms of majority and NOT gates [19].
• The memory controller of a regular memory (be it DRAMbased or NVM-based) is responsible for orchestrating the READ and WRITE operation by issuing the control signals to the peripheral circuitry of the array.
• It must be noted that majority operation is executed on three contiguous bits of data in a column and the triple row decoder of section III-B will not only select the row corresponding to the address placed on the row decoder, but also the next two rows if MAJ is ‘1’.
• The NOT operation is the same as the READ operation with the only exception being the controller issues the control signal INV , which goes high to invert the read data at the output of 2this signal acts as an additional input to the row decoder, Fig. 6 the SA (Fig. 5-(a)).

### B. Triple-row decoder design

• A conventional decoder for a 1T–1R array can select one row at a time, while the proposed majority gate needs three rows to be selected simultaneously.
• To this end, the authors propose a robust row decoder which is designed by interleaving multiple single-row decoders.
• When φ goes high, WLi corresponding to D1D0 goes high, provided EN is ‘1’ signal.
• The address translator does not add any significant latency to the decoding process.

### C. Area of time-based Sense Amplifier

• It must be emphasized that the main drawback of RRAM based in-memory adders is their latency – numerous cycles of Boolean operations (NAND, NOR, IMPLY) are needed to perform addition, when compared to CMOS.
• The time-based SA of [17] could sense the BL voltage without an op-amp, and, this was an important reason for adopting it for their majority gate (conventional SAs use operational amplifier, which consume huge silicon area).
• It must be noted that this area estimate does not include the area of the delay element since it is shared by all the SA in the array.
• (tdelay in Fig.3 is implemented as series of inverters with MOS capacitive load between them).

### D. Energy for in-memory operations

• To assess the energy required for computation, the authors first calculate the energy required for each logic operation.
• The authors calculate the energy for a single majority operation, as EMAJ = VDD ∫ tREAD 0 IREAD · dt+ VDD ∫ tREAD 0 ISA · dt (1) where IREAD is the current injected into the 1T–1R cell (see Fig. 3), ISA is the current consumed by the time-based SA and tREAD is the READ cycle duration.
• The energy for a single majority operation, EMAJ = 1.98 pJ.
• The energy for the NOT operation is the same as the energy to read a single bit, and it was calculated to be ENOT = 1.24 pJ.
• ENOT is smaller than EMAJ because, IREAD was smaller (22 µA) for NOT and READ, where a single bit is read.

### A. Parallel-prefix adder using majority logic

• Parallel-prefix (PP) adders are a family of adders originally proposed to overcome the latency incurred by the rippling of carry in CMOS-based adders.
• The regular structure of the memory array and the proposed parallel-friendly majority gate can be combined to implement PP adders in the memory array.
• The ‘carry-generate block’ can generate the carry ‘ahead’ and is known to reduce the latency to O(log n), for n-bit adders.
• Kogge-Stone, LadnerFischer, Brent-Kung and the like, are examples of PP adders.
• For an eight-bit adder, the logical depth is six levels of majority gates and one level of NOT gates, and at most eight gates are needed simultaneously in each level.

### B. Mapping of the eight-bit LF adder to 1T–1R array

• The authors map the eight-bit Ladner-Fischer adder structure of Fig. 8 to a 1T–1R array, using the proposed logic family, and elaborate the sequence of operations.
• To minimize latency, the authors map the adder in a way such that all the majority gates in a logic level (see Fig. 8) are executed simultaneously in a READ operation (see Fig. 9).
• In a 1T–1R array, HRS→ LRS transition (SET process when the conductive filament is created) is accomplished by applying two pulses simultaneously to the WL and BL, while SL is grounded.
• In the Table III, the authors have not compared the energy for computation since they are either not reported [2] or reported for another RRAM technology [22].
• Each step has one or more OR/AND operation [23].

### V. CONCLUSION

• A memristive logic family formulates a functionally complete Boolean logic with a memristive device (RRAM/PCM/STT-MRAM) as the primary switching device.
• The proposed method of implementing a majority and NOT gate in a 1T–1R array forms a new memristive logic family.
• The majority gate can be implemented in a 1T–1R array without necessitating any major change in the peripheral circuit (except the row decoder which needs to be modified to activate three rows simultaneously).
• Majority logic can be combined with parallel-prefix techniques to design fast adders, and the proposed gate can be used to implement them in memory arrays, with minimum latency.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Parallel-friendly Majority Gate to Accelerate
In-memory Computation
John Reuben
Chair of Computer Science 3 - Hardware Architecture
Friedrich-Alexander-Universit
¨
at Erlangen-N
¨
urnberg (FAU)
91058 Erlangen, Germany
johnreuben.prabahar@fau.de
Stefan Pechmann
Chair of Communications Electronics
Universit
¨
at Bayreuth
95447 Bayreuth, Germany
stefan.pechmann@uni-bayreuth.de
Abstract—Efforts to combat the ‘von Neumann bottleneck’
have been strengthened by Resistive RAMs (RRAMs), which
enable computation in the memory array. Majority logic can
accelerate computation when compared to NAND/NOR/IMPLY
logic due to it’s expressive power. In this work, we propose a
method to compute majority while reading from a transistor-
accessed RRAM array. The proposed gate was veriﬁed by sim-
ulations using a physics-based model (for RRAM) and industry
standard model (for CMOS sense ampliﬁer) and, found to tolerate
reasonable variations in the RRAMs’ resistive states. Together
with NOT gate, which is also implemented in-memory, the pro-
posed gate forms a functionally complete Boolean logic, capable
of implementing any digital logic. Computing is simpliﬁed to a
sequence of READ and WRITE operations and does not require
any major modiﬁcations to the peripheral circuitry of the array.
The parallel-friendly nature of the proposed gate is exploited to
implement an eight-bit parallel-preﬁx adder in memory array.
The proposed in-memory adder could achieve a latency reduction
of 70% and 50% when compared to IMPLY and NAND/NOR
Index Terms—Resistive RAM (RRAM), majority logic, major-
ity gate, memristor, 1 Transistor-1 Resistor(1T–1R), von Neu-
mann bottleneck, in-memory computing, compute-in-memory,
I. INTRODUCTION
T
HE movement of data between processing and memory
units in present day computing systems is their main
performance and energy-efﬁciency bottleneck, often referred
to as the ‘von Neumann bottleneck’ or ‘memory wall’. The
emergence of non-volatile memory technologies like Resistive
RAM (RRAM) has created opportunities to overcome the
memory wall by enabling computing at the residence of data.
RRAMs are two terminal devices (usually a Metal-Insulator-
Metal structure) capable of storing data as resistance. The
change of resistance is due to the formation or rupture of a
conductive ﬁlament, depending on the direction of the current
ﬂow through the structure. The word ‘memristor’ is also used
by researchers to denote such a device, because it is essentially
a resistor with memory. Connecting such RRAM devices in
a certain manner, or by applying certain voltage patterns,
or by modifying the sensing circuitry, basic Boolean gates
(NOR, NAND, XOR, IMPLY logic) have been demonstrated
in RRAM arrays [1]–[6]. The motivation for such efforts is
to perform Boolean operations on data stored in the memory
array, without moving them out to a separate processing
circuit, thus mitigating the von Neumann bottleneck. Reviews
of such in-memory computing approaches are presented in
[7], [8]. To construct a memory array using such devices, two
conﬁgurations are common: 1Transistor–1Resistor (1T–1R)
and 1Selector–1Resistor (1S–1R). The 1T–1R conﬁguration
uses a transistor as an access device for each cell, isolating
the accessed cell from its neighbours in the array. The 1S–1R
conﬁguration uses a two-terminal device called a ‘selector’
which is fabricated in series with the memristive device.
The 1S–1R is area-efﬁcient, but suffers from current leakage
(sneak–path problem) due to the inability to access a particular
cell without interfering with its neighbours [9].
Majority logic, a type of Boolean logic, is deﬁned to be
true if more than half of the n inputs are true, where n is
odd. Hence, a majority gate is a democratic gate and can be
expressed in terms of Boolean AND/OR as M AJ(a, b, c) =
a.b + b.c + a.c, where a, b, c are Boolean variables. Although
majority logic was known since 1960, there has been a
revival in using it for computation in many emerging nan-
otechnologies (spin waves, magnetic Quantum-Dot cellular
automata, nano magnetic logic, Single Electron Tunneling).
Recent research [10]–[12] has conﬁrmed that majority logic is
to be preferred not only because a particular nanotechnology
can realize it, but also because of its ability to implement
arithmetic-intensive circuits with less gates. It must be em-
phasized that majority logic did not become the dominant
logic to compute because it was more efﬁcient to implement
NAND/NOR gate than a majority gate, in CMOS technology.
However, with many emerging nanotechnologies, this is not
the case anymore, therefore, majority logic needs to be re-
evaluated for its computing efﬁciency. In [13]–[15], majority
logic is implemented in RRAM by applying the two inputs of
the majority gate as voltages across its terminals, and the initial
state of the RRAM (which is also the third input) switches to
evaluate majority. Such an approach complicates the peripheral
circuitry and is also not parallel-friendly, because two of the
three inputs of a majority gate need to be applied as voltages
at wordline/bitline (see Fig.1(a)).
In this paper, we propose a majority gate whose structure
is conducive for parallel-processing in the memory array.
By activating three rows of the array simultaneously, the
This is author’s version of the accepted paper. For the published paper, see the 31st IEEE International Conference on
Application-specific Systems, Architectures and Processors (ASAP) proceedings in https://ieeexplore.ieee.org/
See Conference presentation (20 min video) at https://asap2020.cs.manchester.ac.uk/paper.php?id=72
© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any
current or future media, including reprinting/republishing this material for advertising or promotional purposes,
creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component
of this work in other works.

WL
1
BL
1
SL
1
WL
2
WL
3
BL
2
BL
3
SL
2
SL
3
A
B
C
M
3
(A,B,C)
EN
Q
SA
I
WL
1
BL
1
WL
3
M
3
(A,B,C)
A B C
B
BL
3
M
3
(A,B,C)
A B C
A
B
C
C
A
(a) (b)
A B C
D E F
G H K
(c)
Peripheral ckt.
Mapping for (a)
Mapping for (b)
Fig. 1: (a) In-memory majority gate of previous works [13]–[15]
(b) Proposed parallel-friendly gate (c) When multiple gates have
to be executed in parallel, the majority gates of previous works
[13]–[15] have to be mapped diagonally because two gates cannot
be executed in the same row/column. This manner of computation
complicates both the peripheral circuitry and memory controller
(inputs of the gates inﬂuence row/column decoding). In the proposed
method, multiple gates can be mapped to the same set of rows,
thereby simplifying the peripheral and the memory controller (inputs
of the gates are resistance of memory cells and row/column decoders
retain their functionality as in a conventional memory).
resistance of the RRAM cells in a column are in parallel
during the READ operation. A Sense Ampliﬁer (SA) which
can accurately sense the effective resistance implements a ‘in-
memory’ majority gate. This manner of computing majority
enables parallelism and is energy-efﬁcient (both reading and
writing is energy-efﬁcient in 1T–1R when compared to 1S–
1R arrays due to the absence of sneak paths). To demonstrate
the potential of this method to accelerate computation, we
consider a parallel-preﬁx adder and formulate the steps to
perform eight-bit addition in a 1T–1R array. The remainder
of the paper is organized as follows. Section II-A presents the
principle of reading majority from a 1T–1R array. Since the
read operation is the crucial aspect of the proposed majority
gate, we present the detailed sensing methodology in Section
II-B. Further, we study tolerance to variations in resistive
states by performing Monte Carlo simulations. In Section
III we present the framework to compute in the memory
array, using the proposed majority gate. Section IV-A brieﬂy
presents parallel-preﬁx technique and the structure of an eight-
bit parallel-preﬁx adder in terms of majority gates. The adder
is then mapped to a 1T–1R array using the proposed in-
memory computing technique, in Section IV-B. We compare
the proposed eight-bit adder with the state-of-the-art, followed
by conclusions in Section V.
II. MAJORITY GATE IN 1T–1R ARRAY
A. Majority gate: Operating principle
Consider an array of RRAM cells arranged in a 1T-1R
conﬁguration, as depicted in Fig. 2. Each cell can be in-
dividually read/written into by activating the corresponding
wordline (W L) and applying appropriate voltage across the
cell (BL and SL). To read from a cell, the corresponding
W L is activated, a small current is injected into the cell and
the voltage across the cell is sensed in a voltage-mode SA i.e.
WL
1
BL
1
SL
1
BL
64
SL
64
WL
64
WL
2
WL
3
BL
2
BL
3
SL
2
SL
3
R
A
WL
D
S
BL
SL
RRAM
R
eff
= R
A
|| R
B
|| R
C
R
B
R
C
WL
4
WL
5
Fig. 2: When three rows are activated (W L
13
) simultaneously
in a 1T-1R array, the resistances of the three RRAM devices are
in parallel. An ‘in-memory’ majority gate can be implemented by
accurately sensing the effective resistance R
ef f
.
the BL voltage is sensed while the SL is grounded. Now, if
three rows are activated simultaneously during read operation
(Rows 1 to 3 in Fig. 2), the resistances in column 1 are in
parallel (neglecting the parasitic resistance of BL and SL).
During read, the access transistor will be in linear region, and
hence the transistor’s resistance will be
r
DS
=
1
µ
n
C
ox
(
W
L
)(V
GS
V
t
)
= 544 [16]. The effective
resistance between BL and SL will therefore be R
eff
=
(R
A
+ r
DS
)||(R
B
+ r
DS
)||(R
C
+ r
DS
) (R
A
||R
B
||R
C
),
if the drain-to-source resistance of transistor (r
DS
) is small
compared to LRS. Table I lists the truth table of 3-input major-
ity gate (M
3
(A, B, C)) and the effective resistance for all the
eight possibilities. To verify the proposed gate on a real RRAM
device, we choose the 1T-1R cell from IHP
1
. The 1T–1R
structure consists of a NMOS transistor manufactured in IHP’s
130 nm CMOS technology, whose drain is connected in series
to the RRAM. The RRAM is a T iN/Hf
1x
Al
x
O
y
/T i/T iN
stack integrated between Metal2 and Metal3 in the BEOL of
the CMOS process. IHP’s 1T–1R cells were modeled using
the Stanford-PKU RRAM model following the methodology
presented in [16]. The cells have a mean LRS and HRS
of 10 K and 133.3 K, respectively. Therefore, the R
eff
is 8.7 K when two or more cells are in HRS (shaded
grey in Table I) and 4.8 K when two or more cells are
in LRS. Consequently, a majority gate can be implemented
during a READ operation by precisely sensing R
eff
. As can
be deciphered from Table I, the crucial aspect of the proposed
gate is to be able to differentiate between R
001
eff
(two LRS and
one HRS) and R
110
eff
(two HRS and one LRS). Let’s denote
the resistance to be differentiated as sensing window,
Sensing window for majority = 8.7 K 4.8 K = 3.9 K
1
Innovations for High Performance Microelectronics– Leibniz-Institut f
¨
ur
innovative Mikroelektronik, Germany

for IHP’s cell (resistance window = 13.3).
TABLE I: Precisely sensing R
eff
results in majority: Logic
‘0’ is LRS (10 K) and logic ‘1’ is HRS (133.3 K)
A B C M
3
(A, B, C) R
ef f
R
ef f
0 0 0 0
LRS
3
3.3 K
0 0 1 0
HRS·LRS
LRS+2·HRS
4.8 K
0 1 0 0
HRS·LRS
LRS+2·HRS
4.8 K
0 1 1 1
HRS·LRS
HRS+2·LRS
8.7 K
1 0 0 0
HRS·LRS
LRS+2·HRS
4.8 K
1 0 1 1
HRS·LRS
HRS+2·LRS
8.7 K
1 1 0 1
HRS·LRS
HRS+2·LRS
8.7 K
1 1 1 1
HRS
3
44.4 K
B. Sensing methodology
As stated, the methodology to reliably translate R
eff
into
a CMOS-compatible voltage is the crucial aspect of the
proposed majority gate. R
001
eff
is 4.8 K and R
110
eff
is 8.7 K,
and differentiating such a resistance window ( 3.9K) needs
a robust SA. It must be noted that this will be exacerbated by
the variability exhibited by the RRAM devices. To meet this
requirement, a time-based SA recently proposed in [17] was
chosen. Different from conventional sensing schemes (voltage-
mode and current-mode), the time-based sensing scheme con-
verts the BL voltage (to be sensed) into a time delay and dis-
criminates in time-domain. This sensing scheme was originally
proposed to read data from STT-MRAM [17], which have a
resistance window of a few K. Therefore, it is ideal for the
proposed majority gate. Furthermore, this time-based sensing
achieves two to three orders of magnitude improvement in
sensing (BER) compared to conventional schemes, in addition
to being reference-less [17].
The time-based sensing circuit is essentially a voltage-to-
time converter followed by a time-domain comparator (D-ﬂip
ﬂop). Voltage-to-time conversion is achieved by the current-
starved inverter (transistors M
15
) followed by transistor M
6
and an inverter (Fig. 3). During READ, a current I
is
injected into the 1T-1R cell (corresponding three W Ls are
activated and SL is grounded). Depending on the effective
resistance R
eff
, the BL reaches an appropriate voltage. In
the conceptual waveforms of Fig.3, it is assumed that BL
gets charged to 300 mV if R
eff
is a high resistance (8.7 K)
and 200 mV if R
eff
is a low resistance (4.8 K), for the
purpose of illustration. Such a V
BL
(few hundred mV) limits
the current ﬂow through the inverter (transistor M
13
), hence
the name current-starved inverter. When EN goes high, the
current-starved inverter introduces a delay proportional to V
BL
i.e. a higher V
BL
incurs less delay. A V
BL
of 300 mV incurs
less delay and low-to-high transition of EN reaches the input
of the Flip-ﬂop (I
F F
) faster i.e. at T
HRS
. For a lower V
BL
of 200 mV, the delay is greater and the low-to-high transition
I
1T1R array
EN
BL
WL
SL
D
out
Time-Based
Sense Amp.
EN
M
1-3
M
4
M
5
M
6
D
Q
Q
V
BL
V
BL
t
delay
D
out
EN
delay
I
FF
EN
delay
I
FF
I
FF
D
out
HRS (V
BL
=300 mV)
LRS (V
BL
=200 mV)
= 1 if HRS
= 0 if LRS
EN
V
BL
200 or 300 mV
I
= 35 uA
D
out
current-starved
T
DM
T
HRS
T
LRS
Fig. 3: A small current I
injected into the cell converts the
resistance to a voltage which is fed to the time-based SA. A current-
starved inverter transforms this voltage into a proportional delay
which is sensed as a CMOS-compatible voltage by the D-FF [17].
occurs at T
LRS
. t
delay
is a chain of inverters programmed
to introduce a delay between T
HRS
and T
LRS
. EN
delay
, the
EN signal delayed by t
delay
acts as the edge trigger for the
D-FF. When EN
delay
goes high at T
DM
(Decision Moment),
it latches the signal at I
F F
and hence the D
out
is high for
high resistance (R
110
eff
= 8.7 K) and low for low resistance
(R
001
eff
= 4.8 K). It must be noted that for R
111
eff
= 44.4 K,
V
BL
will be much larger than 300 mV and will result in a
transition much before T
HRS
. Similarly, for R
000
eff
= 3.3 K,
V
BL
will be less than 200 mV and will result in a transition
much later than T
LRS
. Once designed to differentiate between
R
110
eff
and R
001
eff
, the time-based SA will output M
3
(A, B, C)
correctly for all the eight cases. Furthermore, the same SA can
be used to read a single bit by using a smaller I
(and
activating a single W L during normal read operation). Hence
the proposed gate does not necessitate any modiﬁcation to the
read-out circuit of the regular memory array.
The time-based sensing circuit of Fig. 3 was designed in
IHP’s 130 nm CMOS process, and simulated to verify the
functioning of the majority gate. I
of 35 µA was injected
into the 1T-1R cell to sense the BL voltage. For R
001
eff
and
R
110
eff
, V
BL
was 282 mV and 410 mV, respectively. Since
the current-starved transistors M
13
are the crucial factor in
deciding the delay, they were made large (
W
L
=
1.5µm
0.39µm
) to
make the circuit less sensitive to CMOS process variations.
t
delay
was set to 3 ns using a chain of inverters with MOS
capacitive loads between them. RRAM cells exhibit variability
in their programmed resistive states cycle-to-cycle and device-
to-device [18]. Therefore the majority gate was evaluated by
taking RRAM variations into account. Since majority is com-
puted while reading (memory cell is not switched), the RRAM
was replaced with a resistor and variability was incorporated as
a Gaussian distribution in that resistor. The impact of process
variations was analysed using the statistical model ﬁles for
the CMOS transistors provided by the foundry. 2000 Monte

Fig. 4: Sample output of the time-based SA. At 13.5 ns, the EN
delay
goes high deciding the output. Only 100 MC simulations are plotted
(shaded light) with single typical case highlighted dark.
Carlo simulations were performed where the resistance of the
RRAM was Gaussian distributed with a standard deviation, σ
= 10% of mean RRAM resistance i.e σ
LRS
= 1 K and σ
HRS
= 13.33 K. With combined effects of RRAM variability and
process variability (in transistors of SA), the Bit Error Rate
(BER) was found to be 5.4%. Sample wave-forms are plotted
in Fig. 4. Further failure analysis of the majority gate (incorrect
sensing of R
001
eff
and R
110
eff
) revealed that it occurred only when
RRAM variability was more than 2σ from mean LRS/HRS (It
must be noted that 95% of resistances fall within 2σ from the
mean, in a Gaussian distribution).
III. FRAMEWORK TO COMPUTE IN 1T1R ARRAY
A. Functional completeness and memory controller
As shown in Fig. 5-(a), NOT operation can be implemented
in a 1T–1R array by simply latching
Q from the output of the
time-based SA during READ (D-Flip ﬂop of Fig.3 outputs
Q and Q). This is accomplished by using a control signal
INV which is low during READ and majority operation (Q
is latched) and goes high only during NOT operation (Q is
latched). Majority together with NOT is functionally complete
i.e any Boolean logic can be expressed in terms of majority
and NOT gates [19]. In [19], the authors present Majority-
Inverter Graph (MIG), a new logic manipulation structure
consisting of three-input majority nodes and regular/inverted
edges. Fig.5-(b) is the MIG of a 1-bit full adder obtained by
MIGhty (MIG synthesis tool) and, any Boolean logic can be
synthesised in terms of majority and NOT gates in a similar
manner. Since both majority and NOT gates are implemented
EN
Q
Q
SL
SA
WL
BL
NOT gate
Majority gate
A
A
Maj(A,B,C)
0
1
A
INV
I
EN
Q
Q
SL
SA
WL
BL
A
0
1
INV
I
EN
Q
Q
SL
SA
WL
BL
B
0
1
INV
I
A
C
M
3
A
B
C
in
M
3
A
B
C
in
M
3
S (sum)
C
out
C
in
& memory WRITE
RRAM memory
array
Peripheral ckt.
Peripheral ckt.
P
e
r
i
p
h
e
r
a
l
c
k
t
.
M
e
m
o
r
y
c
o
n
t
r
o
l
l
e
r
Control signals & data
WRITE
MIGhty
S =ABC
in
C
out
= AB+BC
in
+AC
in
(a)
(b)
(c)
Fig. 5: (a) NOT operation implemented with a 2:1 Mux at the
output of the time-based SA; all logic operations are essentially
READ operations (b) 1-bit full adder expressed as Majority-Inverter-
Graph using MIGhty synthesis tool [19], where M
3
represents 3-
input majority operation (c) With majority/NOT gate computed as
READ, multiple levels of logic can be executed by writing the data
back to the memory, simplifying computing to READ and WRITE
operations.
as READ, multiple levels of gates can be cascaded by writing
the read data back to the array. In essence, ‘computing’ is
simpliﬁed to a sequence of READ and WRITE operations,
orchestrated by the memory controller, as depicted in Fig.5-
(c).
The memory controller of a regular memory (be it DRAM-
based or NVM-based) is responsible for orchestrating the
READ and WRITE operation by issuing the control signals to
the peripheral circuitry of the array. In addition, the memory
controller must be augmented with additional capability to
execute majority and NOT operation. Since both majority and
NOT operations are READ operations in this logic family, the
controller does not require any major alterations. To execute a
majority operation, an additional control signal called M AJ
is needed, which is set to logic ‘1’ during majority operation
2
and, the address of the ﬁrst row (out of three rows in which
majority is to be performed) is placed on the row decoder.
It must be noted that majority operation is executed on three
contiguous bits of data in a column and the triple row decoder
of section III-B will not only select the row corresponding
to the address placed on the row decoder, but also the next
two rows if MAJ is ‘1’. The column address is placed on
the column decoder to select the particular column in which
majority is executed and the SA is activated to get the output.
The NOT operation is the same as the READ operation with
the only exception being the controller issues the control signal
INV , which goes high to invert the read data at the output of
2
this signal acts as an additional input to the row decoder, Fig. 6

the SA (Fig. 5-(a)). The control signals activated during logic
operations are summarized in Table II.
TABLE II: Control signals for memory and logic operations
Operation WL BL SL EN(SA) IN V MAJ
activated
ckt.
grounded 1 0 0
NOT single row
activated
ckt.
grounded 1 1 0
Majority three rows
activated
ckt.
grounded 1 0 1
WRITE
‘0’
single row
activated
V
SET
grounded 0 0 0
WRITE
‘1’
single row
activated
grounded V
RESET
0 0 0
B. Triple-row decoder design
WL
0
WL
1
WL
2
WL
3
2:4 Dynamic Decoder
EN
D
0
D
0
D
1
D
1
Ф
EN
0
D
1
D
0
EN
1
D
3
D
2
EN
2
D
4
D
5
EN
3
D
6
D
7
WL
0
WL
1
WL
2
WL
3
WL
4
WL
5
WL
6
WL
7
WL
8
WL
9
WL
10
WL
11
WL
12
WL
13
WL
14
WL
15
A
D
D
R
E
S
S
T
R
A
N
S
L
A
T
O
R
L
O
G
I
C
A
3
A
2
A
1
A
0
MAJ
Ф
A
1
A
0
MAJ
EN
3
EN
2
EN
1
EN
0
A
2
A
3
D
2
D
3
D
1
D
0
D
7
,
D
5
D
6
,
D
4
Fig. 6: Triple-row decoding is achieved by interleaving mul-
tiple single-row decoders. When control signal MAJ is logic
‘0’ (READ/WRITE/NOT), W L
i
corresponding to row address
A
3
A
2
A
1
A
0
is selected. When M AJ is logic ‘1’ (majority),
W L
i
, W L
i+1
, W L
i+2
are selected.
A conventional decoder for a 1T–1R array can select one
row at a time, while the proposed majority gate needs three
rows to be selected simultaneously. Moreover, the row-decoder
must be versatile to switch between single-row activation and
triple-row activation seamlessly. This is because, as stated
in the previous section, one must be able to read/write a
single bit of the array (READ/WRITE/NOT) as well as read
three bits in a column (majority). To this end, we propose a
robust row decoder which is designed by interleaving multiple
single-row decoders. As depicted in Fig.6, a 4:16 triple-row
decoder can be designed by interleaving four 2:4 dynamic
NAND decoders
3
. Since single-row decoding must co-exist
with triple-row decoding, an address translator circuit is used
to switch between the two modes using MAJ as a control
3
a dynamic decoder uses a precharge signal φ, which when low, all W L
are driven to ‘0’. When φ goes high, W L
i
corresponding to D
1
D
0
goes
high, provided EN is ‘1’
signal. For example, to select a single row W L
5
is A
3
A
2
A
1
A
0
= ‘0101’ and MAJ = ‘0’. For these inputs,
the address translator outputs EN
3
EN
2
EN
1
EN
0
= ‘0010’
and D
7
D
6
D
5
D
4
D
3
D
2
D
1
D
0
= ‘XXXX01XX’ (green decoder
in Fig. 6 is enabled and it’s second row is selected, thereby
activating W L
5
). But, for the same row address A
3
A
2
A
1
A
0
= ‘0101’ and MAJ = ‘1’, the address translator outputs
EN
3
EN
2
EN
1
EN
0
= ‘1110’ and D
7
D
6
D
5
D
4
D
3
D
2
D
1
D
0
=
‘010101XX’ (blue, red and green decoders are enabled and
second row of each of them is selected, thereby activating
W L
5
, W L
6
and W L
7
). The address translator inputs MAJ
and A
3
A
2
A
1
A
0
and generates D
7
D
6
D
5
D
4
D
3
D
2
D
1
D
0
and
EN
3
EN
2
EN
1
EN
0
to achieve this desired functionality for
all the 16 cases. With the address translator logic (88 tran-
sistors), the triple-row decoder requires 200 transistors, while
a regular 4:16 dynamic decoder (only single row activation)
requires 136 transistors, a 47% increase in the row-decoder
area. The address translator does not add any signiﬁcant
latency to the decoding process. The decoder was designed
in 130 nm IHP process and its functionality was veriﬁed and
decoding latency was found to be 496 ps.
C. Area of time-based Sense Ampliﬁer
Fig. 7: Layout of time-based SA.
In this work, the primary motivation for pioneering a
parallel-friendly gate was to exploit it to accelerate addition, by
executing gates in parallel. It must be emphasized that the main
drawback of RRAM based in-memory adders is their latency
numerous cycles of Boolean operations (NAND, NOR,
IMPLY) are needed to perform addition, when compared to
CMOS. To evaluate the number of gates that can be executed
in parallel, we evaluated the area of the time-based SA. The
time-based SA of [17] could sense the BL voltage without an
op-amp, and, this was an important reason for adopting it for
our majority gate (conventional SAs use operational ampliﬁer,
which consume huge silicon area). The layout of the time-
based SA of Fig.3 is drawn in Fig.7 and occupies an area of
20 × 3 = 60 µm
2
. It must be noted that this area estimate does
not include the area of the delay element since it is shared by
all the SA in the array. (t
delay
in Fig.3 is implemented as series
of inverters with MOS capacitive load between them). From
[20], the layout of a single 1T–1R cell occupies 450 nm ×
450 nm = 0.2 µm
2
in 130 nm (12.4 F
2
). If the SA is stacked
along its height of 3 µm, eight columns can share a SA. This
means that the number of majority gates that can be executed
in parallel in an array is the number of columns divided by a
factor of 8 i.e. 32 gates can be executed simultaneously in a
256×256 array, 8 gates in a 64×64 array etc.

##### Citations
More filters

Journal ArticleDOI
TL;DR: In this review, memristive logic families which can implement MAJORITY gate and NOT are to be favored for in-memory computing, and one-bit full adders implemented in memory array using different logic primitives are compared and the efficiency of majority-based implementation is underscores.
Abstract: As we approach the end of Moore’s law, many alternative devices are being explored to satisfy the performance requirements of modern integrated circuits. At the same time, the movement of data between processing and memory units in contemporary computing systems (‘von Neumann bottleneck’ or ‘memory wall’) necessitates a paradigm shift in the way data is processed. Emerging resistance switching memories (memristors) show promising signs to overcome the ‘memory wall’ by enabling computation in the memory array. Majority logic is a type of Boolean logic which has been found to be an efficient logic primitive due to its expressive power. In this review, the efficiency of majority logic is analyzed from the perspective of in-memory computing. Recently reported methods to implement majority gate in Resistive RAM array are reviewed and compared. Conventional CMOS implementation accommodated heterogeneity of logic gates (NAND, NOR, XOR) while in-memory implementation usually accommodates homogeneity of gates (only IMPLY or only NAND or only MAJORITY). In view of this, memristive logic families which can implement MAJORITY gate and NOT (to make it functionally complete) are to be favored for in-memory computing. One-bit full adders implemented in memory array using different logic primitives are compared and the efficiency of majority-based implementation is underscored. To investigate if the efficiency of majority-based implementation extends to n-bit adders, eight-bit adders implemented in memory array using different logic primitives are compared. Parallel-prefix adders implemented in majority logic can reduce latency of in-memory adders by 50–70% when compared to IMPLY, NAND, NOR and other similar logic primitives.

7 citations

### Cites background or methods from "A Parallel-friendly Majority Gate t..."

• ...An eight-bit parallel-prefix adder in majority logic could achieve a latency of 19 steps [46]....

[...]

• ...In contrast, in the R–V implementation [45,46], the row/column decoders retain their functionality as in a conventional memory, with a minor modification (the row decoder must be enhanced to select three rows during majority operation, which can be achieved by interleaving decoders [46])....

[...]

• ...(a) In-memory majority gate proposed in [45,46]: When three rows are activated (WL1−3) simultaneously in a 1T-1R array, the three resistances RA, RB, RC will be in parallel (Inputs of the majority gate A, B, C are represented as resistances RA, RB, RC)....

[...]

• ...Using majority logic, an 8-bit PP adder is implemented in memory in [46]....

[...]

• ...Furthermore, the R–V implementation [45,46] is conducive for parallel-processing since multiple gates can be mapped to the same set of rows, as illustrated in in Figure 4....

[...]

Journal ArticleDOI
TL;DR: A method to implement a majority gate in a transistor-accessed ReRAM array during the READ operation, which forms a functionally complete Boolean logic, capable of implementing any digital logic.
Abstract: To overcome the “von Neumann bottleneck,” methods to compute in memory are being researched in many emerging memory technologies, including resistive RAMs (ReRAMs). Majority logic is efficient for synthesizing arithmetic circuits when compared to NAND/NOR/IMPLY logic. In this work, we propose a method to implement a majority gate in a transistor-accessed ReRAM array during the READ operation. Together with NOT gate, which is also implemented in memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. While many methods have been proposed recently to implement the Boolean logic in memory, the latency of in-memory adders implemented as a sequence of such Boolean operations is exorbitant ( ${O}$ ( ${n}$ )). Parallel-prefix (PP) adders use prefix computation to accelerate addition in conventional CMOS-based adders. By exploiting the parallel-friendly nature of the proposed majority gate and the regular structure of the memory array, it is demonstrated how PP adders can be implemented in memory in ${O}$ (log( ${n}$ )) latency. The proposed in-memory addition technique incurs a latency of $4\cdot$ log( ${n}$ )+6 for $n$ -bit addition and is energy-efficient due to the absence of sneak currents in 1Transistor–1Resistor configuration.

4 citations

Proceedings ArticleDOI
01 Sep 2020
TL;DR: This paper presents the new concept and simulation results characterising the functionality for the new memristive ternary MLC for a ReRAM technology from Innovations for High Performance Microelectronics (IHP).
Abstract: In this paper we present a new procedure for a direct state transfer in ReRAM based multi-level cell (MLC) memristors for future ternary data processing, i.e. the direct transitioning of one ternary MLC state to another state. According to the rules of a ternary stored-transfer-adder cell the content of two memristors storing three different resistance values are read out and processed by a sense amplifier to produce a new ternary state for two output memristors. In contrast to own older work the analogue-digital-converting of ternary MLC based memristors with subsequent digital processing is avoided what requires a comparatively high energy budget. The solution is based on an adapted version of an existing sense amplifier circuit realising an in-memory processing for a majority logic developed by our own. We present the new concept and simulation results characterising the functionality for the new memristive ternary MLC for a ReRAM technology from Innovations for High Performance Microelectronics (IHP).

4 citations

### Cites background or methods from "A Parallel-friendly Majority Gate t..."

• ...A robust decoder designed by interleaving multiple decoders can achieve this functionality, as presented in [7], [8]....

[...]

• ...Our solution we found is based on a procedure that was used in implementing majority gate logic [6] by modifying the processing in sense amplifiers of crossbar memristor based structures [7], [8]....

[...]

• ...This in-memory majority gate, published in our earlier works [7], [8] exploits the 1T–1R structure of the memristive array to compute majority....

[...]

Journal ArticleDOI
TL;DR: The measurement results prove the functionality of the read circuit and the programming system and demonstrate that the read system can distinguish up to eight different states with an overall resistance ratio of 7.9.
Abstract: In this work, we present an integrated read and programming circuit for Resistive Random Access Memory (RRAM) cells. Since there are a lot of different RRAM technologies in research and the process variations of this new memory technology often spread over a wide range of electrical properties, the proposed circuit focuses on versatility in order to be adaptable to different cell properties. The circuit is suitable for both read and programming operations based on voltage pulses of flexible length and height. The implemented read method is based on evaluating the voltage drop over a measurement resistor and can distinguish up to eight different states, which are coded in binary, thereby realizing a digitization of the analog memory value. The circuit was fabricated in the 130 nm CMOS process line of IHP. The simulations were done using a physics-based, multi-level RRAM model. The measurement results prove the functionality of the read circuit and the programming system and demonstrate that the read system can distinguish up to eight different states with an overall resistance ratio of 7.9.

2 citations

### Cites background from "A Parallel-friendly Majority Gate t..."

• ...Those last two applications can also be combined to realize in-memory computing, one prominent way to overcome the von Neumann bottleneck, one of the major challenges for further improvements of modern computing systems [5]....

[...]

• ...in-memorycomputing Non-Volatile Logic [5]...

[...]

Proceedings ArticleDOI
01 Jul 2021
Abstract: The movement of data between processing and memory units, often referred to as the ‘von Neumann bottleneck’ is the main reason for the degraded performance of contemporary computing systems. In an effort to overcome this bottleneck, methods to ‘compute’ at the location of data are being pursued in many emerging memories, including Resistive RAM (ReRAM). Although many prior works have pursued addition in memory, the latency of n-bit addition has not been judiciously optimized, resulting in O(n) or at best O(log(n)). Computing with three states can enable carry-free addition and result in a latency which is independent of operand width (O(1)). In this work, we propose a method to perform carry-free addition completely in memory (a storage array, a processing array and their peripheral circuitry). The proposed technique incurs a latency of 22 memory cycles, which outperforms other in-memory binary adders for n ≥ 32. This speed is achieved at the cost of increased peripheral hardware.

##### References
More filters

Journal ArticleDOI
01 Jun 2018
TL;DR: This Review Article examines the development of in-memory computing using resistive switching devices, where the two-terminal structure of the devices, theirresistive switching properties, and direct data processing in the memory can enable area- and energy-efficient computation.
Abstract: Modern computers are based on the von Neumann architecture in which computation and storage are physically separated: data are fetched from the memory unit, shuttled to the processing unit (where computation takes place) and then shuttled back to the memory unit to be stored. The rate at which data can be transferred between the processing unit and the memory unit represents a fundamental limitation of modern computers, known as the memory wall. In-memory computing is an approach that attempts to address this issue by designing systems that compute within the memory, thus eliminating the energy-intensive and time-consuming data movement that plagues current designs. Here we review the development of in-memory computing using resistive switching devices, where the two-terminal structure of the devices, their resistive switching properties, and direct data processing in the memory can enable area- and energy-efficient computation. We examine the different digital, analogue, and stochastic computing schemes that have been proposed, and explore the microscopic physical mechanisms involved. Finally, we discuss the challenges in-memory computing faces, including the required scaling characteristics, in delivering next-generation computing. This Review Article examines the development of in-memory computing using resistive switching devices.

593 citations

### "A Parallel-friendly Majority Gate t..." refers background in this paper

• ...Reviews of such in-memory computing approaches are presented in [7], [8]....

[...]

Journal ArticleDOI

TL;DR: The IMPLY logic gate, a memristor-based logic circuit, is described and a methodology for designing this logic family is proposed, based on a general design flow suitable for all deterministic memristive logic families.
Abstract: Memristors are novel devices, useful as memory at all hierarchies. These devices can also behave as logic circuits. In this paper, the IMPLY logic gate, a memristor-based logic circuit, is described. In this memristive logic family, each memristor is used as an input, output, computational logic element, and latch in different stages of the computing process. The logical state is determined by the resistance of the memristor. This logic family can be integrated within a memristor-based crossbar, commonly used for memory. In this paper, a methodology for designing this logic family is proposed. The design methodology is based on a general design flow, suitable for all deterministic memristive logic families, and includes some additional design constraints to support the IMPLY logic family. An IMPLY 8-bit full adder based on this design methodology is presented as a case study.

391 citations

### "A Parallel-friendly Majority Gate t..." refers background in this paper

• ...IMPLY 1S-1R 58 steps 72 cells Each step is IMPLY operation [21] NOR 1S-1R 38 steps 19×22 Each step has one or more NOR operations [22] Majority 1S-1R 48∗ steps 8×3 Each step is majority (Fig....

[...]

Journal ArticleDOI
TL;DR: This paper proposes a paradigm shift in representing and optimizing logic by using only majority (MAJ) and inversion (INV) functions as basic operations, and develops powerful Boolean methods exploiting global properties of MIGs, such as bit-error masking.
Abstract: In this paper, we propose a paradigm shift in representing and optimizing logic by using only majority (MAJ) and inversion (INV) functions as basic operations. We represent logic functions by majority-inverter graph (MIG): a directed acyclic graph consisting of three-input majority nodes and regular/complemented edges. We optimize MIGs via a new Boolean algebra, based exclusively on majority and inversion operations, that we formally axiomatize in this paper. As a complement to MIG algebraic optimization, we develop powerful Boolean methods exploiting global properties of MIGs, such as bit-error masking. MIG algebraic and Boolean methods together attain very high optimization quality. Considering the set of IWLS’05 benchmarks, our MIG optimizer (MIGhty) enables a 7% depth reduction in LUT-6 circuits mapped by ABC while also reducing size and power activity, with respect to similar and-inverter graph (AIG) optimization. Focusing on arithmetic intensive benchmarks instead, MIGhty enables a 16% depth reduction in LUT-6 circuits mapped by ABC, again with respect to similar AIG optimization. Employed as front-end to a delay-critical 22-nm application-specified integrated circuit flow (logic synthesis + physical design) MIGhty reduces the average delay/area/power by 13%/4%/3%, respectively, over 31 academic and industrial benchmarks. We also demonstrate delay/area/power improvements by 10%/10%/5% for a commercial FPGA flow.

120 citations

### "A Parallel-friendly Majority Gate t..." refers background or methods in this paper

• ...e any Boolean logic can be expressed in terms of majority and NOT gates [19]....

[...]

• ...5: (a) NOT operation implemented with a 2:1 Mux at the output of the time-based SA; all logic operations are essentially READ operations (b) 1-bit full adder expressed as Majority-InverterGraph using MIGhty synthesis tool [19], where M3 represents 3input majority operation (c) With majority/NOT gate computed as READ, multiple levels of logic can be executed by writing the data back to the memory, simplifying computing to READ and WRITE operations....

[...]

• ...In [19], the authors present MajorityInverter Graph (MIG), a new logic manipulation structure consisting of three-input majority nodes and regular/inverted edges....

[...]

Proceedings Article

14 Mar 2016
TL;DR: This paper addresses the question of controlling the in-memory computation, by proposing a lightweight unit managing the operations performed on a memristive array, and presents a standardized symmetric-key cipher for lightweight security applications.
Abstract: Realization of logic and storage operations in memristive circuits have opened up a promising research direction of in-memory computing. Elementary digital circuits, e.g., Boolean arithmetic circuits, can be economically realized within memristive circuits with a limited performance overhead as compared to the standard computation paradigms. This paper takes a major step along this direction by proposing a fully-programmable in-memory computing system. In particular, we address, for the first time, the question of controlling the in-memory computation, by proposing a lightweight unit managing the operations performed on a memristive array. Assembly-level programming abstraction is achieved by a natively-implemented majority and complement operator. This platform enables diverse sets of applications to be ported with little effort. As a case study, we present a standardized symmetric-key cipher for lightweight security applications. The detailed system design flow and simulation results with accurate device models are reported validating the approach.

108 citations

### "A Parallel-friendly Majority Gate t..." refers background or methods in this paper

• ...In [13]–[15], majority logic is implemented in RRAM by applying the two inputs of the majority gate as voltages across its terminals, and the initial state of the RRAM (which is also the third input) switches to evaluate majority....

[...]

• ...1: (a) In-memory majority gate of previous works [13]–[15] (b) Proposed parallel-friendly gate (c) When multiple gates have to be executed in parallel, the majority gates of previous works [13]–[15] have to be mapped diagonally because two gates cannot be executed in the same row/column....

[...]

Proceedings ArticleDOI
, Fuxi Cai1, Wen Ma1, Wei Lu1
01 Dec 2015
TL;DR: A new efficient in-memory computing architecture based on crossbar array based on basic operation principles and design rules is developed and verified using emerging nonvolatile devices such as very low-power resistive random access memory (RRAM).
Abstract: To solve the "big data" problems that are hindered by the Von Neumann bottleneck and semiconductor device scaling limitation, a new efficient in-memory computing architecture based on crossbar array is developed. The corresponding basic operation principles and design rules are proposed and verified using emerging nonvolatile devices such as very low-power resistive random access memory (RRAM). To prove the computing architecture, we demonstrate a parallel 1-bit full adder (FA) both by experiment and simulation. A 4-bit multiplier (Mult.) is further obtained by a programed 2-bit Mult. and 2-bit FA.

76 citations

### "A Parallel-friendly Majority Gate t..." refers background in this paper

• ...Connecting such RRAM devices in a certain manner, or by applying certain voltage patterns, or by modifying the sensing circuitry, basic Boolean gates (NOR, NAND, XOR, IMPLY logic) have been demonstrated in RRAM arrays [1]–[6]....

[...]

##### Frequently Asked Questions (1)
###### Q1. What are the contributions mentioned in the paper "A parallel-friendly majority gate to accelerate in-memory computation" ?

In this work, the authors propose a method to compute majority while reading from a transistoraccessed RRAM array.