# Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over $\operatorname{GF}\left(2^{m}\right)$ 

Arash Reyhani-Masoleh, Member, IEEE, and M. Anwar Hasan, Senior Member, IEEE


#### Abstract

Representing the field elements with respect to the polynomial (or standard) basis, we consider bit parallel architectures for multiplication over the finite field $G F\left(2^{m}\right)$. In this effect, first we derive a new formulation for polynomial basis multiplication in terms of the reduction matrix $\mathbf{Q}$. The main advantage of this new formulation is that it can be used with any field defining irreducible polynomial. Using this formulation, we then develop a generalized architecture for the multiplier and analyze the time and gate complexities of the proposed multiplier as a function of degree $m$ and the reduction matrix $\mathbf{Q}$. To the best of our knowledge, this is the first time that these complexities are given in terms of $\mathbf{Q}$. Unlike most other articles on bit parallel finite field multipliers, here we also consider the number of signals to be routed in hardware implementation and we show that, compared to the well-known Mastrovito's multiplier, the proposed architecture has fewer routed signals. In this article, the proposed generalized architecture is further optimized for three special types of polynomials, namely, equally spaced polynomials, trinomials, and pentanomials. We have obtained explicit formulas and complexities of the multipliers for these three special irreducible polynomials. This makes it very easy for a designer to implement the proposed multipliers using hardware description languages like VHDL and Verilog with minimum knowledge of finite field arithmetic.


Index Terms-Finite or Galois field, Mastrovito multiplier, all-one polynomial, polynomial basis, trinomial, pentanomial and equallyspaced polynomial.

## 1 Introduction

WITH the rapid expansion of the Internet and wireless communications, more and more digital systems are becoming increasingly equipped with some form of cryptosystems to provide various kinds of data security. Many such cryptosystems rely on computations in very large finite fields and require fast computations in the fields [14], [2]. Finite field arithmetic operations are also used in error control coding [11], [16], VLSI testing [6], [27], and digital signal processing [5]. Among the basic arithmetic operations over the finite field $\operatorname{GF}\left(2^{m}\right)$, addition is easily realized using $m$ two-input XOR gates, while multiplication is costly in terms of gate count and time delay. The other operations of finite fields, such as exponentiation, division, and inversion can be performed by repeated multiplications [21], [26], [1], [7]. In order to satisfy the high speed requirements of many such applications, there is a need to develop an efficient architecture for finite field multiplication which is suitable for VLSI implementation. In this paper, a new general bit parallel structure for the polynomial basis multiplication which is applicable to all types of irreducible binary polynomials is proposed.

[^0]
### 1.1 Summary of Previous Work

The earliest parallel polynomial basis (PB) multiplier over $G F\left(2^{m}\right)$ was suggested by Bartee and Schneider [3]. Depending on the irreducible polynomial, this implementation requires as many as $m^{3}-m$ two-input adders over $G F(2)$ (i.e., XOR gates) [4]. Because of its high circuit complexity and lack of regularity, it is often advantageous to use other hardware structures to implement the multiplier [16]. In [13], [12], Mastrovito has proposed an algorithm along with its hardware architecture (hereafter referred to as the Mastrovito algorithm/multiplier) for PB multiplication. Sunar and Koc [24] have presented a new formulation for the Mastrovito algorithm using trinomials and have shown that $m^{2}-1$ XOR and $m^{2}$ AND gates are sufficient to implement the multiplier. In [8], Halbutogullari and Koc have generalized the approach of Sunar and Koc and have found a method for constructing the Mastrovito multiplier for arbitrary irreducible polynomials. This method considers general as well as special classes of irreducible polynomials such as trinomials, all-one polynomials (AOPs) and equally spaced polynomials (ESPs). So far, for these special polynomials, the XOR gate count and time delay of the Halbutogullari-Koc algorithm appear to be the lowest. In [28], Zhang and Parhi propose a systematic method to design the Mastrovito multiplier. Moreover, they extend the method to systematically design the modified Mastrovito multiplication scheme proposed in [23]. They also present new results of the complexities of the Mastrovito multiplier for two classes of irreducible pentanomials.

Unlike Mastrovito's method, a $G F\left(2^{m}\right)$ multiplication can also be performed by a straightforward polynomial multiplication followed by modular reduction. This approach has been used in a number of papers. For
example, in [25], Wu considered irreducible trinomials as reduction polynomials and showed that a modular multiplication operation in $G F\left(2^{m}\right)$ can be performed with $(\omega-$ 1) $(m-1)$ bit additions, where $\omega$ is the Hamming weight of the irreducible polynomial. In hardware implementation, its multiplication operations can be realized with $m^{2}$ AND and $(m-1)^{2}+(\omega-1)(m-1)$ XOR gates. Recently, RodriguezHenriquez and Koc in [20] proposed a PB multiplier for special case of pentanomials and have obtained its time delay and gate count. Although they have referred to it as the Mastrovito multiplier, their architecture is different from the original Mastrovito multiplier and uses the two steps of multiplication separately.

### 1.2 Scope of Our Work

In this paper, we present a new formulation for polynomial basis multiplication and then a generalized bit-parallel hardware architecture. We consider the time delay and gate count of the proposed multiplier as a function of degree $m$ and the reduction matrix $\mathbf{Q}$. Using the $\mathbf{Q}$ matrix, the complexities of multipliers based on special reduction polynomials, namely: 1) trinomials, 2) ESPs, and 3) two classes of pentanomials are obtained. We also present explicit formulas for multiplication for the above three special classes. These formulas maximize the number of intermediate signals that are reused. These formulas can be easily coded using hardware description languages such as VHDL or Verilog to implement an optimized multiplier. These codings can be done by a hardware designer without running an algorithm for precomputation or even having any knowledge of finite field arithmetic. In this paper, we also show that, for general irreducible polynomials, both the time delay and gate count of the proposed structures are, overall, lower than those available in the literature. Furthermore, these architectures have fewer routed signals and are suitable for VLSI implementation.

The organization of this paper is as follows: In Section 2, polynomial basis multiplication over $G F\left(2^{m}\right)$ and the Mastrovito multiplier in particular are considered. The new architecture and its complexities are introduced in Section 3. In Sections 4, 5, 6, and 7, optimized multiplication schemes using irreducible equally spaced polynomials, generic polynomials, trinomials, and pentanomials are respectively considered and comparisons between our architectures and other PB multipliers are made. Finally, conclusions are given in Section 8.

## 2 Polynomial Basis Multiplication over $G F\left(2^{m}\right)$

Let $P(x)=x^{m}+\sum_{i=0}^{m-1} p_{i} x^{i}$ be a monic irreducible polynomial over $G F(2)$ of degree $m$, where $p_{i} \in G F(2)$ for $i=0,1, \cdots, m-1$. Let $\alpha \in G F\left(2^{m}\right)$ be a root of $P(x)$, i.e., $P(\alpha)=0$. Then, the set $\left\{1, \alpha, \alpha^{2}, \cdots, \alpha^{m-1}\right\}$ is referred to as the polynomial or standard basis and each element of $G F\left(2^{m}\right)$ can be written with respect to the polynomial basis (PB). Let $A$ be an element in $\operatorname{GF}\left(2^{m}\right)$, then the representation of $A$ w.r.t. the PB is $A=\sum_{i=0}^{m-1} a_{i} \alpha^{i}, a_{i} \in\{0,1\}$, where $a_{i}$ s are the coordinates. For convenience, these coordinates will be denoted in vector notation ${ }^{1}$ as $\mathbf{a}=\left[a_{0}, a_{1}, a_{2}, \cdots, a_{m-1}\right]^{T}$,

1. In this paper, vectors and matrices are shown with small and capital bold faces, respectively.
where $T$ denotes the transposition. Using this vector notation, the representation of $A$ can be written as $A=\boldsymbol{\alpha}^{T} \mathbf{a}$, where $\boldsymbol{\alpha}=\left[1, \alpha, \alpha^{2}, \cdots, \alpha^{m-1}\right]^{T}$. Let $S$ be the binary polynomial of degree not more than $2 m-2$ obtained by the direct multiplication of the PB representations of any two elements $A$ and $B$ of $G F\left(2^{m}\right)$, i.e.,

$$
\begin{equation*}
S=\left(\sum_{i=0}^{m-1} a_{i} \alpha^{i}\right)\left(\sum_{j=0}^{m-1} b_{j} \alpha^{j}\right)=\sum_{k=0}^{2 m-2} s_{k} \alpha^{k}, \tag{1}
\end{equation*}
$$

where

$$
\begin{equation*}
s_{k}=\sum_{i+j=k} a_{i} b_{j}, \quad 0 \leq i, j \leq m-1, \quad 0 \leq k \leq 2 m-2 \tag{2}
\end{equation*}
$$

Then, the product $C=A \cdot B$ can be obtained by the following modulo reduction:

$$
\begin{equation*}
C \triangleq \sum_{i=0}^{m-1} c_{i} \alpha^{i} \equiv S \bmod P(\alpha) \tag{3}
\end{equation*}
$$

Using (1) and (3), the product coordinates, i.e., $c_{i} \mathrm{~S}$, are obtained in terms of $a_{i} \mathbf{s}, b_{i} \mathbf{s}$, and the irreducible polynomial $P(x)$. In [12], Mastrovito shows that these coordinates can be calculated using a matrix equation as follows:

$$
\begin{equation*}
\mathbf{c}=\mathbf{F b} \tag{4}
\end{equation*}
$$

where $\mathbf{b}=\left[b_{0}, b_{1}, \cdots, b_{m-1}\right]^{T}$ and $\mathbf{c}=\left[c_{0}, c_{1}, \cdots, c_{m-1}\right]^{T}$ are the vectors associated with $B$ and $C$, respectively. The exact definition of the product matrix $\mathbf{F}=\left[f_{i, j}\right]_{i, j=0}^{m-1}$ can be found in [13].
Remark 1. Matrix $\mathbf{F}$ is unique and depends on the multiplicand $A$ and the irreducible polynomial $P(x)$.

Using (4), an architecture for the Mastrovito multiplier is shown in Fig. 1a, which basically consists of two blocks, namely, $f$-network and IP-network. The $f$-network generates the entries of the product matrix $\mathbf{F}$. The IP-network performs the matrix-vector multiplication as shown in (4) and consists of $m$ inner product units, each generating one coordinate of $C$, i.e.,

$$
c_{i}=\left[f_{i, 0}, f_{i, 1}, \cdots, f_{i, m-1}\right]\left[b_{0}, b_{1}, \cdots, b_{m-1}\right]^{T}, 0 \leq i \leq m-1
$$

In Fig. 1a, block $I P(m)$ corresponds to an inner product unit which has two input vectors of $m$ elements each. Assuming that only two-input logic gates are used, $I P(k)$ for $k>0$, requires $k$ AND gates and $k-1$ XOR gates and has a gate delay of $T_{A}+\left\lceil\log _{2} k\right\rceil T_{X}$, where $T_{A}$ and $T_{X}$ correspond to the delays due to an AND and an XOR gate respectively (see Fig. 1b where $k=m$ ).

In Fig. 1a, there are two buses: the coordinates of $B$ and the interconnection bus IB which contains the coordinates of $A$. The interconnection bus IB carries the elements $f_{i, j}$ of $\mathbf{F}$ from the $f$-network to the IP-network. The number of lines on IB depends on the irreducible polynomial $P(x)$ and varies between $2 m-1$ (for trinomials) and $\frac{m(m+1)}{2}$ (for AOPs) [13].


Fig. 1. (a) Architecture of the Mastrovito multiplier over $G F\left(2^{m}\right)$. (b) Details of $I P(m)$.

## 3 An Efficient Multiplication Scheme

In this section, we first give a new formulation for multiplication over $G F\left(2^{m}\right)$. Using this formulation, we then present a bit parallel architecture for the multiplier. At the end, we give upper bounds of the space and time complexities of the architecture.

### 3.1 New Formulation

Definition 1 [12]. The reduction matrix $\mathbf{Q}$ is an $m-1$ by $m$ binary matrix which is obtained from

$$
\begin{equation*}
\boldsymbol{\alpha}^{\uparrow} \equiv \mathbf{Q} \boldsymbol{\alpha}(\bmod P(\alpha)) \tag{5}
\end{equation*}
$$

where $\boldsymbol{\alpha}^{\dagger}=\left[\alpha^{m}, \alpha^{m+1}, \cdots, \alpha^{2 m-2}\right]^{T}$.
Remark 2. For each irreducible $P(x)$, the reduction matrix $\mathbf{Q}$ is unique.

In order to present our new multiplication scheme, we introduce the following two Toeplitz matrices

$$
\begin{align*}
& \mathbf{L} \triangleq\left[\begin{array}{cccccc}
a_{0} & 0 & 0 & 0 & \cdots & 0 \\
a_{1} & a_{0} & 0 & 0 & \cdots & 0 \\
a_{2} & a_{1} & a_{0} & 0 & \cdots & 0 \\
\vdots & \vdots & \ddots & \ddots & \ddots & \vdots \\
a_{m-2} & a_{m-3} & \cdots & a_{1} & a_{0} & 0 \\
a_{m-1} & a_{m-2} & \cdots & a_{2} & a_{1} & a_{0}
\end{array}\right],  \tag{6}\\
& \mathbf{U} \triangleq\left[\begin{array}{cccccc}
0 & a_{m-1} & a_{m-2} & \cdots & a_{2} & a_{1} \\
0 & 0 & a_{m-1} & \cdots & a_{3} & a_{2} \\
\vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\
0 & 0 & \cdots & 0 & a_{m-1} & a_{m-2} \\
0 & 0 & \cdots & 0 & 0 & a_{m-1}
\end{array}\right],
\end{align*}
$$

where $a_{i}$ s are the coordinates of $A$. Note that $\mathbf{L}$ is an $m \times m$ lower triangular matrix and $\mathbf{U}$ is an $(m-1) \times m$ upper triangular matrix. Now, define the following two vectors which are functions of $A$ and $B$ :

$$
\begin{equation*}
\mathbf{d}=\mathbf{L b}, \tag{7}
\end{equation*}
$$

$$
\begin{equation*}
\mathbf{e}=\mathbf{U b} \tag{8}
\end{equation*}
$$

Then, we can state the following theorem, which is the key step toward the development a new architecture for the PB multiplication in $G F\left(2^{m}\right)$.
Theorem 1. Let $C$ be the product of $A$ and $B \in G F\left(2^{m}\right)$. Then,

$$
\begin{equation*}
\mathbf{c}=\mathbf{d}+\mathbf{Q}^{T} \mathbf{e} \tag{9}
\end{equation*}
$$

where $\mathbf{Q}, \mathbf{d}$, and $\mathbf{e}$ are defined in (5), (7), and (8) respectively. Proof. In vector notation, (1) can be written as

$$
\begin{equation*}
S=\boldsymbol{\alpha}^{\uparrow^{T}} \mathbf{s} \tag{10}
\end{equation*}
$$

where

$$
\boldsymbol{\alpha}^{\uparrow \uparrow}=\left[1, \alpha, \cdots, \alpha^{2 m-2}\right]^{T}=\left[\begin{array}{c}
\boldsymbol{\alpha} \\
\boldsymbol{\alpha}^{\uparrow}
\end{array}\right]
$$

and $\mathbf{s}=\left[s_{0}, s_{1}, \cdots, s_{2 m-2}\right]^{T}$. Using (5), we have

$$
\boldsymbol{\alpha}^{\uparrow \uparrow} \equiv\left[\begin{array}{c}
\mathbf{I}_{m} \\
\mathbf{Q}
\end{array}\right] \alpha
$$

where $\mathbf{I}_{m}$ is the $m$ by $m$ unity matrix. Note that $d_{k}=s_{k}, 0 \leq k \leq m-1$, and $\quad e_{l}=s_{l+m}, 0 \leq l \leq m-2$, then

$$
\mathbf{s}=\left[\begin{array}{l}
\mathbf{d} \\
\mathbf{e}
\end{array}\right]=\left[\begin{array}{l}
\mathbf{L} \\
\mathbf{U}
\end{array}\right] \mathbf{b}
$$

and, using (10), $C$ is obtained as

$$
\begin{align*}
C & \equiv S(\bmod P(\alpha)) \\
& =\left(\left[\begin{array}{c}
\mathbf{I}_{m} \\
\mathbf{Q}
\end{array}\right] \boldsymbol{\alpha}\right)^{T}\left[\begin{array}{l}
\mathbf{L} \\
\mathbf{U}
\end{array}\right] \mathbf{b}=\boldsymbol{\alpha}^{T}\left(\left[\begin{array}{ll}
\mathbf{I}_{m} & \mathbf{Q}^{T}
\end{array}\right]\left[\begin{array}{l}
\mathbf{L} \\
\mathbf{U}
\end{array}\right]\right) \mathbf{b}  \tag{11}\\
& =\boldsymbol{\alpha}^{T}\left(\mathbf{L}+\mathbf{Q}^{T} \mathbf{U}\right) \mathbf{b}
\end{align*}
$$

Since $C=\boldsymbol{\alpha}^{T} \mathbf{c}$, (11) yields (9) and the proof is complete. $\square$

### 3.2 Architecture

Using the formulation presented in the previous section, an architecture for polynomial basis multiplication over $G F\left(2^{m}\right)$ is shown in Fig. 2. This structure is hereafter


Fig. 2. (a) Architecture of the LCBP multiplier over $G F\left(2^{m}\right)$. (b) Detail of cyclic shift block.
referred to as the low complexity bit parallel (LCBP) multiplier. It is divided into two parts: IP-network and Q-network. The IP-network, which has $m$ blocks (denoted as $I_{0}, I_{1}, \cdots, I_{m-1}$ ), generates vectors $\mathbf{d}$ and $\mathbf{e}$ in accordance with (7) and (8). For $0 \leq i \leq m-2$, block $I_{i}$ consists of two inner product cells, namely, $I P(i+1)$ and $I P(m-i-1)$; however, the last block $I_{m-1}$ consists of only one such cell, namely, $I P(m)$.

In Fig. 2, the Q-network takes $\mathbf{d}$ and $\mathbf{e}$ as inputs and generates c. It consists of $m$ binary trees of XOR gates $\left(\mathrm{BTX}_{0 \cdots m-1}\right)$. The number of XOR gates in $\mathrm{BTX}_{i}, 0 \leq i \leq$ $m-1$, is equal to the number of 1 s in the $i$ th column of the Q matrix. It is noted that the number of lines on the interconnection bus IB is fixed and is equal to the number of $e_{j} \mathrm{~s}$, i.e., $m-1$. In Fig. 2 a , there are three buses, $A, B$, and IB, and the number of lines on these buses is $3 m-1$.

In order to illustrate the new multiplier structure, we consider the finite field of $G F\left(2^{4}\right)$ constructed by the irreducible polynomial $P(x)=x^{4}+x^{3}+1$. For this field, the circuit diagram based on the new multiplier structure is shown in Fig. 3. The total number of XOR gates of this figure can be reduced by reusing signals. This is considered later in this paper for special irreducible polynomials.

### 3.3 Complexities

For the LCBP multiplier structure shown in Fig. 2, we now give its complexities, in terms of gate counts and time delay
due to gates. For this purpose, let $\mathbf{q}_{j}, 0 \leq j \leq m-1$, be the $j$ th column of the reduction matrix, i.e.,

$$
\mathbf{Q}=\left[\mathbf{q}_{0}, \mathbf{q}_{1}, \cdots, \mathbf{q}_{m-1}\right]
$$

and $H\left(\mathbf{q}_{j}\right)$ be the Hamming weight (i.e., the number of 1 s ) of $\mathbf{q}_{j}$. We denote $\theta$ as the maximum Hamming weight of a column of $\mathbf{Q}$, i.e.,

$$
\begin{equation*}
\theta=\max \left\{H\left(\mathbf{q}_{j}\right): \quad 0 \leq j \leq m-1\right\} \tag{12}
\end{equation*}
$$

and $H(\mathbf{Q})$ as the Hamming weight of $\mathbf{Q}$, i.e.,

$$
\begin{equation*}
H(\mathbf{Q})=\sum_{j=0}^{m-1} H\left(\mathbf{q}_{j}\right) \tag{13}
\end{equation*}
$$

Now, consider the IP-network of the multiplier in Fig. 2. Since each $I_{i}$ for $0 \leq i \leq m-2$ has $m$ AND and $(m-2)$ XOR gates and $I_{m-1}$ has $m$ AND and $(m-1)$ XOR gates, the IP-network has a total of $m^{2}$ AND gates and $(m-1)(m-$ 2) $+m-1=(m-1)^{2}$ XOR gates. For the Q-network, using (13), one can determine the maximum number of XOR gates needed as $H(\mathbf{Q})$.

To determine the time complexity of this architecture, we need to consider the time delays due to gates of the IP as well as Q-networks. Using (7) and (8), the delays for $d_{j}, 0 \leq j \leq m-1$, and $e_{i}, 0 \leq i \leq m-2$, are given as


Fig. 3. Architecture for the $G F\left(2^{4}\right)$ multiplier with $P(x)=x^{4}+x^{3}+1$.

$$
\begin{array}{ll}
T\left(d_{j}\right)=T_{A}+\left\lceil\log _{2}(j+1)\right\rceil T_{X}, & 0 \leq j \leq m-1 \\
T\left(e_{i}\right)=T_{A}+\left\lceil\log _{2}(m-i-1)\right\rceil T_{X}, & 0 \leq i \leq m-2 \tag{15}
\end{array}
$$

respectively. In the IP-network, the maximum gate delay is due to the $I_{m-1}$ cell and is equal to $T_{A}+\left\lceil\log _{2} m\right\rceil T_{X}$. Using (12), it is not difficult to see that the maximum gate delay in the Q-network is $\left\lceil\log _{2}(\theta+1)\right\rceil T_{X}$. In the worst case, a signal will have maximum delays of both the IP and Q-networks. Thus, an upper bound for the time delay of the entire multiplier structure is $T_{C} \leq T_{A}+\left(\left\lceil\log _{2} m\right\rceil+\left\lceil\log _{2}(\theta+1)\right\rceil\right) T_{X}$. The following theorem summarizes the above results on the complexities of the proposed multiplier structure.
Theorem 2. For the LCBP multiplier, the number of two-input AND gates is

$$
\begin{equation*}
N_{A}=m^{2} \tag{16}
\end{equation*}
$$

and the number of XOR gates and time delay due to gates are upper bounded by

$$
\begin{gather*}
N_{X} \leq(m-1)^{2}+H(\mathbf{Q})  \tag{17}\\
T_{C} \leq T_{A}+\left(\left\lceil\log _{2} m\right\rceil+\left\lceil\log _{2}(\theta+1)\right\rceil\right) T_{X} \tag{18}
\end{gather*}
$$

The above theorem gives upper bounds for the number of XOR gates and time delay. However, the exact values can be obtained by designing a multiplier which is either highly space efficient or very fast. In order to minimize the number of XOR gates, the intermediate signals can be reused. This is illustrated in the following example.

### 3.4 An Example

We consider the field $G F\left(2^{7}\right)$ defined by the irreducible polynomial $P(x)=x^{7}+x^{5}+x^{3}+x+1$ for which gate and time complexities have been reported in [8]. For this irreducible polynomial, one has

$$
\mathbf{Q}=\left[\begin{array}{lllllll}
1 & 1 & 0 & 1 & 0 & 1 & 0  \tag{19}\\
0 & 1 & 1 & 0 & 1 & 0 & 1 \\
1 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 1 & 1 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 & 1 & 0
\end{array}\right]
$$

Since $H(\mathbf{Q})=20$ and $\theta=4$, then, using (17) and (18), the upper bounds of XOR gate count and time delay are $N_{X} \leq$ $(7-1)^{2}+20=56$ and

$$
T_{C} \leq T_{A}+\left(\left\lceil\log _{2} 7\right\rceil+\left\lceil\log _{2} 5\right\rceil\right) T_{X}=T_{A}+6 T_{X}
$$

respectively.
Substituting (19) into Theorem 1, the coordinates of the product $C=A B$ over $G F\left(2^{7}\right)$ can be obtained as

$$
\begin{align*}
& c_{0}=d_{0}+e_{0}+e_{2} \\
& c_{1}=\left(d_{1}+e_{0}\right)+\left(e_{1}+\left(e_{2}+e_{3}\right)\right) \\
& c_{2}=\left(d_{2}+e_{4}\right)+\left(e_{1}+\left(e_{2}+e_{3}\right)\right) \\
& c_{3}=\left(d_{3}+e_{0}\right)+\left(e_{3}+\left(e_{4}+e_{5}\right)\right)  \tag{20}\\
& c_{4}=d_{4}+\left(e_{1}+\left(e_{4}+e_{5}\right)\right) \\
& c_{5}=d_{5}+e_{0}+e_{5} \\
& c_{6}=d_{6}+e_{1},
\end{align*}
$$

where $d_{j}, 0 \leq j \leq 6$ and $e_{i}, 0 \leq i \leq 5$ are from (7) and (8), respectively. Note that the brackets in (20) show the order of modulo two addition which defines the position of XOR gates in the Q-network. Since we reuse partial sums $\left(e_{1}+\left(e_{2}+e_{3}\right)\right)$ and $\left(e_{4}+e_{5}\right)$ in (20), for the realization of (20), 17 XOR gates are needed in the Q-network and the total number of XOR gates of the entire multiplier is $(7-1)^{2}+17=53$. Also, since the time delays of $d_{j}, 0 \leq$ $j \leq 6$ and $e_{i}, 0 \leq i \leq 5$ are $T_{A}+\left\lceil\log _{2}(j+1)\right\rceil T_{X}$ and $T_{A}+\left\lceil\log _{2}(6-i)\right\rceil T_{X}$, respectively, the time delay of the entire multiplier is $T_{C}=T_{A}+5 T_{X}$.

In the following sections, we attempt to minimize the number of XOR gates for special irreducible polynomials, namely, equally spaced polynomials, trinomials, and pentanomials. The LCBP multipliers for the above-mentioned irreducible polynomials are achieved by properly defining some intermediate signals and then reusing them as much as possible. We start with equally spaced polynomials which are very structured and will help us present the remaining special cases with fewer difficulties.

## 4 Multipliers Using Equally Spaced Polynomials

## Defintion 2. A polynomial

$$
\begin{equation*}
P(x)=x^{n s}+x^{(n-1) s}+\cdots+x^{s}+1, \tag{21}
\end{equation*}
$$

over $G F(2)$, with $n s=m$, is called an equally spaced polynomial (denoted as s-ESP) of degree $m$.


Fig. 4. Graphical representation of the locations of nonzero entries of $\mathbf{Q}$ for $s$-ESP $P(x)=x^{n s}+x^{(n-1) s}+\cdots+x^{s}+1, m=n s$. (a) $1<s<\frac{m}{2}$. (b) $s=1$ (AOP). (c) $s=\frac{m}{2}$ (trinomial).

TABLE 1
Comparison of Related s-ESP-Based Polynomial Basis Multipliers

| Reference | \#AND | \#XOR | Time delay |
| :---: | :---: | :---: | :---: |
| Itoh-Tsujii $[10\rceil$ | $(m+s)^{2}$ | $(m+s)^{2}-s$ | $T_{A}+\left(\left\lceil\log _{2} m\right\rceil+\left\lceil\log _{2}(m+s+1)\right\rceil\right) T_{X}$ |
| Hasan et al. $[9]$ | $m^{2}$ | $m^{2}+m-2 s$ | $T_{A}+\left(\frac{m}{s}+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Mastrovito $[12,13]$ | $m^{2}$ | $\frac{2 s+1}{2 s} m^{2}-\frac{3}{2} m$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Halbutogullari-Koc $[8]$ | $m^{2}$ | $m^{2}-s$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Zhang-Parhi $[28]$ | $m^{2}$ | $m^{2}-s$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Presented Here | $m^{2}$ | $m^{2}-s$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |

An $s$-ESP is a self-reciprocal polynomial. In (21), both $n$ and $s$ are integers and $1 \leq s \leq \frac{m}{2}$. When $s=1$, we have 1-ESP and it is the same as the all-one polynomial (AOP). The latter has the highest Hamming weight among all polynomials of degree $m$. On the other hand, $s=\frac{m}{2}$ results in the least Hamming weight irreducible polynomial (i.e., trinomial) of degree $m$.

Theorem 3. For an s-ESP based LCBP multiplier over $G F\left(2^{m}\right)$,
the gate counts, time delay, and number of lines on the buses are $N_{A}=m^{2}, N_{X}=m^{2}-s, T_{C}=T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$, and $N_{L}=2 m+s$, respectively.
Proof. When $\alpha$ is a root of the $s$-ESP in (21), we have

$$
\alpha^{m+i}= \begin{cases}\alpha^{i}+\alpha^{s+i}+\cdots+\alpha^{(n-1) s+i}, & 0 \leq i<s,  \tag{22}\\ \alpha^{i-s}, & s \leq i \leq m-2 .\end{cases}
$$

Using (22), the reduction matrix $\mathbf{Q}$ is obtained as

$$
\mathbf{Q}=\left[\begin{array}{cccc}
\mathbf{I}_{s} & \mathbf{I}_{s} & \cdots & \mathbf{I}_{s}  \tag{23}\\
\mathbf{I}_{m-s-1} & & & \mathbf{0}_{s+1}
\end{array}\right]
$$

where $\mathbf{I}_{j}$ is the $j \times j$ unity matrix and $\mathbf{0}_{s+1}$ is a zero matrix which has $m-s-1$ rows and $s+1$ columns.

The graphical representations of $\mathbf{Q}$ in (23) for different values of $s$ are shown in Fig. 4. In this figure, nonzero entries of $\mathbf{Q}$ are shown with the small squares.

In order to obtain exact expressions for $N_{X}$ and $T_{C}$, first we attempt to obtain the coordinates of $C$. Using (23) into Theorem 1, one can write

$$
\begin{equation*}
c_{j}=d_{j}^{\prime}+e_{j \bmod s}, \quad 0 \leq j \leq m-1, \tag{24}
\end{equation*}
$$

where

$$
d_{j}^{\prime}= \begin{cases}d_{j}+e_{j+s} & 0 \leq j \leq m-s-2,  \tag{25}\\ d_{j} & m-s-1 \leq j \leq m-1\end{cases}
$$

Thus, using (24) and (25), the exact XOR gate count for an $s$-ESP based multiplier is $N_{X}=m^{2}-s$. Referring to Fig. 2, note that the gate delays to generate $d_{j}, 0 \leq j \leq$ $m-1$, and $e_{i}, 0 \leq i \leq m-2$, are $T_{A}+\left\lceil\log _{2}(j+1)\right\rceil T_{X}$, and $T_{A}+\left\lceil\log _{2}(m-i-1)\right\rceil T_{X}$, respectively. Thus, $d^{\prime}$ of (25) can be generated with a maximum delay of $T_{A}+\left\lceil\log _{2} m\right\rceil T_{X}$. Although, this changes the architecture for the LCBP multiplier slightly, now each $c_{j}, 0 \leq j \leq$ $m-1$, has a maximum delay of $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$.

It is worth mentioning that the resultant number of bus lines on IB reduces from $m-1$ to $s$. This corresponds to $e_{0}$ up to $e_{s-1}$ as used in (24). It is noted that $e_{j}$ for $s \leq j \leq m-2$ is not considered as a bus line because it is used only once in the multiplication formulations, i.e., (24) and (25). Thus, the total number of lines on the buses for the multiplier is $2 m+s$.

Table 1 compares the proposed ESP-based multiplier with a number of existing multipliers of the same kind. As seen in the table, our gate count and time delay match the best ones available in the literature.

## 5 Extension to More Generic Polynomials

Here, we consider irreducible polynomials of the form $P(x)=x^{m}+x^{k_{t}}+\cdots+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<$ $\cdots<k_{t} \leq \frac{m}{2}$. The Hamming weight of $P(x)$ is $t+2$ and the degree of the second leading term is less than or equal to $\frac{m}{2}$. All five binary fields recommended by NIST for ECDSA can be constructed by such irreducible polynomials [15].

In order to apply the general formulation stated in Section 3 to these polynomials, first we obtain the corresponding $\mathbf{Q}$ matrix. Note that all the rows of the $\mathbf{Q}$ matrix are the PB representations of $\alpha^{m+i}, 0 \leq i \leq m-2$, where $\alpha$ is a root of $P(x)$. Since $P(\alpha)=0$, then $\alpha^{m}=1+\alpha^{k_{1}}+\alpha^{k_{2}}+\cdots+\alpha^{k_{t}}$. Thus, row 0 of $\mathbf{Q}$ has 1 s in


Fig. 5. Graphical representations of the reduction matrix $\mathbf{Q}$ for a trinomial ( $t=1$ ): (a) $k=k_{1}=1$, (b) $1<k<\frac{m}{2}$ (see Fig. 4 c for $k_{1}=\frac{m}{2}$ ), and a pentanomial $\left(t=3\right.$ ): (c) $k_{1}=1$, (d) $1<k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$.

$T_{C}=$
$T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$,
and $N_{L}=3 m+k_{t}-k_{1}-2$, respectively.
Proof. Let us denote $\mathbf{e}^{(i)}=\left[e_{0}^{(i)}, e_{1}^{(i)}, \cdots, e_{m-1}^{(i)}\right]^{T}=\mathbf{Q}_{i}^{T} \mathbf{e}$, $0 \leq i \leq t$, then, using Theorem 1, we can obtain the coordinates of $C$ as

$$
\begin{equation*}
\mathbf{c}=\mathbf{d}+\mathbf{e}^{(0)}+\mathbf{e}^{(1)}+\mathbf{e}^{(2)}+\cdots+\mathbf{e}^{(t)} \tag{26}
\end{equation*}
$$

First, let us assume that $k_{1} \neq 1$. Using $\mathbf{Q}_{0}$ (see Fig. 6a for $t=3$ ), the elements of $\mathbf{e}^{(0)}$ can be written as follows:
$e_{j}^{(0)}=$
$\begin{cases}e_{j}+e_{j+m-k_{t}}+\cdots+e_{j+m-k_{2}} \\ +e_{j+m-k_{1}}, & \text { if } 0 \leq j \leq k_{1}-2 \\ e_{j}+e_{j+m-k_{t}}+\cdots+e_{j+m-k_{2}} & \text { if } k_{1}-1 \leq j \leq k_{2}-2 \\ \vdots & \vdots \\ e_{j}+e_{j+m-k_{t}} & \text { if } k_{t-1}-1 \leq j \leq k_{t}-2 \\ e_{j} & \text { if } k_{t}-1 \leq j \leq m-2 \\ 0 & \text { if } j=m-1 .\end{cases}$

For $0 \leq j \leq k_{t}-2$ the total number of XOR gates to form $e_{j}^{(0)} \mathrm{s}$, is

TABLE 2
Comparison of Related Polynomial Basis Multipliers for $P(x)=x^{m}+x^{k_{t}}+\cdots+x^{k_{2}}+x^{k_{1}}+1,1 \leq k_{1}<k_{2}<\cdots<k_{t} \leq \frac{m}{2}$

| Reference | \#AND | \#XOR | Time delay |
| :---: | :---: | :---: | :---: |
| Zhang-Parhi $[28\rceil$ | $m^{2}$ | $(m+t)(m-1)$ | $T_{A}+\left(2 t+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Presented here | $m^{2}$ | $(m+t)(m-1)$ | $T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |

$$
\begin{aligned}
N_{1} & =t\left(k_{1}-1\right)+(t-1)\left(k_{2}-k_{1}\right)+\cdots+k_{t}-k_{t-1} \\
& =\sum_{i=1}^{t} k_{i}-t .
\end{aligned}
$$

Let $T\left(e_{j}^{(0)}\right)$ denote the time delay due to the gates to generate $e_{j}^{(0)}$. As seen in (27), the longest delay is due to $e_{0}^{(0)}=e_{0}+e_{m-k_{t}}+\cdots+e_{m-k_{2}}+e_{m-k_{1}}$, i.e., $T\left(e_{j}^{(0)}\right) \leq T\left(e_{0}^{(0)}\right)$. In order to reduce this delay, we first add any two terms except $c_{0}$, e.g., $e_{m-k_{j}}+e_{m-k_{i}}, 1 \leq i, j \leq t, \quad i \neq j$. Then, add these $\left\lceil\frac{t}{2}\right\rceil$ terms/signals to $c_{0}$ using a binary tree of XOR gates. Since $T\left(e_{j}\right)=T_{A}+\left\lceil\log _{2}(m-j-1)\right\rceil T_{X}$, then

$$
\begin{aligned}
T\left(e_{m-k_{j}}+e_{m-k_{i}}\right) & \leq T_{X}+T\left(e_{m-k_{t}}\right) \\
& =T_{A}+\left(1+\left\lceil\log _{2}\left(k_{t}-1\right)\right\rceil\right) T_{X} \\
& \leq T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X},
\end{aligned}
$$

where the last inequality is due to $k_{t} \leq \frac{m}{2}$. Thus, we have

$$
\begin{align*}
& T\left(e_{j}^{(0)}\right) \leq \\
& \begin{cases}T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil\right. \\
\left.+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { if } 0 \leq j \leq k_{t}-2 \\
T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X} & \text { if } k_{t}-1 \leq j \leq m-2 .\end{cases} \tag{28}
\end{align*}
$$

By reusing the terms $e_{j}^{(0)}$ s, the coordinates of $\mathbf{e}^{(i)}$, for $1 \leq i \leq t$, can be obtained as

$$
e_{j}^{(i)}= \begin{cases}0, & \text { if } 0 \leq j \leq k_{i}-1  \tag{29}\\ e_{j-k_{i}}^{(0)} & \text { otherwise } .\end{cases}
$$

Equations (29) and (26) result in the following:
$c_{j}=d_{j}+ \begin{cases}e_{j}^{(0)} & \text { if } 0 \leq j \leq k_{1}-1 \\ e_{j}^{(0)}+e_{j}^{(1)} & \text { if } k_{1} \leq j \leq k_{2}-1 \\ \vdots & \vdots \\ e_{j}^{(0)}+e_{j}^{(1)}+\cdots+e_{j}^{(t-1)} & \text { if } k_{t-1} \leq j \leq k_{t}-1 \\ e_{j}^{(0)}+e_{j}^{(1)}+\cdots+e_{j}^{(t)} & \text { if } k_{t} \leq j \leq m-2 \\ e_{j}^{(1)}+e_{j}^{(2)}+\cdots+e_{j}^{(t)} & \text { if } j=m-1 .\end{cases}$

To realize (30) in hardware, one requires

$$
\begin{aligned}
& N_{2}= m+\left(k_{2}-k_{1}\right)+2\left(k_{3}-k_{2}\right)+\cdots+(t-1)\left(k_{t}-k_{t-1}\right) \\
&+t\left(m-k_{3}-1\right)+t-1 \\
&=(t+1) m-\sum_{i=1}^{t} k_{i}-1
\end{aligned}
$$

XOR gates. Thus, the total number of XOR gates needed for the multiplier is $(m-1)^{2}+N_{1}+N_{2}=(m+t)(m-1)$.

To obtain the time delay of the proposed multiplier, we assume a binary tree of XOR gates for each coordinate in (30). For $j \notin\left[k_{t}, m-2\right]$, it can be seen from (30) that $T_{C} \leq\left\lceil\log _{2}(t+1)\right\rceil T_{X}+T\left(e_{0}^{(0)}\right)$ and the proof is complete by using (28).

Now, we need only obtain the time delay of $c_{j}$ s for $k_{t} \leq j \leq m-2$. For $j \in\left[k_{t}, m-2\right]$, , if we form $c_{j}=$ $\left(d_{j}+e_{j}^{(0)}\right)+e_{j}^{(1)}+e_{j}^{(2)}+\cdots+e_{j}^{(t)}$ such that $d_{j}+e_{j}^{(0)}$ is calculated first, then

$$
\begin{align*}
T\left(d_{j}+e_{j}^{(0)}\right) & \leq T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} \\
& \leq T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} \tag{31}
\end{align*}
$$

Also, using (29) and (28), one can see $T\left(e_{j}^{(t)}\right) \leq$ $T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ which implies that

$$
\begin{gathered}
T_{C} \leq T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil\right. \\
\left.+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
\end{gathered}
$$

and the proof is complete.
In addition to the three buses shown in Fig. 2, now there will be another bus in the middle of the Q-network for signals $e_{j}^{(0)}, 0 \leq j \leq k_{t}-2$. Also note that the signal $e_{j}$, $0 \leq j \leq k_{1}-1$, is used once in (27) and (30). Thus, the total number of bus lines is $3 m+k_{t}-k_{1}-2$.

Corollary 1. For $k_{1}=1$ and $t>1$, the time delay would reduce to

$$
T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left\lceil\frac{t}{2}\right\rceil\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
$$

For this special case of irreducible polynomials, our multiplier has the same gate complexities with shorter time delay compared to the Mastrovito multiplier reported in [28]. This comparison is shown in Table 2.

## 6 TRINOMIALS

Let $P(x)=x^{m}+x^{k}+1$ be an irreducible trinomial generating $G F\left(2^{m}\right)$. Trinomial $P(x)$ has only three nonzero coefficients and (for $m>1$ ) no binary irreducible polynomial can have any fewer nonzero coefficients. Since low Hamming weight polynomials can potentially reduce the space and time complexities of a finite field multiplier, irreducible trinomials have drawn significant attention in the past. Reference [22] lists an irreducible trinomial for every degree $m(\leq 10,000)$ for which such a polynomial exists.


Fig. 7. Graphical representations of the reduction matrix $\mathbf{Q}$ for trinomial $P(x)=x^{m}+x^{k}+1$. (a) $\frac{m}{2}<k<m-1, r=\left\lfloor\frac{m-2}{m-k}\right\rfloor$, (b) $k=m-1$ (see Fig. 5a, Fig. 5b, and Fig. 4c for $k=1,1<k<\frac{m}{2}$, and $k=\frac{m}{2}$, respectively).

Now, we derive $\mathbf{Q}$ for the trinomial to obtain the complexities of the LCBP multiplier. The graphical representations of the locations of the nonzeros of $\mathbf{Q}$ for irreducible trinomials with $k=1$ and $1<k<\frac{m}{2}$ have already been shown in Fig. 5a and Fig. 5b, respectively. Similarly, for $\frac{m}{2}<k<m$ and $k=m-1, \mathbf{Q}$ can be obtained and their graphical representation of the locations of the nonzeros is shown in Fig. 7a and Fig. 7b, respectively. Now, using the representation of $\mathbf{Q}$, we can state the following theorem.
Theorem 5. The number of XOR gates and the time delay of the LCBP multiplier based on the trinomial $x^{m}+x^{k}+1$ are

$$
N_{X}=\left\{\begin{array}{cl}
m^{2}-\frac{m}{2}, & \text { for } k=\frac{m}{2} \\
m^{2}-1, & \text { otherwise }
\end{array}\right.
$$

and
$T_{C}=$

$$
\begin{cases}T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { for } 1 \leq k<\frac{m}{2}, \\ T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}, & \text { for } k=\frac{m}{2}, \\ T_{A}+\left(1+\left\lfloor\frac{m-2}{m-k}\right\rfloor\right. & \text { for } \frac{m}{2}<k<m \\ \left.+\left\lceil\log _{2}\left(m-1-\left\lfloor\frac{k-2}{m-k}\right\rfloor(m-k)\right)\right\rceil\right) T_{X}, & \end{cases}
$$

Proof. Based on the results obtained in Section 5, one can obtain the time delay and the number of XOR gates by substituting $t=1$ in Theorem 4 , for $1 \leq k<\frac{m}{2}$. Since the trinomial with $k=\frac{m}{2}$, i.e., $P(x)=x^{m}+x^{\frac{m}{2}}+1$, is an $\frac{m}{2}$-ESP, the complexities of the multiplier based on this type of trinomial can also be obtained from Theorem 3 with $s=\frac{m}{2}$.

Below, we discuss trinomials with $k>\frac{m}{2}$.
Case: $\frac{m}{2}<k \leq m-1$.
Using Fig. 7a and Theorem 5, one can generate the coordinates of $C$ as

$$
c_{j}=d_{j}+ \begin{cases}e_{j}^{\prime} & \text { for } 0 \leq j \leq k-2  \tag{32}\\ e_{k-1} & \text { for } j=k-1 \\ e_{j}+e_{j-k}^{\prime} & \text { for } k \leq j \leq 2 k-2 \\ e_{j}+e_{j-k} & \text { for } 2 k-1 \leq j \leq m-2 \\ e_{m-k-1} & \text { for } j=m-1\end{cases}
$$

where $e_{j}^{\prime}$ can be obtained recursively from $j=k-2$ down to $j=0$ as follows:

$$
e_{j}^{\prime}= \begin{cases}e_{j}+e_{j+m-k}, & \text { for } k-2 \geq j \geq 2 k-m-1,  \tag{33}\\ e_{j}+e_{j+m-k}^{\prime}, & \text { for } 2 k-m-2 \geq j \geq 0\end{cases}
$$

These require the same number of XOR gates as in the case of $1 \leq k<\frac{m}{2}$, which is $m^{2}-1$. Also, the time delay of the multiplier is determined by

$$
\begin{aligned}
c_{k}=(( & \left.d_{k}+e_{k}\right)+\left(e_{0}+\left(e_{m-k}\right.\right. \\
& \left.\left.\left.+\cdots+\left(e_{(r-1)(m-k)}+e_{r(m-k)}\right) \cdots\right)\right)\right)
\end{aligned}
$$

where $r=\left\lfloor\frac{m-2}{m-k}\right\rfloor$. Thus, $T_{C}=(r+1) T_{X}+T\left(e_{(r-1)(m-k)}\right)$ and, using (15), the total time delay of the multiplier is

$$
\begin{aligned}
& T_{C}=T_{A} \\
& +\left(1+\left\lfloor\frac{m-2}{m-k}\right\rfloor+\left\lceil\log _{2}\left(m-1-\left\lfloor\frac{k-2}{m-k}\right\rfloor(m-k)\right)\right\rceil\right) T_{X} .
\end{aligned}
$$

Note that for, $k=m-1$, (33) becomes

$$
e_{j}^{\prime}= \begin{cases}\sum_{i=j}^{m-2} e_{i}, & \text { for } 0 \leq j \leq m-2, \\ e_{0}^{\prime}, & \text { for } j=m-1,\end{cases}
$$

which requires the same number of XOR gates and the corresponding delay is $T_{A}+m T_{X}$.
In [25], a trinomial based multiplier for the cases of $1 \leq$ $k \leq \frac{m}{2}$ has been proposed. For these values of $k$, the above results match those reported in [25]. Table 3 compares the presented multiplier with other trinomial-based multipliers. As shown in this table, the proposed multiplier has the same gate complexities as the Mastrovito multiplier. For $k=1$, the proposed multiplier has a time delay which is longer by $T_{X}$ than the Mastrovito multiplier. However, for the other values of $k$, i.e., $1<k<m$, it has the same or shorter delay compared to the Mastrovito multiplier.

To reduce the time delay of the Mastrovito multiplier for $\frac{m}{2}<k<m$, a hybrid tree structure is used in [28]. One can also use a similar technique to the proposed multiplier by applying a hybrid tree to generate $e_{j}^{\prime}$ in (33).

If one attempts to apply Theorem 5 to the multiplier in Fig. 3, the Q-network should be modified by reusing signals $\left(e_{0}^{\prime}, e_{1}^{\prime}, e_{2}^{\prime}\right)$ instead of signals $\left(e_{0}, e_{1}, e_{2}\right)$. The coordinates of $C$ can be obtained as $c_{j}=d_{j}+e_{j}^{\prime}, 0 \leq j \leq 3$, where $e_{0}^{\prime}=$ $e_{3}^{\prime}=e_{0}+e_{1}^{\prime}, e_{1}^{\prime}=e_{1}+e_{2}, e_{2}^{\prime}=e_{2}$.

## 7 Special Classes of Pentanomials

A polynomial with five nonzero coefficients, i.e., $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where

$$
1 \leq k_{1}<k_{2}<k_{3} \leq m-1
$$

is called a pentanomial of degree $m$. The nonzero constant term is due to the irreducible property needed to define the representation of the field. In terms of the values of $k_{i} \mathrm{~s}$, the pentanomials can be divided into a number of different classes. Below we consider two special classes of irreducible pentanomials as proposed in [28].

### 7.1 Class 1: $k_{3} \leq \frac{m}{2}$

For this class of irreducible pentanomial where $k_{3} \leq \frac{m}{2}$, one can apply $t=3$ to the complexity results we have presented in Section 5. This yields the following:

TABLE 3
Comparison of Related Polynomial Basis Multipliers Based on Trinomials

| Multiplier | Reference | \#AND | \#XOR | Time delay |
| :---: | :---: | :---: | :---: | :---: |
| $P(x)=x^{m}+x+1$ |  |  |  |  |
| Mastrovito | [13, 24, 8, 28] | $m^{2}$ | $m^{2}-1$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Non-Mastrovito | [25], Presented here | $m^{2}$ | $m^{2}-1$ | $T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{k}+1,1<k<\frac{m}{2}$ |  |  |  |  |
| Mastrovito | [13, 24, 8, 28] | $m^{2}$ | $m^{2}-1$ | $T_{A}+\left(2+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Non-Mastrovito | [25], Presented here | $m^{2}$ | $m^{2}-1$ | $T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{\frac{m}{2}}+1$ |  |  |  |  |
| Mastrovito | [13, 24, 8, 28] | $m^{2}$ | $m^{2}-\frac{m}{2}$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Non-Mastrovito | [25], Presented here | $m^{2}$ | $m^{2}-\frac{m}{2}$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{k}+1, \frac{m}{2}<k<m$ |  |  |  |  |
| Mastrovito | [24, 28] | $\mathrm{m}^{2}$ | $m^{2}-1$ | $T_{A}+\left(1+\left[\frac{m-2}{m-k}\right]+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Mastrovito | [8] | $m^{2}$ | $m^{2}-1$ | $T_{A}+\left(\left\lceil\frac{m-1}{m-k}\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Non-Mastrovito | Presented here | $m^{2}$ | $m^{2}-1$ | $\leq T_{A}+\left(1+\left\lfloor\frac{m-2}{m-k}\right\rfloor+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{m-1}+1$ |  |  |  |  |
| Mastrovito | [24, 8, 28] | $\mathrm{m}^{2}$ | $m^{2}-1$ | $T_{A}+\left(m-1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Non-Mastrovito | Presented here | $m^{2}$ | $m^{2}-1$ | $T_{A}+m T_{X}$ |

Corollary 2. The gate counts and time delay of the multiplier for
the pentanomial $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$, are

$$
\begin{aligned}
N_{A} & =m^{2}, \\
N_{X} & =m^{2}+2 m-3, \\
T_{C} & = \begin{cases}T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { if } k_{1}=1 \\
T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { otherwise },\end{cases}
\end{aligned}
$$

and the number of lines on the buses is $N_{L}=3 m+k_{3}-k_{1}-2$.
The number of XOR gates can be reduced if we choose a pentanomial such that $k_{1}=k_{3}-k_{2}$. Toward this, let us introduce the following set of intermediate terms/signals:

$$
\begin{equation*}
e_{j}^{\prime}=e_{j+m-k_{3}}+e_{j+m-k_{2}}, \quad 0 \leq j \leq k_{2}-2 . \tag{34}
\end{equation*}
$$

Equation (34) can be used to generate $e_{j}^{(0)}, 0 \leq j \leq k_{2}-2$, by substituting $t=3$ in (27) as follows:

$$
e_{j}^{(0)}= \begin{cases}e_{j}+e_{j}^{\prime}+e_{j+m-k_{1}}, & \text { if } 0 \leq j \leq k_{1}-2  \tag{35}\\ e_{j}+e_{j}^{\prime} & \text { if } k_{1}-1 \leq j \leq k_{2}-2 \\ e_{j}+e_{j+m-k_{3}} & \text { if } k_{2}-1 \leq j \leq k_{3}-2 \\ e_{j} & \text { if } k_{3}-1 \leq j \leq m-2 \\ 0 & \text { if } j=m-1\end{cases}
$$

The total number of XOR gates needed to generate $e_{j}^{(0)}$ s (see (35)) is $N_{1}=k_{1}+k_{2}+k_{3}-3$ in which (34) contributes $k_{2}-1$. Also, the maximum delay due to gates in (35) is

$$
\begin{align*}
& T\left(e_{j}^{(0)}\right) \leq \\
& \begin{cases}T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} & \text { if } 0 \leq j \leq k_{1}-2 \\
T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} & \text { if } k_{1}-1 \leq j \leq k_{3}-2 \\
T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X} & \text { if } k_{3}-1 \leq j \leq m-1\end{cases} \tag{36}
\end{align*}
$$

Lemma 1. With symbols defined as above, one has

$$
\begin{aligned}
& e_{j}^{(0)}+e_{j}^{(1)}=e_{j+k_{2}-m}^{\prime}, \text { for } m-k_{2} \leq j \leq m-2, \\
& e_{j}^{(2)}+e_{j}^{(3)}=e_{j-k_{2}}^{(0)}+e_{j-k_{2}}^{(1)}, \text { for } k_{3} \leq j \leq m-1
\end{aligned}
$$

Proof. Since $k_{3} \leq \frac{m}{2}$, one can easily verify that, for all $j$ s, $k_{3}-1 \leq j-k_{1}$ (and, hence, $k_{3}-1 \leq j$ ). Thus, using (35) and (29), one can simply obtain

$$
\begin{aligned}
e_{j}^{(0)}+e_{j}^{(1)} & =e_{j}^{(0)}+e_{j-k_{1}}^{(0)} \\
& =e_{j}+e_{j-k_{1}} \\
& =e_{j+k_{2}-m}^{\prime}, \text { for } m-k_{2} \leq j \leq m-2 .
\end{aligned}
$$

Similarly, the second equation can be proven by using (35), (29), and $k_{2}=k_{3}-k_{1}$ as follows:

$$
\begin{aligned}
e_{j}^{(2)}+e_{j}^{(3)} & =e_{j-k_{2}}^{(0)}+e_{j-k_{3}}^{(0)} \\
& =e_{j-k_{2}}^{(0)}+e_{j+k_{1}-k_{3}}^{(1)} \\
& =e_{j-k_{2}}^{(0)}+e_{j-k_{2}}^{(1)}, \text { for } k_{3} \leq j \leq m-1
\end{aligned}
$$

Let us represent $e_{j}^{(01)}, 0 \leq j \leq m-1$, as the elements of $\left(\mathbf{Q}_{0}+\mathbf{Q}_{1}\right)^{T} \mathbf{e}$, where $\mathbf{Q}_{0}$ and $\mathbf{Q}_{1}$ are shown in Fig. 6 a and Fig. 6b, respectively. Then, substituting $t=3$ in the general case given in (30) and using the above lemma, we can obtain the coordinates of $C=A B$ as follows:

$$
\begin{equation*}
c_{j}=d_{j}+e_{j}^{(01)}+e_{j-k_{2}}^{(01)}, 0 \leq j \leq m-1, \tag{37}
\end{equation*}
$$

where $e_{j-k_{2}}^{(01)}=0$ for $j<k_{2}$, and

$$
e_{j}^{(01)}= \begin{cases}e_{j}^{(0)} & \text { if } 0 \leq j \leq k_{1}-1  \tag{38}\\ e_{j}^{(0)}+e_{j}^{(1)} & \text { if } k_{1} \leq j \leq m-k_{2}-1 \\ e_{j+k_{2}-m}^{\prime} & \text { if } m-k_{2} \leq j \leq m-2 \\ e_{j}^{(1)} & \text { if } j=m-1\end{cases}
$$

As seen in (38), one has to realize $e_{j}^{(0)}+e_{j}^{(1)}$ for all $k_{1} \leq j \leq m-k_{2}-1$, which requires $m-k_{2}-k_{1}$ XOR gates.

TABLE 4
Maximum Time Delays of the Signals, where $t(i), 0 \leq i \leq 4$, Represents the Time Delay of $T_{A}+\left(i+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$, Numbers inside Square Brackets Are for $k_{1}=1$, and $x$ to Indicate whether $e_{j}^{(01)}$ or $e_{j-k_{2}}^{(01)}$ Is to Be Added First to $d_{j}$

| $j$ | $e_{j}^{(0)}$ | $e_{j}^{(1)}$ | $e_{j}^{(01)}$ | $e_{j-k_{2}}^{(01)}$ | $d_{j}+x$ | $c_{j}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $0 \leq j \leq k_{1}-1$ | $t(2),[t(1)]$ | - | $t(2),[t(1)], x$ | - | $t(3)$ | $t(3)$ |
| $k_{1} \leq j \leq k_{2}-1$ | $t(1)$ | $t(2),[t(1)]$ | $t(3),[t(2)], x$ | - | $t(4),[t(3)]$ | $t(4),[t(3)]$ |
| $k_{2} \leq j \leq k_{3}-1$ | $t(1)$ | $t(2),[t(1)]$ | $t(3),[t(2)]$ | $t(2),[t(1)], x$ | $t(3),[t(2)]$ | $t(4),[t(3)]$ |
| $k_{3} \leq j \leq k_{3}+k_{1}-1$ | $t(0)$ | $t(1)$ | $t(2), x$ | $t(3),[t(2)]$ | $t(3)$ | $t(4)$ |
| $k_{3}+k_{1} \leq j \leq m-k_{2}-1$ | $t(0)$ | $t(0)$ | $t(1), x$ | $t(3),[t(2)]$ | $t(2)$ | $t(4),[t(3)]$ |
| $m-k_{2} \leq j \leq m-1$ | $t(0)$ | $t(0)$ | $t(1), x$ | $t(3),[t(2)]$ | $t(2)$ | $t(4),[t(3)]$ |
| $j=m-1$ | - | $t(0)$ | $t(1), x$ | $t(1)$ | $t(2)$ | $t(3)$ |


(a)

(b)

(c)

Fig. 8. Graphical representations of the reduction matrix $\mathbf{Q}$ for class 2 pentanomials $P(x)-x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $m-k_{3}=k_{3}-k_{2}=k_{2}-k_{1}=s$. (a) $\frac{m-1}{4} \leq s \leq \frac{m-1}{3}$ or $1 \leq k_{1} \leq s+1$ (see Fig. 4 a for $k_{1}=s$ ), (b) $\frac{m-1}{5} \leq s<\frac{m-1}{4}$ or $s+1<k_{1} \leq 2 s+1$, (c) $\frac{m-1}{8} \leq$ $s<\frac{m-1}{5}$ or $2 s+1<k_{1} \leq 5 s+1$.

Once $e_{j}^{(01)}$ 's are obtained, then (37) requires $2 m-k_{2}$ XOR gates. Thus, the total number of XOR gates needed for the multiplier is

$$
(m-1)^{2}+N_{1}+m-k_{2}-k_{1}+2 m-k_{2}=m^{2}+m+k_{1}-2 .
$$

Due to the reuse of terms $e_{j}^{\prime}, 0 \leq j \leq k_{2}-1$, and $e_{j}^{(0)}+e_{j}^{(1)}$, $k_{1} \leq j \leq m-k_{2}-1$, additional lines needed on the bus in the $\mathbf{Q}$-network are $\left(k_{2}-1\right)$ and $\left(m-k_{1}-k_{2}\right)$, respectively. Thus, the total number of lines on the buses is increased to $4 m+k_{2}-k_{1}-3$.

To obtain the time delay of the proposed multiplier, we use Table 4 which shows the maximum delay of the signals in (37) for the given ranges of $j$ in each row. In this table, the parameter $t(i), 0 \leq i \leq 4$, represents the time delay of $t(i)=T_{A}+\left(i+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ and the numbers inside square brackets are for $k_{1}=1$. Also, $x$ determines whether $e_{j}^{(01)}$ or $e_{j-k_{2}}^{(01)}$ is to be added to $d_{j}$ first to obtain $c_{j}$. For example, using the fifth row of this table, $c_{k_{3}}$ can be obtained as $c_{k_{3}}=\left(d_{k_{3}}+e_{k_{3}}^{(01)}\right)+e_{k_{1}}^{(01)}$. In each row of this table, the delays are obtained for the first digit of the given range. This is because, as $j$ increases, the time delays of the used signals in each row of this table decreases. As seen in this table, the maximum delay of the multiplier is $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. For $k_{1}=1$, only one signal, i.e., $c_{k_{3}}$, has the delay of $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. One can reduce this delay to $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ if only $c_{k_{3}}$ is realized as $c_{k_{3}}=\left(\left(d_{k_{3}}+e_{j}^{(0)}\right)+e_{j}^{(1)}\right)+e_{k_{3}-k_{2}}^{(01)}$ by using one extra XOR gate.

Based on the above results, we can state the following:
Theorem 6. The gate counts and time delay of the multiplier
based on the pentanomial $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$ and $k_{3}-k_{2}=k_{1}$ are

$$
\begin{aligned}
& N_{A}=m^{2}, \\
& N_{X}= \begin{cases}m^{2}+m & \text { if } k_{1}=1 \\
m^{2}+m+k_{1}-2 & \text { otherwise },\end{cases} \\
& T_{C}= \begin{cases}T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { if } k_{1}=1 \\
T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { otherwise },\end{cases}
\end{aligned}
$$

and the number of lines on the buses is $N_{L}=4 m+k_{2}-k_{1}-3$.
Remark 3. To verify that class 1 irreducible pentanomials exist, we have used a Maple(10) program for $m \in[160,600]$ and have found that at least one such irreducible pentanomial exists for every $m$ in the range of 160 to 600 . This is of interest to elliptic curve cryptosystem designers. In order to minimize the number of XOR gates of the multiplier, we have obtained irreducible pentanomials such that $k_{1}$ is minimum. These are shown in Tables 5 and 6 in [18]. As can be seen from these tables, $k$ is less than or equal to 6 for any $m$ in the above mentioned range.

It is noted that the pentanomial presented in [20] is the special case of $k_{1}=1$.

### 7.2 Class 2: $m-k_{3}=k_{3}-k_{2}=k_{2}-k_{1}=s$,

$$
\frac{m-1}{8} \leq s \leq \frac{m-1}{3}
$$

We refer to polynomials $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<k_{3} \leq m-1$ and $m-k_{3}=k_{3}-k_{2}=$ $k_{2}-k_{1}=s$ as class 2 type pentanomials. Similar to the other special irreducible polynomials, here we first obtain the corresponding reduction matrix. Then, the coordinates and complexities of the multiplier can be obtained. Based on the values of $s$ (or $k_{1}=m-3 s$ ), we can divide the reduction matrix into different forms. Here, only three of them are presented. These $\mathbf{Q}$ matrices for $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ (or $1 \leq k_{1} \leq 5 s+1$ ) are shown in Fig. 8. Based on this figure, we can state the following theorem.

Theorem 7. The gate counts and the time delay of the multiplier
for the pentanomial $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1$, for $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ are $N_{A}=m^{2}$,
$N_{X}= \begin{cases}m^{2}+m-s-1, & \text { if } \frac{m-1}{4} \leq s \leq \frac{m-1}{3} \\ m^{2}+2 m-5 s-2 & \text { if } \frac{m-1}{5} \leq s<\frac{m-1}{4} \\ m^{2}+m-2 & \text { if } \frac{m-1}{8} \leq s<\frac{m-1}{5}\end{cases}$
$T_{C}= \begin{cases}T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { if } \frac{m-1}{5} \leq s \leq \frac{m-1}{3} \\ T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { otherwise, }\end{cases}$
and

$$
N_{L}= \begin{cases}4 m-2, & \text { if } \frac{m-1}{4} \leq s \leq \frac{m-1}{3} \\ 5 m-4 s-3 & \text { if } \frac{m-1}{5} \leq s<\frac{m-1}{4} \\ 5 k_{3}-3 & \text { if } \frac{m-1}{8} \leq s<\frac{m-1}{5} .\end{cases}
$$

Proof. Case I: $1 \leq k_{1} \leq s+1, \frac{m-1}{4} \leq s \leq \frac{m-1}{3}$.
Using (9) and Fig. 8a, one can compute the coordinates of $C$ as
$c_{j}=$
$d_{j}+ \begin{cases}e_{j}+e_{j+s} & \text { if } 0 \leq j \leq k_{1}-1 \\ e_{j-k_{1}}+e_{j}+e_{j-k_{1}+s}+e_{j+s} & \text { if } k_{1} \leq j \leq k_{2}-1 \\ e_{j-k_{2}}+e_{j}+e_{j-k_{1}+s}+e_{j+s} & \text { if } k_{2} \leq j \leq k_{3}-2 \\ e_{j-k_{2}}+e_{j}+e_{j-k_{1}+s} & \text { if } j=k_{3}-1 \\ e_{j-k_{3}}+e_{j}+e_{j-k_{1}+s} & \text { if } k_{3} \leq j \leq k_{1}+k_{3}-2 \\ e_{j-k_{3}}+e_{j} & \text { if } k_{1}+k_{3}-1 \leq j \leq m-2 \\ e_{j-k_{3}} & \text { if } j=m-1 .\end{cases}$

In order to reduce the number of XOR gates needed for implementing (39), one can precompute

$$
\begin{array}{rlrl}
e_{j}^{\prime} & = & e_{j}+e_{j+s}, & \\
\text { for } 0 \leq j \leq k_{2}-1, \\
e_{j-k_{2}}^{\prime \prime} & = & e_{j-k_{2}}+e_{j+s}, & \\
\text { for } k_{2} \leq j<k_{3}-1 .
\end{array}
$$

The precomputation requires a total of $k_{3}-1$ XOR gates with a maximum time delay of $T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. Then, by reusing the first $s$ terms of $e_{j}^{\prime} \mathbf{s}$ and all signals of $e_{j-k_{2}}^{\prime \prime} \mathbf{s}$, i.e., $e_{0}^{\prime}, e_{1}^{\prime}, \cdots, e_{s-1}^{\prime}, e_{0}^{\prime \prime}, e_{1}^{\prime \prime}, \cdots, e_{k_{3}-2}^{\prime \prime}$, one can simplify (39) as
$c_{j}=d_{j}+ \begin{cases}e_{j}^{\prime} & \text { if } 0 \leq j \leq k_{1}-1 \\ e_{j}^{\prime}+e_{j-k_{1}}^{\prime} & \text { if } k_{1} \leq j \leq k_{2}-1 \\ e_{j-k_{2}}^{\prime \prime}+e_{j}+e_{j-k_{1}+s} & \text { if } k_{2} \leq j \leq k_{3}-2 \\ e_{j-k_{2}}+e_{j}+e_{j-k_{1}+s} & \text { if } j=k_{3}-1 \\ e_{j-k_{3}}^{\prime \prime}+e_{j-k_{1}+s} & \text { if } k_{3} \leq j \leq k_{1}+k_{3}-2 \\ e_{j-k_{3}}^{\prime \prime} & \text { if } k_{1}+k_{3}-1 \leq j \leq m-2 \\ e_{j-k_{3}} & \text { if } j=m-1 .\end{cases}$

Equation (40) requires $m+\left(k_{2}-k_{1}\right)+2\left(k_{3}-k_{2}-1\right)+$ $2+\left(k_{1}-1\right)=m+2 k_{3}-k_{2}-1=2 m-1 \quad$ XOR gates with a time delay of $2 T_{X}$. Thus, the total number of XOR gates required for the whole multiplier is
$(m-1)^{2}+k_{3}-1+2 m-1=m^{2}+k_{3}-1=m^{2}+m-s-1$ with a time delay of $T_{C}=T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. The number of lines on the buses has now increased
by the number of reused $e_{j}^{\prime} \mathbf{s}$ and $e_{j-k_{2}}^{\prime \prime} \mathbf{s}$, i.e., $3 m-1+s+k_{3}-1=4 m-2$.

It is noted that, for the special case of $k_{1}=1$, (40) should be modified by simply removing the fifth line with condition $k_{3} \leq j \leq k_{1}+k_{3}-2$. This does not affect the complexities of the whole multiplier structure.

Case II: $s+1<k_{1} \leq 2 s+1, \frac{m-1}{5} \leq s<\frac{m-1}{4}$.
By comparing Fig. 8b with Fig. 8a, one can see that the Q matrix in this case has four more small lines at the bottom of Fig. 8b. This results in more terms in the representations of the coordinates of $C$. In order to be consistent with the previous case and to use (40), one can introduce the following terms:
$e_{j}^{\prime}= \begin{cases}e_{j}+e_{j+4 s}+e_{j+s} & \text { for } 0 \leq j \leq m-2-4 s \\ e_{j}+e_{j+s} & \text { for } m-1-4 s \leq j \leq k_{2}-1,\end{cases}$
and
$e_{j-k_{2}}^{\prime \prime}=$
$\begin{cases}e_{j-k_{2}}+e_{j-k_{2}+4 s}+e_{j+s} & \text { for } k_{2} \leq j \leq k_{2}+m-2-4 s \\ e_{j-k_{2}}+e_{j+s} & \text { for } k_{2}+m-1-4 s \leq j<k_{3}-1 .\end{cases}$

These new terms cause Case II to require $m-1-4 s$ more XOR gates than Case I. Note that the following terms: $e_{j}+e_{j+4 s,} 0 \leq j \leq m-2-4 s$, are common between (41) and (42). Thus, using (40) with new $e_{j}^{\prime}$ s and $e_{j-k_{2}}^{\prime \prime} \mathrm{s}$, i.e., (41) and (42), we have a total number of XOR gates as $m^{2}+k_{3}-1+m-1-4 s=m^{2}+2 m-5 s-2$ and the total number of lines on the buses is $4 m-2+m-1-4 s=5 m-4 s-3$. The maximum delays in (41) and (42) are due to $e_{0}^{\prime}$ and $e_{0}^{\prime \prime}$, respectively, and are equal to $T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ each. Compared to Case I, this delay is increased by $T_{X}$, however, for an implementation similar to Case I, one can obtain $T_{C}=T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$.

Case III: $2 s+1<k_{1} \leq 5 s+1 \frac{m-1}{8} \leq s<\frac{m-1}{5}$.
Let us introduce

$$
e_{j}^{\prime}=e_{j}+e_{j+4 s}, \text { for } 0 \leq j \leq m-2-4 s,
$$

which requires $m-1-4 s$ XOR gates and a delay of $T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. Let $\mathbf{Q}_{0}$ be a submatrix which contains all four lines starting from column 0 in Fig. 8c. Then, the coordinates of $\mathbf{e}^{(0)}=\mathbf{Q}_{0}^{T} \mathbf{e}$ can be obtained as

$$
e_{j}^{(0)}= \begin{cases}e_{j}^{\prime}+e_{j+s}^{\prime} & \text { if } 0 \leq j \leq m-2-5 s  \tag{43}\\ e_{j}^{\prime}+e_{j+s} & \text { if } m-1-5 s \leq j \leq m-2-4 s \\ e_{j}+e_{j+s} & \text { if } m-1-4 s \leq j \leq k_{3}-2 \\ e_{j} & \text { if } k_{3}-1 \leq j \leq m-2 \\ 0 & \text { if } j=m-1,\end{cases}
$$

which requires $k_{3}-1$ XOR gates and a maximum time delay of $T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. Thus, using Fig. 8c and (9), the coordinates of $C$ can be obtained as

TABLE 5
Comparison of Related Pentanomial-Based Multipliers

| Reference | Special Case | \#XOR | Time delay |  |
| :---: | :---: | :---: | :---: | :---: |
| $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1,1<k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$ |  |  |  |  |
| $[28]$ | $k_{1} \geq 1$ | $m^{2}+2 m-3$ | $T_{A}+\left(6+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |  |
| LCBP | $k_{1}>1$ | $m^{2}+2 m-3$ | $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| LCBP | $k_{1}=1$ | $m^{2}+2 m-3$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| LCBP | $k_{3}-k_{2}=k_{1}$ | $m^{2}+m+k_{1}-2$ | $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| $[20]$ | $k_{3}-k_{2}=k_{1}=1$ | $m^{2}+m+2 k_{2}$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| LCBP | $k_{3}-k_{2}=k_{1}=1$ | $m^{2}+m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| LCBP,[20] | $k_{i}=i$ | $m^{2}+m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1$ |  |  |  |  |
| $[28]$ | $1 \leq s \leq \frac{m-1}{3}$ | $m^{2}+4 m-5 s-5$ | $T_{A}+\left(\left\lfloor\frac{d}{4}\right\rfloor+4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| $[28]$ | $s \leq \frac{m-1}{3}$ | $\geq m^{2}+2.33 m-7$ | $\geq T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |
| LCBP | $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ | $\leq m^{2}+m$ | $\leq T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |  |

TABLE 6
Values $m \in[160,600]$ and $s$ such that Polynomial $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1,1 \leq s \leq \frac{m-1}{3}$ Is Irreducible

| 161,20 | 166,43 | 167,44 | 169,45 | 170,53 | 172,57 | 175,53 | 178,49 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 182,27 | 185,48 | 191,40 | 193,40 | 194,29 | 196,43 | 199,55 | 202,49 |
| 209,67 | 212,35 | 214,47 | 215,64 | 217,51 | 218,69 | 220,71 | 223,63 |
| 233,53 | 236,77 | 238,55 | 239,27 | 241,57 | 242,49 | 244,37 | 247,55 |
| 250,49 | 253,69 | 257,72 | 260,75 | 263,31 | 265,46 | 266,73 | 268,81 |
| 271,71 | 274,69 | 278,91 | 281,33 | 284,77 | 286,71 | 287,72 | 289,28 |
| 292,85 | 295,61 | 302,87 | 305,34 | 308,5 | 310,31 | 313,78 | 314,5 |
| 316,45 | 319,89 | 322,85 | 329,93 | 332,81 | 337,94 | 340,55 | 343,53 |
| 346,21 | 350,99 | 353,86 | 358,19 | 359,97 | 362,85 | 364,99 | 367,57 |
| 370,77 | 377,112 | 380,111 | 382,27 | 383,45 | 385,81 | 386,101 | 388,53 |
| 391,121 | 394,45 | 401,83 | 404,113 | 406,83 | 407,112 | 409,29 | 412,49 |
| 415,84 | 418,73 | 422,91 | 425,78 | 428,35 | 431,77 | 433,124 | 436,55 |
| 439,130 | 446,51 | 449,105 | 455,139 | 457,147 | 458,85 | 460,147 | 463,83 |
| 470,107 | 473,91 | 476,47 | 478,119 | 479,125 | 481,77 | 484,35 | 487,131 |
| 490,73 | 494,159 | 497,76 | 500,135 | 503,159 | 505,58 | 506,161 | 508,133 |
| 511,167 | 514,149 | 518,135 | 521,163 | 524,119 | 526,143 | 527,160 | 529,124 |
| 532,177 | 538,65 | 545,141 | 550,119 | 551,80 | 553,153 | 556,91 | 559,175 |
| 566,91 | 569,164 | 574,187 | 575,143 | 577,184 | 580,79 | 583,151 | 590,31 |
| 593,169 | 596,91 | 599,70 |  |  |  |  |  |

$$
\begin{align*}
& c_{j}= \\
& d_{j}+ \begin{cases}e_{j}^{(0)} & \text { if } 0 \leq j \leq k_{1}-1 \\
e_{j}^{(0)}+e_{j-k_{1}}^{(0)} & \text { if } k_{1} \leq j \leq k_{2}-1 \\
e_{j}^{(0)}+e_{j-k_{2}}^{\prime}+e_{j+s-k_{1}}^{\prime} & \text { if } k_{2} \leq j \leq k_{3}-1 \\
e_{j}^{(0)}+e_{j-k_{3}}^{\prime}+e_{j+s-k_{1}}^{\prime} & \text { if } k_{3} \leq j \leq k_{1}+m-2-5 s \\
e_{j}^{(0)}+e_{j-k_{3}}^{\prime}+e_{j+s-k_{1}} & \text { others. }\end{cases} \tag{44}
\end{align*}
$$

To implement (44), one requires $3 m-k_{1}-k_{2}-1$ XOR gates with the time delay of $2 T_{X}$. Thus, the total number of XOR gates and time delay of the multiplier are

$$
\begin{aligned}
& (m-1)^{2}+(m-1-4 s)+\left(k_{3}-1\right)+\left(3 m-k_{1}-k_{2}-1\right) \\
& =m^{2}+m-2
\end{aligned}
$$

and $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$, respectively. Also, similar to the previous cases, one can obtain the number of lines on the buses as $5 k_{3}-3$.

A comparison of our newly obtained gate counts and delays as presented above with those of existing ones for pentanomial based multiplier is shown in Table 5. As seen in this table, for class 1 pentanomials with $k_{3}-k_{2}=k_{1}$, the proposed multiplier is faster than [28] and has fewer XOR gates. This proposed special case of class 1 covers the case of pentanomials reported in [20], where $k_{1}=1$. Compared to the multiplier proposed in [20], the proposed multiplier for the special case of $k_{1}=k_{3}-k_{2}=1$ has $2 k_{2}$ fewer XOR gates and matches the ones proposed in [20] which uses $k_{1}=1$ and $k_{2}=2$. Also, for class 2 pentanomials, our multiplier is either faster than or has the same gate delay and has at least $1.33 m-7$ fewer XOR gates than the multiplier reported in [28].
Remark 4. Using Maple ©®i , we have found that there exist 147 values of $m$, as shown in Table 6 , where $m \in[160,600]$ such

TABLE 7
Comarison of the Numbers of Bus Lines of Fig. 2 with that of the Mastrovito Multiplier

| Multipliers | \# Lines on the buses |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | trinomial | $s$-ESP | pentanomial | generic |
| Mastrovito [13] | $3 m-1$ | $\frac{m(m-s)}{2 s}+2 m$ | $5 m-3$ | $(t+2)(m-1)+2$ |
| Presented here | $3 m-1$ | $2 m+s$ | $\leq 4 m+k_{2}$ | $3 m+k_{t}-k_{1}-2$ |

that polynomial $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1$, $1 \leq s \leq \frac{m-1}{3}$ is irreducible. Among them, only 23 have $1 \leq s<\frac{m-1}{8}$.

## 8 Concluding Remarks

In this paper, new bit parallel polynomial basis multipliers over $G F\left(2^{m}\right)$ have been proposed. Time and space complexities of such a multiplier heavily depend on the field defining irreducible polynomials. Based on a number of important classes of irreducible polynomials, we have given an exact complexity analysis of the multiplier. In general, our results match or outperform the previously known best results in similar classes. We have also presented exact formulations for the coordinates of the multiplier output. Such formulations are expected to be useful to efficiently implement the multiplier using hardware description languages, such as VHDL and Verilog, without having much knowledge of finite field arithmetic.

Moreover compared to the well-known Mastrovito multiplier, the architectures discussed here have fewer number of lines on the buses. This is shown in Table 7. Fewer number of lines on the buses can be advantageous for VLSI implementation, especially for cryptographic applications where $m$ is usually very large.

## Acknowledgments

The authors would like to thank the reviewers for their comments. The work has been supported in part by an NSERC postdoctoral fellowship to A. Reyhani-Masoleh. Preliminary versions of this article can be found in [17] and [19].

## References

[1] G.B. Agnew, T. Beth, R.C. Mullin, and S.A. Vanstone, "Arithmetic Operations in $G F\left(2^{m}\right), "$ J. Cryptology, vol. 6, pp. 3-13, 1993.
[2] G.B. Agnew, R.C. Mullin, and S.A. Vanstone, "An Implementation of Elliptic Curve Cryptosystems over $F_{2^{155}}$," IEEE J. Selected Areas in Comm., vol. 11, no. 5, pp. 804-813, June 1993.
[3] T.C. Bartee and D.I. Schneider, "Computation with Finite Fields," Information and Computers, vol. 6, pp. 79-98, Mar. 1963.
[4] E.R. Berlekamp, Algebraic Coding Theory. McGraw-Hill, 1968.
[5] R.E. Blahut, Fast Algorithms for Digital Signal Processing. AddisonWesley, 1985.
[6] T.A. Gulliver, M. Serra, and V.K. Bhargava, "The Generation of Primitive Polynomials in $G F(q)$ with Independent Roots and Their Application for Power Residue Codes, VLSI Testing and Finite Field Multipliers Using Normal Bases," Int'l J. Electronics, vol. 71, no. 4, pp. 559-576, 1991.
[7] J.H. Guo and C.L. Wang, "Systolic Array Implementation of Euclid's Algorithm for Inversion and Division in $G F\left(2^{m}\right)$," IEEE Trans. Computers, vol. 47, no. 10, pp. 1161-1167, Oct. 1998.
[8] A. Halbutogullari and C.K. Koc, "Mastrovito Multiplier for General Irreducible Polynomials," IEEE Trans. Computers, vol. 49, no. 5, pp. 503-518, May 2000.
[9] M.A. Hasan, M.Z. Wang, and V.K. Bhargava, "Modular Construction of Low Complexity Parallel Multipliers for a Class of Finite Fields $G F\left(2^{m}\right)$," IEEE Trans. Computers, vol. 41, no. 8, pp. 962-971, Aug. 1992.
[10] T. Itoh and S. Tsujii, "Structure of Parallel Mutipliers for a Class of Fields $G F\left(2^{m}\right)$," Information and Computation, vol. 83, pp. 21-40, 1989.
[11] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their Applications. Cambridge Univ. Press, 1994.
[12] E.D. Mastrovito, "VLSI Designs for Multiplication over Finite Fields $G F\left(2^{m}\right), "$ Proc. Sixth Symp. Applied Algebra, Algebraic Algorithms, and Error Correcting Codes (AAECC-6), pp. 297-309, July 1988.
[13] E.D. Mastrovito, "VLSI Architectures for Computation in Galois Fields," PhD thesis, Linkoping Univ., Linkoping, Sweden, 1991.
[14] A.J. Menezes, I.F. Blake, X. Gao, R.C. Mullin, S.A. Vanstone, and T. Yaghoobian, Applications of Finite Fields. Kluwer Academic, 1993.
[15] Nat'l Inst. of Standards and Technology, Digital Signature Standard, FIPS Publication 186-2, Jan. 2000.
[16] I.S. Reed and X. Chen, Error-Control Coding for Data Networks. Kluwer Academic, 1999.
[17] A. Reyhani-Masoleh and M.A. Hasan, "A New Efficient Architecture of Mastrovito Multiplier over $G F\left(2^{m}\right)$," Proc. 20th Biennial Symp. Comm., pp. 59-63, May 2000.
[18] A. Reyhani-Masoleh and M.A. Hasan, "Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over $G F\left(2^{m}\right)$," Technical Report CORR 2003-19, Dept. of C \& O, Univ. of Waterloo, Canada, July 2003.
[19] A. Reyhani-Masoleh and M.A. Hasan, "On Low Complexity Bit Parallel Polynomial Basis Multipliers," Proc. Cryptographic Hardware and Embedded Systems (CHES 2003), pp. 189-202, Sept. 2003.
[20] F. Rodriguez-Henriquez and C.K. Koc, "Parallel Multipliers Based on Special Irreducible Pentanomials," IEEE Trans. Computers, vol. 52, no. 12, pp. 1535-1542, Dec. 2003.
[21] P.A. Scott, S.J. Simmons, S.E. Tavares, and L.E. Peppard, "Architectures for Exponentiation in $G F\left(2^{m}\right)$," IEEE J. Selected Areas in Comm., vol. 6, no. 3, pp. 578-586, Apr. 1988.
[22] G. Seroussi, "Table of Low-Weight Binary Irreducible Polynomials," HP Labs Tech. Report HPL-98-135, Aug. 1998.
[23] L. Song and K.K. Parhi, "Low Complexity Modified Mastrovito Multipliers over Finite Fields $G F\left(2^{M}\right)$," Proc. IEEE Int'l Symp. Circuits and Systems (ISCAS-99), pp. 508-512, 1999.
[24] B. Sunar and C.K. Koc, "Mastrovito Multiplier for All Trinomials," IEEE Trans. Computers, vol. 48, no. 5, pp. 522-527, May 1999.
[25] H. Wu, "Bit-Parallel Finite Field Multiplier and Squarer Using Polynomial Basis," IEEE Trans. Computers, vol. 51, no. 7, pp. 750758, July 2002.
[26] H. Wu and M.A. Hasan, "Efficient Exponentiation of a Primitive Root in $G F\left(2^{m}\right)$," IEEE Trans. Computers, vol. 46, no. 2, pp. 162-172, Feb. 1997.
[27] Y. Wu and M.I. Adham, "Scan-Based BIST Fault Diagnosis," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 18, no. 2, pp. 203-211, Feb. 1999.
[28] T. Zhang and K.K. Parhi, "Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials," IEEE Trans. Computers, vol. 50, no. 7, pp. 734-748, July 2001.


Arash Reyhani-Masoleh received the BSc degree from Iran University of Science and Technology in 1989, the MSc degree from the University of Tehran in 1991, both with the first rank in electrical and electronic engineering, and the PhD degree in electrical and computer engineering from the University of Waterloo in 2001. From 1991 to 1997, he was with the Department of Electrical Engineering, Iran University of Science and Technology. Since June 2001, he has been a postdoctoral fellow with the Centre for Applied Cryptographic Research, University of Waterloo. His current research interests include algorithms and VLSI architectures for computations in finite fields, fault-tolerant computing, and error-control coding. He was awarded an NSERC (Natural Sciences and Engineering Research Council of Canada) postdoctoral fellowship in 2002. He is a member of the IEEE and the IEEE Computer Society.

M. Anwar Hasan received the BSc degree in electrical and electronic engineering, the MSc degree in computer engineering, both from the Bangladesh University of Engineering and Technology, in 1986 and 1988, respectively, and the PhD degree in electrical engineering from the University of Victoria in 1992. Since 1993, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, where he is now a professor. At the University of Waterloo, he is also a member of the Centre for Applied Cryptographic Research and the Center for Wireless Communications. His current research interests include algorithms and architectures for computations in Galois fields, data security and reliability, and digital communication networks. From January to December of 1999, he was on sabbatical with Motorola Labs., Schaumburg, Ilinois. He is a recipient of the Raihan Memorial Gold Medal. At the University of Victoria, he was awarded the President's Research Scholarship four times. He has served on the program and executive committees of several conferences and, currently, he is an associate editor of the IEEE Transactions of Computers. He is a senior member of the IEEE and a licensed professional engineer of Ontario.
$\triangleright$ For more information on this or any computing topic, please visit our Digital Library at www.computer.org/publications/dlib.


[^0]:    - A. Reyhani-Masoleh is with the Centre for Applied Cryptographic Research, Department of Combinatorics and Optimization, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1. E-mail: areyhani@math.uwaterloo.ca.
    - M.A. Hasan is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada N21 3G1. E-mail: ahasan@ece.uwaterloo.ca.
    Manuscript received 21 July 2003; accepted 15 Jan. 2004.
    For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-0090-0703.

