# Design Space Exploration of a Hardware-Software Codesigned GF( $2^{m}$ ) Galois Field Processor for Forward Error Correction and Cryptography 

Wei-Ming Lim ${ }^{1}$, M. Benaissa ${ }^{2}$<br>University of Sheffield<br>Department of Electronic and Electrical Engineering, Mappin Street, Sheffield, S1 3JD, United Kingdom<br>${ }^{1}$ elp00wml@sheffield.ac.uk ${ }^{2}$ m.benaissa@sheffield.ac.uk


#### Abstract

This paper describes a hardware-software co-design approach for flexible programmable Galois Field Processing for applications which require operations over $\mathrm{GF}\left(2^{\mathrm{m}}\right)$, such as RS and BCH codes, Elliptic Curve Cryptography and the AES. Complexities of flexible implementations of different applications on a same computation architecture can be migrated to software during design time. However, the underlying $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ arithmetic architecture needs to be designed with software programmability (or reconfigurability) in mind. We describe novel reconfigurable subword parallel $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ arithmetic architectures designed with an associated instruction set architecture for different applications over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ and same applications with differing parameters. Design space exploration is carried out with two simple parameters $P$ and $Q$ which can be changed at design time and will affect the performance of different applications and flexibility of the final implementation. We show implementation results given for an FPGA prototype of the processor and programmed for RS and BCH coding, AES and elliptic curve cryptography with differing parameters. Complexity figures and configuration overheads for subword parallel $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ arithmetic architectures are also estimated and discussed.


## Categories and Subject Descriptors

C. 3 [Special-Purpose And Application-Based Systems]: Real-time and embedded systems; B. 2 [Arithmetic And Logic Structures]: Design Styles---Parallel

## General Terms:

Design, Algorithm, Performance

## Keywords

Galois Field Processor, GF $\left(2^{\mathrm{m}}\right)$ Arithmetic, Forward Error Control Coding, Reed-Solomon Code, BCH Code, Cryptography, Elliptic Curve Cryptography, Advanced Encryption Standard, HardwareSoftware Co-design, Design Space Exploration

[^0]
## 1. INTRODUCTION

$\mathrm{GF}\left(2^{\mathrm{m}}\right)$ arithmetic has been used extensively in the domains of Forward Error Correction (FEC) Codes and Cryptography. Well known examples include Reed Solomon and BCH Codes for FEC [1], Elliptic Curve Cryptography (ECC) [2] for Public Key Cryptography and lately, the Advanced Encryption Standard [3] (AES) using the Rijndael Algorithm for Private Key Cryptography.
As systems get increasingly complex, more and more effort has being channelled into re-usable implementations, particularly software controlled architectures, by allowing design re-use simply through re-programming. Some examples are described in [4-7]. Here, we define two parameters $P$ and $\mathrm{Q}: ~ P$ is the number of parallel arithmetic computation units (multiplication, division and Addition) and Q is the bit size of each unit. This paper will concentrate on the design space exploration of a software driven hardware architecture for applications over $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ with P and Q as the central design parameters.

A hardware-software co-design approach is described for hardware architectures over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ whereby software allows the same hardware to be re-used for different applications. This entails a design space exploration where the requirements of different applications over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ are systematically explored, and also the formulation of a hardware architecture to facilitate these applications. This design space can be broadly define into three levels of abstraction as shown in Table 1. The top level determines the global requirements of the specific applications. For example, application area (cryptography or FEC), code rate and error correction capability of a RS or BCH code (N,K), key size of the AES and the curve parameters of ECC.. etc.

The bottom level consists of the basic arithmetic circuits that form the basis of the applications. Choices here include the size and type of the arithmetic for example in terms of field size, irreducible polynomial and basis representation. These are of course influenced by the global requirements. The middle level provides the "bridge" between the top and bottom levels. Usually, this middle level determines the overall structure of a derived architecture and is determined by the primitive operations of an application. For each application, we identify these primitive operations, and this is an important first step towards designing efficient hardware/software architectures. Table 2 shows the requirements of various different operations.

There has been a trend towards using parallel arithmetic computation units driven by software for FEC algorithms as evident in [5, 6], since there is substantial inherent parallelism in these algorithms. The same principle can be applied directly to the

AES. Due to the large field size, parallel arithmetic computations are not considered for ECC as these need multiple arithmetic units. Obviously a simplistic implementation of ECC would be achieved by setting: $\mathrm{P}=1$ and Q equals the field size over which the ECC is defined. To bridge the gap of an architecture flexible enough for RS codes, BCH codes and the AES is relatively simple, however, the same cannot be said if ECC is to be included with other applications. Although a $\mathrm{GF}\left(2^{163}\right)$ processor designed for ECC can be used directly for $\operatorname{GF}\left(2^{8}\right)$ computations for the AES (or RS/BCH codes), it will be highly inefficient as only 8 out of a possible 163 bits are used at any one time. (Assuming the underlying arithmetic units can support variable field sizes and irreducible polynomials). This is one main obstacle towards developing efficient processor architectures for the domain of $\mathrm{GF}\left(2^{\mathrm{m}}\right)$.

Table 1 : Abstraction Levels

| Abstraction Levels | Application |  |
| :---: | :---: | :---: |
| Algorithm Level | Cryptography | FEC |
|  | AES Diff Key Lengths ECC Diff Curve sizes | RS(N, K) codec BCH (N,K) codec |
| Primitive Operations | Point Additions/ <br> Doubling over Projective or Affine Co-ordinates | Key Equation Solving <br> Using BMA or Euclidean Algorithm, Systematic Non Systematic Encoding |
|  |  | Different Polynomial Operations |
| Arithmetic Level | Variable Field Size, Variety of Bases, Variable Irreducible Polynomials |  |

Table 2: Requirements Analysis of Applications over GF(2m)

| Application | RS Codes | BCH | AES | ECC |
| :---: | :---: | :---: | :---: | :---: |
| Ranges | 3-8 Bits | 3-16 Bits | 8 Bits | $>100$ Bits |
| Primitive <br> Operations | Polynomial |  | Matrix and <br> Polynomial <br> like | Quadratic <br> Equations |

Subword parallelism (SWP) is a concept from computer architecture first introduced by Lee in [8]. Multiple subwords are packed into a word and processed with a single instruction, which can be seen as a form of SIMD (Single instruction Multiple Data). This concept can be extended to the domain of $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ as well. It can provide the flexibility without loss of efficiency required for a $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ processor and solves the problem of large data size mismatch for different applications. A $\operatorname{SWP} \operatorname{GF}\left(2^{\mathrm{m}}\right)$ processor is defined as an entity with a data path of $m=P x Q$ bits and can operate in two modes : Single Instruction Single Data (SISD) and Single Instruction Multiple Data (SIMD) mode. A SWP $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ arithmetic circuit allows the processor to compute $\mathrm{P} \operatorname{GF}\left(2^{\mathrm{n}}\right)$ arithmetic operations ( $\mathrm{n} \leq \mathrm{Q}$ ) or one $\mathrm{GF}\left(2^{\mathrm{n}}\right)$ arithmetic operation ( $\mathrm{Q}<\mathrm{n} \leq \mathrm{PxQ}$ ) per instruction. This means by suitable selection of P and Q , it is possible to define a structure that can be utilized efficiently for both large and small field size operations. For e.g, if $\mathrm{P}=21$ and $\mathrm{Q}=8, \mathrm{~m}=168$. The processor can be used for one large $\operatorname{GF}\left(2^{\mathrm{n}}\right)$ computation $(\mathrm{n} \leq 168)$ or 21 parallel smaller $\operatorname{GF}\left(2^{\mathrm{n}}\right)$ computations ( $\mathrm{n} \leq 8$ ), the former useful for ECC and the latter useful for AES, RS and BCH codecs.

The process of design space exploration is very much top down/bottom up. In the very first pass, the top down approach will
attempt to identify the requirements of each application at each level as given in table 1. The bottom up approach will attempt to formulate suitable architectures to suit these requirements at each level, and this may take several iterations before a compromise is reached for a given cost function in terms of speed versus area versus power for example. To reach a compromise on all of these requirements plus considerations for flexibility is not a trivial task and further work are needed here. In this paper, we will be concentrating primarily on the requirements issues for the domain of $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ which is not a common subject in the published literature.

The SWP type architectural structure of the arithmetic circuits of the $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ processor forms the middle abstraction level in Table 1 in this paper. An Instruction Set Architecture can then be defined over this architecture which allows the processor to compute the primitive operations for different applications.

## 2.SUBWORD PARALLEL ARCHITECTURES

This section describes the architecture of a Subword Parallel $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ processor which consists of the $\mathrm{SWP} \operatorname{GF}\left(2^{\mathrm{m}}\right)$ arithmetic circuits.

### 2.1. SWP GF( $\mathbf{2}^{\mathbf{m}}$ ) Arithmetic Circuits

We briefly describe how a subword parallel $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ multiplier can be designed using a simple example for $\mathrm{m}=4$. The GF multiplication algorithm used here is based on the well known MSB first algorithm operating over a polynomial basis which is outlined in [9] and reproduced briefly below. We need to compute $\mathrm{C}(\mathrm{y})=\mathrm{A}(\mathrm{y}) \cdot \mathrm{B}(\mathrm{y}) \bmod \mathrm{G}(\mathrm{y})$.
Multiplication Algorithm [9]

```
C'(y) = all zeros
for i = 0 to m-1
            C'}(y)=\mp@subsup{C}{}{\prime}(y)+G(y)\cdot\mp@subsup{c}{}{\prime}\mp@subsup{m}{m}{}+A(y)\cdot\mp@subsup{b}{m-1}{
            B(y) = y.B(y) mod ( }\mp@subsup{y}{}{m}\mathrm{ )
            C'}(\textrm{y})=\textrm{y}\cdot\mp@subsup{\textrm{C}}{}{\prime}(\textrm{y})\operatorname{mod}(\mp@subsup{y}{}{m+1}
end loop;
```

We can break line 3 of the algorithm into bit slices, $(0 \leq k \leq m)$ :

$$
c_{k}^{\prime}=c_{k}^{\prime} \oplus\left(g_{k} \bullet c_{m}^{\prime}\right) \oplus\left(a_{k} \bullet b_{m-1}\right)
$$

Figure 1 shows a 4-bit bit slices cascaded together. This performs one iteration of the multiplication algorithm. By feeding the intermediate result back into the same structure m times, a m -bit $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ multiplication can be performed. Notice that as long as the data are MSB justified, the same m-bit structure can be used for $\operatorname{GF}\left(2^{\mathrm{n}}\right)$ multiplications ( $n \leq m$ ) by varying the number of iterations $n$. From the algorithm, we can see that at every iteration, $\mathrm{b}_{\mathrm{m}-1}$ and $c^{\prime}{ }_{m}$ are used to determine whether $C^{\prime}(y)$ is added with $A(y)$ and $G(y)$. We call these global signals.
$\mathrm{B}(\mathrm{y})$ is loaded into a shift register and is MSB shifted $n$ times during the course of the algorithm. At the $i^{\text {th }}$ iteration, the most significant bit of the shift register will be the $i^{\text {th }}$ bit of $\mathrm{B}(\mathrm{y}), \mathrm{b}_{\mathrm{i}}$. Line 4 and 5 of the algorithm is simply the mathematical representation of logical shift with the MSB discarded. Shifting of Line 5 can be hardwired as shown in Figure 1. Suppose the structure in Figure 1 is cut into two parts of 2-bits each, and
additional configuration circuitry (basically multiplexers and basic gates) are added to each 2-bits parts called Logic Units (LU). The configuration circuitry of each LU is controlled by a set of control signals MSBlock and LSBlock. Setting both MSBlock and LSBlock of a LU to ' 1 ' "isolates" that particular LU such that all global signals are derived from the same LU, and shifting signals do not cross into other LUs. In fact, both LUs in this situation can be seen as independent 2 -bit multipliers. Similarly, setting LSBlock[2] and MSBlock[1] to ' 0 ', and setting LSBlock[1] and MSBlock[2] to ' 1 ' configures the structure in Figure 2 to behave as if it is a 4 -bit multiplier, as the global signals are now multiplexed from the MSB LU (LU 2) and shifting signals are allowed to pass between LUs. In general, we can thus break-up a large M-bits $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ multiplier into P smaller Q-bits LUs, which essentially are Q -bits $\mathrm{GF}\left(2^{\mathrm{Q}}\right)$ multipliers. In practice, it will be easier to design a LU of a specific size (Q-bits) and couple as many of them together to achieve the required large field size.


Figure 1: 4-bit GF( $\left.\mathbf{2}^{\mathrm{m}}\right)$ Multiplier


Figure 2: Modified GF( $2^{m}$ ) multiplier with 2 LUs.
The same procedure can be applied to a $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ division algorithm, although this is not described here it due to length constraints. In particular, Brunner in [10] described a polynomial basis $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ division algorithm which can be used for modification into the SWP structure as it is regular in structure and computation cycles. This regularity is important as it makes
modifications easier. $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ addition needs no modifications as it only involves bit-wise XOR operations.

### 2.1.1. Complexity Analysis

Table 3 shows the complexity figures of the $\operatorname{SWP} \operatorname{GF}\left(2^{m}\right)$ arithmetic circuits in terms of P and Q . The size of the arithmetic circuits without modification are comparable to that outlined in [9, 10], which is expected since they use the same algorithm. Due to the configuration circuitry, the overall area and time complexity incurs a penalty. For reasonable choices of P and Q , this added complexity represents a small percentage overall. For example, if $P=21$ and $Q=8$ giving $M=168$, the percentage overhead due to the configuration circuitry will be $4.01 \%$ for the multiplier and $5.81 \%$ for the divider circuit. The added propagation delay due to the configuration circuits for multiplication and division corresponds to $\mathrm{T}_{\text {MUX2 }}$ and to $\mathrm{T}_{\text {MUX2 }}+\mathrm{T}_{\text {MUXQ }}$ respectively. (Note: $\mathrm{T}_{\mathrm{MUXi}}$ is the propagation delay through an i-input multiplexor.)

Table 3 : Complexity Figures of SISD/SIMD ALU

|  | Gates for $\mathrm{P}, \mathrm{Q}$ Unit without | ts Logic ontrol | Gates for Config. Cct. |  | Overall Propagation Delay |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\begin{gathered} 2 \mathrm{PQ} \\ 2 \mathrm{PQ} \\ 2 \mathrm{PQ} \\ \mathrm{PQ} \end{gathered}$ | $\begin{gathered} \hline \text { XOR } \\ \text { AND } \\ \text { F/F } \\ \text { MUX }_{2} \\ \hline \end{gathered}$ | $\begin{aligned} & 2 \mathrm{P} \\ & 2 \mathrm{P} \end{aligned}$ | $\begin{gathered} \mathrm{MUX}_{2} \\ \text { AND } \end{gathered}$ | $\begin{gathered} \mathrm{T}_{\mathrm{MUX} 2}+2 \mathrm{~T}_{\mathrm{XOR}}+ \\ \mathrm{T}_{\mathrm{AND}} \end{gathered}$ |
| $\begin{aligned} & \text { 苟 } \\ & 0 \\ & \hline 1 \end{aligned}$ | $\begin{aligned} & \mathrm{P}(4 \mathrm{Q}+1) \\ & \mathrm{P}(3 \mathrm{Q}+1) \\ & \mathrm{P}(5 \mathrm{Q}+2) \\ & \mathrm{P}(\mathrm{Q}+1) \\ & \mathrm{P}(16 \mathrm{Q}+7) \end{aligned}$ | $\begin{gathered} \hline \text { XOR } \\ \text { AND } \\ \text { F/F } \\ \text { NOT } \\ \text { MUX }_{2} \end{gathered}$ | $\begin{gathered} 5 \mathrm{P} \\ \mathrm{P} \\ 4 \mathrm{P} \end{gathered}$ | $\mathrm{MUX}_{2}$ $\mathrm{MUX}_{\mathrm{Q}}$ AND | $\begin{gathered} \mathrm{T}_{\mathrm{XOR}}+\mathrm{T}_{\mathrm{AND}}+2 \\ \mathrm{~T}_{\mathrm{MUX} 2}+\mathrm{T}_{\mathrm{MUXQ}} \end{gathered}$ |

### 2.2. SWP GF( $\left.\mathbf{2}^{\mathrm{m}}\right)$ Processor Architecture



Figure 3 : Processor Datapath Block Diagram
A simplified processor datapath block diagram is shown in Figure 3. It consists of the SWP arithmetic circuits (multiplication, division and addition over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ and subword permutation circuits which are necessary to handle subword manipulations. The register file is made up of many Register Locations each with a word size of M bits wide and each Register Location (a word) can be seen as P number of Coefficient Locations (subwords), each Q bits wide (i.e. $\mathrm{m}=\mathrm{PxQ}$ ). In SIMD mode, the data in each Register location can be viewed as a polynomial of degree P-1, with coefficients as elements of a Galois Field up to a size of
$\mathrm{GF}\left(2^{\mathrm{Q}}\right)$ (i.e. the name coefficient locations). A polynomial with degree larger than P can be stored in two or more Register Locations. In SISD mode, each Register Location will only have one data element in it. See Figure 4.


Figure 4 : Register File Structure with different contents

### 2.3. Core Instruction Set Architecture

For each different sets of applications, the instruction set architecture maybe different. This is because for different applications, additional specialized instructions maybe included to improve significantly the performance of the processor running these applications. However, it is possible to identify a core set of instructions that will be applicable for a majority of applications, and these are usually present regardless of the applications the processor is designed for.

### 2.3.1. Core Instructions

We denote $R_{j}\left(C_{i}\right)$ as the $i^{\text {th }}$ Coefficient Location of the $j^{\text {th }}$ Register Location. In SIMD Mode, MULT, DIVI and ADDP instructions operate on p parallel data pairs in a pair of Register Locations. In SISD Mode, they will operate on only one data pair per pair of Register Locations. The other instructions are required for subword re-arrangement, data alignment and movement (shifting, copying, Subword copying etc).

Table 4: Core Arithmetic Instructions

| $\mathrm{ADDP}_{\text {SIMD }}$ | $\begin{gathered} \hline \mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow\left[\mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{i}}\right)+\mathrm{R}_{\mathrm{b}}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \mathrm{mod} \\ \mathrm{G}(\mathrm{y}) \\ \hline \end{gathered}$ | for $\mathrm{i}=0$ to $\mathrm{P}-1$ |
| :---: | :---: | :---: |
| $\mathrm{ADDP}_{\text {SISD }}$ | $\mathrm{R}_{\text {des }} \longleftarrow\left[\mathrm{R}_{\mathrm{a}}+\mathrm{R}_{\mathrm{b}}\right] \bmod \mathrm{G}(\mathrm{y})$ |  |
| $\mathrm{SUMA}_{\text {SIMD }}$ | $\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{a}}\right) \leftarrow \sum_{i=0}^{P-1} \mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{i}}\right)$ |  |
| $\mathrm{MULT}_{\text {SIMD }}$ | $\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow\left[\mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{i}}\right) \times \mathrm{R}_{\mathrm{b}}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \bmod \mathrm{G}(\mathrm{y})$ | for $\mathrm{i}=0$ to $\mathrm{P}-1$ |
| $\mathrm{MULT}_{\text {SISD }}$ | $\mathrm{R}_{\text {des }} \leftarrow \mathrm{R}_{\mathrm{a}} \times \mathrm{R}_{\mathrm{b}} \bmod \mathrm{G}(\mathrm{y})$ |  |
| DIVI ${ }_{\text {SIMD }}$ | $\begin{gathered} \hline \mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow\left[\mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{i}}\right) / \mathrm{R}_{\mathrm{b}}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \bmod \\ \mathrm{G}(\mathrm{y}) \end{gathered}$ | for $\mathrm{i}=0$ to $\mathrm{P}-1$ |
| $\mathrm{DIVI}_{\text {SISD }}$ | $\mathrm{R}_{\text {des }} \leftarrow\left[\mathrm{R}_{\mathrm{a}} / \mathrm{R}_{\mathrm{b}}\right] \bmod \mathrm{G}(\mathrm{y})$ |  |
| $\mathrm{REPA}_{\text {SIMD }}$ | $\left[\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \leftarrow \mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{a}}\right)$ | for $\mathrm{i}=0$ to $\mathrm{P}-1$ |
| $\mathrm{REPO}_{\text {SIMD }}$ | $\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\text {des }}\right) \leftarrow \mathrm{R}_{\mathrm{a}}\left(\mathrm{C}_{\mathrm{a}}\right)$ |  |
| SHPX | $\mathrm{R}_{\text {des }} \leftarrow \mathrm{LSB} / \mathrm{MSB}$ Bit Shift $\mathrm{R}_{\mathrm{a}}$ by x-bits. Pad with Zeros |  |
| COPY | $\mathrm{R}_{\text {des }} \leftarrow \mathrm{R}_{\mathrm{a}}$ |  |
| SETC | Setup Instruction for SIMD/SISD mode, Irreducible Polynomial etc |  |

## 3. PRIMITIVE OPERATIONS

In this section, we show that almost all of the primitive operations for different applications over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ can be synthesized from the core ISA.

### 3.1. Reed Solomon and BCH codes

All of the primitive operations required in RS and BCH codec can be broken down into polynomial operations over $\operatorname{GF}\left(2^{m}\right)$ of one form or another. They are: Polynomial Multiplications (PM), Polynomial Divisions (PD) and Polynomial Evaluations (PE). Non-Systematic RS encoding is basically a PM whereas systematic RS Encoding is a PD. Syndrome Computation and Chien Search of the decoding stage are PEs. Key Equation Solving using Extended Euclidean Algorithm is a combination of PDs and PMs. Therefore, deriving efficient ways of computing these polynomial operations are crucial. This section will briefly describe some ways these polynomial operations can be synthesized simply by the core ISA described before and are by no means definitive. It will be straightforward to extend the techniques presented here to more general polynomial operations.

### 3.1.1. Polynomial Multiplication



Figure 5 : Data Orientation of $A(x)$ and $B(x)$ for PM
Let $A(x)=\sum_{i=0}^{\operatorname{deg}(A(x))} a_{i} x^{i}, B(x)=\sum_{i=0}^{\operatorname{deg}(B(x))} b_{i} x^{i}, a_{i}, b_{i} \in G F\left(2^{m}\right)$. The polynomial multiplication of $D(x)=A(x) \times B(x)$ is given by:

$$
D(x)=\sum_{i=0}^{\operatorname{deg}(A(x))} \sum_{j=0}^{\operatorname{deg}(B(x))} a_{i} b_{j} x^{i+j}
$$

Assuming the degrees of $\mathrm{A}(\mathrm{x})$ and $\mathrm{B}(\mathrm{x})$ is less than $\mathrm{P}-1$, then each of the polynomial can be fitted into a single Register Location LSB justified as shown in Figure 5. Using the core Arithmetic Instructions, a PM can thus be synthesised:

- $\quad R_{1} \leftarrow A(x) ; R_{2} \leftarrow B(x) ; R_{\text {Temp1 }} \leftarrow$ Zero Polynomial
- For $\mathrm{i}=0$ to $\operatorname{deg}(\mathrm{A}(\mathrm{x}))$ Loop
- $\mathrm{R}_{\text {Temp2 }} \leqslant$ REPA $\quad \mathrm{R}_{1}\left(\mathrm{C}_{\mathrm{i}}\right)$
- $\mathrm{R}_{\text {Temp } 3} \leftarrow$ MULT $\quad \mathrm{R}_{\text {Temp2 }}, \mathrm{R}_{2}$
- $\mathrm{R}_{\text {Temp } 1} \leftarrow$ ADDP $\quad \mathrm{R}_{\text {Temp } 3}, \mathrm{R}_{\text {Temp } 1}$
- $\mathrm{R}_{\text {Result }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow$ REPO $\quad \mathrm{R}_{\text {Temp1 }}\left(\mathrm{C}_{\mathrm{i}}\right)$
- $\mathrm{R}_{\text {Temp1 }} \leftarrow$ SHPX LSB, $\mathrm{R}_{\text {Temp1 }}$, Q-Bits
- End Loop;
- For $\mathrm{j}=0$ to $\operatorname{deg}(\mathrm{B}(\mathrm{x}))-1$ Loop
- $\quad \mathrm{R}_{\text {Result }}\left(\mathrm{C}_{\operatorname{deg}(\mathrm{A}(\mathrm{x}))+\mathrm{j}}\right) \leftarrow$ REPO $\quad \mathrm{R}_{\text {Templ }}\left(\mathrm{C}_{\mathrm{j}}\right)$
- End Loop;

This is basically a multiply-add-shift operation of $A(x)$ and $B(x)$. The result $D(x)$ will be stored in $R_{\text {Result. }}$. Note that if $\operatorname{deg}(A(x)+$ $\operatorname{deg}(\mathrm{B}(\mathrm{x}))>\mathrm{P}-1$, the result $\mathrm{D}(\mathrm{x})$ maybe exceed the size of one Register Location and have to be stored accordingly. Since the degrees of $\mathrm{A}(\mathrm{x}), \mathrm{B}(\mathrm{x})$ and $\mathrm{D}(\mathrm{x})$ of all PMs present in RS and BCH codes can be predetermined, it is relatively easy determine the storage required.

### 3.1.2. Polynomial Division

Again, let $\mathrm{A}(\mathrm{x})$ and $\mathrm{B}(\mathrm{x})$ follows the same notation as before. Given $\mathrm{A}(\mathrm{x})$ and $\mathrm{B}(\mathrm{x})$, calculate find $\mathrm{D}(\mathrm{x})$ and $\mathrm{E}(\mathrm{x})$ which satisfies the expression:

$$
\mathrm{D}(\mathrm{x}) \times \mathrm{B}(\mathrm{x})+\mathrm{E}(\mathrm{x})=\mathrm{A}(\mathrm{x})
$$

In other words, we want to calculate $\mathrm{A}(\mathrm{x}) / \mathrm{B}(\mathrm{x})$ and get the quotient $\mathrm{D}(\mathrm{x})$ and the remainder which is $\mathrm{E}(\mathrm{x})$. Assuming the $\operatorname{deg}(A(x))=\operatorname{deg}(B(x))+1$, then $\operatorname{deg}(E(x))=1$ and $\operatorname{deg}(D(x))=$ $\operatorname{deg}(B(x))-1$. For simplicity, here we assume that each $A(x)$ and $\mathrm{B}(\mathrm{x})$ can be fitted into a single Register Location MSB justified. The steps involved in the polynomial division (also commonly known as Polynomial Long Division) are:

| 1. | $\mathrm{R}_{1} \leftarrow \mathrm{~A}(\mathrm{x}) ;$ | $\leqslant \mathrm{B}(\mathrm{x})$; |  |
| :---: | :---: | :---: | :---: |
| 2. | $\mathrm{R}_{\text {Temp1 }}$ | $\leqslant$ REPA | $\mathrm{R}_{1}\left(\mathrm{C}_{\text {P- }-1}\right)$ |
| 3. | $\mathrm{R}_{\text {Temp2 }}$ | $\leftarrow$ MULT | $\mathrm{R}_{\text {Templ }}, \mathrm{R}_{2}$ |
| 4. | $\mathrm{R}_{\text {Temp1 }}$ | $\leftarrow$ REPA | $\mathrm{R}_{2}\left(\mathrm{C}_{\mathrm{P}-1}\right)$ |
| 5. | $\mathrm{R}_{\text {Temp2 }}$ | $\leftarrow$ DIVI | $\mathrm{R}_{\text {Temp2 }}, \mathrm{R}_{\text {Temp1 }}$ |
| 6. | $\mathrm{R}_{\text {Quotient }}\left(\mathrm{C}_{1}\right)$ | $\leftarrow$ REPO | $\mathrm{R}_{\text {Temp2 }}\left(\mathrm{C}_{\mathrm{P}-1}\right)$ |
| 7. | $\mathrm{R}_{\text {Temp3 }}$ | $\leftarrow$ ADDP | $\mathrm{R}_{\text {Temp2 }}, \mathrm{R}_{1}$ |
| 8. | $\mathrm{R}_{\text {Temp1 }}$ | $\leqslant$ REPA | $\mathrm{R}_{\text {Temp3 }}\left(\mathrm{C}_{\mathrm{P}-2}\right)$ |
| 9. | $\mathrm{R}_{\text {Temp2 }}$ | $\leqslant$ MULT | $\mathrm{R}_{\text {Templ }}, \mathrm{R}_{2}$ |
| 10. | $\mathrm{R}_{\text {Temp1 }}$ | $\leqslant$ REPA | $\mathrm{R}_{2}\left(\mathrm{C}_{\mathrm{P}-1}\right)$ |
| 11. | $\mathrm{R}_{\text {Temp2 }}$ | $\leqslant$ DIVI | $\mathrm{R}_{\text {Temp2 }}, \mathrm{R}_{\text {Temp1 }}$ |
| 12. | $\mathrm{R}_{\text {Quotient }}\left(\mathrm{C}_{0}\right)$ | $\leftarrow$ REPO | $\mathrm{R}_{\text {Temp2 }}\left(\mathrm{C}_{\mathrm{P}-2}\right)$ |
| 13. | $\mathrm{R}_{\text {Remainder }}$ | $\leftarrow$ ADDP | $\mathrm{R}_{\text {Temp3 }}, \mathrm{R}_{\text {Temp2 }}$ |
| 14. | $\mathrm{R}_{\text {Quotient }}=\mathrm{D}(\mathrm{x}$ | $\mathrm{R}_{\text {Remainder }}$ |  |

The number of instructions needed to compute a polynomial division will be minimum when $\mathrm{A}(\mathrm{x})$ and $\mathrm{B}(\mathrm{x})$ can be fitted into a single register. Polynomial division is the fundamental primitive operation in Key Equation Solving using the Euclidean Algorithm and it can be concluded that as long as $P \geq 2 t$, where $t$ is the error correcting capability of a BCH or RS code, the number of instructions needed for the Euclidean Algorithm will be at a minimum.

### 3.1.3. Polynomial Evaluation

| $\mathrm{C}_{0} \quad \mathrm{C}_{1}$ |  |  |  |  |  | $\mathrm{C}_{\mathrm{P}-1}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathrm{R}_{1}$ | $\mathrm{a}_{0}$ | $\mathrm{a}_{1}$ | $\mathrm{a}_{2}$ | $\mathrm{a}_{3}$ |  | $\mathrm{a}_{\mathrm{p}-1}$ |
| $\mathrm{R}_{2}$ | $\mathrm{a}_{\mathrm{p}}$ | $\mathrm{a}_{\mathrm{P}+1}$ | $\mathrm{a}_{\mathrm{P}+2}$ | 0 |  | 0 |
| $\mathrm{R}_{3}$ | 1 | $\alpha$ | $\alpha^{2}$ | $\alpha^{3}$ |  | $\alpha^{p-1}$ |
| $\mathrm{R}_{4}$ | $\alpha^{p}$ | $\alpha^{p+1}$ | $\alpha^{p+2}$ | 0 | ....... | 0 |

Figure 6 :Data Orientation for Polynomial Evaluation
Compute $\mathrm{A}(\alpha)=\sum_{i=0}^{\operatorname{deg}(A(x))} a_{i} \alpha^{i}$ where $\alpha \in G F\left(2^{m}\right)$. As an example, we assume that $\mathrm{A}(\mathrm{x})$ in this case spans two Register Locations as shown in Figure 6. To save on computation cycles, P-1 multiple powers of $\alpha$ are pre-computed and stored in multiple Register Locations as well. To evaluate $\mathrm{A}(\alpha)$;
$\begin{array}{llll}\text { 1. } & \mathrm{R}_{1}, \mathrm{R}_{2} \leftarrow \mathrm{~A}(\mathrm{x}) ; & \mathrm{R}_{3}, \mathrm{R}_{4} \leftarrow \text { Pre-computed Powers of } \alpha ; \\ \text { 2. } & \mathrm{R}_{\text {Temp } 1} & \leftarrow \text { MULT } & \mathrm{R}_{1}, \mathrm{R}_{3} \\ \text { 3. } & \mathrm{R}_{\text {Temp } 2} & \leftarrow \text { MULT } & \mathrm{R}_{2}, \mathrm{R}_{4} \\ \text { 4. } & \mathrm{R}_{\text {Temp } 1} & \leftarrow \text { ADDP } & \mathrm{R}_{\text {Temp1 }}, \mathrm{R}_{\text {Temp } 2} \\ \text { 5. } & \mathrm{R}_{\text {Result }}\left(\mathrm{C}_{0}\right) & \leftarrow \text { SUMA } & \mathrm{R}_{\text {Temp1 }}\end{array}$

### 3.2. Advanced Encryption Standard

In general, the basic operations of the AES can be synthesized using the core ISA (with the exception of the Affine Transform required in the SubWord Transformation). We can easily modify the DIVI instruction so that an Forward Affine Transform (FAT) or Inverse Affine Transform (IAT) is computed after and before an inversion is computed respectively (See Table 5). There is substantial parallelism in the AES algorithm, which makes its implementation in a SWP architecture very attractive. An AES State is stored in the register file in the way as shown in Figure 7, where a 128 -bit Block (or a state) is stored across 4 Registers Locations. If $P>4$, multiple blocks of data can be stored in the same 4 Register Locations. In the example of Figure 7, $\mathrm{P}=8$, hence we can store 2 AES states every 4 register locations. This also means that multiple AES encryptions or decryptions can be computed at the same time. On closer examination, it is evident that the ShiftRow Operation of the AES is the bottleneck of the system if it is implemented with just the core ISA. A new instruction SHPW is created specifically for the ShiftRow Operation in AES.

Table 5 : Extended Instructions for AES

| DIVI <br> (SIMD- <br> AES) | $\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow$ FAT[1/R $\left.\mathrm{R}_{\mathrm{b}}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \bmod \mathrm{G}(\mathrm{y})$ <br> $\mathrm{R}_{\text {des }}\left(\mathrm{C}_{\mathrm{i}}\right) \leftarrow 1 / \operatorname{IAT}\left[\mathrm{R}_{\mathrm{b}}\left(\mathrm{C}_{\mathrm{i}}\right)\right] \bmod \mathrm{G}(\mathrm{y})$ | for $\mathrm{i}=0$ <br> to $\mathrm{P}-1$ |
| :---: | :--- | :---: |
| SHRW | $\mathrm{R}_{\text {des }} \leftarrow$ Shift-Row-Index, $\mathrm{R}_{\mathrm{a}}$ |  |


|  | $\mathrm{C}_{1}$ | $\mathrm{C}_{2}$ | $\mathrm{C}_{3}$ | $\mathrm{C}_{4}$ | $\mathrm{C}_{5}$ | C6 | $\mathrm{C}_{7}$ | C8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| R1 | $\mathrm{S}_{0,0}$ | $\mathrm{S}_{0,1}$ | $\mathrm{S}_{0,2}$ | $\mathrm{S}_{0,3}$ | $\mathrm{S}_{0,0}$ | $\mathrm{S}_{0,1}$ | $\mathrm{S}_{0,2}$ | $\mathrm{S}_{0,3}$ |
| R2 | S1,0 | S1,1 | S1,2 | S1,3 | S1,0 | S1,1 | S1,2 | S1,3 |
| $\mathrm{R}_{3}$ | $\mathrm{S}_{2,0}$ | $\mathrm{S}_{2,1}$ | $\mathrm{S}_{2,2}$ | $\mathrm{S}_{2,3}$ | $\mathrm{S}_{2,0}$ | $\mathrm{S}_{2,1}$ | $\mathrm{S}_{2,2}$ | $\mathrm{S}_{2,3}$ |
| R4 | $\mathrm{S}_{3,0}$ | $\mathrm{s}_{3,1}$ | $\mathrm{S}_{3,2}$ | $\mathrm{S}_{3,3}$ | S3,0 | $\mathrm{S}_{3,1}$ | S3,2 | $\mathrm{S}_{3,3}$ |
| 1st AES State 2nd AES Stat |  |  |  |  |  |  |  |  |

Figure 7 : Storage of AES state in Register File

### 3.3. Elliptic Curve Cryptography

The primitive operations of the Elliptic Curve Cryptography can be broken down simply into Point Additions and Point Doubling. A Non-Supersingular Elliptic Curve [2] is defined over Affine Coordinates as:
$y^{2}+x y=x^{3}+a_{2} x^{2}+a_{6} \quad$ where $a_{2}$ and $a_{6} \in G F\left(2^{m}\right)$
Given that two points $\mathrm{A}=\left(\mathrm{x}_{1}, \mathrm{y}_{1}\right)$ and $\mathrm{B}=\left(\mathrm{x}_{2}, \mathrm{y}_{2}\right)$ lie on an Elliptic Curve, point addition $\mathrm{D}=\mathrm{A}+\mathrm{B}$ is defined as below, where $\mathrm{D}=$ $\left(x_{3}, y_{3}\right)$. If $A \neq B$ then
$\theta=\frac{y_{2}-y_{1}}{x_{2}-x_{1}}, x_{3}=\theta^{2}+\theta+x_{2}+x_{1}+a_{2}, y_{3}=\theta\left(x_{3}+x_{1}\right)-y_{1}$
If $\mathrm{A}=\mathrm{B}$ then $\theta=x_{1}+\frac{y_{1}}{x_{1}}$ or $\theta=x_{2}+\frac{y_{2}}{x_{2}}$
and, $x_{3}=\theta^{2}+\theta+a_{2}, y_{3}=x^{2}+(\theta+1) x_{3}$.
These operate over quadratic equations defined over $\operatorname{GF}\left(2^{m}\right)$. and can be synthesized using the core ISA, primarily ADDP, MULT and DIVI operating in SISD mode

## 4. RESULTS \& CONCLUSIONS

It is evident that once the requirements of the primitive operations are determined and a processor architecture is designed to meet these requirements, a stable platform is available that can be tailored for different applications or different groups of applications by changing P and Q at design time. We show an example where P and Q are fixed and determine the range of applications this processor can be used for without re-design. This is to show the inherent flexibility of the processor architecture for different applications and does not represent a practical design flow, where applications usually determine the values of P and Q .

## 4.1. $P=8, Q=8$ for RS, $B C H$, AES and $E C C$

The largest small field size the processor can compute in parallel is constrained by Q . For a BCH or $\mathrm{RS}(\mathrm{N}, \mathrm{K})$ codes, this will constrain the maximum value of N . Table 6 shows the ranges of $(\mathrm{N}, \mathrm{K})$ codecs that can be computed without re-design for $\mathrm{Q}=8$. For the AES, the only variable is the different key schedule computation and $\mathrm{P} / 4$ denotes the number of parallel AES blocks that can be computed at the same time for a specific value of P . The largest field size the processor can operate on in this case is $\mathrm{GF}\left(2^{64}\right)(\mathrm{m}=\mathrm{PxQ}$. In practical applications, P can be chosen to be large enough for secure elliptic curve cryptography. A proof of concept prototype GF Processor has been implemented on Field Programmable Gate Arrays ( Xilinx Virtex XCV-800) using the concepts outlined in this paper for the parameters of $\mathrm{P}=8$ and $\mathrm{Q}=8$ and Table 7 shows the throughput of a few applications from Table 6. The processor was designed using VHDL on Xilinx Foundation 3.1e and occupies 2505 Slices with an equivalent gate count of 43,251 gates. Through analysis, it can be determined that the size of the processor scales more or less linearly with $m$ (i..e. PxQ).

Table 6: Ranges of Applications for $\mathrm{P}=8$ and $\mathrm{Q}=8$

| Application | Variables for $\mathrm{Q}=8$ |  |
| :---: | :---: | :---: |
| RS, BCH | $(255, \mathrm{~K})$ <br> codec | $(31, \mathrm{~K})$ <br> $(63, \mathrm{~K})$ |
| $128,192,256$ Bits Key size. |  |  |
|  | $1, \mathrm{~K})$ |  |
| PCC Parallel Blocks Computations |  |  |

Table 7: Throughput of $\mathbf{G F}\left(\mathbf{2}^{\mathbf{6 4}}\right)$ Processor for $\mathbf{P}=8, Q=8$

| Results for $\mathrm{m}=64 \quad \mathrm{P}=8, \mathrm{Q}=8$ | Speed 40MHz Clock |
| :---: | :---: |
| $\mathrm{RS}(255,247)$ Decoder | 11 Mbps |
| $\mathrm{RS}(31,25)$ Decoder | 6.6 Mbps |
| BCH(31,16) Decoder* | 1.33 Mbps |
| $\begin{array}{c}\text { 128-Bit Key AES (10 Rounds) without key } \\ \text { Expansion. 2 Parallel Computations }\end{array}$ | 3.8 Mbps |
| 128-but Key Schedule Computation | $42.5 \mu \mathrm{~s}$ |
| $\begin{array}{c}\text { Elliptic Curve Point Addition Affine } \\ \text { Coordinates GF(2 }\end{array}$ |  |
| Elliptic Curve Point Doubling Affine |  |
| Coordinates GF(261) |  |$] 7.125 \mu \mathrm{~s}$.

### 4.2. Conclusions

Flexibility of a given architecture cannot be measured easily in quantifiable terms. Yet, as systems get increasingly complex, issues of design re-use makes a flexible architecture with
programmable parameters over a large range of applications increasingly important. There has been a general trend to migrate complexity of traditionally application specific implementations towards software controlled architectures, where flexibility of software automatically allows the maximum flexibility over different applications. However, not all such migrations can be achieved easily, with problems ranging from designing flexible arithmetic circuits to supporting flexible software controlled architectures. This paper has outlined a design space exploration for domain applications based on Galois Fields. The applications are broken down into their primitive operations and a flexible software programmable processor has been designed to handle these primitive operations. This in turn allows the same architecture to be used for all applications that can be defined using these primitive operations. This paper focuses mainly on designing for maximum flexibility across different applications over the same domain, and further work is needed to address design issues at each abstraction level for other constraints like speed, area and power trade-offs with flexibility for a given application in the GF domain. A design methodology can then be defined for primitive based hardware/software co-design techniques.

## 5. Reference

[1] B. Wicker Stephen, Error control systems for digital communication and storage. Englewood Cliffs, N.J. ; London: Prentice Hall : Prentice-HallInternational, 1995.
[2] M. Rosing, Implementing Elliptic Curve Cryptography. Greenwich, CT.: Manning, 1999.
[3] J. Daemen and V. Rijmen, "AES Proposal : The Rijndael Block Cipher," http://csrc.nist.gov/encryption/aes/rijndael/, 2000.
[4] L. Song, K. K. Parhi, I. Kuroda, and T. Nishitani, "Hardware/software codesign of finite field datapath for lowenergy Reed-Solomon codecs," IEEE-Transactions-on-Very-Large-Scale-Integration-VLSI-Systems. April 2000; 8(2): 16072, 2000.
[5] H. M. Ji, "An optimized processor for fast Reed-Solomon encoding and decoding," 2002-IEEE-International-Conference-on-Acoustics,-Speech,-and-Signal-Processing.-Proceedings-Cat.-No.02CH37334. 2002: III-3097-100 vol.3, 2002.
[6] W. Drescher, M. Mennenga, and G. Fettweis, " An architectural study of a digital signal processor for block codes," Proceedings-of-the-1998-IEEE-International-Conference-on-Acoustics,-Speech-and-Signal-Processing,-ICASSP-'98-Cat.-No.98CH36181. 1998: 3129-32 vol.5, 1998.
[7] M. Hasan and A. Wassal, "VLSI Algorithms, Architectures, and Implementation of a Versatile GF( $\left.2^{\mathrm{m}}\right)$ Processor," IEEE Transactions On Computers, vol. 49, pp. 1064-1073, 2000.
[8] R. B. Lee, "Subword parallelism with MAX-2," in IEEEMicro. Aug. 1996; 16(4): 51-9, 1996.
[9] P. A. Scott, S. E. Tavares, and L. E. Peppard, "A Fast VLSI Multiplier for GF $\left(2^{\mathrm{m}}\right)$," IEEE Journal Selected Areas in Communications, vol. SAC-4, pp. 62-66, 1986.
[10] H. Brunner, A. Curiger, and M. Hofstetter, "On Computing Multiplicative Inverses in GF ( $\left.2^{\mathrm{m}}\right)$," IEEE Transactions On Computers C, vol. 42, pp. 1010, 1993.


[^0]:    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
    CODES $+I S S S$ ' 3 , October 1-3, 2003, Newport Beach, California, USA. Copyright 2003 ACM 1-58113-742-7/03/0010...\$5.00.

