What contributions have the authors mentioned in the paper "A highly regular and scalable aes hardware architecture" ?

This article presents a highly regular and scalable AES hardware architecture, suited for full-custom as well as for semicustom design flows. Implementations of the fastest configuration of the architecture provide a throughput of 241 Mbits/sec on a 0.

(Open Access) A highly regular and scalable AES hardware architecture (2003) | Stefan Mangard

A Highly Regular and Scalable

AES Hardware Architecture

Stefan Mangard, Student Member, IEEE, Manfred Aigner, and Sandra Dominikus

Abstract—This article presents a highly regular and scalable AES hardware architecture, suited for full-custom as well as for semi-

custom design flows. Contrary to other publications, a complete architecture (even including CBC mode) that is scalable in terms of

throughput and in terms of the used key size is described. Similarities of encryption and decryption are utilized to provide a high level of

performance using only a relatively small area (10,799 gate equivalents for the standard configuration). This performance is reached

by balancing the combinational paths of the design. No other published AES hardware architecture provides similar balancing or a

comparable regularity. Implementations of the fastest configuration of the architecture provide a throughput of 241 Mbits/sec on a

0.6 m CMOS process using standard cells.

Index Terms—Advanced Encryption Standard (AES), hardware architecture, IP module, VLSI, scalability, regularity.

1INTRODUCTION

HE symmetric block cipher Rijndael [1] was standar-

dized by NIST

as Advanced Encryption Standard

(AES) [2] in November 2001. Being the successor of the

Data Encryption Standard (DES) [3], the AES is used in a

wide range of applications.

The AES is the preferred algorithm for implementations

of cryptographic protocols that are based on a symmetric

cipher. It is not only used to secure data transfers between

small, mobile consumer products, but it is also used in high-

end servers. Consequently, the requirements for implemen-

tations of the AES differ significantly.

Applications with strict requirements concerning perfor-

mance, power consumption, or side-channel leakage are, in

practice, usually implemented by dedicated hardware.

Hardware implementations of the AES are, for example,

used in Internet servers as performance accelerators or in

smart cards (besides other reasons) to increase the

resistance against side-channel attacks.

Due to the practical importance of hardware implemen-

tations, the different AES candidates were implemented

and compared on FPGAs (see [4], [5], and [6]) and on ASICs

[7] before Rijndael was finally selected to become the AES.

After this selection, more effort was dedicated toward the

development of efficient hardware implementations of this

particular algorithm (see [8], [9], [10], and [11]). The most

recent proposal for an ASIC architecture of the AES is [12].

However, this architecture has very unbalanced combina-

tional paths and requires a time and area-consuming

selector function, which is not part of the actual AES

algorithm.

This article presents a highly regular and scalable AES

hardware architecture that requires only 10,799 gate

equivalents to provide a throughput of 128 Mbits/sec (for

AES-128 encryption and decryption) on a 0.6 m standard

cell library. These numbers include an AMBA APB bus

interface, a CBC register, and a key storage register.

The architecture uses similarities of encryption and

decryption to provide a high level of performance while

keeping the chip size small. The high performance is

especially reached by keeping combinational paths ba-

lanced so that every clock cycle is fully utilized. The fact

that the combinational paths are short compared to other

published AES architectures makes the presented architec-

ture a favorable choice for low-power applications. This is

due to the fact that glitches, which occur more frequently in

long combinational paths than in short ones, cause a

significant power consumption.

Besides the small area requirements and the high

performance, the presented architecture has another im-

portant property: It is highly regular. This helps to keep the

size of the AES architecture small during place-and-route of

a semi-custom design flow and facilitates the creation of

full-custom designs. Full-custom approaches are particu-

larly interesting for smart card implementations that are

required to provide protection against power analysis

attacks [13]. In a full-custom approach, the designer can

balance the capacitive loads of differential nodes well as it

is, for example, desired for logic styles like the one

described in [14].

Another very important property of the presented

architecture is its scalability. The performance of the

architecture can be increased gradually at the cost of an

increased chip size. Furthermore, the key size can easily be

changed from 128 to 192 or 256 bits. However, the overall

architecture does not change for versions with different

performance and key sizes.

Section 2 gives a brief overview of the AES algorithm. In

Section 3, the AES hardware architecture and the corre-

sponding implementation options are described. The

IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 4, APRIL 2003 483

. The authors are with the Institute for Applied Information Processing and

Communications (IAIK), Graz University of Technology, Inffeldgasse 16a,

A-8010 Graz, Austria.

E-mail: {stefan.mangard, manfred.aigner, sandra.dominikus}@iaik.at.

Manuscript received 15 June 2002; revised 2 Dec. 2002; accepted 2 Dec. 2002.

For information on obtaining reprints of this article, please send e-mail to:

tc@computer.org, and reference IEEECS Log Number 117871.

1. National Institute of Standards and Technology.

0018-9340/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

performance of the architecture is summarized and com-

pared with other AES hardware implementations in

Section 4. Concluding remarks can be found in Section 5.

2 AES ALGORITHM

The AES is a round-based, symmetric block cipher. It is

defined for a block size of 128 bits and key lengths of 128,

192, and 256 bits. According to the key length, these

variants of the AES are called AES-128, AES-192, and

AES-256. This article mainly focuses on implementing the

AES-128, which is the most commonly used AES variant.

However, the presented architecture can also be used for

the other standardized key sizes.

The following subsection describes the AES transforma-

tions, which are the building blocks of AES encryptions and

decryptions. In Section 2.2, the AES-128 key expansion is

discussed.

2.1 AES Transformations

The AES takes a 128-bit data block as input and performs

several different transformations on this block. In case of an

encryption, the input block of the AES is called plaintext

and the returned block is called ciphertext. All intermediate

results of this block, as well as the input and the output

block, are called states. For a discussion of the different

transformations, executed on the 128-bit states in an AES

encryption or decryption, it is best to picture a state as a

4-by-4 matrix of bytes (see Fig. 1). A 128-bit input/output

block of the AES is mapped to an AES state by putting the

first byte of the block in the upper left corner of the matrix

and by filling in the remaining bytes column by column.

AES encryptions and decryptions are based on four

different transformations that are performed repeatedly in a

certain sequence. Each of these transformations, which are

described in the following, maps a 128-bit input state to a

128-bit output state.

. SubBytes: The SubBytes transformation is a non-

linear substitution operation that works on bytes.

Each byte of the input state is replaced using the

same substitution function (called S-Box).

The S-Box is defined as the multiplicative inverse

in the Galois Field GFð2

Þ with the irreducible

polynomial mðxÞ¼x

þ x

þ x þ 1 followed by

an affine transformation. The InvSubBytes transfor-

mation, which is needed for decryption, is the

inverse of the affine transformation followed by the

same inversion as in the SubBytes transformation.

. ShiftRows: The ShiftRows transformation rotates

each row of the input state to the left, whereby the

offset of the rotation corresponds to the row number.

For example, row one (the row consisting of the

elements D

1;0

, D

1;1

, D

1;2

, and D

1;3

) is rotated by one

position to the left. The inverse of this transforma-

tion is computed by performing the corresponding

rotations to the right.

. MixColumns: The MixColumns transformation

maps each column of the input state to a new

column in the output state. Each input column is

considered as a polynomial over GF ð2

Þ and multi-

plied with the constant polynomial aðxÞ¼f03gx

f01gx

þf01gx þf02g modulo x

ÿ 1. The coeffi-

cients of aðxÞ are also elements of GF ð2

Þ and are

represented by hexadecimal values in this equation.

The InvMixColumns transformation is the multi-

plication of each column with a

ÿ1

ðxÞ¼f0Bgx

f0Dgx

þf09gx þf0Eg modulo x

ÿ 1.

. AddRoundKey: The AddRoundKey transformation

is self-inverting. It maps a 128-bit input state to a

128-bit output state by xoring the input state with a

128-bit round key.

These transformations are applied to a 128-bit input

block in a certain sequence to perform an AES encryption or

decryption. In both cases, the transformations are grouped

to so-called rounds. There are three different types of

rounds, namely, the initial round, the normal round, and

the final round. The transformations of the different rounds

and the sequence of the rounds are shown in Fig. 2. The

rounds are slightly different for encryption and decryption

and the number of rounds, Nr, depends on the key size.

The presented decryption algorithm is called Inverse

Cipher. Compared to the encryption algorithm, it is simply

484 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 4, APRIL 2003

Fig. 1. Alignment of an AES state.

Fig. 2. Sequence of the execution of the four different transformations

used in an AES encryption/decryption.

the execution of the inverse transformations in reversed

order. Alternatively, the so-called Equivalent Inverse Cipher

can be used for decryption. However, for the presented AES

hardware architecture, the Inverse Cipher is more suitable.

2.2 AES-128 Key Expansion

For an AES-128 encryption, the 128-bit cipher key needs

to be expanded to eleven 128-bit round keys. The

principle idea of this key expansion is that the first

round key, Roundkey

, corresponds to the cipher key. All

subsequent round keys are derived from their respective

predecessor using a function f.So,Roundkey

fðRoundkey

iÿ1

Þ for all 0 <i<11.

For an AES-128 decryption, the same round keys are

used in reversed order. Using the inverse of the key

expansion function, f

ÿ1

, the round keys can be derived

recursively from RoundKey

In Fig. 3, a pseudocode for the AES-128 key expansion is

shown. This pseudocode is based on 32-bit key words and,

so, the eleven 128-bit round keys are stored one after the

other in the word array W[0..43]. The RotWord function,

used in the pseudocode, rotates the input word by one byte

to the left. The SubWord function applies the S-Box function

to each byte of the input word. The RC values, finally, are

the powers x

iÿ1

of x in the same Galois field GF ð2

Þ as used

for the S-Box transformation.

Fig. 4 shows how the word array W[0..43] is mapped

to the corresponding round keys. The key expansions for

the AES-192 and for the AES-256 are very similar and

described in detail in [2].

3 AES HARDWARE ARCHITECTURE

The AES hardware architecture presented in this article is

very modular and provides a high level of scalability. While

the standard version of the architecture is suited for smart

cards, USB dongles, and similar devices, the high-perfor-

mance version provides enough throughput to be used as

an acceleration module in high-end servers. It is important

to outline that, in both versions, the overall structure of the

architecture remains the same—even for different key sizes.

This overall structure of the architecture, which is

capable of performing AES encryptions and decryptions,

is shown in Fig. 5. The AES hardware module consists of

the following four components:

. Interface: The interface handles all communication

of the AES module with its environment—it com-

municates based on 32-bit words with the other

components of the AES module and via an AMBA

APB bus with the environment of the module.

. Data Unit: The data unit is the main module of the

architecture. It can perform any kind of AES

encryption or decryption round using the round

key that is assigned to its key input. Although the

number of rounds is different for the three standar-

dized key sizes, the types of rounds that are

executed are always the same. Consequently, the

data unit is independent of the key size.

The data unit has a highly regular structure, as

indicated in Fig. 5. It consists of 16 instances of a so-

called data cell and a certain number of S-Boxes. The

more S-Boxes are used, the higher is the performance

of the AES module. The standard version of the data

unit has four S-Boxes and is described in detail in

Section 3.1. A high-performance version with 16

S-Boxes is presented in Section 3.2. In principle, it is

also possible to implement a data unit with eight S-

Boxes. This version can easily be derived from the

description of the other two versions and is not

presented separately.

. Key Unit: The key unit serves two main purposes:

the storage of cipher keys and the calculation of the

round keys. To save die size, the S-Boxes of the data

unit are reused to perform the key expansion. In the

presented architecture, this reuse is possible for any

key size without loss of performance.

Since 128 bit is currently the most commonly used

key size, a key unit capable of performing the 128-bit

MANGARD ET AL.: A HIGHLY REGULAR AND SCALABLE AES HARDWARE ARCHITECTURE 485

Fig. 3. Pseudocode for the AES-128 key expansion.

Fig. 4. Mapping of the key words to round keys.

Fig. 5. Overall structure of the AES module.

key expansion is described in detail in this article

(see Section 3.3). The overall structure of the AES

module, however, allows the usage of key modules

supporting multiple key sizes in parallel or any of

the standardized key sizes on its own.

. CBC Unit: An AES module just consisting of a key

unit, a data unit, and an interface can already

perform the AES algorithm in ECB (Electronic Code

Book) mode. However, because there exist certain

attacks (e.g., reordering of blocks) against this mode,

usually other modes of operation [15] are used. The

most popular one is the CBC (Cipher Block Chain-

ing) mode, where the result of an AES encryption is

xored with the next 128-bit input block. This

procedure needs to be reversed when performing a

decryption. The CBC unit of the AES module

implements the CBC mode without any negative

influence on the overall performance of the AES

module.

In the presented architecture, a 128-bit block of data is

encrypted as follows: First, a cipher key needs to be loaded

via the interface into the key unit. Once a key is loaded, it

can be used for an arbitrary number of encryptions and

decryptions. After loading the cipher key, the first 128-bit

block of data is transferred via the interface and the CBC

unit into the data unit. The data unit then iteratively

performs the number of AES rounds that are required for

the used key size.

In each round, the key unit provides the corresponding

round key to the data unit. To calculate these round keys,

the key unit uses the S-Boxes of the data unit during a clock

cycle in which they are not used by the data unit. After the

calculation of the AES rounds, the encryption result is

passed in 32-bit words to the interface via the CBC unit.

Decryptions are computed in a very similar way. In this

case, the data unit performs the inverse AES transforma-

tions in reversed order and also the key unit provides the

round keys in reversed order.

The remainder of this section presents the details of the

standard data unit and those of the high-performance data

unit. Additionally, an AES-128 key unit that can be used

with both data units is described.

3.1 Standard Data Unit

The data unit is the biggest and the most important

component of the AES architecture. It stores the current

128-bit state (see Fig. 1) of an encryption or decryption and

is capable of performing any number and type of encryp-

tion/decryption rounds on this state. Consequently, all four

AES transformations (SubBytes, ShiftRows, MixColumns,

and AddRoundKey) and the corresponding inverse trans-

formations are implemented within the data unit. For the

AddRoundKey transformation, a round key needs to be

provided by the key unit.

Fig. 6 shows the standard version of the data unit. Its

structure is highly regular and closely related to the

definition of the AES state. The standard data unit consists

of 16 so-called data cells and four S-Boxes. An S-Box of the

architecture is a circuit capable of performing the S-Box and

the inverse S-Box transformation for an 8-bit input. The data

cells store eight bits per cell and perform all other AES

transformations and the corresponding inverses, when

connected appropriately. In full-custom designs, inputs

and outputs of the data cells can be defined in a way that

connection by abutment is possible when they are placed

next to another.

However, the regular design not only facilitates full-

custom designs. Also, for FPGA and standard-cell synth-

esis, a regular circuit is very desirable. If one improves the

synthesis results of a single data cell by special attributes for

the synthesizer, the overall area reduction is 16 times higher

and therefore worth the effort.

Another distinguishing feature of the presented archi-

tecture is the fact that the combinational paths are relatively

short and, what is even more important, very balanced. The

commonly used approach to implement the AES in

hardware is to store the 128-bit state in a register and to

perform the AES transformations (except for the ShiftRows

transformation) column by column. So, in order to perform

a normal AES encryption round, first the ShiftRows

transformation is done in one clock cycle. Then, the

remaining transformations of an AES round are done

column by column, whereby all transformations for one

column are usually done within one clock cycle.

The problem of this approach is that the combinational

path to perform a SubBytes, a MixColumn, and an

AddRoundKey transformation in one clock cycle is very

long. Additionally, the implementation of the ShiftRows

transformation causes a significant wiring overhead. The

data unit, presented in this section, solves both problems. It

performs AES encryptions and decryptions in the following

way:

486 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 4, APRIL 2003

Fig. 6. Architecture of the standard data unit.

In order to load a data block, the input data is shifted

column by column from the right side (see Fig. 6) into the

data cells. The inputs labeled “In” are connected via the

CBC unit to the interface. The initial AddRoundKey

transformation is done in the fourth clock cycle at the same

time as the last column is loaded.

To compute a normal AES round, the registers are

rotated vertically to perform the Inv-/SubBytes and the

Inv-/ShiftRows transformation row by row. In the first

clock cycle, the Inv-/SubBytes transformation starts for row

three. Due to the fact that the implementation of the S-Boxes

is pipelined (this will be motivated in Section 3.1.1), the

result of this Inv-/SubBytes transformation is stored in row

zero (see Fig. 6) two clock cycles later. Using the pipelined

S-Boxes and the Barrel shifter between row zero and row

one of the registers, the Inv-/SubBytes and the

Inv-/ShiftRows transformations can be applied to all

16 bytes of the state within five clock cycles.

In the sixth clock cycle of a normal AES round, the

Inv-/MixColumns and the AddRoundKey transformations

are performed by all data cells in parallel. Since the S-Boxes

are not used by the data unit during the sixth clock cycle,

they can be utilized by the key unit to perform the key

expansion for the next round key. In order to compute the

final round of an encryption or decryption, the

Inv-/Mixcolumns transformation is omitted by the data

cells in this clock cycle.

In this way, the required number of encryption or

decryption rounds can be executed by the data unit and the

key unit until the 128-bit result is finally stored in the

registers of the data unit. This result is then shifted column

by column to the left (to the interface of the AES module).

At the same time, a new input state can be loaded.

Using the standard data unit, the minimal number of

clock cycles that are required to perform an AES-128

encryption or decryption is 64. Four clock cycles are

required for the I/O of the data unit, 54 clock cycles are

required to perform the nine normal AES rounds, and six

are required for the final round.

The following two subsections present the architecture of

the S-Boxes and the data cells.

3.1.1 S-Boxes

In hardware implementations, the SubBytes transformation

and its inverse are the most expensive AES transformations.

This is why the standard data unit does not contain as many

S-Boxes as data cells.

In principle, there are two ways for implementing an

S-Box in hardware that can be used for the SubBytes

transformation and its inverse. It can either be implemented

as ROM lookup or it can be calculated with combinational

logic. The straightforward way to implement an S-Box is to

store all possible output values for the S-Box function and

its inverse in a ROM. However, this requires a small ROM

with 512 bytes, where the overhead for address decoding

and output signal conditioning outweighs the area require-

ments of the ROM matrix.

Alternatively, just the result of the inversion in GFð2

could be stored in a 256 byte ROM and the affine

transformation and its inverse could be calculated with

combinational logic. This approach would only need half

the ROM size of the first approach, but it would have an

even worse overhead to matrix ratio.

The best way to implement an S-Box is to use combina-

tional logic for the affine transformation, for its inverse and

also for the computation of the inverse in GF ð2

Þ. This

approach was first proposed by Rijmen in [16] and used by

Rudra et al. in [11]. Implementations of S-Boxes that are

particularly interesting for the presented architecture have

been proposed by Satoh et al. in [12] and by Wolkerstorfer

et al. in [17].

For the presented AES module, a pipelined (one stage)

implementation of the S-Box as described in [17] is used.

The main idea of this implementation is to build an

efficient combinational circuit for the S-Box, which is

based on the fact that GFð2

Þ can be seen as a quadratic

extension of the field GFð2

Þ. A pipelined version of the

S-Box is used to accomplish that the combinational paths

in the architecture are balanced (i.e., the paths of the S-

Boxes and those of a MixColumns-and-AddRoundKey

step are roughly the same).

3.1.2 Data Cells

The design of the data cells is crucial for the overall

architecture of the data unit. The data cells serve as storage

elements of the AES state and perform the Inv-/MixColumns

and the AddRoundKey transformation. Each data cell

consists of the following components:

. Eight flip-flops: Each data cell stores one byte of the

current AES state (see Figs. 1 and 6).

. One Multiplier: The MixColumns transformation

maps one column of the input state to a new column

in the output state. The multiplier that is a part of

each data cell computes one output byte of the

MixColumns transformation based on a four byte

input. This multiplier considers its four byte input as

polynomial over GFð2

Þ and is capable of perform-

ing a multiplication of the input with the constant

polynomial aðxÞ¼f03gx

þf01gx

þf01gx þf02g

and with its inverse, aðxÞ

ÿ1

, modulo x

ÿ 1.

The inputs of each multiplier are connected to the

outputs of the four data cells that are in the same

column as the multiplier itself (see Fig. 6). However,

due to the definition of the MixColumns and the

InvMixColumns transformation, the input connec-

tions are different in each row. The multipliers of the

architecture are designed in a way that there is a

maximum reuse of components between the multi-

plication with aðxÞ and the one with aðxÞ

ÿ1

detailed description of this multiplier architecture

can be found in [18].

. Eight XOR-Gates: The AddRoundKey transforma-

tion is performed in parallel in the presented

architecture. Consequently, eight xor gates are

required in each data cell.

. Input Selection: The data cells support unidirec-

tional vertical and horizontal shifting. Consequently,

each data cell consists of a multiplexor to select

which input is loaded into the data cell.

MANGARD ET AL.: A HIGHLY REGULAR AND SCALABLE AES HARDWARE ARCHITECTURE 487

A highly regular and scalable AES hardware architecture

Figures

Citations

Strong Authentication for RFID Systems using the AES Algorithm

Strong authentication for RFID systems using the AES algorithm

AES implementation on a grain of sand

A Practical Wireless Attack on the Connected Car and Security Protocol for In-Vehicle CAN

Secure Scan: A Design-for-Test Architecture for Crypto Chips

References

Differential Power Analysis

The Design of Rijndael

A Compact Rijndael Hardware Architecture with S-Box Optimization

A dynamic and differential CMOS logic with signal independent power consumption to withstand differential power analysis on smart cards

An ASIC Implementation of the AES SBoxes

Related Papers (5)

A Compact Rijndael Hardware Architecture with S-Box Optimization

The Design of Rijndael

Differential Power Analysis

The Design of Rijndael: AES - The Advanced Encryption Standard

AES implementation on a grain of sand

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A highly regular and scalable aes hardware architecture" ?