scispace - formally typeset
Open AccessJournal ArticleDOI

High-throughput programmable cryptocoprocessor

A. Hodjat, +1 more
- 01 May 2004 - 
- Vol. 24, Iss: 3, pp 34-45
Reads0
Chats0
TLDR
A loosely coupled cryptocoprocessor based on the advanced encryption standard combines high throughput with programmability and using domain-specific instructions and design principles, the security engine supports Internet protocol security and other networking applications.
Abstract
High-speed Internet protocol security (IPsec) applications require high throughput and flexible security engines. A loosely coupled cryptocoprocessor based on the advanced encryption standard combines high throughput with programmability. using domain-specific instructions and design principles such as control hierarchy and block pipelining, the security engine supports Internet protocol security and other networking applications.

read more

Content maybe subject to copyright    Report

34
High-speed Internet Protocol secu-
rity (IPsec) applications require high through-
put and flexible security engines. Virtual
private networks, for example, require a
throughput of over 2 gigabits per second.
IPsec uses the Advanced Encryption Stan-
dard
1
algorithm in various operation modes.
2
Most security applications combine AES and
block ciphers in general with different oper-
ation modes because the straightforward elec-
tronic code book (ECB) mode is vulnerable
to statistical attacks.
3
The US National Insti-
tute of Standards and Technology recom-
mends block cipher modes of operation,
4
which, in addition to ECB, include cipher
block chaining (CBC), counter, cipher feed-
back (CFB), output feedback (OFB), and
CCM, a new mode that combines the
counter and CBC-MAC (message authenti-
cation code) modes. CCM only requires the
encryption algorithm and can generate
encrypted and authenticated data simultane-
ously.
5
As the “Related Work on Program-
mable Security Engines” sidebar mentions,
no current systems support all four modes:
ECB, CBC, counter, and CCM.
Recent Internet Society Request for Com-
ments (RFC) efforts propose combining AES
with block cipher modes, such as AES in
counter mode with IPsec
6
and AES in XCBC-
MAC with IPsec.
7
Other researchers use AES
in counter and CCM modes for IPsec.
8
Stan-
dard proposals tend to change, but these
changes are usually limited to initialization,
setup, key management, and so on. Combin-
ing programmability with high throughput
supports a wide range of current and future
standards for security applications.
A high-speed CPU is one way to implement
security primitives. However, factors such as
memory bandwidth and cache misses prevent
the CPU from achieving multi-Gbps
throughput. The AES/Rijndael: SpeedWeb
site (http://www.tcs.hut.fi/~helger/aes/
rijndael.html) reports AES throughput on var-
ious CPUs at over 1 GHz. Optimized C code
compiled with gcc (GNU Compiler Collec-
tion) 3.0.2 achieves only 861 Mbps on a 2.25-
Alireza Hodjat and
Ingrid Verbauwhede
University of California,
Los Angeles
A
LOOSELY COUPLED CRYPTOCOPROCESSOR BASED ON THE
A
DVANCED
E
NCRYPTION
S
TANDARD COMBINES HIGH THROUGHPUT WITH
PROGRAMMABILITY
. U
SING DOMAIN
-
SPECIFIC INSTRUCTIONS AND DESIGN
PRINCIPLES SUCH AS CONTROL HIERARCHY AND BLOCK PIPELINING
,
THE
SECURITY ENGINE SUPPORTS
I
NTERNET
P
ROTOCOL SECURITY AND OTHER
NETWORKING APPLICATIONS
.
H
IGH
-T
HROUGHPUT
P
ROGRAMMABLE
C
RYPTOCOPROCESSOR
Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE

GHz AMD Athlon. A hand-optimized assem-
bly code of the AES algorithm achieves up to
718 Mbps on a 1.33-GHz Pentium III and
up to 1,436 Mbps on a 3.06-GHz Pentium
IV. The CPUs achieved these throughputs in
ideal circumstances; the AES was the only
algorithm running, so there was no overhead
for other tasks.
We have developed a high-throughput, pro-
grammable cryptocoprocessor that runs the
AES algorithm in different operation modes
for IPsec applications. Instead of using multi-
GHz CPUs, we use domain-specific processors
to obtain the required throughput. Domain
specialization helps close the gap between per-
formance and programmability. The crypto-
coprocessor achieves a maximum throughput
of 3.43 Gbps at a 295-MHz clock frequency
using 0.18-micron CMOS technology. The
instruction set includes initialization, key setup,
and AES encryption for different operation
modes. Block pipeline instructions allow AES
to run in ECB, CBC-MAC, counter, and
CCM modes in 11 clock cycles per 128-bit
block without loss in throughput compared to
an AES without a mode of operation.
Architecture
The cryptocoprocessor architecture consists
of three modules. These are input module,
35
MAY–JUNE 2004
Ravi et al. present a system-level design methodology for program-
mable security processor platforms.
1
It uses Tensilica’s Xtensa proces-
sor
2
and includes customized instructions, which improve performance
from less than one Mbps to several tens of Mbps. In the instruction set
extension approach, which Barat, Lauwereins, and Deconinck refer to
as a
tightly coupled processing scheme
, the main CPU (Xtensa) is cus-
tomized for a specific domain by adding a new functional unit to its
pipeline.
3
Custom instructions flow through the pipeline and the new
functional unit decodes and executes them. Our approach differs in that
we use loosely coupled, independent coprocessors in conjunction with
a main embedded processor core. These programmable coprocessors
are designed for specific domains and attached to the main processor on
a dedicated interface.
A typical embedded system contains multiple tasks that might need
acceleration—for example, network protocol processing in the network-
ing domain, image or speech processing in the digital signal processing
(DSP) domain, and authentication and privacy protection in the security
domain. Figure A1 shows the stream of data samples that typically flow
from the DSP unit to the security unit and continue to the networking unit.
Related Work on Programmable Security Engines
Sensor Security Networking
Digital signal
processing (DSP)
(1)
(2) (3)
Memory
Fetch
Decode
Memory
Main
CPU
DSP Security Networking
DSP
Security
Networking
Arithmetic
logic unit
Figure A. Embedded security system: Typical data sample stream from the DSP unit to the security and networking units
(1); system design using a tightly coupled instruction set extension (2); and system-level view of our design (3).
continued on p. 36

output module, and the encryption module,
which includes the AES core.
AES core
Figure 1 shows the AES cores architecture.
It implements the 128-bit key, 128-bit data
version of the AES algorithm and performs
encryption in 11 cycles, with one round of the
algorithm executing in one clock cycle. AES-
128 execution takes 10 rounds, leaving one
clock cycle for the initial key-addition phase.
We optimized the AES core for speed, with
a goal of minimizing delay for one round.
The substitute-phase (S-boxes) is imple-
mented using lookup tables; all other steps in
each round are XOR chains. Other alterna-
tives for implementing S-boxes exist,
9,10
but
we found that the straightforward imple-
mentation—lookup tables—is fastest.
11
Therefore, the core performs each round in
a single clock cycle optimized for minimum
combinational delay.
Cryptocoprocessor
As Figure 2 shows, the cryptocoprocessor
includes
the input and output modules, which
perform handshaking to read the input
and write the encrypted data;
the encryption module, which contains
logic to run AES in the ECB, CBC-
MAC, counter, and CCM modes; and
the top controller, which issues com-
mands to the other three modules.
The cryptocoprocessor includes four 32-bit
I/O interfaces. The input and output mod-
ules can read or write a 128-bit block of data
using two of the interfaces—one for data
input and one for output—asynchronously.
The other two interfaces are synchronous; the
main CPU core and the coprocessor use
them—one as input and the other as out-
put—for data communication.
Memory-mapped interface with host CPU. Fig-
ure 3a shows how the cryptocoprocessor attach-
es to a CPU core through the memory-mapped
interface. Four registers connect the host CPU
to the cryptocoprocessor: instruction, config-
uration, 32-bit input, and 32-bit output. The
host CPU can read or write to these registers
36
S
ECURITY
E
NGINES
IEEE MICRO
Figure A2 shows how these systems can be designed using a tightly cou-
pled instruction set extension. The processor is customizable for each
domain by adding functional units to the pipeline. This way, the corre-
sponding functional units decode and execute the domain instructions.
Figure A3 is a system-level view of our design. Programmable coproces-
sors meet the throughput requirements for each domain. The main embed-
ded processor programs each domain-specific coprocessor. Thus, the
embedded processor exercises control while data is transferred between
coprocessors.
Another related system is the CryptoManiac, a coprocessor for cryp-
tographic workloads.
4
Its domain-specific processor performs crypto-
graphic functions on the data path through its processing elements. The
processing elements support various cryptographic algorithms, thus cre-
ating some overhead. In the AES algorithm, CryptoManiac performs 624
Mbps at 390-MHz clock frequency.
Recent publications report ASIC implementations of the AES algo-
rithm.
5–8
Hifn’s storage security processor,
6
for example, uses 0.13-micron
technology, achieving 2 Gbps at 133 MHz, and can be used in the counter
and CBC modes. Carlson et al.
7
present another implementation in 0.13-
micron technology that operates only in ECB mode and achieves 2.18
Gbps at 500 MHz. Satoh et al.
8
report a compact AES implementation in
0.11-micron technology that runs at 2.6 Gbps at 224 MHz for the ECB and
CBC modes. None of these cases support all four modes: ECB, CBC,
counter, and CCM.
References
1. S. Ravi et al., “System Design Methodologies for a Wireless
Security Processing Platform,”
Proc. 39th Design Automation
Conf.
(DAC 02), ACM Press, 2002, pp. 777-782.
2.
Xtensa Application-Specific Microprocessor Solutions—
Overview Handbook
. Tensilica, 2001.
3. F. Barat, R. Lauwereins, and G. Deconinck, “Reconfigurable
Instruction Set Processors from a HW/SW Perspective,”
IEEE
Trans. Software Eng.,
vol. 28, no. 9, Sept. 2002, pp. 847-862.
4. L. Wu, C. Weaver, and T. Austin, “CryptoManiac: A Fast Flexible
Architecture for Secure Communication,” Proc. 28th Int’l Symp.
Computer Architecture (ISCA-01), IEEE CS Press, 2001, pp.
110-119.
5. T. Ichikawa et al., “Hardware Evaluation of the AES Finalists,”
Proc. 3rd AES Candidate Conf., ACM Press, 2000, pp. 279-
285.
6. “Hifn HIPP III 4300 Storage Security Processor,”
http://www.hifn.com/products/4300.html.
7. D. Carlson et al., “A High-Performance SSL IPsec Protocol
Aware Security Processor,” Proc. Int’l Solid-State Circuits
Conf. (ISSCC 03), UEEE Press, 2003, pp. 142-483.
8. A. Satoh et al., “A Compact Rijndael Hardware Architecture
with S-Box Optimization,” Proc. AsiaCrypt 2001, LNCS 2248,
Springer-Varlag, 2001, pp. 239-254.
continued from p. 35

by accessing different memory locations. The
memory-mapped interface decodes the mem-
ory addresses and updates the register values.
The main CPU can therefore easily program
the cryptocoprocessor.
The CPU programs the coprocessor
through the 8-bit instruction and configura-
tion registers. Moreover, the main CPU core
and the coprocessor use the 32-bit input and
output registers for data communication. The
main CPU uses these registers for key setup
and initialization of the CBC-MAC, counter,
and CCM operation modes. Therefore, it is
possible to change the key and initial vector
values in the software.
Asynchronous I/O interfaces. The input and out-
put interfaces use two handshaking signals to
read and write a 128-bit data block asynchro-
nously in multiple clock cycles. Figure 3b shows
how the cryptocoprocessor connects to the
input and output modules through these inter-
faces. The modules work independently of the
cryptocoprocessor and the main CPU host and
can be programmed through the CPU cores
memory-mapped interface. The modules pro-
duce data for the cryptocoprocessor and use the
coprocessors encrypted output.
Design principles
Several design principles let the crypto-
coprocessor achieve the required through-
put for IPsec and other networking
applications.
37
MAY–JUNE 2004
37
+
+
+
++
+
+
+
+
+
+
+
+
+
+
S
+
S
+
S
S
S
S
S
S
Key scheduling data path
Mix
col 1
Mix
col 4
Mix
col 3
Mix
col 2
XOR
Byte substitution Shift row Mix column Key addition
Figure 1. Advanced Encryption Standard core architecture.

Separate data and control streams
Separating data and control streams enables
high-throughput data encryption and a high
level of programmability. In Figure 2, data
flows through the coprocessor from the input
module to the encryption module and then
to the output module while the top controller
handles instructions. The input and output
finite-state machines (FSMs) perform hand-
shaking to read input and write encrypted
data without interference from the top con-
troller. Following this methodology, we can
program the coprocessor to encrypt the input
data stream and produce output continuous-
ly while the top controller interface processes
new instructions.
Control hierarchy
Designing with multiple controllers
requires partitioning the control over different
modules, particularly when multiple modules
communicate asynchronously. Hierarchical
control design simplifies the controllerscom-
munications and lets us combine high per-
formance and programmability. Harel pro-
posed a control hierarchy for specification in
Statecharts.
12
We propose this design tech-
nique for implementation. We implement the
systems top-level control in the main proces-
sor core. Instructions from the main embed-
ded CPU bring commands to the
coprocessors top controller. The top con-
troller also controls the lower-level modules.
Figure 4 shows the control hierarchy for the
cryptocoprocessor in Figure 2. The top con-
troller unit manages the input FSM, CBC
FSM, counter FSM, and output FSM. As
mentioned, the input and output FSMs per-
form the handshaking sequence for reading
and writing of the 128-bit data blocks. The
CBC FSM controls the encryption sequence
to generate a CBC-MAC, and the counter
FSM controls the encryption sequence for the
counter operation mode.
Depending on which instruction it reads,
the main controller asserts the start signal for
a subcontroller. The submodule starts its
operation, asserting the done signal when fin-
38
S
ECURITY
E
NGINES
IEEE MICRO
Top controller
8
8
Instruction Configuration
Input
finite-state
machine
(FSM)
Input
data path
Input module Encryption module
32 128
32 128
Output
finite-state
machine
(FSM)
Output
data path
Output module
128 32
32
Input handshake Output handshake
Cipher
block
chaining
(CBC)
Key
Data
Data
register
Key
register
AES
Counter
register
Advanced
Encryption
Standard
(AES)
+
+
Output register
Encrypt done Start encrypt
Figure 2. Cryptocoprocessor block architecture.

Citations
More filters
Journal ArticleDOI

Advanced lightweight encryption algorithms for IoT devices: survey, challenges and solutions

TL;DR: A state-of-art of lightweight cryptographic primitives which include lightweight block cipher, hash function, stream ciphers, high performance system, and low resources device for IoT environment are discussed in details.
Journal ArticleDOI

A survey on lightweight block ciphers for low-resource devices

TL;DR: A comprehensive review of state-of-the-art research progress in lightweight block ciphers' implementation and future research directions is presented and the energy/bit metric is designated as the most appropriate metric for energy-constrained low-resource designs.
Journal ArticleDOI

Reconfigurable Hardware for High-Security/ High-Performance Embedded Systems: The SAFES Perspective

TL;DR: It is emphasized that reconfigurable hardware is not just a technology for hardware accelerators dedicated to security primitives as has been focused on by most studies but a real solution to provide high-security and high-performance for a system.
Journal ArticleDOI

Architectures of flexible symmetric key crypto engines—a survey: From hardware coprocessor to multi-crypto-processor system on chip

TL;DR: In this article, a few authors propose original processor architectures based on multi-crypto-processor structures and reconfigurable cryptographic arrays and present current trends and design challenges.
Proceedings ArticleDOI

Interfacing a high speed crypto accelerator to an embedded CPU

TL;DR: This paper presents the AES acceleration for two interface options to the LEON CPU core: the CPI interface and the memory-mapped interface.
References
More filters
Book

Handbook of Applied Cryptography

TL;DR: A valuable reference for the novice as well as for the expert who needs a wider scope of coverage within the area of cryptography, this book provides easy and rapid access of information and includes more than 200 algorithms and protocols.
Journal ArticleDOI

Statecharts: A visual formalism for complex systems

TL;DR: It is intended to demonstrate here that statecharts counter many of the objections raised against conventional state diagrams, and thus appear to render specification by diagrams an attractive and plausible approach.
Book ChapterDOI

A Compact Rijndael Hardware Architecture with S-Box Optimization

TL;DR: Compact and high-speed hardware architectures and logic optimization methods for the AES algorithm Rijndael are described, including a new composite field and the S-Box structure is also optimized.
ReportDOI

Recommendation for Block Cipher Modes of Operation. Methods and Techniques

TL;DR: This recommendation defines five confidentiality modes of operation for use with an underlying symmetric key block cipher algorithm: Electronic Codebook (ECB), Cipher Block Chaining (CBC), Cipher Feedback (CFB), Output Feedback (OFB), and Counter (CTR).
Book ChapterDOI

An ASIC Implementation of the AES SBoxes

TL;DR: This article presents a hardware implementation of the S-Boxes from the Advanced Encryption Standard (AES), and shows that a calculation of this function and its inverse can be done efficiently with combinational logic.
Frequently Asked Questions (16)
Q1. What have the authors contributed in "A loosely coupled cryptocoprocessor based on the advanced encryption standard combines high throughput with programmability. using domain-specific instructions and design principles such as control hierarchy and block" ?

The “ AES/Rijndael: Speed ” Web site ( http: //www. tcs. hut. fi/~helger/aes/ rijndael. html ) reports AES throughput on various CPUs at over 1 GHz. 

Block pipeline instructions allow AES to run in ECB, CBC-MAC, counter, and CCM modes in 11 clock cycles per 128-bit block without loss in throughput compared to an AES without a mode of operation. 

Designing with multiple controllers requires partitioning the control over different modules, particularly when multiple modules communicate asynchronously. 

The AES algorithm takes 24,419 cycles per 128-bit block (1,526.2 cycles per byte) using an efficient, high-speed software code on Xtensa; and it takes 1,400 cycles per 128-bit block (87.5 cycles per byte) to run AES using custom instructions on the customized Xtensa core. 

Examples of multiple-cycle single instructions include single-block encryption, reading one block of data or key, and writ-39MAY–JUNE 2004ing one block of output. 

Separate data and control streams Separating data and control streams enables high-throughput data encryption and a high level of programmability. 

Most future embedded systems willrequire high throughput and programmable security engines similar to the cryptocoprocessor presented in this article. 

Continuous instructions let the coprocessor encrypt the data stream and write the result to the output continuously, until the main CPU issues a done instruction. 

A hand-optimized assembly code of the AES algorithm achieves up to 718 Mbps on a 1.33-GHz Pentium III and up to 1,436 Mbps on a 3.06-GHz Pentium IV. 

It implements the 128-bit key, 128-bit data version of the AES algorithm and performs encryption in 11 cycles, with one round of the algorithm executing in one clock cycle. 

The authors performed synthesis using a typical United MicroElectronic Corp. (UMC) 0.18-micron standard cell library with the Synopsys synthesis tools and the conservative wire load model. 

Because the core produces a 128-bit output every 11 cycles, the authors calculated throughput by multiplying the frequency by 128 and dividing the result by 11, giving us the number of bits produced per second. 

The coprocessor uses most of the 1,228 cycles for transferring the data and key from the Leon core and for the context switching necessary to call the program of Figure 8b from the main C program. 

The authors have developed a high-throughput, programmable cryptocoprocessor that runs the AES algorithm in different operation modes for IPsec applications. 

Hierarchical control design simplifies the controllers’ communications and lets us combine high per-formance and programmability. 

Achieving such a methodology requires research on highly efficient interfaces for programming encryption accelerators through an embedded CPU core as well as high-throughput data transmission schemes between the hardware accelerators of a typical embedded system on a chip.