scispace - formally typeset
Open AccessJournal ArticleDOI

Efficient Cache Attacks on AES, and Countermeasures

Reads0
Chats0
TLDR
An extremely strong type of attack is demonstrated, which requires knowledge of neither the specific plaintexts nor ciphertexts and works by merely monitoring the effect of the cryptographic process on the cache.
Abstract
We describe several software side-channel attacks based on inter-process leakage through the state of the CPU’s memory cache. This leakage reveals memory access patterns, which can be used for cryptanalysis of cryptographic primitives that employ data-dependent table lookups. The attacks allow an unprivileged process to attack other processes running in parallel on the same processor, despite partitioning methods such as memory protection, sandboxing, and virtualization. Some of our methods require only the ability to trigger services that perform encryption or MAC using the unknown key, such as encrypted disk partitions or secure network links. Moreover, we demonstrate an extremely strong type of attack, which requires knowledge of neither the specific plaintexts nor ciphertexts and works by merely monitoring the effect of the cryptographic process on the cache. We discuss in detail several attacks on AES and experimentally demonstrate their applicability to real systems, such as OpenSSL and Linux’s dm-crypt encrypted partitions (in the latter case, the full key was recovered after just 800 writes to the partition, taking 65 milliseconds). Finally, we discuss a variety of countermeasures which can be used to mitigate such attacks.

read more

Content maybe subject to copyright    Report

Efficient Cache Attacks on AES, and Countermeasures
Eran Tromer
1 2
, Dag Arne Osvik
3
and Adi Shamir
2
1
Computer Science and Artificial Intelligence Laboratory,
Massachusetts Institute of Technology,
32 Vassar Street, G682, Cambridge, MA 02139
tromer@csail.mit.edu
2
Department of Computer Science and Applied Mathematics,
Weizmann Institute of Science, Rehovot 76100, Israel
adi.shamir@weizmann.ac.il
3
Laboratory for Cryptologic Algorithms, Station 14,
´
Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne, Switzerland
dagarne.osvik@epfl.ch
Abstract. We describe several software side-channel attacks based on inter-process leakage through
the state of the CPU’s memory cache. This leakage reveals memory access patterns, which can be used
for cryptanalysis of cryptographic primitives that employ data-dependent table lookups. The attacks
allow an unprivileged process to attack other processes running in parallel on the same processor,
despite partitioning methods such as memory protection, sandboxing and virtualization. Some of
our methods require only the ability to trigger services that perform encryption or MAC using the
unknown key, such as encrypted disk partitions or secure network links. Moreover, we demonstrate
an extremely strong type of attack, which requires knowledge of neither the specific plaintexts nor
ciphertexts, and works by merely monitoring the effect of the cryptographic process on the cache.
We discuss in detail several attacks on AES, and experimentally demonstrate their applicability to
real systems, such as OpenSSL and Linux’s dm-crypt encrypted partitions (in the latter case, the full
key was recovered after just 800 writes to the partition, taking 65 milliseconds). Finally, we discuss
a variety of countermeasures which can be used to mitigate such attacks.
Keywords: side-channel attack, cryptanalysis, memory cache, AES
1 Introduction
1.1 Overview
Many computer systems concurrently execute programs with different privileges, employing vari-
ous partitioning methods to facilitate the desired access control semantics. These methods include
kernel vs. userspace separation, process memory protection, filesystem permissions and chroot,
and various approaches to virtual machines and sandboxes. All of these rely on a model of the
underlying machine to obtain the desired access control semantics. However, this model is often
idealized and does not reflect many intricacies of the actual implementation.
In this paper we show how a low-level implementation detail of modern CPUs, namely the
structure of memory caches, causes subtle indirect interaction between processes running on
the same processor. This leads to cross-process information leakage. In essence, the cache forms a
shared resource which all processes compete for, and it thus affects and is affected by every process.
While the data stored in the cache is protected by virtual memory mechanisms, the metadata

about the contents of the cache, and in particular the memory access patterns of processes using
that cache, are not fully protected.
We describe several methods an attacker can use to learn about the memory access patterns
of another process, e.g., one which performs encryption with an unknown key. These are classified
into methods that affect the state of the cache and then measure the effect on the running time
of the encryption, and methods that investigate the state of the cache after or during encryption.
The latter are found to be particularly effective and noise-resistant.
We demonstrate the cryptanalytic applicability of these methods to the Advanced Encryption
Standard (AES, [39]) by showing a known-plaintext (or known-ciphertext) attack that performs
efficient full key extraction. For example, an implementation of one variant of the attack per-
forms full AES key extraction from the dm-crypt system of Linux using only 800 accesses to an
encrypted file, 65ms of measurements and 3 seconds of analysis; attacking simpler systems, such
as “black-box” OpenSSL library calls, is even faster at 13ms and 300 encryptions.
One variant of our attack has the unusual property of performing key extraction without
knowledge of either the plaintext or the ciphertext. This is a particularly strong form of attack,
which is clearly impossible in a classical cryptanalytic setting. It enables an unprivileged process,
merely by accessing its own memory space, to obtain bits from a secret AES key used by another
process, without any (explicit) communication between the two. This too is demonstrated exper-
imentally, and implementing AES in a way that is impervious to this attack, let alone developing
an efficient generic countermeasure, appears non-trivial.
This paper is organized as follows: Section 2 gives an introduction to memory caches and AES
lookup tables. In Section 3 we describe the basic attack techniques, in the “synchronous” setting
where the attacker can explicitly invoke the cipher on known data. Section 4 introduces even
more powerful “asynchronous” attacks which relax the latter requirement. In Section 5, various
countermeasures are described and analyzed. Section 6 summarizes these results and discusses
their implications.
1.2 Related work
The possibility of cross-process leakage via cache state was first considered in 1992 by Hu [24]
in the context of intentional transmission via covert channels. In 1998, Kelsey et al. [27] men-
tioned the prospect of “attacks based on cache hit ratio in large S-box ciphers”. In 2002, Page
[47] described theoretical attacks on DES via cache misses, assuming an initially empty cache and
the ability to identify cache effects with very high temporal resolution in side-channel traces. He
subsequently proposed several countermeasures for smartcards [48], though most of these require
hardware modifications and are inapplicable or insufficient in our attack scenario. Recently, vari-
ants of this attack (termed “trace-driven” in [48]) were realized by Bertoni et al. [11] and Acıi¸cmez
and Ko¸c [3][4], using a power side channel of a MIPS microprocessor in an idealized simulation.
By contrast, our attacks operate purely in software, and are hence of wider applicability and
implications; they have also been experimentally demonstrated in real-life scenarios.
In 2002 and subsequently, Tsunoo et al. devised a timing-based attack on MISTY1 [57,58]
and DES [56], exploiting the effects of collisions between the various memory lookups invoked
internally by the cipher (as opposed to the cipher vs. attacker collisions we investigate, which
2

greatly improve the efficiency of an attack). Recently Lauradoux [32] and Canteaut et al. [18]
proposed some countermeasures against these attacks, none of which are satisfactory against our
attacks (see Section 5).
An abridged version of this paper was published in [45], and announced in [44].
Concurrently but independently, Bernstein [10] described attacks on AES that exploit timing
variability due to cache effects. This attack can be seen as a variant of our Evict+Time measure-
ment method (see Section 3.4 and the analysis of Neve et al. [42]), though it is also somewhat
sensitive to the aforementioned collision effects. The main difference is that [10] does not use an
explicit model of the cache and active manipulation, but rather relies only on the existence of some
consistent statistical patterns in the encryption time caused by memory access effects; these pat-
terns are neither controlled nor modeled. The resulting attack is simpler and more portable than
ours, since its implementation is mostly oblivious to the fine (and often unpublished) details of the
targeted CPU and software; indeed, [10] includes the concise C source code of the attack. More-
over, the attack of [10] locally executes only time measurement code on the attacked computer,
whereas our attack code locally executes more elaborate code that also performs (unprivileged)
memory accesses. However, the attack of [10] has several shortcomings. First, it requires reference
measurements of encryption under known key in an identical configuration, and these are often
not readily available (e.g., a user may be able to write data to an encrypted filesystem, but creat-
ing a reference filesystem with a known key is a privileged operation). Second, the attack of [10]
relies on timing the encryption and thus, similarly to our Evict+Time method, seems impractical
on many real systems due to excessively low signal-to-noise ratio; our alternative methods (Sec-
tions 3.5 and 4) address this. Third, even when the attack of [10] works, it requires a much higher
number of analyzed encryptions than our method.
4
A subsequent paper of Canteaut et al. [18]
describes a variant of Bernstein’s attack which focuses on internal collisions (following Tsunoo
et al.) and provided a more in-depth experimental analysis;
5
its properties and applicability are
similar to Bernstein’s attack.
6
See Section 6.5 for subsequent improvements.
Also concurrently with but independently of our work, Percival [50] described a cache-based
attack on RSA for processors with simultaneous multithreading. The measurement method is
similar to one variant of our asynchronous attack (Section 4), but the cryptanalysis has little
in common since the algorithms and time scales involved in RSA vs. AES operations are very
different. Both [10] and [50] contain discussions of countermeasures against the respective attacks,
and some of these are also relevant to our attacks (see Section 5).
Koeune and Quisquater [30] described a timing attack on a “bad implementation” of AES
which uses its algebraic description in a “careless way” (namely, using a conditional branch in
4
In our experiments the attack code of [10] failed to get a signal from dm-crypt even after a 10 hours run, whereas
in the same setup our Prime+Probe (see Section 3.5) performed full key recovery using 65ms of measurements.
5
Canteaut et al. [18] claim that their attack exploits only collision effects due to microarchitectural details (i.e.,
low address bits) and that Bernstein’s attack [10] exploits only cache misses (i.e., higher address bits). However,
experimentally both attacks yield key bits of both types, as can be expected: the analysis method of [10] also
detects collision effects (albeit with lower sensitivity), while the attack setting of [18] inadvertently also triggers
systematic cache misses (e.g., due to the encryption function’s use of stack and buffers).
6
[18] reports a 85% chance of recovering 20 bits using 2
30
encryptions after a 2
30
learning phase, even for the
“lightweight” target of OpenSSL AES invocation. In the same setting, our attack reliably recovers the full key
from just 300 encryptions (Section 3.7).
3

the MixColumn operation). That attack is not applicable to common software implementations,
but should be taken into account in regard to certain countermeasures against our attacks (see
Section 5.2).
Leakage of memory access information has also been considered in other contexts, yielding
theoretical [22] and heuristic [63][64] mitigation methods; these are discussed in Section 5.3.
See Section 6.5 for a discussion of additional works following our research.
2 Preliminaries
2.1 Memory and cache structure
Over the past couple of decades, CPU speed (in terms of operations per second) has been bene-
fiting from Moore’s law and growing at rate of roughly 60% per year, while the latency of main
memory has been decreasing at a much slower rate (7%–9% per year).
7
Consequentially, a large
gap has developed between the two. Complex multi-level cache architectures are employed to
bridge this gap, but it still shows through during cache misses: on a typical modern processor,
accessing data in the innermost (L1) cache typically requires amortized time on the order of
0.3ns, while accessing main memory may stall computation for 50 to 150ns, i.e., a slowdown of
2–3 orders of magnitude. The cache architectures are optimized to minimize the number of cache
misses for typical access patterns, but can be easily manipulated adversarially; to do so we will
exploit the special structure in the association between main memory and cache memory.
Modern processors use one or more levels of set-associative memory cache. Such a cache
consists of storage cells called cache lines, each consisting of B bytes. The cache is organized into
S cache sets, each containing W cache lines
8
, so overall the cache contains B · S · W bytes. The
mapping of memory addresses into the cache is limited as follows. First, the cache holds copies
of aligned blocks of B bytes in main memory (i.e., blocks whose starting address is 0 modulo B),
which we will term memory blocks. When a cache miss occurs, a full memory block is copied into
one of the cache lines, replacing (“evicting”) its previous contents. Second, each memory block
may be cached only in a specific cache set; specifically, the memory block starting at address a
can be cached only in the W cache lines belonging to cache set ba/Bc mod S. See Figure 1. Thus,
the memory blocks are partitioned into S classes, where the blocks in each class contend for the
W cache lines in a single cache set.
9
Modern processors have up to 3 levels of memory cache, denoted L1 to L3, with L1 being the
smallest and fastest cache and subsequent levels increasing in size and latency. For simplicity, in
the following we mostly ignore this distinction; one has a choice of which cache to exploit, and
our experimental attacks used both L1 and L2 effects. Additional complications are discussed in
Section 3.6. Typical cache parameters are given in Table 1.
7
This relatively slow reduction in DRAM latency has proven so reliable, and founded in basic technological
hurdles, that it has been proposed by Abadi et al. [1] and Dwork et al. [21] as a basis for proof-of-work protocols.
8
In common terminology, W is called the associativity and the cache is called W -way set associative.
9
CPUs differ in their policy for choosing which cache line inside a set to evict during a cache miss. Our attacks work
for all common algorithms, but as discussed in Section 3.8, knowledge of the policy allows further improvements.
4

S
W WT
0
Cache Main memory
Fig. 1. Schematic of a single level of set-associative cache. Each column of memory blocks (right
side) corresponds to S · B contiguous bytes of memory. Each row of memory blocks is mapped to
the corresponding row in the cache (left side), representing a set of W cache lines. The light gray
blocks represent an AES lookup table in the victim’s memory. The dark gray blocks represent
the attacker’s memory used for the attack, which will normally be at least as big as the size of
the cache.
CPU model Level B (cache line size) S (cache sets) W (associativity) B · S · W (total size)
Athlon 64 / Opteron L1 64B 512 2 64KB
Athlon 64 / Opteron L2 64B 1024 16 1024KB
Pentium 4E L1 64B 32 8 16KB
Pentium 4E L2 128B 1024 8 1024KB
PowerPC 970 L1 128B 128 2 32KB
PowerPC 970 L2 128B 512 8 512KB
UltraSPARC T1 L1 16B 128 4 8KB
UltraSPARC T1 L2 64B 4096 12 3072KB
Table 1. Data cache parameters for popular CPU models
2.2 Memory access in AES implementations
This paper focuses on AES, since its memory access patterns are particularly susceptible to
cryptanalysis (see Section 6.2 for a discussion of other ciphers). The cipher is abstractly defined
by algebraic operations and could, in principle, be directly implemented using just logical and
arithmetic operations.
10
However, performance-oriented software implementations on 32-bit (or
higher) processors typically use an alternative formulation based on lookup tables, as prescribed
in the Rijndael specification[19][20]. In the subsequent discussion we assume the following imple-
mentation, which is typically the fastest.
11
Several lookup tables are precomputed once by the programmer or during system initialization.
There are 8 such tables, T
0
, T
1
, T
2
, T
3
and T
(10)
0
, T
(10)
1
, T
(10)
2
, T
(10)
3
, each containing 256 4-byte
words. The contents of the tables, defined in [20], are inconsequential for most of our attacks.
10
Such an implementation would be immune to our attack, but exhibit low performance. A major reason for the
choice of Rijndael in the AES competition was the high performance of the implementation analyzed here.
11
See Section 5.2 for a discussion of alternative table layouts. A common variant employs 1 or no extra tables for
the last round (instead of 4); most of our attacks analyze only the first few rounds, and are thus unaffected.
5

Figures
Citations
More filters
Proceedings ArticleDOI

Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds

TL;DR: It is shown that it is possible to map the internal cloud infrastructure, identify where a particular target VM is likely to reside, and then instantiate new VMs until one is placed co-resident with the target, and how such placement can then be used to mount cross-VM side-channel attacks to extract information from a target VM on the same machine.
Book

Introduction to Embedded Systems - A Cyber-Physical Systems Approach

TL;DR: This book takes a cyber-physical approach to embedded systems, introducing the engineering concepts underlying embedded systems as a technology and as a subject of study.
Proceedings Article

FLUSH+RELOAD: a high resolution, low noise, L3 cache side-channel attack

TL;DR: This paper presents FLUSH+RELOAD, a cache side-channel attack technique that exploits a weakness in the Intel X86 processors to monitor access to memory lines in shared pages and recovers 96.7% of the bits of the secret key by observing a single signature or decryption round.
Proceedings ArticleDOI

Last-Level Cache Side-Channel Attacks are Practical

TL;DR: This work presents an effective implementation of the Prime+Probe side-channel attack against the last-level cache of GnuPG, and achieves a high attack resolution without relying on weaknesses in the OS or virtual machine monitor or on sharing memory between attacker and victim.
References
More filters
Book

The Design of Rijndael: AES - The Advanced Encryption Standard

TL;DR: The underlying mathematics and the wide trail strategy as the basic design idea are explained in detail and the basics of differential and linear cryptanalysis are reworked.
Book

The Java Virtual Machine Specification

Tim Lindholm, +1 more
TL;DR: In this article, the authors present a detailed overview of the Java Virtual Machine, including the internal structure of the class file format, the internal form of Fully Qualified Class and Interface names, and the implementation of new class instances.
Journal ArticleDOI

Differential cryptanalysis of DES-like cryptosystems

TL;DR: A new type of cryptanalytic attack is developed which can break the reduced variant of DES with eight rounds in a few minutes on a personal computer and can break any reduced variantof DES (with up to 15 rounds) using less than 256 operations and chosen plaintexts.
Journal ArticleDOI

Software protection and simulation on oblivious RAMs

TL;DR: This paper shows how to do an on-line simulation of an arbitrary RAM by a probabilistic oblivious RAM with a polylogaithmic slowdown in the running time, and shows that a logarithmic slowdown is a lower bound.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What contributions have the authors mentioned in the paper "Efficient cache attacks on aes, and countermeasures" ?

The authors describe several software side-channel attacks based on inter-process leakage through the state of the CPU ’ s memory cache. Moreover, the authors demonstrate an extremely strong type of attack, which requires knowledge of neither the specific plaintexts nor ciphertexts, and works by merely monitoring the effect of the cryptographic process on the cache. The authors discuss in detail several attacks on AES, and experimentally demonstrate their applicability to real systems, such as OpenSSL and Linux ’ s dm-crypt encrypted partitions ( in the latter case, the full key was recovered after just 800 writes to the partition, taking 65 milliseconds ). Finally, the authors discuss a variety of countermeasures which can be used to mitigate such attacks. 

Major interruptions, such as context switches to other processes, are filtered out by excluding excessively long time measurements. 

Assuming the hardware executes the basic AES operation with constant resource consumption, this allows for efficient AES execution that is invulnerable to their attacks. 

to avoid “polluting” its own samples, the probe code stores each obtained sample into the same cache set it has just finished measuring. 

The attack will thus take about log(1−0.105)/ log(1−2−14.9) ≈ 3386 times more data and analysis, which is inconvenient but certainly feasible for the attacker. 

to eliminate all the wrong candidates out of the δ4, the authors need about log δ−4/ log(1− δ/256 · (1− δ/256)38) samples, i.e., about 2056 samples when δ = 16. 

Note that to obtain a visible signal it is necessary to normalize the measurement scores by subtracting, from each sample, the average timing of its cache set. 

For each cache set, the attacker thread runs a loop which closely monitors the time it takes to repeatedly load a set of memory blocks that exactly fills that cache set with W memory blocks (similarly to step (c) of the Prime+Probe measurements). 

For AES, bitsliced implementation on popular architectures can offer a throughput comparable to that of lookup-based implementations [52][34][51][35][31][26], but only when several independent blocks are processed in parallel.35 Bitsliced AES is thus efficient for parallelized encryption modes such as CTR [35] and for exhaustive key search [62], but not for chained modes such as CBC.