What contributions have the authors mentioned in the paper "Efficient cache attacks on aes, and countermeasures" ?

The authors describe several software side-channel attacks based on inter-process leakage through the state of the CPU ’ s memory cache. Moreover, the authors demonstrate an extremely strong type of attack, which requires knowledge of neither the specific plaintexts nor ciphertexts, and works by merely monitoring the effect of the cryptographic process on the cache. The authors discuss in detail several attacks on AES, and experimentally demonstrate their applicability to real systems, such as OpenSSL and Linux ’ s dm-crypt encrypted partitions ( in the latter case, the full key was recovered after just 800 writes to the partition, taking 65 milliseconds ). Finally, the authors discuss a variety of countermeasures which can be used to mitigate such attacks.

What is the way to filter out interruptions?

Major interruptions, such as context switches to other processes, are filtered out by excluding excessively long time measurements.

What is the way to ensure that the AES operation is invulnerable to attacks?

Assuming the hardware executes the basic AES operation with constant resource consumption, this allows for efficient AES execution that is invulnerable to their attacks.

How does the probe code avoid polluting its own samples?

to avoid “polluting” its own samples, the probe code stores each obtained sample into the same cache set it has just finished measuring.

How many times more data and analysis will be needed to execute a synchronous attack?

The attack will thus take about log(1−0.105)/ log(1−2−14.9) ≈ 3386 times more data and analysis, which is inconvenient but certainly feasible for the attacker.

How many samples are needed to eliminate all the wrong candidates?

to eliminate all the wrong candidates out of the δ4, the authors need about log δ−4/ log(1− δ/256 · (1− δ/256)38) samples, i.e., about 2056 samples when δ = 16.

How do you normalize the measurement scores?

Note that to obtain a visible signal it is necessary to normalize the measurement scores by subtracting, from each sample, the average timing of its cache set.

What is the function used to monitor the time it takes to load a cache set?

For each cache set, the attacker thread runs a loop which closely monitors the time it takes to repeatedly load a set of memory blocks that exactly fills that cache set with W memory blocks (similarly to step (c) of the Prime+Probe measurements).

What is the difference between bitsliced and lookup-based AES?

For AES, bitsliced implementation on popular architectures can offer a throughput comparable to that of lookup-based implementations [52][34][51][35][31][26], but only when several independent blocks are processed in parallel.35 Bitsliced AES is thus efficient for parallelized encryption modes such as CTR [35] and for exhaustive key search [62], but not for chained modes such as CBC.

(Open Access) Efficient Cache Attacks on AES, and Countermeasures (2010) | Eran Tromer

Eﬃcient Cache Attacks on AES, and Countermeasures

Eran Tromer

1 2

, Dag Arne Osvik

and Adi Shamir

Computer Science and Artiﬁcial Intelligence Laboratory,

Massachusetts Institute of Technology,

32 Vassar Street, G682, Cambridge, MA 02139

tromer@csail.mit.edu

Department of Computer Science and Applied Mathematics,

Weizmann Institute of Science, Rehovot 76100, Israel

adi.shamir@weizmann.ac.il

Laboratory for Cryptologic Algorithms, Station 14,

Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne, Switzerland

dagarne.osvik@epfl.ch

Abstract. We describe several software side-channel attacks based on inter-process leakage through

the state of the CPU’s memory cache. This leakage reveals memory access patterns, which can be used

for cryptanalysis of cryptographic primitives that employ data-dependent table lookups. The attacks

allow an unprivileged process to attack other processes running in parallel on the same processor,

despite partitioning methods such as memory protection, sandboxing and virtualization. Some of

our methods require only the ability to trigger services that perform encryption or MAC using the

unknown key, such as encrypted disk partitions or secure network links. Moreover, we demonstrate

an extremely strong type of attack, which requires knowledge of neither the speciﬁc plaintexts nor

ciphertexts, and works by merely monitoring the eﬀect of the cryptographic process on the cache.

We discuss in detail several attacks on AES, and experimentally demonstrate their applicability to

real systems, such as OpenSSL and Linux’s dm-crypt encrypted partitions (in the latter case, the full

key was recovered after just 800 writes to the partition, taking 65 milliseconds). Finally, we discuss

a variety of countermeasures which can be used to mitigate such attacks.

Keywords: side-channel attack, cryptanalysis, memory cache, AES

1 Introduction

1.1 Overview

Many computer systems concurrently execute programs with diﬀerent privileges, employing vari-

ous partitioning methods to facilitate the desired access control semantics. These methods include

kernel vs. userspace separation, process memory protection, ﬁlesystem permissions and chroot,

and various approaches to virtual machines and sandboxes. All of these rely on a model of the

underlying machine to obtain the desired access control semantics. However, this model is often

idealized and does not reﬂect many intricacies of the actual implementation.

In this paper we show how a low-level implementation detail of modern CPUs, namely the

structure of memory caches, causes subtle indirect interaction between processes running on

the same processor. This leads to cross-process information leakage. In essence, the cache forms a

shared resource which all processes compete for, and it thus aﬀects and is aﬀected by every process.

While the data stored in the cache is protected by virtual memory mechanisms, the metadata

about the contents of the cache, and in particular the memory access patterns of processes using

that cache, are not fully protected.

We describe several methods an attacker can use to learn about the memory access patterns

of another process, e.g., one which performs encryption with an unknown key. These are classiﬁed

into methods that aﬀect the state of the cache and then measure the eﬀect on the running time

of the encryption, and methods that investigate the state of the cache after or during encryption.

The latter are found to be particularly eﬀective and noise-resistant.

We demonstrate the cryptanalytic applicability of these methods to the Advanced Encryption

Standard (AES, [39]) by showing a known-plaintext (or known-ciphertext) attack that performs

eﬃcient full key extraction. For example, an implementation of one variant of the attack per-

forms full AES key extraction from the dm-crypt system of Linux using only 800 accesses to an

encrypted ﬁle, 65ms of measurements and 3 seconds of analysis; attacking simpler systems, such

as “black-box” OpenSSL library calls, is even faster at 13ms and 300 encryptions.

One variant of our attack has the unusual property of performing key extraction without

knowledge of either the plaintext or the ciphertext. This is a particularly strong form of attack,

which is clearly impossible in a classical cryptanalytic setting. It enables an unprivileged process,

merely by accessing its own memory space, to obtain bits from a secret AES key used by another

process, without any (explicit) communication between the two. This too is demonstrated exper-

imentally, and implementing AES in a way that is impervious to this attack, let alone developing

an eﬃcient generic countermeasure, appears non-trivial.

This paper is organized as follows: Section 2 gives an introduction to memory caches and AES

lookup tables. In Section 3 we describe the basic attack techniques, in the “synchronous” setting

where the attacker can explicitly invoke the cipher on known data. Section 4 introduces even

more powerful “asynchronous” attacks which relax the latter requirement. In Section 5, various

countermeasures are described and analyzed. Section 6 summarizes these results and discusses

their implications.

1.2 Related work

The possibility of cross-process leakage via cache state was ﬁrst considered in 1992 by Hu [24]

in the context of intentional transmission via covert channels. In 1998, Kelsey et al. [27] men-

tioned the prospect of “attacks based on cache hit ratio in large S-box ciphers”. In 2002, Page

[47] described theoretical attacks on DES via cache misses, assuming an initially empty cache and

the ability to identify cache eﬀects with very high temporal resolution in side-channel traces. He

subsequently proposed several countermeasures for smartcards [48], though most of these require

hardware modiﬁcations and are inapplicable or insuﬃcient in our attack scenario. Recently, vari-

ants of this attack (termed “trace-driven” in [48]) were realized by Bertoni et al. [11] and Acıi¸cmez

and Ko¸c [3][4], using a power side channel of a MIPS microprocessor in an idealized simulation.

By contrast, our attacks operate purely in software, and are hence of wider applicability and

implications; they have also been experimentally demonstrated in real-life scenarios.

In 2002 and subsequently, Tsunoo et al. devised a timing-based attack on MISTY1 [57,58]

and DES [56], exploiting the eﬀects of collisions between the various memory lookups invoked

internally by the cipher (as opposed to the cipher vs. attacker collisions we investigate, which

greatly improve the eﬃciency of an attack). Recently Lauradoux [32] and Canteaut et al. [18]

proposed some countermeasures against these attacks, none of which are satisfactory against our

attacks (see Section 5).

An abridged version of this paper was published in [45], and announced in [44].

Concurrently but independently, Bernstein [10] described attacks on AES that exploit timing

variability due to cache eﬀects. This attack can be seen as a variant of our Evict+Time measure-

ment method (see Section 3.4 and the analysis of Neve et al. [42]), though it is also somewhat

sensitive to the aforementioned collision eﬀects. The main diﬀerence is that [10] does not use an

explicit model of the cache and active manipulation, but rather relies only on the existence of some

consistent statistical patterns in the encryption time caused by memory access eﬀects; these pat-

terns are neither controlled nor modeled. The resulting attack is simpler and more portable than

ours, since its implementation is mostly oblivious to the ﬁne (and often unpublished) details of the

targeted CPU and software; indeed, [10] includes the concise C source code of the attack. More-

over, the attack of [10] locally executes only time measurement code on the attacked computer,

whereas our attack code locally executes more elaborate code that also performs (unprivileged)

memory accesses. However, the attack of [10] has several shortcomings. First, it requires reference

measurements of encryption under known key in an identical conﬁguration, and these are often

not readily available (e.g., a user may be able to write data to an encrypted ﬁlesystem, but creat-

ing a reference ﬁlesystem with a known key is a privileged operation). Second, the attack of [10]

relies on timing the encryption and thus, similarly to our Evict+Time method, seems impractical

on many real systems due to excessively low signal-to-noise ratio; our alternative methods (Sec-

tions 3.5 and 4) address this. Third, even when the attack of [10] works, it requires a much higher

number of analyzed encryptions than our method.

A subsequent paper of Canteaut et al. [18]

describes a variant of Bernstein’s attack which focuses on internal collisions (following Tsunoo

et al.) and provided a more in-depth experimental analysis;

its properties and applicability are

similar to Bernstein’s attack.

See Section 6.5 for subsequent improvements.

Also concurrently with but independently of our work, Percival [50] described a cache-based

attack on RSA for processors with simultaneous multithreading. The measurement method is

similar to one variant of our asynchronous attack (Section 4), but the cryptanalysis has little

in common since the algorithms and time scales involved in RSA vs. AES operations are very

diﬀerent. Both [10] and [50] contain discussions of countermeasures against the respective attacks,

and some of these are also relevant to our attacks (see Section 5).

Koeune and Quisquater [30] described a timing attack on a “bad implementation” of AES

which uses its algebraic description in a “careless way” (namely, using a conditional branch in

In our experiments the attack code of [10] failed to get a signal from dm-crypt even after a 10 hours run, whereas

in the same setup our Prime+Probe (see Section 3.5) performed full key recovery using 65ms of measurements.

Canteaut et al. [18] claim that their attack exploits only collision eﬀects due to microarchitectural details (i.e.,

low address bits) and that Bernstein’s attack [10] exploits only cache misses (i.e., higher address bits). However,

experimentally both attacks yield key bits of both types, as can be expected: the analysis method of [10] also

detects collision eﬀects (albeit with lower sensitivity), while the attack setting of [18] inadvertently also triggers

systematic cache misses (e.g., due to the encryption function’s use of stack and buﬀers).

[18] reports a 85% chance of recovering 20 bits using 2

encryptions after a 2

learning phase, even for the

“lightweight” target of OpenSSL AES invocation. In the same setting, our attack reliably recovers the full key

from just 300 encryptions (Section 3.7).

the MixColumn operation). That attack is not applicable to common software implementations,

but should be taken into account in regard to certain countermeasures against our attacks (see

Section 5.2).

Leakage of memory access information has also been considered in other contexts, yielding

theoretical [22] and heuristic [63][64] mitigation methods; these are discussed in Section 5.3.

See Section 6.5 for a discussion of additional works following our research.

2 Preliminaries

2.1 Memory and cache structure

Over the past couple of decades, CPU speed (in terms of operations per second) has been bene-

ﬁting from Moore’s law and growing at rate of roughly 60% per year, while the latency of main

memory has been decreasing at a much slower rate (7%–9% per year).

Consequentially, a large

gap has developed between the two. Complex multi-level cache architectures are employed to

bridge this gap, but it still shows through during cache misses: on a typical modern processor,

accessing data in the innermost (L1) cache typically requires amortized time on the order of

0.3ns, while accessing main memory may stall computation for 50 to 150ns, i.e., a slowdown of

2–3 orders of magnitude. The cache architectures are optimized to minimize the number of cache

misses for typical access patterns, but can be easily manipulated adversarially; to do so we will

exploit the special structure in the association between main memory and cache memory.

Modern processors use one or more levels of set-associative memory cache. Such a cache

consists of storage cells called cache lines, each consisting of B bytes. The cache is organized into

S cache sets, each containing W cache lines

, so overall the cache contains B · S · W bytes. The

mapping of memory addresses into the cache is limited as follows. First, the cache holds copies

of aligned blocks of B bytes in main memory (i.e., blocks whose starting address is 0 modulo B),

which we will term memory blocks. When a cache miss occurs, a full memory block is copied into

one of the cache lines, replacing (“evicting”) its previous contents. Second, each memory block

may be cached only in a speciﬁc cache set; speciﬁcally, the memory block starting at address a

can be cached only in the W cache lines belonging to cache set ba/Bc mod S. See Figure 1. Thus,

the memory blocks are partitioned into S classes, where the blocks in each class contend for the

W cache lines in a single cache set.

Modern processors have up to 3 levels of memory cache, denoted L1 to L3, with L1 being the

smallest and fastest cache and subsequent levels increasing in size and latency. For simplicity, in

the following we mostly ignore this distinction; one has a choice of which cache to exploit, and

our experimental attacks used both L1 and L2 eﬀects. Additional complications are discussed in

Section 3.6. Typical cache parameters are given in Table 1.

This relatively slow reduction in DRAM latency has proven so reliable, and founded in basic technological

hurdles, that it has been proposed by Abadi et al. [1] and Dwork et al. [21] as a basis for proof-of-work protocols.

In common terminology, W is called the associativity and the cache is called W -way set associative.

CPUs diﬀer in their policy for choosing which cache line inside a set to evict during a cache miss. Our attacks work

for all common algorithms, but as discussed in Section 3.8, knowledge of the policy allows further improvements.

W WT

Cache Main memory

Fig. 1. Schematic of a single level of set-associative cache. Each column of memory blocks (right

side) corresponds to S · B contiguous bytes of memory. Each row of memory blocks is mapped to

the corresponding row in the cache (left side), representing a set of W cache lines. The light gray

blocks represent an AES lookup table in the victim’s memory. The dark gray blocks represent

the attacker’s memory used for the attack, which will normally be at least as big as the size of

the cache.

CPU model Level B (cache line size) S (cache sets) W (associativity) B · S · W (total size)

Athlon 64 / Opteron L1 64B 512 2 64KB

Athlon 64 / Opteron L2 64B 1024 16 1024KB

Pentium 4E L1 64B 32 8 16KB

Pentium 4E L2 128B 1024 8 1024KB

PowerPC 970 L1 128B 128 2 32KB

PowerPC 970 L2 128B 512 8 512KB

UltraSPARC T1 L1 16B 128 4 8KB

UltraSPARC T1 L2 64B 4096 12 3072KB

Table 1. Data cache parameters for popular CPU models

2.2 Memory access in AES implementations

This paper focuses on AES, since its memory access patterns are particularly susceptible to

cryptanalysis (see Section 6.2 for a discussion of other ciphers). The cipher is abstractly deﬁned

by algebraic operations and could, in principle, be directly implemented using just logical and

arithmetic operations.

However, performance-oriented software implementations on 32-bit (or

higher) processors typically use an alternative formulation based on lookup tables, as prescribed

in the Rijndael speciﬁcation[19][20]. In the subsequent discussion we assume the following imple-

mentation, which is typically the fastest.

Several lookup tables are precomputed once by the programmer or during system initialization.

There are 8 such tables, T

, T

and T

(10)

, T

(10)

, T

(10)

, T

(10)

, each containing 256 4-byte

words. The contents of the tables, deﬁned in [20], are inconsequential for most of our attacks.

Such an implementation would be immune to our attack, but exhibit low performance. A major reason for the

choice of Rijndael in the AES competition was the high performance of the implementation analyzed here.

See Section 5.2 for a discussion of alternative table layouts. A common variant employs 1 or no extra tables for

the last round (instead of 4); most of our attacks analyze only the ﬁrst few rounds, and are thus unaﬀected.

Efficient Cache Attacks on AES, and Countermeasures

Figures

Citations

National Institute of Standards and Technology における超伝導研究及び生活

Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds

Introduction to Embedded Systems - A Cyber-Physical Systems Approach

FLUSH+RELOAD: a high resolution, low noise, L3 cache side-channel attack

Last-Level Cache Side-Channel Attacks are Practical

References

The Design of Rijndael: AES - The Advanced Encryption Standard

The Java Virtual Machine Specification

National Institute of Standards and Technology における超伝導研究及び生活

Differential cryptanalysis of DES-like cryptosystems

Software protection and simulation on oblivious RAMs

Related Papers (5)

Cache attacks and countermeasures: the case of AES

FLUSH+RELOAD: a high resolution, low noise, L3 cache side-channel attack

Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds

Cache Games -- Bringing Access-Based Cache Attacks on AES to Practice

Cross-VM side channels and their use to extract private keys

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "Efficient cache attacks on aes, and countermeasures" ?

Q2. What is the way to filter out interruptions?

Q3. What is the way to ensure that the AES operation is invulnerable to attacks?

Q4. How does the probe code avoid polluting its own samples?

Q5. How many times more data and analysis will be needed to execute a synchronous attack?

Q6. How many samples are needed to eliminate all the wrong candidates?

Q7. How do you normalize the measurement scores?

Q8. What is the function used to monitor the time it takes to load a cache set?

Q9. What is the difference between bitsliced and lookup-based AES?