scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

13 Nov 2010-pp 1-11
TL;DR: A classification of applications into four cache usage categories is introduced and how applications from different categories affect each other's performance indirectly through cache sharing is discussed and a scheme to optimize such sharing is devised.
Abstract: Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

Summary (4 min read)

Introduction

  • This paper introduces a classification of applications into four cache usage categories.
  • The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy.
  • When an application shares a multicore with other applications, new types of performance considerations are required for good system throughput.

II. MANAGING CACHES IN SOFTWARE

  • Application performance on multicores is highly dependent on the activities of the other cores in the same chip due to contention for shared resources.
  • The miss ratio of the non-streaming application decreases as the amount of available cache increases.
  • Using the rs and δ the authors can classify applications based on how they use the cache.
  • These applications fit their entire data set in the private cache, they are therefore largely unaffected by contention for the shared cache and memory bandwidth.
  • The large base miss ratio in such applications is due to memory accesses that touch data that is never reused while it resides in the cache.

III. CACHE MANAGEMENT INSTRUCTIONS

  • Most modern instruction sets include instructions to manage caches.
  • Many processors support at least one of these instruction classes.
  • Instructions from the second category, forced cache eviction, appear in some form in most architectures.
  • The ECB and WH64 instructions are in many ways similar to the caching hints in the previous category, but instead of annotating the load or store instruction, the hints are given after or, in case of a store, before the memory accesses in question.
  • The third category, non-temporal prefetches, is also included in several different ISAs.

IV. LOW-OVERHEAD CACHE MODELING

  • A natural starting point for modeling LRU caches is the stack distance [11].
  • StatStack is a statistical cache model that models fully associative caches with LRU replacement.
  • To estimate the stack distance of the second access to A in Figure 4, the authors sum the estimated likelihoods that the reuse distance of the memory accesses executed between the two accesses to A have reuse distances such that their corresponding arcs reach beyond “Out Boundary”.
  • StatStack uses this approach to estimate the stack distances of all memory accesses in a reuse distance sample, effectively estimating a stack distance distribution.
  • A fourth step is included to take effects from sampled stack distances into account.

A. A first simplified approach

  • An instruction has non-temporal behavior if all forward stack distances, i.e. the number of unique cache lines accessed between this instruction and the next access to the same cache line, are larger or equal to the size of the cache.
  • Therefore, the authors can use a non-temporal instruction to bypass the entire cache hierarchy for such accesses.
  • Most applications, even purely streaming ones that do not reuse data, may still exhibit short temporal reuse, e.g. spatial locality where neighboring data items on the same cache line are accessed in close succession.
  • Since cache management is done at a cache line granularity, this clearly restricts the number of possible instructions that can be treated as non-temporal.

B. Refining the simple approach

  • Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.
  • The authors assume that whenever a non-temporal memory access touches a cache line, the cache line is installed in the MRU-position of the LRU stack, and a special bit on the cache line, the evict to memory (ETM) bit, is set.
  • Whenever a normal memory access touches a cache line, the ETM bit is cleared.
  • The authors thus require that one stack distance is greater or equal to dmax , and that the number of stack distances that are larger or equal to dETM but smaller than dmax is smaller than some threshold, tm.
  • In most implementations tm will not be a single value for all accesses, but depend on factors such as how many additional cache hits can be created by disabling caching for a memory access.

C. Handling sticky ETM bits

  • When the ETM bit is retained for the cache lines’ entire lifetime in the cache, the conditions for a memory accessing instruction to be non-temporal developed in section V-B are no longer sufficient.
  • If instruction X sets the ETM bit on a cache line, then the ETM status applies to all subsequent reuses of the cache line as well.
  • The sticky ETM bit is only a problem for non-temporal accesses that have forward reuse distances less than dETM .
  • When Y accesses the cache line it is moved to the MRU position of the LRU stack, and the sticky ETM bit is retained.
  • Therefore, instead of applying the non-temporal condition to a single instruction, the authors have to apply it to all instructions reusing the cache line accessed by the first instruction.

D. Handling sampled data

  • To avoid the overhead of measuring exact stack distances, the authors use StatStack to calculate stack distances from sampled reuse distances.
  • Sampled stack distances can generally be used in place of a full stack distance trace with only a small decrease in average accuracy.
  • There is always a risk of missing some critical behavior.
  • This could potentially lead to flagging an access as non-temporal, even though the instruction in fact has some temporal behavior in some cases, and thereby introducing an unwanted cache miss.

A. Model system

  • To evaluate their model the authors used an x86 based system with an AMD Phenom II X4 920 processor with the AMD family 10h micro-architecture.
  • The processor has 4-cores, each with a private L1 and L2 cache and a shared L3 cache.
  • According to the documentation of the prefetchnta instruction, data fetched using the non-temporal prefetch is not installed in the L2 unless it was fetched from the L2 in the first place.
  • Their experiments show that this is not the case.
  • The system therefore works like the system modeled in section V-C where the ETM-bit is sticky.

B. Benchmark preparation

  • The benchmarks were first compiled normally for initial reference runs and sampling.
  • Sampling was done on each benchmark running with the reference input set.
  • The benchmarks were then recompiled taking this profile into account.
  • The assembly output was then modified before it was passed to the assembler.
  • Before each non-temporal memory access the script inserted a prefetchnta instruction to the same memory location as the original access.

C. Algorithm parameters

  • The authors model the cache behavior of their benchmarks using StatStack and a reuse distance sample with 100 000 memory access pairs per benchmark.
  • This behavior lets us merge the two caches and treat them as one larger LRU stack where each cache level corresponds to a contiguous section of the stack.
  • In most cases this is a valid assumption, especially for large caches with a high degree of associativity.
  • The authors therefore have to be more conservative when evaluating stack distances within this range.
  • The authors use different, conservative, values of dETM , when calculating the number of introduced misses and handling the stickiness of the ETM bits.

D. Benchmarks

  • Using the software classification introduced in section II the authors selected two benchmarks representing each category for analysis.
  • Applications on the right-hand side of the map, Gobblers & Victims and Cache Gobblers, have a high base miss ratio and store a large amount of non-temporal data in the shared cache.
  • Whenever there is a cache miss, a new cache line is installed and another one is replaced.
  • Looking at Figure 6a the authors see that libquantum’s replacement ratio is reduced from approximately 20% to 0% in the shared cache, while the miss ratio stays at 20%.
  • The authors reclassify their benchmarks based on their new replacement ratio curves, the new classification allows us to predict how applications affect each other after they introduce the nontemporal memory accesses.

VII. RESULTS AND ANALYSIS

  • The results for runs of six different mixes of four SPEC2006 benchmarks running with the reference input set, with and without software cache management are shown in Figure 8 and Figure 9.
  • Figure 9 shows five different mixes consisting of two pairs of benchmarks from different categories.
  • The speedup is the improvement in IPC over the unmanaged version when running in a mix.
  • Applying software cache management pushes the knee to the left, i.e. towards smaller cache sizes, and decreases the miss ratio for systems with between 4MB and 8MB of cache.
  • Looking at Figure 9a, Figure 9c and Figure 9e the authors see that running together with applications from these categories causes a significant decrease in IPC compared to when running in isolation.

ACKNOWLEDGMENTS

  • The authors would like to thank Kelly Shaw and David Black-Schaffer for valuable comments and insights that has helped to improve this paper.
  • This work was financially supported by the CoDeR-MP and UPMARC projects.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Reducing Cache Pollution Through Detection and
Elimination of Non-Temporal Memory Accesses
Andreas Sandberg, David Eklöv and Erik Hagersten
Department of Information Technology
Uppsala University, Sweden
{andreas.sandberg, david.eklov, eh}@it.uu.se
Abstract—Contention for shared cache resources has been
recognized as a major bottleneck for multicores—especially for
mixed workloads of independent applications. While most mod-
ern processors implement instructions to manage caches, these
instructions are largely unused due to a lack of understanding
of how to best leverage them.
This paper introduces a classification of applications into
four cache usage categories. We discuss how applications from
different categories affect each other’s performance indirectly
through cache sharing and devise a scheme to optimize such
sharing. We also propose a low-overhead method to automatically
find the best per-instruction cache management policy.
We demonstrate how the indirect cache-sharing effects of
mixed workloads can be tamed by automatically altering some
instructions to better manage cache resources. Practical experi-
ments demonstrate that our software-only method can improve
application performance up to 35% on x86 multicore hardware.
I. INTRODUCTION
The introduction of multicore processors has significantly
changed the landscape for most applications. The literature
has mostly focused on parallel multithreaded applications.
However, multicores are often used to run several independent
applications. Such mixed workloads are common in a wide
range of systems, spanning from cell phones to HPC servers.
HPC clusters often run a large number of serial applications
in parallel across their physical cores. For example, parameter
studies in science and engineering where the same application
is run with different input data sets.
When an application shares a multicore with other appli-
cations, new types of performance considerations are required
for good system throughput. Typically, the co-scheduled appli-
cations share resources with limited capacity and bandwidth,
such as a shared last-level cache (SLLC) and DRAM inter-
faces. An application overusing any of these resources can
degrade the performance of the other applications sharing the
same multicore chip.
Consider a simple example: Application A has an active
working set that barely fits in the SLLC, and application
B makes a copy of a data structure much larger than the
SLLC. When run together, B will use a large portion of
the SLLC and will force A to miss much more often than
when run in isolation. Fortunately, most libraries implementing
memory copying routines, e.g. memcpy, have been hand-
optimized and use special cache-bypass instructions, such as
non-temporal reads and writes. On most implementations,
these instructions will avoid allocation of resources in the
SLLC and subsequently will not force any replacements of
application As working set in the cache.
In the example above the use of cache bypass instructions
may seem obvious, and hand-tuning a common routine, such
as memcpy, may motivate the use of special assembler instruc-
tions. However, many common programs also have memory
accesses that allocate data with little benefit in the SLLC
and may slow down co-scheduled applications. Detecting such
reckless use is beyond the capability of most application pro-
grammers, as is the use of assembly coding. Ideally, both the
detection and cache-bypassing should be done automatically
using existing hardware support.
Several software techniques for managing caches have been
proposed in the past [1], [2], [3], [4]. However, most of
these methods require an expensive simulation analysis. These
techniques assume the existence of heavily specialized instruc-
tions [1], [4], or extensions to the cache state and replacement
policy [3], none of which can be found in today’s processors.
Several researchers have proposed hardware improvements to
the LRU replacement algorithm [5], [6], [7], [8]. In general,
such algorithms tweak the LRU order by including additional
predictions about future re-references. Others have tried to
predict [9] and quantify [10] interference due to cache sharing.
In this paper, we propose an efficient and practical software-
only technique to automatically manage cache sharing to
improve the performance of mixed workloads of common
applications running on existing x86 hardware. Unlike pre-
viously proposed methods, our technique does not rely on
any hardware modifications and can be applied to existing
applications running on commodity hardware. This paper
makes the following contributions:
We propose a scheme to classify applications according
to their impact, and dependence, on the SLLC.
We propose an automatic and low-overhead method to
find instructions that use the SLLC recklessly and au-
tomatically introduce cache bypass instructions into the
binary.
We demonstrate how this technique can change the classi-
fication of many applications, making them better mixed
workload citizens.
We evaluate the performance gain of the applications and
c
2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for
creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be
obtained from the IEEE.
SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

0%
5%
10%
15%
20%
25%
30%
Private Private+SLLC
Miss Ratio
Cache size
Streaming
Non-streaming
Together
Isolation
Figure 1. Miss ratio as a function of cache size for an application with
streaming behavior and a typical non-streaming application that reuses most
of its data. When run in isolation, each application has access to both the
private cache and the entire SLLC. Running together causes the non-streaming
application to receive a small fraction of the SLLC, while the streaming
application receives a large fraction without decreasing its miss ratio. The
change in perceived cache size and miss ratio is illustrated by the arrows.
show that their improved behavior is in agreement with
the classification.
II. MANAGING CACHES IN SOFTWARE
Application performance on multicores is highly dependent
on the activities of the other cores in the same chip due to
contention for shared resources. In most modern processors
there is no explicit hardware policy to manage these shared
resources. However, there are usually instructions to manage
these resources in software. By using these instructions prop-
erly, it is possible to increase the amount of data that is reused
through the cache hierarchy. However, this requires being able
to predict which applications, and which instructions, benefit
from caching and which do not.
In order to know which application would benefit from
using more of the shared cache resources, we need to know
the applications’ cache usage characteristics. The cache miss
ratio as a function of cache size, i.e. the number of cache
misses as a fraction of the total number of memory accesses
as a function of cache size, is a useful metric to determine
such characteristics. Figure 1 shows the miss ratio curves
for a typical streaming application and an application that
reuses its data. The miss ratio of the non-streaming application
decreases as the amount of available cache increases. This
occurs because more of the data set fits in the cache. Since
the streaming application does not reuse its data, the miss ratio
stays constant even when the cache size is increased.
When the applications are run in isolation, they will get
access to both the core-local private cache and the SLLC.
Assuming that the cache hierarchy is exclusive, the amount
of cache available to an application running in isolation is
the sum of the private cache and the SLLC. When two
applications share the cache, they will perceive the SLLC
as being smaller. In the case illustrated by Figure 1, the
streaming application misses much more frequently than the
non-streaming application. The frequent misses causes the
streaming application to install more data in the cache than
the non-streaming application. The non-streaming application
will therefore perceive the cache as being much smaller than
r
s
r
p
Private Private+SLLC
Miss Ratio
Cache size
δ
Figure 2. A generalized miss ratio curve for an application. The minimum,
i.e. only the private cache, and the maximum, i.e. the private cache and the
full shared cache, amount of cache available to an application are shown on
the x-axis. The miss ratio when running in isolation (r
s
) is the smallest miss
ratio that an application can achieve on this system, while the miss ratio when
running only in the private cache (r
p
) is the worst miss ratio. The δ represents
how much an application is affected by competition for the shared cache.
when run in isolation. The change in perceived cache size,
and how this affects miss ratio is illustrated by the arrows in
Figure 1.
Decreasing the perceived cache size for the streaming
application does not affect its miss ratio. The non-streaming
application, however, sees an increased miss ratio when access
to the SLLC is restricted. As the number of misses increase,
the bandwidth requirements also increase, which affects the
performance of all the applications sharing the same memory
interface. If we could make sure that the streaming application
does not install any of its streaming data into the cache, the
miss ratio, and bandwidth requirement, of the non-streaming
applications would decrease without sacrificing any perfor-
mance. In fact, the streaming application would run faster
since the total bandwidth requirement would be decreased.
Using the miss ratio curves we can classify applications
based on how they affect others and how they are affected by
competition for the shared cache. We base this classification on
the base miss ratio, r
s
, when the application is run in isolation
and has access to both its private cache and the entire SLLC,
and the miss ratio, r
p
, when it only has access to the private
cache, see Figure 2. The r
p
miss ratio can be thought of as the
maximum miss ratio that an application can get due to cache
contention, while r
s
is the ideal case when the application is
run in isolation. To capture the sensitivity to cache contention
we define the cache sensitivity, δ, to be the difference between
the two miss ratios. A large δ indicates that an application
benefits from using the shared cache, while a small δ means
that the application exhibits streaming behavior and does not
benefit from additional cache resources.
Using the r
s
and δ we can classify applications based on
how they use the cache. This classification allows us to predict
how applications will affect each other and how the system
will be affected by software cache management. We define
the following categories:
Don’t care
Small r
s
and small δ—Applications that are largely
unaffected by contention for the shared cache level.

0.001%
0.01%
0.1%
1%
10%
0.01% 0.1% 1% 10% 100%
Cache Sensitivity (δ)
Base Miss Ratio (r
s
)
perlbench
bzip2
gcc
bwaves
gamess
mcf
milc
zeusmp
leslie3d
soplex
hmmer
libq. . .
h264ref
lbm
astar
sphinx3
Xalan
Don’t care
Victims Gobblers & Victims
Cache gobblers
Figure 3. Classification map of a subset of the SPEC2006 benchmarks
running with the reference input set on a system with a 576 kB private
cache and 6 MB shared cache. The quadrants signify different behaviors when
running together with other applications. Applications to the left tend to reuse
almost all of their data in the shared cache and generally work well with other
applications, applications to the right tend to use large parts of the shared
cache for data that is never reused and are generally troublesome in mixes
with other applications. Applications in the upper half are sensitive to the
amount of data that can be stored in the shared cache, while applications on
the bottom are insensitive.
These applications fit their entire data set in the
private cache, they are therefore largely unaffected
by contention for the shared cache and memory
bandwidth.
Victims
Small r
s
and large δ—Applications that suffer badly
if the amount of cache at the shared level is restricted.
The data they manage to install in the shared resource
is almost always reused. Applications with a working
set larger than the private cache, but smaller than the
total cache size belong in this category.
Gobblers & Victims
Large r
s
and large δ—Applications that suffer from
SLLC cache contention, but store large amounts of
data that is never reused in the shared cache. For
example, applications traversing a small and a large
data structure in parallel may reuse data in the cache
when accessing the small structure, while accesses
to the large data structure always miss. Disabling
caching for the accesses to the large data structure
would allow more of the smaller data structure to be
cached. Managing the cache for these applications
is likely to improve throughput, both when they
are running in isolation and in a mix with other
applications.
Cache Gobblers
Large r
s
and small δ—Applications that do not ben-
efit from the shared cache at all, but still install large
amounts of data in it. Applications in this category
work on streaming data or data structures that are
much larger than the cache. These applications are
good candidates for software cache management.
Since they do not reuse the data they install in
the shared cache, their throughput is generally not
improved when running in isolation. Managing these
applications will improve the full system throughput
by allowing applications from other categories to use
more of the shared cache.
Figure 3 shows the classification of several SPEC2006
benchmarks according to these categories. Applications clas-
sified as wasting cache resources, i.e. applications on the
right-hand side of the map, are obvious targets for cache
management. The large base miss ratio in such applications
is due to memory accesses that touch data that is never reused
while it resides in the cache. Disabling caching for such
instructions does not introduce new misses since data is not
reused, instead it will free up cache space for other accesses.
III. CACHE MANAGEMENT INSTRUCTIONS
Most modern instruction sets include instructions to manage
caches. These instructions can typically be classified into three
different categories: non-temporal memory accesses, forced
cache eviction and non-temporal prefetches. Many processors
support at least one of these instruction classes. However,
their semantics may not always make them suitable for cache
management for performance.
Examples from the first category are the memory accesses
in the PA-RISC which can be annotated with caching hints,
e.g. only spatial locality or write only. Similar instruction
annotations exist for Itanium. Other instruction sets, such as
some of the SIMD extensions to the x86, contain completely
separate instructions for handling non-temporal data. The
hardware may, based on these hints, decide not to install write-
only cache lines in the cache and use write-combining buffers
instead. Non-temporal reads can be handled using separate
non-temporal buffers or by installing the accessed cache line
in such a way that it is the next line to be evicted from a set.
Instructions from the second category, forced cache eviction,
appear in some form in most architectures. However, not
all architectures expose such instructions to user space. Yet
other implementations may have undesired semantics that limit
their usefulness in code optimizations, e.g. the x86 Flush
Cache Line (CLFLUSH) instruction forces all caches in a
coherence domain to be flushed. There are some architectures
that implement instructions in this class that are specifically
intended for code optimizations. For example, the Alpha ISA
specifies an instruction, Evict Data Cache Block (ECB), that
gives the memory system a hint that a specific cache line will
not be reused in the near future. A similar instruction, Write
Hint (WH64), tells the memory subsystem that an entire cache
line will be overwritten before being read again, this allows
the memory system to allocate the cache line without actually
reading its old contents. The ECB and WH64 instructions are
in many ways similar to the caching hints in the previous
category, but instead of annotating the load or store instruction,

the hints are given after or, in case of a store, before the
memory accesses in question.
The third category, non-temporal prefetches, is also included
in several different ISAs. The SPARC ISA has both read and
write prefetch variants for data that is not temporally reused.
Similar prefetch instructions are also available in both Itanium
and x86. Some implementations may choose to prefetch into
the cache such that the fetched line is the next to be evicted
from that set; others may prevent the data from propagating
from the L1 to a higher level in the cache hierarchy.
In the remainder of this paper, we will assume an archi-
tecture with a non-temporal hint that is implemented such
that non-temporal data is fetched into the L1 cache, but never
installed in higher levels. This is how the AMD system we
target implement support for non-temporal prefetches.
IV. LOW-OVERHEAD CACHE MODELING
A natural starting point for modeling LRU caches is the
stack distance [11]. A stack distance is the number of unique
cache lines accessed between two successive memory accesses
to the same cache line. It can be directly used to determine
if a memory access results in a cache hit or a cache miss
for a fully-associative LRU cache: if the stack distance is less
than the cache size, the access will be a hit, otherwise it will
miss. Therefore, the stack distance distribution enables the
application’s miss ratio to be computed for any given cache
size, by simply computing the fraction of memory accesses
with a stack distances greater than the desired cache size.
In this work, we need to differentiate between what we call
backward and forward stack distance. Let A and B be two
successive memory accesses to the same cache line. Suppose
that there are S unique cache lines accessed by the memory
accesses executed between A and B. Here, we say that A has a
forward stack distance of S, and that B has a backward stack
distance of S.
Measuring stack distances is generally very expensive. In
this paper, we use StatStack [12] to estimate stack distances
and miss ratios. StatStack is a statistical cache model that
models fully associative caches with LRU replacement. Mod-
eling fully associative LRU caches is, for most applications, a
good approximation of the set associative pseudo LRU caches
implemented in hardware. StatStack estimates an application’s
stack distances using only a sparse sample of the application’s
reuse distances, i.e. the number of memory accesses performed
between two accesses to the same cache line. This approach
to modeling caches has been shown to be several orders of
magnitude faster than full cache simulation, and almost as
accurate. The runtime profile of an application can be collected
with an overhead of only 40% [13], and the execution time of
the cache model is only a few seconds [12].
To understand how StatStack works, consider the access
sequence shown in Figure 4. Here the arcs connect subsequent
accesses to the same cache line, and represent the reuse of data.
In this example, the second memory access to cache line A has
a reuse distance of five, since there are ve memory accesses
executed between the two accesses to A, and a backward
A B B C D C C D BA
Out Boundary
Figure 4. Reuse distance in a memory access stream. The arcs connect
successive memory accesses to the same cache line, and represents the reuse
of cache lines. The stack distance of the second memory access to A is equal
to the number of arcs that cross “Out Boundary”.
stack distance of three, since there are three unique cache
lines (B, C and D) accessed between the two accesses to A.
Furthermore, we see that there are three arcs that cross the
vertical line labeled “Out Boundary”, which is the same as
the stack distance of the second access to A. This observation
holds true in general. Based on it we can compute the stack
distance of any memory access, given that we know the reuse
distances of all memory access performed between it and the
previous access to the same cache line.
The input to StatStack is a sparse reuse distance sample that
only contains the reuse distances of a sparse random selection
of an application’s memory accesses, and therefore does not
contain enough information for the above observation to be
directly applied. Instead, StatStack uses the reuse distance
sample to estimate the application’s reuse distance distribution.
This distribution is then used to estimate the likelihood that
a memory access has a reuse distance greater than a given
length. Since the length of a reuse distance determines if its
outbound arc reaches beyond the “Out Boundary”, we can
use these likelihoods to estimate the stack distance of any
memory access. For example, to estimate the stack distance
of the second access to A in Figure 4, we sum the estimated
likelihoods that the reuse distance of the memory accesses exe-
cuted between the two accesses to A have reuse distances such
that their corresponding arcs reach beyond “Out Boundary”.
StatStack uses this approach to estimate the stack distances
of all memory accesses in a reuse distance sample, effectively
estimating a stack distance distribution. StatStack uses this
distribution to estimate the miss ratio for any given cache size,
C, as the fraction of stack distances in the estimated stack
distance distribution that are greater than C.
V. IDENTIFYING NON-TEMPORAL ACCESSES
Using the stack distance profile of an application we can de-
termine which memory accesses do not benefit from caching.
We will refer to memory accessing instructions whose data
is never reused during its lifetime in the cache hierarchy as
non-temporal memory accesses.
If these non-temporal accesses can be identified, the com-
piler, a post processing pass, or a dynamic instrumentation
engine can alter the application to use non-temporal instruc-
tions in these locations without hurting performance.
The system we model implements a non-temporal hint that
causes a cache line to be installed in the L1, but never in any of

the higher cache levels. It turns out that modeling this system
is fairly complicated, we will therefore describe our algorithm
to find non-temporal accesses in three steps. Each step adds
more detail to the model and brings it closer to the hardware.
A fourth step is included to take effects from sampled stack
distances into account.
A. A first simplified approach
By looking at the forward stack distances of an instruction
we can easily determine if the next access to the data used
by that instruction will be a cache miss, i.e. the instruction is
non-temporal. An instruction has non-temporal behavior if all
forward stack distances, i.e. the number of unique cache lines
accessed between this instruction and the next access to the
same cache line, are larger or equal to the size of the cache. In
that case, we know that the next instruction to touch the same
data is very likely to be a cache miss. Therefore, we can use a
non-temporal instruction to bypass the entire cache hierarchy
for such accesses.
This approach has a major drawback. Most applications,
even purely streaming ones that do not reuse data, may
still exhibit short temporal reuse, e.g. spatial locality where
neighboring data items on the same cache line are accessed in
close succession. Since cache management is done at a cache
line granularity, this clearly restricts the number of possible
instructions that can be treated as non-temporal.
B. Refining the simple approach
Most hardware implementations of cache management in-
structions allow the non-temporal data to live in parts of the
cache hierarchy, such as the L1, before it is evicted to memory.
We can exploit this to accommodate short temporal reuse of
cache lines. We assume that whenever a non-temporal memory
access touches a cache line, the cache line is installed in the
MRU-position of the LRU stack, and a special bit on the cache
line, the evict to memory (ETM) bit, is set. Whenever a normal
memory access touches a cache line, the ETM bit is cleared.
Cache lines with the ETM bit set are evicted earlier than other
lines, see Figure 5. Instead of waiting for the line to reach the
depth d
max
it is evicted when it reaches a shallower depth,
d
ETM
. This allows us to model implementations that allow
non-temporal data to live in parts of the memory hierarchy.
For example, the memory controller in our AMD system evicts
ETM tagged cache lines from the L1 to main memory, and
would therefore be modeled with d
ETM
being the size of the
L1 and d
max
the total combined cache size.
The model with the ETM bit allows us to consider memory
accesses as non-temporal even if they have short reuses that
hit in the small ETM area. Instead of requiring that all forward
stack distances are larger than the cache size, we require
that there is at least one such access and that the number
of accesses that reuse data in the area of the LRU stack
outside the ETM area, the gray area in Figure 5, is small,
i.e. the number of misses introduced if the access is treated
as non-temporal is small. We thus require that one stack
distance is greater or equal to d
max
, and that the number
of stack distances that are larger or equal to d
ETM
but
smaller than d
max
is smaller than some threshold, t
m
. In most
implementations t
m
will not be a single value for all accesses,
but depend on factors such as how many additional cache hits
can be created by disabling caching for a memory access.
The hardware we want to model does not, unfortunately,
reset the ETM bit when a temporal access reuses ETM data.
This new situation can be thought of as sticky ETM bits, as
they are only reset on cache line eviction.
C. Handling sticky ETM bits
When the ETM bit is retained for the cache lines’ entire
lifetime in the cache, the conditions for a memory accessing
instruction to be non-temporal developed in section V-B are no
longer sufficient. If instruction X sets the ETM bit on a cache
line, then the ETM status applies to all subsequent reuses of
the cache line as well. To correctly model this, we need to
make sure that the non-temporal condition from section V-B
applies, not only to X, but also to all instructions that reuse
the cache lines accessed by X.
The sticky ETM bit is only a problem for non-temporal
accesses that have forward reuse distances less than d
ET M
.
For example, consider a memory accessing instruction, Y, that
reuses the cache line previously accessed by a non-temporal
access X (here Y is a cache hit). When Y accesses the cache
line it is moved to the MRU position of the LRU stack, and the
sticky ETM bit is retained. Now, since Y would have resulted
in a cache hit no matter if X had set the sticky ETM bit or
not, this is the same as if we would have set the sticky ETM
bit for the cache line when it was accessed by Y.
Therefore, instead of applying the non-temporal condition
to a single instruction, we have to apply it to all instructions
reusing the cache line accessed by the first instruction.
In a machine, such as our AMD system, where d
ETM
corresponds to the L1 cache, this new condition allows us to
categorize a memory access as non-temporal if all the data it
touches is reused through the L1 cache or misses in the entire
cache hierarchy. Due to the stickiness of the non-temporal
status, this condition must also hold for any memory access
that reuses the same data through the L1 cache.
D. Handling sampled data
To avoid the overhead of measuring exact stack distances,
we use StatStack to calculate stack distances from sampled
reuse distances. Sampled stack distances can generally be used
in place of a full stack distance trace with only a small decrease
in average accuracy. However, there is always a risk of missing
some critical behavior. This could potentially lead to flagging
an access as non-temporal, even though the instruction in
fact has some temporal behavior in some cases, and thereby
introducing an unwanted cache miss.
In order to reduce the likelihood of introducing misses due
to sampling, we need to make sure that flagging an instruction
as non-temporal is always based on reliable data. We do this
by introducing a sample threshold, t
s
, which is the smallest
number of samples originating from an instruction that can be
considered to be non-temporal.

Citations
More filters
Journal ArticleDOI
01 Sep 2014
TL;DR: This paper describes the use of dynamically generated code in FFTS, a discrete Fourier transform (DFT) library for x86 and ARM based devices, modified to dynamically exploit streaming store instructions on x86 machines resulting in speedups of over 10 % when the transform is sufficiently large, depending on the size of the cache.
Abstract: This paper describes the use of dynamically generated code in FFTS, a discrete Fourier transform (DFT) library for x86 and ARM based devices. FFTS has recently been shown to be faster than FFTW, Intel IPP and Apple vDSP, partly due to the use of program specialization and dynamic code generation. In this work, FFTS has been modified to dynamically exploit streaming store instructions on x86 machines resulting in speedups of over 10 % when the transform is sufficiently large, depending on the size of the cache. FFTS has also been modified to avoid dynamic code generation on some mobile platforms where dynamic code gen is prohibited, while making every effort to maximize performance, and the results of benchmarks on Apple A4, A6, Nvidia Tegra3 and Samsung Exynos4 based devices show that disabling dynamic code generation in FFTS decreases performance by as much as 25 %, depending on the device and the parameters of the transform.

5 citations

Journal ArticleDOI
TL;DR: NightWatch is a cache management subsystem that provides general, transparent and low-overhead pollution control to applications by online monitoring current cache locality to predict future behavior and restricting potential cache polluters proactively.
Abstract: Cache pollution, by which weak-locality data unduly replaces strong-locality data, may notably degrade application performance in a shared-cache multicore machine. This paper presents NightWatch, a cache management subsystem that provides general, transparent and low-overhead pollution control to applications. NightWatch is based on the observation that data within the same memory chunk or chunks within the same allocation context often share similar locality property. NightWatch embodies this observation by online monitoring current cache locality to predict future behavior and restricting potential cache polluters proactively. We have integrated NightWatch into two popular allocators, tcmalloc and ptmalloc2 . Experiments with SPEC CPU2006 show that NightWatch improves application performance by up to 45 percent (18 percent on average), with an average monitoring overhead of 0.57 percent (up to 3.02 percent).

5 citations


Cites methods from "Reducing Cache Pollution Through De..."

  • ...Experiments with SPEC CPU2006 show that NightWatch improves application performance by up to 45 percent (18 percent on average), with an average monitoring overhead of 0.57 percent (up to 3.02 percent)....

    [...]

Proceedings ArticleDOI
Tao Huang1, Qi Zhong1, Xuetao Guan1, Xiaoyin Wang1, Xu Cheng1, Keyi Wang1 
26 Mar 2012
TL;DR: This paper proposes a software controlled mechanism for last level cache partitioning at the region level in order to reduce intra-application lastlevel cache misses due to cache pollution, and enhances operating system support for mapping poor locality regions to a small slice in the last level caches.
Abstract: Performance degradation caused by cache pollution in the last level cache is extremely severe. In this paper, we propose a software controlled mechanism for last level cache partitioning at the region level in order to reduce intra-application last level cache misses due to cache pollution. A profiling feedback mechanism is used to analyze the inter-region cache interference. Guided by the profiling information, we enhance operating system support for mapping poor locality regions to a small slice in the last level cache in order to eliminate the harmful effect of non-reusable data. Our approach does not require any hardware support or new instructions, and is also application transparent.In comparison with the default linux, our approach, called Soft-RP, reduces LLC MPKI, the last level cache misses per 1000 instructions, up to 30.88%, and 19.31% on average; execution time measurement shows that Soft-RP can improve the performance up to 15.51%, and 8.14% on average.

3 citations

Dissertation
28 Nov 2016
TL;DR: This thesis analyzes the behavior of the prefetching in the CMPs and its impact to the interconnect, and proposes several dynamic management techniques to improve the performance of thePrefetching mechanism in the system.
Abstract: Recently, high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures to deal with instruction level parallelism limitations and, more important, to manage the power consumption that is becoming unaffordable due to the increased transistor count and clock frequency. At the present moment, this architecture, which implements multiple processing cores on a single die, is commercially available with up to twenty four processors on a single chip and there are roadmaps and research trends that suggest that number of cores will increase in the near future. The increasing on number of cores has converted the interconnection network in a key issue that will have significant impact on performance. Moreover, as the number of cores increases, tiled architectures are foreseen to provide a scalable solution to handle design complexity. Network-on-Chip (NoC) emerges as a solution to deal with growing on-chip wire delays. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like prefetching in order to reduce the negative impact on performance that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase power consumption and the latency in the NoC. In this thesis, we do not develop a new prefetching technique for CMPs but propose improvements applicable to any of them. Specifically, we analyze the behavior of the prefetching in the CMPs and its impact to the interconnect. We propose several dynamic management techniques to improve the performance of the prefetching mechanism in the system. Furthermore, we identify the main problems when implementing prefetching in distributed memory systems like tiled architectures and propose directions to solve them. Finally, we propose several research lines to continue the work done in this thesis.

3 citations


Cites methods from "Reducing Cache Pollution Through De..."

  • ...In [71] the authors used reusedistance based cache modeling to insert non-temporal prefetch instructions to cache bypass the data that is not reused from the lower level caches....

    [...]

Dissertation
25 Sep 2018
TL;DR: A nimble M:N user-level threading runtime is presented that supports high levels of concurrency without sacrificing application performance and two locality-aware work-stealing schedulers are proposed and evaluated.
Abstract: Concurrency is an essential part of many modern large-scale software systems. Applications must handle millions of simultaneous requests from millions of connected devices. Handling such a large number of concurrent requests requires runtime systems that efficiently manage concurrency and communication among tasks in an application across multiple cores. Existing low-level programming techniques provide scalable solutions with low overhead, but require non-linear control flow. Alternative approaches to concurrent programming, such as Erlang and Go, support linear control flow by mapping multiple user-level execution entities across multiple kernel threads (M:N threading). However, these systems provide comprehensive execution environments that make it difficult to assess the performance impact of user-level runtimes in isolation. This thesis presents a nimble M:N user-level threading runtime that closes this conceptual gap and provides a software infrastructure to precisely study the performance impact of user-level threading. Multiple design alternatives are presented and evaluated for scheduling, I/O multiplexing, and synchronization components of the runtime. The performance of the runtime is evaluated in comparison to event-driven software, systemlevel threading, and other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. The user-level runtime supports high levels of concurrency without sacrificing application performance. In addition, the user-level scheduling problem is studied in the context of an existing actor runtime that maps multiple actors to multiple kernel-level threads. In particular, two locality-aware work-stealing schedulers are proposed and evaluated. It is shown that locality-aware scheduling can significantly improve the performance of a class of applications with a high level of concurrency. In general, the performance and resource utilization of large-scale concurrent applications depends on the level of concurrency that can be expressed by the programming model. This fundamental effect is studied by refining and customizing existing concurrency models.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Abstract: The design of efficient storage hierarchies generally involves the repeated running of "typical" program address traces through a simulated storage system while various hierarchy design parameters are adjusted. This paper describes a new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms. The technique depends on an algorithm classification, called "stack algorithms," examples of which are "least frequently used," "least recently used," "optimal," and "random replacement" algorithms. The techniques yield the exact access frequency to each storage device, which can be used to estimate the overall performance of actual storage hierarchies.

1,329 citations


"Reducing Cache Pollution Through De..." refers background or methods in this paper

  • ...However, instead of using LRU stack distances, they use OPT stack distances, which requires expensive simulation....

    [...]

  • ...A natural starting point for modeling LRU caches is the stack distance [11]....

    [...]

  • ...[3] propose a method to identify non-temporal memory accesses based on Mattson’s optimal replacement algorithm (OPT) [11]....

    [...]

  • ...Wong et al. [3] propose a method to identify non-temporal memory accesses based on Mattson s optimal replacement algorithm (OPT) [11]....

    [...]

Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations


"Reducing Cache Pollution Through De..." refers methods in this paper

  • ...To detect non-temporal data, they introduce a set of shadow tags [15] used to count the number of hits to a cache line that would have occurred if the thread was allocated all ways in the cache set....

    [...]

Proceedings ArticleDOI
09 Jun 2007
TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
Abstract: The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

722 citations


"Reducing Cache Pollution Through De..." refers background in this paper

  • ...Several researchers have proposed hardware improvements to the LRU replacement algorithm [5], [6], [7], [8]....

    [...]

  • ...[5] propose an insertion policy (DIP) where on a cache miss to non-temporal data it is installed in the LRU position, instead of the MRU position of the LRU stack....

    [...]

Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

715 citations


"Reducing Cache Pollution Through De..." refers background in this paper

  • ...Several researchers have proposed hardware improvements to the LRU replacement algorithm [5], [6], [7], [8]....

    [...]

  • ...A recent extension [8] introduces an additional policy that installs cache lines in the MRU− 1 position....

    [...]

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations


"Reducing Cache Pollution Through De..." refers background or methods in this paper

  • ...Several hardware methods have been proposed [1], [3], [6], [14], that dynamically identify non-temporal data....

    [...]

  • ...[14] propose a replacement policy, PIPP, to effectively way-partition a shared cache, that explicitly handles nontemporal (streaming) data....

    [...]

  • ...Focus has recently started to shift towards shared caches [6], [14]....

    [...]

Frequently Asked Questions (10)
Q1. What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

This paper introduces a classification of applications into four cache usage categories. The authors discuss how applications from different categories affect each other ’ s performance indirectly through cache sharing and devise a scheme to optimize such sharing. The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy. The authors demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. 

Future work will explore other hardware mechanism for handling non-temporal data hints from software and possible applications in scheduling. 

The authors used the performance counters in the processor to measure the cycles and instruction counts using the perf framework provided by recent Linux kernels. 

Since the authors are using StatStack the authors have made the implicit assumption that caches can be modeled to be fully associative, i.e. conflict misses are insignificant. 

The speedup when running with applications from the two victim categories can largely be attributed to a reduction in the total bandwidth requirement of the mix. 

Managing the cache for these applications is likely to improve throughput, both when they are running in isolation and in a mix with other applications. 

Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory. 

the stack distance distribution enables the application’s miss ratio to be computed for any given cache size, by simply computing the fraction of memory accesses with a stack distances greater than the desired cache size. 

Using a modified StatStack implementation the authors can reclassify applications based on their replacement ratios after applying cache management, this allows us to reason about how cache management impacts performance. 

By looking at the forward stack distances of an instruction the authors can easily determine if the next access to the data used by that instruction will be a cache miss, i.e. the instruction is non-temporal. 

Trending Questions (1)
How to clear browser cache in Robot Framework?

We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources.