scispace - formally typeset
Open AccessProceedings ArticleDOI

Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Reads0
Chats0
TLDR
A classification of applications into four cache usage categories is introduced and how applications from different categories affect each other's performance indirectly through cache sharing is discussed and a scheme to optimize such sharing is devised.
Abstract
Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

read more

Content maybe subject to copyright    Report

Reducing Cache Pollution Through Detection and
Elimination of Non-Temporal Memory Accesses
Andreas Sandberg, David Eklöv and Erik Hagersten
Department of Information Technology
Uppsala University, Sweden
{andreas.sandberg, david.eklov, eh}@it.uu.se
Abstract—Contention for shared cache resources has been
recognized as a major bottleneck for multicores—especially for
mixed workloads of independent applications. While most mod-
ern processors implement instructions to manage caches, these
instructions are largely unused due to a lack of understanding
of how to best leverage them.
This paper introduces a classification of applications into
four cache usage categories. We discuss how applications from
different categories affect each other’s performance indirectly
through cache sharing and devise a scheme to optimize such
sharing. We also propose a low-overhead method to automatically
find the best per-instruction cache management policy.
We demonstrate how the indirect cache-sharing effects of
mixed workloads can be tamed by automatically altering some
instructions to better manage cache resources. Practical experi-
ments demonstrate that our software-only method can improve
application performance up to 35% on x86 multicore hardware.
I. INTRODUCTION
The introduction of multicore processors has significantly
changed the landscape for most applications. The literature
has mostly focused on parallel multithreaded applications.
However, multicores are often used to run several independent
applications. Such mixed workloads are common in a wide
range of systems, spanning from cell phones to HPC servers.
HPC clusters often run a large number of serial applications
in parallel across their physical cores. For example, parameter
studies in science and engineering where the same application
is run with different input data sets.
When an application shares a multicore with other appli-
cations, new types of performance considerations are required
for good system throughput. Typically, the co-scheduled appli-
cations share resources with limited capacity and bandwidth,
such as a shared last-level cache (SLLC) and DRAM inter-
faces. An application overusing any of these resources can
degrade the performance of the other applications sharing the
same multicore chip.
Consider a simple example: Application A has an active
working set that barely fits in the SLLC, and application
B makes a copy of a data structure much larger than the
SLLC. When run together, B will use a large portion of
the SLLC and will force A to miss much more often than
when run in isolation. Fortunately, most libraries implementing
memory copying routines, e.g. memcpy, have been hand-
optimized and use special cache-bypass instructions, such as
non-temporal reads and writes. On most implementations,
these instructions will avoid allocation of resources in the
SLLC and subsequently will not force any replacements of
application As working set in the cache.
In the example above the use of cache bypass instructions
may seem obvious, and hand-tuning a common routine, such
as memcpy, may motivate the use of special assembler instruc-
tions. However, many common programs also have memory
accesses that allocate data with little benefit in the SLLC
and may slow down co-scheduled applications. Detecting such
reckless use is beyond the capability of most application pro-
grammers, as is the use of assembly coding. Ideally, both the
detection and cache-bypassing should be done automatically
using existing hardware support.
Several software techniques for managing caches have been
proposed in the past [1], [2], [3], [4]. However, most of
these methods require an expensive simulation analysis. These
techniques assume the existence of heavily specialized instruc-
tions [1], [4], or extensions to the cache state and replacement
policy [3], none of which can be found in today’s processors.
Several researchers have proposed hardware improvements to
the LRU replacement algorithm [5], [6], [7], [8]. In general,
such algorithms tweak the LRU order by including additional
predictions about future re-references. Others have tried to
predict [9] and quantify [10] interference due to cache sharing.
In this paper, we propose an efficient and practical software-
only technique to automatically manage cache sharing to
improve the performance of mixed workloads of common
applications running on existing x86 hardware. Unlike pre-
viously proposed methods, our technique does not rely on
any hardware modifications and can be applied to existing
applications running on commodity hardware. This paper
makes the following contributions:
We propose a scheme to classify applications according
to their impact, and dependence, on the SLLC.
We propose an automatic and low-overhead method to
find instructions that use the SLLC recklessly and au-
tomatically introduce cache bypass instructions into the
binary.
We demonstrate how this technique can change the classi-
fication of many applications, making them better mixed
workload citizens.
We evaluate the performance gain of the applications and
c
2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for
creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be
obtained from the IEEE.
SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

0%
5%
10%
15%
20%
25%
30%
Private Private+SLLC
Miss Ratio
Cache size
Streaming
Non-streaming
Together
Isolation
Figure 1. Miss ratio as a function of cache size for an application with
streaming behavior and a typical non-streaming application that reuses most
of its data. When run in isolation, each application has access to both the
private cache and the entire SLLC. Running together causes the non-streaming
application to receive a small fraction of the SLLC, while the streaming
application receives a large fraction without decreasing its miss ratio. The
change in perceived cache size and miss ratio is illustrated by the arrows.
show that their improved behavior is in agreement with
the classification.
II. MANAGING CACHES IN SOFTWARE
Application performance on multicores is highly dependent
on the activities of the other cores in the same chip due to
contention for shared resources. In most modern processors
there is no explicit hardware policy to manage these shared
resources. However, there are usually instructions to manage
these resources in software. By using these instructions prop-
erly, it is possible to increase the amount of data that is reused
through the cache hierarchy. However, this requires being able
to predict which applications, and which instructions, benefit
from caching and which do not.
In order to know which application would benefit from
using more of the shared cache resources, we need to know
the applications’ cache usage characteristics. The cache miss
ratio as a function of cache size, i.e. the number of cache
misses as a fraction of the total number of memory accesses
as a function of cache size, is a useful metric to determine
such characteristics. Figure 1 shows the miss ratio curves
for a typical streaming application and an application that
reuses its data. The miss ratio of the non-streaming application
decreases as the amount of available cache increases. This
occurs because more of the data set fits in the cache. Since
the streaming application does not reuse its data, the miss ratio
stays constant even when the cache size is increased.
When the applications are run in isolation, they will get
access to both the core-local private cache and the SLLC.
Assuming that the cache hierarchy is exclusive, the amount
of cache available to an application running in isolation is
the sum of the private cache and the SLLC. When two
applications share the cache, they will perceive the SLLC
as being smaller. In the case illustrated by Figure 1, the
streaming application misses much more frequently than the
non-streaming application. The frequent misses causes the
streaming application to install more data in the cache than
the non-streaming application. The non-streaming application
will therefore perceive the cache as being much smaller than
r
s
r
p
Private Private+SLLC
Miss Ratio
Cache size
δ
Figure 2. A generalized miss ratio curve for an application. The minimum,
i.e. only the private cache, and the maximum, i.e. the private cache and the
full shared cache, amount of cache available to an application are shown on
the x-axis. The miss ratio when running in isolation (r
s
) is the smallest miss
ratio that an application can achieve on this system, while the miss ratio when
running only in the private cache (r
p
) is the worst miss ratio. The δ represents
how much an application is affected by competition for the shared cache.
when run in isolation. The change in perceived cache size,
and how this affects miss ratio is illustrated by the arrows in
Figure 1.
Decreasing the perceived cache size for the streaming
application does not affect its miss ratio. The non-streaming
application, however, sees an increased miss ratio when access
to the SLLC is restricted. As the number of misses increase,
the bandwidth requirements also increase, which affects the
performance of all the applications sharing the same memory
interface. If we could make sure that the streaming application
does not install any of its streaming data into the cache, the
miss ratio, and bandwidth requirement, of the non-streaming
applications would decrease without sacrificing any perfor-
mance. In fact, the streaming application would run faster
since the total bandwidth requirement would be decreased.
Using the miss ratio curves we can classify applications
based on how they affect others and how they are affected by
competition for the shared cache. We base this classification on
the base miss ratio, r
s
, when the application is run in isolation
and has access to both its private cache and the entire SLLC,
and the miss ratio, r
p
, when it only has access to the private
cache, see Figure 2. The r
p
miss ratio can be thought of as the
maximum miss ratio that an application can get due to cache
contention, while r
s
is the ideal case when the application is
run in isolation. To capture the sensitivity to cache contention
we define the cache sensitivity, δ, to be the difference between
the two miss ratios. A large δ indicates that an application
benefits from using the shared cache, while a small δ means
that the application exhibits streaming behavior and does not
benefit from additional cache resources.
Using the r
s
and δ we can classify applications based on
how they use the cache. This classification allows us to predict
how applications will affect each other and how the system
will be affected by software cache management. We define
the following categories:
Don’t care
Small r
s
and small δ—Applications that are largely
unaffected by contention for the shared cache level.

0.001%
0.01%
0.1%
1%
10%
0.01% 0.1% 1% 10% 100%
Cache Sensitivity (δ)
Base Miss Ratio (r
s
)
perlbench
bzip2
gcc
bwaves
gamess
mcf
milc
zeusmp
leslie3d
soplex
hmmer
libq. . .
h264ref
lbm
astar
sphinx3
Xalan
Don’t care
Victims Gobblers & Victims
Cache gobblers
Figure 3. Classification map of a subset of the SPEC2006 benchmarks
running with the reference input set on a system with a 576 kB private
cache and 6 MB shared cache. The quadrants signify different behaviors when
running together with other applications. Applications to the left tend to reuse
almost all of their data in the shared cache and generally work well with other
applications, applications to the right tend to use large parts of the shared
cache for data that is never reused and are generally troublesome in mixes
with other applications. Applications in the upper half are sensitive to the
amount of data that can be stored in the shared cache, while applications on
the bottom are insensitive.
These applications fit their entire data set in the
private cache, they are therefore largely unaffected
by contention for the shared cache and memory
bandwidth.
Victims
Small r
s
and large δ—Applications that suffer badly
if the amount of cache at the shared level is restricted.
The data they manage to install in the shared resource
is almost always reused. Applications with a working
set larger than the private cache, but smaller than the
total cache size belong in this category.
Gobblers & Victims
Large r
s
and large δ—Applications that suffer from
SLLC cache contention, but store large amounts of
data that is never reused in the shared cache. For
example, applications traversing a small and a large
data structure in parallel may reuse data in the cache
when accessing the small structure, while accesses
to the large data structure always miss. Disabling
caching for the accesses to the large data structure
would allow more of the smaller data structure to be
cached. Managing the cache for these applications
is likely to improve throughput, both when they
are running in isolation and in a mix with other
applications.
Cache Gobblers
Large r
s
and small δ—Applications that do not ben-
efit from the shared cache at all, but still install large
amounts of data in it. Applications in this category
work on streaming data or data structures that are
much larger than the cache. These applications are
good candidates for software cache management.
Since they do not reuse the data they install in
the shared cache, their throughput is generally not
improved when running in isolation. Managing these
applications will improve the full system throughput
by allowing applications from other categories to use
more of the shared cache.
Figure 3 shows the classification of several SPEC2006
benchmarks according to these categories. Applications clas-
sified as wasting cache resources, i.e. applications on the
right-hand side of the map, are obvious targets for cache
management. The large base miss ratio in such applications
is due to memory accesses that touch data that is never reused
while it resides in the cache. Disabling caching for such
instructions does not introduce new misses since data is not
reused, instead it will free up cache space for other accesses.
III. CACHE MANAGEMENT INSTRUCTIONS
Most modern instruction sets include instructions to manage
caches. These instructions can typically be classified into three
different categories: non-temporal memory accesses, forced
cache eviction and non-temporal prefetches. Many processors
support at least one of these instruction classes. However,
their semantics may not always make them suitable for cache
management for performance.
Examples from the first category are the memory accesses
in the PA-RISC which can be annotated with caching hints,
e.g. only spatial locality or write only. Similar instruction
annotations exist for Itanium. Other instruction sets, such as
some of the SIMD extensions to the x86, contain completely
separate instructions for handling non-temporal data. The
hardware may, based on these hints, decide not to install write-
only cache lines in the cache and use write-combining buffers
instead. Non-temporal reads can be handled using separate
non-temporal buffers or by installing the accessed cache line
in such a way that it is the next line to be evicted from a set.
Instructions from the second category, forced cache eviction,
appear in some form in most architectures. However, not
all architectures expose such instructions to user space. Yet
other implementations may have undesired semantics that limit
their usefulness in code optimizations, e.g. the x86 Flush
Cache Line (CLFLUSH) instruction forces all caches in a
coherence domain to be flushed. There are some architectures
that implement instructions in this class that are specifically
intended for code optimizations. For example, the Alpha ISA
specifies an instruction, Evict Data Cache Block (ECB), that
gives the memory system a hint that a specific cache line will
not be reused in the near future. A similar instruction, Write
Hint (WH64), tells the memory subsystem that an entire cache
line will be overwritten before being read again, this allows
the memory system to allocate the cache line without actually
reading its old contents. The ECB and WH64 instructions are
in many ways similar to the caching hints in the previous
category, but instead of annotating the load or store instruction,

the hints are given after or, in case of a store, before the
memory accesses in question.
The third category, non-temporal prefetches, is also included
in several different ISAs. The SPARC ISA has both read and
write prefetch variants for data that is not temporally reused.
Similar prefetch instructions are also available in both Itanium
and x86. Some implementations may choose to prefetch into
the cache such that the fetched line is the next to be evicted
from that set; others may prevent the data from propagating
from the L1 to a higher level in the cache hierarchy.
In the remainder of this paper, we will assume an archi-
tecture with a non-temporal hint that is implemented such
that non-temporal data is fetched into the L1 cache, but never
installed in higher levels. This is how the AMD system we
target implement support for non-temporal prefetches.
IV. LOW-OVERHEAD CACHE MODELING
A natural starting point for modeling LRU caches is the
stack distance [11]. A stack distance is the number of unique
cache lines accessed between two successive memory accesses
to the same cache line. It can be directly used to determine
if a memory access results in a cache hit or a cache miss
for a fully-associative LRU cache: if the stack distance is less
than the cache size, the access will be a hit, otherwise it will
miss. Therefore, the stack distance distribution enables the
application’s miss ratio to be computed for any given cache
size, by simply computing the fraction of memory accesses
with a stack distances greater than the desired cache size.
In this work, we need to differentiate between what we call
backward and forward stack distance. Let A and B be two
successive memory accesses to the same cache line. Suppose
that there are S unique cache lines accessed by the memory
accesses executed between A and B. Here, we say that A has a
forward stack distance of S, and that B has a backward stack
distance of S.
Measuring stack distances is generally very expensive. In
this paper, we use StatStack [12] to estimate stack distances
and miss ratios. StatStack is a statistical cache model that
models fully associative caches with LRU replacement. Mod-
eling fully associative LRU caches is, for most applications, a
good approximation of the set associative pseudo LRU caches
implemented in hardware. StatStack estimates an application’s
stack distances using only a sparse sample of the application’s
reuse distances, i.e. the number of memory accesses performed
between two accesses to the same cache line. This approach
to modeling caches has been shown to be several orders of
magnitude faster than full cache simulation, and almost as
accurate. The runtime profile of an application can be collected
with an overhead of only 40% [13], and the execution time of
the cache model is only a few seconds [12].
To understand how StatStack works, consider the access
sequence shown in Figure 4. Here the arcs connect subsequent
accesses to the same cache line, and represent the reuse of data.
In this example, the second memory access to cache line A has
a reuse distance of five, since there are ve memory accesses
executed between the two accesses to A, and a backward
A B B C D C C D BA
Out Boundary
Figure 4. Reuse distance in a memory access stream. The arcs connect
successive memory accesses to the same cache line, and represents the reuse
of cache lines. The stack distance of the second memory access to A is equal
to the number of arcs that cross “Out Boundary”.
stack distance of three, since there are three unique cache
lines (B, C and D) accessed between the two accesses to A.
Furthermore, we see that there are three arcs that cross the
vertical line labeled “Out Boundary”, which is the same as
the stack distance of the second access to A. This observation
holds true in general. Based on it we can compute the stack
distance of any memory access, given that we know the reuse
distances of all memory access performed between it and the
previous access to the same cache line.
The input to StatStack is a sparse reuse distance sample that
only contains the reuse distances of a sparse random selection
of an application’s memory accesses, and therefore does not
contain enough information for the above observation to be
directly applied. Instead, StatStack uses the reuse distance
sample to estimate the application’s reuse distance distribution.
This distribution is then used to estimate the likelihood that
a memory access has a reuse distance greater than a given
length. Since the length of a reuse distance determines if its
outbound arc reaches beyond the “Out Boundary”, we can
use these likelihoods to estimate the stack distance of any
memory access. For example, to estimate the stack distance
of the second access to A in Figure 4, we sum the estimated
likelihoods that the reuse distance of the memory accesses exe-
cuted between the two accesses to A have reuse distances such
that their corresponding arcs reach beyond “Out Boundary”.
StatStack uses this approach to estimate the stack distances
of all memory accesses in a reuse distance sample, effectively
estimating a stack distance distribution. StatStack uses this
distribution to estimate the miss ratio for any given cache size,
C, as the fraction of stack distances in the estimated stack
distance distribution that are greater than C.
V. IDENTIFYING NON-TEMPORAL ACCESSES
Using the stack distance profile of an application we can de-
termine which memory accesses do not benefit from caching.
We will refer to memory accessing instructions whose data
is never reused during its lifetime in the cache hierarchy as
non-temporal memory accesses.
If these non-temporal accesses can be identified, the com-
piler, a post processing pass, or a dynamic instrumentation
engine can alter the application to use non-temporal instruc-
tions in these locations without hurting performance.
The system we model implements a non-temporal hint that
causes a cache line to be installed in the L1, but never in any of

the higher cache levels. It turns out that modeling this system
is fairly complicated, we will therefore describe our algorithm
to find non-temporal accesses in three steps. Each step adds
more detail to the model and brings it closer to the hardware.
A fourth step is included to take effects from sampled stack
distances into account.
A. A first simplified approach
By looking at the forward stack distances of an instruction
we can easily determine if the next access to the data used
by that instruction will be a cache miss, i.e. the instruction is
non-temporal. An instruction has non-temporal behavior if all
forward stack distances, i.e. the number of unique cache lines
accessed between this instruction and the next access to the
same cache line, are larger or equal to the size of the cache. In
that case, we know that the next instruction to touch the same
data is very likely to be a cache miss. Therefore, we can use a
non-temporal instruction to bypass the entire cache hierarchy
for such accesses.
This approach has a major drawback. Most applications,
even purely streaming ones that do not reuse data, may
still exhibit short temporal reuse, e.g. spatial locality where
neighboring data items on the same cache line are accessed in
close succession. Since cache management is done at a cache
line granularity, this clearly restricts the number of possible
instructions that can be treated as non-temporal.
B. Refining the simple approach
Most hardware implementations of cache management in-
structions allow the non-temporal data to live in parts of the
cache hierarchy, such as the L1, before it is evicted to memory.
We can exploit this to accommodate short temporal reuse of
cache lines. We assume that whenever a non-temporal memory
access touches a cache line, the cache line is installed in the
MRU-position of the LRU stack, and a special bit on the cache
line, the evict to memory (ETM) bit, is set. Whenever a normal
memory access touches a cache line, the ETM bit is cleared.
Cache lines with the ETM bit set are evicted earlier than other
lines, see Figure 5. Instead of waiting for the line to reach the
depth d
max
it is evicted when it reaches a shallower depth,
d
ETM
. This allows us to model implementations that allow
non-temporal data to live in parts of the memory hierarchy.
For example, the memory controller in our AMD system evicts
ETM tagged cache lines from the L1 to main memory, and
would therefore be modeled with d
ETM
being the size of the
L1 and d
max
the total combined cache size.
The model with the ETM bit allows us to consider memory
accesses as non-temporal even if they have short reuses that
hit in the small ETM area. Instead of requiring that all forward
stack distances are larger than the cache size, we require
that there is at least one such access and that the number
of accesses that reuse data in the area of the LRU stack
outside the ETM area, the gray area in Figure 5, is small,
i.e. the number of misses introduced if the access is treated
as non-temporal is small. We thus require that one stack
distance is greater or equal to d
max
, and that the number
of stack distances that are larger or equal to d
ETM
but
smaller than d
max
is smaller than some threshold, t
m
. In most
implementations t
m
will not be a single value for all accesses,
but depend on factors such as how many additional cache hits
can be created by disabling caching for a memory access.
The hardware we want to model does not, unfortunately,
reset the ETM bit when a temporal access reuses ETM data.
This new situation can be thought of as sticky ETM bits, as
they are only reset on cache line eviction.
C. Handling sticky ETM bits
When the ETM bit is retained for the cache lines’ entire
lifetime in the cache, the conditions for a memory accessing
instruction to be non-temporal developed in section V-B are no
longer sufficient. If instruction X sets the ETM bit on a cache
line, then the ETM status applies to all subsequent reuses of
the cache line as well. To correctly model this, we need to
make sure that the non-temporal condition from section V-B
applies, not only to X, but also to all instructions that reuse
the cache lines accessed by X.
The sticky ETM bit is only a problem for non-temporal
accesses that have forward reuse distances less than d
ET M
.
For example, consider a memory accessing instruction, Y, that
reuses the cache line previously accessed by a non-temporal
access X (here Y is a cache hit). When Y accesses the cache
line it is moved to the MRU position of the LRU stack, and the
sticky ETM bit is retained. Now, since Y would have resulted
in a cache hit no matter if X had set the sticky ETM bit or
not, this is the same as if we would have set the sticky ETM
bit for the cache line when it was accessed by Y.
Therefore, instead of applying the non-temporal condition
to a single instruction, we have to apply it to all instructions
reusing the cache line accessed by the first instruction.
In a machine, such as our AMD system, where d
ETM
corresponds to the L1 cache, this new condition allows us to
categorize a memory access as non-temporal if all the data it
touches is reused through the L1 cache or misses in the entire
cache hierarchy. Due to the stickiness of the non-temporal
status, this condition must also hold for any memory access
that reuses the same data through the L1 cache.
D. Handling sampled data
To avoid the overhead of measuring exact stack distances,
we use StatStack to calculate stack distances from sampled
reuse distances. Sampled stack distances can generally be used
in place of a full stack distance trace with only a small decrease
in average accuracy. However, there is always a risk of missing
some critical behavior. This could potentially lead to flagging
an access as non-temporal, even though the instruction in
fact has some temporal behavior in some cases, and thereby
introducing an unwanted cache miss.
In order to reduce the likelihood of introducing misses due
to sampling, we need to make sure that flagging an instruction
as non-temporal is always based on reliable data. We do this
by introducing a sample threshold, t
s
, which is the smallest
number of samples originating from an instruction that can be
considered to be non-temporal.

Figures
Citations
More filters

Profiling Methods for Memory Centric Software Performance Analysis

David Eklöv
TL;DR: To reduce latency and increase bandwidth to memory, modern microprocessors are often designed with deep memory hierarchies including several level of caches including several levels of caches.
Journal ArticleDOI

Probabilistic Directed Writebacks for Exclusive Caches

TL;DR: This approach relies on the insight that it is possible to makes this prediction based solely on the address stream as seen by the level-one data cache (L1D), eliminating the need to store or communicate PC values between levels of the cache hierarchy.
Proceedings ArticleDOI

A Novel Online Measure of Cache Utility Efficiency in Chip Multiprocessor

TL;DR: Experiments performed in multiple Benchmarks show that the online monitoring mechanism and quantitative metric discussed in this article have achieved fine-grained online monitoring of the efficiency of cache utility based on the working set.

A Software Technique for Reducing Cache Pollution

TL;DR: This research presents a novel and scalable approaches called "Smart Cache Architecture (SCA)” that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and managing cache resources for distributed systems.
References
More filters
Journal ArticleDOI

Evaluation techniques for storage hierarchies

TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Proceedings ArticleDOI

Adaptive insertion policies for high performance caching

TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
Proceedings ArticleDOI

High performance cache replacement using re-reference interval prediction (RRIP)

TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Proceedings ArticleDOI

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

This paper introduces a classification of applications into four cache usage categories. The authors discuss how applications from different categories affect each other ’ s performance indirectly through cache sharing and devise a scheme to optimize such sharing. The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy. The authors demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. 

Future work will explore other hardware mechanism for handling non-temporal data hints from software and possible applications in scheduling. 

The authors used the performance counters in the processor to measure the cycles and instruction counts using the perf framework provided by recent Linux kernels. 

Since the authors are using StatStack the authors have made the implicit assumption that caches can be modeled to be fully associative, i.e. conflict misses are insignificant. 

The speedup when running with applications from the two victim categories can largely be attributed to a reduction in the total bandwidth requirement of the mix. 

Managing the cache for these applications is likely to improve throughput, both when they are running in isolation and in a mix with other applications. 

Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory. 

the stack distance distribution enables the application’s miss ratio to be computed for any given cache size, by simply computing the fraction of memory accesses with a stack distances greater than the desired cache size. 

Using a modified StatStack implementation the authors can reclassify applications based on their replacement ratios after applying cache management, this allows us to reason about how cache management impacts performance. 

By looking at the forward stack distances of an instruction the authors can easily determine if the next access to the data used by that instruction will be a cache miss, i.e. the instruction is non-temporal. 

Trending Questions (1)
How to clear browser cache in Robot Framework?

We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources.