What are the future works mentioned in the paper "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Future work will explore other hardware mechanism for handling non-temporal data hints from software and possible applications in scheduling.

How did the authors measure the cycles and instruction counts?

The authors used the performance counters in the processor to measure the cycles and instruction counts using the perf framework provided by recent Linux kernels.

What is the implicit assumption that caches can be modeled to be?

Since the authors are using StatStack the authors have made the implicit assumption that caches can be modeled to be fully associative, i.e. conflict misses are insignificant.

What is the reason for the speedup when running with victims?

The speedup when running with applications from the two victim categories can largely be attributed to a reduction in the total bandwidth requirement of the mix.

How can the authors reclassify applications based on their replacement ratios?

Using a modified StatStack implementation the authors can reclassify applications based on their replacement ratios after applying cache management, this allows us to reason about how cache management impacts performance.

(Open Access) Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses (2010) | Andreas Sandberg

Q: What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

This paper introduces a classification of applications into four cache usage categories. The authors discuss how applications from different categories affect each other ’ s performance indirectly through cache sharing and devise a scheme to optimize such sharing. The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy. The authors demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources.

Q: What is the main advantage of using a non-temporal instruction to bypass the entire cache?

Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.

Reducing Cache Pollution Through Detection and

Elimination of Non-Temporal Memory Accesses

Andreas Sandberg, David Eklöv and Erik Hagersten

Department of Information Technology

Uppsala University, Sweden

{andreas.sandberg, david.eklov, eh}@it.uu.se

Abstract—Contention for shared cache resources has been

recognized as a major bottleneck for multicores—especially for

mixed workloads of independent applications. While most mod-

ern processors implement instructions to manage caches, these

instructions are largely unused due to a lack of understanding

of how to best leverage them.

This paper introduces a classiﬁcation of applications into

four cache usage categories. We discuss how applications from

different categories affect each other’s performance indirectly

through cache sharing and devise a scheme to optimize such

sharing. We also propose a low-overhead method to automatically

ﬁnd the best per-instruction cache management policy.

We demonstrate how the indirect cache-sharing effects of

mixed workloads can be tamed by automatically altering some

instructions to better manage cache resources. Practical experi-

ments demonstrate that our software-only method can improve

application performance up to 35% on x86 multicore hardware.

I. INTRODUCTION

The introduction of multicore processors has signiﬁcantly

changed the landscape for most applications. The literature

has mostly focused on parallel multithreaded applications.

However, multicores are often used to run several independent

applications. Such mixed workloads are common in a wide

range of systems, spanning from cell phones to HPC servers.

HPC clusters often run a large number of serial applications

in parallel across their physical cores. For example, parameter

studies in science and engineering where the same application

is run with different input data sets.

When an application shares a multicore with other appli-

cations, new types of performance considerations are required

for good system throughput. Typically, the co-scheduled appli-

cations share resources with limited capacity and bandwidth,

such as a shared last-level cache (SLLC) and DRAM inter-

faces. An application overusing any of these resources can

degrade the performance of the other applications sharing the

same multicore chip.

Consider a simple example: Application A has an active

working set that barely ﬁts in the SLLC, and application

B makes a copy of a data structure much larger than the

SLLC. When run together, B will use a large portion of

the SLLC and will force A to miss much more often than

when run in isolation. Fortunately, most libraries implementing

memory copying routines, e.g. memcpy, have been hand-

optimized and use special cache-bypass instructions, such as

non-temporal reads and writes. On most implementations,

these instructions will avoid allocation of resources in the

SLLC and subsequently will not force any replacements of

application A’s working set in the cache.

In the example above the use of cache bypass instructions

may seem obvious, and hand-tuning a common routine, such

as memcpy, may motivate the use of special assembler instruc-

tions. However, many common programs also have memory

accesses that allocate data with little beneﬁt in the SLLC

and may slow down co-scheduled applications. Detecting such

reckless use is beyond the capability of most application pro-

grammers, as is the use of assembly coding. Ideally, both the

detection and cache-bypassing should be done automatically

using existing hardware support.

Several software techniques for managing caches have been

proposed in the past [1], [2], [3], [4]. However, most of

these methods require an expensive simulation analysis. These

techniques assume the existence of heavily specialized instruc-

tions [1], [4], or extensions to the cache state and replacement

policy [3], none of which can be found in today’s processors.

Several researchers have proposed hardware improvements to

the LRU replacement algorithm [5], [6], [7], [8]. In general,

such algorithms tweak the LRU order by including additional

predictions about future re-references. Others have tried to

predict [9] and quantify [10] interference due to cache sharing.

In this paper, we propose an efﬁcient and practical software-

only technique to automatically manage cache sharing to

improve the performance of mixed workloads of common

applications running on existing x86 hardware. Unlike pre-

viously proposed methods, our technique does not rely on

any hardware modiﬁcations and can be applied to existing

applications running on commodity hardware. This paper

makes the following contributions:

• We propose a scheme to classify applications according

to their impact, and dependence, on the SLLC.

• We propose an automatic and low-overhead method to

ﬁnd instructions that use the SLLC recklessly and au-

tomatically introduce cache bypass instructions into the

binary.

• We demonstrate how this technique can change the classi-

ﬁcation of many applications, making them better mixed

workload citizens.

• We evaluate the performance gain of the applications and

2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for

creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be

obtained from the IEEE.

SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

10%

15%

20%

25%

30%

Private Private+SLLC

Miss Ratio

Cache size

Streaming

Non-streaming

Together

Isolation

Figure 1. Miss ratio as a function of cache size for an application with

streaming behavior and a typical non-streaming application that reuses most

of its data. When run in isolation, each application has access to both the

private cache and the entire SLLC. Running together causes the non-streaming

application to receive a small fraction of the SLLC, while the streaming

application receives a large fraction without decreasing its miss ratio. The

change in perceived cache size and miss ratio is illustrated by the arrows.

show that their improved behavior is in agreement with

the classiﬁcation.

II. MANAGING CACHES IN SOFTWARE

Application performance on multicores is highly dependent

on the activities of the other cores in the same chip due to

contention for shared resources. In most modern processors

there is no explicit hardware policy to manage these shared

resources. However, there are usually instructions to manage

these resources in software. By using these instructions prop-

erly, it is possible to increase the amount of data that is reused

through the cache hierarchy. However, this requires being able

to predict which applications, and which instructions, beneﬁt

from caching and which do not.

In order to know which application would beneﬁt from

using more of the shared cache resources, we need to know

the applications’ cache usage characteristics. The cache miss

ratio as a function of cache size, i.e. the number of cache

misses as a fraction of the total number of memory accesses

as a function of cache size, is a useful metric to determine

such characteristics. Figure 1 shows the miss ratio curves

for a typical streaming application and an application that

reuses its data. The miss ratio of the non-streaming application

decreases as the amount of available cache increases. This

occurs because more of the data set ﬁts in the cache. Since

the streaming application does not reuse its data, the miss ratio

stays constant even when the cache size is increased.

When the applications are run in isolation, they will get

access to both the core-local private cache and the SLLC.

Assuming that the cache hierarchy is exclusive, the amount

of cache available to an application running in isolation is

the sum of the private cache and the SLLC. When two

applications share the cache, they will perceive the SLLC

as being smaller. In the case illustrated by Figure 1, the

streaming application misses much more frequently than the

non-streaming application. The frequent misses causes the

streaming application to install more data in the cache than

the non-streaming application. The non-streaming application

will therefore perceive the cache as being much smaller than

Private Private+SLLC

Miss Ratio

Cache size

Figure 2. A generalized miss ratio curve for an application. The minimum,

i.e. only the private cache, and the maximum, i.e. the private cache and the

full shared cache, amount of cache available to an application are shown on

the x-axis. The miss ratio when running in isolation (r

) is the smallest miss

ratio that an application can achieve on this system, while the miss ratio when

running only in the private cache (r

) is the worst miss ratio. The δ represents

how much an application is affected by competition for the shared cache.

when run in isolation. The change in perceived cache size,

and how this affects miss ratio is illustrated by the arrows in

Figure 1.

Decreasing the perceived cache size for the streaming

application does not affect its miss ratio. The non-streaming

application, however, sees an increased miss ratio when access

to the SLLC is restricted. As the number of misses increase,

the bandwidth requirements also increase, which affects the

performance of all the applications sharing the same memory

interface. If we could make sure that the streaming application

does not install any of its streaming data into the cache, the

miss ratio, and bandwidth requirement, of the non-streaming

applications would decrease without sacriﬁcing any perfor-

mance. In fact, the streaming application would run faster

since the total bandwidth requirement would be decreased.

Using the miss ratio curves we can classify applications

based on how they affect others and how they are affected by

competition for the shared cache. We base this classiﬁcation on

the base miss ratio, r

, when the application is run in isolation

and has access to both its private cache and the entire SLLC,

and the miss ratio, r

, when it only has access to the private

cache, see Figure 2. The r

miss ratio can be thought of as the

maximum miss ratio that an application can get due to cache

contention, while r

is the ideal case when the application is

run in isolation. To capture the sensitivity to cache contention

we deﬁne the cache sensitivity, δ, to be the difference between

the two miss ratios. A large δ indicates that an application

beneﬁts from using the shared cache, while a small δ means

that the application exhibits streaming behavior and does not

beneﬁt from additional cache resources.

Using the r

and δ we can classify applications based on

how they use the cache. This classiﬁcation allows us to predict

how applications will affect each other and how the system

will be affected by software cache management. We deﬁne

the following categories:

Don’t care

Small r

and small δ—Applications that are largely

unaffected by contention for the shared cache level.

0.001%

0.01%

0.1%

10%

0.01% 0.1% 1% 10% 100%

Cache Sensitivity (δ)

Base Miss Ratio (r

)

perlbench

bzip2

gcc

bwaves

gamess

mcf

milc

zeusmp

leslie3d

soplex

hmmer

libq. . .

h264ref

lbm

astar

sphinx3

Xalan

Don’t care

Victims Gobblers & Victims

Cache gobblers

Figure 3. Classiﬁcation map of a subset of the SPEC2006 benchmarks

running with the reference input set on a system with a 576 kB private

cache and 6 MB shared cache. The quadrants signify different behaviors when

running together with other applications. Applications to the left tend to reuse

almost all of their data in the shared cache and generally work well with other

applications, applications to the right tend to use large parts of the shared

cache for data that is never reused and are generally troublesome in mixes

with other applications. Applications in the upper half are sensitive to the

amount of data that can be stored in the shared cache, while applications on

the bottom are insensitive.

These applications ﬁt their entire data set in the

private cache, they are therefore largely unaffected

by contention for the shared cache and memory

bandwidth.

Victims

Small r

and large δ—Applications that suffer badly

if the amount of cache at the shared level is restricted.

The data they manage to install in the shared resource

is almost always reused. Applications with a working

set larger than the private cache, but smaller than the

total cache size belong in this category.

Gobblers & Victims

Large r

and large δ—Applications that suffer from

SLLC cache contention, but store large amounts of

data that is never reused in the shared cache. For

example, applications traversing a small and a large

data structure in parallel may reuse data in the cache

when accessing the small structure, while accesses

to the large data structure always miss. Disabling

caching for the accesses to the large data structure

would allow more of the smaller data structure to be

cached. Managing the cache for these applications

is likely to improve throughput, both when they

are running in isolation and in a mix with other

applications.

Cache Gobblers

Large r

and small δ—Applications that do not ben-

eﬁt from the shared cache at all, but still install large

amounts of data in it. Applications in this category

work on streaming data or data structures that are

much larger than the cache. These applications are

good candidates for software cache management.

Since they do not reuse the data they install in

the shared cache, their throughput is generally not

improved when running in isolation. Managing these

applications will improve the full system throughput

by allowing applications from other categories to use

more of the shared cache.

Figure 3 shows the classiﬁcation of several SPEC2006

benchmarks according to these categories. Applications clas-

siﬁed as wasting cache resources, i.e. applications on the

right-hand side of the map, are obvious targets for cache

management. The large base miss ratio in such applications

is due to memory accesses that touch data that is never reused

while it resides in the cache. Disabling caching for such

instructions does not introduce new misses since data is not

reused, instead it will free up cache space for other accesses.

III. CACHE MANAGEMENT INSTRUCTIONS

Most modern instruction sets include instructions to manage

caches. These instructions can typically be classiﬁed into three

different categories: non-temporal memory accesses, forced

cache eviction and non-temporal prefetches. Many processors

support at least one of these instruction classes. However,

their semantics may not always make them suitable for cache

management for performance.

Examples from the ﬁrst category are the memory accesses

in the PA-RISC which can be annotated with caching hints,

e.g. only spatial locality or write only. Similar instruction

annotations exist for Itanium. Other instruction sets, such as

some of the SIMD extensions to the x86, contain completely

separate instructions for handling non-temporal data. The

hardware may, based on these hints, decide not to install write-

only cache lines in the cache and use write-combining buffers

instead. Non-temporal reads can be handled using separate

non-temporal buffers or by installing the accessed cache line

in such a way that it is the next line to be evicted from a set.

Instructions from the second category, forced cache eviction,

appear in some form in most architectures. However, not

all architectures expose such instructions to user space. Yet

other implementations may have undesired semantics that limit

their usefulness in code optimizations, e.g. the x86 Flush

Cache Line (CLFLUSH) instruction forces all caches in a

coherence domain to be ﬂushed. There are some architectures

that implement instructions in this class that are speciﬁcally

intended for code optimizations. For example, the Alpha ISA

speciﬁes an instruction, Evict Data Cache Block (ECB), that

gives the memory system a hint that a speciﬁc cache line will

not be reused in the near future. A similar instruction, Write

Hint (WH64), tells the memory subsystem that an entire cache

line will be overwritten before being read again, this allows

the memory system to allocate the cache line without actually

reading its old contents. The ECB and WH64 instructions are

in many ways similar to the caching hints in the previous

category, but instead of annotating the load or store instruction,

the hints are given after or, in case of a store, before the

memory accesses in question.

The third category, non-temporal prefetches, is also included

in several different ISAs. The SPARC ISA has both read and

write prefetch variants for data that is not temporally reused.

Similar prefetch instructions are also available in both Itanium

and x86. Some implementations may choose to prefetch into

the cache such that the fetched line is the next to be evicted

from that set; others may prevent the data from propagating

from the L1 to a higher level in the cache hierarchy.

In the remainder of this paper, we will assume an archi-

tecture with a non-temporal hint that is implemented such

that non-temporal data is fetched into the L1 cache, but never

installed in higher levels. This is how the AMD system we

target implement support for non-temporal prefetches.

IV. LOW-OVERHEAD CACHE MODELING

A natural starting point for modeling LRU caches is the

stack distance [11]. A stack distance is the number of unique

cache lines accessed between two successive memory accesses

to the same cache line. It can be directly used to determine

if a memory access results in a cache hit or a cache miss

for a fully-associative LRU cache: if the stack distance is less

than the cache size, the access will be a hit, otherwise it will

miss. Therefore, the stack distance distribution enables the

application’s miss ratio to be computed for any given cache

size, by simply computing the fraction of memory accesses

with a stack distances greater than the desired cache size.

In this work, we need to differentiate between what we call

backward and forward stack distance. Let A and B be two

successive memory accesses to the same cache line. Suppose

that there are S unique cache lines accessed by the memory

accesses executed between A and B. Here, we say that A has a

forward stack distance of S, and that B has a backward stack

distance of S.

Measuring stack distances is generally very expensive. In

this paper, we use StatStack [12] to estimate stack distances

and miss ratios. StatStack is a statistical cache model that

models fully associative caches with LRU replacement. Mod-

eling fully associative LRU caches is, for most applications, a

good approximation of the set associative pseudo LRU caches

implemented in hardware. StatStack estimates an application’s

stack distances using only a sparse sample of the application’s

reuse distances, i.e. the number of memory accesses performed

between two accesses to the same cache line. This approach

to modeling caches has been shown to be several orders of

magnitude faster than full cache simulation, and almost as

accurate. The runtime proﬁle of an application can be collected

with an overhead of only 40% [13], and the execution time of

the cache model is only a few seconds [12].

To understand how StatStack works, consider the access

sequence shown in Figure 4. Here the arcs connect subsequent

accesses to the same cache line, and represent the reuse of data.

In this example, the second memory access to cache line A has

a reuse distance of ﬁve, since there are ﬁve memory accesses

executed between the two accesses to A, and a backward

A B B C D C C D BA

Out Boundary

Figure 4. Reuse distance in a memory access stream. The arcs connect

successive memory accesses to the same cache line, and represents the reuse

of cache lines. The stack distance of the second memory access to A is equal

to the number of arcs that cross “Out Boundary”.

stack distance of three, since there are three unique cache

lines (B, C and D) accessed between the two accesses to A.

Furthermore, we see that there are three arcs that cross the

vertical line labeled “Out Boundary”, which is the same as

the stack distance of the second access to A. This observation

holds true in general. Based on it we can compute the stack

distance of any memory access, given that we know the reuse

distances of all memory access performed between it and the

previous access to the same cache line.

The input to StatStack is a sparse reuse distance sample that

only contains the reuse distances of a sparse random selection

of an application’s memory accesses, and therefore does not

contain enough information for the above observation to be

directly applied. Instead, StatStack uses the reuse distance

sample to estimate the application’s reuse distance distribution.

This distribution is then used to estimate the likelihood that

a memory access has a reuse distance greater than a given

length. Since the length of a reuse distance determines if its

outbound arc reaches beyond the “Out Boundary”, we can

use these likelihoods to estimate the stack distance of any

memory access. For example, to estimate the stack distance

of the second access to A in Figure 4, we sum the estimated

likelihoods that the reuse distance of the memory accesses exe-

cuted between the two accesses to A have reuse distances such

that their corresponding arcs reach beyond “Out Boundary”.

StatStack uses this approach to estimate the stack distances

of all memory accesses in a reuse distance sample, effectively

estimating a stack distance distribution. StatStack uses this

distribution to estimate the miss ratio for any given cache size,

C, as the fraction of stack distances in the estimated stack

distance distribution that are greater than C.

V. IDENTIFYING NON-TEMPORAL ACCESSES

Using the stack distance proﬁle of an application we can de-

termine which memory accesses do not beneﬁt from caching.

We will refer to memory accessing instructions whose data

is never reused during its lifetime in the cache hierarchy as

non-temporal memory accesses.

If these non-temporal accesses can be identiﬁed, the com-

piler, a post processing pass, or a dynamic instrumentation

engine can alter the application to use non-temporal instruc-

tions in these locations without hurting performance.

The system we model implements a non-temporal hint that

causes a cache line to be installed in the L1, but never in any of

the higher cache levels. It turns out that modeling this system

is fairly complicated, we will therefore describe our algorithm

to ﬁnd non-temporal accesses in three steps. Each step adds

more detail to the model and brings it closer to the hardware.

A fourth step is included to take effects from sampled stack

distances into account.

A. A ﬁrst simpliﬁed approach

By looking at the forward stack distances of an instruction

we can easily determine if the next access to the data used

by that instruction will be a cache miss, i.e. the instruction is

non-temporal. An instruction has non-temporal behavior if all

forward stack distances, i.e. the number of unique cache lines

accessed between this instruction and the next access to the

same cache line, are larger or equal to the size of the cache. In

that case, we know that the next instruction to touch the same

data is very likely to be a cache miss. Therefore, we can use a

non-temporal instruction to bypass the entire cache hierarchy

for such accesses.

This approach has a major drawback. Most applications,

even purely streaming ones that do not reuse data, may

still exhibit short temporal reuse, e.g. spatial locality where

neighboring data items on the same cache line are accessed in

close succession. Since cache management is done at a cache

line granularity, this clearly restricts the number of possible

instructions that can be treated as non-temporal.

B. Reﬁning the simple approach

Most hardware implementations of cache management in-

structions allow the non-temporal data to live in parts of the

cache hierarchy, such as the L1, before it is evicted to memory.

We can exploit this to accommodate short temporal reuse of

cache lines. We assume that whenever a non-temporal memory

access touches a cache line, the cache line is installed in the

MRU-position of the LRU stack, and a special bit on the cache

line, the evict to memory (ETM) bit, is set. Whenever a normal

memory access touches a cache line, the ETM bit is cleared.

Cache lines with the ETM bit set are evicted earlier than other

lines, see Figure 5. Instead of waiting for the line to reach the

depth d

max

it is evicted when it reaches a shallower depth,

ETM

. This allows us to model implementations that allow

non-temporal data to live in parts of the memory hierarchy.

For example, the memory controller in our AMD system evicts

ETM tagged cache lines from the L1 to main memory, and

would therefore be modeled with d

ETM

being the size of the

L1 and d

max

the total combined cache size.

The model with the ETM bit allows us to consider memory

accesses as non-temporal even if they have short reuses that

hit in the small ETM area. Instead of requiring that all forward

stack distances are larger than the cache size, we require

that there is at least one such access and that the number

of accesses that reuse data in the area of the LRU stack

outside the ETM area, the gray area in Figure 5, is small,

i.e. the number of misses introduced if the access is treated

as non-temporal is small. We thus require that one stack

distance is greater or equal to d

max

, and that the number

of stack distances that are larger or equal to d

ETM

but

smaller than d

max

is smaller than some threshold, t

. In most

implementations t

will not be a single value for all accesses,

but depend on factors such as how many additional cache hits

can be created by disabling caching for a memory access.

The hardware we want to model does not, unfortunately,

reset the ETM bit when a temporal access reuses ETM data.

This new situation can be thought of as sticky ETM bits, as

they are only reset on cache line eviction.

C. Handling sticky ETM bits

When the ETM bit is retained for the cache lines’ entire

lifetime in the cache, the conditions for a memory accessing

instruction to be non-temporal developed in section V-B are no

longer sufﬁcient. If instruction X sets the ETM bit on a cache

line, then the ETM status applies to all subsequent reuses of

the cache line as well. To correctly model this, we need to

make sure that the non-temporal condition from section V-B

applies, not only to X, but also to all instructions that reuse

the cache lines accessed by X.

The sticky ETM bit is only a problem for non-temporal

accesses that have forward reuse distances less than d

ET M

For example, consider a memory accessing instruction, Y, that

reuses the cache line previously accessed by a non-temporal

access X (here Y is a cache hit). When Y accesses the cache

line it is moved to the MRU position of the LRU stack, and the

sticky ETM bit is retained. Now, since Y would have resulted

in a cache hit no matter if X had set the sticky ETM bit or

not, this is the same as if we would have set the sticky ETM

bit for the cache line when it was accessed by Y.

Therefore, instead of applying the non-temporal condition

to a single instruction, we have to apply it to all instructions

reusing the cache line accessed by the ﬁrst instruction.

In a machine, such as our AMD system, where d

ETM

corresponds to the L1 cache, this new condition allows us to

categorize a memory access as non-temporal if all the data it

touches is reused through the L1 cache or misses in the entire

cache hierarchy. Due to the stickiness of the non-temporal

status, this condition must also hold for any memory access

that reuses the same data through the L1 cache.

D. Handling sampled data

To avoid the overhead of measuring exact stack distances,

we use StatStack to calculate stack distances from sampled

reuse distances. Sampled stack distances can generally be used

in place of a full stack distance trace with only a small decrease

in average accuracy. However, there is always a risk of missing

some critical behavior. This could potentially lead to ﬂagging

an access as non-temporal, even though the instruction in

fact has some temporal behavior in some cases, and thereby

introducing an unwanted cache miss.

In order to reduce the likelihood of introducing misses due

to sampling, we need to make sure that ﬂagging an instruction

as non-temporal is always based on reliable data. We do this

by introducing a sample threshold, t

, which is the smallest

number of samples originating from an instruction that can be

considered to be non-temporal.

Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Figures

Citations

Profiling Methods for Memory Centric Software Performance Analysis

Probabilistic Directed Writebacks for Exclusive Caches

A Novel Online Measure of Cache Utility Efficiency in Chip Multiprocessor

A Software Technique for Reducing Cache Pollution

Dynamic superscalar grid for technical debt reduction

References

Evaluation techniques for storage hierarchies

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Adaptive insertion policies for high performance caching

High performance cache replacement using re-reference interval prediction (RRIP)

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Related Papers (5)

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Addressing shared resource contention in multicore processors via scheduling

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

Predicting inter-thread cache contention on a chip multi-processor architecture

Pin: building customized program analysis tools with dynamic instrumentation

Frequently Asked Questions (10)

Q1. What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Q2. What are the future works mentioned in the paper "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Q3. How did the authors measure the cycles and instruction counts?

Q4. What is the implicit assumption that caches can be modeled to be?

Q5. What is the reason for the speedup when running with victims?

Q6. What is the way to manage the cache for these applications?

Q7. What is the main advantage of using a non-temporal instruction to bypass the entire cache?

Q8. What is the stack distance distribution for LRU caches?

Q9. How can the authors reclassify applications based on their replacement ratios?

Q10. How can the authors determine if the next access to the data used by the instruction is a?

Trending Questions (1)