How many false misses are performed by CMPDNUCA?

By delaying migrations by a thousand cycles, and canceling migrations when a different processor accesses the same block, CMPDNUCA still performs at least 94% of all scheduled migrations, while reducing false misses by at least 99%.

What are the contributions in "Managing wire delay in large chip-multiprocessor caches" ?

First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. In this paper, the authors develop L2 cache designs for CMPs that incorporate these three latency management techniques. The authors use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, the authors demonstrate that block migration is less effective for CMPs because 40-60 % of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, the authors observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, the authors show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, the authors present a hybrid design—combining all three techniques— that improves performance by an additional 2 % to 19 % over prefetching alone.

How many banks are used to control the latency of the L2 cache?

the 16 MB L2 storage array is partitioned into 256 banks to control bank access latency [1] and to provide sufficient bandwidth to support up to 128 simultaneous on-chip processor requests.

What is the main reason why architects are turning to on-chip delay?

As wire delays continue to increase, architects will turn to additional techniques such as block migration or transmission lines to manage on-chip delay.

How many MB of L2 cache would be used?

The authors estimate eight 4-wide superscalar processors would occupy 120 mm2 [29] and 16 MB of L2 cache storage would occupy 64 mm2 [16].

Why do barnes, apsi, and fma3d encounter?

Due to a lack of frequently repeatable requests, barnes, apsi, and fma3d encounter 30% to 62% of L2 hits in the distant 10 bankclusters.

How does separating branch predictor histories improve prefetcher performance?

Similar to separating branch predictor histories per thread [39], separating the L2 miss streams by processor significantly improves prefetcher performance (up to 14 times for the workload ocean).

What is the potential benefit of block migration in a CMP cache?

the potential benefit of block migration in a CMP cache is fundamentally limited by the large amount of inter-processor sharing that exists in some workloads.

(Open Access) Managing Wire Delay in Large Chip-Multiprocessor Caches (2004) | Bradford M. Beckmann

Q: What is the role of wire delay in the design of a CMP?

Design partitioning, along with the integration of more metal layers, allows wire dimensions to decrease slower than transistor dimensions, thus keeping wire delay controllable for short distances [20, 42].

Q: How can a LC range be used to communicate data?

In the LC range, data can be communicated by propagating an incident wave across the transmission line instead of charging the capacitance across a series of wire segments.

Managing Wire Delay in Large Chip-Multiprocessor Caches

Bradford M. Beckmann and David A. Wood

Computer Sciences Department

University of Wisconsin—Madison

{beckmann, david}@cs.wisc.edu

Abstract

In response to increasing (relative) wire delay, archi-

tects have proposed various technologies to manage the

impact of slow wires on large uniprocessor L2 caches.

Block migration (e.g., D-NUCA [27] and NuRapid [12])

reduces average hit latency by migrating frequently used

blocks towards the lower-latency banks. Transmission Line

Caches (TLC) [6] use on-chip transmission lines to provide

low latency to all banks. Traditional stride-based hardware

prefetching strives to tolerate, rather than reduce, latency.

Chip multiprocessors (CMPs) present additional chal-

lenges. First, CMPs often share the on-chip L2 cache,

requiring multiple ports to provide sufﬁcient bandwidth.

Second, multiple threads mean multiple working sets, which

compete for limited on-chip storage. Third, sharing code

and data interferes with block migration, since one proces-

sor’s low-latency bank is another processor’s high-latency

bank.

In this paper, we develop L2 cache designs for CMPs

that incorporate these three latency management tech-

niques. We use detailed full-system simulation to analyze

the performance trade-offs for both commercial and scien-

tiﬁc workloads. First, we demonstrate that block migration

is less effective for CMPs because 40-60% of L2 cache hits

in commercial workloads are satisﬁed in the central banks,

which are equally far from all processors. Second, we

observe that although transmission lines provide low

latency, contention for their restricted bandwidth limits

their performance. Third, we show stride-based prefetching

between L1 and L2 caches alone improves performance by

at least as much as the other two techniques. Finally, we

present a hybrid design—combining all three techniques—

that improves performance by an additional 2% to 19%

over prefetching alone.

1 Introduction

Many factors—both technological and marketing—are

driving the semiconductor industry to implement multiple

processors per chip. Small-scale chip multiprocessors

(CMPs), with two processors per chip, are already commer-

cially available [24, 30, 44]. Larger-scale CMPs seem likely

to follow as transistor densities increase [5, 18, 45, 28]. Due

to the beneﬁts of sharing, current and future CMPs are

likely to have a shared, uniﬁed L2 cache [25, 37].

Wire delay plays an increasingly signiﬁcant role in

cache design. Design partitioning, along with the integra-

tion of more metal layers, allows wire dimensions to

decrease slower than transistor dimensions, thus keeping

wire delay controllable for short distances [20, 42]. For

instance as technology improves, designers split caches into

multiple banks, controlling the wire delay within a bank.

However, wire delay between banks is a growing perfor-

mance bottleneck. For example, transmitting data 1 cm

requires only 2-3 cycles in current (2004) technology, but

will necessitate over 12 cycles in 2010 technology assum-

ing a cycle time of 12 fanout-of-three delays [16]. Thus, L2

caches are likely to have hit latencies in the tens of cycles.

Increasing wire delay makes it difﬁcult to provide uni-

form access latencies to all L2 cache banks. One alternative

is Non-Uniform Cache Architecture (NUCA) designs [27],

which allow nearer cache banks to have lower access laten-

cies than further banks. However, supporting multiple pro-

cessors (e.g., 8) places additional demands on NUCA cache

designs. First, simple geometry dictates that eight regular-

shaped processors must be physically distributed across the

2-dimensional die. A cache bank that is physically close to

one processor cannot be physically close to all the others.

Second, an 8-way CMP requires eight times the sustained

cache bandwidth. These two factors strongly suggest a

physically distributed, multi-port NUCA cache design.

This paper examines three techniques—previously

evaluated only for uniprocessors—for managing L2 cache

latency in an eight-processor CMP. First, we consider using

hardware-directed stride-based prefetching [9, 13, 23] to

tolerate the variable latency in a NUCA cache design.

While current systems perform hardware-directed strided

prefetching [19, 21, 43], its effectiveness is workload

dependent [10, 22, 46, 49]. Second, we consider cache

block migration [12, 27], a recently proposed technique for

NUCA caches that moves frequently accessed blocks to

cache banks closer to the requesting processor. While block

migration works well for uniprocessors, adapting it to

This work was supported by the National Science Foundation

(CDA-9623632, EIA-9971256, EIA-0205286, and CCR-

0324878), a Wisconsin Romnes Fellowship (Wood), and dona-

tions from Intel Corp. and Sun Microsystems, Inc. Dr. Wood has a

signiﬁcant ﬁnancial interest in Sun Microsystems, Inc.

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

CMPs poses two problems. One, blocks shared by multiple

processors are pulled in multiple directions and tend to con-

gregate in banks that are equally far from all processors.

Two, due to the extra freedom of movement, the effective-

ness of block migration in a shared CMP cache is more

dependent on “smart searches” [27] than its uniprocessor

counterpart, yet smart searches are harder to implement in a

CMP environment. Finally, we consider using on-chip trans-

mission lines [8] to provide fast access to all cache banks [6].

On-chip transmission lines use thick global wires to reduce

communication latency by an order of magnitude versus

long conventional wires. Transmission Line Caches (TLCs)

provide fast, nearly uniform, access latencies. However, the

limited bandwidth of transmission lines—due to their large

dimensions—may lead to a performance bottleneck in

CMPs.

This paper evaluates these three techniques—against a

baseline NUCA design with L2 miss prefetching—using

detailed full-system simulation and both commercial and sci-

entiﬁc workloads. We make the following contributions:

•Block migration is less effective for CMPs than previous

results have shown for uniprocessors. Even with an per-

fect search mechanism, block migration alone only

improves performance by an average of 3%. This is in

part because shared blocks migrate to the middle

equally-distant cache banks, accounting for 40-60% of

L2 hits for the commercial workloads.

•Transmission line caches in CMPs exhibit performance

improvements comparable to previously published uni-

processor results [6]—8% on average. However, conten-

tion for their limited bandwidth accounts for 26% of L2

hit latency.

•Hardware-directed strided prefetching hides L2 hit

latency about as well as block migration and transmis-

sion lines reduce it. However, prefetching is largely

orthogonal, permitting hybrid techniques.

•A hybrid implementation—combining block migration,

transmission lines, and on-chip prefetching—provides

the best performance. The hybrid design improves per-

formance by an additional 2% to 19% over the baseline.

•Finally, prefetching and block migration improve net-

work efﬁciency for some scientiﬁc workloads, while

transmission lines potentially improve efﬁciency across

all workloads.

2 Managing CMP Cache Latency

This section describes the baseline CMP design for this

study and how we adapt the three latency management tech-

niques to this framework.

2.1 Baseline CMP Design

We target eight-processor CMP chip designs assuming

the 45 nm technology generation projected in 2010 [16].

Table 1 speciﬁes the system parameters for all designs. Each

CMP design assumes approximately 300 mm

of available

die area [16]. We estimate eight 4-wide superscalar proces-

sors would occupy 120 mm

[29] and 16 MB of L2 cache

storage would occupy 64 mm

[16]. The on-chip intercon-

nection network and other miscellaneous structures occupy

the remaining area.

As illustrated in Figure 1, the baseline design—denoted

CMP-SNUCA—assumes a Non-Uniform Cache Architec-

ture (NUCA) L2 cache, derived from Kim, et al.’s S-NUCA-

2 design [27]. Similar to the original proposal, CMP-

SNUCA statically partitions the address space across cache

banks, which are connected via a 2D mesh interconnection

network. CMP-SNUCA differs from the uniprocessor design

in several important ways. First, it places eight processors

around the perimeter of the L2 cache, effectively creating

eight distributed access locations rather than a single central-

ized location. Second, the 16 MB L2 storage array is parti-

tioned into 256 banks to control bank access latency [1] and

to provide sufﬁcient bandwidth to support up to 128 simulta-

neous on-chip processor requests. Third, CMP-SNUCA con-

nects four banks to each switch and expands the link width to

32 bytes. The wider CMP-SNUCA network provides the

additional bandwidth needed by an 8-processor CMP, but

requires longer latencies as compared to the originally pro-

posed uniprocessor network. Fourth, shared CMP caches are

subject to contention from different processors’ working sets

[32], motivating 16-way set-associative banks with a pseudo-

LRU replacement policy [40]. Finally, we assume an ideal-

ized off-chip communication controller to provide consistent

off-chip latency for all processors.

2.2 Strided Prefetching

Strided or stride-based prefetchers utilize repeatable

memory access patterns to tolerate cache miss latency [11,

23, 38]. Though the L1 cache ﬁlters many memory requests,

L1 and L2 misses often show repetitive access patterns. Most

current prefetchers utilize miss patterns to predict cache

misses before they happen [19, 21, 43]. Speciﬁcally, current

hardware prefetchers observe the stride between two recent

cache misses, then verify the stride using subsequent misses.

Once the prefetcher reaches a threshold of ﬁxed strided

misses, it launches a series of ﬁll requests to reduce or elimi-

nate additional miss latency.

We base our prefetching strategy on the IBM Power 4

implementation [43] with some slight modiﬁcations. We

evaluate both L2 prefetching (i.e., between the L2 cache and

memory) and L1 prefetching (i.e., between the L1 and L2

caches). Both the L1 and L2 prefetchers contain three sepa-

rate 32-entry ﬁlter tables: positive unit stride, negative unit

stride, and non-unit stride. Similar to Power 4, once a ﬁlter

table entry recognizes 4 ﬁxed-stride misses, the prefetcher

allocates the miss stream into its 8-entry stream table. Upon

allocation, the L1I and L1D prefetchers launch 6 consecutive

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

prefetches along the stream to compensate for the L1 to L2

latency, while the L2 prefetcher launches 25 prefetches.

Each prefetcher issues prefetches for both loads and stores

because, unlike the Power 4, our simulated machine uses an

L1 write-allocate protocol supporting sequential consistency.

Also we model separate L2 prefetchers per processor, rather

than a single shared prefetcher. We found that with a shared

prefetcher, interference between the different processors’

miss streams signiﬁcantly disrupts the prefetching accuracy

and coverage

2.3 Block Migration

Block migration reduces global wire delay from L2 hit

latency by moving frequently accessed cache blocks closer

to the requesting processor. Migrating data to reduce latency

has been extensively studied in multiple-chip multiproces-

sors [7, 15, 17, 36, 41]. Kim, et al. recently applied data

migration to reduce latency inside future aggressively-

banked uniprocessor caches [27]. Their Dynamic NUCA (D-

NUCA) design used a 2-dimensional mesh to interconnect 2-

way set-associative banks, and dynamically migrated fre-

quently accessed blocks to the closest banks. NuRapid used

centralized tags and a level of indirection to decouple data

placement from set indexing, thus reducing conﬂicts in the

nearest banks [12]. Both D-NUCA and NuRapid assumed a

single processor chip accessing the L2 cache network from a

single location.

For CMPs, we examine a block migration scheme as an

extension to our baseline CMP-SNUCA design. Similar to

the uniprocessor D-NUCA design [27], CMP-DNUCA per-

mits block migration by logically separating the L2 cache

banks into 16 unique banksets, where an address maps to a

bankset and can reside within any one bank of the bankset.

CMP-DNUCA physically separates the cache banks into 16

different bankclusters, shown as the shaded “Tetris” pieces

in Figure 1. Each bankcluster contains one bank from every

bankset, similar to the uniprocessor “fair mapping” policy

[27]. The bankclusters are grouped into three distinct

regions. The 8 banksets closest to each processor form the

local regions, shown by the 8 lightly shaded bankclusters in

Figure 1. The 4 bankclusters that reside in the center of the

shared cache form the center region, shown by the 4 darkest

shaded bankclusters in Figure 1. The remaining 4 bankclus-

ters form the inter, or intermediate, region. Ideally block

migration would maximize L2 hits within each processor’s

local bankcluster where the uncontended L2 hit latency (i.e.,

load-to-use latency) varies between 13 to 17 cycles and limit

the hits to another processor’s local bankcluster, where the

uncontended latency can be as high as 65 cycles.

To reduce the latency of detecting a cache miss, the uni-

processor D-NUCA design utilized a “smart search” [27]

mechanism using a partial tag array. The centrally-located

partial tag structure [26] replicated the low-order bits of each

bank’s cache tags. If a request missed in the partial tag struc-

ture, the block was guaranteed not to be in the cache. This

smart search mechanism allowed nearly all cache misses to

be detected without searching the entire bankset.

In CMP-DNUCA, adopting a partial tag structure

appears impractical. A centralized partial tag structure can-

not be quickly accessed by all processors due to wire delays.

Fully replicated 6-bit partial tag structures (as used in uni-

1. Similar to separating branch predictor histories per thread [39],

separating the L2 miss streams by processor signiﬁcantly improves

prefetcher performance (up to 14 times for the workload ocean).

Table 1. 2010 System Parameters

Memory System Dynamically Scheduled Processor

split L1 I & D caches 64 KB, 2-way, 3 cycles clock frequency 10 GHz

uniﬁed L2 cache 16 MB, 256x64 KB, 16-

way, 6 cycle bank access

reorder buffer / scheduler 128 / 64 entries

L1/L2 cache block size 64 Bytes pipeline width 4-wide fetch & issue

memory latency 260 cycles pipeline stages 30

memory bandwidth 320 GB/s direct branch predictor 3.5 KB YAGS

memory size 4 GB of DRAM return address stack 64 entries

outstanding memory requests/CPU 16 indirect branch predictor 256 entries (cascaded)

Figure 1. CMP-SNUCA Layout with CMP-

DNUCA Bankcluster Regions

Local

Inter.

Center

Bankcluster Key

I $

CPU 7

I $ D

I $

CPU 6

CPU 0 CPU 1

CPU 2 CPU 3

CPU 4 CPU 5

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

processor D-NUCA [27]) require 1.5 MBs of state, an

extremely high overhead. More importantly, separate partial

tag structures require a complex coherence scheme that

updates address location state in the partial tags with block

migrations. However, because architects may invent a solu-

tion to this problem, we evaluate CMP-DNUCA both with

and without a perfect search mechanism.

2.4 On-chip Transmission Lines

On-chip transmission line technology reduces L2 cache

access latency by replacing slow conventional wires with

ultra-fast transmission lines [6]. The delay in conventional

wires is dominated by a wire’s resistance-capacitance prod-

uct, or RC delay. RC delay increases with improving tech-

nology as wires become thinner to match the smaller feature

sizes below. Speciﬁcally, wire resistance increases due to the

smaller cross-sectional area and sidewall capacitance

increases due to the greater surface area exposed to adjacent

wires. On the other hand, transmission lines attain signiﬁcant

performance beneﬁt by increasing wire dimensions to the

point where the inductance-capacitance product (LC delay)

determines delay [8]. In the LC range, data can be communi-

cated by propagating an incident wave across the transmis-

sion line instead of charging the capacitance across a series

of wire segments. While techniques such as low-k intermetal

dielectrics, additional metal layers, and more repeaters

across a link, will mitigate RC wire latency for short and

intermediate links, transmitting data 1 cm will require more

than 12 cycles in 2010 technology [16]. In contrast, on-chip

transmission lines implemented in 2010 technology will

transmit data 1 cm in less than a single cycle [6].

While on-chip transmission lines achieve signiﬁcant

latency reduction, they sacriﬁce substantial bandwidth or

require considerable manufacturing cost. To achieve trans-

mission line signalling, on-chip wire dimensions and spacing

must be an order of magnitude larger than minimum pitch

global wires. To attain these large dimensions, transmission

lines must be implemented in the chip’s uppermost metal

layers. The sparseness of these upper layers severely limits

the number of transmission lines available. Alternatively,

extra metal layers may be integrated to the manufacturing

process, but each new metal layer adds about a day of manu-

facturing time, increasing wafer cost by hundreds of dollars

[47].

Applying on-chip transmission lines to reduce the

access latency of a shared L2 cache requires efﬁcient utiliza-

tion of their limited bandwidth. Similar to our uniprocessor

TLC designs [6], we ﬁrst propose using transmission lines to

connect processors with a shared L2 cache through a single

L2 interface, as shown in Figure 2. Because transmission

lines do not require repeaters, CMP-TLC creates a direct

connection between the centrally located L2 interface and

the peripherally located storage arrays by routing directly

over the processors. Similar to CMP-SNUCA, CMP-TLC

statically partitions the address space across all L2 cache

banks. Sixteen banks (2 adjacent groups of 8 banks) share a

common pair of thin 8-byte wide unidirectional transmission

line links to the L2 cache interface. To mitigate the conten-

tion for the thin transmission line links, our CMP-TLC

design provides 16 separate links to different segments of the

L2 cache. Also to further reduce contention, the CMP-TLC

L2 interface provides a higher bandwidth connection (80-

byte wide) between the transmission lines and processors

than the original uniprocessor TLC design. Due to the higher

bandwidth, requests encounter greater communication

latency (2-10 cycles) within the L2 cache interface.

We also propose using transmission lines to quickly

access the central banks in the CMP-DNUCA design. We

refer to this design as CMP-Hybrid. CMP-Hybrid, illustrated

in Figure 3, assumes the same design as CMP-DNUCA

except the closest switch to each processor has a 32-byte

wide transmission line link to a center switch in the DNUCA

cache. Because the processors are distributed around the

perimeter of the chip and the distance between the processor

switches and the center switches is relatively short (approxi-

mately 8 mm), the transmission line links in CMP-Hybrid

are wider (32 bytes) than their CMP-TLC counterparts

L2 Interface

CPU 4

I $

CPU 2

I $

CPU 0

I $

CPU 1 CPU 6

CPU 5

CPU 3

CPU 7

Figure 2. CMP-TLC Layout

I $

CPU 7

I $ D

I $

CPU 6

CPU 0 CPU 1

CPU 2 CPU 3

CPU 4 CPU 5

Figure 3. CMP-Hybrid Layout

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

(8 bytes). The transmission line links of CMP-Hybrid pro-

vide low latency access to those blocks that tend to congre-

gate in the center banks of the block migrating NUCA cache,

Section 5.3.

Figure 4 compares the uncontended L2 cache hit latency

between the CMP-SNUCA, CMP-TLC, and CMP-Hybrid

designs. The plotted hit latency includes L1 miss latency, i.e.

it plots the load-to-use latency for L2 hits. While CMP-TLC

achieves a much lower average hit latency than CMP-

SNUCA, CMP-SNUCA exhibits lower latency to the closest

1 MB to each processor. For instance, Figure 4 shows all

processors in the CMP-SNUCA design can access their local

bankcluster (6.25% of the entire cache) in 18 cycles or less.

CMP-DNUCA attempts to maximize the hits to this closest

6.25% of the NUCA cache through migration, while CMP-

TLC utilizes a much simpler logical design and provides fast

access for all banks. CMP-Hybrid uses transmission lines to

attain similar average hit latency as CMP-TLC, as well as

achieving fast access to more banks than CMP-SNUCA.

3 Methodology

We evaluated all cache designs using full system simula-

tion of a SPARC V9 CMP running Solaris 9. Speciﬁcally, we

used Simics [33] extended with the out-of-order processor

model, TFSim [34], and a memory system timing model.

Our memory system implements a two-level directory cache-

coherence protocol with sequential memory consistency. The

intra-chip MSI coherence protocol maintains inclusion

between the shared L2 cache and all on-chip L1 caches. All

L1 requests and responses are sent via the L2 cache allowing

the L2 cache to maintain up-to-date L1 sharer knowledge.

The inter-chip MOSI coherence protocol maintains directory

state at the off-chip memory controllers and only tracks

which CMP nodes contain valid block copies. Our memory

system timing model includes a detailed model of the intra-

and inter-chip network. Our network models all messages

communicated in the system including all requests,

responses, replacements, and acknowledgements. Network

routing is performed using a virtual cut-through scheme with

inﬁnite buffering at the switches.

We studied the CMP cache designs for various commer-

cial and scientiﬁc workloads. Alameldeen, et al. described in

detail the four commercial workloads used in this study [2].

We also studied four scientiﬁc workloads: two Splash2

benchmarks [48]: barnes (16k-particles) and ocean

( ), and two SPECOMP benchmarks [4]: apsi and

fma3d. We used a work-related throughput metric to address

multithreaded workload variability [2]. Thus for the com-

mercial workloads, we measured transactions completed and

for the scientiﬁc workloads, runs were completed after the

cache warm-up period indicated in Table 2. However, for the

specOMP workloads using the reference input sets, runs

were too long to be completed in a reasonable amount of

time. Instead, these loop-based benchmarks were split by

main loop completion. This allowed us to evaluate all work-

loads using throughput metrics, rather than IPC.

4 Strided Prefetching

Both on and off-chip strided prefetching signiﬁcantly

improve the performance of our CMP-SNUCA baseline.

Figure 5 presents runtime results for no prefetching, L2

prefetching only, and L1 and L2 prefetching combined, nor-

malized to no prefetching. Error bars signify the 95% conﬁ-

dence intervals [3] and the absolute runtime (in 10K

instructions per transaction/scientiﬁc benchmark) of the no

prefetch case is presented below. Figure 5 illustrates the sub-

stantial beneﬁt from L2 prefetching, particularly for regular

scientiﬁc workloads. L2 prefetching reduces the run times of

ocean and apsi by 43% and 59%, respectively. Strided L2

prefetching also improves performance of the commercial

workloads by 4% to 17%.

The L1&L2 prefetching bars of Figure 5 indicate on-

chip prefetching between each processor’s L1 I and D caches

and the shared L2 cache improves performance by an addi-

100

Latancy (cycles)

% of L2 Cache Storage

CMP-SNUCA

CMP-TL

CMP-Hybrid

Figure 4. CMP-SNUCA vs. CMP-TLC vs. CMP-

Hybrid Uncontended L2 Hit Latency

Table 2. Evaluation Methodology

Bench Fast Forward Warm-up Executed

Commercial Workloads (unit = transactions)

apache 500000 2000 500

zeus 500000 2000 500

jbb 1000000 15000 2000

oltp 100000 300 100

Scientiﬁc Workloads (unit = billion instructions)

barnes None 1.9 run completion

ocean None 2.4 run completion

apsi 88.8 4.64 loop completion

fma3d 190.4 2.08 loop completion

514 514×

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

Managing Wire Delay in Large Chip-Multiprocessor Caches

Figures

Citations

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Multifacets General Execution-Driven Multiprocessor Simulator (GEMS) Toolset

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Reactive NUCA: near-optimal block placement and replication in distributed caches

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

References

The SPLASH-2 programs: characterization and methodological considerations

Simics: A full system simulation platform

The future of wires

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Niagara: a 32-way multithreaded Sparc processor

Related Papers (5)

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Simics: A full system simulation platform

The SPLASH-2 programs: characterization and methodological considerations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Frequently Asked Questions (13)

Q1. How many false misses are performed by CMPDNUCA?

Q2. What are the contributions in "Managing wire delay in large chip-multiprocessor caches" ?

Q3. What is the role of wire delay in the design of a CMP?

Q4. What is the reason for the increase in resistance of wires?

Q5. How can a LC range be used to communicate data?

Q6. How many banks are used to control the latency of the L2 cache?

Q7. What is the main reason why architects are turning to on-chip delay?

Q8. How many MB of L2 cache would be used?

Q9. Why do barnes, apsi, and fma3d encounter?

Q10. What is the importance of separating partial tags?

Q11. What is the effect of hardware-directed strided prefetching?

Q12. How does separating branch predictor histories improve prefetcher performance?

Q13. What is the potential benefit of block migration in a CMP cache?