scispace - formally typeset
Open AccessProceedings ArticleDOI

Managing Wire Delay in Large Chip-Multiprocessor Caches

TLDR
This paper develops L2 cache designs for CMPs that incorporate block migration, stride-based prefetching between L1 and L2 caches, and presents a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% overPrefetching alone.
Abstract
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.

read more

Content maybe subject to copyright    Report

Managing Wire Delay in Large Chip-Multiprocessor Caches
Bradford M. Beckmann and David A. Wood
Computer Sciences Department
University of Wisconsin—Madison
{beckmann, david}@cs.wisc.edu
Abstract
In response to increasing (relative) wire delay, archi-
tects have proposed various technologies to manage the
impact of slow wires on large uniprocessor L2 caches.
Block migration (e.g., D-NUCA [27] and NuRapid [12])
reduces average hit latency by migrating frequently used
blocks towards the lower-latency banks. Transmission Line
Caches (TLC) [6] use on-chip transmission lines to provide
low latency to all banks. Traditional stride-based hardware
prefetching strives to tolerate, rather than reduce, latency.
Chip multiprocessors (CMPs) present additional chal-
lenges. First, CMPs often share the on-chip L2 cache,
requiring multiple ports to provide sufficient bandwidth.
Second, multiple threads mean multiple working sets, which
compete for limited on-chip storage. Third, sharing code
and data interferes with block migration, since one proces-
sor’s low-latency bank is another processor’s high-latency
bank.
In this paper, we develop L2 cache designs for CMPs
that incorporate these three latency management tech-
niques. We use detailed full-system simulation to analyze
the performance trade-offs for both commercial and scien-
tific workloads. First, we demonstrate that block migration
is less effective for CMPs because 40-60% of L2 cache hits
in commercial workloads are satisfied in the central banks,
which are equally far from all processors. Second, we
observe that although transmission lines provide low
latency, contention for their restricted bandwidth limits
their performance. Third, we show stride-based prefetching
between L1 and L2 caches alone improves performance by
at least as much as the other two techniques. Finally, we
present a hybrid designcombining all three techniques
that improves performance by an additional 2% to 19%
over prefetching alone.
1 Introduction
Many factors—both technological and marketing—are
driving the semiconductor industry to implement multiple
processors per chip. Small-scale chip multiprocessors
(CMPs), with two processors per chip, are already commer-
cially available [24, 30, 44]. Larger-scale CMPs seem likely
to follow as transistor densities increase [5, 18, 45, 28]. Due
to the benefits of sharing, current and future CMPs are
likely to have a shared, unified L2 cache [25, 37].
Wire delay plays an increasingly significant role in
cache design. Design partitioning, along with the integra-
tion of more metal layers, allows wire dimensions to
decrease slower than transistor dimensions, thus keeping
wire delay controllable for short distances [20, 42]. For
instance as technology improves, designers split caches into
multiple banks, controlling the wire delay within a bank.
However, wire delay between banks is a growing perfor-
mance bottleneck. For example, transmitting data 1 cm
requires only 2-3 cycles in current (2004) technology, but
will necessitate over 12 cycles in 2010 technology assum-
ing a cycle time of 12 fanout-of-three delays [16]. Thus, L2
caches are likely to have hit latencies in the tens of cycles.
Increasing wire delay makes it difficult to provide uni-
form access latencies to all L2 cache banks. One alternative
is Non-Uniform Cache Architecture (NUCA) designs [27],
which allow nearer cache banks to have lower access laten-
cies than further banks. However, supporting multiple pro-
cessors (e.g., 8) places additional demands on NUCA cache
designs. First, simple geometry dictates that eight regular-
shaped processors must be physically distributed across the
2-dimensional die. A cache bank that is physically close to
one processor cannot be physically close to all the others.
Second, an 8-way CMP requires eight times the sustained
cache bandwidth. These two factors strongly suggest a
physically distributed, multi-port NUCA cache design.
This paper examines three techniques—previously
evaluated only for uniprocessors—for managing L2 cache
latency in an eight-processor CMP. First, we consider using
hardware-directed stride-based prefetching [9, 13, 23] to
tolerate the variable latency in a NUCA cache design.
While current systems perform hardware-directed strided
prefetching [19, 21, 43], its effectiveness is workload
dependent [10, 22, 46, 49]. Second, we consider cache
block migration [12, 27], a recently proposed technique for
NUCA caches that moves frequently accessed blocks to
cache banks closer to the requesting processor. While block
migration works well for uniprocessors, adapting it to
This work was supported by the National Science Foundation
(CDA-9623632, EIA-9971256, EIA-0205286, and CCR-
0324878), a Wisconsin Romnes Fellowship (Wood), and dona-
tions from Intel Corp. and Sun Microsystems, Inc. Dr. Wood has a
significant financial interest in Sun Microsystems, Inc.
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE

CMPs poses two problems. One, blocks shared by multiple
processors are pulled in multiple directions and tend to con-
gregate in banks that are equally far from all processors.
Two, due to the extra freedom of movement, the effective-
ness of block migration in a shared CMP cache is more
dependent on “smart searches” [27] than its uniprocessor
counterpart, yet smart searches are harder to implement in a
CMP environment. Finally, we consider using on-chip trans-
mission lines [8] to provide fast access to all cache banks [6].
On-chip transmission lines use thick global wires to reduce
communication latency by an order of magnitude versus
long conventional wires. Transmission Line Caches (TLCs)
provide fast, nearly uniform, access latencies. However, the
limited bandwidth of transmission lines—due to their large
dimensions—may lead to a performance bottleneck in
CMPs.
This paper evaluates these three techniques—against a
baseline NUCA design with L2 miss prefetching—using
detailed full-system simulation and both commercial and sci-
entific workloads. We make the following contributions:
Block migration is less effective for CMPs than previous
results have shown for uniprocessors. Even with an per-
fect search mechanism, block migration alone only
improves performance by an average of 3%. This is in
part because shared blocks migrate to the middle
equally-distant cache banks, accounting for 40-60% of
L2 hits for the commercial workloads.
Transmission line caches in CMPs exhibit performance
improvements comparable to previously published uni-
processor results [6]—8% on average. However, conten-
tion for their limited bandwidth accounts for 26% of L2
hit latency.
Hardware-directed strided prefetching hides L2 hit
latency about as well as block migration and transmis-
sion lines reduce it. However, prefetching is largely
orthogonal, permitting hybrid techniques.
A hybrid implementation—combining block migration,
transmission lines, and on-chip prefetching—provides
the best performance. The hybrid design improves per-
formance by an additional 2% to 19% over the baseline.
Finally, prefetching and block migration improve net-
work efficiency for some scientific workloads, while
transmission lines potentially improve efficiency across
all workloads.
2 Managing CMP Cache Latency
This section describes the baseline CMP design for this
study and how we adapt the three latency management tech-
niques to this framework.
2.1 Baseline CMP Design
We target eight-processor CMP chip designs assuming
the 45 nm technology generation projected in 2010 [16].
Table 1 specifies the system parameters for all designs. Each
CMP design assumes approximately 300 mm
2
of available
die area [16]. We estimate eight 4-wide superscalar proces-
sors would occupy 120 mm
2
[29] and 16 MB of L2 cache
storage would occupy 64 mm
2
[16]. The on-chip intercon-
nection network and other miscellaneous structures occupy
the remaining area.
As illustrated in Figure 1, the baseline design—denoted
CMP-SNUCA—assumes a Non-Uniform Cache Architec-
ture (NUCA) L2 cache, derived from Kim, et al.s S-NUCA-
2 design [27]. Similar to the original proposal, CMP-
SNUCA statically partitions the address space across cache
banks, which are connected via a 2D mesh interconnection
network. CMP-SNUCA differs from the uniprocessor design
in several important ways. First, it places eight processors
around the perimeter of the L2 cache, effectively creating
eight distributed access locations rather than a single central-
ized location. Second, the 16 MB L2 storage array is parti-
tioned into 256 banks to control bank access latency [1] and
to provide sufficient bandwidth to support up to 128 simulta-
neous on-chip processor requests. Third, CMP-SNUCA con-
nects four banks to each switch and expands the link width to
32 bytes. The wider CMP-SNUCA network provides the
additional bandwidth needed by an 8-processor CMP, but
requires longer latencies as compared to the originally pro-
posed uniprocessor network. Fourth, shared CMP caches are
subject to contention from different processors’ working sets
[32], motivating 16-way set-associative banks with a pseudo-
LRU replacement policy [40]. Finally, we assume an ideal-
ized off-chip communication controller to provide consistent
off-chip latency for all processors.
2.2 Strided Prefetching
Strided or stride-based prefetchers utilize repeatable
memory access patterns to tolerate cache miss latency [11,
23, 38]. Though the L1 cache filters many memory requests,
L1 and L2 misses often show repetitive access patterns. Most
current prefetchers utilize miss patterns to predict cache
misses before they happen [19, 21, 43]. Specifically, current
hardware prefetchers observe the stride between two recent
cache misses, then verify the stride using subsequent misses.
Once the prefetcher reaches a threshold of fixed strided
misses, it launches a series of fill requests to reduce or elimi-
nate additional miss latency.
We base our prefetching strategy on the IBM Power 4
implementation [43] with some slight modifications. We
evaluate both L2 prefetching (i.e., between the L2 cache and
memory) and L1 prefetching (i.e., between the L1 and L2
caches). Both the L1 and L2 prefetchers contain three sepa-
rate 32-entry filter tables: positive unit stride, negative unit
stride, and non-unit stride. Similar to Power 4, once a filter
table entry recognizes 4 fixed-stride misses, the prefetcher
allocates the miss stream into its 8-entry stream table. Upon
allocation, the L1I and L1D prefetchers launch 6 consecutive
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE

prefetches along the stream to compensate for the L1 to L2
latency, while the L2 prefetcher launches 25 prefetches.
Each prefetcher issues prefetches for both loads and stores
because, unlike the Power 4, our simulated machine uses an
L1 write-allocate protocol supporting sequential consistency.
Also we model separate L2 prefetchers per processor, rather
than a single shared prefetcher. We found that with a shared
prefetcher, interference between the different processors’
miss streams significantly disrupts the prefetching accuracy
and coverage
1
.
2.3 Block Migration
Block migration reduces global wire delay from L2 hit
latency by moving frequently accessed cache blocks closer
to the requesting processor. Migrating data to reduce latency
has been extensively studied in multiple-chip multiproces-
sors [7, 15, 17, 36, 41]. Kim, et al. recently applied data
migration to reduce latency inside future aggressively-
banked uniprocessor caches [27]. Their Dynamic NUCA (D-
NUCA) design used a 2-dimensional mesh to interconnect 2-
way set-associative banks, and dynamically migrated fre-
quently accessed blocks to the closest banks. NuRapid used
centralized tags and a level of indirection to decouple data
placement from set indexing, thus reducing conflicts in the
nearest banks [12]. Both D-NUCA and NuRapid assumed a
single processor chip accessing the L2 cache network from a
single location.
For CMPs, we examine a block migration scheme as an
extension to our baseline CMP-SNUCA design. Similar to
the uniprocessor D-NUCA design [27], CMP-DNUCA per-
mits block migration by logically separating the L2 cache
banks into 16 unique banksets, where an address maps to a
bankset and can reside within any one bank of the bankset.
CMP-DNUCA physically separates the cache banks into 16
different bankclusters, shown as the shaded “Tetris” pieces
in Figure 1. Each bankcluster contains one bank from every
bankset, similar to the uniprocessor “fair mapping” policy
[27]. The bankclusters are grouped into three distinct
regions. The 8 banksets closest to each processor form the
local regions, shown by the 8 lightly shaded bankclusters in
Figure 1. The 4 bankclusters that reside in the center of the
shared cache form the center region, shown by the 4 darkest
shaded bankclusters in Figure 1. The remaining 4 bankclus-
ters form the inter, or intermediate, region. Ideally block
migration would maximize L2 hits within each processor’s
local bankcluster where the uncontended L2 hit latency (i.e.,
load-to-use latency) varies between 13 to 17 cycles and limit
the hits to another processor’s local bankcluster, where the
uncontended latency can be as high as 65 cycles.
To reduce the latency of detecting a cache miss, the uni-
processor D-NUCA design utilized a “smart search” [27]
mechanism using a partial tag array. The centrally-located
partial tag structure [26] replicated the low-order bits of each
bank’s cache tags. If a request missed in the partial tag struc-
ture, the block was guaranteed not to be in the cache. This
smart search mechanism allowed nearly all cache misses to
be detected without searching the entire bankset.
In CMP-DNUCA, adopting a partial tag structure
appears impractical. A centralized partial tag structure can-
not be quickly accessed by all processors due to wire delays.
Fully replicated 6-bit partial tag structures (as used in uni-
1. Similar to separating branch predictor histories per thread [39],
separating the L2 miss streams by processor significantly improves
prefetcher performance (up to 14 times for the workload ocean).
Table 1. 2010 System Parameters
Memory System Dynamically Scheduled Processor
split L1 I & D caches 64 KB, 2-way, 3 cycles clock frequency 10 GHz
unified L2 cache 16 MB, 256x64 KB, 16-
way, 6 cycle bank access
reorder buffer / scheduler 128 / 64 entries
L1/L2 cache block size 64 Bytes pipeline width 4-wide fetch & issue
memory latency 260 cycles pipeline stages 30
memory bandwidth 320 GB/s direct branch predictor 3.5 KB YAGS
memory size 4 GB of DRAM return address stack 64 entries
outstanding memory requests/CPU 16 indirect branch predictor 256 entries (cascaded)
Figure 1. CMP-SNUCA Layout with CMP-
DNUCA Bankcluster Regions
Local
Inter.
Center
Bankcluster Key
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
CPU 7
L1
I $ D
L1
$
L1
I $
L1
D$
CPU 6
CPU 0 CPU 1
CPU 2 CPU 3
CPU 4 CPU 5
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE

processor D-NUCA [27]) require 1.5 MBs of state, an
extremely high overhead. More importantly, separate partial
tag structures require a complex coherence scheme that
updates address location state in the partial tags with block
migrations. However, because architects may invent a solu-
tion to this problem, we evaluate CMP-DNUCA both with
and without a perfect search mechanism.
2.4 On-chip Transmission Lines
On-chip transmission line technology reduces L2 cache
access latency by replacing slow conventional wires with
ultra-fast transmission lines [6]. The delay in conventional
wires is dominated by a wire’s resistance-capacitance prod-
uct, or RC delay. RC delay increases with improving tech-
nology as wires become thinner to match the smaller feature
sizes below. Specifically, wire resistance increases due to the
smaller cross-sectional area and sidewall capacitance
increases due to the greater surface area exposed to adjacent
wires. On the other hand, transmission lines attain significant
performance benefit by increasing wire dimensions to the
point where the inductance-capacitance product (LC delay)
determines delay [8]. In the LC range, data can be communi-
cated by propagating an incident wave across the transmis-
sion line instead of charging the capacitance across a series
of wire segments. While techniques such as low-k intermetal
dielectrics, additional metal layers, and more repeaters
across a link, will mitigate RC wire latency for short and
intermediate links, transmitting data 1 cm will require more
than 12 cycles in 2010 technology [16]. In contrast, on-chip
transmission lines implemented in 2010 technology will
transmit data 1 cm in less than a single cycle [6].
While on-chip transmission lines achieve significant
latency reduction, they sacrifice substantial bandwidth or
require considerable manufacturing cost. To achieve trans-
mission line signalling, on-chip wire dimensions and spacing
must be an order of magnitude larger than minimum pitch
global wires. To attain these large dimensions, transmission
lines must be implemented in the chip’s uppermost metal
layers. The sparseness of these upper layers severely limits
the number of transmission lines available. Alternatively,
extra metal layers may be integrated to the manufacturing
process, but each new metal layer adds about a day of manu-
facturing time, increasing wafer cost by hundreds of dollars
[47].
Applying on-chip transmission lines to reduce the
access latency of a shared L2 cache requires efficient utiliza-
tion of their limited bandwidth. Similar to our uniprocessor
TLC designs [6], we first propose using transmission lines to
connect processors with a shared L2 cache through a single
L2 interface, as shown in Figure 2. Because transmission
lines do not require repeaters, CMP-TLC creates a direct
connection between the centrally located L2 interface and
the peripherally located storage arrays by routing directly
over the processors. Similar to CMP-SNUCA, CMP-TLC
statically partitions the address space across all L2 cache
banks. Sixteen banks (2 adjacent groups of 8 banks) share a
common pair of thin 8-byte wide unidirectional transmission
line links to the L2 cache interface. To mitigate the conten-
tion for the thin transmission line links, our CMP-TLC
design provides 16 separate links to different segments of the
L2 cache. Also to further reduce contention, the CMP-TLC
L2 interface provides a higher bandwidth connection (80-
byte wide) between the transmission lines and processors
than the original uniprocessor TLC design. Due to the higher
bandwidth, requests encounter greater communication
latency (2-10 cycles) within the L2 cache interface.
We also propose using transmission lines to quickly
access the central banks in the CMP-DNUCA design. We
refer to this design as CMP-Hybrid. CMP-Hybrid, illustrated
in Figure 3, assumes the same design as CMP-DNUCA
except the closest switch to each processor has a 32-byte
wide transmission line link to a center switch in the DNUCA
cache. Because the processors are distributed around the
perimeter of the chip and the distance between the processor
switches and the center switches is relatively short (approxi-
mately 8 mm), the transmission line links in CMP-Hybrid
are wider (32 bytes) than their CMP-TLC counterparts
L2 Interface
CPU 4
L1
I $
$D
L1
I $
L1
L1
D$
I $
L1
CPU 2
L1
D$
L1
I $
$D
L1
L1
I $
$D
L1
I $
L1
L1
D$
I $
L1
CPU 0
L1
D$
L1
I $
$D
L1
CPU 1 CPU 6
CPU 5
CPU 3
CPU 7
Figure 2. CMP-TLC Layout
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
L1
I $
L1
D$
CPU 7
L1
I $ D
L1
$
L1
I $
L1
D$
CPU 6
CPU 0 CPU 1
CPU 2 CPU 3
CPU 4 CPU 5
Figure 3. CMP-Hybrid Layout
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE

(8 bytes). The transmission line links of CMP-Hybrid pro-
vide low latency access to those blocks that tend to congre-
gate in the center banks of the block migrating NUCA cache,
Section 5.3.
Figure 4 compares the uncontended L2 cache hit latency
between the CMP-SNUCA, CMP-TLC, and CMP-Hybrid
designs. The plotted hit latency includes L1 miss latency, i.e.
it plots the load-to-use latency for L2 hits. While CMP-TLC
achieves a much lower average hit latency than CMP-
SNUCA, CMP-SNUCA exhibits lower latency to the closest
1 MB to each processor. For instance, Figure 4 shows all
processors in the CMP-SNUCA design can access their local
bankcluster (6.25% of the entire cache) in 18 cycles or less.
CMP-DNUCA attempts to maximize the hits to this closest
6.25% of the NUCA cache through migration, while CMP-
TLC utilizes a much simpler logical design and provides fast
access for all banks. CMP-Hybrid uses transmission lines to
attain similar average hit latency as CMP-TLC, as well as
achieving fast access to more banks than CMP-SNUCA.
3 Methodology
We evaluated all cache designs using full system simula-
tion of a SPARC V9 CMP running Solaris 9. Specifically, we
used Simics [33] extended with the out-of-order processor
model, TFSim [34], and a memory system timing model.
Our memory system implements a two-level directory cache-
coherence protocol with sequential memory consistency. The
intra-chip MSI coherence protocol maintains inclusion
between the shared L2 cache and all on-chip L1 caches. All
L1 requests and responses are sent via the L2 cache allowing
the L2 cache to maintain up-to-date L1 sharer knowledge.
The inter-chip MOSI coherence protocol maintains directory
state at the off-chip memory controllers and only tracks
which CMP nodes contain valid block copies. Our memory
system timing model includes a detailed model of the intra-
and inter-chip network. Our network models all messages
communicated in the system including all requests,
responses, replacements, and acknowledgements. Network
routing is performed using a virtual cut-through scheme with
infinite buffering at the switches.
We studied the CMP cache designs for various commer-
cial and scientific workloads. Alameldeen, et al. described in
detail the four commercial workloads used in this study [2].
We also studied four scientific workloads: two Splash2
benchmarks [48]: barnes (16k-particles) and ocean
( ), and two SPECOMP benchmarks [4]: apsi and
fma3d. We used a work-related throughput metric to address
multithreaded workload variability [2]. Thus for the com-
mercial workloads, we measured transactions completed and
for the scientific workloads, runs were completed after the
cache warm-up period indicated in Table 2. However, for the
specOMP workloads using the reference input sets, runs
were too long to be completed in a reasonable amount of
time. Instead, these loop-based benchmarks were split by
main loop completion. This allowed us to evaluate all work-
loads using throughput metrics, rather than IPC.
4 Strided Prefetching
Both on and off-chip strided prefetching significantly
improve the performance of our CMP-SNUCA baseline.
Figure 5 presents runtime results for no prefetching, L2
prefetching only, and L1 and L2 prefetching combined, nor-
malized to no prefetching. Error bars signify the 95% confi-
dence intervals [3] and the absolute runtime (in 10K
instructions per transaction/scientific benchmark) of the no
prefetch case is presented below. Figure 5 illustrates the sub-
stantial benefit from L2 prefetching, particularly for regular
scientific workloads. L2 prefetching reduces the run times of
ocean and apsi by 43% and 59%, respectively. Strided L2
prefetching also improves performance of the commercial
workloads by 4% to 17%.
The L1&L2 prefetching bars of Figure 5 indicate on-
chip prefetching between each processor’s L1 I and D caches
and the shared L2 cache improves performance by an addi-
0
10
20
30
40
50
60
70
0
20
40
60
80
100
Latancy (cycles)
% of L2 Cache Storage
CMP-SNUCA
CMP-TL
CMP-Hybrid
Figure 4. CMP-SNUCA vs. CMP-TLC vs. CMP-
Hybrid Uncontended L2 Hit Latency
Table 2. Evaluation Methodology
Bench Fast Forward Warm-up Executed
Commercial Workloads (unit = transactions)
apache 500000 2000 500
zeus 500000 2000 500
jbb 1000000 15000 2000
oltp 100000 300 100
Scientific Workloads (unit = billion instructions)
barnes None 1.9 run completion
ocean None 2.4 run completion
apsi 88.8 4.64 loop completion
fma3d 190.4 2.08 loop completion
514 514×
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE

Citations
More filters
Journal ArticleDOI

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.

Multifacets General Execution-Driven Multiprocessor Simulator (GEMS) Toolset

M. M. Martin
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers and has released a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
Proceedings ArticleDOI

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Proceedings ArticleDOI

Reactive NUCA: near-optimal block placement and replication in distributed caches

TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Journal ArticleDOI

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

TL;DR: A router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory are proposed that demonstrate that a 3D L2 memory architecture generates much better results than the conventional two-dimensional designs under different number of layers and vertical connections.
References
More filters
Proceedings ArticleDOI

The SPLASH-2 programs: characterization and methodological considerations

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Journal ArticleDOI

Simics: A full system simulation platform

TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Journal ArticleDOI

The future of wires

TL;DR: Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays, which is good news since these "local" wires dominate chip wiring.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Journal ArticleDOI

Niagara: a 32-way multithreaded Sparc processor

TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Related Papers (5)
Frequently Asked Questions (13)
Q1. How many false misses are performed by CMPDNUCA?

By delaying migrations by a thousand cycles, and canceling migrations when a different processor accesses the same block, CMPDNUCA still performs at least 94% of all scheduled migrations, while reducing false misses by at least 99%. 

First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. In this paper, the authors develop L2 cache designs for CMPs that incorporate these three latency management techniques. The authors use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, the authors demonstrate that block migration is less effective for CMPs because 40-60 % of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, the authors observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, the authors show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, the authors present a hybrid design—combining all three techniques— that improves performance by an additional 2 % to 19 % over prefetching alone. 

Design partitioning, along with the integration of more metal layers, allows wire dimensions to decrease slower than transistor dimensions, thus keeping wire delay controllable for short distances [20, 42]. 

wire resistance increases due to the smaller cross-sectional area and sidewall capacitance increases due to the greater surface area exposed to adjacent wires. 

In the LC range, data can be communicated by propagating an incident wave across the transmission line instead of charging the capacitance across a series of wire segments. 

the 16 MB L2 storage array is partitioned into 256 banks to control bank access latency [1] and to provide sufficient bandwidth to support up to 128 simultaneous on-chip processor requests. 

As wire delays continue to increase, architects will turn to additional techniques such as block migration or transmission lines to manage on-chip delay. 

The authors estimate eight 4-wide superscalar processors would occupy 120 mm2 [29] and 16 MB of L2 cache storage would occupy 64 mm2 [16]. 

Due to a lack of frequently repeatable requests, barnes, apsi, and fma3d encounter 30% to 62% of L2 hits in the distant 10 bankclusters. 

More importantly, separate partial tag structures require a complex coherence scheme that updates address location state in the partial tags with block migrations. 

While current systems perform hardware-directed strided prefetching [19, 21, 43], its effectiveness is workload dependent [10, 22, 46, 49]. 

Similar to separating branch predictor histories per thread [39], separating the L2 miss streams by processor significantly improves prefetcher performance (up to 14 times for the workload ocean). 

the potential benefit of block migration in a CMP cache is fundamentally limited by the large amount of inter-processor sharing that exists in some workloads.