scispace - formally typeset
Open AccessProceedings ArticleDOI

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Per Stenström, +2 more
- Vol. 20, Iss: 2, pp 80-91
Reads0
Chats0
TLDR
In this article, the authors compare the performance of CC-NUMA and COMA and show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses.
Abstract
Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.

read more

Content maybe subject to copyright    Report

Comparative Performance Evaluation of
Cache-Coherent NUMA and COMA Architectures
Per Stenstromt, Truman Joe, and Anoop Gupta
Computer Systems Laboratory
Stanford University, CA 94305
Abstract
Two interesting variations of large-scale shared-memory ma-
chines that have recently emerged are cache-coherent mm-
umform-memory-access machines (CC-NUMA) and cache-
only memory architectures (COMA). They both have dis-
tributed main memory and use directory-based cache coher-
ence. Unlike CC-NUMA, however, COMA machines auto-
matically migrate and replicate data at the main-memoty level
in cache-line sized chunks. This paper compares the perfor-
mance of these two classes of machines. We first present a
qualitative model that shows that the relative performance is
primarily determined by two factors: the relative magnitude
of capacity misses versus coherence misses, and the gramr-
hirity of data partitions in the application. We then present
quantitative results using simulation studies for eight prtraUeI
applications (including all six applications from the SPLASH
benchmark suite). We show that COMA’s potential for perfor-
mance improvement is limited to applications where data ac-
cesses by different processors are finely interleaved in memory
space and, in addition, where capacity misses dominate over
coherence misses. In other situations, for example where co-
herence misses dominate, COMA can actually perform worse
than CC-NUMA due to increased miss latencies caused by its
hierarchical directories. Finally, we propose a new architec-
tural alternative, called COMA-F, that combines the advantages
of both CC-NUMA and COMA.
1 Introduction
Large-scale multiprocessors with a single address-space and
coherent caches offer a flexible and powerful computing en-
vironment. The single address space and coherent caches
together ease the problem of data partitioning and dynamic
load balancing. They also provide better support for paral-
lelizing compilers, standard operating systems, and multipro-
gramming, thus enabling more flexible and effective use of the
t Per StenstrOm’s address is Deparlrnent of Computer Engineering,
Lund University, P.O. Box 118, S-221 00 LUND, Sweden.
Permission to copy without fee all or part of this material IS granted
provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
Its date appear, and notice IS given that copying is by permission of the
Association for Computmg Machinery. To copy otherwise, or to repubhsh,
requires a fee and/or specific perrmsslon.
machine. Currently, many research groups are pursuing the
design and construction of such multiprocessors [12, 1, 10].
As research has progressed in this area, two interesting vari-
ants have emerged, namely CC-NUMA (cache-coherent non-
uniform memory access machines) and COMA (cache-only
memory architectures).
Examples of the CC-NUMA ma-
chines are the Stanford DASH multiprocessor [12] and the MIT
Alewife machine [1], while examples of COMA machines are
the Swedish Institute of Computer Science’s Data Diffusion
Machine (DDM) [10] and Kendall Square Research’s KSR1
machine [4].
Common to both CC-NUMA and COMA machines are the
features of distributed main memory, scalable interconnection
network, and directory-based cache coherence.
Distributed
main memory and scalable interconnection networks are es-
sential in providing the required scalable memory bandwidth,
while directory-based schemes provide cache coherence with-
out requiring broadcast and consuming only a small fraction
of the system bandwidth. In contrast to CC-NUMA machines,
however, in COMA the per-node main memory is converted
into an enormous secondary/tertiary cache (called attraction
memory (AM) by the DDM group) by adding tags to cache-
line sized chunks in main memory. A consequence is that the
location of a data item in the machine is totally decoupled
from its physical address, and the data item is automatically
migrated or replicated in main memory depending on the mem-
ory reference pattern.
The main advantage of the COMA machines is that they
cart reduce the average cache miss latency, since data are dy-
namically migrated and replicated at the main-memory level.
However, there are also several disadvantages. Fkst, allowing
migration of data at the memory level requires a mechanism
to locate the data on a miss. To avoid broadcasting such re-
quests, current machines use a hierarchical directory structure,
which increases the miss latency for global requests. Second,
the coherence protocol is more complex because it needs to
ensure that the last copy of a data item is not replaced
in
the attraction memory (main memory). Also, as compared to
CC-NUMA, there is additional complexity in the design of
the main-memory subsystem and in the interface to the disk
subsystem.
Even though CC-NUMA and COMA machines are being
built, so far no studies have been published that evsthtate the
performance benefits of one machine model over the other.
Such a study is the focus of this paper. We note that the paper
focuses on the relative performance of the two machines, and
@ 1992 ACM 0-89791 .509.7/92/0005/0080 $1.50 80

not on the hardware complexity. We do so because without a
good understanding of the performance benefits, it is difficult
to argue about what hardware complexity is justified.
The organization of the rest of the paper is as follows. In
the next section, we begin with detailed descriptions of CC-
NUMA and COMA machines. Then in Section 3, we present
a qualitative model that helps predict the relative performance
of applications on CC-NUMA and COMA machines. Section
4 presents the architecturrd assumptions and our simulation en-
vironment. It also presents the eight benchmark applications
used in our study, which include all six applications from the
SPLASH benchmark suite [14]. The performance results are
presented in Section 5. We show that COMA’s potential for
performance improvement is limited to applications where data
accesses by diffetent processors are interleaved at a tine spa-
tial granularity and, in addition, where capacity misses dom-
inate over coherence misses. We also show that for applica-
tions which access data at a coarse granularity, CC-NUMA can
perform nearly as well as a COMA by exploiting page-level
placement or migration. Furthermore, when coherence misses
dominate, CC-NUMA often performs better than COMA. This
is due to the extra latency introduced by the hierarchical di-
rectory structure in COMA. In Section 6, we present a new
architectural alternative, called COMA-F (for COMA-FLAT),
that is shown to perform better than both regular CC-NUMA
and COMA. We finally conclude in Section 7.
2 CC-NUMA and COMA Machines
In this section we briefly present the organization of CC-
NUMA and COMA machines based on the Stanford DASH
muhiprocessor [12] and the Swedish Institute of Computer Sci-
ence’s Data Diffusion Machine (DDM) [10]. We discuss the
basic architecture and the coherence protocols, directory struc-
ture and interconnection network requirements, and finally the
software model presented by the architectures.
2.1 CC-NUMA Architecture
A CC-NUMA machine consists of a number of processing
nodes comected through a bigh-brmdwidth low-latency inter-
connection network. Each processing node consists of a high-
performance processor, the associated cache, and a portion of
the global shared memory. Cache coherence is maintained by a
directory-based, write-invalidate cache coherence protocol. To
keep all caches consistent, each processing node has a direc-
tory memory corresponding to its portion of the shared phys-
ical memory. For each memory line (rdigned memory block
that has the same size as a cache line), the directory memory
stores identities of remote nodes caching, that line. Thrrs, US-
ing the directory, it is possible for a node writing a location
to send point-to-point messages to invalidate remote copies
of the corresponding cache line. Another important attribute
of the directory-based protocol is that it does not depend on
any specific interconnection network topology. Therefore, any
scalable network, such as a mesh, a hypercube, or a multi-stage
network, cart be used to connect the processing nodes.
Hzmdling a cache miss in a CC-NUMA machine requkes
knowledge about the home node for the corresponding physical
address. The home node is the processing node from whose
main memory the data is allocated. (It is usually determined by
the high-order bits of the physical address.) If the local node
and the home node are the same, a cache miss can be serviced
by main memory in the local node. Otherwise, the miss is
forwarded to the remote home node. If the home node has a
clean copy, it returns the block to the requesting cache. (We
call this a 2-hop miss since it requires two network traversals
one from the requesting node to the home node, and the
other back.) Otherwise, the read request is forwarded to the
node that has the dirty copy. This node returns the block to
the requesting node and also writes back the block to the home
node. (We call this a 3-hop miss, as it takes three network
traversrds before the data is returned to the requesting node.)
If the block is not exclusively owned by a processor that
issues a write request, a read-exclusive request is sent to the
home node. The home node returns ownership and multicasts
invalidation requests to any other nodes that have a copy of the
block. Acknowledgments are returned directly to the issuing
node so as to indicate when the write operation has completed.
A write request to a dirty block is fotwarded (the same way
as a read request) to the node containing the dirty copy, which
then returns the data.
The Startford DASH multiprocessor [12] is an example of a
CC-NLIMA machine. The prototype is to consist of 16 process-
ing nodes, each with 4 processors, for a total of 64 processors.
Each processor has a 64 Kbytes first-level and a 256 Kbytes
second-level cache. The interconnection network is a worm-
hole routed 2-D mesh network, The memory access latencies
for a cache hit, local memory access, 2-hop, and 3-hop read
misses are approximately 1, 30, 100, and 135 processor clocks
respectively.
The CC-NUMA software model allows processes to be at-
tached to specific processors and for data to be allocated from
my specific node’s main memory. However, the granularity at
which data can be moved between different node’s main mem-
ories (transparently to the application) is page sized chunks.
We note that the allocation and movement of data between
nodes may be done explicitly via code written by the applica-
tion programmer, or automatically by the operating system [3].
This is in contrast to the COMA machines where such migra-
tion and replication happens automaticrdly at the granularity of
cache blocks.
2.2 COMA Architecture
LAe CC-NUMA, a COMA machine consists of a number of
processing nodes connected by an interconnection network.
Each processing node has a high-performance processor, a
cache, and a portion of the global shared memory. The differ-
ence, however, is that the memory associated with each node
is augmented to act as a large cache, denoted attraction mem-
ory (AM) using DDM terminology [10]. Consistency among
cache blocks in the AMs is maintained using a write-invalidate
protocol. The AMs allow transparent migration and replication
of data items to nodes where they are referenced.
In a COMA machine, the AMs constitute the only memory
in the system (other than the disk subsystem). A consequence
is that the location of a memory block is totally decoupled from
its physicrd address. This creates several problems. First, when
a reference misses in the local AM, a mechanism is needed to
trace a copy of that block in some other node’s AM. Unlike
CC-NUMA, there is no notion of a home node for a block.
Second, some mechanism is needed to ensure that the last
81

copy of a block (possibly the only valid copy) is not purged.
To address the above problems and to maintain cache and
memory consistency, COMA machines use a hierarchical di-
rectory scheme and a corresponding hierarchical intercormec-
tion network (at least logically so 1). Each directory maintains
state information about all blocks stored in the subsystem be-
low. The state of a block is either exclusive in exactly one
node or shared in several nodes. Note that directories only
contain state information to reduce memory overhead, data are
not stored.
Upon a read miss, a read request locates the closest node
that has a copy of that block by propagating up the hierarchy
until a copy in state shared or exclusive is found. At that
point, it propagates down the hierarchy to a node that has the
copy. The node returns the block rdong the same path as the
request. (A directory read-m odify-tite needs to be done at
each intermediate directory rdong the path, both in the forward
and return directions.) Because of the hierarchical directory
structure, COMA machines can exploit combining [7]; if a
directory receives a read request to a block that is already
being fetched, it does not have to send the new request up the
hierarchy. When the reply comes back, both requesters are
supplied the data.
A write request to an unowned block propagates up the hier-
archy until a directory indicates that the copy is exclusive. This
directory, the root directory, multicasts invrdidation requests to
rdl subsystems having a copy of the block and returns an ac-
knowledgement to the issuing processor.
As stated earlier, decoupling the home location of a block
from its address raises the issue of replacement in the AMs.
A shared block (i.e., one with multiple copies) that is being
replaced is not so difficult to handle. The system simply has to
realize that there exist other copies by going up the hierarchy.
Handling an exclusive block (the only copy of a block, whether
in clean or dirty state) is, however, more complex since it must
be transferred to another attraction memory. This is done by
letting it propagate up in the hierarchy until a directory finds
an empty or a shared block in its subsystem that can host the
block.
Examples of COMA machines include the Swedkh Institute
of Computer Science’s DDM machine [10] and Kendall Square
Research’s KSR1 machine [4]. The processing nodes in DDM
are rdso clusters with multiple processors, as in DASH. How-
ever, the interconnect is a hierarchy of buses, in contrast to
the wormhole-routed grid in DASH. In KSR1, each process-
ing node consists of only a single processor. The intercomect
consists of a hierarchy of slotted ring networks.
In summary, by rdlowing individual memory blocks to be
migrated and replicated in attraction memories, COMA ma-
chines have the potential of reducing the number of cache
misses that need to be serviced remotely. However, because of
the hierarchy in COMA, latency for remote misses is usually
higher (except when combining is successful); this may offset
the advantages of the higher hit rates. We study these tradeoffs
qurditatively in the next section.
1Wallach srrd Daily me investigating a COMA implementation baaed
on a hierarchy embedded in a 3-dimensional mesh network [16].
3 Qualitative Comparison
In this section, we qualitatively evaluate the advantages and
disadvantages of the CC-NUMA and COMA models. In par-
ticular, we focus on application data access patterns that are
expected to cause one modeI to perform better than the other.
We show that the criticrd parameters are the relative magnitudes
of different miss types in the cache, and the spatial granularity
of access to shared data. We begin with a discussion of the
types of misses observed in shared-memory parallel programs,
and discuss the expected miss latencies for CC-NUMA rmd
COMA machines.
3.1 Miss Types and Expected Latencies
Since both CC-NUMA and COMA have private coherent
caches, presumably of the same size, the differences in per-
formance stem primarily because of differences in the miss
latencies. In shared-memory multiprocessors that use a write-
invalidate cache coherence protocol, cache misses may be clas-
sified into four types: cold misses, capacity misses, conflict
misses, and coherence misses.
A cold miss is the result of a block being accessed by the
processor for the first time. A capacity miss is a miss due to
the finite size of the processor cache and a conjlict miss is due
to the limited associativity of the cache. For our discussion
below, we do not distinguish between conflict and capacity
misses since CC-NUMA and COMA respond to them in the
same way. We collectively refer to them as capacity misses.
A coherence miss (or an invalidation miss) is a miss to a block
that has been referenced before, but has been written by another
processor since the last time it was referenced by this processor.
Coherence misses include both false sharing misses as well as
true sharing misses [15]. False sharing misses result from write
references to data that are not shared but happen to reside in
the same cache line. True sharing misses are coherence misses
that would still exist even if the block size were one access
unit. They represent true communication between the multiple
processes in the application.
We now investigate how the two models respond to cache
misses of different types. Beginning with cold misses, we ex-
pect the average miss penalty to be higher for COMA, assum-
ing that data is distributed among the main memory modules
in the same way. The reason is simply that cold misses that are
not serviced locally have to traverse the directory and network
hierarchy in COMA. The only reason for a shorter latency for
COMA would be if combining worked particularly well for art
application.
For coherence misses, we again expect COMA to have
higher miss latencies as compared to CC-NUMA. The reason
is that the data is guaranteed not to be in the local attraction
memory for COMA, and therefore it will need to traverse the
directory hierarchy.z In contrast to CC-NUMA, the latency for
COMA can, of course, be shortened if combining is successful,
or if the communication is localized in the hierarchy.
Finally, for capacity misses, we expect to see shorter miss
latencies for COMA. Most such misses are expected to hit
in the local attraction memory since it is extremely large and
2We assume one processor per node. If seversl processors share the
same AM (or locat memory), a coherence miss can sometimes be serviced
locslly.
82

organized as a cache. In contrast, in CC-NUMA, unless data
referenced by a processor are carefully allocated to locrd main
memory, there is a high likelihood that a capacity miss will
have to be serviced by a remote node.
In summary, since there are some kinds of misses that are
serviced with lower latency by COMA and others that are ser-
viced with lower latency by CC-NUMA, the relative perfor-
mance of an application on COMA or CC-NLIMA will depend
on what kinds of misses dominate.
3.2 Application Performance
In this subsection, we classify applications based on their data
access patterns, and the resulting cache miss behavior, to evd-
uate the relative advantages and disadvantages of COMA and
CC-NUMA. A summary of this classification is presented in
Figure 1. As illustrations, we use many applications that we
evaluate experimentally in Section 5.
On the left of the tree in Figure 1, we group all applications
that exhibh low cache miss-rates. Lktear algebra applications
that can be blocked, for example, fall into this category [11].
Other applications where computation grows at a much faster
rate than- the data set size, e.g., the 0( N2 ) algorithm used to
compute interaction between molecules irt the Water applica-
tion [14], often also fall into this category. In such cases, since
the data sets are quite small, capacity misses are few and the
miss penalty has ordy a small impact on performance. Overall,
for these applications that exhibit a low miss rate, CC-NUMA
and COMA should perform about the same.
Looking at the right portion of the tree, for applications that
exhibit moderate to high miss rates, we differentiate between
applications where coherence misses dominate and those where
capacity misses dominate. Focusing first on the former, we
note that this class of applications is not that unusual. High
coherence misses can arise because an application programmer
(or compiler) may not have done a very good job of scheduling
tasks or partitioning data, or because the cache line size is too
large causing false sharing, or because solving the problem
actually requires such communication to happen. Usurdly, all
three factors are involved to varying degrees. In all of these
cases, CC-NUMA is expected to do better than COMA because
the remote misses take a shorter time to service. As we will
show in Section 5, even when vety small processor caches
are used (thus increasing the magnitude of capacity misses), at
least half of the applications in our study fall into this category.
The situation where COMA has a potential performance ad-
vantage is when the majority of cache misses are capacity
misses. Almost all such misses get serviced by the local at-
traction memory in COMA, in contrast to CC-NUMA where
they may have to go to a remote node. CC-NUMA can deliver
good performance only if a majority of the capacity misses are
serviced by local main memory.
We believe that it is possible for CC-NUMA machines to
get high hit rates in local memory for many applications that
access data in a “coarse-grained” manner. By coarse grained,
we meam applications where large chunks of data (greater than
page size) ~e primarily accessed by one process in the program
for significant periods of time. For example, many scientific
applications where large data arrays are statically partitioned
among the processes fall into this class. A specific example
is the Cholesky sparse factorization algorithm that we evatuate
later in the paper. In the Cholesky application, contiguous
columns with similar non-zero structure (called supemodes)
are assigned to various processors. If the placement is done
right for such applications, the local memory hit rates can be
very high.
Even if the locus of accesses to these large data chunks
changes from one process to another over time, automatic repli-
cation and migration algorithms implemented in the operating
system (possibly with hardware support) can ensure high hit
rates in local memory for CC-NUMA. In fact, we believe that
an interesting way to think of a CC-NUMA machine is as
a COMA machine where tbe line size for the main-memory
cache is the page size, which is not unreasonable since main
memory is very large, and where this main-memory cache is
managed in software.
The latter is also not an unreasonable
policy decision because the very-large line size helps hide the
overheads.
In applications where many small objects that are collocated
on a page are accessed in an interleaved manner by processes,
page-level placement/migration obviously can not ensure high
local hit-rates. An example is the Barnes-Hut application for
simulating the interaction of N-body problems (discussed in
Section 5). In this application multiple bodies that are accessed
by several processors reside on a single page and, as a result,
page placement does not help CC-NUMA.
In summary, we expect the relative performrmce of CC-
NUMA and COMA to be similar if the misses are few. When
the miss rate is high, we expect CC-NUMA to perform better
than COMA if coherence misses dominate; CC-NUMA to per-
form similar to COMA if the capacity misses dominate but the
data usage is coarse grained; and finally, COMA to perform
better than CC-NUMA if capacity misses dominate and data
usage is fme grained.
4 Experimental Methodology
‘his section presents the simulation environment, the architec-
tural models, and the benchmark applications we use to make
a quantitative comparison between CC-NUMA and COMA.
4.1 Simulation Environment
We use a simulated multiprocessor environment to study the
behavior of applications under CC-NUMA and COMA. The
simulation environment consists of two parts: (i) a functional
simulator that executes the parallel applications and (ii) the two
architectural simulators.
The functional simulator is based on Tango [5]. The Tango
system takes a parallel application program and interleaves the
execution of its processes on a uniprocessor to simulate a mul-
tiprocessor. This is achieved by associating a virtual timer
with each process of the application and by always running
the
process with the lowest virtual time first. By letting the
architectural simulator update the virtual timers according to
the access time
of each memory reference, a correct interleav-
ing of all memoty references is maintained. The architectural
simulator takes care of references to shared datq instruction
fetches and private data references are assumed to always hit
in the processor cache.
83

~=~”_”_
_....]
“ss Rates
~
COMA and CC-NUMA
should perform about the same.
Applications that are blockable
(e.g. blocked matrix multiply)
.
Appficafions witfr natnraf locaf-
ity or where computation grows
ats much faster rate thsn data
set sire. (e.g. O(N 2) afgorhhms
for N–body problems), e.g. the
Water application.
I CoarseGrained
Application Characteristics
—-- -—
-—— .-
.—
p=Y=j==j==g.
COMA may have worse
performance due to hierarchy.
Misses csrr occur due to false
sharing or due to trne
communication between the
processes (e.g. MP3D).
. Misses need to go to remote
node to get serwced, and
COMA suffers due to hierarchy.
Combining may help in limited
1 sitwdions.
/
L Data Acess ~ -——–—-–--~’
CC-NUMA can perform almost as
well as COMA with page–level migration
arrd/Or replication.
/’
By coarse grained, we mean appii-
crrtions that use large data structures
that are coarsely shared between
processes. (Coarse refers to greater
than page size.)
. Many scientific applications fall into
this category. They have large data
structures that can be suitably
partitioned.
- The hit rate to locnf memory csn be
increased by page placement policies.
Page placement crrrrbe supported by
the user, compiler, or OS (e.g., possibly
with some hardware support).
-
Cholesky factorization using
supernodes.
- Ocean simulation where domain
decomposition is exploited.
- Particle simulation algorithms
with spatial chunkirvr.
———-——____
.
?
T
High Miss Rates
/H”–
@O:tlYCapaciVMisssesJ
COMA has a potentisf
~rfomrance advantage.
. COMA expected to perfomr better.
.
The data accesses m freely inter–
Ieaved, i.e., data objects that are being
accessed by mnkiple processing
elements sre
coffocated on the same
page.
Page placement policies do not help.
Applications that do not esrefully
fiut) or where dataobjectsaresrnall
artition data (e.g. bodies in Barnes-
and dynamically linked fall into this
category.
F&pre 1: Prediction of relative performance between CC-NUMA and COMA based on relative frequency of cache miss types and
application data access pattern.
4.2 Architecture Simulators
Both architectures consist of a number of processing nodes
connected via low-latency networks. In our simulations we
assume a 16 processor configuration, with one processor per
processing node. (Figure 2(a) shows the organization of the
processing node.) We assume a cache line size of 16 bytes. In
the default configurations, we use a processor cache size of 4
Kbytes, and in the case of COMA, infinite attraction memories.
For COMA we assume a branching factor of 4, implying a
two-level hierarchy.
The reasons for choosing this rather small default processor
cache size are several. First, since the simulations are quite
slow, the data sets used by our applications are smaller than
what we may use on a real machine. As a result, if we were
to use full-size caches (say 256 Kbytes), then for some of the
applications rdl of the data would fit into the caches and we
would not get any capacity misses. This would take away all
of the advantages of the COMA model, and the results would
obviously not be interesting. Second, by using very small
cache sizes, we favor the COMA model, and our gord is to
see whether there are situations where CC-NUMA can still do
better. Third, 4 Kbytes is only the default, and we atso present
results for larger cache sizes.
As for use of infinite attraction
memories, our choice was motivated by the observation that the
capacity miss-rates are expected to be extremely small for the
attraction memories. As a result, the complexity of modeling
finite sized attraction memories did not seem justified.
We now focus on the latency of
cache misses in CC-NUMA
and COMA. To do this in a consistent manner, we have de-
fined a common set of primitive operations that are used to
u
Prxassor
—L ‘C’
[1
Cache
Thus
C
H
bus,cmd
Memory
Treat
Cd
(a)
b) (.)
Figore 2: Node organization (a) and latency calculations for
local and remote misses for CC-NUMA (b) and (c).
construct protocol transactions for both architectures. We can
then choose latency vahtes for these operations and use them
to ewdtrate the latency of memory operations. In the follow-
ing subsections, we describe reference latencies for the two
architectures in terms of these primitive operations. Table 1,
located at the end of this section, summarizes these primitive
operations
and lists the default latency numbers, assuming a
processor clock rate of
100 MHz.
4.2.1
CC-NUMA Latencies
For CC-NUMA, a load request that hits in the cache incurs a
latency of a cache access, TC.chC. For misses, the processor is
stalled for the full service latency of the load request.
In Figure 2(b) we depict the memory latency for a load
84

Citations
More filters
Book

Parallel Computer Architecture: A Hardware/Software Approach

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Book

Computer Architecture, Fifth Edition: A Quantitative Approach

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
Proceedings ArticleDOI

Implementing global memory management in a workstation cluster

TL;DR: The objective is to use a single, unified, but distributed memory management algorithm at the lowest level of the operating system to manage memory globally at this level so that all system- and higher-level software, including VM, file systems, transaction systems, and user applications, can benefit from available cluster memory.
Proceedings ArticleDOI

Tempest and typhoon: user-level shared memory

TL;DR: It is found that Stache running on Typhoon performs comparably to an all-hardware DirNNB cache-coherence protocol for five shared-memory programs, and how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol is illustrated.
Journal ArticleDOI

Optimizing Replication, Communication, and Capacity Allocation in CMPs

TL;DR: This work proposes controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy, and proposes capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand.
References
More filters
Journal ArticleDOI

SPLASH: Stanford parallel applications for shared-memory

TL;DR: This work presents the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems, and describes the applications currently in the suite in detail.
Proceedings ArticleDOI

The cache performance and optimizations of blocked algorithms

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Journal ArticleDOI

The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer

TL;DR: The design for the NYU Ultracomputer is presented, a shared-memory MIMD parallel machine composed of thousands of autonomous processing elements that uses an enhanced message switching network with the geometry of an Omega-network to approximate the ideal behavior of Schwartz's paracomputers model of computation.
Journal ArticleDOI

The directory-based cache coherence protocol for the DASH multiprocessor

TL;DR: The design of the DASH coherence protocol is presented and how it addresses the issues of correctness, performance and protocol complexity are discussed and compared to the IEEE Scalable Coherent Interface protocol.
Journal ArticleDOI

APRIL: a processor architecture for multiprocessing

TL;DR: The architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization is described and the scalability of a multiprocessor based on APRIL is explored using a performance model.
Related Papers (5)