Comparative performance evaluation of cache-coherent NUMA and COMA architectures

doi:10.1145/139669.139705

Comparative Performance Evaluation of

Cache-Coherent NUMA and COMA Architectures

Per Stenstromt, Truman Joe, and Anoop Gupta

Computer Systems Laboratory

Stanford University, CA 94305

Abstract

Two interesting variations of large-scale shared-memory ma-

chines that have recently emerged are cache-coherent mm-

umform-memory-access machines (CC-NUMA) and cache-

only memory architectures (COMA). They both have dis-

tributed main memory and use directory-based cache coher-

ence. Unlike CC-NUMA, however, COMA machines auto-

matically migrate and replicate data at the main-memoty level

in cache-line sized chunks. This paper compares the perfor-

mance of these two classes of machines. We first present a

qualitative model that shows that the relative performance is

primarily determined by two factors: the relative magnitude

of capacity misses versus coherence misses, and the gramr-

hirity of data partitions in the application. We then present

quantitative results using simulation studies for eight prtraUeI

applications (including all six applications from the SPLASH

benchmark suite). We show that COMA’s potential for perfor-

mance improvement is limited to applications where data ac-

cesses by different processors are finely interleaved in memory

space and, in addition, where capacity misses dominate over

coherence misses. In other situations, for example where co-

herence misses dominate, COMA can actually perform worse

than CC-NUMA due to increased miss latencies caused by its

hierarchical directories. Finally, we propose a new architec-

tural alternative, called COMA-F, that combines the advantages

of both CC-NUMA and COMA.

1 Introduction

Large-scale multiprocessors with a single address-space and

coherent caches offer a flexible and powerful computing en-

vironment. The single address space and coherent caches

together ease the problem of data partitioning and dynamic

load balancing. They also provide better support for paral-

lelizing compilers, standard operating systems, and multipro-

gramming, thus enabling more flexible and effective use of the

t Per StenstrOm’s address is Deparlrnent of Computer Engineering,

Lund University, P.O. Box 118, S-221 00 LUND, Sweden.

Permission to copy without fee all or part of this material IS granted

provided that the copies are not made or distributed for direct commercial

advantage, the ACM copyright notice and the title of the publication and

Its date appear, and notice IS given that copying is by permission of the

Association for Computmg Machinery. To copy otherwise, or to repubhsh,

requires a fee and/or specific perrmsslon.

machine. Currently, many research groups are pursuing the

design and construction of such multiprocessors [12, 1, 10].

As research has progressed in this area, two interesting vari-

ants have emerged, namely CC-NUMA (cache-coherent non-

uniform memory access machines) and COMA (cache-only

memory architectures).

Examples of the CC-NUMA ma-

chines are the Stanford DASH multiprocessor [12] and the MIT

Alewife machine [1], while examples of COMA machines are

the Swedish Institute of Computer Science’s Data Diffusion

Machine (DDM) [10] and Kendall Square Research’s KSR1

machine [4].

Common to both CC-NUMA and COMA machines are the

features of distributed main memory, scalable interconnection

network, and directory-based cache coherence.

Distributed

main memory and scalable interconnection networks are es-

sential in providing the required scalable memory bandwidth,

while directory-based schemes provide cache coherence with-

out requiring broadcast and consuming only a small fraction

of the system bandwidth. In contrast to CC-NUMA machines,

however, in COMA the per-node main memory is converted

into an enormous secondary/tertiary cache (called attraction

memory (AM) by the DDM group) by adding tags to cache-

line sized chunks in main memory. A consequence is that the

location of a data item in the machine is totally decoupled

from its physical address, and the data item is automatically

migrated or replicated in main memory depending on the mem-

ory reference pattern.

The main advantage of the COMA machines is that they

cart reduce the average cache miss latency, since data are dy-

namically migrated and replicated at the main-memory level.

However, there are also several disadvantages. Fkst, allowing

migration of data at the memory level requires a mechanism

to locate the data on a miss. To avoid broadcasting such re-

quests, current machines use a hierarchical directory structure,

which increases the miss latency for global requests. Second,

the coherence protocol is more complex because it needs to

ensure that the last copy of a data item is not replaced

in

the attraction memory (main memory). Also, as compared to

CC-NUMA, there is additional complexity in the design of

the main-memory subsystem and in the interface to the disk

subsystem.

Even though CC-NUMA and COMA machines are being

built, so far no studies have been published that evsthtate the

performance benefits of one machine model over the other.

Such a study is the focus of this paper. We note that the paper

focuses on the relative performance of the two machines, and

@ 1992 ACM 0-89791 .509.7/92/0005/0080 $1.50 80

not on the hardware complexity. We do so because without a

good understanding of the performance benefits, it is difficult

to argue about what hardware complexity is justified.

The organization of the rest of the paper is as follows. In

the next section, we begin with detailed descriptions of CC-

NUMA and COMA machines. Then in Section 3, we present

a qualitative model that helps predict the relative performance

of applications on CC-NUMA and COMA machines. Section

4 presents the architecturrd assumptions and our simulation en-

vironment. It also presents the eight benchmark applications

used in our study, which include all six applications from the

SPLASH benchmark suite [14]. The performance results are

presented in Section 5. We show that COMA’s potential for

performance improvement is limited to applications where data

accesses by diffetent processors are interleaved at a tine spa-

tial granularity and, in addition, where capacity misses dom-

inate over coherence misses. We also show that for applica-

tions which access data at a coarse granularity, CC-NUMA can

perform nearly as well as a COMA by exploiting page-level

placement or migration. Furthermore, when coherence misses

dominate, CC-NUMA often performs better than COMA. This

is due to the extra latency introduced by the hierarchical di-

rectory structure in COMA. In Section 6, we present a new

architectural alternative, called COMA-F (for COMA-FLAT),

that is shown to perform better than both regular CC-NUMA

and COMA. We finally conclude in Section 7.

2 CC-NUMA and COMA Machines

In this section we briefly present the organization of CC-

NUMA and COMA machines based on the Stanford DASH

muhiprocessor [12] and the Swedish Institute of Computer Sci-

ence’s Data Diffusion Machine (DDM) [10]. We discuss the

basic architecture and the coherence protocols, directory struc-

ture and interconnection network requirements, and finally the

software model presented by the architectures.

2.1 CC-NUMA Architecture

A CC-NUMA machine consists of a number of processing

nodes comected through a bigh-brmdwidth low-latency inter-

connection network. Each processing node consists of a high-

performance processor, the associated cache, and a portion of

the global shared memory. Cache coherence is maintained by a

directory-based, write-invalidate cache coherence protocol. To

keep all caches consistent, each processing node has a direc-

tory memory corresponding to its portion of the shared phys-

ical memory. For each memory line (rdigned memory block

that has the same size as a cache line), the directory memory

stores identities of remote nodes caching, that line. Thrrs, US-

ing the directory, it is possible for a node writing a location

to send point-to-point messages to invalidate remote copies

of the corresponding cache line. Another important attribute

of the directory-based protocol is that it does not depend on

any specific interconnection network topology. Therefore, any

scalable network, such as a mesh, a hypercube, or a multi-stage

network, cart be used to connect the processing nodes.

Hzmdling a cache miss in a CC-NUMA machine requkes

knowledge about the home node for the corresponding physical

address. The home node is the processing node from whose

main memory the data is allocated. (It is usually determined by

the high-order bits of the physical address.) If the local node

and the home node are the same, a cache miss can be serviced

by main memory in the local node. Otherwise, the miss is

forwarded to the remote home node. If the home node has a

clean copy, it returns the block to the requesting cache. (We

call this a 2-hop miss since it requires two network traversals

—

one from the requesting node to the home node, and the

other back.) Otherwise, the read request is forwarded to the

node that has the dirty copy. This node returns the block to

the requesting node and also writes back the block to the home

node. (We call this a 3-hop miss, as it takes three network

traversrds before the data is returned to the requesting node.)

If the block is not exclusively owned by a processor that

issues a write request, a read-exclusive request is sent to the

home node. The home node returns ownership and multicasts

invalidation requests to any other nodes that have a copy of the

block. Acknowledgments are returned directly to the issuing

node so as to indicate when the write operation has completed.

A write request to a dirty block is fotwarded (the same way

as a read request) to the node containing the dirty copy, which

then returns the data.

The Startford DASH multiprocessor [12] is an example of a

CC-NLIMA machine. The prototype is to consist of 16 process-

ing nodes, each with 4 processors, for a total of 64 processors.

Each processor has a 64 Kbytes first-level and a 256 Kbytes

second-level cache. The interconnection network is a worm-

hole routed 2-D mesh network, The memory access latencies

for a cache hit, local memory access, 2-hop, and 3-hop read

misses are approximately 1, 30, 100, and 135 processor clocks

respectively.

The CC-NUMA software model allows processes to be at-

tached to specific processors and for data to be allocated from

my specific node’s main memory. However, the granularity at

which data can be moved between different node’s main mem-

ories (transparently to the application) is page sized chunks.

We note that the allocation and movement of data between

nodes may be done explicitly via code written by the applica-

tion programmer, or automatically by the operating system [3].

This is in contrast to the COMA machines where such migra-

tion and replication happens automaticrdly at the granularity of

cache blocks.

2.2 COMA Architecture

LAe CC-NUMA, a COMA machine consists of a number of

processing nodes connected by an interconnection network.

Each processing node has a high-performance processor, a

cache, and a portion of the global shared memory. The differ-

ence, however, is that the memory associated with each node

is augmented to act as a large cache, denoted attraction mem-

ory (AM) using DDM terminology [10]. Consistency among

cache blocks in the AMs is maintained using a write-invalidate

protocol. The AMs allow transparent migration and replication

of data items to nodes where they are referenced.

In a COMA machine, the AMs constitute the only memory

in the system (other than the disk subsystem). A consequence

is that the location of a memory block is totally decoupled from

its physicrd address. This creates several problems. First, when

a reference misses in the local AM, a mechanism is needed to

trace a copy of that block in some other node’s AM. Unlike

CC-NUMA, there is no notion of a home node for a block.

Second, some mechanism is needed to ensure that the last

81

copy of a block (possibly the only valid copy) is not purged.

To address the above problems and to maintain cache and

memory consistency, COMA machines use a hierarchical di-

rectory scheme and a corresponding hierarchical intercormec-

tion network (at least logically so 1). Each directory maintains

state information about all blocks stored in the subsystem be-

low. The state of a block is either exclusive in exactly one

node or shared in several nodes. Note that directories only

contain state information to reduce memory overhead, data are

not stored.

Upon a read miss, a read request locates the closest node

that has a copy of that block by propagating up the hierarchy

until a copy in state shared or exclusive is found. At that

point, it propagates down the hierarchy to a node that has the

copy. The node returns the block rdong the same path as the

request. (A directory read-m odify-tite needs to be done at

each intermediate directory rdong the path, both in the forward

and return directions.) Because of the hierarchical directory

structure, COMA machines can exploit combining [7]; if a

directory receives a read request to a block that is already

being fetched, it does not have to send the new request up the

hierarchy. When the reply comes back, both requesters are

supplied the data.

A write request to an unowned block propagates up the hier-

archy until a directory indicates that the copy is exclusive. This

directory, the root directory, multicasts invrdidation requests to

rdl subsystems having a copy of the block and returns an ac-

knowledgement to the issuing processor.

As stated earlier, decoupling the home location of a block

from its address raises the issue of replacement in the AMs.

A shared block (i.e., one with multiple copies) that is being

replaced is not so difficult to handle. The system simply has to

realize that there exist other copies by going up the hierarchy.

Handling an exclusive block (the only copy of a block, whether

in clean or dirty state) is, however, more complex since it must

be transferred to another attraction memory. This is done by

letting it propagate up in the hierarchy until a directory finds

an empty or a shared block in its subsystem that can host the

block.

Examples of COMA machines include the Swedkh Institute

of Computer Science’s DDM machine [10] and Kendall Square

Research’s KSR1 machine [4]. The processing nodes in DDM

are rdso clusters with multiple processors, as in DASH. How-

ever, the interconnect is a hierarchy of buses, in contrast to

the wormhole-routed grid in DASH. In KSR1, each process-

ing node consists of only a single processor. The intercomect

consists of a hierarchy of slotted ring networks.

In summary, by rdlowing individual memory blocks to be

migrated and replicated in attraction memories, COMA ma-

chines have the potential of reducing the number of cache

misses that need to be serviced remotely. However, because of

the hierarchy in COMA, latency for remote misses is usually

higher (except when combining is successful); this may offset

the advantages of the higher hit rates. We study these tradeoffs

qurditatively in the next section.

1Wallach srrd Daily me investigating a COMA implementation baaed

on a hierarchy embedded in a 3-dimensional mesh network [16].

3 Qualitative Comparison

In this section, we qualitatively evaluate the advantages and

disadvantages of the CC-NUMA and COMA models. In par-

ticular, we focus on application data access patterns that are

expected to cause one modeI to perform better than the other.

We show that the criticrd parameters are the relative magnitudes

of different miss types in the cache, and the spatial granularity

of access to shared data. We begin with a discussion of the

types of misses observed in shared-memory parallel programs,

and discuss the expected miss latencies for CC-NUMA rmd

COMA machines.

3.1 Miss Types and Expected Latencies

Since both CC-NUMA and COMA have private coherent

caches, presumably of the same size, the differences in per-

formance stem primarily because of differences in the miss

latencies. In shared-memory multiprocessors that use a write-

invalidate cache coherence protocol, cache misses may be clas-

sified into four types: cold misses, capacity misses, conflict

misses, and coherence misses.

A cold miss is the result of a block being accessed by the

processor for the first time. A capacity miss is a miss due to

the finite size of the processor cache and a conjlict miss is due

to the limited associativity of the cache. For our discussion

below, we do not distinguish between conflict and capacity

misses since CC-NUMA and COMA respond to them in the

same way. We collectively refer to them as capacity misses.

A coherence miss (or an invalidation miss) is a miss to a block

that has been referenced before, but has been written by another

processor since the last time it was referenced by this processor.

Coherence misses include both false sharing misses as well as

true sharing misses [15]. False sharing misses result from write

references to data that are not shared but happen to reside in

the same cache line. True sharing misses are coherence misses

that would still exist even if the block size were one access

unit. They represent true communication between the multiple

processes in the application.

We now investigate how the two models respond to cache

misses of different types. Beginning with cold misses, we ex-

pect the average miss penalty to be higher for COMA, assum-

ing that data is distributed among the main memory modules

in the same way. The reason is simply that cold misses that are

not serviced locally have to traverse the directory and network

hierarchy in COMA. The only reason for a shorter latency for

COMA would be if combining worked particularly well for art

application.

For coherence misses, we again expect COMA to have

higher miss latencies as compared to CC-NUMA. The reason

is that the data is guaranteed not to be in the local attraction

memory for COMA, and therefore it will need to traverse the

directory hierarchy.z In contrast to CC-NUMA, the latency for

COMA can, of course, be shortened if combining is successful,

or if the communication is localized in the hierarchy.

Finally, for capacity misses, we expect to see shorter miss

latencies for COMA. Most such misses are expected to hit

in the local attraction memory since it is extremely large and

2We assume one processor per node. If seversl processors share the

same AM (or locat memory), a coherence miss can sometimes be serviced

locslly.

82

organized as a cache. In contrast, in CC-NUMA, unless data

referenced by a processor are carefully allocated to locrd main

memory, there is a high likelihood that a capacity miss will

have to be serviced by a remote node.

In summary, since there are some kinds of misses that are

serviced with lower latency by COMA and others that are ser-

viced with lower latency by CC-NUMA, the relative perfor-

mance of an application on COMA or CC-NLIMA will depend

on what kinds of misses dominate.

3.2 Application Performance

In this subsection, we classify applications based on their data

access patterns, and the resulting cache miss behavior, to evd-

uate the relative advantages and disadvantages of COMA and

CC-NUMA. A summary of this classification is presented in

Figure 1. As illustrations, we use many applications that we

evaluate experimentally in Section 5.

On the left of the tree in Figure 1, we group all applications

that exhibh low cache miss-rates. Lktear algebra applications

that can be blocked, for example, fall into this category [11].

Other applications where computation grows at a much faster

rate than- the data set size, e.g., the 0( N2 ) algorithm used to

compute interaction between molecules irt the Water applica-

tion [14], often also fall into this category. In such cases, since

the data sets are quite small, capacity misses are few and the

miss penalty has ordy a small impact on performance. Overall,

for these applications that exhibit a low miss rate, CC-NUMA

and COMA should perform about the same.

Looking at the right portion of the tree, for applications that

exhibit moderate to high miss rates, we differentiate between

applications where coherence misses dominate and those where

capacity misses dominate. Focusing first on the former, we

note that this class of applications is not that unusual. High

coherence misses can arise because an application programmer

(or compiler) may not have done a very good job of scheduling

tasks or partitioning data, or because the cache line size is too

large causing false sharing, or because solving the problem

actually requires such communication to happen. Usurdly, all

three factors are involved to varying degrees. In all of these

cases, CC-NUMA is expected to do better than COMA because

the remote misses take a shorter time to service. As we will

show in Section 5, even when vety small processor caches

are used (thus increasing the magnitude of capacity misses), at

least half of the applications in our study fall into this category.

The situation where COMA has a potential performance ad-

vantage is when the majority of cache misses are capacity

misses. Almost all such misses get serviced by the local at-

traction memory in COMA, in contrast to CC-NUMA where

they may have to go to a remote node. CC-NUMA can deliver

good performance only if a majority of the capacity misses are

serviced by local main memory.

We believe that it is possible for CC-NUMA machines to

get high hit rates in local memory for many applications that

access data in a “coarse-grained” manner. By coarse grained,

we meam applications where large chunks of data (greater than

page size) ~e primarily accessed by one process in the program

for significant periods of time. For example, many scientific

applications where large data arrays are statically partitioned

among the processes fall into this class. A specific example

is the Cholesky sparse factorization algorithm that we evatuate

later in the paper. In the Cholesky application, contiguous

columns with similar non-zero structure (called supemodes)

are assigned to various processors. If the placement is done

right for such applications, the local memory hit rates can be

very high.

Even if the locus of accesses to these large data chunks

changes from one process to another over time, automatic repli-

cation and migration algorithms implemented in the operating

system (possibly with hardware support) can ensure high hit

rates in local memory for CC-NUMA. In fact, we believe that

an interesting way to think of a CC-NUMA machine is as

a COMA machine where tbe line size for the main-memory

cache is the page size, which is not unreasonable since main

memory is very large, and where this main-memory cache is

managed in software.

The latter is also not an unreasonable

policy decision because the very-large line size helps hide the

overheads.

In applications where many small objects that are collocated

on a page are accessed in an interleaved manner by processes,

page-level placement/migration obviously can not ensure high

local hit-rates. An example is the Barnes-Hut application for

simulating the interaction of N-body problems (discussed in

Section 5). In this application multiple bodies that are accessed

by several processors reside on a single page and, as a result,

page placement does not help CC-NUMA.

In summary, we expect the relative performrmce of CC-

NUMA and COMA to be similar if the misses are few. When

the miss rate is high, we expect CC-NUMA to perform better

than COMA if coherence misses dominate; CC-NUMA to per-

form similar to COMA if the capacity misses dominate but the

data usage is coarse grained; and finally, COMA to perform

better than CC-NUMA if capacity misses dominate and data

usage is fme grained.

4 Experimental Methodology

‘his section presents the simulation environment, the architec-

tural models, and the benchmark applications we use to make

a quantitative comparison between CC-NUMA and COMA.

4.1 Simulation Environment

We use a simulated multiprocessor environment to study the

behavior of applications under CC-NUMA and COMA. The

simulation environment consists of two parts: (i) a functional

simulator that executes the parallel applications and (ii) the two

architectural simulators.

The functional simulator is based on Tango [5]. The Tango

system takes a parallel application program and interleaves the

execution of its processes on a uniprocessor to simulate a mul-

tiprocessor. This is achieved by associating a virtual timer

with each process of the application and by always running

the

process with the lowest virtual time first. By letting the

architectural simulator update the virtual timers according to

the access time

of each memory reference, a correct interleav-

ing of all memoty references is maintained. The architectural

simulator takes care of references to shared datq instruction

fetches and private data references are assumed to always hit

in the processor cache.

83

~=~”_”_

_....]

“ss Rates

~

COMA and CC-NUMA

should perform about the same.

“ Applications that are blockable

(e.g. blocked matrix multiply)

.

Appficafions witfr natnraf locaf-

ity or where computation grows

ats much faster rate thsn data

set sire. (e.g. O(N 2) afgorhhms

for N–body problems), e.g. the

Water application.

I CoarseGrained

Application Characteristics

—-- -—

● -—— .-

.—

p=Y=j==j==g.

● COMA may have worse

performance due to hierarchy.

“ Misses csrr occur due to false

sharing or due to trne

communication between the

processes (e.g. MP3D).

. Misses need to go to remote

node to get serwced, and

COMA suffers due to hierarchy.

Combining may help in limited

1 sitwdions.

/

L Data Acess ~ -——–—-–--~’

● CC-NUMA can perform almost as

well as COMA with page–level migration

arrd/Or replication.

/’

“ By coarse grained, we mean appii-

crrtions that use large data structures

that are coarsely shared between

processes. (Coarse refers to greater

than page size.)

. Many scientific applications fall into

this category. They have large data

structures that can be suitably

partitioned.

- The hit rate to locnf memory csn be

increased by page placement policies.

Page placement crrrrbe supported by

the user, compiler, or OS (e.g., possibly

with some hardware support).

-

Cholesky factorization using

supernodes.

- Ocean simulation where domain

decomposition is exploited.

- Particle simulation algorithms

with spatial chunkirvr.

———-——____

.

?

T

High Miss Rates

/H”–

@O:tlYCapaciVMisssesJ

● COMA has a potentisf

~rfomrance advantage.

. COMA expected to perfomr better.

.

The data accesses m freely inter–

Ieaved, i.e., data objects that are being

accessed by mnkiple processing

elements sre

coffocated on the same

page.

Page placement policies do not help.

Applications that do not esrefully

fiut) or where dataobjectsaresrnall

artition data (e.g. bodies in Barnes-

and dynamically linked fall into this

category.

F&pre 1: Prediction of relative performance between CC-NUMA and COMA based on relative frequency of cache miss types and

application data access pattern.

4.2 Architecture Simulators

Both architectures consist of a number of processing nodes

connected via low-latency networks. In our simulations we

assume a 16 processor configuration, with one processor per

processing node. (Figure 2(a) shows the organization of the

processing node.) We assume a cache line size of 16 bytes. In

the default configurations, we use a processor cache size of 4

Kbytes, and in the case of COMA, infinite attraction memories.

For COMA we assume a branching factor of 4, implying a

two-level hierarchy.

The reasons for choosing this rather small default processor

cache size are several. First, since the simulations are quite

slow, the data sets used by our applications are smaller than

what we may use on a real machine. As a result, if we were

to use full-size caches (say 256 Kbytes), then for some of the

applications rdl of the data would fit into the caches and we

would not get any capacity misses. This would take away all

of the advantages of the COMA model, and the results would

obviously not be interesting. Second, by using very small

cache sizes, we favor the COMA model, and our gord is to

see whether there are situations where CC-NUMA can still do

better. Third, 4 Kbytes is only the default, and we atso present

results for larger cache sizes.

As for use of infinite attraction

memories, our choice was motivated by the observation that the

capacity miss-rates are expected to be extremely small for the

attraction memories. As a result, the complexity of modeling

finite sized attraction memories did not seem justified.

We now focus on the latency of

cache misses in CC-NUMA

and COMA. To do this in a consistent manner, we have de-

fined a common set of primitive operations that are used to

u

Prxassor

—L ‘C’

[1

Cache

Thus

C

H

bus,cmd

Memory

Treat

Cd

(a)

b) (.)

Figore 2: Node organization (a) and latency calculations for

local and remote misses for CC-NUMA (b) and (c).

construct protocol transactions for both architectures. We can

then choose latency vahtes for these operations and use them

to ewdtrate the latency of memory operations. In the follow-

ing subsections, we describe reference latencies for the two

architectures in terms of these primitive operations. Table 1,

located at the end of this section, summarizes these primitive

operations

and lists the default latency numbers, assuming a

processor clock rate of

100 MHz.

4.2.1

CC-NUMA Latencies

For CC-NUMA, a load request that hits in the cache incurs a

latency of a cache access, TC.chC. For misses, the processor is

stalled for the full service latency of the load request.

In Figure 2(b) we depict the memory latency for a load

84

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Citations

Parallel Computer Architecture: A Hardware/Software Approach

Computer Architecture, Fifth Edition: A Quantitative Approach

Implementing global memory management in a workstation cluster

Tempest and typhoon: user-level shared memory

Optimizing Replication, Communication, and Capacity Allocation in CMPs

References

SPLASH: Stanford parallel applications for shared-memory

The cache performance and optimizations of blocked algorithms

The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer

The directory-based cache coherence protocol for the DASH multiprocessor

APRIL: a processor architecture for multiprocessing

Related Papers (5)

The Stanford Dash multiprocessor

SPLASH: Stanford parallel applications for shared-memory

The Stanford FLASH multiprocessor

The SPLASH-2 programs: characterization and methodological considerations

Tempest and typhoon: user-level shared memory

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Citations

Parallel Computer Architecture: A Hardware/Software Approach

Computer Architecture, Fifth Edition: A Quantitative Approach

Implementing global memory management in a workstation cluster

Tempest and typhoon: user-level shared memory

Optimizing Replication, Communication, and Capacity Allocation in CMPs

References

SPLASH: Stanford parallel applications for shared-memory

The cache performance and optimizations of blocked algorithms

The NYU Ultracomputer&#8212;Designing an MIMD Shared Memory Parallel Computer

The directory-based cache coherence protocol for the DASH multiprocessor

APRIL: a processor architecture for multiprocessing

Related Papers (5)

The Stanford Dash multiprocessor

SPLASH: Stanford parallel applications for shared-memory

The Stanford FLASH multiprocessor

The SPLASH-2 programs: characterization and methodological considerations

Tempest and typhoon: user-level shared memory

The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer