What have the authors stated for future works in "Disaggregated memory for expansion and sharing in blade servers" ?

The ability to extend and share memory can achieve orders of magnitude performance improvements in cases where applications run out of memory capacity, and similar orders of magnitude improvement in performance-per-dollar in cases where systems are overprovisioned for peak memory usage. The authors also demonstrate how this approach can be used to achieve higher levels of server consolidation than currently possible. Overall, as future server environments gravitate towards more memory-constrained and cost-conscious solutions, the authors believe that the memory disaggregation approach they have proposed in the paper is likely to be a key part of future system designs.

(Open Access) Disaggregated memory for expansion and sharing in blade servers (2009) | Kevin Lim

Disaggregated Memory for Expansion and Sharing

in Blade Servers

Kevin Lim*, Jichuan Chang

†

, Trevor Mudge*, Parthasarathy Ranganathan

†

Steven K. Reinhardt

*, Thomas F. Wenisch*

* University of Michigan, Ann Arbor

{ktlim,tnm,twenisch}@umich.edu

†

Hewlett-Packard Labs

{jichuan.chang,partha.ranganathan}@hp.com

Advanced Micro Devices, Inc.

steve.reinhardt@amd.com

ABSTRACT

Analysis of technology and application trends reveals a growing

imbalance in the peak compute-to-memory-capacity ratio for

future servers. At the same time, the fraction contributed by

memory systems to total datacenter costs and power consumption

during typical usage is increasing. In response to these trends, this

paper re-examines traditional compute-memory co-location on a

single system and details the design of a new general-purpose

architectural building block—a memory blade—that allows

memory to be "disaggregated" across a system ensemble. This

remote memory blade can be used for memory capacity expansion

to improve performance and for sharing memory across servers to

reduce provisioning and power costs. We use this memory blade

building block to propose two new system architecture

solutions—(1) page-swapped remote memory at the virtualization

layer, and (2) block-access remote memory with support in the

coherence hardware—that enable transparent memory expansion

and sharing on commodity-based systems. Using simulations of a

mix of enterprise benchmarks supplemented with traces from live

datacenters, we demonstrate that memory disaggregation can

provide substantial performance benefits (on average 10X) in

memory constrained environments, while the sharing enabled by

our solutions can improve performance-per-dollar by up to 87%

when optimizing memory provisioning across multiple servers.

Categories and Subject Descriptors

C.0 [Computer System Designs]: General – system

architectures; B.3.2 [Memory Structures]: Design Styles –

primary memory, virtual memory.

General Terms

Design, Management, Performance.

Keywords

Memory capacity expansion, disaggregated memory, power and

cost efficiencies, memory blades.

1. INTRODUCTION

Recent trends point to the likely emergence of a new memory

wall—one of memory capacity—for future commodity systems.

On the demand side, current trends point to increased number of

cores per socket, with some studies predicting a two-fold increase

every two years [1]. Concurrently, we are likely to see an

increased number of virtual machines (VMs) per core (VMware

quotes 2-4X memory requirements from VM consolidation every

generation [2]), and increased memory footprint per VM (e.g., the

footprint of Microsoft® Windows® has been growing faster than

Moore’s Law [3]). However, from a supply point of view, the

International Technology Roadmap for Semiconductors (ITRS)

estimates that the pin count at a socket level is likely to remain

constant [4]. As a result, the number of channels per socket is

expected to be near-constant. In addition, the rate of growth in

DIMM density is starting to wane (2X every three years versus 2X

every two years), and the DIMM count per channel is declining

(e.g., two DIMMs per channel on DDR3 versus eight for DDR)

[5]. Figure 1(a) aggregates these trends to show historical and

extrapolated increases in processor computation and associated

memory capacity. The processor line shows the projected trend of

cores per socket, while the DRAM line shows the projected trend

of capacity per socket, given DRAM density growth and DIMM

per channel decline. If the trends continue, the growing imbalance

between supply and demand may lead to memory capacity per

core dropping by 30% every two years, particularly for

commodity solutions. If not addressed, future systems are likely to

be performance-limited by inadequate memory capacity.

At the same time, several studies show that the contribution of

memory to the total costs and power consumption of future

systems is trending higher from its current value of about 25%

[6][7][8]. Recent trends point to an interesting opportunity to

address these challenges—namely that of optimizing for the

ensemble [9]. For example, several studies have shown that there

is significant temporal variation in how resources like CPU time

or power are used across applications. We can expect similar

trends in memory usage based on variations across application

types, workload inputs, data characteristics, and traffic patterns.

Figure 1(b) shows how the memory allocated by TPC-H queries

can vary dramatically, and Figure 1(c) presents an eye-chart

illustration of the time-varying memory usage of 10 randomly-

chosen servers from a 1,000-CPU cluster used to render a recent

animated feature film [10]. Each line illustrates a server’s memory

usage varying from a low baseline when idle to the peak memory

usage of the application. Rather than provision each system for its

worst-case memory usage, a solution that provisions for the

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

ISCA’09, June 20–24, 2009, Austin, Texas, USA.

100

1000

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

#Core

DRAM

Relative capacity

(a) Trends leading toward the memory capacity wall

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q 10 Q11 Q12 Q13 Q14 Q15 Q16 Q1 7 Q18 Q19 Q20 Q21 Q22

0.1MB

1MB

10MB

100MB

1GB

10GB

100GB

(b) Memory variations for TPC-H queries (log scale)

Figure 1: Motivating the need for memory extension and

sharing. (a) On average, memory capacity per processor core

is extrapolated to decrease 30% every two years. (b) The

amount of granted memory for TPC-H queries can vary by

orders of magnitude. (c) “Ensemble” memory usage trends

over one month across 10 servers from a cluster used for

animation rendering (one of the 3 datacenter traces used in

this study).

typical usage, with the ability to dynamically add memory

capacity across the ensemble, can reduce costs and power.

Whereas some prior approaches (discussed in Section 2) can

alleviate some of these challenges individually, there is a need for

new architectural solutions that can provide transparent memory

capacity expansion to match computational scaling and

transparent memory sharing across collections of systems. In

addition, given recent trends towards commodity-based solutions

(e.g., [8][9][11]), it is important for these approaches to require at

most minor changes to ensure that the low-cost benefits of

commodity solutions not be undermined. The increased adoption

of blade servers with fast shared interconnection networks and

virtualization software creates the opportunity for new memory

system designs.

In this paper, we propose a new architectural building block to

provide transparent memory expansion and sharing for

commodity-based designs. Specifically, we revisit traditional

memory designs in which memory modules are co-located with

processors on a system board, restricting the configuration and

scalability of both compute and memory resources. Instead, we

argue for a disaggregated memory design that encapsulates an

array of commodity memory modules in a separate shared

memory blade that can be accessed, as needed, by multiple

compute blades via a shared blade interconnect.

We discuss the design of a memory blade and use it to propose

two new system architectures to achieve transparent expansion

and sharing. Our first solution requires no changes to existing

system hardware, using support at the virtualization layer to

provide page-level access to a memory blade across the standard

PCI Express® (PCIe®) interface. Our second solution proposes

minimal hardware support on every compute blade, but provides

finer-grained access to a memory blade across a coherent network

fabric for commodity software stacks.

We demonstrate the validity of our approach through simulations

of a mix of enterprise benchmarks supplemented with traces from

three live datacenter installations. Our results show that memory

disaggregation can provide significant performance benefits (on

average 10X) in memory-constrained environments. Additionally,

the sharing enabled by our solutions can enable large

improvements in performance-per-dollar (up to 87%) and greater

levels of consolidation (3X) when optimizing memory

provisioning across multiple servers.

The rest of the paper is organized as follows. Section 2 discusses

prior work. Section 3 presents our memory blade design and the

implementation of our proposed system architectures, which we

evaluate in Section 4. Section 5 discusses other tradeoffs and

designs, and Section 6 concludes.

2. RELATED WORK

A large body of prior work (e.g., [12][13][14][15][16][17][18])

has examined using remote servers’ memory for swap space

[12][16], file system caching [13][15], or RamDisks [14],

typically over conventional network interfaces (i.e., Ethernet).

These approaches do not fundamentally address the compute-to-

memory capacity imbalance: the total memory capacity relative to

compute is unchanged when all the servers need maximum

capacity at the same time. Additionally, although these

approaches can be used to provide sharing, they suffer from

significant limitations when targeting commodity-based systems.

In particular, these proposals may require substantial system

modifications, such as application-specific programming

interfaces [18] and protocols [14][17]; changes to the host

operating system and device drivers [12][13][14][16]; reduced

reliability in the face of remote server crashes [13][16]; and/or

impractical access latencies [14][17]. Our solutions target the

commodity-based volume server market and thus avoid invasive

changes to applications, operating systems, or server architecture.

Symmetric multiprocessors (SMPs) and distributed shared

memory systems (DSMs) [19][20][21][22][23][24][25][26][27]

allow all the nodes in a system to share a global address space.

However, like the network-based sharing approaches, these

designs do not target the compute-to-memory-capacity ratio.

Hardware shared-memory systems typically require specialized

interconnects and non-commodity components that add costs; in

addition, signaling, electrical, and design complexity increase

rapidly with system size. Software DSMs [24][25][26][27] can

avoid these costs by managing the operations to send, receive, and

maintain coherence in software, but come with practical

limitations to functionality, generality, software transparency,

total costs, and performance [28]. A recent commercial design in

this space, Versatile SMP [29], uses a virtualization layer to chain

together commodity x86 servers to provide the illusion of a single

larger system, but the current design requires specialized

motherboards, I/O devices, and non-commodity networking, and

there is limited documentation on performance benefits,

particularly with respect to software DSMs.

To increase the compute-to-memory ratio directly, researchers

have proposed compressing memory contents [30][31] or

augmenting/replacing conventional DRAM with alternative

devices or interfaces. Recent startups like Virident [32] and Texas

Memory [33] propose the use of solid-state storage, such as

NAND Flash, to improve memory density albeit with higher

access latencies than conventional DRAM. From a technology

perspective, fully-buffered DIMMs [34] have the potential to

increase memory capacity but with significant trade-offs in power

consumption. 3D die-stacking [35] allows DRAM to be placed

on-chip as different layers of silicon; in addition to the open

architectural issues on how to organize 3D-stacked main memory,

this approach further constrains the extensibility of memory

capacity. Phase change memory (PCM) is emerging as a

promising alternative to increase memory density. However,

current PCM devices suffer from several drawbacks that limit

their straightforward use as a main memory replacement,

including high energy requirements, slow write latencies, and

finite endurance. In contrast to our work, none of these

approaches enable memory capacity sharing across nodes. In

addition, many of these alternatives provide only a one-time

improvement, thus delaying but failing to fundamentally address

the memory capacity wall.

A recent study [36] demonstrates the viability of a two-level

memory organization that can tolerate increased access latency

due to compression, heterogeneity, or network access to second-

level memory. However, that study does not discuss a commodity

implementation for x86 architectures or evaluate sharing across

systems. Our prior work [8] employs a variant of this two-level

memory organization as part of a broader demonstration of how

multiple techniques, including the choice of processors, new

packaging design, and use of Flash-based storage, can help

improve performance in warehouse computing environments. The

present paper follows up on our prior work by: (1) extending the

two-level memory design to support x86 commodity servers; (2)

presenting two new system architectures for accessing the remote

memory; and (3) evaluating the designs on a broad range of

workloads and real-world datacenter utilization traces.

As is evident from this discussion, there is currently no single

architectural approach that simultaneously addresses memory-to-

compute-capacity expansion and memory capacity sharing, and

does it in an application/OS-transparent manner on commodity-

based hardware and software. The next section describes our

approach to define such an architecture.

3. DISAGGREGATED MEMORY

Our approach is based on four observations: (1) The emergence of

blade servers with fast shared communication fabrics in the

enclosure enables separate blades to share resources across the

ensemble. (2) Virtualization provides a level of indirection that

can enable OS-and-application-transparent memory capacity

changes on demand. (3) Market trends towards commodity-based

solutions require special-purpose support to be limited to the non-

volume components of the solution. (4) The footprints of

enterprise workloads vary across applications and over time; but

current approaches to memory system design fail to leverage these

variations, resorting instead to worst-case provisioning.

Given these observations, our approach argues for a re-

examination of conventional designs that co-locate memory

DIMMs in conjunction with computation resources, connected

through conventional memory interfaces and controlled through

on-chip memory controllers. Instead, we argue for a

disaggregated multi-level design where we provision an

additional separate memory blade, connected at the I/O or

communication bus. This memory blade comprises arrays of

commodity memory modules assembled to maximize density and

cost-effectiveness, and provides extra memory capacity that can

be allocated on-demand to individual compute blades. We first

detail the design of a memory blade (Section 3.1), and then

discuss system architectures that can leverage this component for

transparent memory extension and sharing (Section 3.2).

3.1 Memory Blade Architecture

Figure 2(a) illustrates the design of our memory blade. The

memory blade comprises a protocol engine to interface with the

blade enclosure’s I/O backplane interconnect, a custom memory-

controller ASIC (or a light-weight CPU), and one or more

channels of commodity DIMM modules connected via on-board

repeater buffers or alternate fan-out techniques. The memory

controller handles requests from client blades to read and write

memory, and to manage capacity allocation and address mapping.

Optional memory-side accelerators can be added to support

functions like compression and encryption.

Although the memory blade itself includes custom hardware, it

requires no changes to volume blade-server designs, as it connects

through standard I/O interfaces. Its costs are amortized over the

entire server ensemble. The memory blade design is

straightforward compared to a typical server blade, as it does not

have the cooling challenges of a high-performance CPU and does

not require local disk, Ethernet capability, or other elements (e.g.,

management processor, SuperIO, etc.) Client access latency is

dominated by the enclosure interconnect, which allows the

memory blade’s DRAM subsystem to be optimized for power and

capacity efficiency rather than latency. For example, the controller

can aggressively place DRAM pages into active power-down

mode, and can map consecutive cache blocks into a single

memory bank to minimize the number of active devices at the

expense of reduced single-client bandwidth. A memory blade can

also serve as a vehicle for integrating alternative memory

technologies, such as Flash or phase-change memory, possibly in

a heterogeneous combination with DRAM, without requiring

modification to the compute blades.

To provide protection and isolation among shared clients, the

memory controller translates each memory address accessed by a

Backplane

Protocol agent

Memory controller

Accelerators

Address mapping

Compute Blades

Memory blade

DIMMs (data, dirty, ECC)

(a) Memory blade design

Base Limit

Super page

SMA

Map

registers

RMMA Permission

RMMA Free list

RMMA maps

Base Limit

RMMA

Offset

(Address)

(Blade ID)

System Memory Address Remote Machine Memory Address

(b) Address mapping

Figure 2: Design of the memory blade.

(a) The memory blade

connects to the compute blades via the enclosure backplane.

(b) The data structures that support

memory access and

allocation/revocation operations.

client blade into an address local to the memory blade, called the

Remote Machine Memory Address (RMMA). In our design, each

client manages both local and remote physical memory within a

single System Memory Address (SMA) space. Local physical

memory resides at the bottom of this space, with remote memory

mapped at higher addresses. For example, if a blade has 2 GB of

local DRAM and has been assigned 6 GB of remote capacity, its

total SMA space extends from 0 to 8 GB. Each blade’s remote

SMA space is mapped to a disjoint portion of the RMMA space.

This process is illustrated in Figure 2(b). We manage the blade’s

memory in large chunks (e.g., 16 MB) so that the entire mapping

table can be kept in SRAM on the memory blade’s controller. For

example, a 512 GB memory blade managed in 16 MB chunks

requires only a 32K-entry mapping table. Using these “superpage”

mappings avoids complex, high-latency DRAM page table data

structures and custom TLB hardware. Note that providing shared-

memory communications among client blades (as in distributed

shared memory) is beyond the scope of this paper.

Allocation and revocation: The memory blade’s total capacity is

partitioned among the connected clients through the cooperation

of the virtual machine monitors (VMMs) running on the clients,

in conjunction with enclosure-, rack-, or datacenter-level

management software. The VMMs in turn are responsible for

allocating remote memory among the virtual machine(s) (VMs)

running on each client system. The selection of capacity allocation

policies, both among blades in an enclosure and among VMs on a

blade, is a broad topic that deserves separate study. Here we

restrict our discussion to designing the mechanisms for allocation

and revocation.

Allocation is straightforward: privileged management software on

the memory blade assigns one or more unused memory blade

superpages to a client, and sets up a mapping from the chosen

blade ID and SMA range to the appropriate RMMA range.

In the case where there are no unused superpages, some existing

mapping must be revoked so that memory can be reallocated. We

assume that capacity reallocation is a rare event compared to the

frequency of accessing memory using reads and writes.

Consequently, our design focuses primarily on correctness and

transparency and not performance.

When a client is allocated memory on a fully subscribed memory

blade, management software first decides which other clients must

give up capacity, then notifies the VMMs on those clients of the

amount of remote memory they must release. We propose two

general approaches for freeing pages. First, most VMMs already

provide paging support to allow a set of VMs to oversubscribe

local memory. This paging mechanism can be invoked to evict

local or remote pages. When a remote page is to be swapped out,

it is first transferred temporarily to an empty local frame and then

paged to disk. The remote page freed by this transfer is released

for reassignment.

Alternatively, many VMMs provide a “balloon driver” [37] within

the guest OS to allocate and pin memory pages, which are then

returned to the VMM. The balloon driver increases memory

pressure within the guest OS, forcing it to select pages for

eviction. This approach generally provides better results than the

VMM’s paging mechanisms, as the guest OS can make a more

informed decision about which pages to swap out and may simply

discard clean pages without writing them to disk. Because the

newly freed physical pages can be dispersed across both the local

and remote SMA ranges, the VMM may need to relocate pages

within the SMA space to free a contiguous 16 MB remote

superpage.

Once the VMMs have released their remote pages, the memory

blade mapping tables may be updated to reflect the new

allocation. We assume that the VMMs can generally be trusted to

release memory on request; the unlikely failure of a VMM to

release memory promptly indicates a serious error and can be

resolved by rebooting the client blade.

3.2 System Architecture with Memory Blades

Whereas our memory-blade design enables several alternative

system architectures, we discuss two specific designs, one based

on page swapping and another using fine-grained remote access.

In addition to providing more detailed examples, these designs

also illustrate some of the tradeoffs in the multi-dimensional

design space for memory blades. Most importantly, they compare

the method and granularity of access to the remote blade (page-

based versus block-based) and the interconnect fabric used for

communication (PCI Express versus HyperTransport).

3.2.1 Page-Swapping Remote Memory (PS)

Our first design avoids any hardware changes to the high-volume

compute blades or enclosure; the memory blade itself is the only

non-standard component. This constraint implies a conventional

I/O backplane interconnect, typically PCIe. This basic design is

illustrated in Figure 3(a).

Because CPUs in a conventional system cannot access cacheable

memory across a PCIe connection, the system must bring

locations into the client blade’s local physical memory before they

Backplane

Memory

blade

Compute Blade

P P P P

DIMM

Memory controller

PCIe bridge

Hypervisor (SMA)

OS (PA)

App (VA)

Software Stack

(a) Compute blade

(b) Address mapping process

Figure 3: Page-swapping remote memory system design.

(a) No changes are required to compute servers and

networking on existing blade designs. Our solution adds

minor modules (shaded block) to the virtualization layer.

(b) The address mapping design places the extended capacity

at the top of the address space.

can be accessed. We leverage standard virtual-memory

mechanisms to detect accesses to remote memory and relocate the

targeted locations to local memory on a page granularity. In

addition to enabling the use of virtual memory support, page-

based transfers exploit locality in the client’s access stream and

amortize the overhead of PCIe memory transfers.

To avoid modifications to application and OS software, we

implement this page management in the VMM. The VMM detects

accesses to remote data pages and swaps those data pages to local

memory before allowing a load or store to proceed.

Figure 3(b) illustrates our page management scheme. Recall that,

when remote memory capacity is assigned to a specific blade, we

extend the SMA (machine physical address) space at that blade to

provide local addresses for the additional memory. The VMM

assigns pages from this additional address space to guest VMs,

where they will in turn be assigned to the guest OS or to

applications. However, because these pages cannot be accessed

directly by the CPU, the VMM cannot set up valid page-table

entries for these addresses. It instead tracks the pages by using

“poisoned” page table entries without their valid bits set or by

tracking the mappings outside of the page tables (similar

techniques have been used to prototype hybrid memory in

VMWare [38]). In either case, a direct CPU access to remote

memory will cause a page fault and trap into the VMM. On such a

trap, the VMM initiates a page swap operation. This simple OS-

transparent memory-to-memory page swap should not be

confused with OS-based virtual memory swapping (paging to

swap space), which is orders of magnitude slower and involves an

entirely different set of sophisticated data structures and

algorithms.

In our design, we assume page swapping is performed on a 4 KB

granularity, a common page size used by operating systems. Page

swaps logically appear to the VMM as a swap from high SMA

addresses (beyond the end of local memory) to low addresses

(within local memory). To decouple the swap of a remote page to

local memory and eviction of a local page to remote memory, we

maintain a pool of free local pages for incoming swaps. The

software fault handler thus allocates a page from the local free list

and initiates a DMA transfer over the PCIe channel from the

remote memory blade. The transfer is performed synchronously

(i.e., the execution thread is stalled during the transfer, but other

threads may execute). Once the transfer is complete, the fault

handler updates the page table entry to point to the new, local

SMA address and puts the prior remote SMA address into a pool

of remote addresses that are currently unused.

To maintain an adequate supply of free local pages, the VMM

must occasionally evict local pages to remote memory, effectively

performing the second half of the logical swap operation. The

VMM selects a high SMA address from the remote page free list

and initiates a DMA transfer from a local page to the remote

memory blade. When complete, the local page is unmapped and

placed on the local free list. Eviction operations are performed

asynchronously, and do not stall the CPU unless a conflicting

access to the in-flight page occurs during eviction.

3.2.2 Fine-Grained Remote Memory Access (FGRA)

The previous solution avoids any hardware changes to the

commodity compute blade, but at the expense of trapping to the

VMM and transferring full pages on every remote memory access.

In our second approach, we examine the effect of a few minimal

hardware changes to the high-volume compute blade to enable an

alternate design that has higher performance potential. In

particular, this design allows CPUs on the compute blade to

access remote memory directly at cache-block granularity.

Our approach leverages the glueless SMP support found in

current processors. For example, AMD Opteron™ processors

have up to three coherent HyperTransport™ links coming out of

the socket. Our design, shown in Figure 4, uses custom hardware

on the compute blade to redirect cache fill requests to the remote

memory blade. Although it does require custom hardware, the

changes to enable our FGRA design are relatively straightforward

adaptations of current coherent memory controller designs

This hardware, labeled “Coherence filter” in Figure 4, serves two

purposes. First, it selectively forwards only necessary coherence

protocol requests to the remote memory blade. For example,

because the remote blade does not contain any caches, the

coherence filter can respond immediately to invalidation requests.

Only memory read and write requests require processing at the

remote memory blade. In the terminology of glueless x86

Backplane

Memory

blade

Compute Blade

P P P P

DIMM

Memory controller

Coherence filter

OS (PA)

App (VA)

Software Stack

Figure 4: Fine-grained remote memory access system design.

This design assumes minor coherence hardware support in

every compute blade.

Disaggregated memory for expansion and sharing in blade servers

Figures

Citations

Trends in big data analytics

技術解説 IEEE Computer

MemScale: active low-power modes for main memory

Rethinking DRAM design and organization for energy-constrained multi-cores

Network requirements for resource disaggregation

References

The Landscape of Parallel Computing Research: A View from Berkeley

Memory resource management in VMware ESX server

Memory coherence in shared virtual memory systems

Web search for a planet: The Google cluster architecture

The Stanford Dash multiprocessor

Related Papers (5)

Scalable high performance main memory system using phase-change memory technology

FaRM: fast remote memory

Architecting phase change memory as a scalable dram alternative

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

PowerNap: eliminating server idle power

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Disaggregated memory for expansion and sharing in blade servers" ?

Q2. What have the authors stated for future works in "Disaggregated memory for expansion and sharing in blade servers" ?