scispace - formally typeset
Open AccessProceedings ArticleDOI

Disaggregated memory for expansion and sharing in blade servers

Reads0
Chats0
TLDR
It is demonstrated that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by the solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.
Abstract
Analysis of technology and application trends reveals a growing imbalance in the peak compute-to-memory-capacity ratio for future servers. At the same time, the fraction contributed by memory systems to total datacenter costs and power consumption during typical usage is increasing. In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block-a memory blade-that allows memory to be "disaggregated" across a system ensemble. This remote memory blade can be used for memory capacity expansion to improve performance and for sharing memory across servers to reduce provisioning and power costs. We use this memory blade building block to propose two new system architecture solutions-(1) page-swapped remote memory at the virtualization layer, and (2) block-access remote memory with support in the coherence hardware-that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, we demonstrate that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by our solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.

read more

Content maybe subject to copyright    Report

1
Disaggregated Memory for Expansion and Sharing
in Blade Servers
Kevin Lim*, Jichuan Chang
, Trevor Mudge*, Parthasarathy Ranganathan
,
Steven K. Reinhardt
+
*, Thomas F. Wenisch*
* University of Michigan, Ann Arbor
{ktlim,tnm,twenisch}@umich.edu
Hewlett-Packard Labs
{jichuan.chang,partha.ranganathan}@hp.com
+
Advanced Micro Devices, Inc.
steve.reinhardt@amd.com
ABSTRACT
Analysis of technology and application trends reveals a growing
imbalance in the peak compute-to-memory-capacity ratio for
future servers. At the same time, the fraction contributed by
memory systems to total datacenter costs and power consumption
during typical usage is increasing. In response to these trends, this
paper re-examines traditional compute-memory co-location on a
single system and details the design of a new general-purpose
architectural building block—a memory blade—that allows
memory to be "disaggregated" across a system ensemble. This
remote memory blade can be used for memory capacity expansion
to improve performance and for sharing memory across servers to
reduce provisioning and power costs. We use this memory blade
building block to propose two new system architecture
solutions—(1) page-swapped remote memory at the virtualization
layer, and (2) block-access remote memory with support in the
coherence hardware—that enable transparent memory expansion
and sharing on commodity-based systems. Using simulations of a
mix of enterprise benchmarks supplemented with traces from live
datacenters, we demonstrate that memory disaggregation can
provide substantial performance benefits (on average 10X) in
memory constrained environments, while the sharing enabled by
our solutions can improve performance-per-dollar by up to 87%
when optimizing memory provisioning across multiple servers.
Categories and Subject Descriptors
C.0 [Computer System Designs]: General system
architectures; B.3.2 [Memory Structures]: Design Styles
primary memory, virtual memory.
General Terms
Design, Management, Performance.
Keywords
Memory capacity expansion, disaggregated memory, power and
cost efficiencies, memory blades.
1. INTRODUCTION
Recent trends point to the likely emergence of a new memory
wall—one of memory capacity—for future commodity systems.
On the demand side, current trends point to increased number of
cores per socket, with some studies predicting a two-fold increase
every two years [1]. Concurrently, we are likely to see an
increased number of virtual machines (VMs) per core (VMware
quotes 2-4X memory requirements from VM consolidation every
generation [2]), and increased memory footprint per VM (e.g., the
footprint of Microsoft® Windows® has been growing faster than
Moore’s Law [3]). However, from a supply point of view, the
International Technology Roadmap for Semiconductors (ITRS)
estimates that the pin count at a socket level is likely to remain
constant [4]. As a result, the number of channels per socket is
expected to be near-constant. In addition, the rate of growth in
DIMM density is starting to wane (2X every three years versus 2X
every two years), and the DIMM count per channel is declining
(e.g., two DIMMs per channel on DDR3 versus eight for DDR)
[5]. Figure 1(a) aggregates these trends to show historical and
extrapolated increases in processor computation and associated
memory capacity. The processor line shows the projected trend of
cores per socket, while the DRAM line shows the projected trend
of capacity per socket, given DRAM density growth and DIMM
per channel decline. If the trends continue, the growing imbalance
between supply and demand may lead to memory capacity per
core dropping by 30% every two years, particularly for
commodity solutions. If not addressed, future systems are likely to
be performance-limited by inadequate memory capacity.
At the same time, several studies show that the contribution of
memory to the total costs and power consumption of future
systems is trending higher from its current value of about 25%
[6][7][8]. Recent trends point to an interesting opportunity to
address these challenges—namely that of optimizing for the
ensemble [9]. For example, several studies have shown that there
is significant temporal variation in how resources like CPU time
or power are used across applications. We can expect similar
trends in memory usage based on variations across application
types, workload inputs, data characteristics, and traffic patterns.
Figure 1(b) shows how the memory allocated by TPC-H queries
can vary dramatically, and Figure 1(c) presents an eye-chart
illustration of the time-varying memory usage of 10 randomly-
chosen servers from a 1,000-CPU cluster used to render a recent
animated feature film [10]. Each line illustrates a server’s memory
usage varying from a low baseline when idle to the peak memory
usage of the application. Rather than provision each system for its
worst-case memory usage, a solution that provisions for the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ISCA’09, June 20–24, 2009, Austin, Texas, USA.
Copyright 2009 ACM 978-1-60558-526-0/09/06...$5.00.

2
1
10
100
1000
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
#Core
DRAM
Relative capacity
(a) Trends leading toward the memory capacity wall
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q 10 Q11 Q12 Q13 Q14 Q15 Q16 Q1 7 Q18 Q19 Q20 Q21 Q22
0.1MB
1MB
10MB
100MB
1GB
10GB
100GB
(b) Memory variations for TPC-H queries (log scale)
(c) Memory variations in server memory utilization
Figure 1: Motivating the need for memory extension and
sharing. (a) On average, memory capacity per processor core
is extrapolated to decrease 30% every two years. (b) The
amount of granted memory for TPC-H queries can vary by
orders of magnitude. (c) “Ensemble” memory usage trends
over one month across 10 servers from a cluster used for
animation rendering (one of the 3 datacenter traces used in
this study).
typical usage, with the ability to dynamically add memory
capacity across the ensemble, can reduce costs and power.
Whereas some prior approaches (discussed in Section 2) can
alleviate some of these challenges individually, there is a need for
new architectural solutions that can provide transparent memory
capacity expansion to match computational scaling and
transparent memory sharing across collections of systems. In
addition, given recent trends towards commodity-based solutions
(e.g., [8][9][11]), it is important for these approaches to require at
most minor changes to ensure that the low-cost benefits of
commodity solutions not be undermined. The increased adoption
of blade servers with fast shared interconnection networks and
virtualization software creates the opportunity for new memory
system designs.
In this paper, we propose a new architectural building block to
provide transparent memory expansion and sharing for
commodity-based designs. Specifically, we revisit traditional
memory designs in which memory modules are co-located with
processors on a system board, restricting the configuration and
scalability of both compute and memory resources. Instead, we
argue for a disaggregated memory design that encapsulates an
array of commodity memory modules in a separate shared
memory blade that can be accessed, as needed, by multiple
compute blades via a shared blade interconnect.
We discuss the design of a memory blade and use it to propose
two new system architectures to achieve transparent expansion
and sharing. Our first solution requires no changes to existing
system hardware, using support at the virtualization layer to
provide page-level access to a memory blade across the standard
PCI Express® (PCIe®) interface. Our second solution proposes
minimal hardware support on every compute blade, but provides
finer-grained access to a memory blade across a coherent network
fabric for commodity software stacks.
We demonstrate the validity of our approach through simulations
of a mix of enterprise benchmarks supplemented with traces from
three live datacenter installations. Our results show that memory
disaggregation can provide significant performance benefits (on
average 10X) in memory-constrained environments. Additionally,
the sharing enabled by our solutions can enable large
improvements in performance-per-dollar (up to 87%) and greater
levels of consolidation (3X) when optimizing memory
provisioning across multiple servers.
The rest of the paper is organized as follows. Section 2 discusses
prior work. Section 3 presents our memory blade design and the
implementation of our proposed system architectures, which we
evaluate in Section 4. Section 5 discusses other tradeoffs and
designs, and Section 6 concludes.
2. RELATED WORK
A large body of prior work (e.g., [12][13][14][15][16][17][18])
has examined using remote servers’ memory for swap space
[12][16], file system caching [13][15], or RamDisks [14],
typically over conventional network interfaces (i.e., Ethernet).
These approaches do not fundamentally address the compute-to-
memory capacity imbalance: the total memory capacity relative to
compute is unchanged when all the servers need maximum
capacity at the same time. Additionally, although these
approaches can be used to provide sharing, they suffer from
significant limitations when targeting commodity-based systems.
In particular, these proposals may require substantial system
modifications, such as application-specific programming
interfaces [18] and protocols [14][17]; changes to the host
operating system and device drivers [12][13][14][16]; reduced
reliability in the face of remote server crashes [13][16]; and/or
impractical access latencies [14][17]. Our solutions target the
commodity-based volume server market and thus avoid invasive
changes to applications, operating systems, or server architecture.
Symmetric multiprocessors (SMPs) and distributed shared
memory systems (DSMs) [19][20][21][22][23][24][25][26][27]
allow all the nodes in a system to share a global address space.
However, like the network-based sharing approaches, these
designs do not target the compute-to-memory-capacity ratio.

3
Hardware shared-memory systems typically require specialized
interconnects and non-commodity components that add costs; in
addition, signaling, electrical, and design complexity increase
rapidly with system size. Software DSMs [24][25][26][27] can
avoid these costs by managing the operations to send, receive, and
maintain coherence in software, but come with practical
limitations to functionality, generality, software transparency,
total costs, and performance [28]. A recent commercial design in
this space, Versatile SMP [29], uses a virtualization layer to chain
together commodity x86 servers to provide the illusion of a single
larger system, but the current design requires specialized
motherboards, I/O devices, and non-commodity networking, and
there is limited documentation on performance benefits,
particularly with respect to software DSMs.
To increase the compute-to-memory ratio directly, researchers
have proposed compressing memory contents [30][31] or
augmenting/replacing conventional DRAM with alternative
devices or interfaces. Recent startups like Virident [32] and Texas
Memory [33] propose the use of solid-state storage, such as
NAND Flash, to improve memory density albeit with higher
access latencies than conventional DRAM. From a technology
perspective, fully-buffered DIMMs [34] have the potential to
increase memory capacity but with significant trade-offs in power
consumption. 3D die-stacking [35] allows DRAM to be placed
on-chip as different layers of silicon; in addition to the open
architectural issues on how to organize 3D-stacked main memory,
this approach further constrains the extensibility of memory
capacity. Phase change memory (PCM) is emerging as a
promising alternative to increase memory density. However,
current PCM devices suffer from several drawbacks that limit
their straightforward use as a main memory replacement,
including high energy requirements, slow write latencies, and
finite endurance. In contrast to our work, none of these
approaches enable memory capacity sharing across nodes. In
addition, many of these alternatives provide only a one-time
improvement, thus delaying but failing to fundamentally address
the memory capacity wall.
A recent study [36] demonstrates the viability of a two-level
memory organization that can tolerate increased access latency
due to compression, heterogeneity, or network access to second-
level memory. However, that study does not discuss a commodity
implementation for x86 architectures or evaluate sharing across
systems. Our prior work [8] employs a variant of this two-level
memory organization as part of a broader demonstration of how
multiple techniques, including the choice of processors, new
packaging design, and use of Flash-based storage, can help
improve performance in warehouse computing environments. The
present paper follows up on our prior work by: (1) extending the
two-level memory design to support x86 commodity servers; (2)
presenting two new system architectures for accessing the remote
memory; and (3) evaluating the designs on a broad range of
workloads and real-world datacenter utilization traces.
As is evident from this discussion, there is currently no single
architectural approach that simultaneously addresses memory-to-
compute-capacity expansion and memory capacity sharing, and
does it in an application/OS-transparent manner on commodity-
based hardware and software. The next section describes our
approach to define such an architecture.
3. DISAGGREGATED MEMORY
Our approach is based on four observations: (1) The emergence of
blade servers with fast shared communication fabrics in the
enclosure enables separate blades to share resources across the
ensemble. (2) Virtualization provides a level of indirection that
can enable OS-and-application-transparent memory capacity
changes on demand. (3) Market trends towards commodity-based
solutions require special-purpose support to be limited to the non-
volume components of the solution. (4) The footprints of
enterprise workloads vary across applications and over time; but
current approaches to memory system design fail to leverage these
variations, resorting instead to worst-case provisioning.
Given these observations, our approach argues for a re-
examination of conventional designs that co-locate memory
DIMMs in conjunction with computation resources, connected
through conventional memory interfaces and controlled through
on-chip memory controllers. Instead, we argue for a
disaggregated multi-level design where we provision an
additional separate memory blade, connected at the I/O or
communication bus. This memory blade comprises arrays of
commodity memory modules assembled to maximize density and
cost-effectiveness, and provides extra memory capacity that can
be allocated on-demand to individual compute blades. We first
detail the design of a memory blade (Section 3.1), and then
discuss system architectures that can leverage this component for
transparent memory extension and sharing (Section 3.2).
3.1 Memory Blade Architecture
Figure 2(a) illustrates the design of our memory blade. The
memory blade comprises a protocol engine to interface with the
blade enclosure’s I/O backplane interconnect, a custom memory-
controller ASIC (or a light-weight CPU), and one or more
channels of commodity DIMM modules connected via on-board
repeater buffers or alternate fan-out techniques. The memory
controller handles requests from client blades to read and write
memory, and to manage capacity allocation and address mapping.
Optional memory-side accelerators can be added to support
functions like compression and encryption.
Although the memory blade itself includes custom hardware, it
requires no changes to volume blade-server designs, as it connects
through standard I/O interfaces. Its costs are amortized over the
entire server ensemble. The memory blade design is
straightforward compared to a typical server blade, as it does not
have the cooling challenges of a high-performance CPU and does
not require local disk, Ethernet capability, or other elements (e.g.,
management processor, SuperIO, etc.) Client access latency is
dominated by the enclosure interconnect, which allows the
memory blade’s DRAM subsystem to be optimized for power and
capacity efficiency rather than latency. For example, the controller
can aggressively place DRAM pages into active power-down
mode, and can map consecutive cache blocks into a single
memory bank to minimize the number of active devices at the
expense of reduced single-client bandwidth. A memory blade can
also serve as a vehicle for integrating alternative memory
technologies, such as Flash or phase-change memory, possibly in
a heterogeneous combination with DRAM, without requiring
modification to the compute blades.
To provide protection and isolation among shared clients, the
memory controller translates each memory address accessed by a

4
Backplane
Protocol agent
Memory controller
Accelerators
Address mapping
Compute Blades
Memory blade
DIMMs (data, dirty, ECC)
DIMMs (data, dirty, ECC)
DIMMs (data, dirty, ECC)
DIMMs (data, dirty, ECC)
(a) Memory blade design
Base Limit
Super page
SMA
Map
registers
RMMA Permission
RMMA Permission
RMMA Free list
RMMA maps
Base Limit
+
RMMA
Offset
(Address)
+
(Blade ID)
System Memory Address Remote Machine Memory Address
(b) Address mapping
Figure 2: Design of the memory blade.
(a) The memory blade
connects to the compute blades via the enclosure backplane.
(b) The data structures that support
memory access and
allocation/revocation operations.
client blade into an address local to the memory blade, called the
Remote Machine Memory Address (RMMA). In our design, each
client manages both local and remote physical memory within a
single System Memory Address (SMA) space. Local physical
memory resides at the bottom of this space, with remote memory
mapped at higher addresses. For example, if a blade has 2 GB of
local DRAM and has been assigned 6 GB of remote capacity, its
total SMA space extends from 0 to 8 GB. Each blade’s remote
SMA space is mapped to a disjoint portion of the RMMA space.
This process is illustrated in Figure 2(b). We manage the blade’s
memory in large chunks (e.g., 16 MB) so that the entire mapping
table can be kept in SRAM on the memory blade’s controller. For
example, a 512 GB memory blade managed in 16 MB chunks
requires only a 32K-entry mapping table. Using these “superpage”
mappings avoids complex, high-latency DRAM page table data
structures and custom TLB hardware. Note that providing shared-
memory communications among client blades (as in distributed
shared memory) is beyond the scope of this paper.
Allocation and revocation: The memory blade’s total capacity is
partitioned among the connected clients through the cooperation
of the virtual machine monitors (VMMs) running on the clients,
in conjunction with enclosure-, rack-, or datacenter-level
management software. The VMMs in turn are responsible for
allocating remote memory among the virtual machine(s) (VMs)
running on each client system. The selection of capacity allocation
policies, both among blades in an enclosure and among VMs on a
blade, is a broad topic that deserves separate study. Here we
restrict our discussion to designing the mechanisms for allocation
and revocation.
Allocation is straightforward: privileged management software on
the memory blade assigns one or more unused memory blade
superpages to a client, and sets up a mapping from the chosen
blade ID and SMA range to the appropriate RMMA range.
In the case where there are no unused superpages, some existing
mapping must be revoked so that memory can be reallocated. We
assume that capacity reallocation is a rare event compared to the
frequency of accessing memory using reads and writes.
Consequently, our design focuses primarily on correctness and
transparency and not performance.
When a client is allocated memory on a fully subscribed memory
blade, management software first decides which other clients must
give up capacity, then notifies the VMMs on those clients of the
amount of remote memory they must release. We propose two
general approaches for freeing pages. First, most VMMs already
provide paging support to allow a set of VMs to oversubscribe
local memory. This paging mechanism can be invoked to evict
local or remote pages. When a remote page is to be swapped out,
it is first transferred temporarily to an empty local frame and then
paged to disk. The remote page freed by this transfer is released
for reassignment.
Alternatively, many VMMs provide a “balloon driver” [37] within
the guest OS to allocate and pin memory pages, which are then
returned to the VMM. The balloon driver increases memory
pressure within the guest OS, forcing it to select pages for
eviction. This approach generally provides better results than the
VMM’s paging mechanisms, as the guest OS can make a more
informed decision about which pages to swap out and may simply
discard clean pages without writing them to disk. Because the
newly freed physical pages can be dispersed across both the local
and remote SMA ranges, the VMM may need to relocate pages
within the SMA space to free a contiguous 16 MB remote
superpage.
Once the VMMs have released their remote pages, the memory
blade mapping tables may be updated to reflect the new
allocation. We assume that the VMMs can generally be trusted to
release memory on request; the unlikely failure of a VMM to
release memory promptly indicates a serious error and can be
resolved by rebooting the client blade.
3.2 System Architecture with Memory Blades
Whereas our memory-blade design enables several alternative
system architectures, we discuss two specific designs, one based
on page swapping and another using fine-grained remote access.
In addition to providing more detailed examples, these designs
also illustrate some of the tradeoffs in the multi-dimensional
design space for memory blades. Most importantly, they compare
the method and granularity of access to the remote blade (page-
based versus block-based) and the interconnect fabric used for
communication (PCI Express versus HyperTransport).
3.2.1 Page-Swapping Remote Memory (PS)
Our first design avoids any hardware changes to the high-volume
compute blades or enclosure; the memory blade itself is the only
non-standard component. This constraint implies a conventional
I/O backplane interconnect, typically PCIe. This basic design is
illustrated in Figure 3(a).
Because CPUs in a conventional system cannot access cacheable
memory across a PCIe connection, the system must bring
locations into the client blade’s local physical memory before they

5
Backplane
Memory
blade
Compute Blade
P P P P
DIMM
Memory controller
PCIe bridge
Hypervisor (SMA)
OS (PA)
App (VA)
Software Stack
(a) Compute blade
(b) Address mapping process
Figure 3: Page-swapping remote memory system design.
(a) No changes are required to compute servers and
networking on existing blade designs. Our solution adds
minor modules (shaded block) to the virtualization layer.
(b) The address mapping design places the extended capacity
at the top of the address space.
can be accessed. We leverage standard virtual-memory
mechanisms to detect accesses to remote memory and relocate the
targeted locations to local memory on a page granularity. In
addition to enabling the use of virtual memory support, page-
based transfers exploit locality in the client’s access stream and
amortize the overhead of PCIe memory transfers.
To avoid modifications to application and OS software, we
implement this page management in the VMM. The VMM detects
accesses to remote data pages and swaps those data pages to local
memory before allowing a load or store to proceed.
Figure 3(b) illustrates our page management scheme. Recall that,
when remote memory capacity is assigned to a specific blade, we
extend the SMA (machine physical address) space at that blade to
provide local addresses for the additional memory. The VMM
assigns pages from this additional address space to guest VMs,
where they will in turn be assigned to the guest OS or to
applications. However, because these pages cannot be accessed
directly by the CPU, the VMM cannot set up valid page-table
entries for these addresses. It instead tracks the pages by using
“poisoned” page table entries without their valid bits set or by
tracking the mappings outside of the page tables (similar
techniques have been used to prototype hybrid memory in
VMWare [38]). In either case, a direct CPU access to remote
memory will cause a page fault and trap into the VMM. On such a
trap, the VMM initiates a page swap operation. This simple OS-
transparent memory-to-memory page swap should not be
confused with OS-based virtual memory swapping (paging to
swap space), which is orders of magnitude slower and involves an
entirely different set of sophisticated data structures and
algorithms.
In our design, we assume page swapping is performed on a 4 KB
granularity, a common page size used by operating systems. Page
swaps logically appear to the VMM as a swap from high SMA
addresses (beyond the end of local memory) to low addresses
(within local memory). To decouple the swap of a remote page to
local memory and eviction of a local page to remote memory, we
maintain a pool of free local pages for incoming swaps. The
software fault handler thus allocates a page from the local free list
and initiates a DMA transfer over the PCIe channel from the
remote memory blade. The transfer is performed synchronously
(i.e., the execution thread is stalled during the transfer, but other
threads may execute). Once the transfer is complete, the fault
handler updates the page table entry to point to the new, local
SMA address and puts the prior remote SMA address into a pool
of remote addresses that are currently unused.
To maintain an adequate supply of free local pages, the VMM
must occasionally evict local pages to remote memory, effectively
performing the second half of the logical swap operation. The
VMM selects a high SMA address from the remote page free list
and initiates a DMA transfer from a local page to the remote
memory blade. When complete, the local page is unmapped and
placed on the local free list. Eviction operations are performed
asynchronously, and do not stall the CPU unless a conflicting
access to the in-flight page occurs during eviction.
3.2.2 Fine-Grained Remote Memory Access (FGRA)
The previous solution avoids any hardware changes to the
commodity compute blade, but at the expense of trapping to the
VMM and transferring full pages on every remote memory access.
In our second approach, we examine the effect of a few minimal
hardware changes to the high-volume compute blade to enable an
alternate design that has higher performance potential. In
particular, this design allows CPUs on the compute blade to
access remote memory directly at cache-block granularity.
Our approach leverages the glueless SMP support found in
current processors. For example, AMD Opteron processors
have up to three coherent HyperTransport links coming out of
the socket. Our design, shown in Figure 4, uses custom hardware
on the compute blade to redirect cache fill requests to the remote
memory blade. Although it does require custom hardware, the
changes to enable our FGRA design are relatively straightforward
adaptations of current coherent memory controller designs
This hardware, labeled “Coherence filter” in Figure 4, serves two
purposes. First, it selectively forwards only necessary coherence
protocol requests to the remote memory blade. For example,
because the remote blade does not contain any caches, the
coherence filter can respond immediately to invalidation requests.
Only memory read and write requests require processing at the
remote memory blade. In the terminology of glueless x86
Backplane
Memory
blade
Compute Blade
P P P P
DIMM
Memory controller
Coherence filter
OS (PA)
App (VA)
Software Stack
Figure 4: Fine-grained remote memory access system design.
This design assumes minor coherence hardware support in
every compute blade.

Citations
More filters
Journal ArticleDOI

Trends in big data analytics

TL;DR: An overview of the state-of-the-art and focus on emerging trends to highlight the hardware, software, and application landscape of big-data analytics are provided.
Proceedings ArticleDOI

MemScale: active low-power modes for main memory

TL;DR: The results demonstrate that MemScale reduces energy consumption significantly compared to modern memory energy management approaches, and it is concluded that the potential benefits of the MemScale mechanisms and policy more than compensate for their small hardware cost.
Proceedings ArticleDOI

Rethinking DRAM design and organization for energy-constrained multi-cores

TL;DR: This paper examines three primary innovations in DRAM chip microarchitecture that lead to a dramatic reduction in the energy and storage overheads for reliability, and further penalizes the cost-per-bit metric by adding a checksum feature to each cache line.
Proceedings ArticleDOI

Network requirements for resource disaggregation

TL;DR: This paper uses a workload-driven approach to derive the minimum latency and bandwidth requirements that the network in disaggregated datacenters must provide to avoid degrading application-level performance and explores the feasibility of meeting these requirements with existing system designs and commodity networking technology.
References
More filters

The Landscape of Parallel Computing Research: A View from Berkeley

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Journal ArticleDOI

Memory resource management in VMware ESX server

TL;DR: Several novel ESX Server mechanisms and policies for managing memory are introduced, including a ballooning technique that reclaims the pages considered least valuable by the operating system running in a virtual machine, and an idle memory tax that achieves efficient memory utilization.
Journal ArticleDOI

Memory coherence in shared virtual memory systems

TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Journal ArticleDOI

Web search for a planet: The Google cluster architecture

TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.
Journal ArticleDOI

The Stanford Dash multiprocessor

TL;DR: The directory architecture for shared memory (Dash) as discussed by the authors allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance, and a distributed directory-based protocol that provides cache coherence without compromising scalability.
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Disaggregated memory for expansion and sharing in blade servers" ?

In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block—a memory blade—that allows memory to be `` disaggregated '' across a system ensemble. The authors use this memory blade building block to propose two new system architecture solutions— ( 1 ) page-swapped remote memory at the virtualization layer, and ( 2 ) block-access remote memory with support in the coherence hardware—that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, the authors demonstrate that memory disaggregation can provide substantial performance benefits ( on average 10X ) in memory constrained environments, while the sharing enabled by their solutions can improve performance-per-dollar by up to 87 % when optimizing memory provisioning across multiple servers. 

The ability to extend and share memory can achieve orders of magnitude performance improvements in cases where applications run out of memory capacity, and similar orders of magnitude improvement in performance-per-dollar in cases where systems are overprovisioned for peak memory usage. The authors also demonstrate how this approach can be used to achieve higher levels of server consolidation than currently possible. Overall, as future server environments gravitate towards more memory-constrained and cost-conscious solutions, the authors believe that the memory disaggregation approach they have proposed in the paper is likely to be a key part of future system designs.