scispace - formally typeset
Open AccessProceedings ArticleDOI

Scalable locality-conscious multithreaded memory allocation

Reads0
Chats0
TLDR
Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators, and favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing.
Abstract
We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favoredby eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform--on a shared-memory systemwith four two-way SMT processors--four state-of-the-art multi-processor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.

read more

Content maybe subject to copyright    Report

Scalable Locality-Conscious Multithreaded Memory Allocation
Schneider, S., Antonopoulos, C. D., & Nikolopoulos, D. (2006). Scalable Locality-Conscious Multithreaded
Memory Allocation. In
Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
Management (ISMM)
(Vol. 2006, pp. 84-94). ACM. https://doi.org/10.1145/1133956.1133968
Published in:
Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory Management (ISMM)
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:10. Aug. 2022

Scalable Locality-Conscious Multithreaded Memory Allocation
Scott Schneider
Department of Computer Science
College of William and Mary
scotts@cs.wm.edu
Christos D. Antonopoulos
Department of Computer Science
College of William and Mary
cda@cs.wm.edu
Dimitrios S. Nikolopoulos
Department of Computer Science
College of William and Mary
dsn@cs.wm.edu
Abstract
We present Streamflow, a new multithreaded memory manager
designed for low overhead, high-performance memory allocation
while transparently favoring locality. Streamflow enables low over-
head simultaneous allocation by multiple threads and adapts to se-
quential allocation at speeds comparable to that of custom sequen-
tial allocators. It favors the transparent exploitation of temporal and
spatial object access locality, and reduces allocator-induced cache
conflicts and false sharing, all using a unified design based on seg-
regated heaps. Streamflow introduces an innovative design which
uses only synchronization-free operations in the most common case
of local allocations and deallocations, while requiring minimal,
non-blocking synchronization in the less common case of remote
deallocations. Spatial locality at the cache and page level is favored
by eliminating small objects headers, reducing allocator-induced
conflicts via contiguous allocation of page blocks in physical mem-
ory, reducing allocator-induced false sharing by using segregated
heaps and achieving better TLB performance and fewer page faults
via the use of superpages. Combining these locality optimizations
with the drastic reduction of synchronization and latency over-
head allows Streamflow to perform comparably with optimized se-
quential allocators and outperform—on a shared-memory system
with four two-way SMT processors—four state-of-the-art multi-
processor allocators by sizeable margins in our experiments. The
allocation-intensive sequential and parallel benchmarks used in our
experiments represent a variety of behaviors, including mostly lo-
cal object allocation-deallocation patterns and producer-consumer
allocation-deallocation patterns.
Categories and Subject Descriptors D.4.2 [Operating Systems]:
Storage Management—Allocation/deallocation strategies; D.3.3
[Programming Languages]: Language Constructs and Features—
Dynamic storage management; D.4.1 [Operating Systems]: Pro-
cess Management—Concurrency, Deadlocks, Synchronization,
Threads; D.1.3 [Programming Techniques]: Concurrent Program-
ming
General Terms Algorithms, Management, Performance
Keywords memory management, multithreading, shared mem-
ory, synchronization-free, non-blocking
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ISMM’06
June 10–11, 2006, Ottawa, Ontario, Canada.
Copyright
c
2006 ACM 1-59593-221-6/06/0006...$5.00.
1. Introduction
Efficient dynamic memory allocation is essential for desktop,
server and scientific applications [27]. As more of these appli-
cations use thread-level parallelism to exploit multiprocessors and
emerging processors with multiple cores and threads, scalable mul-
tiprocessor memory allocation becomes of paramount importance.
Dynamic memory allocation can negatively affect performance
by adding overhead during allocation and deallocation operations,
and by exacerbating object access latency due to poor locality.
Therefore, effective memory allocators need to be optimized for
both low allocation overhead and good object access locality. Scal-
ability and synchronization overhead reduction has been the cen-
tral consideration in the context of thread-safe memory allocators
[3,18], while locality has been the focal point of the design of se-
quential memory allocators for more than a decade [11].
Multiprocessor allocators add synchronization overhead on the
critical path of all allocations and deallocations. Synchronization
is needed because a thread may need to access another thread’s
heap in order to remotely release an object to the owning thread.
Since such operations may be initiated concurrently by multiple
threads, synchronization is used to arbitrate thread accesses to the
data structures used for managing the heaps. Therefore, local heaps
need to be protected with locks or updated atomically with read-
modify-write operations such as cmp&swap. The vast majority of
thread-safe allocators use object headers [3,9, 15, 18,25], which
facilitate object deallocation in local heaps but pollute the cache
in codes that allocate many small objects.
Locality-conscious sequential allocators segregate objects of
different sizes to different page blocks allocated from the operating
system [7]. Objects are allocated by merely bumping a pointer and
no additional information is stored with each object. In general,
the allocation order of objects does not necessarily match their
access pattern. However, contiguous allocation of small objects
works well in practice because eliminating object headers helps
avoid fragmentation and cache pollution.
Efficient, thread-safe memory allocators use local heaps to re-
duce contention between threads. The use of local heaps helps a
multiprocessor allocator avoid false sharing, since threads tend to
allocate and deallocate most of their objects locally [3]. At a lower
level, page block allocation and recycling policies in thread-safe
allocators are primarily concerned with fragmentation and blowup,
without necessarily accounting for locality [3].
The design space of thread-safe allocators that achieve both
good scalability and good data locality merits further investigation.
It is natural to consider combining scalable synchronization mech-
anisms (such as lock-free management of heaps) with locality-
conscious object allocation mechanisms (such as segregated heaps
with headerless objects). Although the two design considerations of
locality and scalability may seem orthogonal and complementary
at first glance, combining them in a unified design is not merely an
engineering effort. Several problems and trade-offs arise in an at-

tempt to integrate scalable concurrent allocation mechanisms with
cache- and page-conscious object allocation mechanisms in a uni-
fied design. Addressing these problems is a central contribution of
this paper. We show that both memory management overhead and
locality exploitation in thread-safe memory allocators can be im-
proved, compared to what is currently offered by state-of-the-art
multiprocessor allocators. These design improvements and the as-
sociated performance benefits are also a key contribution of this
paper.
We present Streamflow, a thread-safe allocator designed for
both scalability and locality. Streamflow’s design is a direct re-
sult of eliminating synchronization operations in the common case,
while at the same time avoiding the memory blowup when strictly
thread-local heaps are used in codes with producer-consumer
allocation-freeing patterns. Local operations in Streamflow are
synchronization-free. Not only do these operations proceed without
thread contention due to locking shared data, but they also proceed
without the latency imposed by uncontested locks and atomic in-
structions. The synchronization-free design of local heaps enables
Streamflow to exploit established sequential allocation optimiza-
tions which are critical for locality, such as eliminating object
headers for small objects and using bump-pointer allocation in
page blocks comprising thread-local heaps.
Streamflow also includes an innovative remote object deallo-
cation mechanism. Remote deallocations namely deallocations
of objects from threads different than the ones that initially al-
located them are decoupled from local allocations and deallo-
cations by forwarding remotely freed objects to per-thread, non-
blocking, lock-free lists. Streamflow’s remote deallocation mecha-
nism enables lazy object reclamation from the owning thread. As
a result, most allocation and deallocation operations proceed with-
out the cost of atomic instructions, and the infrequent operations
that do require atomic instructions are non-blocking, lock-free and
provably fast under various producer-consumer object allocation-
freeing patterns.
Streamflow’s design favors locality at multiple levels. Beyond
reducing memory management overhead and latency, decoupling
local and remote operations promotes temporal locality by allowing
threads to favor locally recycled objects in their private heaps. The
use of thread-local heaps reduces allocator-induced false sharing.
Removing object headers improves spatial locality within cache
lines and page blocks. The integration with a lower level custom
page manager which utilizes superpages [19, 20] avoids allocator-
induced cache conflicts via contiguous allocation of page blocks in
physical memory, and reduces the activity of the OS page manager,
the number of page faults and the rate of TLB misses. Combining
these techniques produces a memory allocator that consistently
outperforms other multithreaded allocators in experiments with up
to 8 threads on a 4-processor system with Hyperthreaded Xeon
processors. Streamflow, by design, also adapts well to sequential
codes and performs competitively with optimized sequential and
application-specific allocators.
This paper makes the following contributions:
We present a new thread-safe dynamic memory manager which
bridges the design space between allocators focused on locality
and allocators focused on scalability. To our knowledge, this
is the first time a memory allocator efficiently unifies locality
considerations with multiprocessor scalability.
We present a new method for eliminating (in the common case)
and minimizing (in the uncommon case) synchronization over-
head in multiprocessor memory allocators. Our method decou-
ples remote and local free lists and uses a new non-blocking re-
mote object deallocation mechanism. This technique preserves
the desirable properties of a multiprocessor memory allocator,
namely blowup avoidance and false sharing avoidance, without
sacrificing the locality and low latency benefits of bump-pointer
allocation.
We present memory allocation and deallocation schemes that
take into account cache-conscious layout of heaps, page- and
TLB-locality. To our knowledge, Streamflow is the first mul-
tiprocessor allocator designed with multilevel and multigrain
locality considerations.
We demonstrate the performance advantages of our design
using realistic sequential and multithreaded applications as
well as synthesized benchmarks. Streamflow outperforms four
widely used, state-of-the-art multiprocessor allocators in alloca-
tion-intensive parallel applications. It also performs compara-
bly to optimized sequential allocators in allocation-intensive
sequential applications. Streamflow exhibits solid performance
improvements both in codes with mostly local object allocation-
freeing patterns and codes with producer-consumer object
allocation-freeing patterns. We have experimented with an SMP
with four two-way SMT processors
1
. Such SMPs are popular
as commercial server platforms, affordable high-performance
computing platforms for scientific problems, and building
blocks for large-scale supercomputers. Since Streamflow elim-
inates (in the common case) or significantly reduces (in the
uncommon case) synchronization, the key scalability-limiting
factor of multithreaded memory managers, we expect it to be
scalable and efficient on larger shared-memory multiprocessors
as well.
The rest of this paper is organized as follows. Section 2 dis-
cusses related work. Section 3 presents the major design princi-
ples, mechanisms and policies of Streamflow. Section 4 presents
our experimental evaluation of Streamflow alongside other multi-
processor allocators and some optimized sequential allocators. In
Section 5 we discuss some implications of the design of Stream-
flow and potential future improvements. Section 6 summarizes the
paper.
2. Related Work
Streamflow includes elements adopted from efficient sequential
memory allocators proposed in the past. Streamflow’s segregated
heap storage and BIBOP (big bag of pages)-style allocation de-
rives from an allocation scheme originally proposed by Guy Steele
in [24] and from the concept of independently managed mem-
ory zones which dates back to 1967 [21]. Segregated heap storage
has since been used in numerous allocators, including the standard
GNU C allocator in Linux [16], an older GNU allocator [11], Vmal-
loc [26], and more recent allocators such as Reaps [4] and Vam [7].
Modern allocators tend to adopt segregated heaps because they en-
able very fast allocation. Deallocation in segregated heap allocators
is more intricate, because in order to comply with the semantics of
free(), the allocator needs to be able to discover internally the
size of each deallocated object, using the object pointer as its only
input. Deallocation is simple if each object has a header pointing
to the base of the heap segment from where the object was allo-
cated. This technique is used, for example, in the GNU C allocator
and in Reaps [4,16]. However, object headers introduce fragmenta-
tion, pollute caches, and eventually penalize codes with many small
object allocations. Therefore, locality-conscious allocators such as
PHKmalloc [12] and Vam [7] eliminate object headers entirely for
small objects and use tables of free lists to manage released space
in segregated heaps. Elimination of headers is common practice in
custom memory allocators [4], as well as semi-custom allocators
1
This is the largest shared-memory system we have direct access to.

with alternate semantics for free(), such as region-based alloca-
tors [8].
Streamflow uses segregated object allocation in thread-private
heaps, as in several other thread-safe allocators including Hoard
[3], Maged Michael’s lock-free memory allocator [18], Tcmalloc
from Google’s performance tools [10], LKmalloc [15], ptmalloc
[9], and Vee and Hsu’s allocator [25]. In particular, Streamflow
uses strictly thread-local object allocation, both thread-local and
remote deallocation and mechanisms for recycling free page blocks
to avoid false sharing and memory blowup [3, 18].
Streamflow differs from earlier multithreaded memory alloca-
tors in several critical aspects: First, its design decouples local
from remote object deallocation to allow local allocation and deal-
location without any atomic instructions. Atomic instructions are
used only sparingly for remote object deallocation and for recy-
cling page blocks. Second, Streamflow eliminates object headers
for small objects, thereby reducing cache pollution and improv-
ing spatial locality. Tcmalloc is the only thread-safe allocator we
are aware of that uses the same technique, although Tcmalloc uses
locks whenever memory has to be allocated from or returned to a
global free memory objects pool. Third, Streamflow uses further
optimizations for temporal locality, cache-conscious page block
layout and better TLB performance. Fourth, unlike many other high
performance allocators, Streamflow allows returning memory to
the OS when the footprint of the application shrinks.
To our knowledge, Streamflow is the rst user-level memory
allocator to control the layout of page blocks in physical memory,
using superpages as the means to achieve contiguous allocation
of each page block in physical memory. It should be noted that
superpages are a generic optimization tool and their scope extends
beyond just memory allocators [6, 19]. However, since superpages
(the size of which is set by the operating system) may subsume
multiple page blocks (the size of which is set by the memory
allocator) a multiprocessor memory allocator using superpages to
achieve cache-conscious of page blocks has certain design choices
as to how it manages free memory inside each superpage and how
it divides superpages between page blocks from different threads.
Streamflow’s design includes some educated choices for effective
management and utilization of superpages.
Several of the design goals of Streamflow, in particular its local-
ity optimizations, can be achieved with allocators that utilize feed-
back from program profiles. For example, earlier work has shown
that object lifetime predictors and reference traces can be used to
customize small object allocation and object segregation [2, 22, 23].
Streamflow assumes no knowledge of object allocation and access
profiles, although its design does not prevent the addition of profile-
guided optimization.
3. Design of Streamflow
Streamflow primarily optimizes dynamic allocation of small ob-
jects, which is a common bottleneck in many sequential and mul-
tithreaded applications, including desktop, server and scientific ap-
plications. Streamflow optimizes dynamic allocation for low la-
tency and scalability, as well as for temporal locality, spatial lo-
cality and cache-conscious layout of data. These optimizations are
accomplished via the use of a decoupled local heap architecture, the
elimination of object headers, the careful layout of heaps in con-
tiguously allocated physical memory and the exploitation of large
pages (superpages). At the same time, Streamflow provides mech-
anisms that facilitate both memory transfer between local heaps
and returning memory to the system. As a result, it is not sensitive
to pathologic memory usage patterns, such as producer-consumer
ones, that could lead to high memory overhead and pressure.
Streamflow consists of two modules. Its front-end is a multi-
threaded memory allocator, which minimizes the overhead of mem-
ory requests by eliminating inter-thread synchronization and all as-
sociated atomic operations during common-case memory request
patterns. Even in the infrequent cases when synchronization be-
tween threads is necessary, it is performed with a single, non-
blocking atomic operation
2
. The front end also includes optimiza-
tions for spatial locality, temporal locality, and the avoidance of
false-sharing.
The back-end of Streamflow is a locality-conscious page man-
ager. This module manages contiguous page blocks, each of which
is used by the front-end for the allocation of objects that belong
to a given size class. The page manager allocates pages blocks
within superpages to achieve contiguous layout of each page block
in physical memory, thus reducing self-interference (within page
blocks) and cross-interference (between page blocks) in the cache.
The use of superpages can also improve the TLB performance and
reduce page faults in applications with large memory footprints.
Moreover, the Streamflow back-end facilitates the interchange of
page blocks between threads, should the memory demand of each
thread change during execution.
We describe the front-end multithreaded memory allocator in
Section 3.1 and the back-end page manager in Section 3.2. The
source code of Streamflow can be downloaded from http://www.
cs.wm.edu/streamflow and can be used as a reference through-
out this section.
3.1 Multithreaded Memory Allocator
3.1.1 Small Object Management
Objects in Streamflow are classified as small if their size does not
exceed 2 KB (half a page in our experimental platform). The man-
agement of objects larger than 2 KB is described in section 3.1.2. In
the following paragraphs we describe Streamflow’s heap architec-
ture, the techniques used to eliminate object headers, small object
allocation and deallocation procedures and specialized support for
recycling memory upon thread termination.
Local heaps: Each thread in Streamflow allocates memory from
a local heap. The heap data structure, shown in Figure 1(a), is
strictly private; only the owner thread can modify it. As a result,
the vast majority of simultaneous memory management operations
issued by multiple threads can be served simultaneously and inde-
pendently, without synchronization. Synchronization is necessary
only when the local heap does not have enough free memory avail-
able to fulfill a request, or during deallocations, when an object is
freed by a thread other than the owner of the heap it was allocated
from.
Local heaps facilitate the reduction of allocator-induced false-
sharing between threads, since memory allocation requests by dif-
ferent threads are not interleaved in the same memory segment.
This technique cannot, however, totally eliminate false-sharing in
the presence of object migrations between threads [3].
Each thread-local heap consists of page blocks, shown in Fig-
ure 1(b). Page blocks are contiguous virtual memory areas. Each
page block is used for the allocation of objects with sizes that fall
into a specific range, which we call an object class. In Streamflow,
each object class differs from the previous one by 4 bytes. This de-
sign provides for ne-grain object segregation and tends to improve
spatial locality in codes that make heavy use of very small objects
[7]. One or more page blocks, organized as a doubly linked list, can
2
We use cmp&swap(ptr, old
val, new val), which atomically checks
that the value in memory address ptr is old val and changes it to
new val. If the value is not equal to old val the operation fails. The op-
eration may be replayed more than once if it fails. All modern processors
offer cmp&swap for 32-bit and 64-bit operands, either as an instruction or as
a high level primitive built from simpler instructions, such as load linked-
store conditional.

. . .
1-4
5
-
8
9
-
12
13
-
16
-
Object size
classes
malloc/free
Active Head
Active Tail
. . .
. . .
Thread 1
Thread n
Page blk 1
Page blk 2
Page blk k
(a)
Freed Unallocated
Object
Next
Prev
Remotely
Freed
ID
(b)
Figure 1. Streamflow front-end design. Figure (a) is an overview of a heap, and Figure (b) is the detail for a particular page block within
that heap.
serve the same object class. A simple page block rotation strategy
guarantees that if there is enough free memory for the allocation
of a specific object class, a page block with available memory will
be found at the head of the list. More specifically, when a page
block becomes full, it is transferred to the end of the list. Similarly,
when an object is freed by the owner of the heap, the page block
it belongs to is placed at the head of the list, if it is not already
there. The block rotation is a fast operation involving exactly seven
pointer updates and no atomic instructions.
Page blocks are always page aligned. Their sizes vary, depend-
ing on the object class they serve. As a rule of thumb, each page
block in Streamflow is large enough to accommodate 1024 objects,
however minimum/maximum page block size limitations also ap-
ply. There is clearly a trade-off between the number of objects
in each page block and the average amount of unused memory a
page block may contain. The minimum page block size (16 KB
in Streamflow) allows more than 1024 very small objects to be
packed inside a single page block, given that the size of the re-
sulting page blocks is also small and the additional memory con-
sumption is not a concern. This amortizes costly heap expansion
operations among more object allocations. On the other hand, the
maximum page block size (256 KB in our implementation) limits
the memory requirements for page blocks which serve relatively
large object classes. Without a limit, page blocks for large objects
could otherwise grow up to 2 MB. This limit reduces internal allo-
cator fragmentation, which is the amount of memory reserved from
the system, yet never used inside each page block. The resulting
page block size is always rounded to the nearest power of two.
The beginning of each page block is occupied by the page
block header. The header consists of all the data structures and
bookkeeping information necessary for the management of the
page block. It contains: i) Pointers for linking the page block to the
doubly-linked list of page blocks for each object class, ii) Pointers
to free memory inside the page block (freed and unallocated),
iii) An identifier of the owner-thread of the page block (id), iv) The
head of a LIFO list used for object deallocations to the page block
by threads other than the owner-thread (remotely
freed), and v)
bookkeeping information, such as the number of free objects in the
page block and the size of each object. All the elds in the header,
with the exception of remotely
freed, are accessed only by the
page block owner-thread, thus accesses and modifications of these
fields require no synchronization.
Headerless objects: When an object is freed, a memory allocator
needs to discover whether the object is large or small as well as its
size and—if the object is small—the exact page block it originated
from. A common technique is to attach a header to each object and
encode the necessary information in the header. Some architectures
impose an 8-bytes alignment requirement for certain data types, or
accesses to these data types suffer significant performance penal-
ties. This limits the minimum memory required for headers to 8
bytes and the minimum object granularity supported by the alloca-
tor to 16 bytes (including the header). As a result, the use of headers
introduces two serious side-effects: a) Significant space overhead,
which can reach up to 300% (12 bytes of overhead for every 4-bytes
object), and b) less objects can be packed in a single cache line or
a single page, thus hurting spatial locality.
Streamflow eliminates headers from small objects using the BI-
BOP technique [24]. We introduce a global table with one byte
for each virtual memory page in the system. Accesses to the table
can simply be indexed by the page starting address. A single bit of
each table cell characterizes objects allocated in the specific page as
small or large. If the object is small, the remaining 7 bits are used to
encode the disposition—in pages—of the header of the parent page
block. This encoding is sufficient for realistic page block sizes (up
to 512 KB, considering a page size of 4 KB). As soon as the header
of the parent page block is available, the memory manager has all

Citations
More filters
Proceedings ArticleDOI

A lightweight infrastructure for graph analytics

TL;DR: This paper argues that existing DSLs can be implemented on top of a general-purpose infrastructure that supports very fine-grain tasks, implements autonomous, speculative execution of these tasks, and allows application-specific control of task scheduling policies.
Proceedings ArticleDOI

Cache craftiness for fast multicore key-value storage

TL;DR: This work presents Masstree, a fast key-value database designed for SMP machines, which is comparable to that of memcached, a non-persistent hash table server, and higher than that of VoltDB, MongoDB, and Redis.
Proceedings ArticleDOI

Corey: an operating system for many cores

TL;DR: This paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core sharing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functions.
Proceedings ArticleDOI

Scalable address spaces using RCU balanced trees

TL;DR: A new design for increasing the concurrency of kernel operations on a shared address space is contributed by exploiting read-copy-update (RCU) so that soft page faults can both run in parallel with operations that mutate the same address space and avoid contending with other page faults on shared cache lines.
Proceedings ArticleDOI

Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

TL;DR: MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.
References
More filters
Book ChapterDOI

Dynamic Storage Allocation: A Survey and Critical Review

TL;DR: This survey describes a variety of memory allocator designs and point out issues relevant to their design and evaluation, and chronologically survey most of the literature on allocators between 1961 and 1995.
Journal ArticleDOI

Hoard: a scalable memory allocator for multithreaded applications

TL;DR: Hoard as mentioned in this paper combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case, which is the first allocator to simultaneously solve the above problems.
Journal ArticleDOI

Thread Scheduling for Multiprogrammed Multiprocessors

TL;DR: This work presents a user-level thread scheduler for shared-memory multiprocessors, and it achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .
Proceedings ArticleDOI

Thread scheduling for multiprogrammed multiprocessors

TL;DR: A user-level thread scheduler for shared-memory multiprocessors, which achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .
Journal ArticleDOI

A fast storage allocator

TL;DR: TEST MACRO EtNEXT,XX CONSIDER SQUARE C (RELATIVE to LEAD).
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Scalable locality-conscious multithreaded memory allocation" ?

The authors present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. 

Extending Streamflow with mechanisms and policies to detect memory pressure and proactively release memory to prevent thrashing is left as future work. 

contiguous allocation of small objects works well in practice because eliminating object headers helps avoid fragmentation and cache pollution. 

Scalability and synchronization overhead reduction has been the central consideration in the context of thread-safe memory allocators [3, 18], while locality has been the focal point of the design of sequential memory allocators for more than a decade [11]. 

In a 32- bit address space 1 MB is enough for the BIBOP table (768 KB in Linux, since 25% of the virtual address space is reserved for kernel memory). 

In order to maintain low virtual memory usage, their implementation constrains the population of the local and global caches to one and zero page blocks respectively. 

Tracing system calls performed by the application revealed that before each thread generation, 513 memory pages (2052 KB) are allocated for the thread’s stack. 

Since Streamflow eliminates (in the common case) or significantly reduces (in the uncommon case) synchronization, the key scalability-limiting factor of multithreaded memory managers, the authors expect it to be scalable and efficient on larger shared-memory multiprocessors as well. 

Due to the minimum size, maximum size, and power of two size limitations for page blocks, multiple object classes use page blocks of the same size. 

Whenever a thread terminates, Streamflow ensures that the free memory of partially free or locally cached page blocks in its heap will be made available to the other threads. 

It is worth noting that Streamflow performs well even with Consume, which is specifically designed to stress multithreaded allocators that use thread-local heaps. 

The minimum page block size (16 KB in Streamflow) allows more than 1024 very small objects to be packed inside a single page block, given that the size of the resulting page blocks is also small and the additional memory consumption is not a concern. 

Several problems and trade-off’s arise in an at-tempt to integrate scalable concurrent allocation mechanisms with cache- and page-conscious object allocation mechanisms in a unified design. 

At a lower level, page block allocation and recycling policies in thread-safe allocators are primarily concerned with fragmentation and blowup, without necessarily accounting for locality [3]. 

This limits the minimum memory required for headers to 8 bytes and the minimum object granularity supported by the allocator to 16 bytes (including the header).