What future works have the authors mentioned in the paper "Scalable locality-conscious multithreaded memory allocation" ?

Extending Streamflow with mechanisms and policies to detect memory pressure and proactively release memory to prevent thrashing is left as future work.

How many MB of address space is enough for the BIBOP table?

In a 32- bit address space 1 MB is enough for the BIBOP table (768 KB in Linux, since 25% of the virtual address space is reserved for kernel memory).

How does Streamflow limit the population of the local and global caches?

In order to maintain low virtual memory usage, their implementation constrains the population of the local and global caches to one and zero page blocks respectively.

How many memory pages are allocated for the thread stack?

Tracing system calls performed by the application revealed that before each thread generation, 513 memory pages (2052 KB) are allocated for the thread’s stack.

Why do page blocks have to be of the same size?

Due to the minimum size, maximum size, and power of two size limitations for page blocks, multiple object classes use page blocks of the same size.

What is the Streamflow policy for removing partially free or locally cached page blocks?

Whenever a thread terminates, Streamflow ensures that the free memory of partially free or locally cached page blocks in its heap will be made available to the other threads.

What is the performance of Streamflow even with Consume?

It is worth noting that Streamflow performs well even with Consume, which is specifically designed to stress multithreaded allocators that use thread-local heaps.

(Open Access) Scalable locality-conscious multithreaded memory allocation (2006) | Scott Schneider

Q: What are the contributions mentioned in the paper "Scalable locality-conscious multithreaded memory allocation" ?

The authors present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality.

Q: What is the main consideration in the context of thread-safe memory allocators?

Scalability and synchronization overhead reduction has been the central consideration in the context of thread-safe memory allocators [3, 18], while locality has been the focal point of the design of sequential memory allocators for more than a decade [11].

Q: What is the key scalability-limiting factor of Streamflow?

Since Streamflow eliminates (in the common case) or significantly reduces (in the uncommon case) synchronization, the key scalability-limiting factor of multithreaded memory managers, the authors expect it to be scalable and efficient on larger shared-memory multiprocessors as well.

Scalable Locality-Conscious Multithreaded Memory Allocation

Schneider, S., Antonopoulos, C. D., & Nikolopoulos, D. (2006). Scalable Locality-Conscious Multithreaded

Memory Allocation. In

Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory

Management (ISMM)

(Vol. 2006, pp. 84-94). ACM. https://doi.org/10.1145/1133956.1133968

Published in:

Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory Management (ISMM)

Queen's University Belfast - Research Portal:

Link to publication record in Queen's University Belfast Research Portal

General rights

Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other

copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated

with these rights.

Take down policy

The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to

ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the

Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.

Download date:10. Aug. 2022

Scalable Locality-Conscious Multithreaded Memory Allocation

Scott Schneider

Department of Computer Science

College of William and Mary

scotts@cs.wm.edu

Christos D. Antonopoulos

Department of Computer Science

College of William and Mary

cda@cs.wm.edu

Dimitrios S. Nikolopoulos

Department of Computer Science

College of William and Mary

dsn@cs.wm.edu

Abstract

We present Streamﬂow, a new multithreaded memory manager

designed for low overhead, high-performance memory allocation

while transparently favoring locality. Streamﬂow enables low over-

head simultaneous allocation by multiple threads and adapts to se-

quential allocation at speeds comparable to that of custom sequen-

tial allocators. It favors the transparent exploitation of temporal and

spatial object access locality, and reduces allocator-induced cache

conﬂicts and false sharing, all using a uniﬁed design based on seg-

regated heaps. Streamﬂow introduces an innovative design which

uses only synchronization-free operations in the most common case

of local allocations and deallocations, while requiring minimal,

non-blocking synchronization in the less common case of remote

deallocations. Spatial locality at the cache and page level is favored

by eliminating small objects headers, reducing allocator-induced

conﬂicts via contiguous allocation of page blocks in physical mem-

ory, reducing allocator-induced false sharing by using segregated

heaps and achieving better TLB performance and fewer page faults

via the use of superpages. Combining these locality optimizations

with the drastic reduction of synchronization and latency over-

head allows Streamﬂow to perform comparably with optimized se-

quential allocators and outperform—on a shared-memory system

with four two-way SMT processors—four state-of-the-art multi-

processor allocators by sizeable margins in our experiments. The

allocation-intensive sequential and parallel benchmarks used in our

experiments represent a variety of behaviors, including mostly lo-

cal object allocation-deallocation patterns and producer-consumer

allocation-deallocation patterns.

Categories and Subject Descriptors D.4.2 [Operating Systems]:

Storage Management—Allocation/deallocation strategies; D.3.3

[Programming Languages]: Language Constructs and Features—

Dynamic storage management; D.4.1 [Operating Systems]: Pro-

cess Management—Concurrency, Deadlocks, Synchronization,

Threads; D.1.3 [Programming Techniques]: Concurrent Program-

ming

General Terms Algorithms, Management, Performance

Keywords memory management, multithreading, shared mem-

ory, synchronization-free, non-blocking

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute

to lists, requires prior speciﬁc permission and/or a fee.

ISMM’06

June 10–11, 2006, Ottawa, Ontario, Canada.

 2006 ACM 1-59593-221-6/06/0006...$5.00.

1. Introduction

Efﬁcient dynamic memory allocation is essential for desktop,

server and scientiﬁc applications [27]. As more of these appli-

cations use thread-level parallelism to exploit multiprocessors and

emerging processors with multiple cores and threads, scalable mul-

tiprocessor memory allocation becomes of paramount importance.

Dynamic memory allocation can negatively affect performance

by adding overhead during allocation and deallocation operations,

and by exacerbating object access latency due to poor locality.

Therefore, effective memory allocators need to be optimized for

both low allocation overhead and good object access locality. Scal-

ability and synchronization overhead reduction has been the cen-

tral consideration in the context of thread-safe memory allocators

[3,18], while locality has been the focal point of the design of se-

quential memory allocators for more than a decade [11].

Multiprocessor allocators add synchronization overhead on the

critical path of all allocations and deallocations. Synchronization

is needed because a thread may need to access another thread’s

heap in order to remotely release an object to the owning thread.

Since such operations may be initiated concurrently by multiple

threads, synchronization is used to arbitrate thread accesses to the

data structures used for managing the heaps. Therefore, local heaps

need to be protected with locks or updated atomically with read-

modify-write operations such as cmp&swap. The vast majority of

thread-safe allocators use object headers [3,9, 15, 18,25], which

facilitate object deallocation in local heaps but pollute the cache

in codes that allocate many small objects.

Locality-conscious sequential allocators segregate objects of

different sizes to different page blocks allocated from the operating

system [7]. Objects are allocated by merely bumping a pointer and

no additional information is stored with each object. In general,

the allocation order of objects does not necessarily match their

access pattern. However, contiguous allocation of small objects

works well in practice because eliminating object headers helps

avoid fragmentation and cache pollution.

Efﬁcient, thread-safe memory allocators use local heaps to re-

duce contention between threads. The use of local heaps helps a

multiprocessor allocator avoid false sharing, since threads tend to

allocate and deallocate most of their objects locally [3]. At a lower

level, page block allocation and recycling policies in thread-safe

allocators are primarily concerned with fragmentation and blowup,

without necessarily accounting for locality [3].

The design space of thread-safe allocators that achieve both

good scalability and good data locality merits further investigation.

It is natural to consider combining scalable synchronization mech-

anisms (such as lock-free management of heaps) with locality-

conscious object allocation mechanisms (such as segregated heaps

with headerless objects). Although the two design considerations of

locality and scalability may seem orthogonal and complementary

at ﬁrst glance, combining them in a uniﬁed design is not merely an

engineering effort. Several problems and trade-off’s arise in an at-

tempt to integrate scalable concurrent allocation mechanisms with

cache- and page-conscious object allocation mechanisms in a uni-

ﬁed design. Addressing these problems is a central contribution of

this paper. We show that both memory management overhead and

locality exploitation in thread-safe memory allocators can be im-

proved, compared to what is currently offered by state-of-the-art

multiprocessor allocators. These design improvements and the as-

sociated performance beneﬁts are also a key contribution of this

paper.

We present Streamﬂow, a thread-safe allocator designed for

both scalability and locality. Streamﬂow’s design is a direct re-

sult of eliminating synchronization operations in the common case,

while at the same time avoiding the memory blowup when strictly

thread-local heaps are used in codes with producer-consumer

allocation-freeing patterns. Local operations in Streamﬂow are

synchronization-free. Not only do these operations proceed without

thread contention due to locking shared data, but they also proceed

without the latency imposed by uncontested locks and atomic in-

structions. The synchronization-free design of local heaps enables

Streamﬂow to exploit established sequential allocation optimiza-

tions which are critical for locality, such as eliminating object

headers for small objects and using bump-pointer allocation in

page blocks comprising thread-local heaps.

Streamﬂow also includes an innovative remote object deallo-

cation mechanism. Remote deallocations – namely deallocations

of objects from threads different than the ones that initially al-

located them – are decoupled from local allocations and deallo-

cations by forwarding remotely freed objects to per-thread, non-

blocking, lock-free lists. Streamﬂow’s remote deallocation mecha-

nism enables lazy object reclamation from the owning thread. As

a result, most allocation and deallocation operations proceed with-

out the cost of atomic instructions, and the infrequent operations

that do require atomic instructions are non-blocking, lock-free and

provably fast under various producer-consumer object allocation-

freeing patterns.

Streamﬂow’s design favors locality at multiple levels. Beyond

reducing memory management overhead and latency, decoupling

local and remote operations promotes temporal locality by allowing

threads to favor locally recycled objects in their private heaps. The

use of thread-local heaps reduces allocator-induced false sharing.

Removing object headers improves spatial locality within cache

lines and page blocks. The integration with a lower level custom

page manager which utilizes superpages [19, 20] avoids allocator-

induced cache conﬂicts via contiguous allocation of page blocks in

physical memory, and reduces the activity of the OS page manager,

the number of page faults and the rate of TLB misses. Combining

these techniques produces a memory allocator that consistently

outperforms other multithreaded allocators in experiments with up

to 8 threads on a 4-processor system with Hyperthreaded Xeon

processors. Streamﬂow, by design, also adapts well to sequential

codes and performs competitively with optimized sequential and

application-speciﬁc allocators.

This paper makes the following contributions:

•

We present a new thread-safe dynamic memory manager which

bridges the design space between allocators focused on locality

and allocators focused on scalability. To our knowledge, this

is the ﬁrst time a memory allocator efﬁciently uniﬁes locality

considerations with multiprocessor scalability.

•

We present a new method for eliminating (in the common case)

and minimizing (in the uncommon case) synchronization over-

head in multiprocessor memory allocators. Our method decou-

ples remote and local free lists and uses a new non-blocking re-

mote object deallocation mechanism. This technique preserves

the desirable properties of a multiprocessor memory allocator,

namely blowup avoidance and false sharing avoidance, without

sacriﬁcing the locality and low latency beneﬁts of bump-pointer

allocation.

•

We present memory allocation and deallocation schemes that

take into account cache-conscious layout of heaps, page- and

TLB-locality. To our knowledge, Streamﬂow is the ﬁrst mul-

tiprocessor allocator designed with multilevel and multigrain

locality considerations.

•

We demonstrate the performance advantages of our design

using realistic sequential and multithreaded applications as

well as synthesized benchmarks. Streamﬂow outperforms four

widely used, state-of-the-art multiprocessor allocators in alloca-

tion-intensive parallel applications. It also performs compara-

bly to optimized sequential allocators in allocation-intensive

sequential applications. Streamﬂow exhibits solid performance

improvements both in codes with mostly local object allocation-

freeing patterns and codes with producer-consumer object

allocation-freeing patterns. We have experimented with an SMP

with four two-way SMT processors

. Such SMPs are popular

as commercial server platforms, affordable high-performance

computing platforms for scientiﬁc problems, and building

blocks for large-scale supercomputers. Since Streamﬂow elim-

inates (in the common case) or signiﬁcantly reduces (in the

uncommon case) synchronization, the key scalability-limiting

factor of multithreaded memory managers, we expect it to be

scalable and efﬁcient on larger shared-memory multiprocessors

as well.

The rest of this paper is organized as follows. Section 2 dis-

cusses related work. Section 3 presents the major design princi-

ples, mechanisms and policies of Streamﬂow. Section 4 presents

our experimental evaluation of Streamﬂow alongside other multi-

processor allocators and some optimized sequential allocators. In

Section 5 we discuss some implications of the design of Stream-

ﬂow and potential future improvements. Section 6 summarizes the

paper.

2. Related Work

Streamﬂow includes elements adopted from efﬁcient sequential

memory allocators proposed in the past. Streamﬂow’s segregated

heap storage and BIBOP (big bag of pages)-style allocation de-

rives from an allocation scheme originally proposed by Guy Steele

in [24] and from the concept of independently managed mem-

ory zones which dates back to 1967 [21]. Segregated heap storage

has since been used in numerous allocators, including the standard

GNU C allocator in Linux [16], an older GNU allocator [11], Vmal-

loc [26], and more recent allocators such as Reaps [4] and Vam [7].

Modern allocators tend to adopt segregated heaps because they en-

able very fast allocation. Deallocation in segregated heap allocators

is more intricate, because in order to comply with the semantics of

free(), the allocator needs to be able to discover internally the

size of each deallocated object, using the object pointer as its only

input. Deallocation is simple if each object has a header pointing

to the base of the heap segment from where the object was allo-

cated. This technique is used, for example, in the GNU C allocator

and in Reaps [4,16]. However, object headers introduce fragmenta-

tion, pollute caches, and eventually penalize codes with many small

object allocations. Therefore, locality-conscious allocators such as

PHKmalloc [12] and Vam [7] eliminate object headers entirely for

small objects and use tables of free lists to manage released space

in segregated heaps. Elimination of headers is common practice in

custom memory allocators [4], as well as semi-custom allocators

This is the largest shared-memory system we have direct access to.

with alternate semantics for free(), such as region-based alloca-

tors [8].

Streamﬂow uses segregated object allocation in thread-private

heaps, as in several other thread-safe allocators including Hoard

[3], Maged Michael’s lock-free memory allocator [18], Tcmalloc

from Google’s performance tools [10], LKmalloc [15], ptmalloc

[9], and Vee and Hsu’s allocator [25]. In particular, Streamﬂow

uses strictly thread-local object allocation, both thread-local and

remote deallocation and mechanisms for recycling free page blocks

to avoid false sharing and memory blowup [3, 18].

Streamﬂow differs from earlier multithreaded memory alloca-

tors in several critical aspects: First, its design decouples local

from remote object deallocation to allow local allocation and deal-

location without any atomic instructions. Atomic instructions are

used only sparingly for remote object deallocation and for recy-

cling page blocks. Second, Streamﬂow eliminates object headers

for small objects, thereby reducing cache pollution and improv-

ing spatial locality. Tcmalloc is the only thread-safe allocator we

are aware of that uses the same technique, although Tcmalloc uses

locks whenever memory has to be allocated from or returned to a

global free memory objects pool. Third, Streamﬂow uses further

optimizations for temporal locality, cache-conscious page block

layout and better TLB performance. Fourth, unlike many other high

performance allocators, Streamﬂow allows returning memory to

the OS when the footprint of the application shrinks.

To our knowledge, Streamﬂow is the ﬁrst user-level memory

allocator to control the layout of page blocks in physical memory,

using superpages as the means to achieve contiguous allocation

of each page block in physical memory. It should be noted that

superpages are a generic optimization tool and their scope extends

beyond just memory allocators [6, 19]. However, since superpages

(the size of which is set by the operating system) may subsume

multiple page blocks (the size of which is set by the memory

allocator) a multiprocessor memory allocator using superpages to

achieve cache-conscious of page blocks has certain design choices

as to how it manages free memory inside each superpage and how

it divides superpages between page blocks from different threads.

Streamﬂow’s design includes some educated choices for effective

management and utilization of superpages.

Several of the design goals of Streamﬂow, in particular its local-

ity optimizations, can be achieved with allocators that utilize feed-

back from program proﬁles. For example, earlier work has shown

that object lifetime predictors and reference traces can be used to

customize small object allocation and object segregation [2, 22, 23].

Streamﬂow assumes no knowledge of object allocation and access

proﬁles, although its design does not prevent the addition of proﬁle-

guided optimization.

3. Design of Streamﬂow

Streamﬂow primarily optimizes dynamic allocation of small ob-

jects, which is a common bottleneck in many sequential and mul-

tithreaded applications, including desktop, server and scientiﬁc ap-

plications. Streamﬂow optimizes dynamic allocation for low la-

tency and scalability, as well as for temporal locality, spatial lo-

cality and cache-conscious layout of data. These optimizations are

accomplished via the use of a decoupled local heap architecture, the

elimination of object headers, the careful layout of heaps in con-

tiguously allocated physical memory and the exploitation of large

pages (superpages). At the same time, Streamﬂow provides mech-

anisms that facilitate both memory transfer between local heaps

and returning memory to the system. As a result, it is not sensitive

to pathologic memory usage patterns, such as producer-consumer

ones, that could lead to high memory overhead and pressure.

Streamﬂow consists of two modules. Its front-end is a multi-

threaded memory allocator, which minimizes the overhead of mem-

ory requests by eliminating inter-thread synchronization and all as-

sociated atomic operations during common-case memory request

patterns. Even in the infrequent cases when synchronization be-

tween threads is necessary, it is performed with a single, non-

blocking atomic operation

. The front end also includes optimiza-

tions for spatial locality, temporal locality, and the avoidance of

false-sharing.

The back-end of Streamﬂow is a locality-conscious page man-

ager. This module manages contiguous page blocks, each of which

is used by the front-end for the allocation of objects that belong

to a given size class. The page manager allocates pages blocks

within superpages to achieve contiguous layout of each page block

in physical memory, thus reducing self-interference (within page

blocks) and cross-interference (between page blocks) in the cache.

The use of superpages can also improve the TLB performance and

reduce page faults in applications with large memory footprints.

Moreover, the Streamﬂow back-end facilitates the interchange of

page blocks between threads, should the memory demand of each

thread change during execution.

We describe the front-end multithreaded memory allocator in

Section 3.1 and the back-end page manager in Section 3.2. The

source code of Streamﬂow can be downloaded from http://www.

cs.wm.edu/streamflow and can be used as a reference through-

out this section.

3.1 Multithreaded Memory Allocator

3.1.1 Small Object Management

Objects in Streamﬂow are classiﬁed as small if their size does not

exceed 2 KB (half a page in our experimental platform). The man-

agement of objects larger than 2 KB is described in section 3.1.2. In

the following paragraphs we describe Streamﬂow’s heap architec-

ture, the techniques used to eliminate object headers, small object

allocation and deallocation procedures and specialized support for

recycling memory upon thread termination.

Local heaps: Each thread in Streamﬂow allocates memory from

a local heap. The heap data structure, shown in Figure 1(a), is

strictly private; only the owner thread can modify it. As a result,

the vast majority of simultaneous memory management operations

issued by multiple threads can be served simultaneously and inde-

pendently, without synchronization. Synchronization is necessary

only when the local heap does not have enough free memory avail-

able to fulﬁll a request, or during deallocations, when an object is

freed by a thread other than the owner of the heap it was allocated

from.

Local heaps facilitate the reduction of allocator-induced false-

sharing between threads, since memory allocation requests by dif-

ferent threads are not interleaved in the same memory segment.

This technique cannot, however, totally eliminate false-sharing in

the presence of object migrations between threads [3].

Each thread-local heap consists of page blocks, shown in Fig-

ure 1(b). Page blocks are contiguous virtual memory areas. Each

page block is used for the allocation of objects with sizes that fall

into a speciﬁc range, which we call an object class. In Streamﬂow,

each object class differs from the previous one by 4 bytes. This de-

sign provides for ﬁne-grain object segregation and tends to improve

spatial locality in codes that make heavy use of very small objects

[7]. One or more page blocks, organized as a doubly linked list, can

We use cmp&swap(ptr, old

val, new val), which atomically checks

that the value in memory address ptr is old val and changes it to

new val. If the value is not equal to old val the operation fails. The op-

eration may be replayed more than once if it fails. All modern processors

offer cmp&swap for 32-bit and 64-bit operands, either as an instruction or as

a high level primitive built from simpler instructions, such as load linked-

store conditional.

. . .

1-4

2045

2048

Object size

classes

malloc/free

Active Head

Active Tail

. . .

Thread 1

Thread n

Page blk 1

Page blk 2

Page blk k

(a)

Freed Unallocated

Object

Remotely

Freed

(b)

Figure 1. Streamﬂow front-end design. Figure (a) is an overview of a heap, and Figure (b) is the detail for a particular page block within

that heap.

serve the same object class. A simple page block rotation strategy

guarantees that if there is enough free memory for the allocation

of a speciﬁc object class, a page block with available memory will

be found at the head of the list. More speciﬁcally, when a page

block becomes full, it is transferred to the end of the list. Similarly,

when an object is freed by the owner of the heap, the page block

it belongs to is placed at the head of the list, if it is not already

there. The block rotation is a fast operation involving exactly seven

pointer updates and no atomic instructions.

Page blocks are always page aligned. Their sizes vary, depend-

ing on the object class they serve. As a rule of thumb, each page

block in Streamﬂow is large enough to accommodate 1024 objects,

however minimum/maximum page block size limitations also ap-

ply. There is clearly a trade-off between the number of objects

in each page block and the average amount of unused memory a

page block may contain. The minimum page block size (16 KB

in Streamﬂow) allows more than 1024 very small objects to be

packed inside a single page block, given that the size of the re-

sulting page blocks is also small and the additional memory con-

sumption is not a concern. This amortizes costly heap expansion

operations among more object allocations. On the other hand, the

maximum page block size (256 KB in our implementation) limits

the memory requirements for page blocks which serve relatively

large object classes. Without a limit, page blocks for large objects

could otherwise grow up to 2 MB. This limit reduces internal allo-

cator fragmentation, which is the amount of memory reserved from

the system, yet never used inside each page block. The resulting

page block size is always rounded to the nearest power of two.

The beginning of each page block is occupied by the page

block header. The header consists of all the data structures and

bookkeeping information necessary for the management of the

page block. It contains: i) Pointers for linking the page block to the

doubly-linked list of page blocks for each object class, ii) Pointers

to free memory inside the page block (freed and unallocated),

iii) An identiﬁer of the owner-thread of the page block (id), iv) The

head of a LIFO list used for object deallocations to the page block

by threads other than the owner-thread (remotely

freed), and v)

bookkeeping information, such as the number of free objects in the

page block and the size of each object. All the ﬁelds in the header,

with the exception of remotely

freed, are accessed only by the

page block owner-thread, thus accesses and modiﬁcations of these

ﬁelds require no synchronization.

Headerless objects: When an object is freed, a memory allocator

needs to discover whether the object is large or small as well as its

size and—if the object is small—the exact page block it originated

from. A common technique is to attach a header to each object and

encode the necessary information in the header. Some architectures

impose an 8-bytes alignment requirement for certain data types, or

accesses to these data types suffer signiﬁcant performance penal-

ties. This limits the minimum memory required for headers to 8

bytes and the minimum object granularity supported by the alloca-

tor to 16 bytes (including the header). As a result, the use of headers

introduces two serious side-effects: a) Signiﬁcant space overhead,

which can reach up to 300% (12 bytes of overhead for every 4-bytes

object), and b) less objects can be packed in a single cache line or

a single page, thus hurting spatial locality.

Streamﬂow eliminates headers from small objects using the BI-

BOP technique [24]. We introduce a global table with one byte

for each virtual memory page in the system. Accesses to the table

can simply be indexed by the page starting address. A single bit of

each table cell characterizes objects allocated in the speciﬁc page as

small or large. If the object is small, the remaining 7 bits are used to

encode the disposition—in pages—of the header of the parent page

block. This encoding is sufﬁcient for realistic page block sizes (up

to 512 KB, considering a page size of 4 KB). As soon as the header

of the parent page block is available, the memory manager has all

Scalable locality-conscious multithreaded memory allocation

Figures

Citations

A lightweight infrastructure for graph analytics

Cache craftiness for fast multicore key-value storage

Corey: an operating system for many cores

Scalable address spaces using RCU balanced trees

Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

References

Dynamic Storage Allocation: A Survey and Critical Review

Hoard: a scalable memory allocator for multithreaded applications

Thread Scheduling for Multiprogrammed Multiprocessors

Thread scheduling for multiprogrammed multiprocessors

A fast storage allocator

Related Papers (5)

Hoard: a scalable memory allocator for multithreaded applications

Scalable lock-free dynamic memory allocation

Dynamic Storage Allocation: A Survey and Critical Review

Corey: an operating system for many cores

Hazard pointers: safe memory reclamation for lock-free objects

Frequently Asked Questions (15)

Q1. What are the contributions mentioned in the paper "Scalable locality-conscious multithreaded memory allocation" ?

Q2. What future works have the authors mentioned in the paper "Scalable locality-conscious multithreaded memory allocation" ?

Q3. Why does contiguous allocation of small objects work well in practice?

Q4. What is the main consideration in the context of thread-safe memory allocators?

Q5. How many MB of address space is enough for the BIBOP table?

Q6. How does Streamflow limit the population of the local and global caches?

Q7. How many memory pages are allocated for the thread stack?

Q8. What is the key scalability-limiting factor of Streamflow?

Q9. Why do page blocks have to be of the same size?

Q10. What is the Streamflow policy for removing partially free or locally cached page blocks?

Q11. What is the performance of Streamflow even with Consume?

Q12. What is the maximum page block size?

Q13. What are the main problems and trade-offs in a unified design?

Q14. What are the main problems of thread-safe memory allocators?

Q15. How many bytes is required for a header?