Journal Article•DOI•

Composable scheduler activations for Haskell

KC Sivaramakrishnan¹, Tim Harris², Simon Marlow³, Simon Jones⁴•Institutions (4)

University of Cambridge¹, Oracle Corporation², Facebook³, Microsoft⁴

01 Jun 2016-Journal of Functional Programming (Cambridge University Press)-Vol. 26

TL;DR: A novel concurrency substrate design for the Glasgow Haskell Compiler is described that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell.

read less

Abstract: The runtime for a modern, concurrent, garbage collected language like Java or Haskell is like an operating system: sophisticated, complex, performant, but alas very hard to change. If more of the runtime system were in the high level language, it would be far more modular and malleable. In this paper, we describe a novel concurrency substrate design for the Glasgow Haskell Compiler (GHC) that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell. The approach relies on abstracting the interface to the user-implemented schedulers through scheduler activations, together with the use of Software Transactional Memory (STM) to promote safety in a multicore context.

...read moreread less

Summary (6 min read)

Jump to: [1. Introduction] – [2. Background] – [2.1 The GHC runtime system] – [2.2 The challenge] – [3. Design] – [3.1 Scheduler activation] – [3.2 Software transactional memory] – [3.3 Concurrency substrate] – [3.3.1 Activation interface] – [4.1 User-level scheduler] – [4.2 Scheduler agnostic user-level MVars] – [5. Semantics] – [5.1 Syntax] – [5.2 Basic transitions] – [5.3 Transactional memory] – [6. Interaction with the RTS] – [6.1 Timer interrupts] – [6.2 STM blocking operations] – [6.2.3 HEC sleep and wakeup] – [6.2.4 Implementation of upcalls] – [6.3 Safe foreign function calls] – [6.4 Timer interrupts and transactions] – [6.5 Black holes] – [6.6 Interaction with RTS MVars] – [6.7 Asynchronous exceptions] – [6.8 On the correctness of user-level schedulers] – [7. Results] and [8. Related Work]

1. Introduction

High performance, multicore-capable runtime systems (RTS) for garbage-collected languages have been in widespread use for many years.
As a result, they are extremely difficult to modify, even for their own authors.
Different strategies might suit different multi-cores, or different application programs or parts thereof.
By abstracting the interface to the ULS through scheduler activations, their concurrency substrate seamlessly integrates with the existing RTS concurrency support such as MVars, asynchronous exceptions [16], safe foreign function interface [17], software transactional memory [10], resumable black-holes [20], etc.
This design absolves the scheduler writer from having to reason about the interaction between the ULS and the RTS, and thus lowering the bar for writing new schedulers.

2. Background

To understand the design of the new concurrency substrate for Haskell, the authors must first give some background on the existing RTS support for concurrency in their target platform – the Glasgow Haskell Compiler (GHC).
The authors then articulate the goals of their concurrency substrate.

2.1 The GHC runtime system

GHC has a sophisticated, highly tuned RTS that has a rich support for concurrency with advanced features such as software transactional memory [10], asynchronous exceptions [16], safe foreign function interface [17], and transparent scaling on multicores [9].
Each HEC is in turn animated by an operating system thread; in this paper the authors use the term tasks for these OS threads, to distinguish them from Haskell threads.
GHC’s current scheduler is written in C, and is hardwired into the RTS (Figure 1).
It uses a single run-queue per processor, and has a single, fixed notion of work-sharing to move work from one processor to another.
There is no notion of thread priority; nor is there support for advanced scheduling policies such as gang or spatial scheduling.

2.2 The challenge

Because there is such a rich design space for schedulers, their goal is to allow a user-level scheduler (ULS) to be written in Haskell, giving programmers the freedom to experiment with different scheduling or work-stealing algorithms.
Applications might also combine the schedulers in a hierarchical fashion; a scheduler receives computational resources from its parent, and divides them among its children.
Matters are made more complicated by asynchronous exceptions, which may cause a thread to abandon evaluation of a thunk, replacing the thunk with a “resumable black hole”.
The difficulty is that all they are intricate and highly-optimised.
Given that the ULS will be implemented in Haskell, the authors would like to utilise the concurrency control abstractions provided by Haskell (notably transactional memory) to simplify the task of scheduler implementation.

3. Design

The authors describe the design of their concurrency substrate and present the concurrency substrate API.
Along 2 2014/3/26 the way, the authors will describe how their design achieves the goals put forth in the previous section.

3.1 Scheduler activation

The authors key observation is that the interaction between the scheduler and the rest of the RTS can be reduced to two fundamental operations: 1. Block operation.
The RTS event that a blocked thread is waiting on occurs.
Eventually, the MVar might be filled by some other thread (analogous to lock release), in which case, the blocked thread is unblocked and resumed with the value from the MVar.
The activations provide an abstract interface to the ULS to which the thread belongs to.
The substrate not only allows the programmer to implement schedulers as Haskell libraries, but also enables other RTS mechanisms to interface with the userlevel schedulers through upcalls to the activations.

3.2 Software transactional memory

Since Haskell computations can run in parallel on different HECs, the substrate must provide a method for safely coordinating activities across multiple HECs.
Similar to Li’s substrate design [14], the authors adopt transactional memory (STM), as the sole multiprocessor synchronisation mechanism exposed by the substrate.
Using transactional memory, rather than locks and condition variables make complex concurrent programs much more modular and less error-prone [10] – and schedulers are prime candidates, because they are prone to subtle concurrency bugs.

3.3 Concurrency substrate

Now that the authors have motivated their design decisions, they will present the API for the concurrency substrate.
The concurrency substrate includes the primitives for instantiating and switching between language level threads, manipulating thread local state, and an abstraction for scheduler activations.
The API is presented below: data SCont type DequeueAct = SCont -> STM SCont type EnqueueAct = SCont -> STM -- activation interface dequeueAct :: DequeueAct enqueueAct :: EnqueueAct -- SCont manipulation newSCont :: IO ->.

3.3.1 Activation interface

Rather than directly exposing the notion of a “thread”, the substrate offers one-shot continuations [3], which is of type SCont.
When the program begins execution, a fixed number of HECs (N) is provided to it by the environment.
Notice that the result of the dequeue activation and the body of the switch primitive are STM transactions.
Along the same lines, the authors interpret the use of retry within a switch or dequeue activation transaction as putting the whole HEC to sleep.
The activations of an SCont can be read by dequeueAct and enqueueAct primitives.

4.1 User-level scheduler

The first step in designing a scheduler is to describe the scheduler data structure.
(1) allocating the scheduler and initialising the main thread and (2) spinning up additional HECs, also known as This involves two steps.
The authors assume that the Haskell program wishing to utilise the ULS performs these two steps at the start of the main IO computation.
The main SCont’s activations, initialised in newScheduler, are copied to the newly allocated SCont.

4.2 Scheduler agnostic user-level MVars

The authors scheduler activations abstracts the interface to the ULS’s.
This fact can be exploited to build scheduler agnostic implementation of user-level concurrency libraries such as MVars.
The following snippet describes the structure of an MVar implementation: newtype MVar a = MVar (TVar (MVPState a)) data MVPState a = Full a [(a, SCont)] | Empty [(IORef a, SCont)].
If the MVar is full, SCont consumes the value and unblocks the next waiting putter SCont, if any.
The implementation of putMVar is the dual of this implementation.

5. Semantics

The authors present the formal semantics of the concurrency substrate primitives introduced in Section 3.3.
The authors will subsequently utilise the semantics to formally describe the interaction of the ULS with the RTS in Section 6.
The aim of this is to precisely describe the issues with respect to the interactions between the ULS and the RTS, and have the language to enunciate their solutions.

5.1 Syntax

Figure 5 shows the syntax of program states.
Each HEC is either idle (Idle) or a triple 〈s,M,D〉t where s is a unique identifier of the currently executing SCont, M is the currently executing term, D represents SCont-local state.
The authors represent the stack local state D as a tuple with two terms and a name (M,N, r).
For perspicuity, the authors define accessor functions as shown below.
The number of HECs remains constant, and HEC runs one, and only one SCont.

5.2 Basic transitions

Some basic transitions are presented in Figure 6.
This says that the program makes a transition from S; Θ to S′; Θ′, possibly interacting with the underlying RTS through action a.
Similarly, Rule PureStep enables one of the HECs to perform a purely functional transition under the evaluation context E (defined in Figure 5).
The purely functional transitionsM → N include βreduction, arithmetic expressions, case expressions, monadic operations return, bind, throw, catch, and so on according to their standard definitions.
Bind operation on the transactional memory primitive retry simply reduces to retry (Figure 6).

5.3 Transactional memory

Since the concurrency substrate primitives utilise STM as the sole synchronisation mechanism, the authors will present the formal semantics of basic STM operations in this section.
Figure 7 presents the semantics of non-blocking STM operations.
The current SCont s and its local state D are read-only, and are not used at all in this section, but will be needed when manipulating SCont-local state.
Since an exception can carry a TVar allocated in the aborted transaction, the effects of the current transaction are undone except for the newly allocated TVars.
If the resultant SCont s′ is different from the current SCont s , the authors transfer control to the new SCont s′ by making it the running SCont and saving the state of the original SCont s in the heap.

6. Interaction with the RTS

The key aspect of their design is composability of ULS’s with the existing RTS concurrency mechanisms (Section 3.1).
The authors will describe in detail the interaction of RTS concurrency mechanisms and the ULS’s.
The formalisation brings out the tricky cases associated with the interaction between the ULS and the RTS.

6.1 Timer interrupts

In GHC, concurrent threads are preemptively scheduled.
On a tick, the current SCont needs to be de-scheduled and a new SCont from the scheduler needs to be scheduled.
The semantics of handling timer interrupts is shown in Figure 9.
In this case the RTS-interaction Tick indicates that the RTS wants to signal a timer tick6.
The transition here injects yield into the instruction stream of the SCont running on 6 Technically the authors should ensure that every HEC receives a tick, and of course their implementation does just that, but they elide that here.

6.2 STM blocking operations

As mentioned before (Section 3.4), STM supports blocking operations through the retry primitive.
2.1 Blocking the SCont Rule TRETRYATOMIC is similar to TTHROW in Figure 7.
The rules presented in Figure 11 are the key rules in abstracting the interface between the ULS and the RTS, and describe the invocation of upcalls.
Invoking the dequeue upcall on the blocked SCont s can lead to a race on s between multiple HECs if s happens to be unblocked and enqueued to the scheduler before the switch transaction is completed.
Resuming the SCont Some time later, the RTS will see that some thread has written to one of the TVars read by s’s transaction, so it will signal an RetrySTM s interaction (rule TRESUMERETRY).

6.2.3 HEC sleep and wakeup

Recall that invoking retry within a switch transaction or dequeue activation puts the HEC to sleep (Section 3.4).
Also, notice that the dequeue activation is always invoked by the RTS from a switch transaction (Rule UPDEQUEUE).
If a switch transaction blocks, the authors put the whole HEC to sleep, also known as This motivates rule TRETRYSWITCH.
Then, dual to TRESUMERETRY, rule TWAKEUP wakes up the HEC when the RTS sees that the transaction may now be able to make progress.

6.2.4 Implementation of upcalls

Notice that the rules UPDEQUEUE and UPENQUEUEIDLE in Figure 11 instantiate a fresh SCont.
The freshly instantiated SCont performs just a single transaction; switch in UPDEQUEUE and atomically in UPENQUEUEIDLE, after which it is garbage-collected.
Since instantiating a fresh SCont for every upcall is unwise, the RTS maintains a dynamic pool of dedicated upcall SConts for performing the upcalls.
It is worth mentioning that the authors need an “upcall SCont pool” rather than a single “upcall SCont” since the upcall transactions can themselves get blocked synchronously on STM retry as well as asynchronously due to optimizations for lazy evaluation (Section 6.5).

6.3 Safe foreign function calls

Foreign calls in GHC are highly efficient but intricately interact with the scheduler [17].
Each HEC is animated by one of a pool of tasks (OS threads); the current task may become blocked in a foreign call (e.g. a blocking I/O operation), in which case another task takes over the HEC.
The authors decision to preserve the task model in the RTS allows us to delegate much of the work involved in safe foreign call to the RTS.
Rule OCBLOCK illustrates that the HEC performing the foreign 10 2014/3/26 call moves into the Outcall state, where it is no longer runnable.
The scheduler is resumed using the dequeue upcall.

6.4 Timer interrupts and transactions

This is faithful to the semantics expressed by the rule, but it does mean that a rogue transaction could completely monopolise a HEC.
An alternative possibility (Plan B) is for the RTS to roll the transaction back to the beginning, and then deliver the tick using rule (TICK).
That too is implementable, but this time the risk is that a slightly-too-long transaction would always be rolled back, so it would never make progress.
And that transaction is likely to run the very same code that has just been interrupted.

6.5 Black holes

To avoid duplicate evaluation the RTS (in intimate cooperation with the compiler) arranges for B to blackhole the thunk when it starts to evaluate x.
This mechanism, and its implementation on a multicore, is described in detail in earlier work [9].
The RTS behaves as if the black-hole suspension and resumption occurred just before the transaction, but the implementation actually arranges to resume the transaction from where it left off.
Moreover, it is just possible that the thunk is under evaluation by an SCont in this very scheduler’s runqueue, so the black hole is preventing us from scheduling the very SCont that is evaluating it.
Since the authors cannot sensibly suspend the switch transaction, they must find a way for it to make progress.

6.6 Interaction with RTS MVars

An added advantage of their scheduler activation interface is that the authors are able to reuse the existing MVar implementation in the RTS.
This significantly reduces the burden of migrating to a ULS implementation.

6.7 Asynchronous exceptions

GHC’s supports asynchronous exceptions in which one thread can send an asynchronous interrupt to another [16].
This is a very tricky area; for example, if a thread is blocked on a user-level MVar (Section 4.2), and receives an exception, it should wake up and do something — even though it is linked onto an unknown queue of blocked threads.
The authors implementation does in fact handle asynchronous exceptions, but the authors are not yet happy with the details of the design, and in any case space precludes presenting them here.

6.8 On the correctness of user-level schedulers

While the concurrency substrate exposes the ability to build ULS’s, the onus is on the scheduler implementation to ensure that it is sensible.
The authors implementation dynamically enforces such invariants through runtime assertions.
Activations raising exceptions indicates an error in the ULS implementation, and the substrate simply reports an error to the standard error stream.
A thread suspended on an ULS may become unreachable if the scheduler data structure holding it becomes unreachable.
A thread indefinitely blocked on an RTS MVar operation is raised with an exception and added to its ULS.

7. Results

The authors implementation is a fork of GHC,7 and supports all of the features discussed in the paper.
The benchmarks offer varying degrees of parallelisation opportunity.
K-nucleotide, mandelbrot and spectral-norm are computation intensive, while chameneos and primes-sieve are communication intensive and are specifically intended to test the overheads of thread synchronisation.
Additionally, in these benchmarks, LWC version performs 3X-8× more allocations than the vanilla version.

Did you find this useful? Give us your feedback

Figures (15)

Figure 12. Safe foreign call transitions

Figure 8. Operational semantics for SCont manipulation

Figure 1. The anatomy of the Glasgow Haskell Compiler runtime system

Figure 14. Benchmark results. All times are in seconds.

Figure 5. Syntax of terms, states, contexts, and heaps

Figure 6. Operational semantics for basic transitions

Figure 2. New GHC RTS design with Concurrency Substrate.

Figure 7. Operational semantics for software transactional memory

Figure 15. Operational semantics for manipulating activations and auxiliary state.

Content maybe subject to copyright Report

Purdue University Purdue University

Purdue e-Pubs Purdue e-Pubs

Department of Computer Science Technical

Reports

Department of Computer Science

2014

Composable Scheduler Activations for Haskell Composable Scheduler Activations for Haskell

KC Sivaramakrishnan

Purdue University

, chandras@cs.purdue.edu

Tim Harris

Oracle Labs

, timothy.l.harris@oracle.com

Simon Marlow

Facebook UK Ltd.

, smarlow@fb.com

Simon Peyton Jones

Microsoft Research, Cambridge

, simonpj@microsoft.com

Report Number:

14-004

Sivaramakrishnan, KC; Harris, Tim; Marlow, Simon; and Peyton Jones, Simon, "Composable Scheduler

Activations for Haskell" (2014).

Department of Computer Science Technical Reports.

Paper 1774.

https://docs.lib.purdue.edu/cstech/1774

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.

Please contact epubs@purdue.edu for additional information.

Composable Scheduler Activations for Haskell

KC Sivaramakrishnan

Purdue University

chandras@cs.purdue.edu

Tim Harris

Oracle Labs

timothy.l.harris@oracle.com

Simon Marlow

Facebook UK Ltd.

smarlow@fb.com

Simon Peyton Jones

Microsoft Research, Cambridge

simonpj@microsoft.com

Abstract

The runtime for a modern, concurrent, garbage collected

language like Java or Haskell is like an operating system:

sophisticated, complex, performant, but alas very hard to

change. If more of the runtime system were in the high level

language, it would be far more modular and malleable. In

this paper, we describe a novel concurrency substrate design

for the Glasgow Haskell Compiler (GHC) that allows mul-

ticore schedulers for concurrent and parallel Haskell pro-

grams to be safely and modularly described as libraries in

Haskell. The approach relies on abstracting the interface to

the user-implemented schedulers through scheduler activa-

tions, together with the use of Software Transactional Mem-

ory (STM) to promote safety in a multicore context.

1. Introduction

High performance, multicore-capable runtime systems (RTS)

for garbage-collected languages have been in widespread

use for many years. Examples include virtual machines

for popular object-oriented languages such as Oracle’s Java

HotSpot VM [12], IBM’s Java VM [13], Microsoft’s Com-

mon Language Runtime (CLR) [19], as well as functional

language runtimes such as Manticore [22], MultiMLton [27]

and the Glasgow Haskell Compiler (GHC) [8].

These runtime systems tend to be complex monolithic

pieces of software, written not in the high-level source lan-

guage (Java, Haskell, etc), but in an unsafe, systems pro-

gramming language (usually C or C++). They are highly

concurrent, with extensive use of locks, condition variables,

timers, asynchronous I/O, thread pools, and other arcana. As

a result, they are extremely difﬁcult to modify, even for their

own authors. Moreover, such modiﬁcations typically require

a rebuild of the runtime, so it is not an easy matter to make

changes on a program-by-program basis, let alone within a

single program.

This work was done at Microsoft Research, Cambridge.

[Copyright notice will appear here once ’preprint’ option is removed.]

This lack of malleability is particularly unfortunate for

the thread scheduler, which governs how the computational

resources of the multi-core are deployed to run zillions of

lightweight high-level language threads. A broad range of

strategies are possible, including ones using priorities, hi-

erarchical scheduling, gang scheduling, and work stealing.

Different strategies might suit different multi-cores, or dif-

ferent application programs or parts thereof. The goal of this

paper is, therefore, to allow programmers to write a User

Level Scheduler (ULS), as a library written the high level

language itself. Not only does this make the scheduler more

modular and changeable, but it can readily be varied between

programs, or even within a single program.

The difﬁculty is that the scheduler interacts intimately

with other aspects of the runtime such as transactional mem-

ory or blocking I/O. Our main contribution is the design

of an interface that allows expressive user-level schedulers

to interact cleanly with these low-level communication and

synchronisation primitives:

•

We present a new concurrency substrate design for

Haskell that allows application programmers to write

schedulers for Concurrent Haskell programs in Haskell

(Section 3). These schedulers can then be plugged-in as

ordinary user libraries in the target program.

•

By abstracting the interface to the ULS through scheduler

activations, our concurrency substrate seamlessly inte-

grates with the existing RTS concurrency support such as

MVars, asynchronous exceptions [16], safe foreign func-

tion interface [17], software transactional memory [10],

resumable black-holes [20], etc. The RTS makes upcalls

to the activations whenever it needs to interact with the

ULS. This design absolves the scheduler writer from hav-

ing to reason about the interaction between the ULS and

the RTS, and thus lowering the bar for writing new sched-

ulers.

•

Concurrency primitives and their interaction with the

RTS are particularly tricky to specify and reason about.

An unusual feature of this paper is that we precisely

formalise not only the concurrency substrate primitives

(Section 5), but also their interaction with the RTS con-

currency primitives (Section 6).

•

We present an implementation of our concurrency sub-

strate in GHC. Experimental evaluation indicate that the

1 2014/3/26

Runtime System

MVar

Safe FFI

Scheduler

Async

Exception

STM

Concurrent Application

Written by: Written in:

Application

Developer

Language

Developer

Haskell

Figure 1. The anatomy of the Glasgow Haskell Compiler

runtime system

performance of ULS’s is comparable to the highly opti-

mised default scheduler of GHC (Section 7).

2. Background

To understand the design of the new concurrency substrate

for Haskell, we must ﬁrst give some background on the

existing RTS support for concurrency in our target platform

– the Glasgow Haskell Compiler (GHC). We then articulate

the goals of our concurrency substrate.

2.1 The GHC runtime system

GHC has a sophisticated, highly tuned RTS that has a rich

support for concurrency with advanced features such as

software transactional memory [10], asynchronous excep-

tions [16], safe foreign function interface [17], and transpar-

ent scaling on multicores [9]. The Haskell programmer can

use very lightweight Haskell threads, which are executed

by a ﬁxed number of Haskell execution contexts, or HECs.

Each HEC is in turn animated by an operating system thread;

in this paper we use the term tasks for these OS threads, to

distinguish them from Haskell threads. The choice of which

Haskell thread is executed by which HEC is made by the

scheduler.

GHC’s current scheduler is written in C, and is hard-

wired into the RTS (Figure 1). It uses a single run-queue

per processor, and has a single, ﬁxed notion of work-sharing

to move work from one processor to another. There is no

notion of thread priority; nor is there support for advanced

scheduling policies such as gang or spatial scheduling. From

an application developer’s perspective, the lack of ﬂexibility

hinders deployment of new programming models on top of

GHC such as data-parallel computations [4, 15], and appli-

cations such as virtual machines [7] and web-servers [11]

that can beneﬁt from the ability to deﬁne custom scheduling

policies.

2.2 The challenge

Because there is such a rich design space for schedulers, our

goal is to allow a user-level scheduler (ULS) to be written

in Haskell, giving programmers the freedom to experiment

with different scheduling or work-stealing algorithms. In-

deed, we would like the ability to combine multiple ULS’s

in the same program. For example, in order to utilise the best

scheduling strategy, a program could dynamically switch

from a priority-based scheduler to gang scheduling when

switching from general purpose computation to data-parallel

computation. Applications might also combine the sched-

ulers in a hierarchical fashion; a scheduler receives compu-

tational resources from its parent, and divides them among

its children.

This goal is not not easy to achieve. The scheduler inter-

acts intimately with other RTS components including

•

MVars and transactional memory [10] allow Haskell

threads to communicate and synchronise; they may cause

threads to block or unblock.

•

The garbage collector must somehow know about the

run-queue on each HEC, so that it can use it as a root

for garbage collection.

•

Lazy evaluation means that if a Haskell thread tries to

evaluate a thunk that is already under evaluation by an-

other thread (it is a “black hole”), the former must block

until the thunk’s evaluation is complete [9]. Matters are

made more complicated by asynchronous exceptions,

which may cause a thread to abandon evaluation of a

thunk, replacing the thunk with a “resumable black hole”.

•

A foreign-function call may block (e.g. when doing I/O).

GHC’s RTS has can schedule a fresh task (OS thread)

to re-animate the HEC, blocking the in-ﬂight Haskell

thread, and scheduling a new one [17].

All of these components do things like “block a thread”

or ”unblock a thread” that require interaction with the sched-

uler. One possible response, taken by Li et al [14] is to pro-

gram these components, too, into Haskell. The difﬁculty is

that all they are intricate and highly-optimised. Moreover,

unlike scheduling, there is no call from Haskell’s users for

them to be user-programmable.

Instead, our goal is to tease out the scheduler implemen-

tation from rest of the RTS, establishing a clear API between

the two, and leaving unchanged the existing implementation

of MVars, STM, black holes, FFI, and so on.

Lastly, schedulers are themselves concurrent programs,

and they are particularly devious ones. Using the facilities

available in C, they are extremely hard to get right. Given

that the ULS will be implemented in Haskell, we would like

to utilise the concurrency control abstractions provided by

Haskell (notably transactional memory) to simplify the task

of scheduler implementation.

3. Design

In this section, we describe the design of our concurrency

substrate and present the concurrency substrate API. Along

2 2014/3/26

the way, we will describe how our design achieves the goals

put forth in the previous section.

3.1 Scheduler activation

Our key observation is that the interaction between the

scheduler and the rest of the RTS can be reduced to two

fundamental operations:

1. Block operation. The currently running thread blocks

on some event in the RTS. The execution proceeds by

switching to the next available thread from the scheduler.

2. Unblock operation. The RTS event that a blocked thread

is waiting on occurs. After this, the blocked thread is

resumed by adding it to the scheduler.

For example, in Haskell, a thread might encounter an

empty MVar while attempting to take the value from it

In this case, the thread performing the MVar read operation

should block. Eventually, the MVar might be ﬁlled by some

other thread (analogous to lock release), in which case, the

blocked thread is unblocked and resumed with the value

from the MVar. As we will see, all of the RTS interactions

(as well as the interaction with the concurrency libraries) fall

into this pattern.

Notice that the RTS blocking operations enqueue and

dequeue threads from the scheduler. But the scheduler is

now implemented as a Haskell library. So how does the RTS

ﬁnd the scheduler? We could equip each HEC with a ﬁxed

scheduler, but it is much more ﬂexible to equip each Haskell

thread with its own scheduler. That way, different threads

(or groups thereof) can have different schedulers.

But what precisely is a “scheduler”? In our design, the

scheduler is represented by two function values, or sched-

uler activations

. Every user-level thread has a dequeue ac-

tivation and an enqueue activation. The activations provide

an abstract interface to the ULS to which the thread belongs

to. At the very least, the dequeue activation fetches the next

available thread from the ULS encapsulated in the activation,

and the enqueue activation adds the given thread to the en-

capsulated ULS. The activations are stored at known offsets

in the thread object so that the RTS may ﬁnd it. The RTS

makes upcalls to the activations to perform the enqueue and

dequeue operations on a ULS.

Figure 2 illustrates the modiﬁed RTS design that supports

the implementation of ULS’s. The idea is to have a minimal

concurrency substrate which is implemented in C and is a

part of the RTS. The substrate not only allows the program-

mer to implement schedulers as Haskell libraries, but also

enables other RTS mechanisms to interface with the user-

level schedulers through upcalls to the activations.

Figure 3 illustrates the steps associated with blocking on

an RTS event. Since the scheduler is implemented in user-

This operation is analogous to attempting to take a lock that is currently

held by some other thread.

The term “activation” comes from the operating systems literature [1]

Runtime System

MVar

Safe FFI

Conc

Substrate

Async

Exception

STM

Concurrent Application

Written by: Written in:

Application

Developer

Language

Developer

Haskell

User-level Scheduler

Application

Developer

Haskell

Activation

Interface

Upcall

Figure 2. New GHC RTS design with Concurrency Sub-

strate.

User-level Scheduler

RTS

e t

wait

t.dequeueAct()

switchToThread(t')

current thread current thread

dequeue()

Figure 3. Blocking on an RTS event.

space, each HEC in the RTS is aware of only the currently

running thread, say t. Suppose thread t waits for an abstract

event e in the RTS, which is currently disabled. Since the

thread t cannot continue until e is enabled, the RTS adds t

to the queue of threads associated with e, which are currently

waiting for e to be enabled. Notice that the RTS “owns” t

at this point. The RTS now invokes the dequeue activation

associated with t, which returns the next runnable thread

from t’s scheduler queue, say t’. This HEC now switches

control to t’ and resumes execution. The overall effect of the

operation ensure that although the thread t is blocked, t’s

scheduler (and the threads that belong to it) is not blocked.

User-level Scheduler

RTS

e t

wait

t.enqueueAct()

current thread

()

current thread

enqueue(t)

Figure 4. Unblocking from an RTS event.

Figure 4 illustrates the steps involved in unblocking from

an RTS event. Eventually, the disabled event e can become

enabled. At this point, the RTS wakes up all of the threads

waiting on event e by invoking their enqueue activation.

Suppose we want to resume the thread t which is blocked

on e. The RTS invokes t’s enqueue activation to add t to

3 2014/3/26

its scheduler. Since t’s scheduler is already running, t will

eventually be scheduled again.

3.2 Software transactional memory

Since Haskell computations can run in parallel on different

HECs, the substrate must provide a method for safely coordi-

nating activities across multiple HECs. Similar to Li’s sub-

strate design [14], we adopt transactional memory (STM),

as the sole multiprocessor synchronisation mechanism ex-

posed by the substrate. Using transactional memory, rather

than locks and condition variables make complex concurrent

programs much more modular and less error-prone [10] –

and schedulers are prime candidates, because they are prone

to subtle concurrency bugs.

3.3 Concurrency substrate

Now that we have motivated our design decisions, we will

present the API for the concurrency substrate. The con-

currency substrate includes the primitives for instantiating

and switching between language level threads, manipulating

thread local state, and an abstraction for scheduler activa-

tions. The API is presented below:

data SC o n t

type D equ eue Act = SCont -> STM S Cont

type E nqu eue Act = SCont -> STM ()

-- ac ti v a t i o n i nt e r f ac e

deq u eu e Act :: De q ue u eAc t

enq u eu e Act :: En q ue u eAc t

-- S Co n t ma n i p u l a t i o n

newS Cont :: IO () - > IO S Cont

switch :: ( SCo n t -> STM SCont ) -> IO ()

run OnI dle HEC :: SCont -> IO ()

-- M a n i p u l a t i n g l oca l s t at e

set Deq ue u eA c t :: D eq u eue Act -> IO ()

set Enq ue u eA c t :: E nq u eue Act -> IO ()

getAux :: SCont -> STM Dy nami c

setAux :: SCont -> D y nami c -> STM ()

3.3.1 Activation interface

Rather than directly exposing the notion of a “thread”, the

substrate offers one-shot continuations [3], which is of type

SCont. An SCont is a heap-allocated object representing the

current state of a Haskell computation. In the RTS, SConts

are represented quite conventionally by a heap-allocated

Thread Storage Object (TSO), which includes the compu-

tations stack and local state, saved registers, and program

counter. Unreachable SConts are garbage collected.

The call (dequeueAct s) invokes s’s dequeue activa-

tion, passing s to it like a “self” parameter. The return type

of dequeueAct indicates that the computation encapsulated

in the dequeueAct is transactional (under STM monad

which when discharged, returns an SCont. Similarly, the

call (enqueueAct s) invokes the enqueue activation trans-

actionally, which enqueues s to its ULS.

http://hackage.haskell.org/package/stm-2.1.1.0/docs/

Control-Concurrent-STM.html

Since the activations are under STM monad, we have the

assurance that the ULS’ cannot be built with low-level un-

safe components such as locks and condition variables. Such

low-level operations would be under IO monad, which can-

not be part of an STM transaction. Thus, our concurrency sub-

strate statically prevents the implementation of potentially

unsafe schedulers.

3.3.2 SCont management

The substrate offers primitives for creating, constructing and

transferring control between SConts. The call (newSCont M)

creates a new SCont that, when scheduled, executes M. By

default, the newly created SCont is associated with the ULS

of the invoking thread. This is done by copying the invoking

SCont’s activations.

An SCont is scheduled (i.e. is given control of a HEC) by

the switch primitive. The call (switch M ) applies M to the

current continuation s. Notice that (M s) is an STM compu-

tation. In a single atomic transaction switch performs the

computation (M s), yielding an SCont s

, and switches con-

trol to s

. Thus, the computation encapsulated by s

becomes

the currently running computation on this HEC.

Since our continuations are one-shot, capturing a contin-

uation simply fetches the reference to the underlying TSO

object. Hence, continuation capture involves no copying, and

is cheap. Using the SCont interface, a cooperative scheduler

can be built as follows:

yield :: IO ()

yield = switc h (\ s -> en q ue u eAc t s >> d e que ueA c t s)

3.4 Parallel SCont execution

When the program begins execution, a ﬁxed number of

HECs (N) is provided to it by the environment. This sig-

niﬁes the maximum number of parallel computations in

the program. Of these, one of the HEC runs the main IO

computation. All other HECs are in idle state. The call

runOnIdleHEC s initiates parallel execution of SCont s on

an idle HEC. Once the SCont running on a HEC ﬁnishes

evaluation, the HEC moves back to the idle state.

Notice that the upcall from the RTS to the dequeue acti-

vation as well as the body of the switch primitive return an

SCont. This is the SCont to which the control would switch

to subsequently. But what if such an SCont cannot be found?

This situation can occur during multicore execution, when

the number of available threads is less than the number of

HECs. If a HEC does not have any work to do, it better be

put to sleep.

Notice that the result of the dequeue activation and the

body of the switch primitive are STM transactions. GHC

today supports blocking operations under STM. When the

programmer invokes retry inside a transaction, the RTS

blocks the thread until another thread writes to any of the

transactional variables read by the transaction; then the

thread is re-awoken, and retries the transaction [10]. This

4 2014/3/26

HTML Viewer

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Composable scheduler activations for haskell" ?

In this paper, the authors describe a novel concurrency substrate design for the Glasgow Haskell Compiler ( GHC ) that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell.

Q2. What future works have the authors mentioned in the paper "Composable scheduler activations for haskell" ?

As the next step, the authors plan to improve upon their current solution for handling asynchronous exceptions.

Q3. What does the substrate report to the standard error stream?

Activations raising exceptions indicates an error in the ULS implementation, and the substrate simply reports an error to the standard error stream.

Q4. What is the onus on the scheduler implementation to ensure that it is sensible?

While the concurrency substrate exposes the ability to build ULS’s, the onus is on the scheduler implementation to ensure that it is sensible.

Q5. What is the first concurrency substrate for GHC?

While Manticore [22], and MultiMLton [27] utilise lowlevel compare-and-swap operation as the core synchronisation primitive, Li et al.’s concurrency substrate [14] for GHC was the first to utilise transactional memory for multiprocessor synchronisation for in the context of ULS’s.

Q6. What is the main reason why Jikes supports unsafe low-level operations?

Jikes supports unsafe low-level operations to block and synchronise threads in order to implement other operations such as garbage collection.

Q7. What are some examples of multicore-capable runtimes?

Examples include virtual machines for popular object-oriented languages such as Oracle’s Java HotSpot VM [12], IBM’s Java VM [13], Microsoft’s Common Language Runtime (CLR) [19], as well as functional language runtimes such as Manticore [22], MultiMLton [27] and the Glasgow Haskell Compiler (GHC) [8].

Q8. How can a cooperative scheduler be built?

Using the SCont interface, a cooperative scheduler can be built as follows:yield :: IO () yield = switch (\\s -> enqueueAct s >> dequeueAct s)3.4 Parallel SCont execution

Q9. Why is there a rich design space for schedulers?

Because there is such a rich design space for schedulers, their goal is to allow a user-level scheduler (ULS) to be written in Haskell, giving programmers the freedom to experimentwith different scheduling or work-stealing algorithms.

Q10. What is the semantics of basic STM operations?

Since the concurrency substrate primitives utilise STM as the sole synchronisation mechanism, the authors will present the formal semantics of basic STM operations in this section.

Q11. What is the reason why the scheduler is now implemented in user-space?

The fact that the scheduler itself is now implemented in user-space complicates error recovery and reporting when threads become unreachable.

Q12. What are some of the strategies that can be used to run a thread?

A broad range of strategies are possible, including ones using priorities, hierarchical scheduling, gang scheduling, and work stealing.

Q13. What is the correct behaviour of the scheduler?

Since the authors have already resumed the scheduler, the correct behaviour is to prepare the SCont s with the result and add it to its ULS.

Q14. What is the difficulty of the thread scheduler?

The difficulty is that the scheduler interacts intimately with other aspects of the runtime such as transactional memory or blocking I/O.

Q15. What is the goal of this paper?

The goal of this paper is, therefore, to allow programmers to write a User Level Scheduler (ULS), as a library written the high level language itself.

Composable scheduler activations for Haskell

Summary (6 min read)

1. Introduction

2. Background

2.1 The GHC runtime system

2.2 The challenge

3. Design

3.1 Scheduler activation

3.2 Software transactional memory

3.3 Concurrency substrate

3.3.1 Activation interface

4.1 User-level scheduler

4.2 Scheduler agnostic user-level MVars

5. Semantics

5.1 Syntax

5.2 Basic transitions

5.3 Transactional memory

6. Interaction with the RTS

6.1 Timer interrupts

6.2 STM blocking operations

6.2.3 HEC sleep and wakeup

6.2.4 Implementation of upcalls

6.3 Safe foreign function calls

6.4 Timer interrupts and transactions

6.5 Black holes

6.6 Interaction with RTS MVars

6.7 Asynchronous exceptions

6.8 On the correctness of user-level schedulers

7. Results

8. Related Work

Figures (15)

Citations

Cites background from "Composable scheduler activations fo..."

Cites background from "Composable scheduler activations fo..."

Cites background from "Composable scheduler activations fo..."

References

"Composable scheduler activations fo..." refers methods in this paper

"Composable scheduler activations fo..." refers background or methods in this paper

"Composable scheduler activations fo..." refers background or methods in this paper

"Composable scheduler activations fo..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Composable scheduler activations for haskell" ?

Q2. What future works have the authors mentioned in the paper "Composable scheduler activations for haskell" ?

Q3. What does the substrate report to the standard error stream?

Q4. What is the onus on the scheduler implementation to ensure that it is sensible?

Q5. What is the first concurrency substrate for GHC?

Q6. What is the main reason why Jikes supports unsafe low-level operations?

Q7. What are some examples of multicore-capable runtimes?

Q8. How can a cooperative scheduler be built?

Q9. Why is there a rich design space for schedulers?

Q10. What is the semantics of basic STM operations?

Q11. What is the reason why the scheduler is now implemented in user-space?

Q12. What are some of the strategies that can be used to run a thread?

Q13. What is the correct behaviour of the scheduler?

Q14. What is the difficulty of the thread scheduler?

Q15. What is the goal of this paper?