scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Composable scheduler activations for Haskell

TL;DR: A novel concurrency substrate design for the Glasgow Haskell Compiler is described that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell.
Abstract: The runtime for a modern, concurrent, garbage collected language like Java or Haskell is like an operating system: sophisticated, complex, performant, but alas very hard to change. If more of the runtime system were in the high level language, it would be far more modular and malleable. In this paper, we describe a novel concurrency substrate design for the Glasgow Haskell Compiler (GHC) that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell. The approach relies on abstracting the interface to the user-implemented schedulers through scheduler activations, together with the use of Software Transactional Memory (STM) to promote safety in a multicore context.

Summary (6 min read)

1. Introduction

  • High performance, multicore-capable runtime systems (RTS) for garbage-collected languages have been in widespread use for many years.
  • As a result, they are extremely difficult to modify, even for their own authors.
  • Different strategies might suit different multi-cores, or different application programs or parts thereof.
  • By abstracting the interface to the ULS through scheduler activations, their concurrency substrate seamlessly integrates with the existing RTS concurrency support such as MVars, asynchronous exceptions [16], safe foreign function interface [17], software transactional memory [10], resumable black-holes [20], etc.
  • This design absolves the scheduler writer from having to reason about the interaction between the ULS and the RTS, and thus lowering the bar for writing new schedulers.

2. Background

  • To understand the design of the new concurrency substrate for Haskell, the authors must first give some background on the existing RTS support for concurrency in their target platform – the Glasgow Haskell Compiler (GHC).
  • The authors then articulate the goals of their concurrency substrate.

2.1 The GHC runtime system

  • GHC has a sophisticated, highly tuned RTS that has a rich support for concurrency with advanced features such as software transactional memory [10], asynchronous exceptions [16], safe foreign function interface [17], and transparent scaling on multicores [9].
  • Each HEC is in turn animated by an operating system thread; in this paper the authors use the term tasks for these OS threads, to distinguish them from Haskell threads.
  • GHC’s current scheduler is written in C, and is hardwired into the RTS (Figure 1).
  • It uses a single run-queue per processor, and has a single, fixed notion of work-sharing to move work from one processor to another.
  • There is no notion of thread priority; nor is there support for advanced scheduling policies such as gang or spatial scheduling.

2.2 The challenge

  • Because there is such a rich design space for schedulers, their goal is to allow a user-level scheduler (ULS) to be written in Haskell, giving programmers the freedom to experiment with different scheduling or work-stealing algorithms.
  • Applications might also combine the schedulers in a hierarchical fashion; a scheduler receives computational resources from its parent, and divides them among its children.
  • Matters are made more complicated by asynchronous exceptions, which may cause a thread to abandon evaluation of a thunk, replacing the thunk with a “resumable black hole”.
  • The difficulty is that all they are intricate and highly-optimised.
  • Given that the ULS will be implemented in Haskell, the authors would like to utilise the concurrency control abstractions provided by Haskell (notably transactional memory) to simplify the task of scheduler implementation.

3. Design

  • The authors describe the design of their concurrency substrate and present the concurrency substrate API.
  • Along 2 2014/3/26 the way, the authors will describe how their design achieves the goals put forth in the previous section.

3.1 Scheduler activation

  • The authors key observation is that the interaction between the scheduler and the rest of the RTS can be reduced to two fundamental operations: 1. Block operation.
  • The RTS event that a blocked thread is waiting on occurs.
  • Eventually, the MVar might be filled by some other thread (analogous to lock release), in which case, the blocked thread is unblocked and resumed with the value from the MVar.
  • The activations provide an abstract interface to the ULS to which the thread belongs to.
  • The substrate not only allows the programmer to implement schedulers as Haskell libraries, but also enables other RTS mechanisms to interface with the userlevel schedulers through upcalls to the activations.

3.2 Software transactional memory

  • Since Haskell computations can run in parallel on different HECs, the substrate must provide a method for safely coordinating activities across multiple HECs.
  • Similar to Li’s substrate design [14], the authors adopt transactional memory (STM), as the sole multiprocessor synchronisation mechanism exposed by the substrate.
  • Using transactional memory, rather than locks and condition variables make complex concurrent programs much more modular and less error-prone [10] – and schedulers are prime candidates, because they are prone to subtle concurrency bugs.

3.3 Concurrency substrate

  • Now that the authors have motivated their design decisions, they will present the API for the concurrency substrate.
  • The concurrency substrate includes the primitives for instantiating and switching between language level threads, manipulating thread local state, and an abstraction for scheduler activations.
  • The API is presented below: data SCont type DequeueAct = SCont -> STM SCont type EnqueueAct = SCont -> STM -- activation interface dequeueAct :: DequeueAct enqueueAct :: EnqueueAct -- SCont manipulation newSCont :: IO ->.

3.3.1 Activation interface

  • Rather than directly exposing the notion of a “thread”, the substrate offers one-shot continuations [3], which is of type SCont.
  • When the program begins execution, a fixed number of HECs (N) is provided to it by the environment.
  • Notice that the result of the dequeue activation and the body of the switch primitive are STM transactions.
  • Along the same lines, the authors interpret the use of retry within a switch or dequeue activation transaction as putting the whole HEC to sleep.
  • The activations of an SCont can be read by dequeueAct and enqueueAct primitives.

4.1 User-level scheduler

  • The first step in designing a scheduler is to describe the scheduler data structure.
  • (1) allocating the scheduler and initialising the main thread and (2) spinning up additional HECs, also known as This involves two steps.
  • The authors assume that the Haskell program wishing to utilise the ULS performs these two steps at the start of the main IO computation.
  • The main SCont’s activations, initialised in newScheduler, are copied to the newly allocated SCont.

4.2 Scheduler agnostic user-level MVars

  • The authors scheduler activations abstracts the interface to the ULS’s.
  • This fact can be exploited to build scheduler agnostic implementation of user-level concurrency libraries such as MVars.
  • The following snippet describes the structure of an MVar implementation: newtype MVar a = MVar (TVar (MVPState a)) data MVPState a = Full a [(a, SCont)] | Empty [(IORef a, SCont)].
  • If the MVar is full, SCont consumes the value and unblocks the next waiting putter SCont, if any.
  • The implementation of putMVar is the dual of this implementation.

5. Semantics

  • The authors present the formal semantics of the concurrency substrate primitives introduced in Section 3.3.
  • The authors will subsequently utilise the semantics to formally describe the interaction of the ULS with the RTS in Section 6.
  • The aim of this is to precisely describe the issues with respect to the interactions between the ULS and the RTS, and have the language to enunciate their solutions.

5.1 Syntax

  • Figure 5 shows the syntax of program states.
  • Each HEC is either idle (Idle) or a triple 〈s,M,D〉t where s is a unique identifier of the currently executing SCont, M is the currently executing term, D represents SCont-local state.
  • The authors represent the stack local state D as a tuple with two terms and a name (M,N, r).
  • For perspicuity, the authors define accessor functions as shown below.
  • The number of HECs remains constant, and HEC runs one, and only one SCont.

5.2 Basic transitions

  • Some basic transitions are presented in Figure 6.
  • This says that the program makes a transition from S; Θ to S′; Θ′, possibly interacting with the underlying RTS through action a.
  • Similarly, Rule PureStep enables one of the HECs to perform a purely functional transition under the evaluation context E (defined in Figure 5).
  • The purely functional transitionsM → N include βreduction, arithmetic expressions, case expressions, monadic operations return, bind, throw, catch, and so on according to their standard definitions.
  • Bind operation on the transactional memory primitive retry simply reduces to retry (Figure 6).

5.3 Transactional memory

  • Since the concurrency substrate primitives utilise STM as the sole synchronisation mechanism, the authors will present the formal semantics of basic STM operations in this section.
  • Figure 7 presents the semantics of non-blocking STM operations.
  • The current SCont s and its local state D are read-only, and are not used at all in this section, but will be needed when manipulating SCont-local state.
  • Since an exception can carry a TVar allocated in the aborted transaction, the effects of the current transaction are undone except for the newly allocated TVars.
  • If the resultant SCont s′ is different from the current SCont s , the authors transfer control to the new SCont s′ by making it the running SCont and saving the state of the original SCont s in the heap.

6. Interaction with the RTS

  • The key aspect of their design is composability of ULS’s with the existing RTS concurrency mechanisms (Section 3.1).
  • The authors will describe in detail the interaction of RTS concurrency mechanisms and the ULS’s.
  • The formalisation brings out the tricky cases associated with the interaction between the ULS and the RTS.

6.1 Timer interrupts

  • In GHC, concurrent threads are preemptively scheduled.
  • On a tick, the current SCont needs to be de-scheduled and a new SCont from the scheduler needs to be scheduled.
  • The semantics of handling timer interrupts is shown in Figure 9.
  • In this case the RTS-interaction Tick indicates that the RTS wants to signal a timer tick6.
  • The transition here injects yield into the instruction stream of the SCont running on 6 Technically the authors should ensure that every HEC receives a tick, and of course their implementation does just that, but they elide that here.

6.2 STM blocking operations

  • As mentioned before (Section 3.4), STM supports blocking operations through the retry primitive.
  • 2.1 Blocking the SCont Rule TRETRYATOMIC is similar to TTHROW in Figure 7.
  • The rules presented in Figure 11 are the key rules in abstracting the interface between the ULS and the RTS, and describe the invocation of upcalls.
  • Invoking the dequeue upcall on the blocked SCont s can lead to a race on s between multiple HECs if s happens to be unblocked and enqueued to the scheduler before the switch transaction is completed.
  • Resuming the SCont Some time later, the RTS will see that some thread has written to one of the TVars read by s’s transaction, so it will signal an RetrySTM s interaction (rule TRESUMERETRY).

6.2.3 HEC sleep and wakeup

  • Recall that invoking retry within a switch transaction or dequeue activation puts the HEC to sleep (Section 3.4).
  • Also, notice that the dequeue activation is always invoked by the RTS from a switch transaction (Rule UPDEQUEUE).
  • If a switch transaction blocks, the authors put the whole HEC to sleep, also known as This motivates rule TRETRYSWITCH.
  • Then, dual to TRESUMERETRY, rule TWAKEUP wakes up the HEC when the RTS sees that the transaction may now be able to make progress.

6.2.4 Implementation of upcalls

  • Notice that the rules UPDEQUEUE and UPENQUEUEIDLE in Figure 11 instantiate a fresh SCont.
  • The freshly instantiated SCont performs just a single transaction; switch in UPDEQUEUE and atomically in UPENQUEUEIDLE, after which it is garbage-collected.
  • Since instantiating a fresh SCont for every upcall is unwise, the RTS maintains a dynamic pool of dedicated upcall SConts for performing the upcalls.
  • It is worth mentioning that the authors need an “upcall SCont pool” rather than a single “upcall SCont” since the upcall transactions can themselves get blocked synchronously on STM retry as well as asynchronously due to optimizations for lazy evaluation (Section 6.5).

6.3 Safe foreign function calls

  • Foreign calls in GHC are highly efficient but intricately interact with the scheduler [17].
  • Each HEC is animated by one of a pool of tasks (OS threads); the current task may become blocked in a foreign call (e.g. a blocking I/O operation), in which case another task takes over the HEC.
  • The authors decision to preserve the task model in the RTS allows us to delegate much of the work involved in safe foreign call to the RTS.
  • Rule OCBLOCK illustrates that the HEC performing the foreign 10 2014/3/26 call moves into the Outcall state, where it is no longer runnable.
  • The scheduler is resumed using the dequeue upcall.

6.4 Timer interrupts and transactions

  • This is faithful to the semantics expressed by the rule, but it does mean that a rogue transaction could completely monopolise a HEC.
  • An alternative possibility (Plan B) is for the RTS to roll the transaction back to the beginning, and then deliver the tick using rule (TICK).
  • That too is implementable, but this time the risk is that a slightly-too-long transaction would always be rolled back, so it would never make progress.
  • And that transaction is likely to run the very same code that has just been interrupted.

6.5 Black holes

  • To avoid duplicate evaluation the RTS (in intimate cooperation with the compiler) arranges for B to blackhole the thunk when it starts to evaluate x.
  • This mechanism, and its implementation on a multicore, is described in detail in earlier work [9].
  • The RTS behaves as if the black-hole suspension and resumption occurred just before the transaction, but the implementation actually arranges to resume the transaction from where it left off.
  • Moreover, it is just possible that the thunk is under evaluation by an SCont in this very scheduler’s runqueue, so the black hole is preventing us from scheduling the very SCont that is evaluating it.
  • Since the authors cannot sensibly suspend the switch transaction, they must find a way for it to make progress.

6.6 Interaction with RTS MVars

  • An added advantage of their scheduler activation interface is that the authors are able to reuse the existing MVar implementation in the RTS.
  • This significantly reduces the burden of migrating to a ULS implementation.

6.7 Asynchronous exceptions

  • GHC’s supports asynchronous exceptions in which one thread can send an asynchronous interrupt to another [16].
  • This is a very tricky area; for example, if a thread is blocked on a user-level MVar (Section 4.2), and receives an exception, it should wake up and do something — even though it is linked onto an unknown queue of blocked threads.
  • The authors implementation does in fact handle asynchronous exceptions, but the authors are not yet happy with the details of the design, and in any case space precludes presenting them here.

6.8 On the correctness of user-level schedulers

  • While the concurrency substrate exposes the ability to build ULS’s, the onus is on the scheduler implementation to ensure that it is sensible.
  • The authors implementation dynamically enforces such invariants through runtime assertions.
  • Activations raising exceptions indicates an error in the ULS implementation, and the substrate simply reports an error to the standard error stream.
  • A thread suspended on an ULS may become unreachable if the scheduler data structure holding it becomes unreachable.
  • A thread indefinitely blocked on an RTS MVar operation is raised with an exception and added to its ULS.

7. Results

  • The authors implementation is a fork of GHC,7 and supports all of the features discussed in the paper.
  • The benchmarks offer varying degrees of parallelisation opportunity.
  • K-nucleotide, mandelbrot and spectral-norm are computation intensive, while chameneos and primes-sieve are communication intensive and are specifically intended to test the overheads of thread synchronisation.
  • Additionally, in these benchmarks, LWC version performs 3X-8× more allocations than the vanilla version.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Purdue University Purdue University
Purdue e-Pubs Purdue e-Pubs
Department of Computer Science Technical
Reports
Department of Computer Science
2014
Composable Scheduler Activations for Haskell Composable Scheduler Activations for Haskell
KC Sivaramakrishnan
Purdue University
, chandras@cs.purdue.edu
Tim Harris
Oracle Labs
, timothy.l.harris@oracle.com
Simon Marlow
Facebook UK Ltd.
, smarlow@fb.com
Simon Peyton Jones
Microsoft Research, Cambridge
, simonpj@microsoft.com
Report Number:
14-004
Sivaramakrishnan, KC; Harris, Tim; Marlow, Simon; and Peyton Jones, Simon, "Composable Scheduler
Activations for Haskell" (2014).
Department of Computer Science Technical Reports.
Paper 1774.
https://docs.lib.purdue.edu/cstech/1774
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.
Please contact epubs@purdue.edu for additional information.

Composable Scheduler Activations for Haskell
KC Sivaramakrishnan
Purdue University
chandras@cs.purdue.edu
Tim Harris
1
Oracle Labs
timothy.l.harris@oracle.com
Simon Marlow
1
Facebook UK Ltd.
smarlow@fb.com
Simon Peyton Jones
Microsoft Research, Cambridge
simonpj@microsoft.com
Abstract
The runtime for a modern, concurrent, garbage collected
language like Java or Haskell is like an operating system:
sophisticated, complex, performant, but alas very hard to
change. If more of the runtime system were in the high level
language, it would be far more modular and malleable. In
this paper, we describe a novel concurrency substrate design
for the Glasgow Haskell Compiler (GHC) that allows mul-
ticore schedulers for concurrent and parallel Haskell pro-
grams to be safely and modularly described as libraries in
Haskell. The approach relies on abstracting the interface to
the user-implemented schedulers through scheduler activa-
tions, together with the use of Software Transactional Mem-
ory (STM) to promote safety in a multicore context.
1. Introduction
High performance, multicore-capable runtime systems (RTS)
for garbage-collected languages have been in widespread
use for many years. Examples include virtual machines
for popular object-oriented languages such as Oracle’s Java
HotSpot VM [12], IBM’s Java VM [13], Microsoft’s Com-
mon Language Runtime (CLR) [19], as well as functional
language runtimes such as Manticore [22], MultiMLton [27]
and the Glasgow Haskell Compiler (GHC) [8].
These runtime systems tend to be complex monolithic
pieces of software, written not in the high-level source lan-
guage (Java, Haskell, etc), but in an unsafe, systems pro-
gramming language (usually C or C++). They are highly
concurrent, with extensive use of locks, condition variables,
timers, asynchronous I/O, thread pools, and other arcana. As
a result, they are extremely difficult to modify, even for their
own authors. Moreover, such modifications typically require
a rebuild of the runtime, so it is not an easy matter to make
changes on a program-by-program basis, let alone within a
single program.
1
This work was done at Microsoft Research, Cambridge.
[Copyright notice will appear here once ’preprint’ option is removed.]
This lack of malleability is particularly unfortunate for
the thread scheduler, which governs how the computational
resources of the multi-core are deployed to run zillions of
lightweight high-level language threads. A broad range of
strategies are possible, including ones using priorities, hi-
erarchical scheduling, gang scheduling, and work stealing.
Different strategies might suit different multi-cores, or dif-
ferent application programs or parts thereof. The goal of this
paper is, therefore, to allow programmers to write a User
Level Scheduler (ULS), as a library written the high level
language itself. Not only does this make the scheduler more
modular and changeable, but it can readily be varied between
programs, or even within a single program.
The difficulty is that the scheduler interacts intimately
with other aspects of the runtime such as transactional mem-
ory or blocking I/O. Our main contribution is the design
of an interface that allows expressive user-level schedulers
to interact cleanly with these low-level communication and
synchronisation primitives:
We present a new concurrency substrate design for
Haskell that allows application programmers to write
schedulers for Concurrent Haskell programs in Haskell
(Section 3). These schedulers can then be plugged-in as
ordinary user libraries in the target program.
By abstracting the interface to the ULS through scheduler
activations, our concurrency substrate seamlessly inte-
grates with the existing RTS concurrency support such as
MVars, asynchronous exceptions [16], safe foreign func-
tion interface [17], software transactional memory [10],
resumable black-holes [20], etc. The RTS makes upcalls
to the activations whenever it needs to interact with the
ULS. This design absolves the scheduler writer from hav-
ing to reason about the interaction between the ULS and
the RTS, and thus lowering the bar for writing new sched-
ulers.
Concurrency primitives and their interaction with the
RTS are particularly tricky to specify and reason about.
An unusual feature of this paper is that we precisely
formalise not only the concurrency substrate primitives
(Section 5), but also their interaction with the RTS con-
currency primitives (Section 6).
We present an implementation of our concurrency sub-
strate in GHC. Experimental evaluation indicate that the
1 2014/3/26

Runtime System
MVar
Safe FFI
GC
Scheduler
Async
Exception
STM
Concurrent Application
Written by: Written in:
Application
Developer
Language
Developer
Haskell
C
Figure 1. The anatomy of the Glasgow Haskell Compiler
runtime system
performance of ULS’s is comparable to the highly opti-
mised default scheduler of GHC (Section 7).
2. Background
To understand the design of the new concurrency substrate
for Haskell, we must first give some background on the
existing RTS support for concurrency in our target platform
the Glasgow Haskell Compiler (GHC). We then articulate
the goals of our concurrency substrate.
2.1 The GHC runtime system
GHC has a sophisticated, highly tuned RTS that has a rich
support for concurrency with advanced features such as
software transactional memory [10], asynchronous excep-
tions [16], safe foreign function interface [17], and transpar-
ent scaling on multicores [9]. The Haskell programmer can
use very lightweight Haskell threads, which are executed
by a fixed number of Haskell execution contexts, or HECs.
Each HEC is in turn animated by an operating system thread;
in this paper we use the term tasks for these OS threads, to
distinguish them from Haskell threads. The choice of which
Haskell thread is executed by which HEC is made by the
scheduler.
GHC’s current scheduler is written in C, and is hard-
wired into the RTS (Figure 1). It uses a single run-queue
per processor, and has a single, fixed notion of work-sharing
to move work from one processor to another. There is no
notion of thread priority; nor is there support for advanced
scheduling policies such as gang or spatial scheduling. From
an application developer’s perspective, the lack of flexibility
hinders deployment of new programming models on top of
GHC such as data-parallel computations [4, 15], and appli-
cations such as virtual machines [7] and web-servers [11]
that can benefit from the ability to define custom scheduling
policies.
2.2 The challenge
Because there is such a rich design space for schedulers, our
goal is to allow a user-level scheduler (ULS) to be written
in Haskell, giving programmers the freedom to experiment
with different scheduling or work-stealing algorithms. In-
deed, we would like the ability to combine multiple ULS’s
in the same program. For example, in order to utilise the best
scheduling strategy, a program could dynamically switch
from a priority-based scheduler to gang scheduling when
switching from general purpose computation to data-parallel
computation. Applications might also combine the sched-
ulers in a hierarchical fashion; a scheduler receives compu-
tational resources from its parent, and divides them among
its children.
This goal is not not easy to achieve. The scheduler inter-
acts intimately with other RTS components including
MVars and transactional memory [10] allow Haskell
threads to communicate and synchronise; they may cause
threads to block or unblock.
The garbage collector must somehow know about the
run-queue on each HEC, so that it can use it as a root
for garbage collection.
Lazy evaluation means that if a Haskell thread tries to
evaluate a thunk that is already under evaluation by an-
other thread (it is a “black hole”), the former must block
until the thunk’s evaluation is complete [9]. Matters are
made more complicated by asynchronous exceptions,
which may cause a thread to abandon evaluation of a
thunk, replacing the thunk with a “resumable black hole”.
A foreign-function call may block (e.g. when doing I/O).
GHC’s RTS has can schedule a fresh task (OS thread)
to re-animate the HEC, blocking the in-flight Haskell
thread, and scheduling a new one [17].
All of these components do things like “block a thread”
or ”unblock a thread” that require interaction with the sched-
uler. One possible response, taken by Li et al [14] is to pro-
gram these components, too, into Haskell. The difficulty is
that all they are intricate and highly-optimised. Moreover,
unlike scheduling, there is no call from Haskell’s users for
them to be user-programmable.
Instead, our goal is to tease out the scheduler implemen-
tation from rest of the RTS, establishing a clear API between
the two, and leaving unchanged the existing implementation
of MVars, STM, black holes, FFI, and so on.
Lastly, schedulers are themselves concurrent programs,
and they are particularly devious ones. Using the facilities
available in C, they are extremely hard to get right. Given
that the ULS will be implemented in Haskell, we would like
to utilise the concurrency control abstractions provided by
Haskell (notably transactional memory) to simplify the task
of scheduler implementation.
3. Design
In this section, we describe the design of our concurrency
substrate and present the concurrency substrate API. Along
2 2014/3/26

the way, we will describe how our design achieves the goals
put forth in the previous section.
3.1 Scheduler activation
Our key observation is that the interaction between the
scheduler and the rest of the RTS can be reduced to two
fundamental operations:
1. Block operation. The currently running thread blocks
on some event in the RTS. The execution proceeds by
switching to the next available thread from the scheduler.
2. Unblock operation. The RTS event that a blocked thread
is waiting on occurs. After this, the blocked thread is
resumed by adding it to the scheduler.
For example, in Haskell, a thread might encounter an
empty MVar while attempting to take the value from it
2
.
In this case, the thread performing the MVar read operation
should block. Eventually, the MVar might be filled by some
other thread (analogous to lock release), in which case, the
blocked thread is unblocked and resumed with the value
from the MVar. As we will see, all of the RTS interactions
(as well as the interaction with the concurrency libraries) fall
into this pattern.
Notice that the RTS blocking operations enqueue and
dequeue threads from the scheduler. But the scheduler is
now implemented as a Haskell library. So how does the RTS
find the scheduler? We could equip each HEC with a fixed
scheduler, but it is much more flexible to equip each Haskell
thread with its own scheduler. That way, different threads
(or groups thereof) can have different schedulers.
But what precisely is a “scheduler”? In our design, the
scheduler is represented by two function values, or sched-
uler activations
3
. Every user-level thread has a dequeue ac-
tivation and an enqueue activation. The activations provide
an abstract interface to the ULS to which the thread belongs
to. At the very least, the dequeue activation fetches the next
available thread from the ULS encapsulated in the activation,
and the enqueue activation adds the given thread to the en-
capsulated ULS. The activations are stored at known offsets
in the thread object so that the RTS may find it. The RTS
makes upcalls to the activations to perform the enqueue and
dequeue operations on a ULS.
Figure 2 illustrates the modified RTS design that supports
the implementation of ULS’s. The idea is to have a minimal
concurrency substrate which is implemented in C and is a
part of the RTS. The substrate not only allows the program-
mer to implement schedulers as Haskell libraries, but also
enables other RTS mechanisms to interface with the user-
level schedulers through upcalls to the activations.
Figure 3 illustrates the steps associated with blocking on
an RTS event. Since the scheduler is implemented in user-
2
This operation is analogous to attempting to take a lock that is currently
held by some other thread.
3
The term “activation” comes from the operating systems literature [1]
Runtime System
MVar
Safe FFI
GC
Conc
Substrate
Async
Exception
STM
Concurrent Application
Written by: Written in:
Application
Developer
Language
Developer
Haskell
C
User-level Scheduler
Application
Developer
Haskell
Activation
Interface
Upcall
Figure 2. New GHC RTS design with Concurrency Sub-
strate.
User-level Scheduler
RTS
e t
wait
t.dequeueAct()
t'
switchToThread(t')
current thread current thread
dequeue()
Figure 3. Blocking on an RTS event.
space, each HEC in the RTS is aware of only the currently
running thread, say t. Suppose thread t waits for an abstract
event e in the RTS, which is currently disabled. Since the
thread t cannot continue until e is enabled, the RTS adds t
to the queue of threads associated with e, which are currently
waiting for e to be enabled. Notice that the RTS “owns” t
at this point. The RTS now invokes the dequeue activation
associated with t, which returns the next runnable thread
from ts scheduler queue, say t’. This HEC now switches
control to t’ and resumes execution. The overall effect of the
operation ensure that although the thread t is blocked, ts
scheduler (and the threads that belong to it) is not blocked.
User-level Scheduler
RTS
e t
wait
t.enqueueAct()
current thread
t'
()
current thread
t'
enqueue(t)
Figure 4. Unblocking from an RTS event.
Figure 4 illustrates the steps involved in unblocking from
an RTS event. Eventually, the disabled event e can become
enabled. At this point, the RTS wakes up all of the threads
waiting on event e by invoking their enqueue activation.
Suppose we want to resume the thread t which is blocked
on e. The RTS invokes ts enqueue activation to add t to
3 2014/3/26

its scheduler. Since ts scheduler is already running, t will
eventually be scheduled again.
3.2 Software transactional memory
Since Haskell computations can run in parallel on different
HECs, the substrate must provide a method for safely coordi-
nating activities across multiple HECs. Similar to Li’s sub-
strate design [14], we adopt transactional memory (STM),
as the sole multiprocessor synchronisation mechanism ex-
posed by the substrate. Using transactional memory, rather
than locks and condition variables make complex concurrent
programs much more modular and less error-prone [10]
and schedulers are prime candidates, because they are prone
to subtle concurrency bugs.
3.3 Concurrency substrate
Now that we have motivated our design decisions, we will
present the API for the concurrency substrate. The con-
currency substrate includes the primitives for instantiating
and switching between language level threads, manipulating
thread local state, and an abstraction for scheduler activa-
tions. The API is presented below:
data SC o n t
type D equ eue Act = SCont -> STM S Cont
type E nqu eue Act = SCont -> STM ()
-- ac ti v a t i o n i nt e r f ac e
deq u eu e Act :: De q ue u eAc t
enq u eu e Act :: En q ue u eAc t
-- S Co n t ma n i p u l a t i o n
newS Cont :: IO () - > IO S Cont
switch :: ( SCo n t -> STM SCont ) -> IO ()
run OnI dle HEC :: SCont -> IO ()
-- M a n i p u l a t i n g l oca l s t at e
set Deq ue u eA c t :: D eq u eue Act -> IO ()
set Enq ue u eA c t :: E nq u eue Act -> IO ()
getAux :: SCont -> STM Dy nami c
setAux :: SCont -> D y nami c -> STM ()
3.3.1 Activation interface
Rather than directly exposing the notion of a “thread”, the
substrate offers one-shot continuations [3], which is of type
SCont. An SCont is a heap-allocated object representing the
current state of a Haskell computation. In the RTS, SConts
are represented quite conventionally by a heap-allocated
Thread Storage Object (TSO), which includes the compu-
tations stack and local state, saved registers, and program
counter. Unreachable SConts are garbage collected.
The call (dequeueAct s) invokes ss dequeue activa-
tion, passing s to it like a “self parameter. The return type
of dequeueAct indicates that the computation encapsulated
in the dequeueAct is transactional (under STM monad
4
),
which when discharged, returns an SCont. Similarly, the
call (enqueueAct s) invokes the enqueue activation trans-
actionally, which enqueues s to its ULS.
4
http://hackage.haskell.org/package/stm-2.1.1.0/docs/
Control-Concurrent-STM.html
Since the activations are under STM monad, we have the
assurance that the ULS’ cannot be built with low-level un-
safe components such as locks and condition variables. Such
low-level operations would be under IO monad, which can-
not be part of an STM transaction. Thus, our concurrency sub-
strate statically prevents the implementation of potentially
unsafe schedulers.
3.3.2 SCont management
The substrate offers primitives for creating, constructing and
transferring control between SConts. The call (newSCont M)
creates a new SCont that, when scheduled, executes M. By
default, the newly created SCont is associated with the ULS
of the invoking thread. This is done by copying the invoking
SConts activations.
An SCont is scheduled (i.e. is given control of a HEC) by
the switch primitive. The call (switch M ) applies M to the
current continuation s. Notice that (M s) is an STM compu-
tation. In a single atomic transaction switch performs the
computation (M s), yielding an SCont s
0
, and switches con-
trol to s
0
. Thus, the computation encapsulated by s
0
becomes
the currently running computation on this HEC.
Since our continuations are one-shot, capturing a contin-
uation simply fetches the reference to the underlying TSO
object. Hence, continuation capture involves no copying, and
is cheap. Using the SCont interface, a cooperative scheduler
can be built as follows:
yield :: IO ()
yield = switc h (\ s -> en q ue u eAc t s >> d e que ueA c t s)
3.4 Parallel SCont execution
When the program begins execution, a fixed number of
HECs (N) is provided to it by the environment. This sig-
nifies the maximum number of parallel computations in
the program. Of these, one of the HEC runs the main IO
computation. All other HECs are in idle state. The call
runOnIdleHEC s initiates parallel execution of SCont s on
an idle HEC. Once the SCont running on a HEC finishes
evaluation, the HEC moves back to the idle state.
Notice that the upcall from the RTS to the dequeue acti-
vation as well as the body of the switch primitive return an
SCont. This is the SCont to which the control would switch
to subsequently. But what if such an SCont cannot be found?
This situation can occur during multicore execution, when
the number of available threads is less than the number of
HECs. If a HEC does not have any work to do, it better be
put to sleep.
Notice that the result of the dequeue activation and the
body of the switch primitive are STM transactions. GHC
today supports blocking operations under STM. When the
programmer invokes retry inside a transaction, the RTS
blocks the thread until another thread writes to any of the
transactional variables read by the transaction; then the
thread is re-awoken, and retries the transaction [10]. This
4 2014/3/26

Citations
More filters
Book ChapterDOI
19 Jun 2017
TL;DR: It is made the observation that effect handlers can elegantly express particularly difficult programs that combine system programming and concurrency without compromising performance.
Abstract: Algebraic effects and their handlers have been steadily gaining attention as a programming language feature for composably expressing user-defined computational effects. While several prototype implementations of languages incorporating algebraic effects exist, Multicore OCaml incorporates effect handlers as the primary means of expressing concurrency in the language. In this paper, we make the observation that effect handlers can elegantly express particularly difficult programs that combine system programming and concurrency without compromising performance. Our experimental results on a highly concurrent and scalable web server demonstrate that effect handlers perform on par with highly optimised monadic concurrency libraries, while retaining the simplicity of direct-style code.

40 citations


Cites background from "Composable scheduler activations fo..."

  • ...Attempts to lift the scheduler from the runtime system to a library in the high-level language while retaining other features in the runtime system lead to further complications [31]....

    [...]

  • ...In Multicore OCaml, the user-level thread schedulers themselves are expressed as OCaml libraries, thus minimising the secret sauce that gets baked into high-performance multicore runtime systems [31]....

    [...]

Proceedings ArticleDOI
28 Aug 2013
TL;DR: The design and implementation of a new parallel Haskell RTE implementation, GUMSMP, which exploits hierarchical platforms more effectively is presented, designed to efficiently combine distributed memory parallelism, using a virtual shared heap over a cluster, with low-overhead shared memory Parallelism on the multicores.
Abstract: The most widely available high performance platforms today are multilevel clusters of multicores. The Glasgow Haskell Compiler (GHC) provides a number of parallel Haskell implementations targeting different parallel architectures. In particular, GHC-SMP supports shared memory, and GHC-GUM supports distributed memory machines. Both implementations use different, but related, runtime environment (RTE) mechanisms. Good performance results can be achieved on shared memory architectures and on networks individually. However, a combination of both, for networks of multicores, is lacking.We present the design and implementation of a new parallel Haskell RTE implementation, GUMSMP, which exploits hierarchical platforms more effectively. It is designed to efficiently combine distributed memory parallelism, using a virtual shared heap over a cluster, with low-overhead shared memory parallelism on the multicores. Key design objectives for realising this system are: asymmetric load balance, effective latency hiding, and mostly passive load distribution.We show that the automatic hierarchical load distribution policies must be carefully tuned to obtain good performance, showing the impact of several policies, including work pre-fetching and favouring inter-node work distribution. We present the initial performance results for this implementation, demonstrating the good scalability of a set of 8 benchmarks on up to 100 cores, and show performance gains of up to 20% compared to GHC-GUM.

8 citations

Journal ArticleDOI
TL;DR: The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects, and builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler.
Abstract: Abstract Over time, several competing approaches to parallel Haskell programming have emerged. Different approaches support parallelism at various different scales, ranging from small multicores to massively parallel high-performance computing systems. They also provide varying degrees of control, ranging from completely implicit approaches to ones providing full programmer control. Most current designs assume a shared memory model at the programmer, implementation and hardware levels. This is, however, becoming increasingly divorced from the reality at the hardware level. It also imposes significant unwanted runtime overheads in the form of garbage collection synchronisation etc. What is needed is an easy way to abstract over the implementation and hardware levels, while presenting a simple parallelism model to the programmer. The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects. It abstracts over major issues such as work distribution and data serialisation, consolidating existing, successful designs into a single framework. It also provides an optional virtual shared-memory programming abstraction for (possibly) shared-nothing parallel machines, such as modern multicore/manycore architectures or cluster/cloud computing systems. It builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler. This paper summarises the state-of-the-art in shared-nothing parallel Haskell implementations, introduces the PArallEl shAred Nothing abstractions, shows how they can be used to implement three distinct parallel Haskell dialects, and demonstrates that good scalability can be obtained on recent parallel machines.

5 citations


Cites background from "Composable scheduler activations fo..."

  • ...Two separate lightweight implementations of concurrent Haskell have also been produced that lift scheduling and other concurrency features to the Haskell level (Li et al., 2007; Sivaramakrishnan et al., 2013)....

    [...]

Proceedings ArticleDOI
22 Jun 2020
TL;DR: This paper presents a concurrent, statically enforced IFC language that, as a novelty, features asynchronous exceptions, and shows how asynchronous exceptions easily enable useful programming patterns like speculative execution and some degree of resource management.
Abstract: Language-based information-flow control (IFC) techniques often rely on special purpose, ad-hoc primitives to address different covert channels that originate in the runtime system, beyond the scope of language constructs. Since these piecemeal solutions may not compose securely, there is a need for a unified mechanism to control covert channels. As a first step towards this goal, we argue for the design of a general interface that allows programs to safely interact with the runtime system and the available computing resources. To coordinate the communication between programs and the runtime system, we propose the use of asynchronous exceptions (interrupts), which, to the best of our knowledge, have not been considered before in the context of IFC languages. Since asynchronous exceptions can be raised at any point during execution-often due to the occurrence of an external event-threads must temporarily mask them out when manipulating locks and shared data structures to avoid deadlocks and, therefore, breaking program invariants. Crucially, the naive combination of asynchronous exceptions with existing features of IFC languages (e.g., concurrency and synchronization variables) may open up new possibilities of information leakage. In this paper, we present $\mathrm {M}\mathrm {A}\mathrm {C}_{async}$, a concurrent, statically enforced IFC language that, as a novelty, features asynchronous exceptions. We show how asynchronous exceptions easily enable (out of the box) useful programming patterns like speculative execution and some degree of resource management. We prove that programs in $\mathrm {M}\mathrm {A}\mathrm {C}_{async}$ satisfy progress-sensitive non-interference and mechanize our formal claims in the Agda proof assistant.

1 citations

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work presents four transactional schedulers implemented entirely in Haskell using different abstraction levels and presents, despite the inherent overhead of high-level implementations, a reduction in the conflict rates.
Abstract: Transactional Memory is an abstraction that helps concurrent programming, however, in high contention sceneries, it presents low performance because of the high conflict rate between transactions. In this work, we present four transactional schedulers implemented entirely in Haskell using different abstraction levels. The results present, despite the inherent overheadof high-level implementations, a reduction in the conflict rates.

Cites background from "Composable scheduler activations fo..."

  • ...O modelo de escalonamento de threads desenvolvido usou como base um modelo presente em [15]....

    [...]

  • ...[15] K....

    [...]

References
More filters
Proceedings ArticleDOI
11 Oct 2009
TL;DR: This work investigates a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.
Abstract: Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

926 citations


"Composable scheduler activations fo..." refers methods in this paper

  • ...Scheduler activations have successfully been demonstrated to interface kernel with the user-level process scheduler (Williams, 2002; Baumann et al., 2009)....

    [...]

  • ...Scheduler activations [2] have successfully been demonstrated to interface kernel with the user-level process scheduler [3, 20]....

    [...]

Proceedings ArticleDOI
15 Jun 2005
TL;DR: This paper presents a new concurrency model, based on transactional memory, that offers far richer composition, and describes new modular forms of blocking and choice that have been inaccessible in earlier work.
Abstract: Writing concurrent programs is notoriously difficult, and is of increasing practical importance. A particular source of concern is that even correctly-implemented concurrency abstractions cannot be composed together to form larger abstractions. In this paper we present a new concurrency model, based on transactional memory, that offers far richer composition. All the usual benefits of transactional memory are present (e.g. freedom from deadlock), but in addition we describe new modular forms of blocking and choice that have been inaccessible in earlier work.

815 citations


"Composable scheduler activations fo..." refers background or methods in this paper

  • ...This mechanism, and its implementation on a multicore, is described in detail in earlier work (Harris et al., 2005b)....

    [...]

  • ...The ability to perform blocking operations in the scheduler allows us to utilise STM based concurrency libraries such as TMVar [8] with minimal refactoring....

    [...]

  • ...The scheduler interacts intimately with other RTS components including • MVars and transactional memory (Harris et al., 2005a) allow Haskell threads to communicate and synchronise; they may cause threads to block or unblock....

    [...]

  • ...• Lazy evaluation means that if a Haskell thread tries to evaluate a thunk that is already under evaluation by another thread (it is a “black hole”), the former must block until the thunk’s evaluation is complete (Harris et al., 2005b)....

    [...]

  • ...…tuned RTS that has a rich support for concurrency with advanced features such as software transactional memory (Harris et al., 2005a), asynchronous exceptions (Marlow et al., 2001), safe foreign function interface (Marlow et al., 2004), and transparent scaling on multicores (Harris et al., 2005b)....

    [...]

Proceedings ArticleDOI
01 Sep 1991
TL;DR: It is argued that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations, and that managing parallelism at the user level is essential to high-performance parallel computing.
Abstract: Threads are the vehicle for concurrency in many approaches to parallel programming. Threads separate the notion of a sequential execution stream from the other aspects of traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective of this separation is to make the expression and control of parallelism sufficiently cheap that the programmer or compiler can exploit even fine-grained parallelism with acceptable overhead.Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory. This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; we thus argue that managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the lack of system integration exhibited by user-level threads is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; we thus argue that kernel threads or processes, as currently conceived, are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.

581 citations


"Composable scheduler activations fo..." refers background or methods in this paper

  • ...Scheduler activations [2] have successfully been demonstrated to interface kernel with the user-level process scheduler [3, 20]....

    [...]

  • ...• Our concurrency substrate design relies on abstracting the interface to the user-level scheduler through scheduler activations [2] (Section 4....

    [...]

  • ...2 The term “activation” comes from the operating systems literature [2]....

    [...]

Proceedings ArticleDOI
16 Mar 2013
TL;DR: The Mirage prototype compiles OCaml code into unikernels that run on commodity clouds and offer an order of magnitude reduction in code size without significant performance penalty, and demonstrates that the hypervisor is a platform that overcomes the hardware compatibility issues that have made past library operating systems impractical to deploy in the real-world.
Abstract: We present unikernels, a new approach to deploying cloud services via applications written in high-level source code. Unikernels are single-purpose appliances that are compile-time specialised into standalone kernels, and sealed against modification when deployed to a cloud platform. In return they offer significant reduction in image sizes, improved efficiency and security, and should reduce operational costs. Our Mirage prototype compiles OCaml code into unikernels that run on commodity clouds and offer an order of magnitude reduction in code size without significant performance penalty. The architecture combines static type-safety with a single address-space layout that can be made immutable via a hypervisor extension. Mirage contributes a suite of type-safe protocol libraries, and our results demonstrate that the hypervisor is a platform that overcomes the hardware compatibility issues that have made past library operating systems impractical to deploy in the real-world.

476 citations


"Composable scheduler activations fo..." refers methods in this paper

  • ...MirageOS (Madhavapeddy et al., 2013) is a unikernel implemented in OCaml, and uses monadic Lwt threads (Vouillon, 2008) for cooperative concurrency....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high-performance parallel computing.
Abstract: Threads are the vehicle for concurrency in many approaches to parallel programming. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory.This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the problems encountered in integrating user-level threads with other system services is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; kernel threads are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.

437 citations

Frequently Asked Questions (15)
Q1. What have the authors contributed in "Composable scheduler activations for haskell" ?

In this paper, the authors describe a novel concurrency substrate design for the Glasgow Haskell Compiler ( GHC ) that allows multicore schedulers for concurrent and parallel Haskell programs to be safely and modularly described as libraries in Haskell. 

As the next step, the authors plan to improve upon their current solution for handling asynchronous exceptions. 

Activations raising exceptions indicates an error in the ULS implementation, and the substrate simply reports an error to the standard error stream. 

While the concurrency substrate exposes the ability to build ULS’s, the onus is on the scheduler implementation to ensure that it is sensible. 

While Manticore [22], and MultiMLton [27] utilise lowlevel compare-and-swap operation as the core synchronisation primitive, Li et al.’s concurrency substrate [14] for GHC was the first to utilise transactional memory for multiprocessor synchronisation for in the context of ULS’s. 

Jikes supports unsafe low-level operations to block and synchronise threads in order to implement other operations such as garbage collection. 

Examples include virtual machines for popular object-oriented languages such as Oracle’s Java HotSpot VM [12], IBM’s Java VM [13], Microsoft’s Common Language Runtime (CLR) [19], as well as functional language runtimes such as Manticore [22], MultiMLton [27] and the Glasgow Haskell Compiler (GHC) [8]. 

Using the SCont interface, a cooperative scheduler can be built as follows:yield :: IO () yield = switch (\\s -> enqueueAct s >> dequeueAct s)3.4 Parallel SCont execution 

Because there is such a rich design space for schedulers, their goal is to allow a user-level scheduler (ULS) to be written in Haskell, giving programmers the freedom to experimentwith different scheduling or work-stealing algorithms. 

Since the concurrency substrate primitives utilise STM as the sole synchronisation mechanism, the authors will present the formal semantics of basic STM operations in this section. 

The fact that the scheduler itself is now implemented in user-space complicates error recovery and reporting when threads become unreachable. 

A broad range of strategies are possible, including ones using priorities, hierarchical scheduling, gang scheduling, and work stealing. 

Since the authors have already resumed the scheduler, the correct behaviour is to prepare the SCont s with the result and add it to its ULS. 

The difficulty is that the scheduler interacts intimately with other aspects of the runtime such as transactional memory or blocking I/O. 

The goal of this paper is, therefore, to allow programmers to write a User Level Scheduler (ULS), as a library written the high level language itself.