scispace - formally typeset
Search or ask a question

Showing papers by "Tim Harris published in 2015"


Proceedings Article
18 May 2015
TL;DR: This paper proposes a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes that is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.
Abstract: Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications. In this paper, we show that distributed applications suffer from each node's language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes. We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.

74 citations


Proceedings ArticleDOI
16 Jun 2015
TL;DR: LIRA, a spatial-scheduling heuristic for selecting which parallel applications should run on the same socket in a multi-socket machine, is introduced and two flavors of scheduler using this heuristic are devised: LIRA-static which collects performance data in an offline profiling step to decide the schedule when a program starts, and LIRA, which operates dynamically based on hardware performance counters available on off-the-shelf hardware.
Abstract: Running multiple parallel programs on multi-socket multi-core machines using commodity hardware is increasingly common for data analytics and cluster workloads. These workloads exhibit bursty behavior and are rarely tuned to specific hardware. This leads to poor performance due to suboptimal decisions, such as poor choices for which programs run on the same socket. Consequently, there is a renewed importance for schedulers to consider the structure of the machine alongside the dynamic behavior of workloads.This paper introduces LIRA, a spatial-scheduling heuristic for selecting which parallel applications should run on the same socket in a multi-socket machine. We devise two flavors of scheduler using this heuristic: (i) LIRA-static which collects performance data in an offline profiling step to decide the schedule when a program starts, and (ii) LIRA-adaptive which operates dynamically based on hardware performance counters available on off-the-shelf hardware. LIRA-adaptive does not require separate, offline workload characterization runs, and it accommodates a dynamically changing mix of applications, including those with phase changes.We evaluate LIRA-static and LIRA-adaptive using programs from SPEC OMP and two graph analytics projects. We compare our approaches to the best possible performance obtained across all static mappings of 4 programs to 2 sockets, the libgomp OpenMP runtime that comes with GCC and Callisto, a state-of-the-art scheduler. LIRA-static improves system throughput by 10% compared to libgomp, and LIRA-adaptive improves system throughput by 13%. Compared to Callisto, LIRA-adaptive improves performance in 30 of the 32 combinations tested, with an improvement in system throughput of up to 7%, and 3% on average over 32 combinations.

28 citations


Proceedings Article
08 Jul 2015
TL;DR: An array abstraction is presented that allows data placement to be automatically inferred from program analysis, and implemented in Shoal, a runtime library for parallel programs on NUMA machines based on annotating memory allocation statements to indicate access patterns.
Abstract: Modern NUMA multi-core machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program's access patterns. However, sub-optimal allocation can significantly impact performance of parallel programs. We present an array abstraction that allows data placement to be automatically inferred from program analysis, and implement the abstraction in Shoal, a runtime library for parallel programs on NUMA machines. In Shoal, arrays can be automatically replicated, distributed, or partitioned across NUMA domains based on annotating memory allocation statements to indicate access patterns. We further show how such annotations can be automatically provided by compilers for high-level domainspecific languages (for example, the Green-Marl graph language). Finally, we show how Shoal can exploit additional hardware such as programmable DMA copy engines to further improve parallel program performance. We demonstrate significant performance benefits from automatically selecting a good array implementation based on memory access patterns and machine characteristics. We present two case-studies: (i) Green-Marl, a graph analytics workload using automatically annotated code based on information extracted from the high-level program and (ii) a manually-annotated version of the PARSEC Streamcluster benchmark.

28 citations


Patent
27 May 2015
TL;DR: In this paper, a garbage collection coordination mechanism (a coordinator implemented by a dedicated process on a single node or distributed across the nodes) may obtain or receive state information from each of the nodes and apply one of multiple supported garbage collection policies to reduce the impact of garbage collection pauses, dependent on that information.
Abstract: Fast modern interconnects may be exploited to control when garbage collection is performed on the nodes (e.g., virtual machines, such as JVMs) of a distributed system in which the individual processes communicate with each other and in which the heap memory is not shared. A garbage collection coordination mechanism (a coordinator implemented by a dedicated process on a single node or distributed across the nodes) may obtain or receive state information from each of the nodes and apply one of multiple supported garbage collection coordination policies to reduce the impact of garbage collection pauses, dependent on that information. For example, if the information indicates that a node is about to collect, the coordinator may trigger a collection on all of the other nodes (e.g., synchronizing collection pauses for batch-mode applications where throughput is important) or may steer requests to other nodes (e.g., for interactive applications where request latencies are important).

20 citations


Proceedings Article
08 Jul 2015
TL;DR: Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines, is introduced, and per-core iteration counts are used to distribute work initially, and a new asynchronous request combining technique for when threads require more work is introduced.
Abstract: We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops-- down to batches of work of around 1K cycles. Finegrained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an 8-socket SPARC machine (1024 h/w contexts). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).

18 citations


Posted Content
TL;DR: This paper demonstrates the impact of the placement policies of memory allocators on the performance of applications that use hardware transactional memory and shows that using index-aware allocators can avoid these pathological memory placements.
Abstract: In this paper, we demonstrate the impact of the placement policies of memory allocators on the performance of applications that use hardware transactional memory. In particular, commonly used allocators such as the default GNUglib malloc allocator may place objects in such a way that causes hardware transactions to consistently abort, even when running single-threaded. In multithreaded applications, these consistent aborts can force applications to fall back to using locks, significantly limiting the parallelism. We also show that using index-aware allocators can avoid these pathological memory placements. We have observed read-only transactions commit where the ca che footprint exceeds the L2 size, but have never observed tr ansactions commit where the footprint is above the size of th e L3 with TSX-RTM, misses turn into aborts aborts : amplify impac t of misses; wasted cycles; loss of write-set from L1 aborts a lso force slow-path and loss of concurrent execution for TLE Thread-local problem shifts to global impediment Misses be come aborts become serialization

13 citations


Journal ArticleDOI
Tim Harris1
TL;DR: This article is about three trends in computer hardware, and some of the challenges and opportunities that I think they provide for the distributed computing community.
Abstract: This article is about three trends in computer hardware, and some of the challenges and opportunities that I think they provide for the distributed computing community. A common theme in all of these trends is that hardware is moving away from assumptions that have often been made about the relative performance of different operations (e.g., computation versus network communication), the reliability of operations (e.g., that memory accesses are reliable, but network communication is not), and even some of the basic properties of the system (e.g., that the contents of main memory are lost on power failure).Section 1 introduces "rack-scale" systems and the kinds of properties likely in their interconnect networks. Section 2 describes challenges in systems with shared physical memory but without hardware cache coherence. Section 3 discusses non-volatile byte-addressable memory. The article is based in part on my talk at the ACM PODC 2014 event in celebration of Maurice Herlihy's sixtieth birthday.

9 citations


Proceedings Article
17 Apr 2015
TL;DR: It is hard to believe that it is already ten years since the first EuroSys conference in Leuven in 2006 and in the years since that first meeting, EuroSys has grown a reputation as one of the leading systems conferences.
Abstract: It is hard to believe that it is already ten years since the first EuroSys conference in Leuven in 2006. In the years since that first meeting, EuroSys has grown a reputation as one of the leading systems conferences. We believe that this year certainly follows that trend.

7 citations


DOI
01 Jan 2015
TL;DR: The Dagstuhl Seminar was successful and facilitated interaction between researchers working in a diverse set of fields, including computer architecture, parallel workloads, systems software, and programming language design.
Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 15421 "Rack-scale Computing". The seminar was successful and facilitated interaction between researchers working in a diverse set of fields, including computer architecture, parallel workloads, systems software, and programming language design. In addition to stimulating interaction during the seminar, the event led to a follow-on Workshop on Rack-Scale Computing to be organized during 2016.

6 citations


01 Jan 2015
TL;DR: It is shown that hardware-based transactional lock elision can provide benefit by reducing the incidence of lock holder preemption, decreasing lock hold times and promoting improved scalability.
Abstract: In this short paper we show that hardware-based transactional lock elision can provide benefit by reducing the incidence of lock holder preemption, decreasing lock hold times and promoting improved scalability.

4 citations


Patent
10 Jun 2015
TL;DR: Transactional lock elision as mentioned in this paper allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction.
Abstract: Transactional Lock Elision allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction. A “lazy subscription” optimization, which delays lock subscription, can potentially cause behavior that cannot occur when the critical sections are executed under the lock. Hardware extensions may provide mechanisms to ensure that lazy subscriptions are safe (e.g., that they result in correct behavior). Prior to executing a critical section transactionally, its lock and subscription code may be identified (e.g., by writing their locations to special registers). Prior to committing the transaction, the thread executing the critical section may verify that the correct lock was correctly subscribed to. If not, or if locations identified by the special registers have been modified, the transaction may be aborted. Nested critical sections associated with different lock types may invoke different subscription code.