Showing papers by "Tim Harris published in 2015"

PDF

Open Access

Proceedings Article•

Trash day: coordinating garbage collection in distributed systems

[...]

Martin Maas¹, Tim Harris², Krste Asanovic¹, John Kubiatowicz¹•Institutions (2)

University of California, Berkeley¹, Oracle Corporation²

18 May 2015

TL;DR: This paper proposes a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes that is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.

...read moreread less

Abstract: Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications. In this paper, we show that distributed applications suffer from each node's language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes. We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.

...read moreread less

74 citations

Proceedings Article•DOI•

LIRA: Adaptive Contention-Aware Thread Placement for Parallel Runtime Systems

[...]

Alexander Collins¹, Tim Harris², Murray Cole¹, Christian Fensch³•Institutions (3)

University of Edinburgh¹, Oracle Corporation², Heriot-Watt University³

16 Jun 2015

TL;DR: LIRA, a spatial-scheduling heuristic for selecting which parallel applications should run on the same socket in a multi-socket machine, is introduced and two flavors of scheduler using this heuristic are devised: LIRA-static which collects performance data in an offline profiling step to decide the schedule when a program starts, and LIRA, which operates dynamically based on hardware performance counters available on off-the-shelf hardware.

...read moreread less

Abstract: Running multiple parallel programs on multi-socket multi-core machines using commodity hardware is increasingly common for data analytics and cluster workloads. These workloads exhibit bursty behavior and are rarely tuned to specific hardware. This leads to poor performance due to suboptimal decisions, such as poor choices for which programs run on the same socket. Consequently, there is a renewed importance for schedulers to consider the structure of the machine alongside the dynamic behavior of workloads.This paper introduces LIRA, a spatial-scheduling heuristic for selecting which parallel applications should run on the same socket in a multi-socket machine. We devise two flavors of scheduler using this heuristic: (i) LIRA-static which collects performance data in an offline profiling step to decide the schedule when a program starts, and (ii) LIRA-adaptive which operates dynamically based on hardware performance counters available on off-the-shelf hardware. LIRA-adaptive does not require separate, offline workload characterization runs, and it accommodates a dynamically changing mix of applications, including those with phase changes.We evaluate LIRA-static and LIRA-adaptive using programs from SPEC OMP and two graph analytics projects. We compare our approaches to the best possible performance obtained across all static mappings of 4 programs to 2 sockets, the libgomp OpenMP runtime that comes with GCC and Callisto, a state-of-the-art scheduler. LIRA-static improves system throughput by 10% compared to libgomp, and LIRA-adaptive improves system throughput by 13%. Compared to Callisto, LIRA-adaptive improves performance in 30 of the 32 combinations tested, with an improvement in system throughput of up to 7%, and 3% on average over 32 combinations.

...read moreread less

28 citations

Proceedings Article•

Shoal: smart allocation and replication of memory for parallel programs

[...]

Stefan Kaestle¹, Reto Achermann¹, Timothy Roscoe¹, Tim Harris²•Institutions (2)

ETH Zurich¹, Oracle Corporation²

08 Jul 2015

TL;DR: An array abstraction is presented that allows data placement to be automatically inferred from program analysis, and implemented in Shoal, a runtime library for parallel programs on NUMA machines based on annotating memory allocation statements to indicate access patterns.

...read moreread less

Abstract: Modern NUMA multi-core machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program's access patterns. However, sub-optimal allocation can significantly impact performance of parallel programs. We present an array abstraction that allows data placement to be automatically inferred from program analysis, and implement the abstraction in Shoal, a runtime library for parallel programs on NUMA machines. In Shoal, arrays can be automatically replicated, distributed, or partitioned across NUMA domains based on annotating memory allocation statements to indicate access patterns. We further show how such annotations can be automatically provided by compilers for high-level domainspecific languages (for example, the Green-Marl graph language). Finally, we show how Shoal can exploit additional hardware such as programmable DMA copy engines to further improve parallel program performance. We demonstrate significant performance benefits from automatically selecting a good array implementation based on memory access patterns and machine characteristics. We present two case-studies: (i) Green-Marl, a graph analytics workload using automatically annotated code based on information extracted from the high-level program and (ii) a manually-annotated version of the PARSEC Streamcluster benchmark.

...read moreread less

28 citations

Patent•

Coordinated Garbage Collection in Distributed Systems

[...]

Tim Harris¹, Martin Maas¹•Institutions (1)

Business International Corporation¹

27 May 2015

TL;DR: In this paper, a garbage collection coordination mechanism (a coordinator implemented by a dedicated process on a single node or distributed across the nodes) may obtain or receive state information from each of the nodes and apply one of multiple supported garbage collection policies to reduce the impact of garbage collection pauses, dependent on that information.

...read moreread less

Abstract: Fast modern interconnects may be exploited to control when garbage collection is performed on the nodes (e.g., virtual machines, such as JVMs) of a distributed system in which the individual processes communicate with each other and in which the heap memory is not shared. A garbage collection coordination mechanism (a coordinator implemented by a dedicated process on a single node or distributed across the nodes) may obtain or receive state information from each of the nodes and apply one of multiple supported garbage collection coordination policies to reduce the impact of garbage collection pauses, dependent on that information. For example, if the information indicates that a node is about to collect, the coordinator may trigger a collection on all of the other nodes (e.g., synchronizing collection pauses for batch-mode applications where throughput is important) or may steer requests to other nodes (e.g., for interactive applications where request latencies are important).

...read moreread less

20 citations

Proceedings Article•

Callisto-RTS: fine-grain parallel loops

[...]

Tim Harris¹, Stefan Kaestle²•Institutions (2)

Oracle Corporation¹, ETH Zurich²

08 Jul 2015

TL;DR: Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines, is introduced, and per-core iteration counts are used to distribute work initially, and a new asynchronous request combining technique for when threads require more work is introduced.

...read moreread less

Abstract: We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops-- down to batches of work of around 1K cycles. Finegrained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an 8-socket SPARC machine (1024 h/w contexts). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).

...read moreread less

18 citations

Posted Content•

The Influence of Malloc Placement on TSX Hardware Transactional Memory

[...]

Dave Dice¹, Tim Harris¹, Alex Kogan¹, Yossi Lev¹•Institutions (1)

Oracle Corporation¹

17 Apr 2015-arXiv: Operating Systems

TL;DR: This paper demonstrates the impact of the placement policies of memory allocators on the performance of applications that use hardware transactional memory and shows that using index-aware allocators can avoid these pathological memory placements.

...read moreread less

Abstract: In this paper, we demonstrate the impact of the placement policies of memory allocators on the performance of applications that use hardware transactional memory. In particular, commonly used allocators such as the default GNUglib malloc allocator may place objects in such a way that causes hardware transactions to consistently abort, even when running single-threaded. In multithreaded applications, these consistent aborts can force applications to fall back to using locks, significantly limiting the parallelism. We also show that using index-aware allocators can avoid these pathological memory placements. We have observed read-only transactions commit where the ca che footprint exceeds the L2 size, but have never observed tr ansactions commit where the footprint is above the size of th e L3 with TSX-RTM, misses turn into aborts aborts : amplify impac t of misses; wasted cycles; loss of write-set from L1 aborts a lso force slow-path and loss of concurrent execution for TLE Thread-local problem shifts to global impediment Misses be come aborts become serialization

...read moreread less

13 citations

Journal Article•DOI•

Hardware Trends: Challenges and Opportunities in Distributed Computing

[...]

Tim Harris¹•Institutions (1)

Oracle Corporation¹

04 Jun 2015-Sigact News

TL;DR: This article is about three trends in computer hardware, and some of the challenges and opportunities that I think they provide for the distributed computing community.

...read moreread less

Abstract: This article is about three trends in computer hardware, and some of the challenges and opportunities that I think they provide for the distributed computing community. A common theme in all of these trends is that hardware is moving away from assumptions that have often been made about the relative performance of different operations (e.g., computation versus network communication), the reliability of operations (e.g., that memory accesses are reliable, but network communication is not), and even some of the basic properties of the system (e.g., that the contents of main memory are lost on power failure).Section 1 introduces "rack-scale" systems and the kinds of properties likely in their interconnect networks. Section 2 describes challenges in systems with shared physical memory but without hardware cache coherence. Section 3 discusses non-volatile byte-addressable memory. The article is based in part on my talk at the ACM PODC 2014 event in celebration of Maurice Herlihy's sixtieth birthday.

...read moreread less

9 citations

Proceedings Article•

Proceedings of the Tenth European Conference on Computer Systems

[...]

Laurent Réveillère¹, Tim Harris², Maurice Herlihy³•Institutions (3)

University of Bordeaux¹, Oracle Corporation², Brown University³

17 Apr 2015

TL;DR: It is hard to believe that it is already ten years since the first EuroSys conference in Leuven in 2006 and in the years since that first meeting, EuroSys has grown a reputation as one of the leading systems conferences.

...read moreread less

Abstract: It is hard to believe that it is already ten years since the first EuroSys conference in Leuven in 2006. In the years since that first meeting, EuroSys has grown a reputation as one of the leading systems conferences. We believe that this year certainly follows that trend.

...read moreread less

7 citations

DOI•

Rack-scale Computing (Dagstuhl Seminar 15421)

[...]

Babak Falsafi, Tim Harris, Dushyanth Narayanan, David A. Patterson

01 Jan 2015

TL;DR: The Dagstuhl Seminar was successful and facilitated interaction between researchers working in a diverse set of fields, including computer architecture, parallel workloads, systems software, and programming language design.

...read moreread less

Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 15421 "Rack-scale Computing". The seminar was successful and facilitated interaction between researchers working in a diverse set of fields, including computer architecture, parallel workloads, systems software, and programming language design. In addition to stimulating interaction during the seminar, the event led to a follow-on Workshop on Rack-Scale Computing to be organized during 2016.

...read moreread less

6 citations

Lock Holder Preemption Avoidance via Transactional Lock Elision

[...]

Dave Dice¹, Tim Harris¹•Institutions (1)

Oracle Corporation¹

01 Jan 2015

TL;DR: It is shown that hardware-based transactional lock elision can provide benefit by reducing the incidence of lock holder preemption, decreasing lock hold times and promoting improved scalability.

...read moreread less

Abstract: In this short paper we show that hardware-based transactional lock elision can provide benefit by reducing the incidence of lock holder preemption, decreasing lock hold times and promoting improved scalability.

...read moreread less

4 citations

Patent•

Systems and Methods for Safely Subscribing to Locks Using Hardware Extensions

[...]

David Dice¹, Tim Harris¹, Alex Kogan¹, Yosef Lev¹, Mark S. Moir¹ - Show less +1 more•Institutions (1)

Business International Corporation¹

10 Jun 2015

TL;DR: Transactional lock elision as mentioned in this paper allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction.

...read moreread less

Abstract: Transactional Lock Elision allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction. A “lazy subscription” optimization, which delays lock subscription, can potentially cause behavior that cannot occur when the critical sections are executed under the lock. Hardware extensions may provide mechanisms to ensure that lazy subscriptions are safe (e.g., that they result in correct behavior). Prior to executing a critical section transactionally, its lock and subscription code may be identified (e.g., by writing their locations to special registers). Prior to committing the transaction, the thread executing the critical section may verify that the correct lock was correctly subscribed to. If not, or if locations identified by the special registers have been modified, the transaction may be aborted. Nested critical sections associated with different lock types may invoke different subscription code.

...read moreread less