The Cilk++ concurrency platform

doi:10.1145/1629911.1630048

Home
/
Papers
/
The Cilk++ concurrency platform

Proceedings Article•DOI•

The Cilk++ concurrency platform

Charles E. Leiserson¹•Institutions (1)

Massachusetts Institute of Technology¹

26 Jul 2009-pp 522-527

TL;DR: This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool, and provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.

read less

Abstract: The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

HPX: A Task Based Programming Model in a Global Address Space

[...]

Hartmut Kaiser¹, Thomas Heller, Bryce Adelstein-Lelbach¹, Adrian Serio¹, Dietmar Fey - Show less +1 more•Institutions (1)

Louisiana State University¹

06 Oct 2014

TL;DR: HX is presented -- a parallel runtime system which extends the C++11/14 standard to facilitate distributed operations, enable fine-grained constraint based parallelism, and support runtime adaptive resource management, and provides a widely accepted API enabling programmability, composability and performance portability of user applications.

...read moreread less

Abstract: The significant increase in complexity of Exascale platforms due to energy-constrained, billion-way parallelism, with major changes to processor and memory architecture, requires new energy-efficient and resilient programming techniques that are portable across multiple future generations of machines. We believe that guaranteeing adequate scalability, programmability, performance portability, resilience, and energy efficiency requires a fundamentally new approach, combined with a transition path for existing scientific applications, to fully explore the rewards of todays and tomorrows systems. We present HPX -- a parallel runtime system which extends the C++11/14 standard to facilitate distributed operations, enable fine-grained constraint based parallelism, and support runtime adaptive resource management. This provides a widely accepted API enabling programmability, composability and performance portability of user applications. By employing a global address space, we seamlessly augment the standard to apply to a distributed case. We present HPX's architecture, design decisions, and results selected from a diverse set of application runs showing superior performance, scalability, and efficiency over conventional practice.

...read moreread less

268 citations

Proceedings Article•DOI•

Reducers and other Cilk++ hyperobjects

[...]

Matteo Frigo, Pablo Halpern, Charles E. Leiserson, Stephen T. Lewin-Berlin

11 Aug 2009

TL;DR: This paper introduces hyperobjects, a linguistic mechanism that allows different branches of a multithreaded program to maintain coordinated local views of the same nonlocal variable, and examines a randomized locking methodology for reducers.

...read moreread less

Abstract: This paper introduces hyperobjects, a linguistic mechanism that allows different branches of a multithreaded program to maintain coordinated local views of the same nonlocal variable. We have identified three kinds of hyperobjects that seem to be useful -- reducers, holders, and splitters -- and we have implemented reducers and holders in Cilk++, a set of extensions to the C++ programming language that enables "dynamic" multithreaded programming in the style of MIT Cilk. We analyze a randomized locking methodology for reducers and show that a work-stealing scheduler can support reducers without incurring significant overhead.

...read moreread less

171 citations

Cites methods from "The Cilk++ concurrency platform"

...Figure 2 illustrates a straightforward parallelization of this code in Cilk++ [15], a set of simple extensions to the C++ programming language that enables “dynamic” multithreaded programming in the style of the MIT Cilk multithreaded programming language [8]....
[...]

Proceedings Article•

Parallel Coordinate Descent for L1-Regularized Loss Minimization

[...]

Aapo Kyrola¹, Danny Bickson¹, Carlos Guestrin¹, Joseph K. Bradley¹•Institutions (1)

Carnegie Mellon University¹

28 Jun 2011

TL;DR: This work proves convergence bounds for Shotgun which predict linear speedups, up to a problem-dependent limit, and presents a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression.

...read moreread less

Abstract: We propose Shotgun, a parallel coordinate descent algorithm for minimizing L1-regularized losses. Though coordinate descent seems inherently sequential, we prove convergence bounds for Shotgun which predict linear speedups, up to a problem-dependent limit. We present a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression. Our theoretical predictions on the potential for parallelism closely match behavior on real data. Shotgun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L1.

...read moreread less

128 citations

Proceedings Article•DOI•

Adaptive heterogeneous scheduling for integrated GPUs

[...]

Rashid Kaleem¹, Rajkishore Barik², Tatiana Shpeisman², Brian T. Lewis², Chunling Hu², Keshav Pingali¹ - Show less +2 more•Institutions (2)

University of Texas at Austin¹, Intel²

24 Aug 2014

TL;DR: The asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers, underscoring the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

...read moreread less

Abstract: Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naive and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

...read moreread less

106 citations

Journal Article•DOI•

Kremlin: rethinking and rebooting gprof for the multicore age

[...]

Saturnino Garcia¹, Donghwan Jeon¹, Christopher M. Louie¹, Michael Taylor¹•Institutions (1)

University of California, San Diego¹

04 Jun 2011

TL;DR: Kremlin is examined, an automatic tool that, given a serial version of a program, will make recommendations to the user as to what regions of the program to attack first, and introduces a novel hierarchical critical path analysis and develops a new metric for estimating the potential of parallelizing a region: self-parallelism.

...read moreread less

Abstract: Many recent parallelization tools lower the barrier for parallelizing a program, but overlook one of the first questions that a programmer needs to answer: which parts of the program should I spend time parallelizing?This paper examines Kremlin, an automatic tool that, given a serial version of a program, will make recommendations to the user as to what regions (e.g. loops or functions) of the program to attack first. Kremlin introduces a novel hierarchical critical path analysis and develops a new metric for estimating the potential of parallelizing a region: self-parallelism. We further introduce the concept of a parallelism planner, which provides a ranked order of specific regions to the programmer that are likely to have the largest performance impact when parallelized. Kremlin supports multiple planner personalities, which allow the planner to more effectively target a particular programming environment or class of machine.We demonstrate the effectiveness of one such personality, an OpenMP planner, by comparing versions of programs that are parallelized according to Kremlin's plan against third-party manually parallelized versions. The results show that Kremlin's OpenMP planner is highly effective, producing plans whose performance is typically comparable to, and sometimes much better than, manual parallelization. At the same time, these plans would require that the user parallelize significantly fewer regions of the program.

...read moreread less

106 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

Collapse

References

PDF

Open Access

More filters

Book•

Introduction to Algorithms

[...]

Thomas H. Cormen¹, Charles E. Leiserson¹, Ronald L. Rivest¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1990

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.

...read moreread less

Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

...read moreread less

21,651 citations

Introduction to Algorithms

[...]

Adhi Harmoko S, M.Komp, Joseph Marie Jacquard, Konrad Zuse, Eniac - Show less +1 more

01 Jan 2005

19,250 citations

Book•

The C++ Programming Language

[...]

Bjarne Stroustrup¹•Institutions (1)

Bell Labs¹

01 Jan 1985

TL;DR: Bjarne Stroustrup makes C even more accessible to those new to the language, while adding advanced information and techniques that even expert C programmers will find invaluable.

...read moreread less

Abstract: From the Publisher: Written by Bjarne Stroustrup, the creator of C, this is the world's most trusted and widely read book on C. For this special hardcover edition, two new appendixes on locales and standard library exception safety have been added. The result is complete, authoritative coverage of the C language, its standard library, and key design techniques. Based on the ANSI/ISO C standard, The C Programming Language provides current and comprehensive coverage of all C language features and standard library components. For example: abstract classes as interfaces class hierarchies for object-oriented programming templates as the basis for type-safe generic software exceptions for regular error handling namespaces for modularity in large-scale software run-time type identification for loosely coupled systems the C subset of C for C compatibility and system-level work standard containers and algorithms standard strings, I/O streams, and numerics C compatibility, internationalization, and exception safety Bjarne Stroustrup makes C even more accessible to those new to the language, while adding advanced information and techniques that even expert C programmers will find invaluable.

...read moreread less

6,795 citations

Journal Article•DOI•

Pin: building customized program analysis tools with dynamic instrumentation

[...]

Chi-Keung Luk¹, Robert Cohn¹, Robert Muth¹, Harish Patil¹, Artur Klauser¹, Geoff Lowney¹, Steven Wallace¹, Vijay Janapa Reddi², Kim Hazelwood¹ - Show less +5 more•Institutions (2)

Intel¹, University of Colorado Boulder²

12 Jun 2005

TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.

...read moreread less

Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

...read moreread less

4,019 citations

Proceedings Article•DOI•

Validity of the single processor approach to achieving large scale computing capabilities

[...]

Gene Myron Amdahl¹•Institutions (1)

IBM¹

18 Apr 1967

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.

...read moreread less

Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

...read moreread less

3,653 citations