Optimizing recursive task parallel programs

doi:10.1145/3079079.3079102

Home
/
Papers
/
Optimizing recursive task parallel programs

Proceedings Article•DOI•

Optimizing recursive task parallel programs

Suyash Gupta¹, Rahul Shrivastava², V. Krishna Nandivada²•Institutions (2)

Purdue University¹, Indian Institute of Technology Madras²

14 Jun 2017-pp 11

TL;DR: A new optimization DECAF is presented that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads and extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads.

read less

Abstract: We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads DECAF reduces the task termination (join) operations by aggressively increasing the scope of join operations (in a semantics preserving way), and eliminating the redundant join operations discovered on the way Further, DECAF extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads This helps reduce the redundant parallel tasks at different levels of recursion We also discuss the impact of exceptions on our techniques and extend them to handle RTP programs that may throw exceptions We implemented DECAF in the X10v23 compiler and tested it over a set of benchmark kernels on two different hardwares (a 16-core Intel system and a 64-core AMD system) With respect to the base X10 compiler extended with loop-chunking of Nandivada et al [26] (LC), DECAF achieved a geometric mean speed up of 214× and 253× on the Intel and AMD system, respectively We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DECAF versions consume 712% less energy

...read moreread less

Citations

PDF

Open Access

More filters

Mul-T: a high-performance parallel Lisp

[...]

David A. Kranz¹, Robert H. Halstead, E. Mohr²•Institutions (2)

Massachusetts Institute of Technology¹, Yale University²

01 Jun 1990

TL;DR: Mul-T as discussed by the authors is a parallel Lisp system based on Multilisp's future construct that has been developed to run on an Encore Multimax multiprocessor.

...read moreread less

Abstract: Mul-T is a parallel Lisp system, based on Multilisp's future construct, that has been developed to run on an Encore Multimax multiprocessor. Mul-T is an extended version of the Yale T system and uses the T system's ORBIT compiler to achieve “production quality” performance on stock hardware — about 100 times faster than Multilisp. Mul-T shows that futures can be implemented cheaply enough to be useful in a production-quality system. Mul-T is fully operational, including a user interface that supports managing groups of parallel tasks.

...read moreread less

153 citations

Proceedings Article•DOI•

Parallelism-centric what-if and differential analyses

[...]

Adarsh Yoga¹, Santosh Nagarakatte¹•Institutions (1)

Rutgers University¹

08 Jun 2019

TL;DR: This paper proposes TaskProf2, a parallelism profiler and an adviser for task parallel programs that uses a performance model that captures series-parallel relationships between various dynamic execution fragments of tasks and includes fine-grained measurement of computation in those fragments.

...read moreread less

Abstract: This paper proposes TaskProf2, a parallelism profiler and an adviser for task parallel programs. As a parallelism profiler, TaskProf2 pinpoints regions with serialization bottlenecks, scheduling overheads, and secondary effects of execution. As an adviser, TaskProf2 identifies regions that matter in improving parallelism. To accomplish these objectives, it uses a performance model that captures series-parallel relationships between various dynamic execution fragments of tasks and includes fine-grained measurement of computation in those fragments. Using this performance model, TaskProf2’s what-if analyses identify regions that improve the parallelism of the program while considering tasking overheads. Its differential analyses perform fine-grained differencing of an oracle and the observed performance model to identify static regions experiencing secondary effects. We have used TaskProf2 to identify regions with serialization bottlenecks and secondary effects in many applications.

...read moreread less

10 citations

Techniques for optimizing dynamic parallelism on graphics processing units

[...]

Izzat El Hajj

06 Dec 2018

4 citations

Book Chapter•DOI•

Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor Strategy

[...]

Salwa Kobeissi¹, Alain Ketterlin¹, Philippe Clauss¹•Institutions (1)

University of Strasbourg¹

05 Jul 2020

TL;DR: Rec2Poly as mentioned in this paper detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model, and then the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations such as tiling or skewing.

...read moreread less

Abstract: In this paper, we propose Rec2Poly, a framework which detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model. If successful, the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations as tiling or skewing.

...read moreread less

3 citations

Cites methods from "Optimizing recursive task parallel ..."

...[6] propose an approach to optimize recursive task parallel programs through lessening task creation and termination overhead....
[...]

Proceedings Article•

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral Modeling

[...]

Salwa Kobeissi, Philippe Clauss

23 Jan 2019

TL;DR: Early executions of a recursive program are analyzed using a Nested Loop Recognition algorithm, performing the affine loop modeling of the original program runtime behavior, which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly.

...read moreread less

Abstract: There may be a huge gap between the statements outlined by programmers in a program source code and instructions that are actually performed by a given processor architecture when running the executable code. This gap is due to the way the input code has been interpreted, translated and transformed by the compiler and the final processor hardware. Thus, there is an opportunity for efficient optimization strategies, that are dedicated to specific control structures and memory access patterns, to apply as soon as the actual runtime behavior has been discovered, even if they could not have been applied on the original source code. In this paper, we develop this idea by identifying code extracts that behave as polyhedral-compliant loops at runtime, while not having been outlined at all as loops in the original source code. In particular, we are interested in recursive functions whose runtime behavior can be modeled as polyhedral loops. Therefore, the scope of this study exclusively includes recursive functions whose control flow and memory accesses exhibit an affine behavior, which means that there exists a semantically equivalent affine loop nest, candidate for poly-hedral optimizations. Accordingly, our approach is based on analyzing early executions of a recursive program using a Nested Loop Recognition (NLR) algorithm, performing the affine loop modeling of the original program runtime behavior , which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly. We present some preliminary results showing that this approach brings recursion optimization techniques into a higher level in addition to widening the scope of the polyhe-dral model to include originally non-loop programs.

...read moreread less

3 citations

Cites methods from "Optimizing recursive task parallel ..."

...DECAF [10] is a technique to optimize recursive task parallel programs by reducing the task creation and termination overheads....
[...]

References

PDF

Open Access

More filters

Book•

Advanced Compiler Design and Implementation

[...]

Steven S. Muchnick

01 Jan 1997

TL;DR: Advanced Compiler Design and Implementation by Steven Muchnick Preface to Advanced Topics

...read moreread less

Abstract: Advanced Compiler Design and Implementation by Steven Muchnick Preface 1 Introduction to Advanced Topics 1.1 Review of Compiler Structure 1.2 Advanced Issues in Elementary Topics 1.3 The Importance of Code Optimization 1.4 Structure of Optimizing Compilers 1.5 Placement of Optimizations in Aggressive Optimizing Compilers 1.6 Reading Flow Among the Chapters 1.7 Related Topics Not Covered in This Text 1.8 Target Machines Used in Examples 1.9 Number Notations and Data Sizes 1.10 Wrap-Up 1.11 Further Reading 1.12 Exercises 2 Informal Compiler Algorithm Notation (ICAN) 2.1 Extended Backus-Naur Form Syntax Notation 2.2 Introduction to ICAN 2.3 A Quick Overview of ICAN 2.4 Whole Programs 2.5 Type Definitions 2.6 Declarations 2.7 Data Types and Expressions 2.8 Statements 2.9 Wrap-Up 2.10 Further Reading 2.11 Exercises 3 Symbol-Table Structure 3.1 Storage Classes, Visibility, and Lifetimes 3.2 Symbol Attributes and Symbol-Table Entries 3.3 Local Symbol-Table Management 3.4 Global Symbol-Table Structure 3.5 Storage Binding and Symbolic Registers 3.6 Approaches to Generating Loads and Stores 3.7 Wrap-Up 3.8 Further Reading 3.9 Exercises 4 Intermediate Representations 4.1 Issues in Designing an Intermediate Language 4.2 High-Level Intermediate Languages 4.3 Medium-Level Intermediate Languages 4.4 Low-Level Intermediate Languages 4.5 Multi-Level Intermediate Languages 4.6 Our Intermediate Languages: MIR, HIR, and LIR 4.7 Representing MIR, HIR, and LIR in ICAN 4.8 ICAN Naming of Data Structures and Routines that Manipulate Intermediate Code 4.9 Other Intermediate-Language Forms 4.10 Wrap-Up 4.11 Further Reading 4.12 Exercises 5 Run-Time Support 5.1 Data Representations and Instructions 5.2 Register Usage 5.3 The Local Stack Frame 5.4 The Run-Time Stack 5.5 Parameter-Passing Disciplines 5.6 Procedure Prologues, Epilogues, Calls, and Returns 5.7 Code Sharing and Position-Independent Code 5.8 Symbolic and Polymorphic Language Support 5.9 Wrap-Up 5.10 Further Reading 5.11 Exercises 6 Producing Code Generators Automatically 6.1 Introduction to Automatic Generation of Code Generators 6.2 A Syntax-Directed Technique 6.3 Introduction to Semantics-Directed Parsing 6.4 Tree Pattern Matching and Dynamic Programming 6.5 Wrap-Up 6.6 Further Reading 6.7 Exercises 7 Control-Flow Analysis 7.1 Approaches to Control-Flow Analysis 7.2 Depth-First Search, Preorder Traversal, Postorder Traversal, and Breadth-First Search 7.3 Dominators 7.4 Loops and Strongly Connected Components 7.5 Reducibility 7.6 Interval Analysis and Control Trees 7.7 Structural Analysis 7.8 Wrap-Up 7.9 Further Reading 7.10 Exercises 8 Data-Flow Analysis 8.1 An Example: Reaching Definitions 8.2 Basic Concepts: Lattices, Flow Functions, and Fixed Points 8.3 Taxonomy of Data-Flow Problems and Solution Methods 8.4 Iterative Data-Flow Analysis 8.5 Lattices of Flow Functions 8.6 Control-Tree-Based Data-Flow Analysis 8.7 Structural Analysis 8.8 Interval Analysis 8.9 Other Approaches 8.10 Du-Chains, Ud-Chains, and Webs 8.11 Static Single-Assignment (SSA) Form 8.12 Dealing with Arrays, Structures, and Pointers 8.13 Automating Construction of Data-Flow Analyzers 8.14 More Ambitious Analyses 8.15 Wrap-Up 8.16 Further Reading 8.17 Exercises 9 Dependence Analysis and Dependence Graph 9.1 Dependence Relations 9.2 Basic-Block Dependence DAGs 9.3 Dependences in Loops 9.4 Dependence Testing 9.5 Program-Dependence Graphs 9.6 Dependences Between Dynamically Allocated Objects 9.7 Wrap-Up 9.8 Further Reading 9.9 Exercises 10 Alias Analysis 10.1 Aliases in Various Real Programming Languages 10.2 The Alias Gatherer 10.3 The Alias Propagator 10.4 Wrap-Up 10.5 Further Reading 10.6 Exercises 11 Introduction to Optimization 11.1 Global Optimizations Discussed in Chapters 12 Through 18 11.2 Flow Sensitivity and May vs. Must Information 11.3 Importance of Individual Optimizations 11.4 Order and Repetition of Optimizations 11.5 Further Reading 11.6 Exercises 12 Early Optimizations 12.1 Constant-Expression Evaluation (Constant Folding) 12.2 Scalar Replacement of Aggregates 12.3 Algebraic Simplifications and Reassociation 12.4 Value Numbering 12.5 Copy Propagation 12.6 Sparse Conditional Constant Propagation 12.7 Wrap-Up 12.8 Further Reading 12.9 Exercises 13 Redundancy Elimination 13.1 Common-Subexpression Elimination 13.2 Loop-Invariant Code Motion 13.3 Partial-Redundancy Elimination 13.4 Redundancy Elimination and Reassociation 13.5 Code Hoisting 13.6 Wrap-Up 13.7 Further Reading 13.8 Exercises 14 Loop Optimizations 14.1 Induction-Variable Optimizations 14.2 Unnecessary Bounds-Checking Elimination 14.3 Wrap-Up 14.4 Further Reading 14.5 Exercises 15 Procedure Optimizations 15.1 Tail-Call Optimization and Tail-Recursion Elimination 15.2 Procedure Integration 15.3 In-Line Expansion 15.4 Leaf-Routine Optimization and Shrink Wrapping 15.5 Wrap-Up 15.6 Further Reading 15.7 Exercises 16 Register Allocation 16.1 Register Allocation and Assignment 16.2 Local Methods 16.3 Graph Coloring 16.4 Priority-Based Graph Coloring 16.5 Other Approaches to Register Allocation 16.6 Wrap-Up 16.7 Further Reading 16.8 Exercises 17 Code Scheduling 17.1 Instruction Scheduling 17.2 Speculative Loads and Boosting 17.3 Speculative Scheduling 17.4 Software Pipelining 17.5 Trace Scheduling 17.6 Percolation Scheduling 17.7 Wrap-Up 17.8 Further Reading 17.9 Exercises 18 Control-Flow and Low-Level Optimizations 18.1 Unreachable-Code Elimination 18.2 Straightening 18.3 If Simplifications 18.4 Loop Simplifications 18.5 Loop Inversion 18.6 Unswitching 18.7 Branch Optimizations 18.8 Tail Merging or Cross Jumping 18.9 Conditional Moves 18.10 Dead-Code Elimination 18.11 Branch Prediction 18.12 Machine Idioms and Instruction Combining 18.13 Wrap-Up 18.14 Further Reading 18.15 Exercises 19 Interprocedural Analysis and Optimization 19.1 Interprocedural Control-Flow Analysis: The Call Graph 19.2 Interprocedural Data-Flow Analysis 19.3 Interprocedural Constant Propagation 19.4 Interprocedural Alias Analysis 19.5 Interprocedural Optimizations 19.6 Interprocedural Register Allocation 19.7 Aggregation of Global References 19.8 Other Issues in Interprocedural Program Management 19.9 Wrap-Up 19.10 Further Reading 19.11 Exercises 20 Optimization of the Memory Hierarchy 20.1 Impact of Data and Instruction Caches 20.2 Instruction-Cache Optimization 20.3 Scalar Replacement of Array Elements 20.4 Data-Cache Optimization 20.5 Scalar vs. Memory-Oriented Optimizations 20.6 Wrap-Up 20.7 Further Reading 20.8 Exercises 21 Case Studies of Compilers and Future Trends 21.1 the Sun Compilers for SPARC 21.2 The IBM XL Compilers for the POWER and PowerPC Architectures 21.3 Digital Equipment's Compilers for Alpha 21.4 The Intel Reference Compilers for the Intel 386 Architecture 21.5 Future Trends in Compiler Design and Implementation 21.6 Further Reading A Guide to Assembly Languages Used in This Book A.1 Sun SPARC Versions 8 and 9 Assembly Language A.2 IBM POWER and PowerPC Assembly Language A.3 DEC Alpha Assembly Language A.4 Intel 386 Architecture Assembly Language A.5 Hewlett-Packard's PA-RISC Assembly Language B Representation of Sets, Sequences, Trees, DAGs, and Functions B.1 Representation of Sets B.2 Representation of Sequences B.3 Representation of Trees and DAGs B.4 Representation of Functions B.5 Further Reading C Software Resources View Appendix C with live links to download sites C.1 Finding and Accessing Software on the Internet C.2 Machine Simulators C.3 Compilers C.4 Code-Generator Generators: BURG and IBURG C.5 Profiling Tools Bibliography Indices

...read moreread less

2,482 citations

Additional excerpts

...It starts by invoking LC on parallel loops in canonical form [24]....
[...]

Proceedings Article•DOI•

The implementation of the Cilk-5 multithreaded language

[...]

Matteo Frigo¹, Charles E. Leiserson¹, Keith H. Randall¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 1998

TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.

...read moreread less

Abstract: The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.

...read moreread less

1,367 citations

"Optimizing recursive task parallel ..." refers background in this paper

...Recursive Task Parallel (RTP) programs constitute an important subset of task parallel programs written in popular languages like Cilk [12], X10 [31], Chapel [7], OpenMP [28], HJ [5], and so on....
[...]
...Though our results are shown in the context of X10, we believe that DECAF can be applied (with similar effect) to other task parallel languages like OpenMP, Chapel, Cilk and HJ that admit RTP programs....
[...]
...Cilk [12] and TBB [30] both implement specialised mechanisms of loop scheduling at runtime, to achieve load balancing, by controlling the number of worker threads and the division of tasks among the workers....
[...]

Book•

Optimizing Compilers for Modern Architectures: A Dependence-based Approach

[...]

Ken Kennedy, John R. Allen

10 Oct 2001

TL;DR: A broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling are provided.

...read moreread less

Abstract: Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue. The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures. The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications.

...read moreread less

1,087 citations

"Optimizing recursive task parallel ..." refers methods in this paper

...Loop scheduling [21] has been one of the most popular techniques to efficiently execute loop nests....
[...]

Journal Article•DOI•

Parallel Programmability and the Chapel Language

[...]

Bradford L. Chamberlain¹, David Callahan², Hans P. Zima³•Institutions (3)

Cray¹, Microsoft², University of Vienna³

01 Aug 2007

TL;DR: A candidate list of desirable qualities for a parallel programming language is offered, and how these qualities are addressed in the design of the Chapel language is described, providing an overview of Chapel's features and how they help address parallel productivity.

...read moreread less

Abstract: In this paper we consider productivity challenges for parallel programmers and explore ways that parallel language design might help improve end-user productivity. We offer a candidate list of desirable qualities for a parallel programming language, and describe how these qualities are addressed in the design of the Chapel language. In doing so, we provide an overview of Chapel's features and how they help address parallel productivity. We also survey current techniques for parallel programming and describe ways in which we consider them to fall short of our idealized productive programming model.

...read moreread less

905 citations

"Optimizing recursive task parallel ..." refers background in this paper

...Recursive Task Parallel (RTP) programs constitute an important subset of task parallel programs written in popular languages like Cilk [12], X10 [31], Chapel [7], OpenMP [28], HJ [5], and so on....
[...]
...Though our results are shown in the context of X10, we believe that DECAF can be applied (with similar effect) to other task parallel languages like OpenMP, Chapel, Cilk and HJ that admit RTP programs....
[...]
...DECAF can also be extended to other task parallel languages (such as HJ, Chapel and OpenMP) that have similar constructs for task creation and task termination operations....
[...]

Book•

Intel Threading Building Blocks

[...]

James Reinders

01 Jan 2007

TL;DR: The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications, and the information here is subject to change without notice.

...read moreread less

Abstract: Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. 1.14 Type atomic now allows T to be an enumeration type. Clarify zero-initialization of atomic. Default partitioner changed from simple_partitioner to auto_partitioner. Instance of task_scheduler_init is optional. Discuss cancellation and exception handling. Describe tbb_hash_compare and tbb_hasher.

...read moreread less

889 citations