scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Optimizing recursive task parallel programs

TL;DR: A new optimization DECAF is presented that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads and extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads.
Abstract: We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads DECAF reduces the task termination (join) operations by aggressively increasing the scope of join operations (in a semantics preserving way), and eliminating the redundant join operations discovered on the way Further, DECAF extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads This helps reduce the redundant parallel tasks at different levels of recursion We also discuss the impact of exceptions on our techniques and extend them to handle RTP programs that may throw exceptions We implemented DECAF in the X10v23 compiler and tested it over a set of benchmark kernels on two different hardwares (a 16-core Intel system and a 64-core AMD system) With respect to the base X10 compiler extended with loop-chunking of Nandivada et al [26] (LC), DECAF achieved a geometric mean speed up of 214× and 253× on the Intel and AMD system, respectively We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DECAF versions consume 712% less energy
Citations
More filters
01 Jun 1990
TL;DR: Mul-T as discussed by the authors is a parallel Lisp system based on Multilisp's future construct that has been developed to run on an Encore Multimax multiprocessor.
Abstract: Mul-T is a parallel Lisp system, based on Multilisp's future construct, that has been developed to run on an Encore Multimax multiprocessor. Mul-T is an extended version of the Yale T system and uses the T system's ORBIT compiler to achieve “production quality” performance on stock hardware — about 100 times faster than Multilisp. Mul-T shows that futures can be implemented cheaply enough to be useful in a production-quality system. Mul-T is fully operational, including a user interface that supports managing groups of parallel tasks.

153 citations

Proceedings ArticleDOI
08 Jun 2019
TL;DR: This paper proposes TaskProf2, a parallelism profiler and an adviser for task parallel programs that uses a performance model that captures series-parallel relationships between various dynamic execution fragments of tasks and includes fine-grained measurement of computation in those fragments.
Abstract: This paper proposes TaskProf2, a parallelism profiler and an adviser for task parallel programs. As a parallelism profiler, TaskProf2 pinpoints regions with serialization bottlenecks, scheduling overheads, and secondary effects of execution. As an adviser, TaskProf2 identifies regions that matter in improving parallelism. To accomplish these objectives, it uses a performance model that captures series-parallel relationships between various dynamic execution fragments of tasks and includes fine-grained measurement of computation in those fragments. Using this performance model, TaskProf2’s what-if analyses identify regions that improve the parallelism of the program while considering tasking overheads. Its differential analyses perform fine-grained differencing of an oracle and the observed performance model to identify static regions experiencing secondary effects. We have used TaskProf2 to identify regions with serialization bottlenecks and secondary effects in many applications.

10 citations

Book ChapterDOI
05 Jul 2020
TL;DR: Rec2Poly as mentioned in this paper detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model, and then the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations such as tiling or skewing.
Abstract: In this paper, we propose Rec2Poly, a framework which detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model. If successful, the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations as tiling or skewing.

3 citations


Cites methods from "Optimizing recursive task parallel ..."

  • ...[6] propose an approach to optimize recursive task parallel programs through lessening task creation and termination overhead....

    [...]

Proceedings Article
23 Jan 2019
TL;DR: Early executions of a recursive program are analyzed using a Nested Loop Recognition algorithm, performing the affine loop modeling of the original program runtime behavior, which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly.
Abstract: There may be a huge gap between the statements outlined by programmers in a program source code and instructions that are actually performed by a given processor architecture when running the executable code. This gap is due to the way the input code has been interpreted, translated and transformed by the compiler and the final processor hardware. Thus, there is an opportunity for efficient optimization strategies, that are dedicated to specific control structures and memory access patterns, to apply as soon as the actual runtime behavior has been discovered, even if they could not have been applied on the original source code. In this paper, we develop this idea by identifying code extracts that behave as polyhedral-compliant loops at runtime, while not having been outlined at all as loops in the original source code. In particular, we are interested in recursive functions whose runtime behavior can be modeled as polyhedral loops. Therefore, the scope of this study exclusively includes recursive functions whose control flow and memory accesses exhibit an affine behavior, which means that there exists a semantically equivalent affine loop nest, candidate for poly-hedral optimizations. Accordingly, our approach is based on analyzing early executions of a recursive program using a Nested Loop Recognition (NLR) algorithm, performing the affine loop modeling of the original program runtime behavior , which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly. We present some preliminary results showing that this approach brings recursion optimization techniques into a higher level in addition to widening the scope of the polyhe-dral model to include originally non-loop programs.

3 citations


Cites methods from "Optimizing recursive task parallel ..."

  • ...DECAF [10] is a technique to optimize recursive task parallel programs by reducing the task creation and termination overheads....

    [...]

References
More filters
Book
01 Jan 1997
TL;DR: Advanced Compiler Design and Implementation by Steven Muchnick Preface to Advanced Topics
Abstract: Advanced Compiler Design and Implementation by Steven Muchnick Preface 1 Introduction to Advanced Topics 1.1 Review of Compiler Structure 1.2 Advanced Issues in Elementary Topics 1.3 The Importance of Code Optimization 1.4 Structure of Optimizing Compilers 1.5 Placement of Optimizations in Aggressive Optimizing Compilers 1.6 Reading Flow Among the Chapters 1.7 Related Topics Not Covered in This Text 1.8 Target Machines Used in Examples 1.9 Number Notations and Data Sizes 1.10 Wrap-Up 1.11 Further Reading 1.12 Exercises 2 Informal Compiler Algorithm Notation (ICAN) 2.1 Extended Backus-Naur Form Syntax Notation 2.2 Introduction to ICAN 2.3 A Quick Overview of ICAN 2.4 Whole Programs 2.5 Type Definitions 2.6 Declarations 2.7 Data Types and Expressions 2.8 Statements 2.9 Wrap-Up 2.10 Further Reading 2.11 Exercises 3 Symbol-Table Structure 3.1 Storage Classes, Visibility, and Lifetimes 3.2 Symbol Attributes and Symbol-Table Entries 3.3 Local Symbol-Table Management 3.4 Global Symbol-Table Structure 3.5 Storage Binding and Symbolic Registers 3.6 Approaches to Generating Loads and Stores 3.7 Wrap-Up 3.8 Further Reading 3.9 Exercises 4 Intermediate Representations 4.1 Issues in Designing an Intermediate Language 4.2 High-Level Intermediate Languages 4.3 Medium-Level Intermediate Languages 4.4 Low-Level Intermediate Languages 4.5 Multi-Level Intermediate Languages 4.6 Our Intermediate Languages: MIR, HIR, and LIR 4.7 Representing MIR, HIR, and LIR in ICAN 4.8 ICAN Naming of Data Structures and Routines that Manipulate Intermediate Code 4.9 Other Intermediate-Language Forms 4.10 Wrap-Up 4.11 Further Reading 4.12 Exercises 5 Run-Time Support 5.1 Data Representations and Instructions 5.2 Register Usage 5.3 The Local Stack Frame 5.4 The Run-Time Stack 5.5 Parameter-Passing Disciplines 5.6 Procedure Prologues, Epilogues, Calls, and Returns 5.7 Code Sharing and Position-Independent Code 5.8 Symbolic and Polymorphic Language Support 5.9 Wrap-Up 5.10 Further Reading 5.11 Exercises 6 Producing Code Generators Automatically 6.1 Introduction to Automatic Generation of Code Generators 6.2 A Syntax-Directed Technique 6.3 Introduction to Semantics-Directed Parsing 6.4 Tree Pattern Matching and Dynamic Programming 6.5 Wrap-Up 6.6 Further Reading 6.7 Exercises 7 Control-Flow Analysis 7.1 Approaches to Control-Flow Analysis 7.2 Depth-First Search, Preorder Traversal, Postorder Traversal, and Breadth-First Search 7.3 Dominators 7.4 Loops and Strongly Connected Components 7.5 Reducibility 7.6 Interval Analysis and Control Trees 7.7 Structural Analysis 7.8 Wrap-Up 7.9 Further Reading 7.10 Exercises 8 Data-Flow Analysis 8.1 An Example: Reaching Definitions 8.2 Basic Concepts: Lattices, Flow Functions, and Fixed Points 8.3 Taxonomy of Data-Flow Problems and Solution Methods 8.4 Iterative Data-Flow Analysis 8.5 Lattices of Flow Functions 8.6 Control-Tree-Based Data-Flow Analysis 8.7 Structural Analysis 8.8 Interval Analysis 8.9 Other Approaches 8.10 Du-Chains, Ud-Chains, and Webs 8.11 Static Single-Assignment (SSA) Form 8.12 Dealing with Arrays, Structures, and Pointers 8.13 Automating Construction of Data-Flow Analyzers 8.14 More Ambitious Analyses 8.15 Wrap-Up 8.16 Further Reading 8.17 Exercises 9 Dependence Analysis and Dependence Graph 9.1 Dependence Relations 9.2 Basic-Block Dependence DAGs 9.3 Dependences in Loops 9.4 Dependence Testing 9.5 Program-Dependence Graphs 9.6 Dependences Between Dynamically Allocated Objects 9.7 Wrap-Up 9.8 Further Reading 9.9 Exercises 10 Alias Analysis 10.1 Aliases in Various Real Programming Languages 10.2 The Alias Gatherer 10.3 The Alias Propagator 10.4 Wrap-Up 10.5 Further Reading 10.6 Exercises 11 Introduction to Optimization 11.1 Global Optimizations Discussed in Chapters 12 Through 18 11.2 Flow Sensitivity and May vs. Must Information 11.3 Importance of Individual Optimizations 11.4 Order and Repetition of Optimizations 11.5 Further Reading 11.6 Exercises 12 Early Optimizations 12.1 Constant-Expression Evaluation (Constant Folding) 12.2 Scalar Replacement of Aggregates 12.3 Algebraic Simplifications and Reassociation 12.4 Value Numbering 12.5 Copy Propagation 12.6 Sparse Conditional Constant Propagation 12.7 Wrap-Up 12.8 Further Reading 12.9 Exercises 13 Redundancy Elimination 13.1 Common-Subexpression Elimination 13.2 Loop-Invariant Code Motion 13.3 Partial-Redundancy Elimination 13.4 Redundancy Elimination and Reassociation 13.5 Code Hoisting 13.6 Wrap-Up 13.7 Further Reading 13.8 Exercises 14 Loop Optimizations 14.1 Induction-Variable Optimizations 14.2 Unnecessary Bounds-Checking Elimination 14.3 Wrap-Up 14.4 Further Reading 14.5 Exercises 15 Procedure Optimizations 15.1 Tail-Call Optimization and Tail-Recursion Elimination 15.2 Procedure Integration 15.3 In-Line Expansion 15.4 Leaf-Routine Optimization and Shrink Wrapping 15.5 Wrap-Up 15.6 Further Reading 15.7 Exercises 16 Register Allocation 16.1 Register Allocation and Assignment 16.2 Local Methods 16.3 Graph Coloring 16.4 Priority-Based Graph Coloring 16.5 Other Approaches to Register Allocation 16.6 Wrap-Up 16.7 Further Reading 16.8 Exercises 17 Code Scheduling 17.1 Instruction Scheduling 17.2 Speculative Loads and Boosting 17.3 Speculative Scheduling 17.4 Software Pipelining 17.5 Trace Scheduling 17.6 Percolation Scheduling 17.7 Wrap-Up 17.8 Further Reading 17.9 Exercises 18 Control-Flow and Low-Level Optimizations 18.1 Unreachable-Code Elimination 18.2 Straightening 18.3 If Simplifications 18.4 Loop Simplifications 18.5 Loop Inversion 18.6 Unswitching 18.7 Branch Optimizations 18.8 Tail Merging or Cross Jumping 18.9 Conditional Moves 18.10 Dead-Code Elimination 18.11 Branch Prediction 18.12 Machine Idioms and Instruction Combining 18.13 Wrap-Up 18.14 Further Reading 18.15 Exercises 19 Interprocedural Analysis and Optimization 19.1 Interprocedural Control-Flow Analysis: The Call Graph 19.2 Interprocedural Data-Flow Analysis 19.3 Interprocedural Constant Propagation 19.4 Interprocedural Alias Analysis 19.5 Interprocedural Optimizations 19.6 Interprocedural Register Allocation 19.7 Aggregation of Global References 19.8 Other Issues in Interprocedural Program Management 19.9 Wrap-Up 19.10 Further Reading 19.11 Exercises 20 Optimization of the Memory Hierarchy 20.1 Impact of Data and Instruction Caches 20.2 Instruction-Cache Optimization 20.3 Scalar Replacement of Array Elements 20.4 Data-Cache Optimization 20.5 Scalar vs. Memory-Oriented Optimizations 20.6 Wrap-Up 20.7 Further Reading 20.8 Exercises 21 Case Studies of Compilers and Future Trends 21.1 the Sun Compilers for SPARC 21.2 The IBM XL Compilers for the POWER and PowerPC Architectures 21.3 Digital Equipment's Compilers for Alpha 21.4 The Intel Reference Compilers for the Intel 386 Architecture 21.5 Future Trends in Compiler Design and Implementation 21.6 Further Reading A Guide to Assembly Languages Used in This Book A.1 Sun SPARC Versions 8 and 9 Assembly Language A.2 IBM POWER and PowerPC Assembly Language A.3 DEC Alpha Assembly Language A.4 Intel 386 Architecture Assembly Language A.5 Hewlett-Packard's PA-RISC Assembly Language B Representation of Sets, Sequences, Trees, DAGs, and Functions B.1 Representation of Sets B.2 Representation of Sequences B.3 Representation of Trees and DAGs B.4 Representation of Functions B.5 Further Reading C Software Resources View Appendix C with live links to download sites C.1 Finding and Accessing Software on the Internet C.2 Machine Simulators C.3 Compilers C.4 Code-Generator Generators: BURG and IBURG C.5 Profiling Tools Bibliography Indices

2,482 citations


Additional excerpts

  • ...It starts by invoking LC on parallel loops in canonical form [24]....

    [...]

Proceedings ArticleDOI
01 May 1998
TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.
Abstract: The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.

1,367 citations


"Optimizing recursive task parallel ..." refers background in this paper

  • ...Recursive Task Parallel (RTP) programs constitute an important subset of task parallel programs written in popular languages like Cilk [12], X10 [31], Chapel [7], OpenMP [28], HJ [5], and so on....

    [...]

  • ...Though our results are shown in the context of X10, we believe that DECAF can be applied (with similar effect) to other task parallel languages like OpenMP, Chapel, Cilk and HJ that admit RTP programs....

    [...]

  • ...Cilk [12] and TBB [30] both implement specialised mechanisms of loop scheduling at runtime, to achieve load balancing, by controlling the number of worker threads and the division of tasks among the workers....

    [...]

Book
10 Oct 2001
TL;DR: A broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling are provided.
Abstract: Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue. The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures. The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications.

1,087 citations


"Optimizing recursive task parallel ..." refers methods in this paper

  • ...Loop scheduling [21] has been one of the most popular techniques to efficiently execute loop nests....

    [...]

Journal ArticleDOI
01 Aug 2007
TL;DR: A candidate list of desirable qualities for a parallel programming language is offered, and how these qualities are addressed in the design of the Chapel language is described, providing an overview of Chapel's features and how they help address parallel productivity.
Abstract: In this paper we consider productivity challenges for parallel programmers and explore ways that parallel language design might help improve end-user productivity. We offer a candidate list of desirable qualities for a parallel programming language, and describe how these qualities are addressed in the design of the Chapel language. In doing so, we provide an overview of Chapel's features and how they help address parallel productivity. We also survey current techniques for parallel programming and describe ways in which we consider them to fall short of our idealized productive programming model.

905 citations


"Optimizing recursive task parallel ..." refers background in this paper

  • ...Recursive Task Parallel (RTP) programs constitute an important subset of task parallel programs written in popular languages like Cilk [12], X10 [31], Chapel [7], OpenMP [28], HJ [5], and so on....

    [...]

  • ...Though our results are shown in the context of X10, we believe that DECAF can be applied (with similar effect) to other task parallel languages like OpenMP, Chapel, Cilk and HJ that admit RTP programs....

    [...]

  • ...DECAF can also be extended to other task parallel languages (such as HJ, Chapel and OpenMP) that have similar constructs for task creation and task termination operations....

    [...]

Book
01 Jan 2007
TL;DR: The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications, and the information here is subject to change without notice.
Abstract: Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. 1.14 Type atomic now allows T to be an enumeration type. Clarify zero-initialization of atomic. Default partitioner changed from simple_partitioner to auto_partitioner. Instance of task_scheduler_init is optional. Discuss cancellation and exception handling. Describe tbb_hash_compare and tbb_hasher.

889 citations