scispace - formally typeset
Search or ask a question

Showing papers by "Charles E. Leiserson published in 2017"


Proceedings ArticleDOI
26 Jan 2017
TL;DR: This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler's intermediate representation (IR) with only minor changes to its existing analyses and code transformations.
Abstract: This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler's intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program's control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVM's 4-million-line codebase. Tapir enables LLVM's existing compiler optimizations for serial code -- including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination -- to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

67 citations



Journal ArticleDOI
05 Oct 2017
TL;DR: The experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms and are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable.
Abstract: We present Autogen—an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. Autogen analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use Autogen to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also, these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, Autogen is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.

14 citations


Journal ArticleDOI
TL;DR: Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once; the autotuning time of OpenTuner was typically measured in hours or days.
Abstract: Summary This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties—space-time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state-of-the-art heuristic autotuner called OpenTuner in tuning the divide-and-conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

5 citations


Journal ArticleDOI
TL;DR: In this article, the authors studied the problem of work stealing in multithreaded computations and obtained tight upper bounds on the number of steals when the computation can be modeled by rooted trees.
Abstract: Inspired by applications in parallel computing, we analyze the setting of work stealing in multithreaded computations. We obtain tight upper bounds on the number of steals when the computation can be modeled by rooted trees. In particular, we show that if the computation with $n$ processors starts with one processor having a complete $k$-ary tree of height $h$ (and the remaining $n-1$ processors having nothing), the maximum possible number of steals is $\sum_{i=1}^n(k-1)^i\binom{h}{i}$.

2 citations


01 Jan 2017
TL;DR: Results are presented of an ongoing investigation into the design and implementation of systolic Gram -Schmidt processors in .5 micron VHSIC for the adaptivefiltering of spatial and temporal input in communications applications.
Abstract: High -performance architectures for adaptivefiltering based on the Gram -Schmidt algorithmKyle A. GallivanGovernment Aerospace Systems Division, Harris CorporationP.O. Box 94000, Melbourne, Florida 32902Charles E. LeisersonLaboratory for Computer Science, Massachusetts Institute of TechnologyRoom NE43 -802, Cambridge, Massachusetts 02139AbstractThe difficulties in designing systolic processors can be reduced by applying the ar-chitectural transformations of code motion, retiming, slowdown, coalescing, parallel /serialcompromises_ and partitioning to a more easily designed combinational or semisystolic form ofthe processor. In this paper, the use of these transformations and the attendant tradeoffsin the design of architectures for adaptive filtering based on the Gram -Schmidt algorithmare considered. A modification to the classical Gram -Schmidt algorithm which eliminates theuse of division under certain assumptions is suggested. Also, size and speed statistics aregiven for a projected .5 micron VHSIC implementation of the processor.IntroductionIn recent years, systolic architectures have been proposed for many digital signalprocessing applications. The regular repetition of simple processing elements and thesimple communication scheme make such architectures very amenable to VLSI implementation.The direct design of systolic architectures, however, can be a difficult and less thanintuitive exercise. If one has a combinational or semisystolic form of the processor thedesign activity can be simplified tremendously by using the architectural transformations ofcode motion, retiming, slowdown, coalescing, parallel /serial compromises and partitioningl.The Gram -Schmidt algorithm for adaptive filtering has a well known regular computationstructure and therefore is a good candidate for systolic implementation. This algorithm hasbeen considered in the literature as a orthogonalization preprocessor for LMS2, as a linearpredictor for temporal input3, as a sidelobe canceller4and for clutter rejections. Liles,Demmel and Brennan6 presented a detailed study of the use of the Gram -Schmidt algorithm foradaptive filtering in radar applications. Many of their observations concerning a processorto implement their Universal Adaptive Algorithm are applicable to the design of systolicsystems in general. Liles, Ritchey and Demmel7 presented a brief paper considering a par-ticular implementation of a sidelobe canceller based on the Universal Adaptive Algorithm.In this paper, we present some results of an ongoing investigation into the design andimplementation of systolic Gram -Schmidt processors in .5 micron VHSIC for the adaptivefiltering of spatial and temporal input in communications applications. The tradeoffs andthe design process using the above architectural transformations are discussed along withsome modifications to the Gram -Schmidt algorithm itself. Observations coincident with thoseof Liles, Demmel and Brennan6 are noted and extended where appropriate.The AlgorithmIn its simplest form, the classical Gram -Schmidt algorithm operates on two complexstochastic processes x0 and xl. The algorithm produces two stochastic processes u0 = x0 andul = xl - gx0, where g = E(x0x1) /E(x0x0) and E denotes expectation. It is easy to verifythat the output processes are orthogonal in the mean. This algorithm can be generalized toproduce n orthogonal outputs from n linearly independent inputs by cascading cells whichimplement the above two input form. Figure 1 shows the result for n = 3. The horizontalcommunication paths are broadcasts. To see that the outputs are mutually orthogonal notethat u2 and x3 are orthogonal to ul since they both result from the simple two input form.Further, u3 is orthogonal to u2 for the same reason and u3 is orthogonal to ul since it is alinear combination of processes which are orthogonal to u1.