scispace - formally typeset
Search or ask a question

Showing papers by "Charles E. Leiserson published in 1995"


Proceedings ArticleDOI
01 Aug 1995
TL;DR: This paper shows that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance, and proves that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal.
Abstract: Cilk (pronounced “silk”) is a C-based runtime system for multi-threaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal.The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the *Socrates chess program, which won third prize in the 1994 ACM International Computer Chess Championship.

985 citations


Dissertation
01 Jan 1995
TL;DR: The thesis demonstrates that a new method for creating locality, called the blocking covers method, can improve the performance of iterative algorithms including multigrid, conjugate gradient, and implicit time stepping.
Abstract: How do you determine the running time of a program without actually running it? How do you design an efficient out-of-core iterative algorithm? These are the two questions answered in this thesis. The first part of the thesis demonstrates that the performance of programs can be predicted accurately, automatically, and rapidly using a method called benchmapping. The key aspects benchmapping are: automatic creation of detailed performance models, prediction of the performance of runtime system calls using these models, and automatic decomposition of a data-parallel program into a sequence of runtime system calls. The feasibility and utility of benchmapping are established using two performance-prediction systems called P scERFS scIM and B scENCHC scVL. Empirical studies show that P scERFS scIM's relative prediction errors are within 21% and that B scENCHC scVL's relative prediction errors are almost always within 33%. The second part of the thesis presents methods for creating locality in numerical algorithms. Designers of computers, compilers, and runtime systems strive to create designs that exploit the temporal locality of reference found in some programs. Unfortunately, many iterative numerical algorithms lack temporal locality. Executions of such algorithms on current high-performance computers are characterized by saturation of some communication channel (such as a bus or an I/O channel) whereas the CPU is idle most of the time. The thesis demonstrates that a new method for creating locality, called the blocking covers method, can improve the performance of iterative algorithms including multigrid, conjugate gradient, and implicit time stepping. The thesis proves that the method reduces the amount of input-output operations in these algorithms and demonstrates that the method reduces the solution time on workstations by up to a factor of 5. The thesis also describes a parallel linear equation solver which is based on a method called local densification. The method increases the amount of dependencies that can be handled by individual processors but not the amount of dependencies that generate interprocessor communication. An implementation of the resulting algorithm is up to 2.5 times faster than conventional algorithms. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

45 citations


Patent
27 Jan 1995
TL;DR: In this paper, a digital computer comprises a plurality of processing elements, a communications router, and a control network, each processing element performs data processing operations in connection with commands, at least some of the processing elements performing the data processing operation in relation with the commands in messages they receive over the control network.
Abstract: A digital computer comprises a plurality of processing elements, a communications router, and a control network. Each processing element performs data processing operations in connection with commands, at least some of the processing elements performing the data processing operations in connection with the commands in messages they receive over the control network. Each processing element also generates and receives data transfer messages, each including an address portion containing an address, for transfer to another processing element as identified by the address. At least one of the processing elements further generates the control network messages for transfer over the communications router. The communications router comprises router nodes interconnected in the form of a "fat-tree," and the control network comprises control network nodes interconnected in the form of a tree, with the processing elements being connected at the leaf nodes of the respective communications router and control network.

31 citations


Patent
13 Feb 1995
TL;DR: A digital computer includes a plurality of processing elements, a command processor, a diagnostic processor and a communications network as mentioned in this paper, each performing data processing and data communications operations in connection with commands, and also performing diagnostic operations in response to diagnostic operation requests and providing diagnostic results in response thereto.
Abstract: A digital computer includes a plurality of processing elements, a command processor, a diagnostic processor and a communications network The processing elements each performs data processing and data communications operations in connection with commands The processing elements also performing diagnostic operations in response to diagnostic operation requests and providing diagnostic results in response thereto The command processor generates commands for the processing elements, and also performs diagnostic operations in response to diagnostic operation requests and providing diagnostic results in response thereto The diagnostic processor generates diagnostic requests The communication network includes three elements, including a data router, a control network and a diagnostic network The data router is connected to the processing elements for facilitating the transfer of data among them during a data communications operation The control network is connected to the processing elements and the command processor for transferring commands from the command processor to the processing elements The diagnostic network connected to the processing elements, the command processor and the diagnostic processor for transferring diagnostic requests from the diagnostic processor to the processing elements and the command processor and for transferring diagnostic results from the processing elements and the command processor to the diagnostic processor

20 citations


Proceedings ArticleDOI
20 Jul 1995
TL;DR: The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed, easily solved on an ordinary serial computer in O(W+D) time.
Abstract: The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W+D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit).

2 citations