Showing papers by "Charles E. Leiserson published in 1996"

PDF

Open Access

Journal Article•DOI•

Cilk: An Efficient Multithreaded Runtime System

[...]

Robert D. Blumofe¹, Christopher F. Joerg¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, Keith H. Randall¹, Yuli Zhou¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996-Journal of Parallel and Distributed Computing

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1,688 citations

Journal Article•DOI•

The Network Architecture of the Connection Machine CM-5

[...]

Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A. St. Pierre, David S. Wells, Monica C. Wong-Chan, Shaw-Wen Yang, Robert C. Zak - Show less +9 more

15 Mar 1996-Journal of Parallel and Distributed Computing

TL;DR: The Connection Machine Model CM-5 supercomputer as discussed by the authors is a massively parallel computer system designed to offer performance in the range of 1 teraflops (1012 floating point operations per second).

...read moreread less

453 citations

Proceedings Article•DOI•

An analysis of dag-consistent distributed shared-memory algorithms

[...]

Robert D. Blumofe¹, Matteo Frigo², Christopher F. Joerg², Charles E. Leiserson², Keith H. Randall² - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Massachusetts Institute of Technology²

24 Jun 1996

TL;DR: It is proved that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞).

...read moreread less

Abstract: In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number FP(C) of page faults incurred by a computation executed on P processors can be related to the number F1(C) of serial page faults by the formula FP(C) F1(C)+O(CPT∞). Finally, we give simple bounds on the number of page faults and the space requirements for “regular” divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.

...read moreread less

141 citations

Proceedings Article•DOI•

DAG-consistent distributed shared memory

[...]

Robert D. Blumofe¹, Matteo Frigo¹, Christopher F. Joerg¹, Charles E. Leiserson¹, Keith H. Randall¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

15 Apr 1996

TL;DR: This work introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming and provides empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassen's matrix multiplication algorithm and a Barnes-Hut code.

...read moreread less

Abstract: Introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented DAG consistency in software for the Cilk multithreaded runtime system running on a CM5 Connection Machine. Our implementation includes a DAG-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassen's (1969) matrix multiplication algorithm and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number F/sub P/ of page faults incurred by a user program running an P processors can be related to the number F/sub 1/ of page faults running serially by the formula F/sub P//spl les/F/sub 1/+2Cs, where C is the cache size and s is the number of thread migrations executed by Cilk's scheduler.

...read moreread less

100 citations

Journal Article•

A Comparison of Sorting Algorithms for the Connection Machine CM-2.

[...]

Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith, Marco Zagha - Show less +2 more

01 Jan 1996-Communications of The ACM

TL;DR: In this article, the authors present a parallel algorithm for sorting in O(lg 2 n) time with n processors, which is the fastest known algorithm for the Connection Machine Supercomputer model CM-2.

...read moreread less

Abstract: Sorting is arguably the most studied problem in computer science, both because it is used as a substep in many applications and because it is a simple, combinatorial problem with many interesting and diverse solutions. Sorting is also an important benchmark for parallel supercomputers. It requires significant communication bandwidth among processors, unlike many other supercomputer benchmarks, and the most efficient sorting algorithms communicate data in irregular patterns. Parallel algorithms for sorting have been studied since at least the 1960’s. An early advance in parallel sorting came in 1968 when Batcher discovered the elegant U(lg2 n)-depth bitonic sorting network [3]. For certain families of fixed interconnection networks, such as the hypercube and shuffle-exchange, Batcher’s bitonic sorting technique provides a parallel algorithm for sorting n numbers in U(lg2 n) time with n processors. The question of existence of a o(lg2 n)-depth sorting network remained open until 1983, when Ajtai, Komlos, and Szemeredi [1] provided an optimal U(lg n)-depth sorting network, but unfortunately, their construction leads to larger networks than those given by bitonic sort for all “practical” values of n. Leighton [15] has shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U(lg n) time in the bounded-degree fixed interconnection network domain. Not surprisingly, the optimal U(lg n)-time fixed interconnection sorting networks implied by the AKS construction are also impractical. In 1983, Reif and Valiant proposed a more practical O(lg n)-time randomized algorithm for sorting [19], called flashsort. Many other parallel sorting algorithms have been proposed in the literature, including parallel versions of radix sort and quicksort [5], a variant of quicksort called hyperquicksort [23], smoothsort [18], column sort [15], Nassimi and Sahni’s sort [17], and parallel merge sort [6]. This paper reports the findings of a project undertaken at Thinking Machines Corporation to develop a fast sorting algorithm for the Connection Machine Supercomputer model CM-2. The primary goals of this project were:

...read moreread less

9 citations