scispace - formally typeset
Search or ask a question
Author

Ramesh Subramonian

Other affiliations: Lockheed Corporation
Bio: Ramesh Subramonian is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Dissemination & Parallel algorithm. The author has an hindex of 2, co-authored 4 publications receiving 1800 citations. Previous affiliations of Ramesh Subramonian include Lockheed Corporation.

Papers
More filters
Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

1,515 citations

Journal ArticleDOI
TL;DR: Enough to be generally useful and to keep the algorithm analysis tractable to produce a better program in practice.
Abstract: enough to be generally useful and to keep the algorithm analysis tractable. Ideally, producing a better algorithm under the model should yield a better program in practice. The Parallel Random Access Machine (PRAM) [8] is the most popular model for representing and analyzing the complexity of parallel algorithms. A LogP A Practic

328 citations

Journal ArticleDOI
TL;DR: This paper addresses the problem of simulating multiple-item broadcast by point-to-point message transmission, where a source processor has many messages which it wishes to disseminate to P−1 other processors, by developing a simpler and almost optimal solution which makes no assumptions about L or P.
Abstract: A common arising problem in many parallel algorithms is broadcast. Many multiprocessors do not have dedicated hardware for this purpose. In this paper, we address the problem of simulating multiple-item broadcast by point-to-point message transmission, where a source processor has many messages which it wishes to disseminate to P−1 other processors. In a step, a processor can send at most one item from among those in its possession and receive at most one item. An item is received at most L steps after it is sent. The goal is to find a schedule that achieves the broadcast in minimum time. We improve on previous results by developing a simpler and almost optimal solution which makes no assumptions about L or P. We also provide a bound on the performance degradation when the latency, L, is allowed to vary.

2 citations

01 Jan 1993
TL;DR: This paper improves on previous results by developing an optimal solution of the general problem which makes no assumptions about L or P; and a simpler but slightly sub-optimal solution which also makes no assumption about L and P.
Abstract: A commonly arising problem in many parallel algorithms is broadcast. Many multi-processors do not have dedicated hardware for this purpose. Therefore, a broadcast must be effected by point-to-point message passing. In KSSS93, the k-broadcast problem was formulated as follows: One processor possesses k items which it wishes to disseminate to P-1 other processors. In a step, a processor can send at most one item and receive at most one item. An item is received L steps after it is sent. The goal is to find an optimal schedule. In this paper, we improve on previous results by developing (I) an optimal solution of the general problem which makes no assumptions about L or P; and (ii) a simpler but slightly sub-optimal solution which also makes no assumptions about L or P. We also discuss the results of implementing the schemes developed on a 64-processor CM-5.

Cited by
More filters
Journal ArticleDOI
TL;DR: The 100-node NOW prototype aims to demonstrate practical solutions to the challenges of efficient communication hardware and software, global coordination of multiple workstation operating systems, and enterprise-scale network file systems.
Abstract: Networks of workstations are poised to become the primary computing infrastructure for science and engineering. NOWs may dramatically improve virtual memory and file system performance; achieve cheap, highly available, and scalable file storage: and provide multiple CPUs for parallel computing. Hurdles that remain include efficient communication hardware and software, global coordination of multiple workstation operating systems, and enterprise-scale network file systems. Our 100-node NOW prototype aims to demonstrate practical solutions to these challenges. >

871 citations

Journal ArticleDOI
01 Feb 2005
TL;DR: The work on improving the performance of collective communication operations in MPICH is described, with results indicating that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
Abstract: We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI Message Passing Interface collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.

838 citations

Proceedings ArticleDOI
04 Oct 2010
TL;DR: Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques, improves job completion times by 32% and detects and acts on outliers early in their lifetime.
Abstract: Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.

737 citations

Proceedings ArticleDOI
17 Jan 2010
TL;DR: A simulation lemma is proved showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce, and it is demonstrated how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds.
Abstract: In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine---our model allows each machine to perform sequential computations in time polynomial in the size of the original input.We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.

643 citations

Posted Content
TL;DR: MPICH-G2 as discussed by the authors is a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer.
Abstract: Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues, we have developed MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. This library extends the Argonne MPICH implementation of MPI to use services provided by the Globus Toolkit for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, both network topology and network quality-of-service mechanisms. We describe the MPICH-G2 design and implementation, present performance results, and review application experiences, including record-setting distributed simulations.

638 citations