Home
/
Authors
/
Ramesh Subramonian

Author

Ramesh Subramonian

Other affiliations: Lockheed Corporation

Bio: Ramesh Subramonian is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Dissemination & Parallel algorithm. The author has an hindex of 2, co-authored 4 publications receiving 1800 citations. Previous affiliations of Ramesh Subramonian include Lockheed Corporation.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

LogP: towards a realistic model of parallel computation

[...]

David E. Culler¹, Richard M. Karp¹, David A. Patterson¹, Abhijit Sahay¹, Klaus Erik Schauser¹, Eunice E. Santos¹, Ramesh Subramonian¹, Thorsten von Eicken¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

01 Jul 1993

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.

...read moreread less

Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

...read moreread less

1,515 citations

Journal Article•DOI•

LogP: a practical model of parallel computation

[...]

David E. Culler¹, Richard M. Karp², David A. Patterson¹, Abhijit Sahay, Eunice E. Santos³, Klaus Erik Schauser¹, Ramesh Subramonian⁴, Thorsten von Eicken⁵ - Show less +4 more•Institutions (5)

University of California, Berkeley¹, University of Washington², Lehigh University³, Lockheed Corporation⁴, Cornell University⁵

01 Nov 1996-Communications of The ACM

TL;DR: Enough to be generally useful and to keep the algorithm analysis tractable to produce a better program in practice.

...read moreread less

Abstract: enough to be generally useful and to keep the algorithm analysis tractable. Ideally, producing a better algorithm under the model should yield a better program in practice. The Parallel Random Access Machine (PRAM) [8] is the most popular model for representing and analyzing the complexity of parallel algorithms. A LogP A Practic

...read moreread less

328 citations

Journal Article•DOI•

EFFICIENT MULTIPLE-ITEM BROADCAST IN THE LogP MODEL

[...]

Ramesh Subramonian¹, Narayan Venkatasubramanyan²•Institutions (2)

University of California, Berkeley¹, American Airlines²

01 Dec 1993-Parallel Processing Letters

TL;DR: This paper addresses the problem of simulating multiple-item broadcast by point-to-point message transmission, where a source processor has many messages which it wishes to disseminate to P−1 other processors, by developing a simpler and almost optimal solution which makes no assumptions about L or P.

...read moreread less

Abstract: A common arising problem in many parallel algorithms is broadcast. Many multiprocessors do not have dedicated hardware for this purpose. In this paper, we address the problem of simulating multiple-item broadcast by point-to-point message transmission, where a source processor has many messages which it wishes to disseminate to P−1 other processors. In a step, a processor can send at most one item from among those in its possession and receive at most one item. An item is received at most L steps after it is sent. The goal is to find a schedule that achieves the broadcast in minimum time. We improve on previous results by developing a simpler and almost optimal solution which makes no assumptions about L or P. We also provide a bound on the performance degradation when the latency, L, is allowed to vary.

...read moreread less

2 citations

Optimal Broadcast in a Distributed Memory Model of Parallel

[...]

Ramesh Subramonian, Narayan Venkatasubramanyan

01 Jan 1993

TL;DR: This paper improves on previous results by developing an optimal solution of the general problem which makes no assumptions about L or P; and a simpler but slightly sub-optimal solution which also makes no assumption about L and P.

...read moreread less

Abstract: A commonly arising problem in many parallel algorithms is broadcast. Many multi-processors do not have dedicated hardware for this purpose. Therefore, a broadcast must be effected by point-to-point message passing. In KSSS93, the k-broadcast problem was formulated as follows: One processor possesses k items which it wishes to disseminate to P-1 other processors. In a step, a processor can send at most one item and receive at most one item. An item is received L steps after it is sent. The goal is to find an optimal schedule. In this paper, we improve on previous results by developing (I) an optimal solution of the general problem which makes no assumptions about L or P; and (ii) a simpler but slightly sub-optimal solution which also makes no assumptions about L or P. We also discuss the results of implementing the schemes developed on a 64-processor CM-5.

...read moreread less

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A case for NOW (Networks of Workstations)

[...]

Thomas Anderson¹, David E. Culler¹, David A. Patterson¹•Institutions (1)

University of California, Berkeley¹

01 Feb 1995-IEEE Micro

TL;DR: The 100-node NOW prototype aims to demonstrate practical solutions to the challenges of efficient communication hardware and software, global coordination of multiple workstation operating systems, and enterprise-scale network file systems.

...read moreread less

Abstract: Networks of workstations are poised to become the primary computing infrastructure for science and engineering. NOWs may dramatically improve virtual memory and file system performance; achieve cheap, highly available, and scalable file storage: and provide multiple CPUs for parallel computing. Hurdles that remain include efficient communication hardware and software, global coordination of multiple workstation operating systems, and enterprise-scale network file systems. Our 100-node NOW prototype aims to demonstrate practical solutions to these challenges. >

...read moreread less

871 citations

Journal Article•DOI•

Optimization of Collective Communication Operations in MPICH

[...]

Rajeev Thakur¹, Rolf Rabenseifner², William Gropp¹•Institutions (2)

Argonne National Laboratory¹, University of Stuttgart²

01 Feb 2005

TL;DR: The work on improving the performance of collective communication operations in MPICH is described, with results indicating that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.

...read moreread less

Abstract: We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI Message Passing Interface collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.

...read moreread less

838 citations

Proceedings Article•DOI•

Reining in the outliers in map-reduce clusters using Mantri

[...]

Ganesh Ananthanarayanan¹, Srikanth Kandula¹, Albert Greenberg¹, Ion Stoica², Yi Lu¹, Bikas Saha¹, Edward Harris¹ - Show less +3 more•Institutions (2)

Microsoft¹, University of California, Berkeley²

04 Oct 2010

TL;DR: Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques, improves job completion times by 32% and detects and acts on outliers early in their lifetime.

...read moreread less

Abstract: Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.

...read moreread less

737 citations

Proceedings Article•DOI•

A model of computation for MapReduce

[...]

Howard Karloff¹, Siddharth Suri², Sergei Vassilvitskii²•Institutions (2)

AT&T Labs¹, Yahoo!²

17 Jan 2010

TL;DR: A simulation lemma is proved showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce, and it is demonstrated how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds.

...read moreread less

Abstract: In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine---our model allows each machine to perform sequential computations in time polynomial in the size of the original input.We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.

...read moreread less

643 citations

Posted Content•

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

[...]

Nicholas T. Karonis¹, Brian Toonen², Ian Foster³•Institutions (3)

Northern Illinois University¹, Argonne National Laboratory², University of Chicago³

25 Jun 2002-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: MPICH-G2 as discussed by the authors is a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer.

...read moreread less

Abstract: Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues, we have developed MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. This library extends the Argonne MPICH implementation of MPI to use services provided by the Globus Toolkit for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, both network topology and network quality-of-service mechanisms. We describe the MPICH-G2 design and implementation, present performance results, and review application experiences, including record-setting distributed simulations.

...read moreread less

638 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse