Home
/
Authors
/
Jason Duell

Author

Jason Duell

Other affiliations: Lawrence Berkeley National Laboratory

Bio: Jason Duell is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Compiler & Microarchitecture. The author has an hindex of 9, co-authored 11 publications receiving 1563 citations. Previous affiliations of Jason Duell include Lawrence Berkeley National Laboratory.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

[...]

Paul Hargrove¹, Jason Duell•Institutions (1)

Lawrence Berkeley National Laboratory¹

01 Sep 2006

TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.

...read moreread less

Abstract: This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

...read moreread less

439 citations

Journal Article•DOI•

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

[...]

Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine¹, Jason Duell, Paul Hargrove, Eric Roman² - Show less +4 more•Institutions (2)

Indiana University¹, Lawrence Berkeley National Laboratory²

01 Nov 2005

TL;DR: This work designs and implements a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications that integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface.

...read moreread less

Abstract: As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.

...read moreread less

302 citations

Report•DOI•

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

[...]

Jason Duell¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

30 Apr 2005-Lawrence Berkeley National Laboratory

TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.

...read moreread less

Abstract: This paper describes Berkeley Linux Checkpoint/Restart (BLCR), a linux kernel module that allows system-level checkpoints on a variety of Linux systems. BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointing and restoring parallel jobs running on multiple machines. Integration with Message Passing Interface (MPI) and other parallel systems is described.

...read moreread less

266 citations

Proceedings Article•DOI•

Productivity and performance using partitioned global address space languages

[...]

Katherine Yelick¹, Dan Bonachea¹, Wei-Yu Chen¹, Phillip Colella², Kaushik Datta¹, Jason Duell¹, Susan L. Graham¹, Paul Hargrove¹, Paul N. Hilfinger¹, Parry Husbands¹, Costin Iancu², Amir Kamil¹, Rajesh Nishtala¹, Jimmy Su¹, Michael Welcome², Tong Wen² - Show less +12 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

27 Jul 2007

TL;DR: Two related projects, the Titanium and UPC projects, combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

...read moreread less

Abstract: Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that trans-lates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

...read moreread less

205 citations

Proceedings Article•DOI•

A performance analysis of the Berkeley UPC compiler

[...]

Wei-Yu Chen¹, Dan Bonachea¹, Jason Duell², Parry Husbands², Costin Iancu², Katherine Yelick¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

23 Jun 2003

TL;DR: This paper describes a portable open source compiler for UPC and identifies some of the challenges in compiling UPC, and uses a combination of micro-benchmarks and application kernels to show that the compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster, the commercial HP compiler.

...read moreread less

Abstract: Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7] In this paper we describe a portable open source compiler for UPC Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code

...read moreread less

134 citations

Cited by

PDF

Open Access

More filters

Book Chapter•DOI•

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

[...]

Edgar Gabriel¹, Graham E. Fagg¹, George Bosilca¹, Thara Angskun¹, Jack Dongarra¹, Jeffrey M. Squyres², Vishal Sahay², Prabhanjan Kambadur², Brian Barrett², Andrew Lumsdaine², Ralph H. Castain³, David Daniel³, Richard L. Graham³, Timothy S. Woodall³ - Show less +10 more•Institutions (3)

University of Tennessee¹, Indiana University², Los Alamos National Laboratory³

19 Sep 2004-Lecture Notes in Computer Science

TL;DR: Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI, which provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons.

...read moreread less

Abstract: A large number of MPI implementations are currently available, each of which emphasize different aspects of high-performance computing or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, and FT-MPI projects, Open MPI is an all-new, production-quality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI.

...read moreread less

1,603 citations

Journal Article•DOI•

Parallel Programmability and the Chapel Language

[...]

Bradford L. Chamberlain¹, David Callahan², Hans P. Zima³•Institutions (3)

Cray¹, Microsoft², University of Vienna³

01 Aug 2007

TL;DR: A candidate list of desirable qualities for a parallel programming language is offered, and how these qualities are addressed in the design of the Chapel language is described, providing an overview of Chapel's features and how they help address parallel productivity.

...read moreread less

Abstract: In this paper we consider productivity challenges for parallel programmers and explore ways that parallel language design might help improve end-user productivity. We offer a candidate list of desirable qualities for a parallel programming language, and describe how these qualities are addressed in the design of the Chapel language. In doing so, we provide an overview of Chapel's features and how they help address parallel productivity. We also survey current techniques for parallel programming and describe ways in which we consider them to fall short of our idealized productive programming model.

...read moreread less

905 citations

Proceedings Article•DOI•

Legion: expressing locality and independence with logical regions

[...]

Michael Bauer¹, Sean J. Treichler¹, Elliott Slaughter¹, Alex Aiken¹•Institutions (1)

Stanford University¹

10 Nov 2012

TL;DR: A runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism.

...read moreread less

Abstract: Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation.

...read moreread less

500 citations

Journal Article•DOI•

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

[...]

Paul Hargrove¹, Jason Duell•Institutions (1)

Lawrence Berkeley National Laboratory¹

01 Sep 2006

...read moreread less

439 citations

Proceedings Article•DOI•

Proactive fault tolerance for HPC with Xen virtualization

[...]

Arun Babu Nagarajan¹, Frank Mueller¹, Christian Engelmann², Stephen L. Scott²•Institutions (2)

North Carolina State University¹, Oak Ridge National Laboratory²

17 Jun 2007

TL;DR: This paper believes that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring, and makes proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes.

...read moreread less

Abstract: Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

...read moreread less

394 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse