GPFS: A Shared-Disk File System for Large Computing Clusters

Home
/
Papers
/
GPFS: A Shared-Disk File System for Large Computing Clusters

Proceedings Article•

GPFS: A Shared-Disk File System for Large Computing Clusters

Frank B. Schmuck¹, Roger L. Haskin¹•Institutions (1)

28 Jan 2002-pp 231-244

TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

read less

Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The Google file system

[...]

Sanjay Ghemawat¹, Howard Gobioff¹, Shun-Tak Albert Leung¹•Institutions (1)

Google¹

19 Oct 2003

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

...read moreread less

5,429 citations

Cites methods from "GPFS: A Shared-Disk File System for..."

...GPFS: A shared-disk.le system for large computing clusters....
[...]
...Some distributed file systems like Frangipani, xFS, Minnesota’s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and man-...
[...]
...Some distributed .le systems like Frangipani, xFS, Minnesota s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and management....
[...]

Journal Article•DOI•

A scalable, commodity data center network architecture

[...]

Mohammad Al-Fares¹, Alexander Loukissas¹, Amin Vahdat¹•Institutions (1)

University of California, San Diego¹

17 Aug 2008

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.

...read moreread less

Abstract: Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

...read moreread less

3,549 citations

Proceedings Article•DOI•

Ceph: a scalable, high-performance distributed file system

[...]

Sage A. Weil¹, Scott A. Brandt¹, Ethan L. Miller¹, Darrell D. E. Long¹, Carlos Maltzahn¹ - Show less +1 more•Institutions (1)

University of California, Santa Cruz¹

06 Nov 2006

TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

...read moreread less

Abstract: We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

...read moreread less

1,621 citations

Cites background from "GPFS: A Shared-Disk File System for..."

...GPFS: A shared-disk .le systemforlargecomputingclusters....
[...]
...GPFS [22] and StorageTank [14] partially decouple metadata and data management, but are limited by their use of block-based disks and their metadata distribution architecture....
[...]
...GPFS[24] and StorageTank[16] partiallydecouplemetadataand data management,butarelimitedbytheir useofblock-based disks andtheir metadatadistributionarchitecture....
[...]

Journal Article•DOI•

Big Data computing and clouds

[...]

Marcos Dias De Assuncao¹, Rodrigo N. Calheiros², Silvia Cristina Sardela Bianchi³, Marco A. S. Netto³, Rajkumar Buyya² - Show less +1 more•Institutions (3)

École normale supérieure de Lyon¹, University of Melbourne², IBM³

01 May 2015-Journal of Parallel and Distributed Computing

TL;DR: This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications, and identifies possible gaps in technology and provides recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

...read moreread less

773 citations

Cites background from "GPFS: A Shared-Disk File System for..."

...How to optimise resource usage and energy consumption when executing the analytics application?...
[...]

Book•

Data-Intensive Text Processing with MapReduce

[...]

Jimmy Lin¹, Chris Dyer¹•Institutions (1)

University of Maryland, College Park¹

30 Apr 2010

TL;DR: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model using the open-source Hadoop implementation, with a focus on scalability and the tradeoffs associated with distributed processing of large datasets.

...read moreread less

Abstract: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model [1], using the open-source Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters.

...read moreread less

538 citations

Cites background from "GPFS: A Shared-Disk File System for..."

...Of course, distributed file systems are not new [74, 32, 7, 147, 133]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book Chapter•DOI•

Notes on Data Base Operating Systems

[...]

Jim Gray¹•Institutions (1)

IBM¹

01 Jan 1978

TL;DR: This paper is a compendium of data base management operating systems folklore and focuses on particular issues unique to the transaction management component especially locking and recovery.

...read moreread less

Abstract: This paper is a compendium of data base management operating systems folklore. It is an early paper and is still in draft form. It is intended as a set of course notes for a class on data base operating systems. After a brief overview of what a data management system is it focuses on particular issues unique to the transaction management component especially locking and recovery.

...read moreread less

1,635 citations

Proceedings Article•DOI•

Petal: distributed virtual disks

[...]

Edward K. Lee, Chandramohan A. Thekkath

01 Sep 1996

TL;DR: The design, implementation, and performance of Petal is described, a system that attempts to approximate this ideal in practice through a novel combination of features.

...read moreread less

Abstract: The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system that attempts to approximate this ideal in practice through a novel combination of features. Petal consists of a collection of network-connected servers that cooperatively manage a pool of physical disks. To a Petal client, this collection appears as a highly available block-level storage system that provides large abstract containers called virtual disks. A virtual disk is globally accessible to all Petal clients on the network. A client can create a virtual disk on demand to tap the entire capacity and performance of the underlying physical resources. Furthermore, additional resources, such as servers and disks, can be automatically incorporated into Petal.We have an initial Petal prototype consisting of four 225 MHz DEC 3000/700 workstations running Digital Unix and connected by a 155 Mbit/s ATM network. The prototype provides clients with virtual disks that tolerate and recover from disk, server, and network failures. Latency is comparable to a locally attached disk, and throughput scales with the number of servers. The prototype can achieve I/O rates of up to 3150 requests/sec and bandwidth up to 43.1 Mbytes/sec.

...read moreread less

725 citations

"GPFS: A Shared-Disk File System for..." refers methods in this paper

...A Frangipani file system resides on a single, large (2(64) byte) virtual disk provided by Petal [19], which redirects I/O requests to a set of Petal servers and handles physical storage allocation and striping....
[...]
...The granularity of disk space allocation (64kB) in Petal, however, is too large and its virtual address space is too small to simply reserve a fixed, contiguous virtual disk area (e.g., 1TB) for each file in a Frangipani file system....
[...]
...A Frangipani file system resides on a single, large (264 byte) virtual disk provided by Petal [19], which redirects I/O requests to a set of Petal servers and handles physical storage allocation and striping....
[...]
...Therefore, accessing large files in GFS entails significantly more locking overhead than the byte-range locks used in GPFS. Similar to Frangipani/Petal, striping in GFS is handled in a “Network Storage Pool” layer; once created, however, the stripe width cannot be changed (it is possible to add a new “sub-pools”, but striping is confined to a sub-pool, i.e., GFS will not stripe across sub-pools)....
[...]
...Therefore, Frangipani still needs its own allocation maps to manage the virtual disk space provided by Petal....
[...]

Journal Article•DOI•

Extendible hashing—a fast access method for dynamic files

[...]

David K. Hsiao¹•Institutions (1)

Ohio State University¹

01 Sep 1979-ACM Transactions on Database Systems

TL;DR: This work studies, by analysis and simulation, the performance of extendible hashing and indicates that it provides an attractive alternative to other access methods, such as balanced trees.

...read moreread less

Abstract: Extendible hashing is a new access technique, in which the user is guaranteed no more than two page faults to locate the data associated with a given unique identifier, or key. Unlike conventional hashing, extendible hashing has a dynamic structure that grows and shrinks gracefully as the database grows and shrinks. This approach simultaneously solves the problem of making hash tables that are extendible and of making radix search trees that are balanced. We study, by analysis and simulation, the performance of extendible hashing. The results indicate that extendible hashing provides an attractive alternative to other access methods, such as balanced trees.

...read moreread less

709 citations

"GPFS: A Shared-Disk File System for..." refers methods in this paper

...To support efficient file name lookup in very large directories (millions of files), GPFS uses extensible hashing [6] to organize directory entries within a directory....
[...]

Proceedings Article•DOI•

Frangipani: a scalable distributed file system

[...]

Chandramohan A. Thekkath, Timothy Mann, Edward K. Lee

01 Oct 1997

TL;DR: Initial measurements indicate that Frangipani has excellent single-server performance and scales well as servers are added, and can be exported to untrusted machines using ordinary network file access protocols.

...read moreread less

Abstract: The ideal distributed file system would provide all its users with coherent, shared access to the same set of files, yet would be arbitrarily scalable to provide more storage space and higher performance to a growing user community. It would be highly available in spite of component failures. It would require minimal human administration, and administration would not become more complex as more components were added. Frangipani is a new file system that approximates this ideal, yet was relatively easy to build because of its two-layer structure. The lower layer is Petal (described in an earlier paper), a distributed storage service that provides incrementally scalable, highly available, automatically managed virtual disks. In the upper layer, multiple machines run the same Frangipani file system code on top of a shared Petal virtual disk, using a distributed lock service to ensure coherence. Frangipani is meant to run in a cluster of machines that are under a common administration and can communicate securely. Thus the machines trust one another and the shared virtual disk approach is practical. Of course, a Frangipani file system can be exported to untrusted machines using ordinary network file access protocols. We have implemented Frangipani on a collection of Alphas running DIGITAL Unix 4.0. Initial measurements indicate that Frangipani has excellent single-server performance and scales well as servers are added.

...read moreread less

579 citations

"GPFS: A Shared-Disk File System for..." refers methods in this paper

...Frangipani [18] is a shared-disk cluster file system that is similar in principle to GPFS....
[...]

Proceedings Article•

Scalability in the XFS file system

[...]

Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, Geoff Peck - Show less +2 more

22 Jan 1996

TL;DR: The architecture and design of a new file system, XFS, for Silicon Graphics' IRIX operating system is described, and the use of B+ trees in place of many of the more traditional linear file system structures are discussed.

...read moreread less

Abstract: In this paper we describe the architecture and design of a new file system, XFS, for Silicon Graphics' IRIX operating system It is a general purpose file system for use on both workstations and servers The focus of the paper is on the mechanisms used by XFS to scale capacity and performance in supporting very large file systems The large file system support includes mechanisms for managing large files, large numbers of files, large directories, and very high performance I/O In discussing the mechanisms used for scalability we include both descriptions of the XFS on-disk data structures and analyses of why they were chosen We discuss in detail our use of B+ trees in place of many of the more traditional linear file system structures XFS has been shipping to customers since December of 1994 in a version of IRIX 53, and we are continuing to improve its performance and add features in upcoming releases We include performance results from running on the latest version of XFS to demonstrate the viability of our design

...read moreread less

458 citations