scispace - formally typeset
Search or ask a question
Proceedings Article

GPFS: A Shared-Disk File System for Large Computing Clusters

Frank B. Schmuck1, Roger L. Haskin1
28 Jan 2002-pp 231-244
TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
19 Oct 2003
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

5,429 citations


Cites methods from "GPFS: A Shared-Disk File System for..."

  • ...GPFS: A shared-disk.le system for large computing clusters....

    [...]

  • ...Some distributed file systems like Frangipani, xFS, Minnesota’s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and man-...

    [...]

  • ...Some distributed .le systems like Frangipani, xFS, Min­nesota s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and man­agement....

    [...]

Journal ArticleDOI
17 Aug 2008
TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Abstract: Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

3,549 citations

Proceedings ArticleDOI
06 Nov 2006
TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
Abstract: We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

1,621 citations


Cites background from "GPFS: A Shared-Disk File System for..."

  • ...GPFS: A shared-disk .le systemforlargecomputingclusters....

    [...]

  • ...GPFS [22] and StorageTank [14] partially decouple metadata and data management, but are limited by their use of block-based disks and their metadata distribution architecture....

    [...]

  • ...GPFS[24] and StorageTank[16] partiallydecouplemetadataand data management,butarelimitedbytheir useofblock-based disks andtheir metadatadistributionarchitecture....

    [...]

Journal ArticleDOI
TL;DR: This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications, and identifies possible gaps in technology and provides recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

773 citations


Cites background from "GPFS: A Shared-Disk File System for..."

  • ...How to optimise resource usage and energy consumption when executing the analytics application?...

    [...]

Book
30 Apr 2010
TL;DR: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model using the open-source Hadoop implementation, with a focus on scalability and the tradeoffs associated with distributed processing of large datasets.
Abstract: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model [1], using the open-source Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters.

538 citations


Cites background from "GPFS: A Shared-Disk File System for..."

  • ...Of course, distributed file systems are not new [74, 32, 7, 147, 133]....

    [...]

References
More filters
Book ChapterDOI
Jim Gray1
01 Jan 1978
TL;DR: This paper is a compendium of data base management operating systems folklore and focuses on particular issues unique to the transaction management component especially locking and recovery.
Abstract: This paper is a compendium of data base management operating systems folklore. It is an early paper and is still in draft form. It is intended as a set of course notes for a class on data base operating systems. After a brief overview of what a data management system is it focuses on particular issues unique to the transaction management component especially locking and recovery.

1,635 citations

Proceedings ArticleDOI
01 Sep 1996
TL;DR: The design, implementation, and performance of Petal is described, a system that attempts to approximate this ideal in practice through a novel combination of features.
Abstract: The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system that attempts to approximate this ideal in practice through a novel combination of features. Petal consists of a collection of network-connected servers that cooperatively manage a pool of physical disks. To a Petal client, this collection appears as a highly available block-level storage system that provides large abstract containers called virtual disks. A virtual disk is globally accessible to all Petal clients on the network. A client can create a virtual disk on demand to tap the entire capacity and performance of the underlying physical resources. Furthermore, additional resources, such as servers and disks, can be automatically incorporated into Petal.We have an initial Petal prototype consisting of four 225 MHz DEC 3000/700 workstations running Digital Unix and connected by a 155 Mbit/s ATM network. The prototype provides clients with virtual disks that tolerate and recover from disk, server, and network failures. Latency is comparable to a locally attached disk, and throughput scales with the number of servers. The prototype can achieve I/O rates of up to 3150 requests/sec and bandwidth up to 43.1 Mbytes/sec.

725 citations


"GPFS: A Shared-Disk File System for..." refers methods in this paper

  • ...A Frangipani file system resides on a single, large (2(64) byte) virtual disk provided by Petal [19], which redirects I/O requests to a set of Petal servers and handles physical storage allocation and striping....

    [...]

  • ...The granularity of disk space allocation (64kB) in Petal, however, is too large and its virtual address space is too small to simply reserve a fixed, contiguous virtual disk area (e.g., 1TB) for each file in a Frangipani file system....

    [...]

  • ...A Frangipani file system resides on a single, large (264 byte) virtual disk provided by Petal [19], which redirects I/O requests to a set of Petal servers and handles physical storage allocation and striping....

    [...]

  • ...Therefore, accessing large files in GFS entails significantly more locking overhead than the byte-range locks used in GPFS. Similar to Frangipani/Petal, striping in GFS is handled in a “Network Storage Pool” layer; once created, however, the stripe width cannot be changed (it is possible to add a new “sub-pools”, but striping is confined to a sub-pool, i.e., GFS will not stripe across sub-pools)....

    [...]

  • ...Therefore, Frangipani still needs its own allocation maps to manage the virtual disk space provided by Petal....

    [...]

Journal ArticleDOI
TL;DR: This work studies, by analysis and simulation, the performance of extendible hashing and indicates that it provides an attractive alternative to other access methods, such as balanced trees.
Abstract: Extendible hashing is a new access technique, in which the user is guaranteed no more than two page faults to locate the data associated with a given unique identifier, or key. Unlike conventional hashing, extendible hashing has a dynamic structure that grows and shrinks gracefully as the database grows and shrinks. This approach simultaneously solves the problem of making hash tables that are extendible and of making radix search trees that are balanced. We study, by analysis and simulation, the performance of extendible hashing. The results indicate that extendible hashing provides an attractive alternative to other access methods, such as balanced trees.

709 citations


"GPFS: A Shared-Disk File System for..." refers methods in this paper

  • ...To support efficient file name lookup in very large directories (millions of files), GPFS uses extensible hashing [6] to organize directory entries within a directory....

    [...]

Proceedings ArticleDOI
01 Oct 1997
TL;DR: Initial measurements indicate that Frangipani has excellent single-server performance and scales well as servers are added, and can be exported to untrusted machines using ordinary network file access protocols.
Abstract: The ideal distributed file system would provide all its users with coherent, shared access to the same set of files, yet would be arbitrarily scalable to provide more storage space and higher performance to a growing user community. It would be highly available in spite of component failures. It would require minimal human administration, and administration would not become more complex as more components were added. Frangipani is a new file system that approximates this ideal, yet was relatively easy to build because of its two-layer structure. The lower layer is Petal (described in an earlier paper), a distributed storage service that provides incrementally scalable, highly available, automatically managed virtual disks. In the upper layer, multiple machines run the same Frangipani file system code on top of a shared Petal virtual disk, using a distributed lock service to ensure coherence. Frangipani is meant to run in a cluster of machines that are under a common administration and can communicate securely. Thus the machines trust one another and the shared virtual disk approach is practical. Of course, a Frangipani file system can be exported to untrusted machines using ordinary network file access protocols. We have implemented Frangipani on a collection of Alphas running DIGITAL Unix 4.0. Initial measurements indicate that Frangipani has excellent single-server performance and scales well as servers are added.

579 citations


"GPFS: A Shared-Disk File System for..." refers methods in this paper

  • ...Frangipani [18] is a shared-disk cluster file system that is similar in principle to GPFS....

    [...]

Proceedings Article
22 Jan 1996
TL;DR: The architecture and design of a new file system, XFS, for Silicon Graphics' IRIX operating system is described, and the use of B+ trees in place of many of the more traditional linear file system structures are discussed.
Abstract: In this paper we describe the architecture and design of a new file system, XFS, for Silicon Graphics' IRIX operating system It is a general purpose file system for use on both workstations and servers The focus of the paper is on the mechanisms used by XFS to scale capacity and performance in supporting very large file systems The large file system support includes mechanisms for managing large files, large numbers of files, large directories, and very high performance I/O In discussing the mechanisms used for scalability we include both descriptions of the XFS on-disk data structures and analyses of why they were chosen We discuss in detail our use of B+ trees in place of many of the more traditional linear file system structures XFS has been shipping to customers since December of 1994 in a version of IRIX 53, and we are continuing to improve its performance and add features in upcoming releases We include performance results from running on the latest version of XFS to demonstrate the viability of our design

458 citations


"GPFS: A Shared-Disk File System for..." refers background in this paper

  • ...SGI’s XFS file system [16] is designed for similar, large-scale, high throughput applications that GPFS excels at....

    [...]