Author
Frank B. Schmuck
Bio: Frank B. Schmuck is an academic researcher from IBM. The author has contributed to research in topics: File system & Stub file. The author has an hindex of 40, co-authored 120 publications receiving 5981 citations.
Papers published on a yearly basis
Papers
More filters
•
IBM1
TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
1,434 citations
•
IBM1
TL;DR: In this article, a first snapshot of a first set of source files in a file system is generated and stored in each inode is a first identifier associated with the first set and a second identifier associated to the time of the first snapshot.
Abstract: A system, method and computer readable medium for providing a snapshot of a subset of a file system. A first snapshot of a first set of source files in a file system is generated. The first snapshot includes an inode corresponding to each source file in the first set of files. Stored in each inode is a first identifier associated with the first set of files and a second identifier associated with the time of the first snapshot. Next, a second snapshot of a second set of source files is taken. The second snapshot includes an inode corresponding to each source file in the second set of files. Stored in each inode are a first identifier and a second identifier. Subsequent snapshots are taken every first period and every second period for the first set of files and the second set of files, respectively.
283 citations
•
IBM1
TL;DR: In this paper, a shared disk file system running on multiple computers each having their own instance of an operating system is coupled for parallel data sharing access to files residing on network attached shared disks.
Abstract: A computer system having a shared disk file system running on multiple computers each having their own instance of an operating system and being coupled for parallel data sharing access to files residing on network attached shared disks. A metadata node manages file metadata for parallel read and write actions. Metadata tokens are used for controlled access to the metadata and initial selection and changing of the metadata node.
237 citations
•
IBM1
TL;DR: In this article, the authors describe a parallel file system in a shared disk environment using scalable directory service method improvements to caching and cache performance developments balance pools for multiple accesses, where a metadata node manages file metadata, and locking techniques reduce the overhead of a token manager which is also used in the file system recovery if a computer participating in the management of shared disks becomes unavailable or failed.
Abstract: A computer system having a shared parallel disk file system running on a network for multiple computers each having their own instance of an operating system and with a protocol that makes disks appear to be locally attached to each file system. This parallel file system in a shared disk environment uses scalable directory service method improvements to caching and cache performance developments balance pools for multiple accesses. A metadata node manages file metadata, and locking techniques reduce the overhead of a token manager which is also used in the file system recovery if a computer participating in the management of shared disks becomes unavailable or failed. Synchronous and asynchronous takeover of a metadata node occurs for correction of metadata which was under modification and a new computer node to be a metadata node for that file. Locks are not constantly required to allocate new blocks on behalf of a user. Hash buckets are used and each hash bucket is stored in a sparse file at an offset given as i*s, where i is the hash bucket number and s is the hash bucket size, an where a directory starts out as an empty file, where the file size increases to the size where it needs to be split by inserting records, and wherein upon a split, an additional bucket is written increasing the file size from s to 2*s upon the first split. Lookup operations are performed with a step of computing the hash value of the key being looked up, as well as a hash tree depth as log-base-2 of the file size divided by hash bucket size, and with compute steps also computed for an insert operation.
227 citations
•
IBM1
TL;DR: In this paper, a system, method and computer readable medium for deferring copy-on-write of a snapshot is disclosed, which includes the generation of snapshot of a source file.
Abstract: A system, method and computer readable medium for deferring copy-on-write of a snapshot is disclosed. The method includes the generation of snapshot of a source file. Upon modification of a first data block referenced by the source file, the first data block is referenced by the snapshot and a second data block is allocated for the source file. Then, a first variable associated with the source file is set to a value indicating an incomplete source file data block and a second variable associated with the source file is set to a value indicating the valid portion of the second data block. Any portion of the second data block that is overwritten is considered valid. The second data block is then modified and the second variable is changed to reflect the modification. Upon reception of a read request, the corresponding portion of the second data block is retrieved.
224 citations
Cited by
More filters
••
19 Oct 2003TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
5,429 citations
••
17 Aug 2008TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Abstract: Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
3,549 citations
••
06 Nov 2006TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
Abstract: We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
1,621 citations
•
11 Jan 2011TL;DR: In this article, an intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions.
Abstract: An intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions. The system can be implemented using any of a number of different platforms, such as the web, email, smartphone, and the like, or any combination thereof. In one embodiment, the system is based on sets of interrelated domains and tasks, and employs additional functionally powered by external services with which the system can interact.
1,462 citations
•
IBM1
TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
1,434 citations