The Google file system

doi:10.1145/1165389.945450

Home
/
Papers
/
The Google file system

Journal Article•DOI•

The Google file system

Sanjay Ghemawat¹, Howard Gobioff¹, Shun-Tak Albert Leung¹•Institutions (1)

Google¹

19 Oct 2003-Vol. 37, Iss: 5, pp 29-43

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

read less

Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Bigtable: a distributed storage system for structured data

[...]

Fay W. Chang¹, Jeffrey Dean¹, Sanjay Ghemawat¹, Wilson C. Hsieh¹, Deborah A. Wallach¹, Michael Burrows¹, Tushar Deepak Chandra¹, Andrew Fikes¹, Robert E. Gruber¹ - Show less +5 more•Institutions (1)

Google¹

06 Nov 2006

TL;DR: Bigtable as discussed by the authors is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.

...read moreread less

Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

...read moreread less

1,523 citations

Proceedings Article•DOI•

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

[...]

Matei Zaharia¹, Dhruba Borthakur², Joydeep Sen Sarma², Khaled Elmeleegy³, Scott Shenker¹, Ion Stoica¹ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, Facebook², Yahoo!³

13 Apr 2010

TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.

...read moreread less

Abstract: As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop cluster at Facebook. To address the conflict between locality and fairness, we propose a simple algorithm called delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. In addition, the simplicity of delay scheduling makes it applicable under a wide variety of scheduling policies beyond fair sharing.

...read moreread less

1,514 citations

Proceedings Article•DOI•

Spanner: Google's globally-distributed database

[...]

James C. Corbett¹, Jeffrey Dean¹, Michael James Boyer Epstein¹, Andrew Fikes¹, Christopher Frost¹, J. J. Furman¹, Sanjay Ghemawat¹, Andrey Gubarev¹, Christopher Heiser¹, Peter Hochschild¹, Wilson C. Hsieh¹, Sebastian Kanthak¹, Eugene Kogan¹, Hongyi Li¹, Alexander Lloyd¹, Sergey Melnik¹, David Mwaura¹, David Nagle¹, Sean Quinlan¹, Rajesh Rao¹, Lindsay Rolig¹, Yasushi Saito¹, Michal Piotr Szymaniak¹, Chris Jorgen Taylor¹, Ruth Wang¹, Dale Woodford¹ - Show less +22 more•Institutions (1)

Google¹

08 Oct 2012

TL;DR: This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty, critical to supporting external consistency and a variety of powerful features.

...read moreread less

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

...read moreread less

1,366 citations

Journal Article•DOI•

MapReduce: a flexible data processing tool

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2010-Communications of The ACM

TL;DR: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

...read moreread less

Abstract: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

...read moreread less

1,293 citations

Proceedings Article•DOI•

PortLand: a scalable fault-tolerant layer 2 data center network fabric

[...]

Radhika Niranjan Mysore¹, Andreas Pamboris¹, Nathan Farrington¹, Nelson Huang¹, Pardis Miri¹, Sivasankar Radhakrishnan¹, Vikram Subramanya¹, Amin Vahdat¹ - Show less +4 more•Institutions (1)

University of California, San Diego¹

16 Aug 2009

TL;DR: Through the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments, it is shown that PortLand holds promise for supporting a ``plug-and-play" large-scale, data center network.

...read moreread less

Abstract: This paper considers the requirements for a scalable, easily manageable, fault-tolerant, and efficient data center network fabric. Trends in multi-core processors, end-host virtualization, and commodities of scale are pointing to future single-site data centers with millions of virtual end points. Existing layer 2 and layer 3 network protocols face some combination of limitations in such a setting: lack of scalability, difficult management, inflexible communication, or limited support for virtual machine migration. To some extent, these limitations may be inherent for Ethernet/IP style protocols when trying to support arbitrary topologies. We observe that data center networks are often managed as a single logical network fabric with a known baseline topology and growth model. We leverage this observation in the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments. Through our implementation and evaluation, we show that PortLand holds promise for supporting a ``plug-and-play" large-scale, data center network.

...read moreread less

1,238 citations

1
2
…
3
4
5
6
7
8
9
…
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A case for redundant arrays of inexpensive disks (RAID)

[...]

David A. Patterson¹, Garth A. Gibson¹, Randy H. Katz¹•Institutions (1)

University of California, Berkeley¹

01 Jun 1988

TL;DR: Five levels of RAIDs are introduced, giving their relative cost/performance, and a comparison to an IBM 3380 and a Fujitsu Super Eagle is compared.

...read moreread less

Abstract: Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disks (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability. This paper introduces five levels of RAIDs, giving their relative cost/performance, and compares RAID to an IBM 3380 and a Fujitsu Super Eagle.

...read moreread less

3,015 citations

Additional excerpts

...As disks are relatively cheap and replication is simpler than more sophisticated RAID [9] approaches, GFS currently uses only replication for redundancy and so consumes more raw storage than xFS or Swift....
[...]
...As disks are relatively cheap and replication is simpler than more sophisticated RAID [9] approaches, GFS currently uses only replication for redundancy and so consumes more raw storage than xFS or Swift....
[...]
...A case for redundant arrays of inexpensive disks (RAID)....
[...]

Journal Article•DOI•

Scale and performance in a distributed file system

[...]

John H. Howard¹, Michael Kazar¹, Sherri G. Menees¹, David A. Nichols¹, Mahadev Satyanarayanan¹, Robert N. Sidebotham¹, Michael J. West¹ - Show less +3 more•Institutions (1)

Carnegie Mellon University¹

01 Feb 1988-ACM Transactions on Computer Systems

TL;DR: Observations of a prototype implementation are presented, changes in the areas of cache validation, server process structure, name translation, and low-level storage representation are motivated, and Andrews ability to scale gracefully is quantitatively demonstrated.

...read moreread less

Abstract: The Andrew File System is a location-transparent distributed tile system that will eventually span more than 5000 workstations at Carnegie Mellon University. Large scale affects performance and complicates system operation. In this paper we present observations of a prototype implementation, motivate changes in the areas of cache validation, server process structure, name translation, and low-level storage representation, and quantitatively demonstrate Andrews ability to scale gracefully. We establish the importance of whole-file transfer and caching in Andrew by comparing its performance with that of Sun Microsystems NFS tile system. We also show how the aggregation of files into volumes improves the operability of the system.

...read moreread less

1,604 citations

Proceedings Article•

GPFS: A Shared-Disk File System for Large Computing Clusters

[...]

Frank B. Schmuck¹, Roger L. Haskin¹•Institutions (1)

IBM¹

28 Jan 2002

TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

...read moreread less

Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

...read moreread less

1,434 citations

"The Google file system" refers methods in this paper

...GPFS: A shared-disk.le system for large computing clusters....
[...]
...Some distributed file systems like Frangipani, xFS, Minnesota’s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and man-...
[...]
...Some distributed .le systems like Frangipani, xFS, Minnesota s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and management....
[...]

Journal Article•DOI•

Searching the Web: the public and their queries

[...]

Amanda Spink¹, Dietmar Wolfram², Major B. J. Jansen³, Tefko Saracevic⁴•Institutions (4)

Pennsylvania State University¹, University of Wisconsin–Milwaukee², University of Maryland, College Park³, Rutgers University⁴

01 Feb 2001-Journal of the Association for Information Science and Technology

TL;DR: It is found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features, and the language of Web queries is distinctive.

...read moreread less

Abstract: In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.

...read moreread less

1,153 citations

Journal Article•DOI•

Web search for a planet: The Google cluster architecture

[...]

Luiz Andre Barroso¹, Jeffrey Dean¹, Urs Hölzle¹•Institutions (1)

Google¹

01 Mar 2003-IEEE Micro

TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

...read moreread less

Abstract: Amenable to extensive parallelization, Google's web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to handle this workload, Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

...read moreread less

1,129 citations