Spark: cluster computing with working sets

Home
/
Papers
/
Spark: cluster computing with working sets

Proceedings Article•

Spark: cluster computing with working sets

Matei Zaharia¹, Mosharaf Chowdhury¹, Michael J. Franklin¹, Scott Shenker¹, Ion Stoica¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

22 Jun 2010-pp 10-10

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

read less

Abstract: MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

[...]

Stephen Boyd¹, Neal Parikh¹, Eric Chu¹, Borja Peleato¹, Jonathan Eckstein² - Show less +1 more•Institutions (2)

Stanford University¹, Rutgers University²

23 May 2011

TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.

...read moreread less

Abstract: Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for l1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

...read moreread less

17,433 citations

Cites background from "Spark: cluster computing with worki..."

...There has also been some recent work on alternative MapReduce systems that are specifically designed for iterative computation, which are likely better suited for ADMM [25, 179], though the implementations are less mature and less widely available....
[...]

Journal Article•DOI•

Edge Computing: Vision and Challenges

[...]

Weisong Shi¹, Jie Cao¹, Quan Zhang¹, Youhuizi Li¹, Lanyu Xu¹ - Show less +1 more•Institutions (1)

Wayne State University¹

09 Jun 2016-IEEE Internet of Things Journal

TL;DR: The definition of edge computing is introduced, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge Computing.

...read moreread less

Abstract: The proliferation of Internet of Things (IoT) and the success of rich cloud services have pushed the horizon of a new computing paradigm, edge computing, which calls for processing the data at the edge of the network. Edge computing has the potential to address the concerns of response time requirement, battery life constraint, bandwidth cost saving, as well as data safety and privacy. In this paper, we introduce the definition of edge computing, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge computing. Finally, we present several challenges and opportunities in the field of edge computing, and hope this paper will gain attention from the community and inspire more research in this direction.

...read moreread less

5,198 citations

Journal Article•DOI•

The rise of big data on cloud computing

[...]

Ibrahim Abaker Targio Hashem¹, Ibrar Yaqoob¹, Nor Badrul Anuar¹, Salimah Binti Mokhtar¹, Abdullah Gani¹, Samee U. Khan² - Show less +2 more•Institutions (2)

Information Technology University¹, North Dakota State University²

01 Jan 2015-Information Systems

TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.

...read moreread less

2,141 citations

Cites methods from "Spark: cluster computing with worki..."

...nodes and stores configuration information [82] Spark A fast and general computation engine for Hadoop data [83] Chukwa Chukwa has just passed its development stage; it is a data collection and analysis framework incorporated with MapReduce and HDFS; the workflow of Chukwa allows for data collection from distributed systems, data processing, and data storage in Hadoop; as an independent module, Chukwa is included in the Apache Hadoop distribution [76] Twister Provides support for iterative MapReduce computations and Twister; extremely faster than Hadoop MAPR Comprehensive distribution processing for Apache Hadoop and Hbase...
[...]

Proceedings Article•DOI•

Apache Hadoop YARN: yet another resource negotiator

[...]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas¹, Sharad Agarwal, Mahadev Konar, Robert Evans², Thomas Graves², Jason Lowe², Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino¹, Owen O'Malley, Sanjay Radia, Benjamin Reed³, Eric Baldeschwieler - Show less +12 more•Institutions (3)

Microsoft¹, Yahoo!², Facebook³

01 Oct 2013

TL;DR: The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.

...read moreread less

Abstract: The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agora---the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.

...read moreread less

2,006 citations

Cites background or methods from "Spark: cluster computing with worki..."

...Examples of alternative programming models that are becoming available on YARN are: Dryad [18], Giraph, Hoya, REEF [10], Spark [32], Storm [4] and Tez [2]....
[...]
...Spark is an open-source research project from UC Berkeley [32], that targets machine learning and interactive querying workloads....
[...]
...If one composes this flow as a sequence of MapReduce jobs, the scheduling overhead will significantly delay the result [32]....
[...]

Proceedings Article•DOI•

Mesos: a platform for fine-grained resource sharing in the data center

[...]

Benjamin Hindman¹, Andy Konwinski¹, Matei Zaharia¹, Ali Ghodsi¹, Anthony D. Joseph¹, Randy H. Katz¹, Scott Shenker¹, Ion Stoica¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

30 Mar 2011

TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

...read moreread less

Abstract: We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

...read moreread less

1,786 citations

Cites methods from "Spark: cluster computing with worki..."

...For example, at time 350, when both Spark and the Facebook Hadoop framework have no running jobs and Torque is using 1/8 of the cluster, the large-job Hadoop framework scales up to 7/8 of the cluster....
[...]
...We evaluated the benefit of running iterative jobs using the specialized Spark framework we developed on top of Mesos (Section 5.3) over the general-purpose Hadoop framework....
[...]
...To validate our hypothesis that specialized frameworks provide value over general ones, we have also built a new framework on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel oper- ations, and shown that Spark can outperform Hadoop by 10x in iterative machine learning workloads....
[...]
...The longer time for the first iteration in Spark is due to the use of slower text parsing routines....
[...]
...• Spark running a series of machine learning jobs....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

...read moreread less

20,309 citations

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2008-Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

...read moreread less

17,663 citations

Journal Article•DOI•

IPython: A System for Interactive Scientific Computing

[...]

Fernando Perez¹, Brian E. Granger•Institutions (1)

University of Colorado Boulder¹

01 May 2007

TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.

...read moreread less

Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

...read moreread less

3,355 citations

Proceedings Article•DOI•

Dryad: distributed data-parallel programs from sequential building blocks

[...]

Michael Isard¹, Mihai Budiu¹, Yuan Yu¹, Andrew Birrell¹, Dennis Fetterly¹ - Show less +1 more•Institutions (1)

Microsoft¹

21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

Abstract: Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through flies, TCP pipes, and shared-memory FIFOs.The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

2,867 citations

"Spark: cluster computing with worki..." refers methods in this paper

...MapReduce [11] pioneered this model, while systems like Dryad [16] and Map-ReduceMerge [23] generalized the types of data flows supported....
[...]

Journal Article•

[''R"--project for statistical computing].

[...]

Ram Benny Dessau, Christian Bressen Pipper

28 Jan 2008-Ugeskrift for Læger

TL;DR: An introduction to the R project for statistical computing (www.R-project.org) is presented to make the professional community aware of "R" as a potent and free software for graphical and statistical analysis of medical data.

...read moreread less

Abstract: An introduction to the R project for statistical computing (www.R-project.org) is presented. The main topics are: 1. To make the professional community aware of "R" as a potent and free software for graphical and statistical analysis of medical data; 2. Simple well-known statistical tests are fairly easy to perform in R, but more complex modelling requires programming skills; 3. R is seen as a tool for teaching statistics and implementing complex modelling of medical data among medical professionals.

...read moreread less

2,670 citations