Journal ArticleDOI
Clustera: an integrated computation and data management system
David J. DeWitt,Paulson Erik S,Eric Robinson,Jeffrey F. Naughton,Joshua Royalty,Srinath Shankar,Andrew Krioukov +6 more
- Vol. 1, Iss: 1, pp 28-41
Reads0
Chats0
TLDR
Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables.Abstract:
This paper introduces Clustera, an integrated computation and data management system. In contrast to traditional cluster-management systems that target specific types of workloads, Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables. Another unique feature of Clustera is the way in which the system architecture exploits modern software building blocks including application servers and relational database systems in order to realize important performance, scalability, portability and usability benefits. Finally, experimental evaluation suggests that Clustera has good scale-up properties for SQL processing, that Clustera delivers performance comparable to Hadoop for MapReduce processing and that Clustera can support higher job throughput rates than previously published results for the Condor and CondorJ2 batch computing systems.read more
Citations
More filters
Book
Mining of Massive Datasets
TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.
Proceedings ArticleDOI
Mesos: a platform for fine-grained resource sharing in the data center
Benjamin Hindman,Andy Konwinski,Matei Zaharia,Ali Ghodsi,Anthony D. Joseph,Randy H. Katz,Scott Shenker,Ion Stoica +7 more
TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Journal ArticleDOI
Parallel data processing with MapReduce: a survey
TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.
Proceedings ArticleDOI
SkewTune: mitigating skew in mapreduce applications
TL;DR: The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.
Journal ArticleDOI
The performance of MapReduce: an in-depth study
TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI
The Google file system
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings ArticleDOI
Dryad: distributed data-parallel programs from sequential building blocks
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Proceedings ArticleDOI
Condor-a hunter of idle workstations
TL;DR: The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented and a performance profile of the system is presented that is based on data accumulated from 23 stations during one month.