scispace - formally typeset
Journal ArticleDOI

Clustera: an integrated computation and data management system

Reads0
Chats0
TLDR
Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables.
Abstract
This paper introduces Clustera, an integrated computation and data management system. In contrast to traditional cluster-management systems that target specific types of workloads, Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables. Another unique feature of Clustera is the way in which the system architecture exploits modern software building blocks including application servers and relational database systems in order to realize important performance, scalability, portability and usability benefits. Finally, experimental evaluation suggests that Clustera has good scale-up properties for SQL processing, that Clustera delivers performance comparable to Hadoop for MapReduce processing and that Clustera can support higher job throughput rates than previously published results for the Condor and CondorJ2 batch computing systems.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Mining of Massive Datasets

TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.
Proceedings ArticleDOI

Mesos: a platform for fine-grained resource sharing in the data center

TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Journal ArticleDOI

Parallel data processing with MapReduce: a survey

TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.
Proceedings ArticleDOI

SkewTune: mitigating skew in mapreduce applications

TL;DR: The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.
Journal ArticleDOI

The performance of MapReduce: an in-depth study

TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings ArticleDOI

Dryad: distributed data-parallel programs from sequential building blocks

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Proceedings ArticleDOI

Condor-a hunter of idle workstations

TL;DR: The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented and a performance profile of the system is presented that is based on data accumulated from 23 stations during one month.
Related Papers (5)