Clustera: an integrated computation and data management system

doi:10.14778/1453856.1453865

Journal ArticleDOI

Clustera: an integrated computation and data management system

David J. DeWitt, +6 more

- Vol. 1, Iss: 1, pp 28-41

Chats0

TLDR

Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables.

Abstract:

This paper introduces Clustera, an integrated computation and data management system. In contrast to traditional cluster-management systems that target specific types of workloads, Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables. Another unique feature of Clustera is the way in which the system architecture exploits modern software building blocks including application servers and relational database systems in order to realize important performance, scalability, portability and usability benefits. Finally, experimental evaluation suggests that Clustera has good scale-up properties for SQL processing, that Clustera delivers performance comparable to Hadoop for MapReduce processing and that Clustera can support higher job throughput rates than previously published results for the Condor and CondorJ2 batch computing systems.

Citations

PDF

Open Access

More filters

Book

Mining of Massive Datasets

Anand Rajaraman, +1 more

TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.

...read moreread less

Proceedings ArticleDOI

Mesos: a platform for fine-grained resource sharing in the data center

Benjamin Hindman, +7 more

TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

...read moreread less

Journal ArticleDOI

Parallel data processing with MapReduce: a survey

Kyong-Ha Lee, +4 more

TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.

...read moreread less

Proceedings ArticleDOI

SkewTune: mitigating skew in mapreduce applications

YongChul Kwon, +3 more

TL;DR: The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.

...read moreread less

Journal ArticleDOI

The performance of MapReduce: an in-depth study

Dawei Jiang, +3 more

TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Proceedings ArticleDOI

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

Proceedings ArticleDOI

Condor-a hunter of idle workstations

M. Litzkow, +2 more

TL;DR: The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented and a performance profile of the system is presented that is based on data accumulated from 23 stations during one month.

...read moreread less

Related Papers (5)

Pig latin: a not-so-foreign language for data processing

Christopher Olston, +4 more

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Clustera: an integrated computation and data management system

Citations

Mining of Massive Datasets

Mesos: a platform for fine-grained resource sharing in the data center

Parallel data processing with MapReduce: a survey

SkewTune: mitigating skew in mapreduce applications

The performance of MapReduce: an in-depth study

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The Google file system

Dryad: distributed data-parallel programs from sequential building blocks

Condor-a hunter of idle workstations

Related Papers (5)

Pig latin: a not-so-foreign language for data processing

MapReduce: simplified data processing on large clusters

Dryad: distributed data-parallel programs from sequential building blocks

The Google file system

A comparison of approaches to large-scale data analysis