Proceedings ArticleDOI
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Dominic Battré,Stephan Ewen,Fabian Hueske,Odej Kao,Volker Markl,Daniel Warneke +5 more
- pp 119-130
TLDR
The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function.Abstract:
We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation.The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.read more
Citations
More filters
Journal ArticleDOI
Parallel data processing with MapReduce: a survey
TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.
Journal ArticleDOI
The Stratosphere platform for big data analytics
Alexander Alexandrov,Rico Bergmann,Stephan Ewen,Johann-Christoph Freytag,Fabian Hueske,Arvid Heise,Odej Kao,Marcus Leich,Ulf Leser,Volker Markl,Felix Naumann,Mathias Peters,Astrid Rheinländer,Matthias J. Sax,Sebastian Schelter,Mareike Hoger,Kostas Tzoumas,Daniel Warneke +17 more
TL;DR: The overall system architecture design decisions are presented, Stratosphere is introduced through example queries, and the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution are dive into.
Proceedings ArticleDOI
SkewTune: mitigating skew in mapreduce applications
TL;DR: The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.
Journal ArticleDOI
Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud
Daniel Warneke,Odej Kao +1 more
TL;DR: Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution.
Proceedings ArticleDOI
Hyracks: A flexible and extensible foundation for data-intensive computing
TL;DR: The Hyrack end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types are described.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Proceedings ArticleDOI
Dryad: distributed data-parallel programs from sequential building blocks
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Proceedings ArticleDOI
Access path selection in a relational database management system
TL;DR: System R as mentioned in this paper is an experimental database management system developed to carry out research on the relational model of data, which chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates.
Proceedings ArticleDOI
Pig latin: a not-so-foreign language for data processing
TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Journal ArticleDOI
Hive: a warehousing solution over a map-reduce framework
Ashish Thusoo,Joydeep Sen Sarma,Namit Jain,Zheng Shao,Prasad Chakka,Suresh Anthony,Hao Liu,Pete Wyckoff,Raghotham Murthy +8 more
TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.