https://andreashellander.se/wp-content/uploads/2016/01/1-s2.0-S0306437914001288-main.pdf

The rise of big data on cloud computing

Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

/pdf/spanner-google-s-globally-distributed-database-57sdsy7wpa.pdf

Spanner: Google's globally-distributed database

MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

MapReduce: a flexible data processing tool

Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

/pdf/haloop-efficient-iterative-data-processing-on-large-clusters-55j3q58dsk.pdf

HaLoop: efficient iterative data processing on large clusters

The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private "clouds". At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis.There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. In this paper, we explore the feasibility of building a hybrid system that takes the best features from both technologies; the prototype we built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

/pdf/hadoopdb-an-architectural-hybrid-of-mapreduce-and-dbms-29369umveq.pdf

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

A system, method, and computer program product for processing data are disclosed The system includes a data processing framework configured to receive a data processing task for processing, a plurality of database systems coupled to the data processing framework, wherein the database systems are configured to perform a data processing task, and a storage component in communication with the data processing framework and the plurality database systems, configured to store information about each partition of the data processing task being processed by each database system and the data processing framework The data processing task is configured to be partitioned into a plurality of partitions and each database system is configured to process a partition of the data processing task assigned for processing to that database system Each database system is configured to perform processing of its assigned partition of the data processing task in parallel with another database system processing another partition of the data processing task assigned to the another database system The data processing framework is configured to perform at least one partition of the data processing task

Systems and methods for processing data

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework.In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

/pdf/efficient-processing-of-data-warehousing-queries-in-a-split-1phl8da5aa.pdf

Efficient processing of data warehousing queries in a split execution environment

HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and still yields the scalability and fault tolerance of MapReduce-based systems. In this demonstration, we focus on HadoopDB's flexible architecture and versatility with two real world application scenarios: a semantic web data application for protein sequence analysis and a business data warehousing application based on TPC-H. The demonstration offers a thorough walk-through of how to easily build applications on top of HadoopDB.

https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p1111-abouzied.pdf

HadoopDB in action: building real world applications

A system and method for performing distributed execution of database queries includes a query server that receives a query to be executed on a database, forms a query plan based on the query, assigns tasks to task slots on a plurality of worker nodes in a cluster, and, upon receipt of a notification that a task has completed on a worker node, immediately assigns an unassigned task to a free task slot on that worker node, such that the task may begin executing on that worker node substantially immediately thereafter. The task slots on worker nodes include pools of resources that run tasks without start-up overhead.

Kamil Bajda-Pawlikowski

Papers

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Systems and methods for processing data

Efficient processing of data warehousing queries in a split execution environment

HadoopDB in action: building real world applications

Systems and methods for fault tolerant, adaptive execution of arbitrary queries at low latency